[AAAI 2026] Learning Diffusion Policy from Primitive Skills for Robot Manipulation

대부분의 기존 방법에서 high-level instruction은 너무 추상적 short-term action과 granularity mismatch 발생 (ex. “Pick up the lemon and put it into the pan” 와 같은 high-level task description은 “close the gripper” 같은 구체적 지시를 포함하지 않음)

논문에서는 high-level instruction을 현재 observation 기반의 short-term skill로 분해, 해당 skill에 condition된 diffusion policy 학습

Contribution

global instruction 과 short-term actions 간의 misalignment를 완화하는 SDP (skill-conditioned diffusion policy) 제안

8개의 reusable primitive skill 정의, lightweight router network로 state-aware skill 선택

각 skill 에 잘 align된 action을 만드는 single-skill diffusion policy를 디자인.

simulation + real-world 모두에서 성능 향상. multi-task 및 generalization 능력 향상. skill visualization을 통해 interpretability 확인

Proposed Approach

Approach Overview

위의 figure에서 윗부분은 primitive skill을 예측하는 부분, 아래 부분은 single-skill policy가 state information을 통합해서 skill aligned action을 만들어내는 부분

Primitive Skill Assignment

task를 primitive skills로 decompose해서 각 state에 대한 short term action을 예측하게함

이 skill들은 short term action을 생성할때 precise/actionable guidance를 제공

Compositional prompt ensemble (CPE)

모든 skill에 대해 "the robot arm is going to {skill}.” 템플릿 사용.

Prompt Ensemble 구성

P_{En}:=\text{“the robot arm is going to \{skill\}.”}⊗P

Prompt Embedding 생성
- frozen CLIP text encoder 사용
  $CLIP_{text}(P_{En})$
- MLP f 통과
  $p=f(CLIP_{text}(P_{En}))∈R^{8×C_{img}}$
- 8개의 skill embedding, dimension은 skill assignment joint space
- CPE 텍스트는 reusable
- inference 시 미리 pre-compute 가능

Vision-language model

CPE가 만들어주는 각 skill에 대한 prompt가 있다고 할떄 어떤 skill을 할지 정해야함.

visual observations+high-level instruction을 vision-language representation로 만들어서 guidance로 활용
- static camera image: $I_s \in \mathbb{R}^{3 \times H \times W}$
- wrist camera image: $I_w \in \mathbb{R}^{3 \times H \times W}$
- high-level instruction: $l$

Image Encoding
- 공유 image encoder
  $f_{img}(\cdot): \mathbb{R}^{3 \times H \times W} \rightarrow \mathbb{R}^{N_{img} \times C_{img}}$
- 출력: vision tokens

Text Encoding
- tokenizer + embedding layer
  $f_t(l) \in \mathbb{R}^{N_t \times C_{text}}$
- 출력: text embeddings

vision + text token concat 후 transformer Φ 에 input으로 넣어주고 vision-language representations 을 만듬

z_{vl}=Φ([f_t(l),f_{img}(I_s),f_{img}(I_w)]) \\ z_{vl}∈R^{(N_t+2N_{img})×C_{img}}

Primitive skill selection

vision-language representations z_{vl} 을 평균을 냄. 이후 MLP와 softmax를 통해 z_{avg} 에서 가장 높은 skill 선택

R(z_{vl}) = \text{top-1}(σ(MLP(Avg(z_{vl})))) \\ R(z_{vl}) \in \mathbb{R}^8

최종적으로 선택되는 skill embedding은 다음과 같음

z=∑^8_{i=1}R(z_{vl})_i⋅p_i

Analysis

논문에서 제안하는 SDP는 task 전반에 걸친 공유 primitive skill을 명시적으로(explicitly) 정의. skill assignment가 human-understandable

Skill-conditioned Diffusion Policy Learning

목표는 single-skill diffusion policies를 통해 skill-aligned actions를 예측하는 것

Priors injection

각 state에서는 time steps, proprioception, visual observations, high-level instruction 이 제공

VLM의 output token들은 linear layer with RMSNorm를 통해서 linear projection됨. 이후 각 block에서 Cross-Attention으로 주입

이 연산을 통해서 state prior를 통합하고, conditional injection을 수행

Skill-dependent FFN layer

primitive skill and the action generation간의 dependenc를 만들기 위해서 LoRA-like FFN layer를 추가

FFN(x)=W_2^z(SwishGLU(W_1^zx))+FFN_{ori}(x)

skill embedding $z$ 로부터 새로운 weight 행렬을 생성. 그 weight로 FFN 계산 → skill이 바뀌면 $W_1^z, W_2^z$ 가 바뀜. feature transformation이 달라짐. skill이 feature extractor 자체를 바꿈

Training Objective

orthogonal loss $L_{Orth}(θ)$ 를 통해서 $p_{i,j}$ 에 대한 pairwise cosine similarity를 감소시키고자 함

L(θ) = L_{SM}(θ) + γL_{Orth}(θ),

Experiments

CALVIN
- 4개 scene configuration: A, B, C, D
- 34개 task
- 24,000개의 language-annotated demonstration

평가 설정
- ABC → D: A, B, C 환경에서 학습. D 환경에서 zero-shot 평가
- ABCD → D: A, B, C, D에서 학습. D에서 평가

[CoRL 2025] Humanoid Policy ∼ Human Policy (2)	2026.02.22
Multi-Modal Manipulation via Multi-Modal Policy Consensus (2)	2026.02.16
[ICCV 2025] 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting (2)	2026.02.07
[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints (1)	2026.01.31
[CVPR 2025] Hand-held Object Reconstruction from RGB Video with Dynamic Interaction (2)	2026.01.26

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

거대고슴도치

[AAAI 2026] Learning Diffusion Policy from Primitive Skills for Robot Manipulation

[AAAI 2026] Learning Diffusion Policy from Primitive Skills for Robot Manipulation

Proposed Approach

Approach Overview

Primitive Skill Assignment

Skill-conditioned Diffusion Policy Learning

Experiments

'DL' 카테고리의 다른 글

'DL'의 다른글

티스토리툴바

[AAAI 2026] Learning Diffusion Policy from Primitive Skills for Robot Manipulation

Proposed Approach

Approach Overview

Primitive Skill Assignment

Skill-conditioned Diffusion Policy Learning

Experiments

'DL' 카테고리의 다른 글

'DL'의 다른글

관련글

티스토리툴바