[CoRL 2025] Humanoid Policy ∼ Human Policy

Humanoid robot의 manipulation policy를 더 robust하고 generalize되게 만들려면 다양한 데이터로 학습하는 게 도움이 됨. 하지만 robot demonstration만으로 학습하면 데이터 수집이 labor-intensive하고, tele-operation이 필요해서 확장이 어려움

논문은 humanoid teleoperation을 인간 행동을 기하학적 변환이나 retargeting으로 로봇 동작에 매핑하는 과정으로 보고, robot을 human-centric representation으로 모델링하여 human action을 변환해 robot action을 얻는 관점을 제시

egocentric task-oriented 데이터셋 PH2D를 수집하고, human과 humanoid를 통합된 state-action space에서 함께 학습하는 Human Action Transformer(HAT)를 제안하여, human 데이터와 robot 데이터를 co-training하는 방식을 제

Contribution

PH2D 데이터셋 생성(large egocentric task-oriented human-humanoid dataset, 정확한 hand & wrist pose 포함)

Human-humanoid Action Transformer (HAT)를 통해서 unified state-action space 및 other alignment techniques 도입

실험/ablations로 co-training human data의 이점(robustness/generalization)을 검증

Method

PH2D: Task-oriented Physical Humanoid-Human Data

기존의 egocentric human videos 는 non-task-oriented skills 에 대한 demonstraction을 제공하거나, imitation learning 을 위한 world-frame 3D head/hand poses 을 제공하지 않음

PH2D는 위 두 문제를 해결하기 위해:
1. Robot execution과 직접 관련된 task-oriented human demonstration 수집
1. VR device의 SDK를 활용해 supervision 제공
1. task, camera sensor 다양화 / whole-body movement 감소를 통해 vision/behavior domain gap 감소

Adapting Low-cost Commerical Devices

Apple Vision Pro + Built-in Camera, Meta Quest 3 / Apple Vision Pro + ZED Camera 등 기존 상용 장비를 활용해서 데이터 취득하는데 활용

Data Collection Pipeline

VR 장비 착용한 사람이 Robot execution과 겹치는 task 수행 (예: grasping, pouring)

각 demonstration마다 Language instruction 제공 (ex. "grasp a can of coke zero with right hand)

Proprioception input과 visual input을 timestamp 기준으로 동기화

Action Domain Gap

human과 tele-operated robot actions 는 두가지 다른 특성이 있음
- human manipulation은 whole-body movement가 포함되어 있음
  → 인간 데이터 수집 시 upright 자세로 앉도록 요청
- 인간이 로봇보다 훨씬 빠르고 dexterous, task completion time 차이 존재
  → 학습 중 human trajectory의 translation과 rotation을 interpolation하여 “slow down”
  → Slow-down factor: $α_{slow}$ ( human과 humanoid의 평균 task completion time 비율을 정규화하여 계산. 경험적으로 약 4 )
  → 모든 task에 α_slow = 4 사용

HAT: Human Action Transformer

HAT는 human을 모델링함으로써 cross-embodied robot policy를 학습

bimanual humanoid robot과 human을 retargeting을 통해 서로 다른 embodiment로 취급하는 것이 HAT의 generalizability와 robustness를 개선한다고 주장

로봇데이터: $D_{robot}=\{(S_i,A_i)\}_{i=1}^N$
- $S_i$ : i 번째 demonstration 에 대한 proprioceptive , visual observations
- $A_i$ v : action

인간 데이터 (PH2D): $D_{human}=\{(S_i,A_i)\}_{i=1}^M$

인간 데이터 수집 효율이 훨씬 높으므로 $M≫N$

time $t$ 에서 현재 robot observation $s_t$ 가 주어졌을때 future robot $a_{t+1}$ 을 예측하는 policy $\pi: {S → A}$ 를 학습하는것

action은 multi-step 실행을 위한 action chunk로 구성됨

Unified State-Action Space

human과 bimanual humanoid robot을 위해 통합된 state-action 공간을 설계 $(S,A)≡(\tilde S, \tilde A)$

Proprioceptive Observation은 총 54차원 벡터로 구성됨.
- Head, Left wrist, Right wrist의 6D rotation
- Left / Right wrist, 10개의 fingertip 의 x/y/z 위치

논문은 5-fingered dexterous hand를 사용하는 로봇에 deploy

robot fingertip과 human fingertip 사이에 bijective mapping 존재

Visual Domain Gap

Human–Robot co-training 시 두 가지 domain gap 존재
- Camera sensor 차이
  - human 데이터 수집 카메라 ≠ robot deploy 카메라, tone 등 차이 발생
- End-effector appearance 차이
  - human 손 vs humanoid 손 외형 차이

다만, 충분히 크고 다양한 데이터가 있다면 visual artifact 추가나 generative 방법 없이도 기본적인 image augmentation(color jittering, Gaussian blurring) 만으로도 효과적인 regularization이 된다고 언급

Training

final policy는 다음과 같이 쓸 수 있음

π:f_θ(⋅)→A

$f_θ$ : 파라미터 $\theta$ 를 가진 transformer-based neural network

이 policy은 human과 robot 모두에 대해 동일하게 사용됨

최종 loss는 다음과 같이 정의
- $\ell_1(\pi(s_i), a_i)$ : policy의 action prediction과 실제 action 간의 L1 loss
- EEF는 left/right wrist translation vector의 인덱스. 즉, 손목 위치(translation) 부분만 따로 뽑아 L1 loss를 계산

L = \ell_1(\pi(s_i), a_i) + \lambda \cdot \ell_1(\pi(s_i)_{\text{EEF}}, a_{i,\text{EEF}})

$λ=2$ : insensitive hyperparameter
- End effector position의 중요성을 더 강조하면서 불필요하게 정밀한 fingertip keypoint 학습에 과도하게 집중하는 것을 방지
- 손목 위치 정확도를 더 중요하게 학습하도록 설계되어 있다.

Experiment

Hardware Platforms

두 개의 humanoid robot 사용:
- Humanoid A: Unitree H1
- Humanoid B: Unitree H1 2 (다른 arm configuration)

두 로봇 모두 6DOF Inspire dexterous hand 장착, Actuated neck 보유 (egocentric view 활용 가능), Wrist camera 없음

대부분의 데이터 수집은 Humanoid A

Humanoid B는 cross-humanoid generalization 테스트용

Implementation Details

Transformer 기반 architecture 사용

Visual backbone: Frozen DinoV2 ViT-S

두 가지 모델 구현:
- ACT (Baseline): Action Chunk Transformer, Robot data만 사용. Robot state는 joint positions
- HAT: ACT와 동일한 architecture. 하지만 state encoder는 unified state-action space. robot + human co-training. 각 task마다 250–400 robot demonstrations로 checkpoint 학습

Experimental Protocol

In-Distribution (I.D.)
- Training robot demonstration과 유사한 scene setup
- Background, object type, placement 유사

Out-of-Distribution (O.O.D.)
- Robot data에는 없지만 human data에는 있는 새로운 setup
- Generalization 및 robustness 평가 목적

Main Evaluation

Human data has minor effects on I.D. testing

Human data co-training 유무에 따라 성능 거의 동일

소량 robot data만으로도 I.D.에서는 잘 동작

Frozen visual foundation model이 lighting 등 perturbation에 강건하다는 최근 연구와 일치함

Human data improves the O.O.D. settings with many generalizations

Co-training이 O.O.D. 성능을 drastically improve

Robot data에 없던 setting에서 거의 100% relative improvement

Human data가 향상시킨 일반화 유형: Background, Object placement, Appearance

각 task는 특정 generalization 유형에 집중하도록 설계

Few-Shot Transfer across Heterogeneous Embodiments

Humanoid B에서 few-shot 실험 수행 (Humanoid A와 다른 embodiment + 다른 환경)

Experiment 1: Cross-embodiment co-training

Humanoid B demonstration 20개만 사용해 3가지 정책 학습: (i) Humanoid B만, (ii) Humanoid B + Humanoid A, (iii) Humanoid B + Humanoid A + Human

(ii), (iii)가 (i)보다 모든 task에서 성능 우수

Embodiment 간 latent task structure 전이 가능함을 보여줌

Experiment 2: Few-shot scaling

Humanoid A + Human dataset 고정, Humanoid B demonstration 개수만 변화

Co-training (B + A + Human)이 모든 설정에서 단독 B 학습보다 우수. 특히 few-data regime에서 차이 큼

[AAAI 2026] Learning Diffusion Policy from Primitive Skills for Robot Manipulation (0)	2026.02.28
Multi-Modal Manipulation via Multi-Modal Policy Consensus (2)	2026.02.16
[ICCV 2025] 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting (2)	2026.02.07
[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints (1)	2026.01.31
[CVPR 2025] Hand-held Object Reconstruction from RGB Video with Dynamic Interaction (2)	2026.01.26

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

거대고슴도치

[CoRL 2025] Humanoid Policy ∼ Human Policy

[CoRL 2025] Humanoid Policy ∼ Human Policy

Method

PH2D: Task-oriented Physical Humanoid-Human Data

HAT: Human Action Transformer

Experiment

Main Evaluation

Few-Shot Transfer across Heterogeneous Embodiments

'DL' 카테고리의 다른 글

'DL'의 다른글

티스토리툴바

[CoRL 2025] Humanoid Policy ∼ Human Policy

Method

PH2D: Task-oriented Physical Humanoid-Human Data

HAT: Human Action Transformer

Experiment

Main Evaluation

Few-Shot Transfer across Heterogeneous Embodiments

'DL' 카테고리의 다른 글

'DL'의 다른글

관련글

티스토리툴바