[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Movitation

최근 LLM, VLM을 robotics에 사용하여 semantic reasoning같은 high-level task planning에 활용하는 연구가 진행됨. 다만, VLM은 2d 데이터에서만 학습이 되었기 때문에 precise, low-level manipulation tasks에 필요한 3D spatial understanding 능력이 떨어짐

따라서 최근에는 VLM을 finetune해서 VLA로 전환하는 연구가 진행됨
- finetune을 위한 diverse, high-quality robotic data가 부족함
- VLM을 VLA로 finetune하는것은 특정 로봇에만 한정되어 genetalizability를 떨어트리는 agent-specific representations를 야기

따라서 robotic action을 interaction primitive로 abstract하고, VLM을 이용해서 이 primitive의 spatial constraints를 정의하는데 사용

현재까지 진행된 이러한 primitive를 사용한 연구는 primitive proposals를 만드는데 있어서 task-agnostic하고 primitive proposals를 후처리하는데에 manually designed rules를 사용하는 것은 불안정함

따라서 논문에서는 VLM high-level reasoning과 low-level robotics manipulation을 이어주는 효과적이고 generalizable한 representation을 어떻게 만들지 고민

Contribution

VLM의 high-level commonsense reasoning 과 low-level robotic manipulation을 잇는 새로운 object-centric interaction representation을 제안

VLM fine-tuning없이 planning 과 execution dual closed-loop open-vocabulary manipulation system을 처음으로 제안

다양한 manipulation 상황에 적용될 수 있는 zero-shot generalization

3. Method

3.1. Manipulation with Interaction Primitives

Task Decomposition

manipulation task $\mathcal{T}$ (e.g., pouring tea into a cup) 가 주어졌을때 GroundingDINO와 SAM을 사용해서 scene의 모든 object의 mask를 추출

GPT4를 이용해서 task를 여러 stage로 나눔 $\mathcal{S} = \{ \mathcal{S}_1, \mathcal{S}_2, … \mathcal{S}_n \}$ .

각 stage는 다음과 같이 나타낼 수 있음
$\mathcal{S}_i= \{ A_i, \mathcal{O}_i^{active} , \mathcal{O}_i^{passive}\}$
- $A_i$ : action to be preformed (grasp, pour)
- $\mathcal{O}_i^{active}$ : 상호작용을 주도하는 object
- $\mathcal{O}_i^{passive}$ : 상호작용을 받는 object

Object-Centric Canonical Interaction Primitives

\text{interaction primitive}: O = \{p, v\}

manipulation tasks에서 object가 어떻게 interaction하는지 표현하기 위해서 object-centric representation with canonical interaction primitives를 제안
- $p \in \mathbb{R}^3$ : interaction point
- $v \in \mathbb{R}^3$ : interaction direction

object마다 고유한 object-centric 좌표계인 canonical space에서 표현되기 때문에 다른 scenario에서 consistent

Interaction Primitives with Spatial Constraints

spatial constraints $\mathcal{C}_i$ 는 각 stage $\mathcal{S}_i$ 에서 active / passive objects 간의 관계
$\mathcal{C}_i=\{ \mathcal{O}_i^{active}, \mathcal{O}_i^{passive}, d_i, \theta_i \}$
- $d_i$ : distance constraints (distance between interaction points)
- $\theta_i$ : angular constraints (ensure proper alignment of interaction directions)

3.2. Primitives and Constraints Extraction
Grounding Interaction Point

interaction points는 Visible/Tangible, Invisible/Intangible로 구분됨

VLM이 interaction points grounding을 더 잘하게 하기 위해서 Cartesian grid를 image에 overlay하 SCAFFOLD visual prompting mechanism를 사용

visible point는 이미지에서 추출되고, invisible points는 canonical object representations에 기반한 multi-view reasoning으로 추출
- primary viewpoint에서 위치 추론 ambiguity가 있으면 orthogonal view로 전환하여 다시 추론

Grasping task의 경우 interaction point를 단일 point로 고정하지 않음. 여러 interaction point로부터 heatmap 생성

Sampling Interaction Direction

object canonical space의 principal axes를 candidate interaction directions로 사용

VLM은 공간을 이해하는데 한계가 있기 때문에 task와 direction의 관계를 평가하는데 한계가 있음. VLM caption + LLM scoring을 이용해서 해결하고자 함
- VLM captioning: 각 candidate axis에 대해 VLM이 semantic description 생성
- LLM scoring: LLM이 task description을 참고하여 각 axis 설명이 task에 얼마나 relevant한지 score 부여

candidate interaction directions가 relevance 기준으로 정렬된 리스트로 출력됨. task에 가장 적합한 direction이 상위에 위치. 각 stage마다 ordered list of constrained interaction primitives $K_i$ 생성
- $C_i^{(j)}$ : 하나의 interaction primitive + spatial constraint

K_i=\{C_i^{(1)},C_i^{(2)},…,C_i^{(N)}\}

3.3. Dual Closed-Loop System

Sec. 3.2까지에서 얻은 것은 interaction primitives $\mathcal{O}^{active}$ , $\mathcal{O}^ {passive}$ , spatial constraints $C$ . 하지만 이것은 open-loop inference(한 번 추론하고 끝나는 결정) 결과
- open-loop는 Large model hallucination, Real-world environment의 dynamic nature 라는 한계점을 가짐. 이 두 문제를 해결하기 위해 dual closed-loop system을 제안

Closed-loop Planning

interaction primitive 및 constraint 선택 과정에서 hallucination 및 부정확한 primitive를 사전에 걸러냄

Resampling, Rendering, and Checking (RRC)은 VLM을 evaluator로 사용하는 self-correction loop

RRC process는 2개의 stage로 구성됨
- initial phase
  - Sec. 3.2에서 생성된 constraint 후보 리스트 $K_i=\{C_i^{(1)},C_i^{(2)},…,C_i^{(N)}\}$ 는 interaction point, interaction direction,distance / angle constraint 를 포함하고 있음
  - constraint에 대해서 $C_i^{(K)}$ 에 대해서 현재 constraint $C_i^{(K)}$ 를 적용한 interaction 장면 이미지 $I_i$ 를 렌더링
  - VLM에 task $T$ , stage $S_i$ , rendered image $I_i$ , constraint $C_i^{(k)}$ 를 입력. VLM이 success, failure, refinement 세 가지 중 하나를 return
  - success: contraint accepted , task 진행
  - failure: 다음 contraint가 평가됨
  - refinement: refinement phase로 진입
- refinement phase
  - functional / geometric axes of objects 간의 misalignments를 수정하기 위해서 예측된 interaction direction v_i의 주변에서 6개의 refined direction $v_i^{(j)}$ 를 uniform sampling으로 생성
  - 새 direction들을 포함한 constraint들을 다시 평가

Closed-loop Execution

목표는 end-effector의 target pose $P^{ee*}$ 의 loss를 minimize하는것

constraint loss $\mathcal{L}_C$
- 현재 active object와 passive object의 spatial relationship이 원하는 constraint C에서 얼마나 벗어났는지를 측정
- $\Phi(\cdot)$ : end-effector pose → active object pose 변환 함수

collision loss $\mathcal{L}_{collision}$
- end-effector가 환경 내 obstacle $O_j$ 와 최소 안전 거리 $d_{min}$ 보다 가까워지면 penalty 발생. 충분히 멀면 loss = 0

path loss $\mathcal{L}_{path}$
- 현재 $P_{ee}^t$ 에서 새 pose $P_{ee}$ 로 갈 때 translation 변화량, rotation 변화량 이 너무 크지 않도록 제한

constraint를 만족하면서 collision을 피하고 motion을 부드럽게 유지하도록 end-effector pose $P_{ee}$ 를 계속 조정

constraint loss 로 interaction primitives와 spatial constraints를 통해서 end-effector pose를 최적화 할 수 있지만, 해당 수식은 환경이 static이라고 가정한 것임. 실제는 물체가 움직여서 grasp pose가 변할 수 있음

이러한 문제를 해결하기 위해 off-the-shelf 6D object pose tracking algorithm 사용하여 active object pose $P^{active}_t$ , passive object pose $P^{passive}_t$ 를 실시간으로 계속 업데이트

object pose가 바뀌면 collision loss 값이 바뀌고 그에 따라 end-effector의 target pose도 다시 계산

4. Experiment

Multi-Modal Manipulation via Multi-Modal Policy Consensus (2)	2026.02.16
[ICCV 2025] 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting (2)	2026.02.07
[CVPR 2025] Hand-held Object Reconstruction from RGB Video with Dynamic Interaction (2)	2026.01.26
[ECCV 2024] Grounding Image Matching in 3D with MASt3R (2)	2026.01.25
[ICCV 2025] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips (2)	2026.01.25

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

거대고슴도치

[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

'DL' 카테고리의 다른 글

'DL'의 다른글

티스토리툴바

[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

'DL' 카테고리의 다른 글

'DL'의 다른글

관련글

티스토리툴바