[ECCV 2024] Grounding Image Matching in 3D with MASt3R

gomduribo 2026. 1. 25. 15:54

MASt3R는 두개의 이미지가 주어졌을때 pairwise pixel correspondences (matches)를 만들어내는 파이프라인

전통적인 Keypoint-based Matching 파이프라인은 보통 3단계로 구성됨:
- Sparse & repeatable keypoint detection
- Locally invariant feature description
- Feature space distance 기반 keypoint pairing
→ 조명·시점 변화가 작을 때 정확, keypoint가 sparse하기 때문에 계산량이 적음, 조건이 좋으면 수 ms 내 매우 정확한 매칭 가능

다만 이런 keypoint-based 방법들은 matching을 두 이미지간의 독립적인 keypoint들의 집합(bag-of-keypoint)을 미칭하는 문제로 보고 global geometric context(keypoint들 간의 상대적 위치 관계, 장면의 전체 레이아웃 등..)를 고려하지 않음

SuperGlue와 같이 keypoint를 pairing하는 단계에서 global optimization(attention같은거를 써서 keypoint간의 관계를 만들어 내는식…)하는 연구들이 어느정도 성능이 나오지만, global optimization은 pairing 단계에서만 작동하기 때문에 keypoint 위치, descriptor 표현 자체가 이미 정보가 부족할 수 있다.

global attention mechanism의 등장으로 이미지 전체를 holistic하게 처리하는 것이 가능해짐 →dense holistic matching 방법이 등장, 이미지상의 keypoint들을 매칭하는게 아니라 이미지 전체를 매칭하는 방식(ex. LoFTR)

LoFTR는 다만 matching을 image space에서의 2d problem이라고 여기고 풀어왔음

matching 보다는 3D reconstruction을 푼 DUSt3R에서 두 이미지간의 correspondence는 3d output의 부산물 정도인데, 이 3d output에서 단순하게 얻은 correspondence만으로도 keypoint 기반 방식이나 matching 기반 방식보다 성능이 좋음

이러한 점에서 DUSt3R를 matching을 하는데 사용할 수도 있지만, 아무래도 2d correspondence를 찾는데 부정확하기 때문에 dense local feature map을 얻는 두번째 head를 붙이고, InfoNCE loss를 이용해서 학습시킴

DUSt3R에 기반한 3D-aware matching approach인 MASt3R 제안
- highly accurate / robust한 매칭이 가능한 local feature maps 추출가능
- 고해상도 이미지에도 대응가능한 coarse-to-fine matching 방법 제안
- absolute / relative pose localization benchmarks에서 SOTA 달성

3. Method

parameter를 모르는 카메라 $C^1, C^2$ 로 찍힌 이미지 두장 ( $I^1, I^2$ )가 주어졌을때, pixel correspondences $\{ (i,j) \}$ 를 얻는게 목표

3.1. The DUSt3R framework

DUSt3R는 두 이미지를 받아서 calibration과 3D recon을 하는 프레임워크

DUSt3R에서는 transformer기반 네트워크로 두 이미지에서 local 3D recon(Pointcloud $X^{1,1}, X^{2,1}$ = Pointmap)을 예측

pointmap $X^{𝑎,𝑏} ∈ ℝ^{𝐻×𝑊×3}$ 은 이미지 $I^a$ 의 각 pixel $i=(u,v)$ 와 카메라 $C^b$ 좌표계에서의 3D point $X^{a,b}_{u,v} \in \mathbb{R}^3$ 간의 2D-to-3D mapping을 의미

입력으로 들어가는 두 이미지는 같은 encoder에 들어가서 인코딩됨

H^1 = \text{Encoder}(I^1) \\ H^2 = \text{Encoder}(I^2)

이렇게 추출된 representation H^1, H^2는 Decoder에 들어가서 cross attention을 거쳐서 두 viewpoint간의 공간관계가 계산되게 됨

𝐻 ^{′1} , 𝐻^{′2} = Decoder(𝐻 ^1 , 𝐻^2 )

encoder와 decoder에서 나온 representation을 합치고, 두개의 prediction head를 통해서 최정 pointmap와 confidence map을 regress

𝑋 ^{1,1} , 𝐶^1 = Head^1 _{3D}( [𝐻^ 1 , 𝐻^{′1} ]) \\ 𝑋 ^{2,1} , 𝐶^2 = Head^2 _{3D}( [𝐻^ 2 , 𝐻^{′2} ])

Regression loss

DUSt3R는 다음과 같은 loss로 학습됨
- $𝑣 ∈ \{1, 2\}$ : 2개 view
- $i$ : 3D point $\hat X^{v,1} \in \mathbb{R}^3$ 이 정의된 pixel
- $z, \hat z$ : scale normalization factor. (해당 view의 모든 valid 3D point들의 origin까지의 평균 거리 → scale-invariant reconstruction)

MASt3R는 scale-invariance는 항상 바람직하지 않다고 줒장. map-free visual localization의 경우 metric scale이 필요
→ ground-truth가 metric scale인 경우 예측값만 정규화하지 않도록 수정

\ell_{\text{regr}}(v,i) = \frac{\left\| X_{v,1}^i - \hat X_{v,1}^i \right\|}{\hat z}

DUSt3R의 최종 confidence-aware regression loss는 다음과 같음

3.2. Matching prediction head and loss

DUSt3R의 invariant feature space에서 상호간의 match를 찾는 방식 사용함. 이는 극단적인 viewpoint 변화에도 robust 하지만 그러나 pixel-level correspondences 정확도는 낮음

MASt3R에서는 dense feature map $D^1, D^2 \in \mathbb{R}^{H \times W \times d}$ 를 예측하는 head를 추가

𝐷 ^1 = Head^1 _{desc} ( [𝐻^ 1 , 𝐻^{′1} ]) \\ 𝐷 ^2 = Head^2 _{desc} ( [𝐻^ 2 , 𝐻^{′2} ])

GT correspondence $\hat M = \{ (i,j) | \hat X_i^{1,1} = \hat X_j^{2,1} \}$ 에 대해서 infoNCE loss 사용
- $\mathcal{P}^1= \{𝑖| (𝑖, 𝑗) ∈ \hat M\}, \mathcal{P}^2= \{𝑖| (𝑖, 𝑗) ∈ \hat M\}$ : subset of considered pixels
- $\tau$ : temperature hyper-parameter

기존에 DUSt3R가 학습했던 loss와 다르게 정확한 pixel간의 correspondence(nearby pixel X)를 찾을때만 reward를 받도록 loss설계

최종 loss는 다음과 같음

3.3. Fast reciprocal matching

예측된 dense feature map $D^1, D^2 \in \mathbb{R}^{H \times W \times d}$ 에서 set of reliable pixel correspondences를 찾는 과정에서 기존의 reciprocal matching는 모든 pixel쌍을 비교해야하기 때문에 $O(W^2H^2)$ 의 계산 복잡도를 가짐

논문에서는 따라서 sub-sampling을 기반해서 한 방법제안. 이미지 $I^1$ 에서 grid sampling을 통해서 선정한 sparse set of k pixels $U^0 = \{ U_n^0 \}_{n=1}^k$ 에서 시작하는 방식

이렇게 선정된 각 pixel은 $I^2$ 의 Nearset Neighbor(NN)에 매핑되어 $V^1$ 을 만듬. 그리고 다시 $V^1$ 은 같은 방식으로 $I^1$ 에 매핑됨

이렇게 cycle을 형성한 $\mathcal{M}_k^t = \{ (U_n^t, V_n^t) | U_n^t=U_n^{t+1} \}$ 을 set of reciprocal matches로 판별

iteration에서 이미 수렴한 pixel들은 제거 $U^{t+1} := U^{t+1} \ U^t$

같은 방식으로 $V^{t+1}$ 도 filtering

최종 correspondence $M_k=⋃_tM_t^k$

위의 그림에서 볼수 있듯 몇 iteration후에 대부분의 correspondence가 수렴하고, 계산복잡도가 낮아짐

3.4. Coarse-to-fine matching

attention의 계산복잡도 때문에, MASt3R는 긴변이 512 픽셀인 이미지만 처리. 따라서 고해상도 이미지는 이미지를 축소해서 input으로 넣어줘야함.

이런식으로 계산된 correspondence가 다시 원래 해상도로 upscale되면 성능의 저하가 있을수 있음

Coarse-to-fine matching 방식은 저해상도 알고리즘 방식을 이용하면서 고해상도 이미지간의 매칭의 이점을 보존하는 기술

coarse matching : 입력 이미지 I_1, I_2를 다운스케일, MASt3R + fast reciprocal matching 수행 → M_k^0 (coarse correspondence 집합)

원본이미지 각각에서 겹치는 window $W^1, W^2 \in \mathbb{R}^{w \times 4}$ 를 생성

모든 가능한 window pair: $(w_1, w_2) \in \mathcal{W}_1 \times \mathcal{W}_2$
- 이 중에서 coarse correspondence $M_k^0$ 를 가장 많이 포함하는 window pair들만 선택
- window pair를 하나씩 greedy하게 추가, 선택된 window pair들이coarse correspondence의 90% 이상을 커버할 때까지 반복
- 각 window pair에서 얻은 correspondence를 window 좌표계 → 원본 이미지 좌표계로 변환
- 모든 window pair 결과를 concatenate → 고해상도(full-resolution) dense correspondence 집합

4. Experimental results

'DL' 카테고리의 다른 글

[CVPR 2025] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints (1)	2026.01.31
[CVPR 2025] Hand-held Object Reconstruction from RGB Video with Dynamic Interaction (2)	2026.01.26
[ICCV 2025] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips (2)	2026.01.25
[Paper Review] Grounded Language-Image Pre-training(GLIP) 논문 정리 (0)	2023.11.14
[Paper Review] Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision 논문 정리 (1)	2023.11.06

현재글[ECCV 2024] Grounding Image Matching in 3D with MASt3R

거대고슴도치

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

거대고슴도치

[ECCV 2024] Grounding Image Matching in 3D with MASt3R

'DL' 카테고리의 다른 글

'DL'의 다른글

티스토리툴바

[ECCV 2024] Grounding Image Matching in 3D with MASt3R

'DL' 카테고리의 다른 글

'DL'의 다른글

관련글

티스토리툴바