One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency

CVPR2025

Li Jin1,7
Yujie Wang3,4
Wenzheng Chen5,6
Qiyu Dai3
Qingzhe Gao3
Xueying Qin1,7
Baoquan Chen2*
1School of Software, Shandong University
2State Key Laboratory of General Artificial Intelligence, Peking University
3School of Intelligence Science and Technology, Peking University
4UNC Chapel Hill
5Wangxuan Institute of Computer Technology, Peking University
6State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, P.R. China
7Engineering Research Center of Digital Media Technology, Ministry of Education, P.R. China
3D object canonicalization is a fundamental task, essential for a variety of downstream tasks. Existing methods rely on either cumbersome manual processes or priors learned from extensive, per-category training samples. Real-world datasets, however, often exhibit long-tail distributions, challenging existing learning-based methods, especially in categories with limited samples. We address this by introducing the first one-shot category-level object canonicalization framework, requiring only a single canonical model as a reference (the "prior model") for each category. To canonicalize any object, our framework first extracts semantic cues with large language models (LLMs) and vision-language models (VLMs) to establish correspondences with the prior model. We introduce a novel joint energy function to enforce geometric and semantic consistency, aligning object orientations precisely despite significant shape variations. Moreover, we adopt a support-plane strategy to reduce search space for initial poses and utilize a semantic relationship map to select the canonical pose from multiple hypotheses. Extensive experiments on multiple datasets demonstrate that our framework achieves state-of-the-art performance and validates key design choices. Using our framework, we create the Canonical Objaverse Dataset (COD), canonicalizing 32K samples in the Objaverse-LVIS dataset, underscoring the effectiveness of our framework on handling large-scale datasets.
Method Overview

Our approach enables category-level object canonicalization using a single prior model for each category. We begin by utilizing large language models (LLM) and vision-language models (VLM) to capture the 3D semantics of both the prior model and the test model, establishing semantic correspondences (left). Next, we generate canonical pose hypotheses and introduce a joint energy function that integrates semantic and geometric cues, facilitating accurate alignment with the prior model (middle). Finally, we identify the optimal canonical pose using a semantic relationship map (right) by evaluating the consistency of semantic positions.

描述图片

Comparison with Canonical Dataset

Compared to existing datasets, COD features the largest number of categories and shapes. More importantly, we obtain 33k valid data with just two annotators over approximately eight hours, completed the alignment of the 40k shape dataset, which demonstrates capability to handle larger-scale datasets. Next, we will process the Objaverse-1.0 dataset, which contains 800k shapes.

描述图片

Data Statistics and Analysis

The figure below compares the Objaverse-LVIS dataset before and after applying our canonicalization method. Before canonicalization, only 24% of the objects were properly aligned. Following our processing, the proportion of canonicalized data increased by 55%, highlighting the effectiveness of our approach. Subsequently, we created the Canonical Objaverse Dataset (COD) by extracting 79% of canonical objects from the Canonical Objaverse-LVIS Dataset.

描述图片
Visual Comparisons

We propose a one-shot canonicalization method based on the semantic and support information, enabling the canonicalization of 3D objects within the same category, even in the presence of significant differences in shape and appearance. The term "Initial" refers to the original objects, while "Canonical" denotes the canonical objects. The terms "Objects" and "semantics" represent the shape and meaning of the objects, respectively.

(Use the slider to compare the results before and after canonicalization.)

Canonicalization Process
"banana"
"bass_horn"
"bicycle"
"fighter_jet"
"foal"
"foal"
"pocket_watch"
"shears"
Canonicalization Presentation
Category: "race_car"
Category: "raincoat"
Category: "ring"
Category: "shears"
Category: "shepherd_dog"
Category: "shield"
Category: "shoe"
Category: "shopping_cart"
Category: "sink"
Category: "skateboard"
Category: "squirrel"
Category: "tapestry"
Category: "trailer_truck"
Category: "vodka"