Our approach enables category-level object canonicalization using a single prior model for each category. We begin by utilizing large language models (LLM) and vision-language models (VLM) to capture the 3D semantics of both the prior model and the test model, establishing semantic correspondences (left). Next, we generate canonical pose hypotheses and introduce a joint energy function that integrates semantic and geometric cues, facilitating accurate alignment with the prior model (middle). Finally, we identify the optimal canonical pose using a semantic relationship map (right) by evaluating the consistency of semantic positions.
Compared to existing datasets, COD features the largest number of categories and shapes. More importantly, we obtain 33k valid data with just two annotators over approximately eight hours, completed the alignment of the 40k shape dataset, which demonstrates capability to handle larger-scale datasets. Next, we will process the Objaverse-1.0 dataset, which contains 800k shapes.
The figure below compares the Objaverse-LVIS dataset before and after applying our canonicalization method. Before canonicalization, only 24% of the objects were properly aligned. Following our processing, the proportion of canonicalized data increased by 55%, highlighting the effectiveness of our approach. Subsequently, we created the Canonical Objaverse Dataset (COD) by extracting 79% of canonical objects from the Canonical Objaverse-LVIS Dataset.
We propose a one-shot canonicalization method based on the semantic and support information, enabling the canonicalization of 3D objects within the same category, even in the presence of significant differences in shape and appearance. The term "Initial" refers to the original objects, while "Canonical" denotes the canonical objects. The terms "Objects" and "semantics" represent the shape and meaning of the objects, respectively.
(Use the slider to compare the results before and after canonicalization.)