CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling

CVPR 2026 Oral ✨
Li Jin*, Weikai Chen*, Yujie Wang†, Yingda Yin, Zeyu Hu, Runze Zhang,
Keyang Luo, Shengju Qian, Xin Wang, Xueying Qin†

Shandong University, Tencent LIGHTSPEED, UNC Chapel Hill

Task. Open-world promptable 3D semantic part segmentation: given a natural language query such as "chair leg" or "door handle", the model segments the target part directly on 3D geometry. CoSMo3D addresses this task by learning to reason in a canonical space rather than raw input pose space.

CoSMo3D teaser

Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space– wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.

Motivation

Method: Canonical-Aware Dual Branch

CoSMo3D uses a dual-branch architecture: a feature extraction branch for both training and inference, and a canonical embedding branch used only in training. Canonical supervision is distilled into the main representation, so inference introduces no extra runtime branch.

framework

Canonical Learning

Data Construction. CoSMo3D builds a unified canonical dataset spanning 200 categories and around 17K shapes. LLM-guided semantic grouping and cross-category alignment expose transferable canonical spatial regularities.

Model Induction. CoSMo3D learns canonical-aware representations through three objectives: hard-negative semantic alignment, canonical map anchoring, and canonical box calibration, jointly improving boundary precision, symmetry robustness, and spatial calibration.

canonical dataset construction canonical training objectives

Ablation: Why Canonical Losses Matter

Canonical map anchoring suppresses bleeding and false activation, while canonical box calibration stabilizes thin and elongated parts with better spatial envelopes.

canomap ablation bbox ablation

Experiments: Qualitative Results

CoSMo3D is more robust under geometry-semantic ambiguity, noisy small parts, cross-category prompts, and arbitrary object poses.

qualitative results

Experiments: Feature Space Visualization

Features from CoSMo3D are more semantically consistent across shape variation and rotation, indicating a more stable canonical embedding.

feature visualization

Experiments: Quantitative Highlights

Relative to the strong Find3D* baseline, CoSMo3D reports +27.85% (Canonical) and +25.01% (Rotated) on 3DCompat-Coarse, and +18.09% on ShapeNet-Part (Canonical). These gains indicate that reasoning in canonical embedding space improves stability and transferability.

BibTeX

@article{jin2026cosmo3d,
        title={CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling},
        author={Jin, Li and Chen, Weikai and Wang, Yujie and Yin, Yingda and Hu, Zeyu and Zhang, Runze and Luo, Keyang and Qian, Shengju and Wang, Xin and Qin, Xueying},
        journal={arXiv preprint arXiv:2603.01205},
        year={2026}
      }