Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space– wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.
CoSMo3D uses a dual-branch architecture: a feature extraction branch for both training and inference, and a canonical embedding branch used only in training. Canonical supervision is distilled into the main representation, so inference introduces no extra runtime branch.
Data Construction. CoSMo3D builds a unified canonical dataset spanning 200 categories and around 17K shapes. LLM-guided semantic grouping and cross-category alignment expose transferable canonical spatial regularities.
Model Induction. CoSMo3D learns canonical-aware representations through three objectives: hard-negative semantic alignment, canonical map anchoring, and canonical box calibration, jointly improving boundary precision, symmetry robustness, and spatial calibration.
Canonical map anchoring suppresses bleeding and false activation, while canonical box calibration stabilizes thin and elongated parts with better spatial envelopes.
CoSMo3D is more robust under geometry-semantic ambiguity, noisy small parts, cross-category prompts, and arbitrary object poses.
Features from CoSMo3D are more semantically consistent across shape variation and rotation, indicating a more stable canonical embedding.
Relative to the strong Find3D* baseline, CoSMo3D reports +27.85% (Canonical) and +25.01% (Rotated) on 3DCompat-Coarse, and +18.09% on ShapeNet-Part (Canonical). These gains indicate that reasoning in canonical embedding space improves stability and transferability.
@article{jin2026cosmo3d,
title={CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling},
author={Jin, Li and Chen, Weikai and Wang, Yujie and Yin, Yingda and Hu, Zeyu and Zhang, Runze and Luo, Keyang and Qian, Shengju and Wang, Xin and Qin, Xueying},
journal={arXiv preprint arXiv:2603.01205},
year={2026}
}