Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot model-free object orientation estimation via analysis-by-synthesis and efficient arrow-based object manipulation.
We employ VLM pre-processing and manual correction to curate our Objaverse-OA dataset, striking a balance between efficiency and accuracy. We show the error rate of VLM’s estimation across different categories above. We observe that (1) the VLM demonstrates particular difficulty in recognizing front-facing orientations for stick-like objects, and (2) a significant portion of recognition errors occur when processing objects with inherently ambiguous or unclear frontal views. These challenges highlight the necessity of our manual curation.
We fine-tune two representative methods: Trellis, based on a 3D-VAE backbone (top), and Wonder3D, based on a multi-view diffusion backbone (bottom). For the 3D-VAE, we find that fine-tuning only the sparse structure generator is sufficient to produce orientation-aligned objects. For the multi-view diffusion model, we adopt LoRA as a lightweight domain adapter to enable the generation of orientation-aligned target images.
Our zero-shot model-free orientation estimation method includes three stages: 3D generation, pose refinement, and pose selection. Our orientation-aligned 3D object acts as a template for pose estimation by rendering it from multiple views, refining each, and selecting the best-matching viewpoint. Note that we do not perform training for this downstream task, where the pose refinement module is directly from FoundationPose, and the pose selection module directly utilizes the pre-trained DINO feature extractor.
It is difficult to manipulate the rotation of 3D models with uncanonical poses in the downstream applications. Based on our generated orientation-aligned 3D models, we design an arrow-based object rotation manipulation operation, which allows users to efficiently manipulate the rotation of the 3D models in both augmented reality applications and generic 3D software.
@misc{lu2025orientationmattersmaking3d,
title={Orientation Matters: Making 3D Generative Models Orientation-Aligned},
author={Yichong Lu and Yuzhuo Tian and Zijin Jiang and Yikun Zhao and Yuanbo Yang and Hao Ouyang and Haoji Hu and Huimin Yu and Yujun Shen and Yiyi Liao},
year={2025},
eprint={2506.08640},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.08640},
}