Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

CVPR 2026

¹Zhejiang University ²ByteDance
^†Corresponding Author

Abstract

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

Interactive Examples

Loading interactive demo...

3D Generation: First Frame to 3D

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

Scene 6

Scene 7

Scene 8

Scene 9

Scene 10

3D Generation: First-last Frames to 3D

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

Scene 6

Scene 7

Scene 8

Scene 9

Scene 10

3D Reconstruction: All Frames to 3D

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

Scene 6

Scene 7

Scene 8

Scene 9

Scene 10

Comparisons

3D Generation: First Frame to 3D

Scene 1

Scene 2

Scene 3

Scene 4

3D Generation: First-last Frames to 3D

Scene 1

Scene 2

Scene 3

3D Reconstruction: All Frames to 3D

Scene 1

Scene 2

Scene 3

More Results

3D Generation: First Frame to 3D

Scene 1

Scene 2

Scene 3

3D Generation: First-last Frames to 3D

Scene 1

Scene 2

Scene 3

3D Reconstruction: All Frames to 3D

Scene 1

Scene 2

BibTeX

@misc{huang2026gen3r3dscenegeneration, title={Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction}, author={Jiaxin Huang and Yuanbo Yang and Bangbang Yang and Lin Ma and Yuewen Ma and Yiyi Liao}, year={2026}, eprint={2601.04090}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.04090}, }

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

TL;DR: Create multi-quantity geometry with RGB from images via a unified latent space that aligns geometry and appearance.

Abstract

Interactive Examples

3D Generation: First Frame to 3D

3D Generation: First-last Frames to 3D

3D Reconstruction: All Frames to 3D

Comparisons

3D Generation: First Frame to 3D

3D Generation: First-last Frames to 3D

3D Reconstruction: All Frames to 3D

More Results

3D Generation: First Frame to 3D

3D Generation: First-last Frames to 3D

3D Reconstruction: All Frames to 3D

BibTeX