EDUS: Efficient Depth-Guided Urban View Synthesis

ECCV 2024

1Zhejiang University 2Huawei Noah's Ark Lab 3University of Tübingen 4Tübingen AI Center
*equal contribution; corresponding author

Recent advances in implicit scene representation enable high- fidelity street view novel view synthesis. However, existing methods op- timize a neural radiance field for each scene, relying heavily on dense training images and extensive computation resources. To mitigate this shortcoming, we introduce a new method called Efficient Depth-Guided Urban View Synthesis (EDUS) for fast feed-forward inference and effi- cient per-scene fine-tuning. Different from prior generalizable methods that infer geometry based on feature matching, EDUS leverages noisy predicted geometric priors as guidance to enable generalizable urban view synthesis from sparse input images. The geometric priors allow us to ap- ply our generalizable model directly in the 3D space, gaining robustness across various sparsity levels. Through comprehensive experiments on the KITTI-360 and Waymo datasets, we demonstrate promising gener- alization abilities on novel street scenes. Moreover, our results indicate that EDUS achieves state-of-the-art performance in sparse view settings when combined with fast test-time optimization.

Results Gallery

Drop50 Feed-forward on KITTI-360

Drop50 Zero-shot on Waymo (Pretrained on KITTI-360)

Comparison with Baselines on KITTI-360 Drop80 Feed-forward

Comparison with Baselines on KITTI-360 Drop90 Test-time Optimization

How it works

training pipeline

Illustration. Left: Most existing generalizable NeRF methods rely on feature matching for recovering the geometry, e.g., by constructing local cost volumes, poten- tially overfitting certain reference camera pose distributions. Right: Our method lifts geometric priors to the 3D space and fuses them into a global volume to be processed by a generalizable network. This enhances robustness as the geometric priors are un- affected by the reference image poses. We show synthesized images and depth maps through a feed-forward inference in the middle.

Overall Framework

training pipeline

Pipeline. Our model takes as input sparse reference images and renders a color image at a target view point. This is achieved by decomposing the scene into three generalizable modules: 1) depth-guided generalizable foreground fields to model the scene within a foreground volume; 2) generalizable background fields to model the background objects and stuff; and 3) generalizable sky fields. Our model is trained on multiple street scenes using RGB supervision and optionally LiDAR supervision. We further apply sky loss to decompose the sky from the other regions and Lentropy to penalize semi-transparent reconstructions.

Citation

@inproceedings{EDUS,
      author = {Sheng Miao and Jiaxin Huang and Dongfeng Bai and Weichao Qiu and Bingbing Liu and Andreas Geiger and Yiyi Liao},
      title = {EDUS: Efficient Depth-Guided Urban View Synthesis},
      booktitle = {European Conference on Computer Vision (ECCV)},
      year = {2024}
     }