This paper presents EVolSplat, a method for efficient urban scene reconstruction in a feed-forward manner. Unlike previous pixel-aligned 3DGS frameworks, we use geometric priors to construct a global volume and predict a standalone 3D representation, achieving state-of-the-art performance across several street-view datasets and enabling real-time rendering, making it well-suited for urban scenes.
EVolSplat learns to predict 3D Gaussians of urban scenes in a feed-forward manner. Given a set of posed images $\{I_n\}_{i=1}^N$, we first leverage off-the-shelf metric depth estimators to provide depth estimations $\{D_n\}_{n=1}^N$. The depth maps are unprojected and accumulated into a global point cloud $P$, which is fed into a sparse 3D CNN for extracting a feature volume $F$. We leverage the 3D context of $F$ to predict the geometry attributes of 3D Gaussians. Furthermore, we project the 3D Gaussians to the nearest reference views to retrieve 2D context, including color window $\{c_k\}_{k=1}^K$ and visibility maps $\{v_k\}_{k=1}^K$ to decode their color. To model far regions, we propose a generalizable hemisphere Gaussian model, where the geometry is fixed and the color is predicted in a similar manner as the foreground volume.
@InProceedings{Miao_2025_CVPR,
author = {Sheng Miao, Jiaxin Huang, Dongfeng Bai, Xu Yan, Hongyu Zhou, Yue Wang, Bingbing Liu, Andreas Geiger and Yiyi Liao},
title = {EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025}
}