We propose EVolSplat4D, a unified feed-forward reconstruction model tailored for static & dynamic urban scenes that achieves real-time rendering speeds. Leveraging both camera and LiDAR cues as inputs, EVolSplat4D can produce novel, photo-realistic renderings in a very short time (approximately 1.3s), achieving quality that is comparable to time-consuming per-scene optimization methods. EVolSplat4D also supports various downstream applications, including high-fidelity scene editing and scene decomposition.
We reconstruct urban scenes by disentangling them as close-range volume, dynamic actors, and distant view, predicting 4D Gaussians of urban scenes in a feed-forward manner. a) Given a set of images and sparse lidar points, we initialize our model with the pretrained depth model and DINO feature extractor. b) In close-range volume, we leverage the 3D context of $\psi^\text{3D}$ predict the geometry attributes of 3D Gaussians and project the 3D Gaussians to the reference views to retrieve 2D context, including color window and visibility maps to decode their color. c) For dynamic actors, we model each instance using an instance-wise canonical space and perform generalizable reconstitution through our proposed motion-adjusted IBR module. d) To model far regions, we employ a 2D U-Net backbone with cross-view self-attention to aggregate information from nearby reference images and predict per-pixel Gaussians. e) The composition of the three parts leads to our full model for the unbounded scenes.