We propose EVolSplat4D, a unified feed-forward 3D Gaussian Splatting framework tailored for static \& dynamic urban scenes that achieves real-time rendering speeds. Leveraging both camera and tracked 3D bounding box as inputs, EVolSplat4D completes scene reconstruction in approximately 1.3 seconds, achieving photo-realistic quality comparable to time-consuming per-scene optimization methods. EvolSplat4D also supports various downstream applications, including high-fidelity scene editing and scene decomposition.
We reconstruct urban scenes by disentangling them as close-range volume, dynamic actors, and far-field scenery, predicting 3D Gaussians of each in a feed-forward manner. a) Given a set of images, we initialize our model with the pretrained depth model and DINO feature extractor. b) In close-range volume, we leverage the 3D context of $\mathcal{F}^\text{3D}$ to predict the geometry attributes of 3D Gaussians and project the 3D Gaussians to the reference views to retrieve 2D context, including color window and visibility maps to decode their color. c) For dynamic actors, we model each instance using an instance-wise canonical space and perform feed-forward reconstitution through our proposed motion-adjusted IBR module. d) To model far-range regions, we employ a 2D U-Net backbone $\mathcal{F}^\text{2D}$ with cross-view self-attention to aggregate information from nearby reference images and predict per-pixel Gaussians. e) The composition of the three parts leads to our full model for unbounded scenes.