Recent advances in implicit scene representation enable high-
fidelity street view novel view synthesis. However, existing methods op-
timize a neural radiance field for each scene, relying heavily on dense
training images and extensive computation resources. To mitigate this
shortcoming, we introduce a new method called Efficient Depth-Guided
Urban View Synthesis (EDUS) for fast feed-forward inference and effi-
cient per-scene fine-tuning. Different from prior generalizable methods
that infer geometry based on feature matching, EDUS leverages noisy
predicted geometric priors as guidance to enable generalizable urban view
synthesis from sparse input images. The geometric priors allow us to ap-
ply our generalizable model directly in the 3D space, gaining robustness
across various sparsity levels. Through comprehensive experiments on
the KITTI-360 and Waymo datasets, we demonstrate promising gener-
alization abilities on novel street scenes. Moreover, our results indicate
that EDUS achieves state-of-the-art performance in sparse view settings
when combined with fast test-time optimization.