Learning Temporally Consistent Video Depth from Video Diffusion Priors

CVPR 2025

Jiahao Shao^1*, Yuanbo Yang^1*, Hongyu Zhou¹, Youmin Zhang^2,6, Yujun Shen⁴,

Vitor Guizilini⁵, Yue Wang³, Matteo Poggi², Yiyi Liao^1†

¹Zhejiang University ²University of Bologna ³University of Southern California

⁴Ant Group ⁵Toyota Research Institute ⁶Rock Universe

^*equal contribution; ^†corresponding author

Paper Code 🤗 Hugging Face Space🔥🔥

TL;DR: ChronoDepth addresses the challenge of streamed video depth estimation with video diffusion model.

Abstract

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that no contextual information shared between frames or clips is pivotal in fostering inconsistency. Instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, We design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth.

Illustration of different inference strategies for infinitely long videos. (a) Naive sliding window inference; (b) Inference with replacement trick; (c) Our proposed consistent context-aware inference.

Overall Framework

Training pipeline. We add an RGB video conditioning branch to a pre-trained video diffusion model (SVD) and fine-tune it via DSM for depth estimation, by sampling different noise levels for each frame. Our training involves two stages: 1) we train the spatial layers with single-frame depths; 2) we freeze spatial layers and train the temporal layers on clips of random lengths.

We explore an inference strategy for infinately long videos. We segment the video into several $F$-frame clips with overlap $W$ and use a sliding window strategy for inference. Besides, we initialize overlapping $W$ frames with previously predicted depth frames $\hat{\mathbf{z}}^{(\mathbf{d}_{0:W})}_0$ to support consistent context information during each sampling step.

Learning Temporally Consistent Video Depth from Video Diffusion Priors

CVPR 2025

Abstract

Comparison Gallery

Overall Framework

Citation