Learning Temporally Consistent Video Depth from Video Diffusion Priors

1Zhejiang University 2University of Bologna 3University of Southern California

4Ant Group 5Toyota Research Institute 6Rock Universe
*equal contribution; corresponding author

TL;DR: ChronoDepth addresses the challenge of streamed video depth estimation with video diffusion model.

Abstract

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that no contextual information shared between frames or clips is pivotal in fostering inconsistency. Instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, We design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth.

inference comparison

Illustration of different inference strategies for infinitely long videos. (a) Naive sliding window inference; (b) Inference with replacement trick; (c) Our proposed consistent context-aware inference.

Comparison Gallery

Overall Framework

training pipeline

Training pipeline. We add an RGB video conditioning branch to a pre-trained video diffusion model (SVD) and fine-tune it via DSM for depth estimation, by sampling different noise levels for each frame. Our training involves two stages: 1) we train the spatial layers with single-frame depths; 2) we freeze spatial layers and train the temporal layers on clips of random lengths.

inference scheme

We explore an inference strategy for infinately long videos. We segment the video into several $F$-frame clips with overlap $W$ and use a sliding window strategy for inference. Besides, we initialize overlapping $W$ frames with previously predicted depth frames $\hat{\mathbf{z}}^{(\mathbf{d}_{0:W})}_0$ to support consistent context information during each sampling step.

Citation

@misc{shao2024learningtemporallyconsistentvideo,
      title={Learning Temporally Consistent Video Depth from Video Diffusion Priors}, 
      author={Jiahao Shao and Yuanbo Yang and Hongyu Zhou and Youmin Zhang and Yujun Shen and Vitor Guizilini and Yue Wang and Matteo Poggi and Yiyi Liao},
      year={2024},
      eprint={2406.01493},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.01493}, 
  }