ICCV 2023

DiffDreamer

Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models

Shengqu Cai^1,2* Eric Ryan Chan¹ Songyou Peng^2,3 Mohamad Shahbazi² Anton Obukhov² Luc Van Gool^2,4 Gordon Wetzstein¹

¹Stanford University ²ETH Zürich ³MPI for Intelligent Systems, Tübingen ⁴KU Leuven

^*Work done as a visiting researcher at Stanford

Paper

Single image to long trajectory

Nature scenes that keep their shape as the camera moves.

DiffDreamer comparison showing consistent long-range scene extrapolation

Abstract

Consistent scene extrapolation from a single view.

Scene extrapolation, the idea of generating novel views by flying into a given image, is a promising yet challenging task. For each predicted frame, a joint inpainting and 3D refinement problem has to be solved, which is ill posed and includes a high level of ambiguity. Moreover, training data for long-range scenes is difficult to obtain and usually lacks sufficient views to infer accurate camera poses.

We introduce DiffDreamer, an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory while training solely on internet-collected images of nature scenes. Utilizing the stochastic nature of the guided denoising steps, we train diffusion models to refine projected RGBD images, then condition the denoising steps on multiple past and future frames for inference.

Image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods.

Method

Render, refine, and repeat with stochastic conditioning.

DiffDreamer pipeline with forward warping, anchored conditioning, and lookahead conditioning

We train an image-conditional diffusion model to perform image-to-image refinement and inpainting given a corrupted image and its missing-region mask. At inference, stochastic conditioning combines naive forward warping from the previous frame, anchored conditioning from a further frame, and lookahead conditioning from a virtual future frame.

Gallery

Long camera motion from internet-collected nature images.

Flying into input images

Flying out of input images

Citation

BibTeX

@inproceedings{cai2023diffdreamer,
  title     = {DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models},
  author    = {Cai, Shengqu and Chan, Eric Ryan and Peng, Songyou and Shahbazi, Mohamad and Obukhov, Anton
               and Van Gool, Luc and Wetzstein, Gordon},
  booktitle = {ICCV},
  year      = {2023}
}