* primecai / projects  / generative-rendering
CVPR 2024
Seattle, WA
CVPR 2024 / Video Generation / 4D-Guided Diffusion

Controllable 4D-guided video generation with 2D diffusion models — combining the precision of animated 3D meshes with the expressivity of pre-trained image diffusion.

* equal contribution
§01 Abstract

A bridge between control and creativity.

Summary We inject ground-truth 4D correspondences into a pre-trained text-to-image diffusion model to render high-quality, temporally consistent video.

Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hindering a user to apply their own creativity rather than amplifying it.

To address this challenge, we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. Our approach takes an animated, low-fidelity rendered mesh as input and injects the ground-truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames.

We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.

§02 Method

UV correspondences, extended attention, unified features.

Pipeline Depth-conditioned ControlNet backbone, UV-noise initialization, keyframe extended attention, UV-space feature unification.

Our system takes as input a set of UV and depth maps rendered from an animated 3D scene. We use a depth-conditioned ControlNet to generate corresponding frames while using the UV correspondences to preserve consistency. We initialize the noise in the UV space of each object, which is then rendered into each image.

For each diffusion step, we first apply extended attention to a set of keyframes and extract their pre- and post-attention features. The post-attention features are projected to the UV space and unified. Finally, all frames are generated using a weighted combination of the outputs of the extended attention with the pre-attention features of the keyframe, and the UV-composed post-attention features.

Generative Rendering pipeline diagram showing UV + depth inputs feeding a ControlNet with extended attention across keyframes
FIG. 01 — Pipeline Overview UV × Depth → Diffusion
§04 Comparisons

Qualitative comparisons.

We compare against per-frame editing, adapted versions of SOTA video editing works Pix2Video and TokenFlow, and the SOTA video diffusion model Gen1.

Prompt 01 a basketball bouncing in a chamber under light
Prompt 02 a Swarovski blue fox running
§05 Citation

BibTeX.

cai2023genren.bib  
@inproceedings{cai2023genren,
  author    = {Cai, Shengqu and Ceylan, Duygu and Gadelha, Matheus
               and Huang, Chun-Hao and Wang, Tuanfeng and Wetzstein, Gordon},
  title     = {Generative Rendering: Controllable 4D-Guided
               Video Generation with 2D Diffusion Models},
  booktitle = {CVPR},
  year      = {2024}
}