Principle of the WoVoGen framework for generating models for autonomous driving data sets

Infineon / Mitsubishi / Fuji / Semikron / Eupec / IXYS

Principle of the WoVoGen framework for generating models for autonomous driving data sets

Posted Date: 2024-01-27

1. Write in front

Generative models for autonomous driving data sets are very popular recently, mainly including NeRF and diffusion models. The difficulty of the diffusion model is to maintain worldwide consistency and consistency between sensors. Today, the author recommends to you an article about Fudan University’s latest open source solution WoVoGen, which can generate street videos based on vehicle control inputs and can also perform scene editing.

Let’s read about this work together~

2. Summary

Generating multi-camera street view videos is critical to increasing autonomous driving datasets, addressing the urgent need for extensive and diverse data. Due to diversity limitations and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being replaced by diffusion-based methods. However, an important challenge of diffusion-based approaches is ensuring that the generated sensor data maintains both worldwide consistency and inter-sensor consistency. To address these challenges, we incorporate an additional explicit world voxel and propose a world voxel-aware multi-camera driven scene generator (Wovogen). This system is specifically designed to utilize 4D world voxels as the basic elements for video generation. Our model runs in two distinct stages: (i) envisioning a future 4D temporal world of voxels based on vehicle control sequences, (ii) generating multi-camera video from this envisioned 4D temporal world of voxels and sensor interconnectivity Knowledge. The addition of 4D world voxels enables WoVoGen not only to generate high-quality street view videos based on vehicle control inputs, but also to facilitate scene editing tasks.

3. Effect display

WoVoGen can predict the surrounding environment and generate reasonable visual feedback in response to the driving operation of the own vehicle. To leverage the capabilities of rapidly developing generative models, WoVoGen encodes structured traffic information into a regular grid framework, known as world voxels, and designs a new latent diffusion-based world model to perform the world regressively Voxel prediction.

WoVoGen does a good job of generating temporally consistent future world voxels (first two rows). Then, the world voxel-aware 2D image features output by the world model are used to synthesize a driving video with both multi-camera consistency and temporal consistency (bottom two rows).

4. What is the specific principle?

The overall framework of WoVoGen. Top: World model branch. The author fine-tuned AutoencoderKL, trained the 4D diffusion model from scratch, and generated future world voxels based on past world voxels and self-vehicle actions. Bottom: World voxel-aware synthesis branch. Using the generated future quantity as input, Fw is obtained through the world encoder. Subsequent sampling produces Fimg, which is then aggregated. The process is completed by applying panoramic diffusion to produce future videos.

5. How does it compare with other SOTA methods?

Quantitative comparison of image/video generation quality on nuScenes validation set. WoVoGen achieves both multi-view and multi-frame generation, with FID and FVD scoring the lowest among all methods.

6. Summary

This article proposes WoVoGen, which leverages 4D world voxels to combine temporal and spatial data, solving the complexity of creating content from multi-sensor data while ensuring consistency. This two-stage system not only produces high-quality video based on vehicle control, but also enables complex scene editing.

Review Editor: Huang Fei

#Principle #WoVoGen #framework #generating #models #autonomous #driving #data #sets