Mixed Neural Voxels for Fast Multi-view Video Synthesis


Feng Wang1    Sinan Tan1    Xinghang Li1    Zeyue Tian2    Huaping Liu1   
1Tsinghua University
2HKUST
[Paper]
[Code]

Training 40 minutes with a single RTX-3090 GPU (300 frames)

Training 15 minutes with a single RTX-3090 GPU (300 frames)

More complicated scenes with fast movements and large moving areas

Abstract

Synthesizing high-fidelity videos from real-world multi-view input is challenging because of the complexities of real-world environments and highly dynamic motions. Previous works based on neural radiance fields have demonstrated high-quality reconstructions of dynamic scenes. However, training such models on real-world scenes is time-consuming, usually taking days or weeks. In this paper, we present a novel method named MixVoxels to better represent the dynamic scenes with fast training speed and competitive rendering qualities. The proposed MixVoxels represents the 4D dynamic scenes as a mixture of static and dynamic voxels and processes them with different networks. In this way, the computation of the required modalities for static voxels can be processed by a lightweight model, which essentially reduces the amount of computation, especially for many daily dynamic scenes dominated by the static background. To separate the two kinds of voxels, we propose a novel variation field to estimate the temporal variance of each voxel. For the dynamic voxels, we design an inner-product time query method to efficiently query multiple time steps, which is essential to recover the high-dynamic motions. As a result, with 15 minutes of training for dynamic scenes with inputs of 300-frame videos, MixVoxels achieves better PSNR than previous methods.



All Scenes on Plenoptic Video Dataset


Long dynamic scenes (40s)