We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a 3D dataset comprising 700K scenes (which takes only 3 days to generate) - 70 times larger than the prior real dataset DL3DV - dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data, enabling wide-coverage scene reconstruction within 0.3 second.
Megasynth synthesizes data using non-learning-based procedual generation. We first generate the scene floor plan, where each 3D box represents a shape and different colors represent different object types. We compose shape primitives into objects with geometry augmentations, where these objects further compose the scene. We randomize the texture and lighting, and generate random cameras for rendering. MegaSynth denefits 3D reconstruction model with: (1) scalability, as the procedual data generation is efficient; (2) controllability, as we have fully control of data complexity, distribution, and alignment with real-world; (3) diversity, with randomized geometry, lighting, material and spatial structures; (4) accurate meta-data, which provides geometry supervision and stablizes training.
We compare with baselines on 3 testing datasets, including DL3DV (real, in-domain), Hypersim (synthetic, out-of-domain, indoor), and MipNeRF360 + Tanks & Temples (real, out-of-domain, outdoor). We perform experiments with two different models (GS-LRM and Long-LRM) under two settings (resolution 128 and 256). We show benefits of using MegaSynth against using DL3DV only.
By using MegaSynth, our results demonstrates consistent 1.2 to 1.8 dB PSNR gains across different experiment settings, base models and testing datasets.
We include results of Long-LRM at resolution 256 on on-domain DL3DV and out-of-domain data. For DL3DV, with our MegaSynth, the model performs better on thin structures (e.g., middle left), complicated lighting (e.g., top left), and cluttered scenes (e.g., middle right).
For out-of-domain data, we include results on Hypersim and MipNeRF360 in the first and second rows, respectively. With our MegaSynth, the model performs better on strong lighting (e.g., top left), thin structure (e.g., middle left), and complicated materials (e.g., bottom right).
We experiment with using only MegaSynth for training GS-LRM (res-128), which demonstrates comparable results to using real data. The results are consistent over different number of input images. This phenomenon implies that 3D reconstruction requires nearly no semantic information, akin to the success of non-semantic optimization-based methods, i.e., COLMAP and NeRF, and showing that 3D reconstruction is a low-level task!
@article{jiang2024megasynth,
title={MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data},
author={Jiang, Hanwen and Xu, Zexiang and Xie, Desai and Chen, Ziwen and Jin, Haian and Luan, Fujun and Shu, Zhixin and Zhang, Kai and Bi, Sai and Sun, Xin and Gu, Jiuxiang and Huang, Qixing and Pavlakos, Georgios and Tan, Hao},
booktitle={arXiv preprint arXiv:2412.14166},
year={2024},
}