Real3D

Abstract

As single-view 3D reconstruction is ill-posed due to the ambiguity from 2D to 3D, the reconstruction models have to learn generic shape and texture priors from large data. The default strategy for training single-view Large Reconstruction Models (LRMs) follows the fully supervised route, using synthetic 3D assets or multi-view captures. Although these resources simplify the training procedure, they are hard to scale up beyond the existing datasets and they are not necessarily representative of the real distribution of object shapes. To address these limitations, in this paper, we introduce Real3D, the first LRM system that can be trained using single-view real-world images. Real3D introduces a novel self-training framework that can benefit from both the existing 3D/multi-view synthetic data and diverse single-view real images. We propose two unsupervised losses that allow us to supervise LRMs at the pixel- and semantic-level, even for training examples without ground-truth 3D or novel views. To further improve performance and scale up the image data, we develop an automatic data curation approach to collect high-quality examples from in-the-wild images. Our experiments show that Real3D consistently outperforms prior work in four diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes.

Overview

Real3D is trained jointly on synthetic data (fully supervised) and on single-view real images (unsupervised self-training). The former stabelizes the training with the help of supervision from ground-truth novel views. The latter introduces new information, which improves the reconstruction quality and generalization capability. A curation strategy is used to identify and leverage the high-quality training instances from the initial real image collection. We adopt the LRM model architecture.

Self-Training

We develop novel unsupervised pixel-level and semantic-level supervision when training the model on single-view images, where ground-truth novel views are not available. The pixel-level supervision (shown below) uses a cycle-consistency rendering loss (shown below). We found applying stop-gradient on intermediate renderings avoids trivial reconstruction solutions which degenerate the model. Moreover, we apply a pose sampling curriculum during the process, adjusting the complexity of the learning target from simple to difficult during learning process. The semantic-level supervision uses CLIP similarity between input image and novel views of reconstruction, involving additional regularization and hard negative mining.

Comparison with Baselines

We compare with baselines on 4 testing datasets (MVImgNet, CO3D, in-the-wild images and OmniObject3D), encompassing both real and synthetic data as well as both in-domain and out-of-domain shapes. We show visualization results below.

We show more comparison with our base model TripoSR.

Scaling Effect

We experiment with using different ratio of real data for self-training. Real3D demonstrates better performance with using more real data for self-training, showcasing its potential for being further scaling up.

BibTeX


		@article{jiang2024real3d,
		   title={Real3D: Scaling Up Large Reconstruction Models with Real-World Images},
		   author={Jiang, Hanwen and Huang, Qixing and Pavlakos, Georgios},
		   booktitle={arXiv preprint arXiv:2406.08479},
		   year={2024},
		}

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

TLDR: We scale up training data of single-view LRMs by enabling self-training on in-the-wild images.