Foundation Model Philosophy is Missing in 3D Vision

Current large foundation models, e.g., LLMs, are built on stage-wise training. They are first pre-trained on large data to learn general knowledge and good representations, which is most likely self-supervised (e.g., next-token prediction). Then they are mid-trained to learn specific capabilities, e.g., reasoning. As the data in this stage is carefully curated, although LLMs still learn through next-token prediction, they are de facto supervised learning. Finally, they are post-trained to aligned with human preferences.
However, this recipe is entirely missing in 3D Vision -- we are mixing all stages together under supervised learning all the time.
Why is thisa problem? We will have a series of blogs (papers) to discuss this, and we will go beyond 3D Vision later. Specifically, this blog will give a perspective on the problem of DATA.

Data! Data! Data!

DATA is the fuel of everything!

Data! data! data! I can't make bricks without clay.

We are (still!) not giving data enough credit.

However, we lack 3D DATA -- The COLMAP Era

3D Vision has followed supervised learning for a long time. We first label data with COLMAP, and turn everything into a supervised learning problem.
Why is this bad? Because this is not what morden foundation models do. Specifically, COLMAP is bad because:

It is inefficient, as per-scene optimization is super slow and is hard to deploy in large scale (unless you are Google).

It is noisy, as no one can guarantee the quality of the labels. Your model is a distillation of the noisy COLMAP labels.

Most importantly, it is hard to scale, as it only works on limited types of data, e.g., static scenes with big camera motion, and ultimately limits the amount of 3D DATA you can use.

But, this is a FAKE problem

Data is actually everywhere! As Engels said, "The unity of the world consists in its materiality". We live in the 3D physical world, and any projection of this world is (partial) 3D data. We just need to recover them from the projections, and we actually have very strong inductive bias -- the disentanglement between structure (scene) and motion (camera), the space-time continuity, the object permanence, the physical laws and plausibility, etc. From this perspective, we have infinite data -- videos on the Internet reveal their underlying 3D structure (COLMAP doesn't work on most of them). Thus, what we want to do is just let models learn the underlying 3D by themselves. This is self-supervised learning.

RayZer: Self-supervised Learning for 3D Vision

Self-supervised Learning is actually EASY

Building a self-supervised learning model for 3D Vision is not as hard as you think. We can learn from the success of other domains, e.g., language and image. In these domains, a prevailing paradigm is predicting masked raw data. For example, in language, we predict the (masked) missing tokens in a sentence. In image, we predict the (masked) missing pixels in an image.
So, in 3D Vision, what raw data should we mask and predict? The masked view (frame)! -- since in multi-view geometry, a view is a smallest unit, just like a word in a sentence or a pixel in an image.

Masked View Prediction with 3D Structure

The remaining question is how to predict the masked view. As we intend to recover the underlying 3D structure, we can use the 3D structure to predict the masked view. This leads to the disentanglement between structure and motion.
During training, the input of RayZer is unlabeled (unposed & uncalibrated) multi-view images. RayZer splits the images into two sets -- set A and set B.
RayZer predicts the scene representation from set A, and uses the predicted cameras of set B to render the scene, getting predictions for set B.
Thus, RayZer is trained with only photometric loss between the original images and predicted images of set B, requiring zero 3D supervision of camera and geometry. This split helps us disentangle scene and camera representations. And this learning paradigm is multi-view images auto-encoding, but with controlled information flow and 3D structure.

Instatiation of RayZer with minimal 3D indutive bias

Similar to foundation models in other domains, we build RayZer with minimal 3D inductive bias using transformers. And the keypoints are:

We first predict cameras as SE(3), as they have limited capabity and help disentanglement of scene

We then convert cameras as pixel-aligned Plucker rays, as it provides fine-grained ray-level physical structure of the camera

We repreent the scene as a token set by compressing the images, and use a learned transformer renderer (LVSM architecture)

Findings

Experiments show that RayZer can even outperform "oracle" supervised method LVSM, on data annotated by COLMAP. This verifies the problem of supervised learning with COLMAP labels, even on static scenes. We have two pieces of evidence:

First, on synthetic Objaverse (camera labels are perfect for training LVSM), we perform on par but slightly worse

Second, from visualization, we find supervised methods usually fail on cases that COLMAP fails

What's Next?

The value of self-supervised learning is training on large data and learn transferable representations as pre-training. Towards this goal, we need to scale up further, and verify the learned representations are indeed transferable.

This post reflects my thoughts after working on projects like RayZer, Real3D, and MegaSynth. If you're interested in discussing these ideas, feel free to reach out!

What are we missing in 3D Vision? Data and Self-supervision