We include more novel view synthesis results of RayZer.
Morden self-supervised methods share the same principle of learning by predicting "missing" data. For example, GPT predicts missing next token leveraging language sequential prior; MAE predicts missing (masked) visual tokens leveraging 2D spatial structure.
Can we make analogy to build self-supervised 3D model? What will be the "missing" data then?
Our answer is predicting "missing" views from observed views.
During training, the input of RayZer is unlabeled (unposed & uncalibrated) multi-view images. RayZer splits the images into two sets -- \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \) and \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \).
RayZer predicts the \( \textcolor{#47D45A}{\text{scene representation}} \) from \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \), and use predicted \( \textcolor{#E97132}{\text{cameras}} \) of \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \) to render the \( \textcolor{#47D45A}{\text{scene}} \), getting predictions \( \mathcal{\hat{I}}_\textcolor{#E97132}{\mathcal{B}} \).
Thus, RayZer is trained with only photometric loss between \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \) and \( \mathcal{\hat{I}}_\textcolor{#E97132}{\mathcal{B}} \), requiring zero 3D supervision of camera and geometry.
This split of \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \) and \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \) helps us disentangle scene and camera representations.
To make cameras and scene register with each other, we predict cameras of all views (both \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \) and \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \)) jointly, and use predicted \( \textcolor{#47D45A}{\text{cameras}} \) of \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \) to condition the \( \textcolor{#47D45A}{\text{scene}} \) prediction.
To facilitate self-supervised learning, RayZer model is built with minimal 3D inductive bias, motivated by self-supervised large models in other modalities. RayZer's three main components, including camera estimator, scene reconstructor, and renderer, are plain Transformers with self-attention . The only 3D prior incorporated is the ray structure, which simultaneously models the relationship between camera, pixels (image), and scene. (Left) RayZer first estimates camera parameters, where the middle view is selected as the canonical reference view (in \( \textcolor{#44B3E1}{blue} \) box). RayZer predicts the intrinsics and the relative camera poses \( \mathcal{P} \) of all views. The camera poses are in low-dimensional \( SE(3) \), helping disentangling cameras and scene. The predicted cameras are then converted into pixel-aligned Plücker ray maps \( \mathcal{R} \). (Middle) RayZer uses the subset of input images, \( \mathcal{I}_\textcolor{#47D45A}{\mathcal{A}} \), as well as their previously predicted camera Plücker ray maps, \( \mathcal{R}_\textcolor{#47D45A}{\mathcal{A}} \), to predict a latent \( \textcolor{#47D45A}{\text{scene}} \) representation. Here, the Plücker ray maps, \( \mathcal{R}_\textcolor{#47D45A}{\mathcal{A}} \), serve as an effective condition for scene reconstruction by providing fine-grained ray-level information. (Right) RayZer render a target image given the scene representation \( \mathbf{z}^{*} \) and a target camera. During training, we use \( \mathcal{R}_\textcolor{#E97132}{\mathcal{B}} \), which is the previously predicted cameras Plücker ray maps of \( \mathcal{I}_\textcolor{#E97132}{\mathcal{B}} \), to render \( \hat{\mathcal{I}}_\textcolor{#E97132}{\mathcal{B}} \). This allows training RayZer end-to-end with self-supervised photometric losses.
Experiment setting: We compare with two 3D supervised methods, GS-LRM and LVSM. They are trained with 3D camera annotations and uses labeled cameras during testing. RayZer share similar reconstruction and rendering modules with LVSM.
We compare with baselines on DL3DV, RealEstate, and Objaverse. Note that DL3DV and RealEstate are annotated by COLMAP.
RayZer (self-supervised) shows better novel view synthesis performance on DL3DV and RealEstate. This result not only shows the strong capability of RayZer, but also implies COLMAP annotation is not perfect. Supervised learning with COLMAP IS NOT ALWAYS THE BEST OPTION!
We find that GS-LRM and LVSM consistently fail on some cases during inference. Interestingly, these cases are the scenarios that COLMAP usually fails. For example, the glasses in the first row, the high luminance intensity and the white walls in the second row. The result again verifies the limitation of supervised learning with COLMAP annotations and highlights the importance of self-supervised learning.
(Left) We visualize RayZer predicted camera poses learned with self-supervision. We also visualize 3 out of 5 rendered views. The predicted poses correctly capture the camera motion patterns. (Right) At the same time, we find the learned pose space do not exactly match the real-world pose space. As RayZer is built with latent 3D representation, the learned pose space can be compatible with learned scene representation, but is not gauranteed to be geometrically correct. To understand the learned pose space, we probe it by train an prediction MLP using supervised learning (with frozen Camera Encoder initialized by self-supervised pre-training). We also compare with a supervised baseline where both Camera Encoder and prediction MLP are trained from scratch. The result shows that RayZer's learned pose space is meaningful, and RayZer self-supervised novel view synthesis pre-training is more effective than supervised training of pose.
We include more novel view synthesis results of RayZer.
@article{jiang2025rayzer,
title={RayZer: A Self-supervised Large View Synthesis Model},
author={Jiang, Hanwen and Tan, Hao and Wang, Peng and Jin, Haian and Zhao, Yue and Bi, Sai and Zhang, Kai and Luan, Fujun and Sunkavalli, Kalyan and Huang, Qixing and Pavlakos, Georgios},
booktitle={arXiv preprint arXiv:2505.00702},
year={2025},
}