Model Overview
RUST builds on the Scene Representation Transformer (SRT). Contrary to prior methods, it does not require any form of pose supervision -- neither for training, nor for inference. Instead, the model learns a latent pose space through self supervision by taking a peek at the target view during training.

Abstract

Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

Examples and Visualizations

PCA Analysis (MSN dataset)
PCA Analysis (MSN streetview)
Video results (MSN dataset)
Video results RUST p-64 and p-768 (MSN dataset)
Video results (SV dataset)
Video results (GNerf baseline)

Dataset

Our dataset is identical to the one used in Object Scene Representation Transformer (OSRT). Please see the OSRT project website for details.

Code

An official implementation is currently unavailable. However, there exist third-party implementations for SRT and OSRT (including the improved SRT architechture), that could be extended to RUST with the inclusion of a pose estimator. Please see the SRT and OSRT project websites for further resources.

Related Projects

Scene Representation Transformer (SRT)
Object Scene Representation Transformer (OSRT)
Dynamic Scene Transformer (DyST)

Reference

  @article{sajjadi2022rust,
    author = {
      Sajjadi, Mehdi S. M.
      and Mahendran, Aravindh
      and Kipf, Thomas
      and Pot, Etienne
      and Duckworth, Daniel
      and Lu{\v{c}}i{\'c}, Mario
      and Greff, Klaus
    },
    title = {{RUST: Latent Neural Scene Representations from Unposed Imagery}},
    journal = {CVPR},
    year = {2023},
  }