Geometry-biased Transformers

Geometry-biased Transformers for Novel View Synthesis

Naveen Venkat^*1

Mayank Agarwal^*1

Maneesh Singh

Shubham Tulsiani¹

¹Carnegie Mellon University

(* indicates equal contribution)

[Paper]

[Video]

[Code]






Given a small set of context images with known camera viewpoints, our Geometry-biased transformer (GBT) synthesizes novel views from arbitrary query viewpoints. The use of global context ensures meaningful prediction despite large viewpoint variation, while the geometric bias allows more accurate inference compared to a baseline without such bias (GBT-nb).

Abstract

We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints. Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation, which is then used to predict the color for arbitrary query rays. While this representation yields (coarsely) accurate images corresponding to novel viewpoints, the lack of geometric reasoning limits the quality of these outputs. To overcome this limitation, we propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference to encourage multi-view geometric consistency. We induce the geometric bias by augmenting the dot-product attention mechanism to also incorporate 3D distances between rays associated with tokens as a learnable bias. We find that this, along with camera-aware embeddings as input, allows our models to generate significantly more accurate outputs. We validate our approach on the real-world CO3D dataset, where we train our system over 10 categories and evaluate its view-synthesis ability for novel objects as well as unseen categories. We empirically validate the benefits of the proposed geometric biases and show that our approach significantly improves over prior works.

Video

Results

Qualitative results on heldout objects from training categories

For each object, we consider V=3 input views and compare the reconstruction quality of each method on novel query views. See more random results here.

Qualitative results on heldout categories




Given V = 3 input views, we visualize the rendered views obtained from GBT. Note that the model has never seen these categories of objects during training. See more random results here.

Effect of camera noise



Given the 3 input views with noisy camera poses (increasing left to right), we visualize the predictions for a common query view across three methods (top row: pixelNeRF , middle row: GBT-nb, bottom row: GBT).

BibTeX

          
            @article{venkat2023geometry,
              title={Geometry-biased Transformers for Novel View Synthesis},
              author={Venkat, Naveen and Agarwal, Mayank and Singh, Maneesh and Tulsiani, Shubham},
              journal={arXiv preprint arXiv:2301.04650},
              year={2023}
            }

Acknowledgements

We thank Zhizhuo Zhou, Jason Zhang, Yufei Ye, Ambareesh Revanur, Yehonathan Litman, and Anish Madan for helpful discussions and feedback. We also thank David Novotny and Jonáš Kulhánek for sharing outputs of their work and helpful correspondence. This project was supported in part by a Verisk AI Faculty Research Award. This webpage template was borrowed from here.