Repurposing Video Models for 3D Super Resolution

ECCV 2024 (Chat with us at poster session from 4:30 PM CEST to 6:30 PM CEST on Wed, Oct 2nd, poster #298) | Codebase

¹University of Illinois at Urbana Champaign ²Adobe Research ³University College London

⚠️ This page contains a large number of videos. If they don't play or go out of sync, please reload.
We recommend to use Chrome as your browser.

Abstract

We present a simple, modular, and generic method that upsamples coarse 3D models by adding geometric and appearance details. While generative 3D models now exist, they do not yet match the quality of their counterparts in image and video domains. We demonstrate that it is possible to directly repurpose existing (pre-trained) video models for 3D super-resolution and thus sidestep the problem of the shortage of large repositories of high-quality 3D training models. We describe how to repurpose a video upsampling model – which are not 3D consistent – and combine them with 3D consolidation to produce 3D-consistent results. As output, we produce high-quality Gaussian Splat models, which are object-centric, and effective. Our method is category-agnostic and can be easily incorporated into existing 3D workflows. We evaluate our proposed SuperGaussian on a variety of 3D inputs, which are diverse both in terms of complexity and representation (e.g., Gaussian Splats or NeRFs), and demonstrate that our simple method significantly improves the fidelity of the final 3D models.

Pipeline in a Nutshell

Given an input low-res 3D representation, which can be in various formats, we first sample a smooth camera trajectory and render an intermediate low-resolution video. We first upsample this video using existing video upsamplers and obtain a higher resolution 3D representation that has sharper and more vivid details. Next, we perform 3D optimization to improve geometric and texture details. Our method, SuperGaussian, produces a final 3D representation in the form of high-resolution Gaussian Splats.

Low-res input

Trajectory sampling

Video Upsampling

Reconstruction

Paper

@inproceedings{Shen2024SuperGaussian,
  title = {SuperGaussian: Repurposing Video Models for 3D Super Resolution},
  author = {Shen, Yuan and Ceylan, Duygu and Guerrero, Paul and Xu, Zexiang and Mitra, {Niloy J.} and Wang, Shenlong and Fr{\"u}hst{\"u}ck, Anna},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2024},
}

Quantitative Comparison

Comparison on MV-ImgNet to upsample low-res GSplats

We compare our methods against baselines using perceptual metrics. Our method, even though it is generic, consistently produces the best quantitative results. We encourage the reader to inspect the visual results in the later section, which highlights that the visual quality of our method surpasses the baselines.

Method	LPIPS ↓	NIQE ↓	FID ↓	IS ↑
Instruct-NeRF2NeRF¹	0.1867	8.33	32.56	10.52 ± 1.06
Super-NeRF²	0.2204	8.84	37.54	10.40 ± 1.03
Pre-hoc Image³	0.1524	7.65	27.04	11.27 ± 0.99
SuperGaussian (ours)	0.1290	6.80	24.32	11.69 ± 1.08

¹ Haque et al., "Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions", ICCV 2023
² Han et al., "Super-NeRF: View-Consistent Detail Generation for NeRF Super-Resolution", arXiv 2023
³ Our customized baseline is similar to SuperGaussian except using SOTA image upsampler

Comparison on Blender-Synthetic to upsample low-res RGB

Here, we compare on ×4 upsampling from 200 × 200 to 800 × 800px. We compare our methods against baselines on the official test set using metrics reported in prior work. Our method produces on-par quantitative results. Besides, our results yield more generative details, which are not captured by the reference-based metrics. For a fair comparison against these baselines, we use Neural Radiance Field, i.e., TensoRF, as our 3D representation. Other baseline results are directly taken from their paper.

Method	LPIPS ↓	PSNR ↑	SSIM ↑
FastSR-NeRF⁴	0.075	30.47	0.944
NeRF-SR⁵	0.076	28.46	0.921
SuperGaussian (ours)	0.067	28.44	0.923

⁴ Lin et al., "FastSR-NeRF: Improving NeRF efficiency on consumer devices with a simple superresolution pipeline", WACV 2024
⁵ Wang et al., "NeRF-SR: High-Quality Neural Radiance Fields using Supersampling", MM 2022

SuperGaussian Results on MVImgNet

The videos below show the 16× upsampling result of our method (right) when applied to the low-resolution input (left). The videos are synchronized, such that the same frame is shown at the same time. We provide an interactive zoom lens to better appreciate the differences between low-res input and our output. The zoomed in views are enlarged by a factor of 2.5, showing the output at approximately the native video resolution (1024×1024px).

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

low-res (64×64)

2.5×

3D Upsampled (1024×1024)

Baseline Comparison on MVImgNet

Baseline comparison on 4× upsampling of low-res Gaussian Splats with several related methods.

3×

Low Resolution

Instruct-NeRF2NeRF

Super-NeRF

Prehoc Image

Ours

3×

Low Resolution

Instruct-NeRF2NeRF

Super-NeRF

Prehoc Image

Ours

3×

Low Resolution

Instruct-NeRF2NeRF

Super-NeRF

Prehoc Image

Ours

3×

Low Resolution

Instruct-NeRF2NeRF

Super-NeRF

Prehoc Image

Ours

SuperGaussian Results on Blender-Synthetic

We follow the same experiment protocol as prior work, i.e., NeRF-SR and FastSR-NeRF, and report performance for 4× upsampling in 3D. For fair comparison with other NeRF-based baselines, we reconstruct the upsampled video with TensoRF, a NeRF representation after video upsampling

low-res (200×200)

2.5×

3D Upsampled (800×800)

low-res (200×200)

2.5×

3D Upsampled (800×800)

low-res (200×200)

2.5×

3D Upsampled (800×800)

low-res (200×200)

2.5×

3D Upsampled (800×800)

low-res (200×200)

2.5×

3D Upsampled (800×800)

low-res (200×200)

2.5×

3D Upsampled (800×800)

Superresolving Text-to-3D

Our method is able to improve the output of state-of-the-art text-to-3D models. Here, we show results of our method where we upsample objects generated by Instant3D.