IMO Proposal 5: Autoregressive Single-Image to Multiview Foundation Model

keke · January 24, 2025, 7:51pm

Introduction

Recent advances in generative autoregressive modeling, with the appearance of Next-Scale Prediction (VAR), has spurred excitement and innovation in the generative arena. Autoregression, traditionally dominating in language-related areas (most notable LLMs), is a very effective paradigm. However, though there are some early efforts to use it for visual generation (e.g., VQGAN, Parti, DALL-E), the consensus eventually shifted to using diffusion models, which are known to produce high-quality results. However, the proposal of VAR, and its exceeding the diffusion baseline DiT on the ImageNet class-conditioned generating benchmark (plus exhibiting scaling laws), has opened the potential for autoregressive schemes which both have very desirable scaling properties and are much more efficient than diffusion models (which tend to force trade-offs between speed and quality due to sampling steps – and current high-quality models are often very slow). We will try to take advantage of this momentum in the single-image novel view synthesis arena. Previously, the foundational work Zero 1-to-3 has proved the effectiveness of fine-tuned Stable Diffusion for this task. And we shall demonstrate the Next-Scale Autoregression paradigm’s even better (and faster!) capability in this direction.

Training Pipeline and Datasets

We plan to use a subset of LAION-Aesthetics v2 – 5 (~200M images, resolution >256) with resizing and random cropping augmentation to train the Image Variation stage, and official renderings of Objaverse-XL (~10M objects x 12 views each, ~18 TB) for the 3D training stage. We will train, as a first stage, a state-of-the-art image variation model based on next-scale autoregressive structure. (Note: unfortunately there are no known good metrics for image variation models, but if possible, we will conduct a subjective user test to evaluate it against the current de facto state-of-the-art: Stable Diffusion Image Variation v2 by Lambda Labs). We will proceed to exploit its visual knowledge by adding pose conditioning for fine-tuning, thus producing the 3D-aware foundation model we desire.

Applications

Our foundational model will empower many 3D modeling related tasks. Previously works such as ReconFusion have proven that generative structures will greatly increase the capability of 3D reconstruction models. Our model can also be directly used for various tasks, such as entertainment and gaming (creating 3D assets), NFT creation, e-commerce (creating models of products), architecture, healthcare (inferring organ structure), microbiology (our team has previously conducted research in this direction using Zero 1-to-3), manufacturing, cultural heritage preservation, etc.

Current Progress

We have verified the feasibility of our idea by creating a small prototype (with 7 times less parameters than planned), trained on 16 A100s for two days per stage, and respectively using LAION-Aesthetics v2 – 6 and Objaverse as datasets. Our prototype achieves good results for image variation. We also performed experimental comparisons on Google Scanned Objects, and our model exceeds all non-diffusion baselines by a large margin, suggesting that it has gained preliminary ability to understand depth. Notably, our model takes only 0.2 seconds to produce an output, while Zero 1-to-3 requires 2 seconds, empirically demonstrating the efficiency of our autoregressive backbone.

Conclusion

We propose the first Single-Image to Multiview model that is based on autoregression. This will being many advantages over conventional diffusion-based models (most representatively Zero 1-to-3) Preliminary results validate our idea and our pipeline implementation, and demonstrate our model’s extremely fast speed. We are looking forward to scaling up our prototype and creating a groundbreaking new model.

Here are some results from the current Image Variation model (with ~1B parameters like Stable Diffusion for fairness), compared with diffusion-based baselines. The image on the far left is the original, the first one of the four on the right is ours, and the other three are baselines.

Topic		Replies	Views
IMO Proposal 4: Kumo Text to Video Foundation model ☉ - IMO (Initial Model Offering)	5	319	December 24, 2024
IMO Proposal 2: Foundation Model for Multi-Purpose Bimanual Robots with Few-Shot Learning ☉ - IMO (Initial Model Offering)	1	106	December 2, 2024
Da-Vinci: Self Supervised Generative Model For Game Character Synthesis AI Grant Application	2	59	March 12, 2025
Your Path to Funding: IMO Proposal Guidelines ☉ - IMO (Initial Model Offering)	2	567	November 15, 2024
IMO Proposal 1: Multi-Agent Long-Form Narrative Generation with Extended Context Memory ☉ - IMO (Initial Model Offering)	3	119	December 2, 2024

IMO Proposal 5: Autoregressive Single-Image to Multiview Foundation Model

Related topics