AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
Xuying Zhang1*    Yupeng Zhou1*    Kai Wang1    Yikai Wang2    Zhen Li1    Shaohui Jiao3   
Daquan Zhou3    Qibin Hou1✉    Ming-Ming Cheng1   
1Nankai University    2Tsinghua University     3ByteDance Inc.   

🔥 3D Asset Gallery of Our AR-1-to-3 🔥

3d asset gallery


Abstract

Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views as well as among the new views, producing high-fidelity 3D assets.

Method

Overview of our AR-1-to-3 framework. The left side shows the AR-1-to-3 workflow, while the right side illustrates the denoising process of target views. Taking the input single-view image as initialization, our methodology employs a diffusion model to generate all target views incrementally from near to far, with the existing views from previous steps serving as contextual information about the objects themselves. To achieve this, the Stacked-LE and LSTM-GE strategies are developed to encode the local features and global features of the partial view sequence as the image conditions of the denoising UNet for the view prediction of the current step.

Novel-View-Synthesis

Image-to-3D

Citation

                    
@article{zhang2025AR123,
    title={AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction},
    author={Zhang, Xuying and Zhou, Yupeng and Wang, Kai and Wang, Yikai and Li, Zhen and Shao, Xiuli and Zhou, Daquan and Hou, Qibin and Cheng, Ming-Ming},
    journal={arXiv preprint arXiv:2503.12929},
    year={2025}
}

Contact

Feel free to contact us via zhangxuying1004@gmail.com or ypzhousdu@gmail.com