Visual Tokenizer is Key to Unification
UniAR is a unified multimodal autoregressive framework that bridges visual understanding, image generation, and image editing with a single discrete visual tokenizer. Unlike prior unified models that rely on separate tokenizers for perception and generation, UniAR places both modalities in a shared token space, enabling the model to directly understand its own generated visual tokens within the same context without an extra visual re-encoding step. Built on lookup-free bitwise quantization, multi-level visual feature modeling, and parallel bitwise prediction, UniAR achieves strong performance across generation, editing, and understanding benchmarks while exposing an emergent interleaved generation-understanding capability.
@inproceedings{peng2026uniar,
title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
booktitle={ICML},
year={2026}
}