UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

Visual Tokenizer is Key to Unification


Wujian Peng1,2,*    Lingchen Meng3,*,‡    Yuxuan Cai3    Xianwei Zhuang3    Yuhuan Yang3    Rongyao Fang3    Chenfei Wu3    Junyang Lin3    Zuxuan Wu1,2,†    Shuai Bai3,†
* Equal contribution     Corresponding authors     Project lead
1 Institute of Trustworthy Embodied AI, Fudan University     2 Shanghai Innovation Institute
3 Qwen Team, Alibaba Group

UniAR teaser

UniAR is a unified multimodal autoregressive framework that bridges visual understanding, image generation, and image editing with a single discrete visual tokenizer. Unlike prior unified models that rely on separate tokenizers for perception and generation, UniAR places both modalities in a shared token space, enabling the model to directly understand its own generated visual tokens within the same context without an extra visual re-encoding step. Built on lookup-free bitwise quantization, multi-level visual feature modeling, and parallel bitwise prediction, UniAR achieves strong performance across generation, editing, and understanding benchmarks while exposing an emergent interleaved generation-understanding capability.

Highlights

  • True unification via shared context. A single visual tokenizer is used for both understanding and generation, allowing UniAR to interpret its own generated visual tokens directly.
  • Bitwise visual tokenization at scale. Lookup-free BSQ quantization expands the effective visual vocabulary with low overhead while preserving semantic alignment.
  • Multi-level visual features. Hierarchical feature fusion retains both high-level semantics and fine-grained details, which is especially important for text rendering and editing.
  • Fast autoregressive generation. Parallel bitwise prediction and a diffusion-based visual decoder with resolution upsampling reduce sequence length and accelerate image synthesis.
  • Strong multimodal performance. UniAR achieves state-of-the-art or highly competitive results on image generation, image editing, OCR-heavy understanding, and long-text rendering.

Method Overview

UniAR architecture
UniAR consists of three core components: a unified visual tokenizer that discretizes semantic visual features into shared bitwise tokens, a unified autoregressive backbone that jointly models text and visual tokens, and a diffusion-based visual decoder that reconstructs high-fidelity images from predicted visual tokens.

Interleaved Generation and Understanding

UniAR interleaved generation-understanding example
UniAR unifies generation and understanding in the same discrete visual space. This enables an emergent interleaved capability: after generating an image, the model can answer follow-up questions about its own output without re-encoding the image.

BibTeX

@inproceedings{peng2026uniar,
  title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  booktitle={ICML},
  year={2026}
}