UniAR

UniAR is a unified autoregressive multimodal model that handles image understanding, image generation, and image editing in a single Transformer. Unlike prior unified models that rely on two separate visual tokenizers (splitting the representation space), UniAR uses a single discrete visual tokenizer as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding.

Key design choices:

Multi-level BSQ tokenizer — fuses shallow (low-level detail) and deep (high-level semantic) visual features via lookup-free Binary Spherical Quantization, scaling the effective vocabulary to 2⁶⁴ codes with minimal overhead.
Parallel bitwise prediction — jointly predicts spatially grouped, multi-level visual codes per AR step, achieving a 32x visual compression ratio (a 1024×1024 image needs only 256 AR tokens).
DiT-based visual decoder — an SD3-medium transformer with semantic visual feature injection that reconstructs high-fidelity images from discrete visual tokens, with resolution upsampling support.

True unification via shared context. A single visual tokenizer is used for both understanding and generation, allowing UniAR to interpret its own generated visual tokens directly.
Bitwise visual tokenization at scale. Lookup-free BSQ quantization expands the effective visual vocabulary with low overhead while preserving semantic alignment.
Multi-level visual features. Hierarchical feature fusion retains both high-level semantics and fine-grained details, which is especially important for text rendering and editing.
Fast autoregressive generation. Parallel bitwise prediction and a diffusion-based visual decoder with resolution upsampling reduce sequence length and accelerate image synthesis.
Strong multimodal performance. UniAR achieves state-of-the-art or highly competitive results on image generation, image editing, OCR-heavy understanding, and long-text rendering.

BibTeX

@article{peng2026unified,
  title={Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  journal={arXiv preprint arXiv:2606.18249},
  year={2026}
}

UniAR: Unified Multimodal Autoregressive Modeling with Shared Context

Highlights

Method Overview

Interleaved Generation and Understanding

BibTeX