CaTok: Taming Mean Flows
for One-Dimensional Causal Image Tokenization

Yitong Chen^1,2,3 Zuxuan Wu^1,2,3† Xipeng Qiu^1,2,3 Yu-Gang Jiang^1,3†

^† Corresponding authors

¹ Institute of Trustworthy Embodied AI, Fudan University ² Shanghai Innovation Institute

³ Shanghai Key Laboratory of Multimodal Embodied AI

CaTok is a 1D causal image tokenizer with a MeanFlow decoder that enables fast one-step sampling and strong multi-step reconstruction while capturing diverse visual concepts across token intervals. The one-step and multi-step results are shown in cols. 2 and 3; cols. 3–7 illustrate a fine-to-coarse trend as tokens are reduced; cols. 7–10 reconstruct from different token segments, revealing distinct visual concepts.

Method Overview

CaTok combines a causal ViT encoder with a MeanFlow decoder, selecting 1D tokens over time intervals to preserve causality.

Decoder Comparison

MeanFlow decoding maintains causality and balance, supporting efficient one-step sampling.

BibTeX

@inproceedings{catok2026,
  title={CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization},
  author={Chen, Yitong and Wu, Zuxuan and Qiu, Xipeng and Jiang, Yu-Gang},
  booktitle={CVPR},
  year={2026}
}

This website is adapted from Nerfies and MathVista, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CaTok: Taming Mean Flowsfor One-Dimensional Causal Image Tokenization

Method Overview

Decoder Comparison

BibTeX

CaTok: Taming Mean Flows
for One-Dimensional Causal Image Tokenization