CaTok: Taming Mean Flows
for One-Dimensional Causal Image Tokenization


Yitong Chen1,2,3     Zuxuan Wu1,2,3†     Xipeng Qiu1,2,3     Yu-Gang Jiang1,3†
Corresponding authors
1 Institute of Trustworthy Embodied AI, Fudan University     2 Shanghai Innovation Institute
3 Shanghai Key Laboratory of Multimodal Embodied AI

CaTok teaser

CaTok is a 1D causal image tokenizer with a MeanFlow decoder that enables fast one-step sampling and strong multi-step reconstruction while capturing diverse visual concepts across token intervals. The one-step and multi-step results are shown in cols. 2 and 3; cols. 3–7 illustrate a fine-to-coarse trend as tokens are reduced; cols. 7–10 reconstruct from different token segments, revealing distinct visual concepts.

Method Overview

CaTok architecture
CaTok combines a causal ViT encoder with a MeanFlow decoder, selecting 1D tokens over time intervals to preserve causality.

Decoder Comparison

Decoder comparison
MeanFlow decoding maintains causality and balance, supporting efficient one-step sampling.

BibTeX

@inproceedings{catok2026,
  title={CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization},
  author={Chen, Yitong and Wu, Zuxuan and Qiu, Xipeng and Jiang, Yu-Gang},
  booktitle={CVPR},
  year={2026}
}

This website is adapted from Nerfies and MathVista, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.