Abstract

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation.

FQGAN Visual Tokenizer

We present a novel VQ-based visual tokenizer that achieves state-of-the-art performance on discrete image reconstruction, surpassing existing VQ-based and LFQ-based visual tokenizers.

Method

Overview of our method. The left part shows FQGAN-Dual, the factorized tokenizer design in an example scenario where a large codebook is factorized into two sub-codebooks. This framework is extendable to factorization of more sub-codebooks. The right part demonstrate how we leverage an additional AR head to accommodate the factorized sub-codes based on standard AR generative transformer.

Image Reconstruction Results with Different Sub-codebooks

T-SNE Visualization of the Distribution of the Sub-codebooks

Image Generation Results with the FQGAN Tokenizer

Comparison with Tokenizers

Comparison with Generation Models