GenLIP: Generative Language-Image Pre-training

Let ViT Speak

A minimalist generative pretraining framework for scalable vision encoders in multimodal large language models.

1 Beijing Jiaotong University 2 ByteDance 3 Nanyang Technological University

*Equal contribution   Corresponding authors

GenLIP training and deployment pipeline

Pretrain by speaking
Deploy as vision encoder

Pretraining on image caption data.

GenLIP Model single transformer + single autoregressive object
GenLIP as ViT Projector MLLM

Serve as competitive vision encoder in MLLMs

GenLIP: builds efficient vision encoder with vision-language alignment by a single transformer and a single autoregressive object.

Single Transformer Single NTP Loss No text transformer 8B pretraining samples Excellent Scalability

Key results

73.6ALL AVG with GenLIP-g/16 and Qwen2.5-7B frozen visual representation
+4.7ALL AVG over SigLIP2-g/16 in the same setting
8Bpretraining samples, compared with 40B-pair SigLIP2 baselines
Doc/OCRlargest gains on detail-sensitive document and text-centric benchmarks

Abstract

In this paper, we present Generative Language-Image Pre-training (GenLIP), a simplified generative pretraining approach for Vision Transformers tailored to multimodal large language models. To better align ViTs with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder.

GenLIP offers three key advantages: simplicity, scalability, and performance. With training totaling 8B samples on Recap-DataComp-1B, GenLIP matches or surpasses strong baselines such as SigLIP2. With continued pretraining on multi-resolution images at native aspect ratios, GenLIP further excels at detail-sensitive tasks such as OCR, chart understanding, and visual question answering.

Simplicity

A single Transformer jointly models visual and linguistic tokens with one language modeling objective.

Scalability

Performance improves with both data and model size, supported by gated attention for stable pretraining.

Performance

GenLIP delivers strong MLLM vision encoder performance, especially on document and OCR-heavy tasks.

Method

Minimal Generative Pretraining For Vision Encoders

GenLIP uses language generation as the pretraining signal, then deploys the trained Transformer as a visual feature extractor for MLLMs.

Unified Transformer

Image patches and text tokens are concatenated into one sequence and modeled by one Transformer.

Prefix-LM Objective

Visual tokens attend bidirectionally, text tokens attend causally, and loss is applied only to text tokens.

Gated Attention

A lightweight gate regulates attention outputs, reducing attention sink and stabilizing visual representation learning.

GenLIP architecture with image-text sequence modeling, gated attention, and Prefix-LM attention.
Training uses the LM head for next-token prediction; deployment keeps the Transformer visual features and discards language-only modules.

Results

Strong Vision Encoder Performance With Less Pretraining Data

GenLIP consistently improves frozen visual representation evaluation, with its clearest gains on Doc/OCR tasks.

Qwen2.5-1.5B Frozen Visual Representation

Full table in paper
Qwen2.5-1.5B frozen visual representation results across Doc/OCR, VQA, caption, and overall averages.
Model Arch Data Doc/OCR Avg (of 7) MME-P Nocaps ALL AVG
OpenVision2L/1612.8B44.3123084.358.7
SigLIPL/1640.0B42.4120384.056.9
SigLIP2L/1640.0B45.0116582.958.7
GenLIPL/168.0B49.3125882.661.5
SigLIP2So/1640.0B46.8122084.360.6
GenLIPSo/168.0B50.1121587.562.6
SigLIP2g/1640.0B47.3128484.461.5
GenLIPg/168.0B53.2125688.365.2

Qwen2.5-7B Frozen Visual Representation

Full table in paper
Qwen2.5-7B frozen visual representation results across Doc/OCR, VQA, caption, and overall averages.
Model Arch Data Doc/OCR Avg (of 7) MME-P Textcaps ALL AVG
SigLIP2So/1640.0B56.71422139.369.4
GenLIPSo/168.0B621424142.171.8
SigLIP2g/1640.0B56.61422142.768.9
GenLIPg/168.0B63.51483144.873.6
Data scaling curves for OCR, VQA, and Caption tasks from 1B to 8B pretraining samples.
Data scaling shows sustained gains, with gated attention improving stability across scales.

Scaling signal. As pretraining grows from 1B to 8B samples, GenLIP keeps improving OCR, VQA, and caption probes, supporting the vision encoder scaling story rather than a one-off benchmark gain.

Stage 1 versus Stage 2 validation curves across evaluation resolutions.
Native-aspect adaptation improves OCR, VQA, and caption performance at higher evaluation resolutions.

Resolution signal. Continued pretraining with native aspect ratios strengthens detail-sensitive recognition, which is where an MLLM vision encoder most needs reliable spatial evidence.

Representation Probes

What does "speak" reveal?

Generation and patch semantics readout are used here to inspect the learned visual-language representation, not to redefine GenLIP as a captioning model.

GenLIP caption generation examples comparing model scales and pretraining stages.
Direct caption generation shows that the pretrained ViT can produce grounded descriptions from visual tokens.

Generation probe. Captioning is used here as an inspection tool for visual-language alignment, while deployment still uses GenLIP as the MLLM vision encoder.

Patch semantic readout showing local visual regions mapped to language tokens.
Patch semantics reveal local visual embeddings aligned with language concepts.

Patch semantics. Local readouts make the learned representation more legible by showing which visual regions align with recognizable language concepts.

OCR-heavy GenLIP generation probes on receipt, geometry, and tiny text examples.
OCR-heavy probes explain the strong Doc/OCR benchmark behavior and expose remaining failure modes.

Doc/OCR behavior. The qualitative probes connect the benchmark gains to visible text-centric evidence, while keeping failure modes inspectable instead of hiding them behind aggregate scores.

Demo

Explore GenLIP Through Generation Probes

Explore the pretrained encoder's visual-language alignment through generation probes. The deployed model is intended as an MLLM vision encoder.

Natural image Document/OCR Global Caption Local Semantics
View release links

Release

Paper, Code, Models, And Citation

The first release path keeps all resources visible from the first viewport and collected here for repeated access.

GenLIP-L/16

300M parameters - 24 layers - recommended for efficient evaluation.

Model details

GenLIP-So/16

400M parameters - 27 layers - balanced scale for MLLM experiments.

Model details

GenLIP-g/16

1.1B parameters - 40 layers - strongest benchmark and Doc/OCR performance.

Model details

BibTeX

@article{fang2026letvitspeakgenerative,
  title={Let ViT Speak: Generative Language-Image Pre-training}, 
  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  journal={arXiv preprint arXiv:2605.00809},
  year={2026}
}