LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

Abstract

Vision-language pretraining remains dominated by contrastive objectives, whereas vision-only self-supervised learning has largely adopted non-contrastive methods. At the same time, the role of vision-language encoders has shifted: they are increasingly deployed not as zero-shot classifiers but as the frozen visual backbone of vision-language models and dense prediction systems, which consume the full grid of patch tokens rather than a single pooled embedding. We introduce LeVLJEPA, the first fully non-contrastive end-to-end vision-language pretraining method. LeVLJEPA learns through cross-modal prediction with stop-gradient targets and per-modality distributional regularization, without negatives, temperature, momentum encoder, or teacher-student schedule, and trains stably at large scale.

Our central finding is that non-contrastive pretraining yields a vision encoder with substantially stronger dense semantic features than contrastive pretraining: as a frozen vision-language-model backbone, LeVLJEPA is the strongest of the evaluated encoders across GQA, VQAv2, and POPE under two distinct language models, and it outperforms contrastive baselines on semantic segmentation, while remaining on par on global readouts such as linear probing.

Cross-modal prediction replaces contrastive discrimination

LeVLJEPA learns image-text alignment through matched-pair prediction: image embeddings predict stop-gradient text embeddings, and text embeddings predict stop-gradient image embeddings through modality-specific predictors.

The stopped targets are the central asymmetry. Each encoder receives gradients only from its own branch, while the predictors absorb the residual mismatch between image content and caption content instead of forcing both encoders into the same under-specified shared space.

How LeVLJEPA learns

Step 1 · Encode each modality

LeVLJEPA overview. Image \(X_V\) and caption \(X_T\) are encoded into pre-projected embeddings \(Z_V, Z_T\), then passed through modality-specific predictors to produce cross-modal predictions \(\hat{Z}_V, \hat{Z}_T\). The objective combines (i) a cross-modal MSE against a stop-gradient target from the other modality and (ii) SIGReg applied independently to \(Z_V\) and \(Z_T\) — no negatives, temperature, momentum encoder, or teacher–student schedule.

SIGReg supplies the anti-collapse signal

To prevent collapse, each modality is regularized independently so its marginal embedding distribution stays close to an isotropic Gaussian. This gives LeVLJEPA an explicit anti-collapse objective without negatives, temperature scaling, a momentum encoder, or a teacher network.

Rather than matching a full density in \(d\) dimensions, SIGReg samples random one-dimensional projections and applies a characteristic-function normality test to each projection, yielding linear time and memory complexity in batch size and embedding dimension.

SIGReg — drag toward isotropy

SIGReg never matches a density in \(d\) dimensions directly. It projects the embeddings onto random univariate directions (colored arrows) and applies a characteristic-function normality test (Epps–Pulley) to each 1-D projection, pushing every marginal toward the standard Gaussian (navy). Drag the slider to apply the regularizer and watch the cloud become isotropic.

Frozen transfer isolates visual feature quality

For VLM evaluation, both the ViT-B/16 vision encoder and the language model are held frozen. Only a lightweight MLP bridge is trained to map patch tokens into the language model input space.

Because neither endpoint is updated, downstream accuracy tests whether the pretrained patch tokens already carry the visual structure the frozen LLM needs for question answering.

How downstream transfer is trained

Step 1 · Ask a visual question

Frozen transfer protocol. For VLM evaluation, the pretrained vision encoder and the language model are frozen. Only a small MLP bridge learns to map the encoder's patch tokens into the language model input space for question answering, isolating how readable the frozen visual representation is to a downstream language model.

Explore what the models retrieve

We embedded 500 held-out images from ten Imagenette categories with the two ViT-B/16 checkpoints. Compare their image spaces, then choose a text prompt to retrieve the nearest images.

Loading the precomputed embedding space…

Embedding explorer. Image and text features come from the released LeVLJEPA and CLIP checkpoints after 200k DataComp training steps. The interaction only filters precomputed values.

Results

Global Readouts

Pooled readouts tell two different stories

Zero-shot transfer and linear probing both reduce an image to a single pooled vector before scoring. Zero-shot compares that vector to text embeddings, while linear probing asks whether the frozen CLS feature is directly separable by class.

This separates alignment from visual feature quality: contrastive baselines retain the advantage on the zero-shot protocol they optimize directly, but the objectives are close once the text encoder is removed from the readout.

Zero-shot transfer (0-100% accuracy)

CLIP SigLIP LeVLJEPA

ImageNet

CLIP47.32

SigLIP50.78

LeVLJEPA42.45

Places365

CLIP34.46

SigLIP33.76

LeVLJEPA29.97

Aircraft

CLIP8.10

SigLIP10.62

LeVLJEPA7.65

Pets

CLIP68.98

SigLIP77.27

LeVLJEPA59.63

LeVLJEPA still learns nontrivial image-text alignment without negatives, but zero-shot directly rewards the matched-vs-unmatched discrimination used by contrastive objectives.

Linear probing (0-100% accuracy)

CLIP SigLIP LeVLJEPA

ImageNet

CLIP65.75

SigLIP66.34

LeVLJEPA65.42

Places365

CLIP37.11

SigLIP36.81

LeVLJEPA36.07

Aircraft

CLIP44.10

SigLIP47.46

LeVLJEPA46.38

Pets

CLIP82.86

SigLIP82.64

LeVLJEPA81.28

A linear probe discards the text encoder and isolates the pooled visual feature; here LeVLJEPA stays within about 1.5 points of the strongest baseline on every benchmark.

Background Robustness

Even the global feature is less background-sensitive

The standard IID linear probe leaves the methods closely matched, but controlled background swaps expose a smaller degradation for LeVLJEPA. This is consistent with a more object-focused global representation, even before moving to patch-token evaluations.

Accuracy drop under background shift (0-20 pts, lower is better)

CLIP SigLIP LeVLJEPA

Mixed-Same

CLIP6.57

SigLIP7.03

LeVLJEPA5.95

Mixed-Rand

CLIP18.67

SigLIP18.09

LeVLJEPA17.21

From Original to Mixed-Same, LeVLJEPA drops 5.95 points, compared with 6.57 for CLIP and 7.03 for SigLIP. Under Mixed-Rand, it again has the smallest drop.

Dense Semantic Features

Linear segmentation on frozen patch tokens (0-35 mIoU)

CLIP SigLIP LeVLJEPA

ADE20K

CLIP20.90

SigLIP19.24

LeVLJEPA23.15

COCO-Stuff

CLIP29.02

SigLIP28.88

LeVLJEPA31.10

A single linear head exposes the per-token semantic structure that downstream dense systems consume.

Patch tokens separate the objectives

A frozen ViT emits a grid of patch tokens, and downstream dense systems consume that sequence rather than a pooled CLS embedding. Semantic segmentation is therefore a direct readout of spatial and semantic structure already present in the frozen patch features.

With only a single linear head, LeVLJEPA exceeds the stronger contrastive baseline by 2.25 mIoU on ADE20K and 2.08 mIoU on COCO-Stuff, so the margin cannot be attributed to head capacity.

Frozen VLM Backbone

The deployment setting: frozen encoder, frozen LLM

In the VLM protocol, the patch tokens are projected into a language model's input space and consumed during question answering. Both the ViT-B/16 encoder and the language model stay frozen; only a lightweight MLP bridge is trained.

Because neither endpoint is updated, downstream accuracy is a function of the pretrained visual features and the bridge alone. Repeating the evaluation with Llama-1B and Qwen-1.5B separates backbone quality from quirks of a particular language model.

Llama-1B (gain over random, 0-25 pts)

CLIP SigLIP LeVLJEPA

GQA

CLIP+6.3

SigLIP+6.0

LeVLJEPA+8.2

VQAv2

CLIP+8.4

SigLIP+6.0

LeVLJEPA+11.0

POPE

CLIP+16.2

SigLIP+12.4

LeVLJEPA+17.3

Qwen-1.5B (gain over random, 0-25 pts)

CLIP SigLIP LeVLJEPA

GQA

CLIP+5.2

SigLIP+4.6

LeVLJEPA+6.7

VQAv2

CLIP+5.8

SigLIP+4.1

LeVLJEPA+10.5

POPE

CLIP+19.1

SigLIP+18.0

LeVLJEPA+22.6

Random initialization is the zero point for each task and language model. LeVLJEPA adds the largest gain in every column, preserving the same encoder ordering across both language models.

Acknowledgements

This project page is built on top of the LeWorldModel website template by Lucas Maes et al. We thank them for making their code publicly available.

We gratefully acknowledge support from the hessian.AI Service Center (funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no. 16IS22091) and the hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Strategy and Innovation, grant no. S-DIW04/0013/003).

This work was funded by the European Union (ERC, TAIPO, 101088594 to F.B.) grant. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or ERC. Neither the European Union nor the granting authority can be held responsible for them.

LeVLJEPA:

End-to-End Vision-Language Pretraining
Without Negatives

Abstract

Cross-modal prediction replaces contrastive discrimination

SIGReg supplies the anti-collapse signal

Frozen transfer isolates visual feature quality

Explore what the models retrieve

Image embedding space

Text-to-image retrieval

Results

Global Readouts

Pooled readouts tell two different stories

Zero-shot transfer (0-100% accuracy)

Linear probing (0-100% accuracy)

Background Robustness

Even the global feature is less background-sensitive

Accuracy drop under background shift (0-20 pts, lower is better)

Dense Semantic Features

Linear segmentation on frozen patch tokens (0-35 mIoU)

Patch tokens separate the objectives

Frozen VLM Backbone

The deployment setting: frozen encoder, frozen LLM

Llama-1B (gain over random, 0-25 pts)

Qwen-1.5B (gain over random, 0-25 pts)

Acknowledgements

LeVLJEPA:

End-to-End Vision-Language PretrainingWithout Negatives

Abstract

Cross-modal prediction replaces contrastive discrimination

SIGReg supplies the anti-collapse signal

Frozen transfer isolates visual feature quality

Explore what the models retrieve

Image embedding space

Text-to-image retrieval

Results

Global Readouts

Pooled readouts tell two different stories

Zero-shot transfer (0-100% accuracy)

Linear probing (0-100% accuracy)

Background Robustness

Even the global feature is less background-sensitive

Accuracy drop under background shift (0-20 pts, lower is better)

Dense Semantic Features

Linear segmentation on frozen patch tokens (0-35 mIoU)

Patch tokens separate the objectives

Frozen VLM Backbone

The deployment setting: frozen encoder, frozen LLM

Llama-1B (gain over random, 0-25 pts)

Qwen-1.5B (gain over random, 0-25 pts)

Acknowledgements

End-to-End Vision-Language Pretraining
Without Negatives