LeVLJEPA:

End-to-End Vision-Language Pretraining
Without Negatives

1German Cancer Research Center (DKFZ), 2German Cancer Consortium (DKTK), 3Goethe University Frankfurt, 4Brown University *Joint last author

Abstract

Vision-language pretraining remains dominated by contrastive objectives, whereas vision-only self-supervised learning has largely adopted non-contrastive methods. At the same time, the role of vision-language encoders has shifted: they are increasingly deployed not as zero-shot classifiers but as the frozen visual backbone of vision-language models and dense prediction systems, which consume the full grid of patch tokens rather than a single pooled embedding. We introduce LeVLJEPA, the first fully non-contrastive end-to-end vision-language pretraining method. LeVLJEPA learns through cross-modal prediction with stop-gradient targets and per-modality distributional regularization, without negatives, temperature, momentum encoder, or teacher-student schedule, and trains stably at large scale.

Our central finding is that non-contrastive pretraining yields a vision encoder with substantially stronger dense semantic features than contrastive pretraining: as a frozen vision-language-model backbone, LeVLJEPA is the strongest of the evaluated encoders across GQA, VQAv2, and POPE under two distinct language models, and it outperforms contrastive baselines on semantic segmentation, while remaining on par on global readouts such as linear probing.

Cross-modal prediction replaces contrastive discrimination

LeVLJEPA learns image-text alignment through matched-pair prediction: image embeddings predict stop-gradient text embeddings, and text embeddings predict stop-gradient image embeddings through modality-specific predictors.

The stopped targets are the central asymmetry. Each encoder receives gradients only from its own branch, while the predictors absorb the residual mismatch between image content and caption content instead of forcing both encoders into the same under-specified shared space.

How LeVLJEPA learns
Step 1 · Encode each modality

LeVLJEPA overview. Image \(X_V\) and caption \(X_T\) are encoded into pre-projected embeddings \(Z_V, Z_T\), then passed through modality-specific predictors to produce cross-modal predictions \(\hat{Z}_V, \hat{Z}_T\). The objective combines (i) a cross-modal MSE against a stop-gradient target from the other modality and (ii) SIGReg applied independently to \(Z_V\) and \(Z_T\) — no negatives, temperature, momentum encoder, or teacher–student schedule.

SIGReg supplies the anti-collapse signal

To prevent collapse, each modality is regularized independently so its marginal embedding distribution stays close to an isotropic Gaussian. This gives LeVLJEPA an explicit anti-collapse objective without negatives, temperature scaling, a momentum encoder, or a teacher network.

Rather than matching a full density in \(d\) dimensions, SIGReg samples random one-dimensional projections and applies a characteristic-function normality test to each projection, yielding linear time and memory complexity in batch size and embedding dimension.

SIGReg — drag toward isotropy

SIGReg never matches a density in \(d\) dimensions directly. It projects the embeddings onto random univariate directions (colored arrows) and applies a characteristic-function normality test (Epps–Pulley) to each 1-D projection, pushing every marginal toward the standard Gaussian (navy). Drag the slider to apply the regularizer and watch the cloud become isotropic.

Frozen transfer isolates visual feature quality

For VLM evaluation, both the ViT-B/16 vision encoder and the language model are held frozen. Only a lightweight MLP bridge is trained to map patch tokens into the language model input space.

Because neither endpoint is updated, downstream accuracy tests whether the pretrained patch tokens already carry the visual structure the frozen LLM needs for question answering.

How downstream transfer is trained
Step 1 · Ask a visual question

Frozen transfer protocol. For VLM evaluation, the pretrained vision encoder and the language model are frozen. Only a small MLP bridge learns to map the encoder's patch tokens into the language model input space for question answering, isolating how readable the frozen visual representation is to a downstream language model.

Explore what the models retrieve

We embedded 500 held-out images from ten Imagenette categories with the two ViT-B/16 checkpoints. Compare their image spaces, then choose a text prompt to retrieve the nearest images.

Loading the precomputed embedding space…

Embedding explorer. Image and text features come from the released LeVLJEPA and CLIP checkpoints after 200k DataComp training steps. The interaction only filters precomputed values.

Results

Global Readouts

Pooled readouts tell two different stories

Zero-shot transfer and linear probing both reduce an image to a single pooled vector before scoring. Zero-shot compares that vector to text embeddings, while linear probing asks whether the frozen CLS feature is directly separable by class.

This separates alignment from visual feature quality: contrastive baselines retain the advantage on the zero-shot protocol they optimize directly, but the objectives are close once the text encoder is removed from the readout.

Zero-shot transfer (0-100% accuracy)

CLIP SigLIP LeVLJEPA
ImageNet
CLIP47.32
SigLIP50.78
LeVLJEPA42.45
Places365
CLIP34.46
SigLIP33.76
LeVLJEPA29.97
Aircraft
CLIP8.10
SigLIP10.62
LeVLJEPA7.65
Pets
CLIP68.98
SigLIP77.27
LeVLJEPA59.63

LeVLJEPA still learns nontrivial image-text alignment without negatives, but zero-shot directly rewards the matched-vs-unmatched discrimination used by contrastive objectives.

Linear probing (0-100% accuracy)

CLIP SigLIP LeVLJEPA
ImageNet
CLIP65.75
SigLIP66.34
LeVLJEPA65.42
Places365
CLIP37.11
SigLIP36.81
LeVLJEPA36.07
Aircraft
CLIP44.10
SigLIP47.46
LeVLJEPA46.38
Pets
CLIP82.86
SigLIP82.64
LeVLJEPA81.28

A linear probe discards the text encoder and isolates the pooled visual feature; here LeVLJEPA stays within about 1.5 points of the strongest baseline on every benchmark.

Background Robustness

Even the global feature is less background-sensitive

The standard IID linear probe leaves the methods closely matched, but controlled background swaps expose a smaller degradation for LeVLJEPA. This is consistent with a more object-focused global representation, even before moving to patch-token evaluations.

Accuracy drop under background shift (0-20 pts, lower is better)

CLIP SigLIP LeVLJEPA
Mixed-Same
CLIP6.57
SigLIP7.03
LeVLJEPA5.95
Mixed-Rand
CLIP18.67
SigLIP18.09
LeVLJEPA17.21

From Original to Mixed-Same, LeVLJEPA drops 5.95 points, compared with 6.57 for CLIP and 7.03 for SigLIP. Under Mixed-Rand, it again has the smallest drop.

Dense Semantic Features

Linear segmentation on frozen patch tokens (0-35 mIoU)

CLIP SigLIP LeVLJEPA
ADE20K
CLIP20.90
SigLIP19.24
LeVLJEPA23.15
COCO-Stuff
CLIP29.02
SigLIP28.88
LeVLJEPA31.10

A single linear head exposes the per-token semantic structure that downstream dense systems consume.

Patch tokens separate the objectives

A frozen ViT emits a grid of patch tokens, and downstream dense systems consume that sequence rather than a pooled CLS embedding. Semantic segmentation is therefore a direct readout of spatial and semantic structure already present in the frozen patch features.

With only a single linear head, LeVLJEPA exceeds the stronger contrastive baseline by 2.25 mIoU on ADE20K and 2.08 mIoU on COCO-Stuff, so the margin cannot be attributed to head capacity.

Frozen VLM Backbone

The deployment setting: frozen encoder, frozen LLM

In the VLM protocol, the patch tokens are projected into a language model's input space and consumed during question answering. Both the ViT-B/16 encoder and the language model stay frozen; only a lightweight MLP bridge is trained.

Because neither endpoint is updated, downstream accuracy is a function of the pretrained visual features and the bridge alone. Repeating the evaluation with Llama-1B and Qwen-1.5B separates backbone quality from quirks of a particular language model.

Random initialization is the zero point for each task and language model. LeVLJEPA adds the largest gain in every column, preserving the same encoder ordering across both language models.

Acknowledgements

This project page is built on top of the LeWorldModel website template by Lucas Maes et al. We thank them for making their code publicly available.

We gratefully acknowledge support from the hessian.AI Service Center (funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no. 16IS22091) and the hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Strategy and Innovation, grant no. S-DIW04/0013/003).

This work was funded by the European Union (ERC, TAIPO, 101088594 to F.B.) grant. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or ERC. Neither the European Union nor the granting authority can be held responsible for them.