How JEPA Learns World Models Without Predicting Every Pixel
How JEPA Learns World Models Without Predicting Every Pixel
Introduction
There is a simple intuition that makes JEPA feel less mysterious.
Take this sentence:
The cat climbed up the __.
Most humans can guess the missing word could be tree, stairs, sofa, or something structurally similar. We do not mentally generate every possible image of a cat, every texture of bark, every camera angle, or every exact pixel. We jump to the meaningful structure of the situation.
That is the spirit behind JEPA, short for Joint-Embedding Predictive Architecture. Instead of predicting raw pixels, raw audio samples, or the next token directly, JEPA tries to predict a useful internal representation of what is missing or what comes next. This makes it one of the most interesting ideas in modern AI research, especially if you care about world models, reasoning, planning, and systems that can ask:
If I do this, what is likely to happen next?
In this post, I want to give a high-level but technically grounded overview of JEPA, explain why Yann LeCun considers it important, and finish with a small PyTorch notebook example that demonstrates the idea using a toy missing-word task.
What Is JEPA?
At a high level, JEPA is a self-supervised learning framework. It learns from structure already present in data, without requiring humans to label everything by hand.
The core idea is simple:
- Take an observed context.
- Encode that context into an abstract representation, also called an embedding.
- Hide or skip another part of the input.
- Train a predictor to infer the embedding of the missing or future part.
The important detail is that the prediction happens in latent space, not in raw data space.
That means JEPA is usually not trying to reconstruct:
- every pixel in an image
- every waveform value in an audio clip
- every token in a sentence
Instead, it tries to predict the semantic state that matters.
This is why JEPA often gets described as a move from surface reconstruction toward meaningful prediction.
Why Predict Embeddings Instead of Pixels?
If you predict pixels directly, the model is forced to care about many details that may not matter for understanding.
For example, if a system watches a short video of a ball rolling behind a chair, the key thing is not the exact lighting of the room. The important thing is that:
- there is a ball
- it is moving
- it is temporarily occluded
- it should reappear in a physically plausible location
A representation-learning approach like JEPA tries to keep that important structure while discarding irrelevant noise.
This is one reason JEPA is often discussed as a promising route toward world modeling. A world model should not merely reproduce appearances. It should capture:
- objects
- relations
- motion
- persistence over time
- plausible consequences of actions
In other words, a useful AI system should not only say what something looks like. It should build an internal model that supports expectation.
The Three Main Pieces of a JEPA System
Most JEPA-style systems can be understood through three components.
1. Context Encoder
The context encoder reads the visible part of the input.
If the input is an image, it might see only some image patches. If the input is video, it might see a sequence of frames. If the input is text, it might see the words around a missing span.
Its job is to convert that visible context into a compact embedding that preserves the meaningful structure.
2. Target Encoder
The target encoder processes the hidden, masked, or future part of the input and produces the target embedding.
This target is not usually exposed directly to the predictor as raw data. It serves as the representation the model should learn to match.
In many practical JEPA systems, the target encoder is designed to be stable so the model does not collapse to trivial solutions.
3. Predictor
The predictor receives the context embedding and learns to predict the target embedding.
This is the heart of the architecture. The predictor is effectively answering:
Given what I can already observe, what latent state should I expect for the hidden or future part?
That question is much closer to planning and reasoning than raw reconstruction.
Why Yann LeCun Pushes JEPA So Strongly
Yann LeCun has argued for several years that intelligence cannot come only from next-token prediction. His broader vision is that advanced AI systems need:
- world models
- memory
- reasoning
- hierarchical planning
JEPA fits into that picture because it offers a way to train systems to predict at a level of abstraction that is more aligned with how agents interact with the world.
This does not mean JEPA is already a complete artificial general intelligence recipe. It is better to think of it as a foundational ingredient in a larger architecture for autonomous intelligence.
The appeal is clear:
- It can be more efficient than reconstructing everything.
- It can focus on what matters rather than visual noise.
- It can support prediction in settings where multiple futures are possible.
- It naturally connects to robotics, video understanding, and planning.
If an agent must decide whether opening a door, moving a cup, or turning left will produce a useful outcome, latent-space prediction is often more practical than brute-force generation of every sensory detail.
JEPA vs Generative AI
It is tempting to frame this as “JEPA versus generative AI,” but the more accurate view is that they solve different problems.
Generative models are excellent when you want to produce a detailed output:
- generate an image
- write a paragraph
- synthesize speech
- autocomplete code
JEPA-style models are appealing when you want a system to learn:
- what structure is important
- what hidden content is compatible with a scene
- what future states are plausible
- what consequences follow from an action
Generative models often optimize for high-fidelity output.
JEPA optimizes for useful predictive representation.
That distinction matters. An AI that can draw a photorealistic cat is not automatically an AI that understands how the cat will move after jumping onto a branch.
The “Missing Tree” Analogy
Let us return to the sentence example:
The cat climbed up the __.
A generative language model might treat this as a token prediction problem. It scores candidate next tokens from left-to-right.
A JEPA-like framing asks a different question:
What latent concept best fits the missing part, given the context?
The answer does not need to begin as a word. It can begin as an internal semantic representation that is close to ideas like:
- climbable object
- common outdoor object
- physically plausible continuation
Later, for the sake of a demo or downstream task, we may map that latent representation back to a word such as tree.
That difference is subtle but important. The model is encouraged to learn compatibility in concept space, not merely surface continuation.
Types of JEPA You Should Know
As of April 26, 2026, the JEPA family has expanded across several modalities.
I-JEPA
I-JEPA stands for Image-JEPA. It was introduced in 2023 as a self-supervised image model that predicts the representations of target image blocks from a visible context block.
It became one of the clearest demonstrations of the JEPA idea: learn from images by predicting representations of missing regions, not by reconstructing the exact pixels.
V-JEPA and V-JEPA 2
For video, the idea becomes even more compelling because video naturally contains time, motion, and occlusion.
V-JEPA extends latent prediction into the temporal setting. Rather than asking, “What exact pixels come next?”, it asks, “What representation of the future scene should follow from the current one?”
Meta later introduced V-JEPA 2 in 2025, positioning it as a world-model-oriented step toward understanding, prediction, and planning from video.
VL-JEPA
VL-JEPA brings the idea into the vision-language setting. The intuition here is powerful: instead of autoregressively generating language token by token, the model can predict continuous latent targets tied to language and vision jointly.
This makes JEPA relevant not only to perception, but also to multimodal reasoning systems.
Why JEPA Matters for World Models
The phrase world model can sound philosophical, but the practical meaning is straightforward.
A world model is an internal system that helps an agent answer:
- What state is the world in now?
- What parts of the state matter?
- If I act, what changes next?
- Which future outcomes are more plausible?
JEPA contributes strongly to the first three of those questions.
It does this by learning a state representation that is abstract enough to ignore noise, but structured enough to preserve what matters for prediction.
That is exactly why JEPA is interesting for:
- robotics
- physical reasoning
- autonomous agents
- video understanding
- long-horizon planning
You can think of JEPA as helping a model build an internal simulator, not necessarily a renderer.
A Tiny PyTorch Example: JEPA for a Missing Word
Real JEPA systems are much more sophisticated than a toy notebook, but a small example can still make the idea intuitive.
For this post, I created a Colab/Jupyter-ready notebook here:
The notebook uses a miniature synthetic dataset of short sentences such as:
1
2
3
4
the cat climbed the tree
the dog chased the ball
the child opened the door
the bird sat on the branch
Then it masks one important token and trains a model to predict the embedding of the missing word from the remaining context.
The Architecture Used in the Notebook
The notebook keeps the architecture intentionally small so the idea is easy to follow:
-
Embedding layer Each word is mapped to a dense vector.
-
Context encoder The visible words are embedded, averaged, and passed through a small feed-forward network.
-
Target encoder The hidden word is also embedded and passed through another small projection layer.
-
Predictor A multilayer perceptron predicts the target latent representation from the context latent representation.
-
Similarity search After training, we compare the predicted latent vector against the latent vectors of vocabulary words and choose the nearest one by cosine similarity.
This final step is just for interpretation. The important training objective is the latent prediction itself.
Why These Layers Make Sense
The notebook uses simple fully connected hidden layers with ReLU activations.
That choice is deliberate:
- The embedding layer gives each word a trainable semantic vector.
- The encoder hidden layer learns a compressed representation of the observed context.
- The predictor hidden layer transforms context meaning into a guess about the missing concept.
- The target projection layer puts the true hidden word into the same latent space the predictor is trying to reach.
This is obviously much smaller than real JEPA systems, which often use Vision Transformers or more advanced spatiotemporal architectures. But structurally it captures the same teaching idea:
observe context -> encode meaning -> predict a compatible latent target
A Minimal Code Sketch
Here is the core modeling idea in compact form:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class ToyJEPA(nn.Module):
def __init__(self, vocab_size, embed_dim=32, hidden_dim=64, latent_dim=32):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.context_encoder = nn.Sequential(
nn.Linear(embed_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, latent_dim),
)
self.target_encoder = nn.Sequential(
nn.Linear(embed_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, latent_dim),
)
self.predictor = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, latent_dim),
)
def encode_context(self, context_ids):
context_emb = self.embedding(context_ids).mean(dim=1)
return self.context_encoder(context_emb)
def encode_target(self, target_ids):
target_emb = self.embedding(target_ids)
return self.target_encoder(target_emb)
def forward(self, context_ids, target_ids):
context_latent = self.encode_context(context_ids)
predicted_target = self.predictor(context_latent)
true_target = self.encode_target(target_ids)
return predicted_target, true_target
And the training objective is just a latent-space regression loss:
1
loss = F.mse_loss(predicted_target, true_target)
That is the key educational point. We are not training the model to emit the word directly. We are training it to predict the representation of the hidden word.
What This Toy Model Teaches Correctly, and What It Leaves Out
The notebook captures the main intuition correctly:
- prediction happens in latent space
- context and target are represented separately
- the model learns compatibility, not exact reconstruction
But it also leaves out many things used in serious JEPA research:
- large-scale training
- masking strategies over image or video patches
- momentum or EMA target encoders
- uncertainty variables for multiple futures
- action-conditioned prediction
- hierarchical planning
So the notebook should be seen as a teaching scaffold, not as a research-grade JEPA implementation.
Conclusion
JEPA is one of the clearest signs that AI research is pushing beyond pure generation and back toward a deeper question:
What kind of internal representation lets a system understand the world well enough to predict meaningful change?
That is why JEPA matters.
It asks models to predict what matters, not just what appears. It shifts the focus from reproducing surfaces to building structure. And if future AI systems are going to plan, reason, and act with more grounded intelligence, ideas like JEPA will likely sit close to the center of that story.
The missing word in “The cat climbed up the __” is not interesting because of the word itself. It is interesting because of the world knowledge behind it. JEPA is an attempt to teach machines to operate closer to that level.