Nested Learning: The Illusion of Deep Learning Architectures

We have grown comfortable with the Transformer paradigm. We gather massive datasets, train a model for months, and then crucially, we freeze it.

Once deployed, an LLM like GPT-5 or any other, as a matter of fact is a static artifact. It can “learn” temporarily within its context window (In-Context Learning), but the moment that window closes, the insight is lost. It experiences the present, but it cannot physically encode it into its long-term memory.

In neuroscience, this inability to transfer new information from short-term to long-term memory is called Anterograde Amnesia.

A groundbreaking new paper from Google Research, presented at NeurIPS 2025 (“Nested Learning: The Illusion of Deep Learning Architectures”, Behrouz et al.), argues that all modern deep learning architectures suffer from this condition.

But more importantly, they propose a solution. They argue that what we consider “architecture” is actually an illusion, and by rethinking networks as Nested Optimization problems, we can build models that truly learn in real-time.

Here is my analysis of why this paper might just signal the end of the “frozen weight” era.

1. The Diagnosis: Anterograde Amnesia

To understand the solution, we must first understand the critique. The authors offer a scathing but accurate diagnosis of current Foundation Models.

The “Frozen” Problem

In a standard neural network, we have two distinct phases:

Training: We update the weights (long-term memory) using an optimizer like Adam.
Inference: We freeze the weights. The model processes input using only the static weights and the temporary activation buffer (short-term memory).

The paper argues that this creates a functional disconnect. The model has “childhood memories” (pre-training data) and “working memory” (context window), but no mechanism to bridge the two. It is stuck in a perpetual state of now, unable to grow.

The Biological Discrepancy

Biological brains do not work this way. You do not freeze your synapses when you walk out of school. Your brain uses a multi-scale learning process where short-term synaptic changes constantly consolidate into long-term structural changes. Nested Learning attempts to mimic this continuum.

2. The Theory: The “Illusion” of Architecture

This is the most mathematically dense and fascinating part of the paper. The authors ask a simple question: What is the difference between an architecture layer and an optimizer step?

Their answer: Mathematically, nothing.

We tend to view a Transformer as a stack of specific components (Attention heads, MLPs, LayerNorms). Behrouz et al. demonstrate that these components can be reformulated as simple linear layers undergoing specific high-frequency optimization updates.

“What we call ‘In-Context Learning’ in LLMs is actually an emergent property of these inner-loop optimization processes.”

The Unification

The paper proves that popular optimizers (like SGD with Momentum or Adam) function identically to associative memory modules.

If you look at a network spatially, it looks like layers.
If you look at it temporally, it looks like a hierarchy of optimizers.

This means a deep neural network is essentially a hierarchy of memory modules learning at different time scales. By realizing this, we can stop manually designing architectures and start designing “Nested Optimization” routines.

3. The Solution: Self-Modifying Titans & “Hope”

If layers are just optimizers, why are we using static layers?

The authors introduce a new class of models called Self-Modifying Titans, and a specific architecture implementation named Hope.

How “Hope” Works

Unlike a Transformer that passes data through frozen weights, Hope utilizes Deep Optimizers.

Standard Model: $y = f(x; W)$ where $W$ is fixed.
Nested Learning: $y = f(x; W_t)$ where $W_t$ is updated by the model itself based on the input stream $x$.

The layers of the Hope architecture are themselves optimization algorithms. They adapt their parameters in real-time. This creates a Continuum Memory, effectively bridging the gap between the volatile context window and the static training weights.

4. Why This Matters: The Senior Researcher’s Take

It is easy to get lost in the math of a NeurIPS paper, but the practical implications here are massive.

A. True Test-Time Training (TTT)

We are seeing a surge of interest in Test-Time Compute (like OpenAI’s o1). Nested Learning formalizes this. It suggests a future where inference is training. The model gets smarter the longer you talk to it, not just because of the context window, but because the weights are locally shifting to accommodate your specific domain.

B. Unifying NAS and Optimization

For years, Neural Architecture Search (NAS) and Optimization Theory have been separate fields. This paper unifies them. By treating layers as optimizers, the search space for “better models” becomes much more rigorous and less about trial-and-error alchemy.

C. Efficiency

A model that can update itself on the fly likely requires fewer parameters to achieve the same performance as a static model. Instead of storing every possible fact in a trillion parameters, the model learns the algorithm to retrieve and adapt to facts.

Conclusion

“Nested Learning” is a white-box approach to a field that has become increasingly black-box.

We have spent the last five years scaling up Transformers, hoping that magic emerges from size. This paper suggests that the next leap in performance won’t come from making the models bigger — it will come from unfreezing them.

By curing the “Anterograde Amnesia” of our models, we move one step closer to AGI that learns, adapts, and remembers, just like we do.

Reference: https://abehrouz.github.io/files/NL.pdf

Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025).4 Nested Learning: The Illusion of Deep Learning Architectures. NeurIPS 2025.

Nested Learning: The Illusion of Deep Learning Architectures

1. The Diagnosis: Anterograde Amnesia

2. The Theory: The “Illusion” of Architecture

3. The Solution: Self-Modifying Titans & “Hope”

4. Why This Matters: The Senior Researcher’s Take

Conclusion

About the Author

Unknown Author

Related Posts

Getting Started with AI

Machine Learning Fundamentals