Research / Machine Learning

Dead
Compute

Why every language model discards its best reasoning the moment you close the tab, and what an architecture that doesn't would look like.

Kabir Murjani · March 2026

Close a session with a 40,000-token chain of reasoning that just solved a hard problem. The base weights didn't move. Next time you ask the same class of question, the model recomputes from scratch. The work is gone.

That is not a bug. It is the design. The transformer architecture treats every inference session as stateless. Useful for isolation. Catastrophic for learning. The compute is not cumulative, which means the model can never be said to have mastered anything in a permanent sense, only to have reconstructed it one more time.

What I want to describe here is an architecture where it does compound, where offline time is not idle time but consolidation time, and where the model wakes up each morning having genuinely learned from the night before.

The Problem with the KV Cache

For a transformer processing N tokens, attention is O(N²) in both time and memory. The key-value cache that holds the intermediate representations grows linearly with context length. For a 128k-token window, the model reconstructs its entire working memory on every forward pass. When the session ends, the cache is wiped.

A model that has just derived a novel proof over a 30,000-token session learns nothing from the derivation. The weights are identical before and after. From the perspective of the base model, the session never happened.

The response to this has been more context, faster attention kernels, and better prompting. These are engineering improvements on top of a fundamentally stateless design. The question nobody has seriously pursued as an architectural primitive is: what if the model had a sleep cycle?

Not metaphorically. Literally an offline state where it runs self-directed computation on its own internals, consolidates the verified reasoning traces from the day into its weights, and prunes the useless hallucinations. The way biological systems have done it for 500 million years.

The Architecture

The Sleep-Wake Model (SWM) splits the operational cycle into two phases.

The Awake Phase is standard inference with one addition: for every interaction where an external verifier confirms correctness (a compiler, a Lean proof assistant, a physics simulator), the model logs the internal state trajectory. These are the verified targets, denoted x⁺. The base model weights don't change during waking. We are only accumulating evidence.

The Sleep Phase runs when the cluster is idle. No user input. The model runs two sequential processes: REM exploration and deep consolidation.

Figure 1 / Parameter Energy Landscape Before vs. After Sleep

Fig. 1. Left: parameter space before sleep. The energy landscape is noisy and flat, no strong attractors. Right: after three sleep cycles. Three verified reasoning procedures have been carved as deep attractor basins (yellow hotspots). The model will fall into these basins with minimal token input.

REM Sleep: Structured Hallucination

The goal of REM is to generate reasoning trajectories that standard greedy decoding would never surface. Beam search finds local optima. What is needed is a sampler that explores the full distribution of plausible reasoning paths, weighted by correctness, not by likelihood under the current model.

The right tool here is a Generative Flow Network (GFlowNet) [1, 2]. GFlowNets don't maximize a reward. They learn to sample from a distribution proportional to a reward. Given a trajectory from state s₀ to terminal state s_n, the trajectory balance condition requires:

Z · ∏_t=0^n-1 P_F(s_t+1 | s_t) = R(s_n) · ∏_t=0^n-1 P_B(s_t | s_t+1)
where Z is total flow, P_F forward policy, P_B backward policy

The key property: this sampler visits every region of state space proportional to its reward, rather than zeroing out low-reward regions the way policy gradient RL does. Rare, correct, unusual ideas get found. They don't get optimized away.

During REM, the model injects Langevin dynamics noise into its own latent states to escape trained attractors and generate dream trajectories x^-. A binary verifiable reward R_ver scores each: 1 if the external oracle confirms it, 0 otherwise. The contrastive objective then shapes the energy landscape over verified trajectory x⁺ and failed trajectories:

L_REM(θ) = −E_{τ~P_θ} [ log( exp(−E_θ(τ⁺)) / ( exp(−E_θ(τ⁺)) + ∑_i exp(−E_θ(τ_i⁻)) ) ) ]
InfoNCE-style contrastive loss on the model's energy manifold [3]

Figure 2 / REM Exploration Space One overnight cycle — 14 concept regions

Fig. 2. One REM phase. Each row is a concept region in the model's latent space. Red points are failed hallucinations, pruned. Cyan points are trajectories verified by the external oracle, kept for consolidation. The oracle confirmation rate is roughly 1-3%. That fraction is enough.

Deep Sleep: Consolidation Without Forgetting

REM produces a small dream buffer D_sleep: verified, externally-confirmed trajectories the model generated without any human prompting. The question is how to integrate these into the base weights without catastrophic forgetting [4, 5].

Standard fine-tuning on new data degrades performance on old tasks in proportion to how different the new data is from the training distribution. This is the entire problem. The solution uses Dark Experience Replay (DER) [6], which stores not just input-output pairs but the model's internal logit distributions z_past at the time of original experience. The consolidation loss is:

L_total = L_task + α · E_{x~D_sleep} [ ||f_θ(x) − z_past||²₂ ]
first term: train on dream buffer. second term: distillation anchor on past logits.

The second term is the anchor. For any input in the dream buffer, the model's current output distribution must stay close to what it predicted when the memory was stored. New knowledge updates the weights; the anchor prevents drift on everything else.

The weight update during consolidation is a thermodynamic gradient flow [7] over the energy landscape E_W(x):

dW/dt = −η ( E_x~x⁺[∇_W E_W(x)] − E_x~x⁻[∇_W E_W(x)] )
left: pull energy of verified states down (carve attractor). right: push hallucinations up (flatten waste).

This is contrastive divergence [7] applied to the full parameter manifold of the language model. The attractor basin framing is not a metaphor here. Ramsauer et al. [8] showed rigorously that transformer self-attention is mathematically equivalent to a continuous Hopfield network. The weight geometry already supports associative attractors. The SWM is exploiting that structure deliberately.

Figure 3 / Token Bandwidth Decay Theoretical projection

Fig. 3. Expected token cost to solve a benchmark problem class as a function of sleep cycles completed on that class. As attractor basins deepen, the model requires fewer tokens to fall into the correct reasoning regime. Asymptote is not zero, but approaches the cost of a short trigger prompt rather than a full chain-of-thought.

Why This Doesn't Just Degenerate Into Noise

The standard objection: the REM phase is generating hallucinations, and deep sleep is fine-tuning on them. That is only true if the verifier is weak.

The verifiable reward R_ver is an external oracle. A Python interpreter does not grade on a curve. A formal proof assistant does not accept near-proofs. The dream buffer contains only trajectories that are structurally correct, even if stylistically bizarre or unlike anything in the training distribution. The model is not reinforcing existing patterns. It is integrating confirmed novel structure.

This connects directly to what makes RLVR work [9, 10]. Training against a hard external verifier rather than a human preference model produces qualitatively different capabilities because the verifier has no prior distribution of its own. The SWM extends this offline: the model generates its own problems, verifies the solutions, and consolidates the survivors, with no human-labeled data entering the loop after initialization.

The Experiment

The simplest falsifiable test: take a code-generation model, establish a baseline pass@1 on a competitive programming benchmark. Run the REM-sleep loop overnight with a real compiler as the verifier. Evaluate on a held-out problem set the next morning, zero-shot, no new human examples.

If the attractor hypothesis is correct, the model shows measurable improvement on problem types it explored during sleep, with minimal regression on unrelated tasks. If it doesn't, the hypothesis is falsified and we learn something specific about whether EBM-style attractor formation via contrastive divergence is achievable at this scale.

The reason this hasn't been done is not cost. It is that no one has assembled the GFlowNet sampler, external verifier loop, and DER consolidation into a unified training pipeline on a production-scale model. That is the contribution this architecture proposes.

Honest Limitations

The architecture works cleanly in domains with formal verification: competitive programming, mathematical theorem proving, molecular dynamics, circuit synthesis. For open-ended language tasks, the verifier is harder. Ensemble disagreement [11] as a proxy for verification is weaker than a formal oracle but not vacuous. That gap is an open research question, not a reason to abandon the idea.

The REM compute cost is real. The practical implementation likely runs the exploration phase on a smaller distilled model [12] and transfers the dream buffer to the full model for consolidation. This is an engineering detail with precedent.

References

Bengio, Y. et al. (2021). GFlowNet foundations. arXiv:2111.09266.
Malkin, N. et al. (2022). Trajectory balance: improved credit assignment in GFlowNets. NeurIPS 2022.
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748.
McCloskey, M. & Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psych. of Learning and Motivation, 24.
French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4).
Buzzega, P. et al. (2020). Dark experience for general continual learning. NeurIPS 2020.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8).
Ramsauer, H. et al. (2020). Hopfield networks is all you need. arXiv:2008.02217.
DeepSeek-AI. (2025). DeepSeek-R1: incentivizing reasoning capability in LLMs via RL. arXiv:2501.12948.
Lightman, H. et al. (2023). Let's verify step by step. arXiv:2305.20050.
Ha, D. & Schmidhuber, J. (2018). World models. arXiv:1803.10122.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531.

DeadCompute