Why every language model discards its best reasoning the moment you close the tab, and what an architecture that doesn't would look like.
Close a session with a 40,000-token chain of reasoning that just solved a hard problem. The base weights didn't move. Next time you ask the same class of question, the model recomputes from scratch. The work is gone.
That is not a bug. It is the design. The transformer architecture treats every inference session as stateless. Useful for isolation. Catastrophic for learning. The compute is not cumulative, which means the model can never be said to have mastered anything in a permanent sense, only to have reconstructed it one more time.
What I want to describe here is an architecture where it does compound, where offline time is not idle time but consolidation time, and where the model wakes up each morning having genuinely learned from the night before.
For a transformer processing N tokens, attention is O(N2) in both time and memory. The key-value cache that holds the intermediate representations grows linearly with context length. For a 128k-token window, the model reconstructs its entire working memory on every forward pass. When the session ends, the cache is wiped.
A model that has just derived a novel proof over a 30,000-token session learns nothing from the derivation. The weights are identical before and after. From the perspective of the base model, the session never happened.
The response to this has been more context, faster attention kernels, and better prompting. These are engineering improvements on top of a fundamentally stateless design. The question nobody has seriously pursued as an architectural primitive is: what if the model had a sleep cycle?
Not metaphorically. Literally an offline state where it runs self-directed computation on its own internals, consolidates the verified reasoning traces from the day into its weights, and prunes the useless hallucinations. The way biological systems have done it for 500 million years.
The Sleep-Wake Model (SWM) splits the operational cycle into two phases.
The Awake Phase is standard inference with one addition: for every interaction where an external verifier confirms correctness (a compiler, a Lean proof assistant, a physics simulator), the model logs the internal state trajectory. These are the verified targets, denoted x+. The base model weights don't change during waking. We are only accumulating evidence.
The Sleep Phase runs when the cluster is idle. No user input. The model runs two sequential processes: REM exploration and deep consolidation.
The goal of REM is to generate reasoning trajectories that standard greedy decoding would never surface. Beam search finds local optima. What is needed is a sampler that explores the full distribution of plausible reasoning paths, weighted by correctness, not by likelihood under the current model.
The right tool here is a Generative Flow Network (GFlowNet) [1, 2]. GFlowNets don't maximize a reward. They learn to sample from a distribution proportional to a reward. Given a trajectory from state s0 to terminal state sn, the trajectory balance condition requires:
The key property: this sampler visits every region of state space proportional to its reward, rather than zeroing out low-reward regions the way policy gradient RL does. Rare, correct, unusual ideas get found. They don't get optimized away.
During REM, the model injects Langevin dynamics noise into its own latent states to escape trained attractors and generate dream trajectories x-. A binary verifiable reward Rver scores each: 1 if the external oracle confirms it, 0 otherwise. The contrastive objective then shapes the energy landscape over verified trajectory x+ and failed trajectories:
REM produces a small dream buffer Dsleep: verified, externally-confirmed trajectories the model generated without any human prompting. The question is how to integrate these into the base weights without catastrophic forgetting [4, 5].
Standard fine-tuning on new data degrades performance on old tasks in proportion to how different the new data is from the training distribution. This is the entire problem. The solution uses Dark Experience Replay (DER) [6], which stores not just input-output pairs but the model's internal logit distributions zpast at the time of original experience. The consolidation loss is:
The second term is the anchor. For any input in the dream buffer, the model's current output distribution must stay close to what it predicted when the memory was stored. New knowledge updates the weights; the anchor prevents drift on everything else.
The weight update during consolidation is a thermodynamic gradient flow [7] over the energy landscape EW(x):
This is contrastive divergence [7] applied to the full parameter manifold of the language model. The attractor basin framing is not a metaphor here. Ramsauer et al. [8] showed rigorously that transformer self-attention is mathematically equivalent to a continuous Hopfield network. The weight geometry already supports associative attractors. The SWM is exploiting that structure deliberately.
The standard objection: the REM phase is generating hallucinations, and deep sleep is fine-tuning on them. That is only true if the verifier is weak.
The verifiable reward Rver is an external oracle. A Python interpreter does not grade on a curve. A formal proof assistant does not accept near-proofs. The dream buffer contains only trajectories that are structurally correct, even if stylistically bizarre or unlike anything in the training distribution. The model is not reinforcing existing patterns. It is integrating confirmed novel structure.
This connects directly to what makes RLVR work [9, 10]. Training against a hard external verifier rather than a human preference model produces qualitatively different capabilities because the verifier has no prior distribution of its own. The SWM extends this offline: the model generates its own problems, verifies the solutions, and consolidates the survivors, with no human-labeled data entering the loop after initialization.
The simplest falsifiable test: take a code-generation model, establish a baseline pass@1 on a competitive programming benchmark. Run the REM-sleep loop overnight with a real compiler as the verifier. Evaluate on a held-out problem set the next morning, zero-shot, no new human examples.
If the attractor hypothesis is correct, the model shows measurable improvement on problem types it explored during sleep, with minimal regression on unrelated tasks. If it doesn't, the hypothesis is falsified and we learn something specific about whether EBM-style attractor formation via contrastive divergence is achievable at this scale.
The reason this hasn't been done is not cost. It is that no one has assembled the GFlowNet sampler, external verifier loop, and DER consolidation into a unified training pipeline on a production-scale model. That is the contribution this architecture proposes.
The architecture works cleanly in domains with formal verification: competitive programming, mathematical theorem proving, molecular dynamics, circuit synthesis. For open-ended language tasks, the verifier is harder. Ensemble disagreement [11] as a proxy for verification is weaker than a formal oracle but not vacuous. That gap is an open research question, not a reason to abandon the idea.
The REM compute cost is real. The practical implementation likely runs the exploration phase on a smaller distilled model [12] and transfers the dream buffer to the full model for consolidation. This is an engineering detail with precedent.