
Rational Amnesia
Stop decaying your gradients. Start amputating your graphs.
January 2026
Kabir Murjani · January 2026
Here is a failure mode I keep seeing, and I think it is more fundamental than people treat it as.
When the rules of a game change abruptly, most reinforcement learning systems fall apart. Not because they are stupid, and not because they are too small. They fall apart because they have excellent memory and terrible amnesia. They remember the old world beautifully, and they are catastrophically bad at forgetting it on demand.
This shows up everywhere the action space is large and combinatorial. You are routing a fleet across a supply chain and a trade route gets blocked overnight. You are bidding in a multi-agent auction and a hostile bidder changes strategy. You are allocating a portfolio and a flash crash rewrites the correlation structure of the entire market in an afternoon. The world has just undergone a structural shift, and your agent needs to respond to a reality that did not exist five minutes ago.
The standard playbook for this is soft adaptation. Decay the learning rate. Lean on an exponential moving average. Crank up an entropy penalty. Let the agent gently unlearn the old regime over many updates. And it sort of works, eventually, if the shift is mild and you have time to burn.
But notice what is actually happening underneath. The old policy shrinks, but it does not disappear. It is still sitting there in the network weights, decaying asymptotically toward irrelevance, never quite getting there. I call this combinatorial drag: the agent keeps spending compute differentiating through constraint-violating noise, inheriting the structural errors of a reality that is already gone. It is dragging a corpse behind it and calling it memory.
The fix is not a bigger model. A one-billion-parameter policy network drags the same corpse, just more expensively. The fix is a different attitude toward forgetting. We need to move from soft learning to hard topological pruning, rational amnesia: the ability to decide that a region of the search space is now structurally invalid and to make it cease to exist, not merely become unlikely.
The Flat Action Space Had to Die
Start with where the drag comes from. Standard deep RL for a neural combinatorial optimization problem treats the action space as a flat probability distribution. At every node, the policy network spits out logits over every possible edge, softmaxes them, and maybe applies a soft penalty to discourage the moves that violate constraints.
Think about what that softmax is doing. It is assigning nonzero probability to mathematically invalid branches. The network is dutifully evaluating moves that cannot, under the current rules, ever be part of a feasible solution, and it is letting that noise leak into the exact solver downstream. Your solver gets poisoned by the dead ends. And again, scaling does not save you here. A bigger network just computes the same invalid logits with more parameters.
So the flat action space has to go. Instead of letting the solver stare at a flat distribution over all actions and hope the penalties steer it right, we enforce hard topological isolation on the feasible regions. The space gets physically fractured along the lines of what is actually legal right now.
Three Layers
If you want a solver that survives a macro shock or an adversarial opponent, I think you structure it in three layers, each doing a job the others should not.
Layer 1: isolated feasibility sets. Rather than partitioning the problem by something incidental like spatial clusters on a map, you partition it along strictly defined constraint manifolds. Set A is high-speed, high-emission routes. Set B is low-capacity, precision-packing routes. They do not overlap. Membership in a set is a hard fact, not a soft preference.
Layer 2: the latent meta-router. A small, heavily quantized autoregressive transformer, think sub-50M parameters, whose only job is to read the macroscopic state and emit a deterministic binary mask over the constraint sets. It runs with constrained decoding, so it physically cannot hallucinate a constraint set that violates current reality. If liquidity dries up, the router slams the gate shut on the high-leverage manifolds. It does not reason about them, debate them, or assign them a low probability. It closes the door. The router is a regime classifier with a hand on the circuit breaker.
Layer 3: the exact solver. Once the router unlocks a specific feasibility set, the heavy exact solver takes over. Because the action space has already been walled off from invalid constraints, there is zero combinatorial drag. The solver computes the exact non-dominated operating points strictly inside a pristine manifold. It never wastes a cycle on the dead regions because, as far as it can see, the dead regions are not there.
The division of labor matters. The cheap router decides what is allowed. The expensive solver decides what is optimal. You never pay solver-grade compute to rediscover feasibility, which is most of where the waste lives.
The Cost of Opening More Doors
The math gets genuinely fun when the router wants to satisfy several sub-objectives at once. Each additional manifold you open inflates the search space the solver has to grind through, and it inflates it fast. So you do not want to open doors for free.
The lever I like is a logarithmic relaxation penalty on the router's gating decision:
P_threshold(n) = T_base + log(n) · I_complexityn is the number of orthogonal constraint manifolds the solver is being asked to merge. T_base is the baseline expected reward you demand before opening any set at all. And I_complexity is a volatility term, a measure of how hairy the current state is.
The behavior falls out nicely. Under tight, volatile conditions, near a hard deadline, or mid-selloff, I_complexity is high, the penalty spikes, and the router forces the solver into a single strict manifold. Do one thing, do it exactly, do not get cute. When the environment is calm, I_complexity flattens, the penalty relaxes, and the gates swing open for richer multi-objective synthesis. The architecture becomes conservative exactly when conservatism is cheap insurance, and ambitious exactly when ambition is affordable.
Where the Amnesia Earns Its Keep
Capital allocation under a macro shock. A shock hits and your historical covariance matrices instantly become toxic: they describe correlations from a market regime that no longer exists. An agent leaning on standard moving averages will smear that dead regime into the new one and bleed alpha the whole way through the transition, lagging reality by exactly the half-life of its averaging window. With a meta-router acting as a regime classifier, the system isolates the new volatility cluster directly. Zero legacy noise, zero lag. The toxic priors are not down-weighted, they are gated out.
Adversarial signaling games. In a multi-agent signaling game, a smart opponent will feed you a long run of consistent signals specifically to build your confidence in a particular prior, and then execute an orthogonal shift the moment you have committed. A standard Bayesian update drags the poisoned prior along for the ride, because Bayesian updating is, by design, a soft operation. Topological pruning lets you do something a posterior cannot do gracefully: a rank-one downdate that annihilates the compromised belief from the probability manifold outright, in O(D) time.
The Thesis, Compressed
The default goal of robust RL is usually framed as teaching agents to ignore bad information. I think that framing is the trap. Ignoring is soft, ignoring is expensive, ignoring leaves the corpse in the weights. The better goal is to build architectures where bad information mathematically ceases to exist the moment it is identified as bad.
Memory is easy. Every overparameterized network has plenty of it. The hard and underrated capability is amnesia: the disciplined, structural, on-demand kind. Not forgetting because you decayed, but forgetting because you decided.
If forgetting is a capability rather than an accident, we should probably start engineering it like one.
- Knowing What Not to Do: Leveraging LLM Insights for Action Space Pruning in MARL (Liu et al., 2024). The clearest recent argument that hard pruning of the action space is the right move when the agent count blows the action space up combinatorially.
- CORectifier: Hierarchical Trajectory Rectifications Boost RL for Combinatorial Optimization (2025). Interesting as the complement: instead of amputating bad trajectory segments, it probabilistically injects high-quality expert segments to regularize RL's exploration.
- Neural Solver Selection for Combinatorial Optimization (Gao et al., ICML 2025). The cleanest published echo of the meta-router philosophy: routing each instance to the right specialist beats one big generalist.