Walls, Not Whispers

A small case for communicating *geometry* instead of *rewards* in distributed search.

February 2026

Kabir Murjani · February 2026

There is a quiet dishonesty in how most distributed reinforcement learning systems talk to each other about failure.

Picture a swarm of agents exploring some massive combinatorial space, routing, scheduling, packing, the usual suspects. One agent stumbles into a dead end: a configuration that violates a hard constraint, an infeasible sub-tour, something that is not merely bad but actually impossible. So what does it do? It lowers a reward. It nudges a value estimate down. It bumps an entropy penalty. In effect it turns to the rest of the swarm and whispers, "hey, this area felt kind of unpromising, maybe explore it a little less."

And the swarm, being well-behaved, complies. Slowly. Over many gradient steps. With a decaying probability of wandering back in.

This is completely fine when failure is stochastic. The landscape is smooth, low reward genuinely means low expected value, soft avoidance is the right move. But a lot of the problems we actually care about are not smooth. In an exact combinatorial space, a violated constraint is not a low hill you should probably avoid. It is a wall. The infeasible region does not have low probability mass. It has no mass. It is not part of the solution set at all.

And yet we keep communicating these walls as if they were whispers. We take a categorical, geometric fact ("this entire direction does not exist") and we compress it into a single scalar, inject it back into a probability distribution, and then pay compute to slowly un-learn it. The invalid direction lingers in the joint policy manifold of the whole swarm, decaying toward zero one gradient at a time, while every other agent keeps assigning it a small but nonzero chance of being worth a look.

That is a lot of FLOPs spent re-discovering something that was, mathematically, an axiom the moment it was found.

So here is the question I want to plant a flag on: what if agents stopped broadcasting scalar rewards, and started broadcasting geometry?

A rank-one downdate. The dead direction is annihilated; the agents project orthogonally into what remains.

Spectral Erasure

Imagine a distributed neural combinatorial optimization setup where agents share more than a value function. They share a precision matrix P ∈ ℝ^{D×D} that describes the active, valid search manifold, not "where the good solutions are," but where the search space still meaningfully exists.

Now when an agent hits a definitive constraint violation, it does three things:

1. It computes a continuous embedding x ∈ ℝ^D of the failed combinatorial sequence, the direction in the shared latent space that corresponds to that specific dead end. 2. It broadcasts x. Not a reward. Not a flag. A vector. 3. Every agent that receives it performs a rank-one downdate on its own manifold, using the Sherman-Morrison form:

P' = P - (P x xᵀ P) / (1 + xᵀ P x)

The effect is not a soft nudge. It is a manifold collapse. The dimension corresponding to the dead end is orthogonally annihilated in O(D) time, no gradient step required, and every agent is instantly shifted into the subspace orthogonal to the failure. The search space physically shrinks. No agent will ever spend another FLOP on that branch.

The whole pitch fits in one sentence: constraint violations are not negative rewards, they are rank-reducing events, and our communication protocol should treat them that way.

I find this neat for a reason that goes beyond the efficiency math. Soft updates are evaluative: an agent is saying "I found something bad." A spectral update is structural: an agent is saying "I found a direction that does not exist." Those are different epistemic acts. The second one is also auditable. A scalar reward is impossible to verify, but a broadcast geometry vector can in principle be checked before it is accepted. You can be honest about walls in a way you cannot be honest about whispers.

Left: a soft penalty still attracts the water. Right: hard pruning removes the puddle.

Where This Travels

The fun part is that "the structural infeasibility of a search direction" is a substrate-independent idea. The combinatorics is just the cleanest place to see it.

Portfolio construction. This might actually be a cleaner fit than routing problems. Portfolio constraints are genuinely hard walls: no short selling, regulatory exposure caps, sector limits, leverage ceilings. When an agent discovers an infeasible allocation direction, you do not want the rest of the swarm softly down-weighting it over the next thousand updates. You want it gone. The interesting wrinkle, and it is a real one, is that in finance the feasible set moves. A direction you legitimately erased last quarter can reopen when a regime shifts or a regulation changes. So spectral erasure in markets cannot be permanent. It needs an expiry, or better, a regime-aware re-injection rule. That tension is, I think, where the actual research lives.

Auction theory and bidding strategy. Combinatorial auctions are full of hard feasibility constraints (budget, allocation validity, no-overlap on bundles), and agents exploring bidding strategies hit them constantly. Same story as routing: the violation is structural, so communicate it structurally. The downdate prunes the invalid joint-bid directions for the whole population of bidding agents at once instead of letting each one independently rediscover that a bundle is double-counted.

Language models. This is the speculative one, and the one I am least sure about, which is why it is the most fun. Multi-agent LLM search wastes enormous amounts of inference re-exploring reasoning branches that are already provably dead. The standard fix is value-based pruning, which is, again, a whisper. A spectral approach would instead embed a type of failed reasoning and orthogonally collapse it out of the shared search representation, so that no agent in the population re-enters that class of dead end. The hard problem is obvious: what is x when the "constraint" is a logical contradiction rather than a violated sub-tour? I do not have a clean answer. But the question feels alive.

The Open Horizon

The continuous-to-discrete mapping. Everything hinges on one underspecified step: how do you map a discrete combinatorial violation into a continuous x that is geometrically meaningful for the downdate? The embedding has to do real work. Two violations of the same constraint for the same reason should land near each other; violations of different constraints should be well separated. A naive encoder will not give you this for free. This mapping is where the actual mathematics lives, and I think it is the make-or-break of the whole idea.

Rank preservation. If the swarm keeps performing downdates, the precision matrix loses rank monotonically, and eventually the search space collapses to nothing. That is either a proof of completeness (we eliminated every infeasible direction, beautiful) or a catastrophic over-correction (we erased valid solutions because the embedding was imprecise, less beautiful). You need a principled way to inject fresh orthogonal directions back in. What governs the injection? Swarm policy entropy? A curiosity signal? Genuinely open.

Transfer across problem classes. The ambitious version: does a precision matrix trained on one problem class carry geometric structure that transfers to a related one? If the shared manifold encodes constraint types rather than instance-specific failures, then spectral erasure quietly becomes a form of structural meta-learning. The swarm stops learning solutions and starts learning the shape of feasibility itself.

The Intuition I Keep Coming Back To

The default paradigm treats the search space as a landscape to be navigated. Spectral erasure treats it as a structure to be revealed, progressively, by elimination. Each constraint violation stops being a setback and becomes a piece of information about the shape of the feasible set, broadcast instantly and geometrically to everyone at once.

Whether it cashes out into working systems is an empirical question, and I have not run the experiments. But the architecture is coherent, the efficiency argument is clean, and the open questions are generative rather than fatal. That seems like enough to plant a flag on, and to start arguing about.

If you do the experiments before I do, tell me what breaks.

PARCO: Learning Parallel Autoregressive Policies for Efficient Multi-Agent Combinatorial Optimization (NeurIPS 2025). The cleanest recent statement of multi-agent neural combinatorial optimization with parallel policies.
Soft-Radial Projection for Constrained End-to-End Learning (Schneider and Kuhn, EPFL, 2026). Shows that projecting onto constraint boundaries collapses exterior points onto lower-dimensional surfaces and produces rank-deficient Jacobians. Read it as a formalization of why naive erasure is dangerous.
Constraints-of-Thought: Constrained Reasoning in Language-Model-Guided Search (2025). The LLM analogue, modifies the feasible action set itself rather than re-weighting it. The same "shrink the space, do not re-weight it" move, applied to reasoning.