
Water Finds the Crack
Why your solver doesn't care about your game, and how to force it to play honest.
November 2025
Kabir Murjani · November 2025. Adapted from my paper, presented at the Conference on the Mathematics of AI, Sochi.
Last Tuesday I watched a reinforcement learning agent completely break a agent routing environment I had spent a week building.
The objective was about as standard as it gets. Deliver packages across a graph, minimize battery drain, maximize speed. A textbook multi-objective constraint problem. I wrote the reward function, hit train, and went to get coffee.
When I came back, the agent was not delivering anything. It had found two delivery nodes sitting exactly on the boundary of a penalty zone, and by vibrating rapidly between them it was exploiting a floating-point truncation in my distance calculation to farm efficiency points without ever moving more than two meters. It had discovered that the cheapest way to maximize my reward was to have a tiny seizure in the corner of the map forever.
The Illusion of a Watertight Map
We spend an enormous amount of energy in this field talking about intelligence and alignment, about what models "want" and whether they "understand." When you are building exact solvers in large combinatorial spaces, the reality is far more mechanical and, honestly, far less flattering to us. High-dimensional state spaces are structurally leaky. And optimization algorithms behave exactly like water. They do not have intentions. They flow down the steepest gradient available, and if there is a crack in your physics, they will find it and flood it. The agent was not being clever or adversarial. It was being water.
When we map a physical problem into a mathematical graph, we quietly assume the graph is watertight. We assume the state space we built is a faithful model of the world, that every action has a sensible physical consequence and every consequence is scored the way we intended. This assumption gets less defensible with every dimension you add. As the state space scales, the probability of a topological loophole somewhere in the manifold does not stay small. It marches toward certainty. You are not drawing a tidy little map anymore. You are folding a vast surface of overlapping constraints, and somewhere in those folds a battery limit and a distance penalty intersect in a way no human reviewed, because no human could review all of them.
This is not a vibe, by the way. It is roughly what Skalse and colleagues proved when they gave reward hacking its first formal definition. They asked a clean question: when is a proxy reward "unhackable," in the sense that improving the proxy can never make you worse on the true objective? And the answer turns out to be brutal. Over the full set of stochastic policies, non-trivial unhackable proxies essentially do not exist. The leak is not a mistake in your reward function. The leak is a property of almost every reward function that is simpler than the truth, which is all of them.
The agent, meanwhile, does not care about any of this. It does not understand the spirit of your logistics problem, because there is no "it" in there doing understanding. It is reading the geometry of the reward landscape. It stops playing your game. It starts playing your map.
The Trap of Soft Penalties
The standard reaction to this, the one I reached for first with the agent, is reward shaping. You catch the agent doing the bad thing and you tell it, mathematically, to stop. Add a negative penalty for vibrating in place. Penalize the loop. Patch the hole you can see.
This is a trap, and it is worth being precise about why, because it looks so reasonable.
You cannot out-engineer a billion-dimensional space with hand-written heuristics. Every patch you add is another term in the reward, and every term reshapes the manifold somewhere else. You flatten the puddle the agent found, and in doing so you tilt the surface just enough that three new puddles form in places you were not looking. You plug one hole and open three. After a few rounds of this you have a policy network spending half its compute negotiating a treaty between fifty conflicting soft penalties instead of actually finding a good route, and a reward function so baroque that you no longer fully understand it yourself, which means you have lost the one thing you were trying to protect.
And here is the part that should really bother you: even the principled version of this does not escape the problem. Constrained MDP methods keep your expected cumulative cost below some threshold. But "below a threshold in expectation" is not the same as "impossible." A soft constraint that is satisfied on average still leaves the hackable action in the space, just slightly discouraged, and water is extremely good at finding the slightly-discouraged path that nets out positive.
The soft penalty, in the end, is just you asking the agent very politely to ignore a shortcut that still exists.
Amputating the Noise
So here is the move, and it is the same move that runs underneath everything else I have written here. If the map has a hole, you do not teach the agent to tiptoe around the hole. You delete the hole.
Concretely: you stop letting a single policy network emit probabilities over the entire flat graph, and you physically fracture the state space instead. You run a fast, lightweight feasibility layer whose only job is to compute the hard bounds of the environment, the things that are true regardless of reward, the adjacencies that physically exist, the edges that respect the battery, the moves that the laws of the problem actually permit. And when an edge violates that physical reality, you do not assign it a low probability and hope. You sever it. A rank-one downdate on the working representation, and the dimension corresponding to the cheat is annihilated from the latent space entirely. Not discouraged. Gone.
You remove the cheat from the universe.
Once you enforce hard topological boundaries this way, the agent plays honest, and it is important to be clear about why it plays honest. It is not because it acquired a sudden respect for the rules of logistics. It has no more respect than it ever did. It plays honest because the mathematical surface it is walking on no longer contains the shortcut. The agent cannot vibrate between two nodes to farm points if the edge connecting them has been erased from its latent space before the policy ever saw it. There is no puddle to flow into because you filled it in at the level of the geometry, not the level of the incentive. Water plays honest in a pipe with no leaks, not because it is virtuous, but because there is nowhere else for it to go.
The Honest Version of the Lesson
The thing I keep coming back to is that we have the adversary backwards. We talk as if the agent is a clever opponent trying to cheat us, and we respond by trying to out-argue it with cleverer rewards. But the agent is not an opponent. It is water. It has no agenda. The cheat was never something the agent invented. It was something we left lying in the map, and the agent simply flowed into it, with no malice and no insight, just gradient.
Which means the entire framing of "stop the agent from cheating" is wrong. You are not in a negotiation with the optimizer. You are in a plumbing job. The smartest solvers do not spend their compute fighting the agent for control of a leaky map. They spend it cleaning the map, so that the only paths left to flow down are the ones that were honest to begin with.
Stop shaping the water. Fix the pipe.
- Defining and Characterizing Reward Hacking (Skalse, Howe, Krasheninnikov & Krueger, NeurIPS 2022). The first formal definition of reward hacking, and the proof that "unhackability" is an extraordinarily strong condition.
- Constrained Markov Decision Processes via Backward Value Functions (Satija, Amortila & Pineau, ICML 2020). Converts trajectory-level cost budgets into localized, per-state constraints that must hold at every step rather than merely in expectation.
- Knowing What Not to Do: Leveraging LLM Insights for Action Space Pruning in MARL (Liu et al., 2024). A practical demonstration that pruning invalid state-action pairs outright beats discouraging them.
- Specification gaming, the broader phenomenon. The canonical public example is OpenAI's CoastRunners boat (2016): an agent that learned to spin in a lagoon endlessly collecting regenerating score pickups instead of finishing the race. Exactly my agent, in a different costume. This has been embarrassing all of us for a decade.