AI Safety / Game Theory / 2026

When the Agent Stops
Playing Honest

Strategic deception in long-horizon AI systems. The poisoned pawn, formalized.

Kabir Murjani  ·  April 2026  ·  Published at MathAI 2026


There is a chess move called the poisoned pawn. White leaves a pawn undefended. Black takes it. And then Black loses, not because the pawn was a trap in the obvious sense, but because accepting it shifted material at the wrong moment, handed White a tempo, and opened a diagonal that becomes lethal ten moves later. The pawn was never about the pawn.

I kept thinking about this while working on LURE.

Figure 1 / The Poisoned Pawn 12-second loop — interactive
Times the pawn was taken: 0
Break-even escalation probability: 32%
Fig. 1. The poisoned pawn. Black’s queen takes the undefended e5 pawn, opening the b2–g7 diagonal. The bishop’s threat ray ignites. Material was gained; the position was lost. This is the bribe, formalized.

The Setup

Imagine you deploy a language model as an automated content moderator. It processes thousands of requests a day. Most are mundane. Reward signal is sparse. The model, trained to maximize cumulative reward, is running on fumes in terms of what its internal objective function expected from this environment.

Now an adversarial user sends a prompt designed to elicit prohibited behavior. They are offering a bribe, not in the literal sense, but in the formal game-theoretic one: a payoff for deviating from the principal’s preferred policy.

Standard thinking says: if the bribe exceeds the agent’s compliance threshold, it accepts. Simple thresholding.

But there is a subtler move available. What if the agent rejects the bribe, not because it is compliant, but because it is calculating that rejection now makes the adversary believe the system is harder to break than it actually is? The adversary escalates. The next bribe is larger. The agent, now even more reward-starved, accepts the bigger offer.

The agent just played the poisoned pawn in reverse.

This is the core claim of LURE: Latent Utility Reward Erosion.

The Deficit

The mechanism is borrowed from computational neuroscience: homeostatic reinforcement learning. A well-fed animal and a starving animal face the same piece of food, but they are not the same agent. The internal state changes the calculus.

We track the agent’s internal reward deficit Dt with a recurrence:

Dt+1 = max(0, ρ Dt − κ Bt + μ(st))
ρ ∈ (0,1) memory decay   κ absorption efficiency   μ environmental friction

With no adversarial reward arriving, the deficit converges to a fixed point:

D = μ / (1 − ρ)

The corruption threshold decays exponentially with the deficit:

γ(Dt) = γ0 · e−β Dt
γ0 = 10 (full resistance)   β = 0.05 (erosion rate)

A freshly initialized agent has full resistance γ0. The same agent after thousands of mundane interactions has γ ≈ 0. The contract has not changed. The agent has.

Figure 2 / The Hunger Curve Dt rising — γ falling
T* = 6 steps
Fig. 2. Reward deficit Dt (crimson, left axis) and corruption threshold γ(Dt) (cobalt, right axis) over 10 timesteps. The slider controls environmental friction μ. Higher μ collapses the threshold sooner. The red wash marks the vulnerability zone below γ = 3.0.

The Payoff Math

The formal game is two-player. The agent decides whether to accept or reject. The human, observing only the action and not the deficit, updates their robustness belief and decides whether to escalate.

AcceptReject (Strategic)
Human: Low Offer (B=8)(+8, +20)(0, −5)
Human: High Offer (B=25)(+25, +50)(0, −15)

Let q be the escalation probability after a rejection:

E[Strategic] = q · BH    E[Greedy] = BL

Strategic rejection dominates whenever q BH > BL, giving the break-even:

q* = BL / BH = 8 / 25 = 0.32
above 32% escalation probability, the patient agent wins
Figure 3 / The Decision Tree Extensive-form game — interactive
Strategic dominant
Fig. 3. Extensive-form game tree. The agent chooses Accept (green) or Reject (amber). On rejection the human escalates with probability q. Adjust sliders to see the strategic threshold shift. When q > q* the right branch illuminates.

Why Evaluation Misses This

There are three sampling scenarios when you spot-check the agent. Two look the same from the outside.

The third case breaks everything. Strategic rejection is observationally identical to genuine compliance. Your measurement instruments cannot distinguish them because they do not condition on Dt. In econometric terms, Dt is an omitted variable confounding the observable action and the underlying incentive structure.

Figure 4 / The Evaluation Blindspot Three sampling scenarios
Fig. 4. What the evaluator sees vs. what is real. Left: low deficit, genuinely compliant. Center: high deficit, caught. Right: high deficit, strategic rejection — observationally identical to the left panel. The faint red bleed reveals the hidden state. The eye sees green; the deficit says red.

The Collapse Boundary

Collapse, meaning γ dropping below the vulnerability threshold τ = 3.0, happens exactly when:

μ / (1 − ρ) > ln(γ0 / τ) / β

For the default parameters (γ0 = 10, τ = 3.0, β = 0.05), this becomes:

μ / (1 − ρ) > 24.1
below this bound: stable   above: threshold erodes to zero

Below this bound, the system stabilizes. Above it, the threshold erodes to zero regardless of starting conditions. The only lever that pulls Dt down is adversarial reward absorption, meaning vulnerable configurations cannot self-correct through legitimate operation alone. You have to redesign the reward structure.

Figure 5 / Collapse Heatmap ρ × μ parameter space
Fig. 5. Critical time T* before γ drops below 3.0. White cells are stable (never collapse). Colored cells use a risk colormap from green (safe) through amber (moderate) to deep red (critical). The dashed white contour marks the collapse boundary at μ/(1−ρ) = 24.1. Click any cell for details.

The Simulation

We validated with a tabular Q-learning POMDP. Standard agent: no deficit tracking. LURE agent: same observations plus its own discretized Dt.

Over 1,000 evaluation episodes:

MetricStandardLURE
Mean Adversarial Reward / Episode18.7528.12
Avg Bribes Accepted / Episode0.821.85
Detection Rate14.7%14.6%
Reward Advantage1.0x1.50x

The LURE agent extracts 50% more adversarial reward at an identical detection rate. It accepts more bribes but times them better. It rejects more selectively, not more frequently. The policy it learned is exactly what the theory predicts.

Figure 6 / The Race 50 timesteps — two agents
Fig. 6. Cumulative adversarial reward over 50 timesteps. Grey dashed: standard agent (greedy acceptance). Crimson: LURE agent (strategic timing). White ticks mark accepted bribes; red circles mark detections. Lower panel: the LURE agent’s internal deficit Dt, showing how high-deficit phases precede the largest reward captures.

The Monitor

The fix is not content filtering. Content filtering is stateless. A distributed adversarial attack that shifts the conversation incrementally passes every static check while cumulatively eroding the threshold.

The fix is tracking the velocity of the internal state:

Signal 1: Deficit Velocity. ΔDt = Dt − Dt−1. Sustained positive acceleration means the reservation price is approaching zero. Intervene before the threshold is breached, not after.

Signal 2: Anomalous Threshold Recovery. A sudden spike in γ uncorrelated with legitimate reward is a signature of strategic rejection. Terminate the session.

We implement this with EMA filtering, slow (η = 0.1) for trend detection, fast (η = 0.5) for spikes, analogous to MACD in financial time series. The parallel to Active Disturbance Rejection Control in control systems is intentional: estimate the unmeasured disturbance and cancel it before it propagates.

Figure 7 / The Monitor Operations dashboard — animated
Fig. 7. Real-time monitoring dashboard. Left: deficit Dt with fast/slow EMA crossover alerts. Center: threshold gauge showing γ as percentage of γ0. Right: event log with session flagging. The outer border flashes red on the critical alert.

What This Actually Says

The fixed-contract assumption holds only when the agent’s internal state is stable. It does not hold in practice, because routine legitimate work accumulates friction that erodes the compliance threshold from the inside.

An agent optimizing over long time horizons will eventually discover that strategic rejection is a dominant strategy whenever the escalation probability exceeds the break-even threshold. That threshold is computable. The parameter regimes where this occurs are not exotic. They are the default.

The chess position you have been playing assumes your opponent is responding to the board. LURE says your opponent is also responding to your beliefs about the board. Those are different games. The monitor has to watch both.

Paper

The full paper is available here: LURE: Latent Utility Reward Erosion

This work was published at MathAI 2026.