The Calculus of Laziness — Kabir Murjani

KABIR MURJANI

Cover image for "The Calculus of Laziness"

The Calculus of Laziness

Why procrastination is not a flaw, but an optimal compression algorithm.

Kabir Murjani · December 2025


Every intelligent computation in the universe, whether it is a biological neural network or a student at three in the morning, reduces to exactly two questions:

How long can I get away with doing nothing?

When I finally act, how good does "good enough" need to be?

That is it. Those are the only two free parameters. Everything else, the architecture, the data, the optimizer, is just plumbing. The deep strategic decision, the one that actually determines whether a system is efficient or merely powerful, is the answer to those two questions. And the machine learning community has spent the last decade optimizing the plumbing while mostly ignoring the strategy.

Efficiency, as it is usually practiced, is an afterthought. You build the biggest model you can, and then you quantize it or prune it or distill it to fit on the hardware you actually have. That is not optimization. That is apologizing for overbuilding. Real optimization starts earlier: with a system that chooses when to think and how hard to try, the same way every intelligent organism that has ever survived under energy constraints has had to.

Procrastination is not a failure of discipline. It is a compression strategy. And it is time we formalized it.


The Pareto front of laziness and ambition. Move your cursor to test a point.

Two Eigenvalues of Intelligence

P, the Procrastination Index. P ∈ [0, 1] measures the temporal sparsity of computation. How much of the available time does the system spend doing absolutely nothing? A standard neural network sits at P = 0. It responds to every input with maximum effort, every time, unconditionally. A system at P → 1 is the opposite: it defers all computation to the last recoverable instant, then fires everything at once. P is not a wake-up threshold. It is a fundamental stance toward existence. Sleep as long as possible. Compress everything into the minimum remaining window. Do not compute a single operation until the deadline makes it mandatory.

A, the Ambition Index. A ∈ [0, 1] measures the quality target. It governs how much of the system's available capacity is actually deployed once it finally decides to move. A = 1 is perfectionism: full precision, zero error tolerance, every parameter lit up. A = 0.4 is satisficing, the word Herbert Simon coined for the strategy that most living things actually use: do just enough to clear the bar, then stop.

Together, P and A span a manifold. Every computational system, every organism, every exam strategy you have ever used, occupies a point on this surface. And most of the interesting behavior happens at the edges.


Each layer has its own procrastination probability. The system learns which ones are worth firing.

The Cramming Compression Theorem

There is a reason the student who starts studying the night before sometimes outperforms the one who reviewed the material for two weeks. It is not talent. It is compression under deadline pressure.

If a system with procrastination P has to achieve quality A by deadline T, then as P → 1 the processing rate required to hit A goes to infinity. The system literally cannot attend to everything. So it has no choice: it has to throw information away. And because it is throwing information away under the constraint of still passing, it is forced to identify and retain only the bits that carry structural weight. The noise has to go. The redundancy has to go. What survives is the skeleton of the problem.

Laziness forces compression. Compression forces generalization.

This is not a motivational metaphor. It is an information-theoretic consequence. An always-on network (P = 0) has all the time in the world, so it memorizes everything, noise included. A procrastinating network is forced to learn the structure, because structure is the only thing you can compress into a last-minute sprint and still get the answer right.

The lazy student does not generalize because they are brilliant. They generalize because they have no other option.


The Coupling Law

You cannot set P and A independently. Physics will not let you. They are coupled by feasibility:

A_effective = A_0 · min(1, (1 - P) · T · C_max / I_T)

A_0 is your target quality. T is the total time. C_max is your peak processing rate. I_T is the total information load. The numerator is the work you can actually get done in the time you left yourself, and the denominator is the work the task requires.

If (1 - P) · T · C_max ≥ I_T, you are fine: you left yourself enough time and your effective quality equals your target. But if you procrastinated too long and the fraction drops below one, your effective ambition automatically degrades. There is nothing you can do about it. The window closed.

This creates an impossibility region in the PAD manifold: a zone where the combination of high laziness and high ambition is simply not reachable. The forbidden region is bounded by the Laziness-Ambition Pareto Front, and every real system lives on one side of it. The interesting ones live close to the edge.


Evolving Laziness in Practice

You do not hand-tune P and A. You evolve them.

The implementation uses CMA-ES (Covariance Matrix Adaptation Evolution Strategy) to search over layer-wise procrastination and ambition parameters, with a Gumbel-Softmax relaxation to make the discrete compute-or-skip decision differentiable. In the forward pass, each layer hits a stochastic Bernoulli gate that routes the input between the full compute path and an identity skip-connection. The gate probability is controlled by P for that layer. Some layers fire on every input. Some layers almost never fire. The system learns which layers are worth waking up, and for what.

The objective is the Unified PAD Loss:

L_PAD(θ; P, A) = A · L_task(θ)  +  (1 - P) · C(θ)  +  μ · D(P, A)

Three terms, three pressures, all pulling in different directions.

The first term is ambition-scaled error. When A is high, the task loss matters intensely. When A → 0, the gradient of the task loss vanishes. The system stops caring about perfection and satisfices: good enough is good enough, and any additional precision is wasted compute. This is the mathematical version of "a B-plus is fine."

The second term is procrastination-penalized compute. When P is high (the system wants to be lazy), every active layer pays a tax. The only way to survive a high procrastination index is to shed computation, aggressively, until only the essential layers remain.

The third term is the coupling regularizer D(P, A). This is the enforcer of physical reality. If the system tries to be simultaneously very lazy and very ambitious, D explodes. You do not get to procrastinate all semester and then demand an A+.


The Edge of Catastrophe

The optimal point on the PAD manifold is not the safe one. It is not the comfortable middle where P = 0.5 and A = 0.7 and everything is predictable. The optimal point is right at the boundary of the cramming phase transition:

P* = P_crit − δ

Maximum laziness. Minimum margin. Just barely enough runway to prevent an ambition collapse.

This is the agent that waits until the absolute last recoverable moment, and then compresses the entire problem into the minimum viable computation. It does not waste a single FLOP on a layer that can be skipped, or a precision target that exceeds the quality gate, or a moment of attention that could be deferred. It is not reckless. It is thermodynamically optimal. It spends exactly as much energy as reality requires and not one unit more.

The universe tends toward maximum entropy.

Intelligence tends toward maximum laziness.

They are the exact same tendency, viewed from different ends of the computation.


  • Herbert Simon, "A Behavioral Model of Rational Choice" (1955). The origin of satisficing. Simon proved, decades before neural networks existed, that bounded rational agents cannot optimize and should not try: they should set a quality threshold and stop the moment it is reached.
  • Alex Graves, "Adaptive Computation Time for Recurrent Neural Networks" (2016). The first serious proposal for letting a network decide how much to think per input.
  • Nikolaus Hansen, "The CMA Evolution Strategy: A Tutorial" (2016). The optimizer that searches the PAD manifold.
© 2026 Kabir Murjani.