And a zero-sum spectral update that doesn't.
Forgetting is not the same as fading. When a sequence model applies a decay matrix with $|\lambda_i| < 1$, the old state shrinks but it does not disappear. It persists in every eigenmode, contributing noise to every subsequent prediction. When a KV cache evicts by recency or by accumulated attention score, the eviction criterion has nothing to do with whether the evicted direction actually conflicts with what the model is currently trying to encode. Everyone is solving a proxy for the right problem.
The right problem: given that new information just arrived and the memory budget is full, which existing direction is most geometrically in the way? That question has a closed-form answer, it is computable in $O(d)$ time via a rank-one Sherman-Morrison identity, and the update it produces guarantees exact spectral erasure of the removed direction. Not approximate erasure. Not small-weight-in-the-spectrum erasure. The direction ceases to exist in memory.
This note works through that update, why it works, and what nobody has gotten right before it.
Define working memory as a rank-$k$ positive semidefinite precision matrix:
$$\Omega_{t-1} \;\in\; \mathcal{S}^d_+, \qquad \rank(\Omega_{t-1}) = k.$$
Think of $\Omega_{t-1}$ as a $k$-dimensional ellipsoid in $\mathbb{R}^d$. Each axis is a direction the system currently holds in memory, and the length of that axis encodes confidence. The rank $k$ is a hard constraint: the KV-cache budget, the FPGA register file, the maximum simultaneous factors in a market model. It cannot grow. When a new state $x_t \in \mathbb{R}^d$ arrives and the ellipsoid is full, one axis gets replaced. The question is which one.
A precision matrix $\Omega$ encodes certainty. High eigenvalue along direction $v$ means the system is confident about $v$. Decaying an SSM state toward zero is equivalent to letting eigenvalues of $\Omega$ grow toward infinity: unbounded uncertainty, not absence of knowledge. Spectral annihilation removes the eigenvalue. These produce different nullspaces, and only one of them prevents the stale direction from affecting future queries.
The direction to remove is not the smallest eigenvalue of $\Omega_{t-1}$. It is the direction in current memory most activated by the new signal, because that is the one that would interfere most with encoding $x_t$. Define it as:
$$y \;=\; \frac{\Omega_{t-1}\, x_t}{\|\Omega_{t-1}\, x_t\|}.$$
This is the memory's response when $x_t$ is pushed through it: the direction inside $\Col(\Omega_{t-1})$ that the new input activates most strongly. If $y$ is closely aligned with $x_t$, memory is already holding something geometrically close to the new signal. That is what gets replaced. The containment $y \in \Col(\Omega_{t-1})$ holds by construction, which is required for the rank argument to work cleanly.
We want a unit vector $w \in \mathbb{R}^d$ such that projecting memory through $(I - ww^\top)$ kills $y$ while leaving everything orthogonal to $y$ untouched. Write this as minimizing a Generalized Rayleigh Quotient over the matrix pencil $(A, B)$:
$$A = x_t x_t^\top \qquad \text{(new signal covariance)}$$
$$B = yy^\top + \varepsilon I \qquad \text{(stale covariance, noise-regularized)}$$
$$R(w) = \frac{w^\top B\, w}{w^\top A\, w}.$$
Small $R(w)$ means $w$ has high projection onto the new signal and low projection onto the stale direction. By the Courant-Fischer minimax theorem [7], the minimizer is the dominant eigenvector of $B^{-1}A$. Since $A = x_t x_t^\top$ has rank one:
$$B^{-1}A\, v = B^{-1}x_t \cdot (x_t^\top v).$$
Every eigenvector is proportional to $B^{-1}x_t$. The full GEVP reduces to one matrix-vector product.
$B = yy^\top + \varepsilon I$ is a rank-one perturbation of a scaled identity. Sherman-Morrison [8] gives the exact inverse analytically:
$$B^{-1} = \frac{1}{\varepsilon}\!\left(I - \frac{yy^\top}{\varepsilon + \|y\|^2}\right).$$
Apply to $x_t$ directly, with $\|y\| = 1$:
$$B^{-1}x_t = \frac{1}{\varepsilon}\!\left(x_t - \frac{(y^\top x_t)}{\varepsilon + 1}\,y\right).$$
Two dot products, two scalar multiplications, one vector subtraction. No matrix inversion. No iterative solver. Normalize to get $w^*$. The entire path from receiving $x_t$ to having the projection direction is $O(d)$ in time and space.
$$\Omega_t = (I - w^*{w^*}^\top)\;\Omega_{t-1}\;(I - w^*{w^*}^\top) + x_t x_t^\top.$$
The first term takes the current memory, removes the $w^*$ axis from both sides (because $\Omega$ is symmetric), and leaves a matrix with rank at most $k-1$. The second term adds $x_t$ as a fresh axis, restoring rank to exactly $k$. The subspace rotates. It does not grow.
For the erasure guarantee: at the next timestep, a query in direction $y$ computes $\Omega_t y$. The congruence term gives $(I - w^*{w^*}^\top)\Omega_{t-1}(I - w^*{w^*}^\top)y$. Since $w^*$ is constructed to align with $y$, we have ${w^*}^\top y \approx 1$, so $(I - w^*{w^*}^\top)y \approx 0$, and the whole congruence term collapses. Memory cannot retrieve $y$. The direction has been removed from the spectrum.
Maintain $\Omega$ in factored form: $\Omega = V\Lambda V^\top$ where $V \in \mathbb{R}^{d \times k}$ has orthonormal columns and $\Lambda$ is diagonal. Memory footprint is $O(dk)$. Per-step cost:
$\textbf{1.}$ $y = V(\Lambda(V^\top x_t))$, normalize. $O(dk)$.
$\textbf{2.}$ $w^* \propto x_t - \tfrac{(y^\top x_t)}{\varepsilon+1}\,y$, normalize. $O(d)$.
$\textbf{3.}$ Rank-1 downdate of $(V,\Lambda)$ along $w^*$: removes one column. $O(dk)$.
$\textbf{4.}$ Rank-1 update: append $x_t$ to $V$, new eigenvalue to $\Lambda$. $O(dk)$.
Total: $O(dk)$ per step. Every operation is a dot product or a rank-one outer product. No backward pass, no optimizer state, no iteration. Maps directly to INT8 tensor cores. On an FPGA with $k$ parallel dot-product units, the four steps pipeline to $O(d)$ cycles.
| Architecture | Update | Forget mechanism | Rank |
|---|---|---|---|
| Transformer [1] | Append to KV cache | Attention score at query time | Grows; hard eviction when full |
| Linear Attention [4] | $\Omega_t = \Omega_{t-1} + x_tx_t^\top$ | None | Unbounded growth |
| S4 / Mamba [5,6] | $h_t = Ah_{t-1} + Bx_t$ | Eigenvalue decay | Fixed; small eigenvalues persist |
| StreamingLLM [2] | Sliding window | Recency | Fixed window, age-based eviction |
| H2O [3] | KV with scoring | Cumulative attention score | Fixed budget, proxy eviction |
| This | Congruence downdate + rank-1 update | Spectral annihilation by geometric conflict | Exactly $k$, provably |
Staleness without a trigger. The scheme annihilates on arrival of a new input. In sparse-input regimes, old directions age without a signal to evict them. A time-indexed eigenvalue schedule on $\Lambda$, separate from the structural annihilation, would handle this. The right schedule is an open question and interacts nontrivially with the rank budget.
Multi-head coupling. Running $h$ independent rank-$k$ precision matrices in parallel gives $h$ concurrent rotating subspaces. The information-geometric mean on the positive definite cone [9] is the natural merge operation, but its naive $O(d^2)$ cost kills the budget. Getting it to $O(dk)$ is open.
Perturbation stability. If $x_t$ is estimated from a noisy upstream layer, error $\delta$ in $x_t$ propagates through Sherman-Morrison into $w^*$ and into the column space of $\Omega_t$. A bound on $\|w^*(x_t + \delta) - w^*(x_t)\|$ as a function of $\|\delta\|$ and $\varepsilon$ would give the minimum regularization needed. The answer likely depends on the eigenvalue gap in $\Omega_{t-1}$.
Learning the rank. The rank $k$ is fixed here. A trainable version would require backpropagation through the congruence update and through Sherman-Morrison. Both are differentiable with rank-one Jacobians, so the gradient cost is $O(d)$ per step. Whether the loss landscape supports stable SGD is empirical.
The core claim is narrow. When a bounded-memory system receives a new input and the budget is full, the right direction to remove is the one most geometrically conflicted with the new signal. That direction is computable in $O(d)$ via Sherman-Morrison on the dominant eigenvector of a rank-one matrix pencil. The update maintains rank exactly $k$, guarantees spectral erasure of the removed direction, and costs $O(dk)$ per step with no iterative components.
Decay is an approximation of forgetting. This achieves it. The cost is the same order. There is no reason to approximate.