Sequence models / Linear algebra

Every Sequence Model
Forgets Wrong

And a zero-sum spectral update that doesn't.

Kabir Murjani  ·  28 March 2026


Forgetting is not the same as fading. When a sequence model applies a decay matrix with $|\lambda_i| < 1$, the old state shrinks but it does not disappear. It persists in every eigenmode, contributing noise to every subsequent prediction. When a KV cache evicts by recency or by accumulated attention score, the eviction criterion has nothing to do with whether the evicted direction actually conflicts with what the model is currently trying to encode. Everyone is solving a proxy for the right problem.

The right problem: given that new information just arrived and the memory budget is full, which existing direction is most geometrically in the way? That question has a closed-form answer, it is computable in $O(d)$ time via a rank-one Sherman-Morrison identity, and the update it produces guarantees exact spectral erasure of the removed direction. Not approximate erasure. Not small-weight-in-the-spectrum erasure. The direction ceases to exist in memory.

This note works through that update, why it works, and what nobody has gotten right before it.

I. What Everyone Has Tried

What if we weighted everything by relevance at query time?
Transformers [1]. It worked. But the KV cache grows linearly with sequence length, and eviction has no principled answer. StreamingLLM [2] keeps recency windows. H2O [3] evicts lowest cumulative attention. Both are heuristics built on top of the wrong objective.
What if we approximated softmax attention with a running outer product?
Linear attention [4]: $\Omega_t = \Omega_{t-1} + x_t x_t^\top$. Elegant kernel interpretation, linear cost. But rank grows without bound. Memory never forgets anything. It just accumulates.
What if the state transition used a structured diagonal matrix?
S4 [5] and the structured SSM family. HiPPO initialization, efficient convolutions. The old state decays by $|\lambda_i(A)| < 1$. The past is exponentially weighted, not removed.
What if the decay were input-dependent instead of fixed?
Mamba [6]. A real improvement. Data-dependent gates mean the model can learn when to forget more aggressively. But the mechanism is still decay. The stale direction shrinks; it does not cease to exist. In precision matrix terms, that corresponds to near-infinite uncertainty in a direction, which is not the same as that direction being absent.
What if the right direction to remove were determined by the geometric conflict between the incoming signal and the current memory?
Nobody has answered this. That is the gap this note fills.
Figure 1. Simulated memory direction activations over 80 timesteps under spectral annihilation (top panel) versus exponential decay (bottom panel). Cyan markers indicate annihilation events: a direction is killed in one step and a new one appears. Under decay, stale directions linger in the spectrum indefinitely. The colormap encodes activation magnitude: black (zero) through red to yellow (peak).

II. Memory as a Fixed-Rank Manifold

Define working memory as a rank-$k$ positive semidefinite precision matrix:

$$\Omega_{t-1} \;\in\; \mathcal{S}^d_+, \qquad \rank(\Omega_{t-1}) = k.$$

Think of $\Omega_{t-1}$ as a $k$-dimensional ellipsoid in $\mathbb{R}^d$. Each axis is a direction the system currently holds in memory, and the length of that axis encodes confidence. The rank $k$ is a hard constraint: the KV-cache budget, the FPGA register file, the maximum simultaneous factors in a market model. It cannot grow. When a new state $x_t \in \mathbb{R}^d$ arrives and the ellipsoid is full, one axis gets replaced. The question is which one.

Why a precision matrix

A precision matrix $\Omega$ encodes certainty. High eigenvalue along direction $v$ means the system is confident about $v$. Decaying an SSM state toward zero is equivalent to letting eigenvalues of $\Omega$ grow toward infinity: unbounded uncertainty, not absence of knowledge. Spectral annihilation removes the eigenvalue. These produce different nullspaces, and only one of them prevents the stale direction from affecting future queries.

III. The Target: Memory Response to the New Input

The direction to remove is not the smallest eigenvalue of $\Omega_{t-1}$. It is the direction in current memory most activated by the new signal, because that is the one that would interfere most with encoding $x_t$. Define it as:

$$y \;=\; \frac{\Omega_{t-1}\, x_t}{\|\Omega_{t-1}\, x_t\|}.$$

This is the memory's response when $x_t$ is pushed through it: the direction inside $\Col(\Omega_{t-1})$ that the new input activates most strongly. If $y$ is closely aligned with $x_t$, memory is already holding something geometrically close to the new signal. That is what gets replaced. The containment $y \in \Col(\Omega_{t-1})$ holds by construction, which is required for the rank argument to work cleanly.

IV. The Generalized Eigenvalue Problem

We want a unit vector $w \in \mathbb{R}^d$ such that projecting memory through $(I - ww^\top)$ kills $y$ while leaving everything orthogonal to $y$ untouched. Write this as minimizing a Generalized Rayleigh Quotient over the matrix pencil $(A, B)$:

$$A = x_t x_t^\top \qquad \text{(new signal covariance)}$$

$$B = yy^\top + \varepsilon I \qquad \text{(stale covariance, noise-regularized)}$$

$$R(w) = \frac{w^\top B\, w}{w^\top A\, w}.$$

Small $R(w)$ means $w$ has high projection onto the new signal and low projection onto the stale direction. By the Courant-Fischer minimax theorem [7], the minimizer is the dominant eigenvector of $B^{-1}A$. Since $A = x_t x_t^\top$ has rank one:

$$B^{-1}A\, v = B^{-1}x_t \cdot (x_t^\top v).$$

Every eigenvector is proportional to $B^{-1}x_t$. The full GEVP reduces to one matrix-vector product.

V. O(d) via Sherman-Morrison

$B = yy^\top + \varepsilon I$ is a rank-one perturbation of a scaled identity. Sherman-Morrison [8] gives the exact inverse analytically:

$$B^{-1} = \frac{1}{\varepsilon}\!\left(I - \frac{yy^\top}{\varepsilon + \|y\|^2}\right).$$

Apply to $x_t$ directly, with $\|y\| = 1$:

$$B^{-1}x_t = \frac{1}{\varepsilon}\!\left(x_t - \frac{(y^\top x_t)}{\varepsilon + 1}\,y\right).$$

Two dot products, two scalar multiplications, one vector subtraction. No matrix inversion. No iterative solver. Normalize to get $w^*$. The entire path from receiving $x_t$ to having the projection direction is $O(d)$ in time and space.

VI. The Update

$$\Omega_t = (I - w^*{w^*}^\top)\;\Omega_{t-1}\;(I - w^*{w^*}^\top) + x_t x_t^\top.$$

The first term takes the current memory, removes the $w^*$ axis from both sides (because $\Omega$ is symmetric), and leaves a matrix with rank at most $k-1$. The second term adds $x_t$ as a fresh axis, restoring rank to exactly $k$. The subspace rotates. It does not grow.

For the erasure guarantee: at the next timestep, a query in direction $y$ computes $\Omega_t y$. The congruence term gives $(I - w^*{w^*}^\top)\Omega_{t-1}(I - w^*{w^*}^\top)y$. Since $w^*$ is constructed to align with $y$, we have ${w^*}^\top y \approx 1$, so $(I - w^*{w^*}^\top)y \approx 0$, and the whole congruence term collapses. Memory cannot retrieve $y$. The direction has been removed from the spectrum.

Figure 2. Eigenvalue spectrum of the memory matrix $\Omega$ after 50 timesteps under exponential decay (left) versus spectral annihilation (right). Under decay, stale eigenvalues accumulate near zero but remain in the spectrum, continuing to influence future queries. Under annihilation, the spectrum stays clean: exactly $k$ well-separated eigenvalues, no near-zero residue.

VII. Implementation

Maintain $\Omega$ in factored form: $\Omega = V\Lambda V^\top$ where $V \in \mathbb{R}^{d \times k}$ has orthonormal columns and $\Lambda$ is diagonal. Memory footprint is $O(dk)$. Per-step cost:

$\textbf{1.}$   $y = V(\Lambda(V^\top x_t))$, normalize.   $O(dk)$.

$\textbf{2.}$   $w^* \propto x_t - \tfrac{(y^\top x_t)}{\varepsilon+1}\,y$, normalize.   $O(d)$.

$\textbf{3.}$   Rank-1 downdate of $(V,\Lambda)$ along $w^*$: removes one column.   $O(dk)$.

$\textbf{4.}$   Rank-1 update: append $x_t$ to $V$, new eigenvalue to $\Lambda$.   $O(dk)$.

Total: $O(dk)$ per step. Every operation is a dot product or a rank-one outer product. No backward pass, no optimizer state, no iteration. Maps directly to INT8 tensor cores. On an FPGA with $k$ parallel dot-product units, the four steps pipeline to $O(d)$ cycles.

VIII. How It Compares

Architecture Update Forget mechanism Rank
Transformer [1] Append to KV cache Attention score at query time Grows; hard eviction when full
Linear Attention [4] $\Omega_t = \Omega_{t-1} + x_tx_t^\top$ None Unbounded growth
S4 / Mamba [5,6] $h_t = Ah_{t-1} + Bx_t$ Eigenvalue decay Fixed; small eigenvalues persist
StreamingLLM [2] Sliding window Recency Fixed window, age-based eviction
H2O [3] KV with scoring Cumulative attention score Fixed budget, proxy eviction
This Congruence downdate + rank-1 update Spectral annihilation by geometric conflict Exactly $k$, provably
Figure 3. 2D projection of the memory subspace (rank $k=2$ for illustration). Each ellipse axis is a memory direction. Under annihilation, one axis is removed in a single step and a new axis grows to replace it; the ellipse rotates sharply. Under decay, both axes shrink toward zero and the ellipse collapses rather than rotates. Click to trigger an annihilation event.

IX. Open Problems

Staleness without a trigger. The scheme annihilates on arrival of a new input. In sparse-input regimes, old directions age without a signal to evict them. A time-indexed eigenvalue schedule on $\Lambda$, separate from the structural annihilation, would handle this. The right schedule is an open question and interacts nontrivially with the rank budget.

Multi-head coupling. Running $h$ independent rank-$k$ precision matrices in parallel gives $h$ concurrent rotating subspaces. The information-geometric mean on the positive definite cone [9] is the natural merge operation, but its naive $O(d^2)$ cost kills the budget. Getting it to $O(dk)$ is open.

Perturbation stability. If $x_t$ is estimated from a noisy upstream layer, error $\delta$ in $x_t$ propagates through Sherman-Morrison into $w^*$ and into the column space of $\Omega_t$. A bound on $\|w^*(x_t + \delta) - w^*(x_t)\|$ as a function of $\|\delta\|$ and $\varepsilon$ would give the minimum regularization needed. The answer likely depends on the eigenvalue gap in $\Omega_{t-1}$.

Learning the rank. The rank $k$ is fixed here. A trainable version would require backpropagation through the congruence update and through Sherman-Morrison. Both are differentiable with rank-one Jacobians, so the gradient cost is $O(d)$ per step. Whether the loss landscape supports stable SGD is empirical.

The core claim is narrow. When a bounded-memory system receives a new input and the budget is full, the right direction to remove is the one most geometrically conflicted with the new signal. That direction is computable in $O(d)$ via Sherman-Morrison on the dominant eigenvector of a rank-one matrix pencil. The update maintains rank exactly $k$, guarantees spectral erasure of the removed direction, and costs $O(dk)$ per step with no iterative components.

Decay is an approximation of forgetting. This achieves it. The cost is the same order. There is no reason to approximate.

References

  1. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
  2. Xiao, G. et al. (2023). Efficient Streaming Language Models with Attention Sinks. ICLR 2024.
  3. Zhang, Z. et al. (2023). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. NeurIPS 2023.
  4. Katharopoulos, A. et al. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
  5. Gu, A., Goel, K., and Re, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.
  6. Gu, A., and Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
  7. Horn, R. A., and Johnson, C. R. (1985). Matrix Analysis. Cambridge University Press. (Courant-Fischer theorem, Ch. 4.)
  8. Sherman, J., and Morrison, W. J. (1950). Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix. Annals of Mathematical Statistics, 21(1):124-127.
  9. Moakher, M. (2005). A Differential Geometric Approach to the Geometric Mean of Symmetric Positive-Definite Matrices. SIAM Journal on Matrix Analysis and Applications, 26(3):735-747.