any time with Transformers, you already know attention is the brain of the whole operation. It is what lets the model figure out which tokens are talking to each other, and that one mechanism is responsible for almost everything impressive LLMs do.
Attention works with three components: Query (Q), Key (K), and Value (V) [1]. The dot product between Q and K is what tells the model how much each token should focus on the others, and that is essentially the core of what attention does.
Now, calling attention the “brain” also means it comes with a cost. During inference, every time a new token is being predicted, the K and V matrices are recalculated for all the previous tokens too. So if 90 tokens are already there and the model is predicting the 91st, it goes back and recomputes KV for all 90. Isn’t this repetitiveness a waste?
KV cache changed this. The idea is straightforward, instead of recomputing, just store the K and V matrices in VRAM and reuse them during inference. Sounds simple, right? Which is probably why every major LLM out there has adopted it, the drop in latency is hard to argue with.
Though KV cache came as a silver lining for LLMs, it brought up more challenges. It introduced additional memory overhead. This might not be a big issue for SLMs, but mega-LLMs with billions of parameters now became more difficult to load on machines. Roughly 20-30% additional VRAM is consumed by the KV cache alone. The bigger limitation is that this overhead is not static, it keeps growing. This can grow up to the model size itself with long contexts or more concurrent users, since each user gets their own KV cache. To resolve this many researchers introduced different approaches like Grouped-Query Attention (GQA) [2], PagedAttention (VLLM) [3], Quantization (to 4-bit or 8-bit). However, all of these helped with the memory overhead issue but accuracy had to be compromised for that. There was no solution to both compress them and retain original accuracy. Then came TurboQuant from Google, which surprisingly manages to do both. The authors also prove that this solution sits at the theoretical optimum, the best possible for this class of problem.
TurboQuant comes with two stages: PolarQuant and Residual Correction [4].
PolarQuant (Stage 1): Compresses the K and V matrices.
Residual Correction (Stage 2): Corrects the quantization error left after PolarQuant, recovering lost information.
Applying both sequentially is what makes it different from traditional quantization. Here is a visual breakdown:
That should give you a clear picture of the pipeline of TurboQuant and how it differs from the traditional quantization we talked about. Before we dive into each stage, let us uncover another important thing: since we are talking about reducing the memory overhead, what exactly does TurboQuant store in cache? and how much less memory does it actually take up? Let us look into that visually below:

You might not fully grasp what Idx, QJL, and ε mean just yet, but they’ll become clear as we unpack this pipeline step by step. For now, the table above gives you the essential idea: it shows exactly what TurboQuant stores compared with traditional quantization.
The key takeaway? Even though both techniques achieve identical compression rates (the extra ε scalar is negligible once you spread it across the vector dimensions), TurboQuant keeps accuracy on par with the original full-precision model. In fact, the official paper reports that TurboQuant delivers more than 4.5–5x KV cache compression, that’s effective 3.5–2.5 bits per channel, with near-zero accuracy loss in practice. That’s pretty phenomenal.
Now let’s walk through the actual step-by-step flow of TurboQuant, the exact sequence we previewed in the diagram earlier.
Stage 1 (PolarQuant):
This involves two major operations in it, Rotation and LLoyd Max Quantization.
But why rotation in the first place? The major flaw of traditional quantization is how badly it handles outliers. To make this concrete, lets assume we have a 4 dimensional key vector for a token: [0.125, 0.103, 0.220, 6.030] (Outliers like this are actually pretty common in attention keys). Now if we quantize them traditionally, the quantizer has to stretch its limited levels to cover that massive 6.030 spike. The result? Something like [0, 0, 0, 1], almost all the information is lost.
Rotating the vector resolves this issue. This “spinning” of vector in high-dimensional space (y = R*x, where R is random orthogonal rotation matrix) removes the spike and immerses its energy to the other coordinates making the vector distribution smooth (Isotropic). The values are changed but the overall magnitude remains the same. After rotation, the same example vector might look something more balanced like [1.42, -0.85, 2.31, 0.97].

This smoothed distribution for high-dimensional vectors brings us close to gaussian distribution (in practice, the rotated vector is uniformly distributed on the unit sphere, as expected from the central limit theorem). As a result, each coordinate thus follows beta-like distribution over the energy present in the vector.
where d is head dimension
Tip (skip if you’re not into the math details): This is linked to a fundamental property in multivariate statistics, where if X1, X2, …. Xd ~ N(0,1) are independent and identically distributed (i.i.d), then Xi2 ~ Chi-Squared distribution and there is a theorem which states that:
Now rotation has led us to the point that we know what is the distribution like of coordinates. Now comes second major operation in stage 1: Lloyd Max Quantization:
The whole idea behind Lloyd-Max [5,6] is to place the quantization levels (centroids) in exactly the right spots so the mean squared error is minimized. It’s basically smart clustering for 1D data. Let’s simplify it with an example. Taking the same rotated vector as above: [1.42, -0.85, 2.31, 0.97]. Suppose we are doing 1 bit quantization here.
- Number of centroids or levels here are 2bits = 21 = 2.
- Let us take initial random levels as [0.5, 1.5], their mid-point or boundary is (0.5 + 1.5)/2 ~ 1, thus the quantized values now become [1.5, 0.5, 1.5, 0.5] (All values below 1 belong to 0.5 and above 1 belong to 1.5). That’s the idea of quantization right? but what we notice is there’s so much of error here, i.e., MSE is very high.
- Thus we have to find optimal levels such that MSE is minimum and values are best represented around them.
- This is done by Llyod Max quantization: since now new values are [1.5, 0.5, 1.5, 0.5], allotting two clusters:
-0.85, 0.97 –> 0.5 level cluster,
1.42, 2.31 –> 1.5 level cluster.
Taking their mean, 0.5 level cluster mean ~ 0.06 and 1.5 cluster mean ~ 1.86.
So now the new levels are changed from [0.5, 1.5] to [0.06, 1.86], and our new boundary now is (0.06+1.86)/2 ~ 0.96, now values lower than 0.96 belong to 0.06 level and values above 0.96 belong to 1.86. This keeps on reiterating until we reach a point where MSE doesn’t improve.
Tip: There’s a fundamental statistical reason this works: the value that minimizes squared error for any group of points is simply their mean.
But wait, running this repetitive process on every new vector during inference would be way too slow, right? Here’s where the rotation pays off again. Because every coordinate now follows the same known distribution (the Beta we saw earlier), we don’t have to compute a fresh Lloyd-Max codebook for every new piece of data. Instead, the optimal codebook depends only on two fixed parameters: the head dimension (d) and the number of bits (b). We compute it once, offline, and reuse it forever. A snippet of this codebook is shown below:

The quantized values are not stored in float, but in the form of indexes (idx) of levels. Example: if the levels were 8, then its indexed (idx) form is 0, 1, 2, 3, 4, 5, 6, 7. Thus needing 3 bits for storage of each value.
Note: In TurboQuant’s Stage 1 (PolarQuant), the actual stored index (idx) uses b-1 bits per dimension (codebook size = 2{b-1}), not b bits. The extra bit per dimension comes from the QJL residual correction in Stage 2 (Same was mentioned in storage comparison diagram of this article above, hope now it’s clear) The table above shows the general Lloyd-Max setup; TurboQuant cleverly splits the budget to leave room for that correction.
These indexes are stored in cache until the token is evicted. Dequantization happens on the fly whenever that token’s K is needed for attention, idx is looked up in the codebook to retrieve the float values for each index, and this matrix is then multiplied with the transpose of the original rotation matrix to get back K̂ in the original space. This completes the first stage.
Therefore, finally we are able to extract residuals:
ε = Original K matrix – K̂ matrix [dequantized]
Stage 2 (Residual Correction):
Now that we have the residuals, the most intriguing part of TurboQuant follows.
Traditional quantization didn’t even look into the residuals. However TurboQuant does not discard this residual. Instead it asks a clever question, whatever information was lost during Stage 1 compression, can we extract its essential characteristics rather than storing it fully? Think of it as asking simple yes/no questions about the residual: is this dimension leaning positive or negative? The answers to these yes/no questions are what Stage 2 preserves.
To do this, a random projection matrix S of shape (d, d) is multiplied with the residual vector. The signs of the resulting values, either +1 or -1, are what actually get stored.
Sign(ε(seq_length, d) * S(d, d))
These sign projections are known as the Quantized Johnson-Lindenstrauss (QJL) Transform [7].
Note: The randomness of S is not arbitrary, the Johnson-Lindenstrauss lemma guarantees that random projections preserve inner product structure with high probability.
But signs alone only capture direction, not magnitude. So alongside QJL, the L2 norm of the residual (‖ε‖₂) is also stored as a single scalar per vector. This scalar is what restores the magnitude during reconstruction.
During dequantization, these stored sign bits are multiplied back with transposed S, then scaled by (√π/2)/d and the stored norm ‖ε‖₂. The authors show that without this scaling factor, the sign-based estimation of the inner product is biased, this correction is what makes it unbiased. The exact formula is shown below:
Finally the two parts from both stages are added together to get:
K̃ = K̂ + K̃QJL
Some of the last minute observations:
- Complete TurboQuant pipeline summed up: Stage 1 handles the bulk compression, Stage 2 hunts down what was lost and adds it back.
- So what actually sits in cache for each token is three things: Idx, the sign bits QJL, and the scalar norm ‖ε‖₂. That is the complete compressed representation.
- The authors formally prove that this two-stage design reaches the theoretical optimum, meaning no method working within the same bit budget can do better at preserving attention dot products.
Conclusion:
At the end of the day, TurboQuant works because it stops obsessing over perfect vector reconstruction and cleverly focuses on what the attention mechanism actually needs to see. Instead of fighting the VRAM “tax” with more complex calibration, it just uses a cleaner mathematical pipeline to get the job done.
As we keep pushing for longer context windows, the KV cache bottleneck isn’t going away. But as this framework shows, we don’t necessarily need more hardware, we just need to be more intentional about how we handle the data we already have.
With the introduction of TurboQuant, is the chapter of KV Cache memory management finally closed? Or is this just the foundation for something even more powerful?
Note: This breakdown represents my current understanding of the TurboQuant pipeline. Any errors in interpretation are entirely my own, and I encourage readers to refer to the original research for the full mathematical proofs.
References:
[1] Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS 2017).
[2] Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023.
[3] Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
[4] Zandieh, A., et al. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874.
[5] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
[6] Max, J. (1960). Quantizing for Minimum Distortion. IRE Transactions on Information Theory, 6(1), 7–12.
[7] Zandieh, A., et al. (2024). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. AAAI 2025.

