RotorQuant: Clifford Algebra Vector Quantization for LLM KV Cache Compression
Replacing d×d matrix rotations with Cl(3,0) rotor sandwich products. 10-31× faster (CUDA + Metal), 44× fewer parameters, matching attention fidelity on real models.
1 Abstract
We present RotorQuant, a reimagining of Google's TurboQuant (ICLR 2026) that replaces the d×d random orthogonal rotation matrix Π with Clifford rotors R = exp(B/2) in the geometric algebra Cl(3,0). Instead of a matrix multiply Πx requiring d² = 16,384 multiply-adds for d=128, RotorQuant performs the rotor sandwich product RxR̃ using only ~100 multiply-adds per vector — exploiting the algebraic sparsity of rotors (4 of 8 multivector components are zero).
Fused GPU kernels implementing the full pipeline (embed → rotor sandwich → Lloyd-Max quantize → inverse sandwich → extract) achieve 10-19× speedup on NVIDIA (CUDA) and 9-31× speedup on Apple Silicon (Metal) over TurboQuant's BLAS matmul, while using 44× fewer parameters (372 vs 16,399 for d=128).
Validated on real KV cache data from Qwen2.5-3B-Instruct, RotorQuant matches TurboQuant's attention fidelity (cosine similarity 0.990 vs 0.991) and achieves higher top-1/top-5 retrieval accuracy at 4K context — suggesting the Clifford rotor decorrelation better preserves directional structure of real attention heads.
2 The Intuition
TurboQuant says: "randomly rotate the space so quantization becomes easy."
RotorQuant says: "why use a sledgehammer when Clifford algebra gives us a scalpel that does the same job geometrically, with 44× fewer params and a kernel that screams?"
Quick TurboQuant Recap
TurboQuant's magic is in Stage 1: you take a high-dim vector v ∈ ℝd (typical d=128 for attention heads) and multiply it by a fixed random orthogonal matrix Q (generated via QR on a Gaussian matrix): v' = Qv.
This mixes the coordinates so thoroughly that each one becomes almost independent and follows a very predictable distribution (roughly Gaussian / Beta). That lets you slap a single precomputed Lloyd-Max quantizer on every coordinate independently and get near-optimal scalar quantization. Then QJL adds a tiny 1-bit-per-dim residual correction so that inner products (i.e. attention scores) stay unbiased even though the per-vector reconstruction error is large.
The problem: that Q is dense and expensive — for d=128 it costs ~16k parameters and 16,384 multiply-adds per vector.
RotorQuant's Trick: Tiny Clifford Rotors
Instead of one huge d×d orthogonal matrix, RotorQuant chunks the d-dimensional vector into groups of 3 dimensions and rotates each little 3D block with its own cheap Clifford rotor from Cl(3,0).
In Cl(3,0):
- A rotor is R = exp(B/2) where B is a bivector (the 6 possible plane rotations in 3D).
- R has only 4 non-zero components (scalar + 3 bivector terms) and is normalized so R˜R = 1.
- To rotate a 3D vector v you embed it as a pure grade-1 multivector and do the sandwich product: v' = Rv˜R
- This is exactly a rotation in 3D (preserves norm, length, angles, etc.).
Do this independently on each 3D chunk (with its own rotor) and you get a block-diagonal orthogonal transformation that still mixes coordinates very effectively — just locally instead of globally.
Why This Is Elegant
Compact Parametrization
Each rotor needs only ~4 parameters (vs. thousands for a dense matrix). For d=128: 372 total parameters — 44× fewer than TurboQuant's QR matrix.
Sparsity & Geometry
Rotors are even-grade multivectors. The sandwich product is extremely sparse (lots of zeros), so the fused CUDA kernel blasts through it with far fewer FMAs.
Grade-Aware Quantization
After the rotor sandwich you have an 8-component multivector. RotorQuant splits it by grade (scalar vs. bivector) and quantizes each with its own Lloyd-Max codebook. It respects geometric structure.
Distribution Still Works
Random rotors in orthogonal 3D subspaces are enough to decorrelate for Lloyd-Max. Synthetic MSE is slightly higher, but on real KV-cache vectors + QJL correction the attention fidelity is identical (or better).
The Fused Kernel Win (CUDA + Metal)
Everything (embed → rotor sandwich → grade-aware quant → inverse → extract) lives in one single GPU kernel — CUDA on NVIDIA, Metal on Apple Silicon. No intermediate tensors bouncing between memory levels, no separate matmul. That's why you see 10–19× speed-ups on NVIDIA (6 μs vs 69 μs, RTX PRO 4000) and 9–31× on Apple Silicon (650 μs vs 6 ms, Mac Mini M4).
It's a perfect example of stealing a trick from physics/mathematics (Clifford rotors are the cleanest way to represent 3D rotations) and making it practical for modern LLM inference.
3 Background: The KV Cache Problem
When an LLM generates text, it stores key and value vectors for every token across every layer. At 8K tokens on Qwen2.5-3B (36 layers), this KV cache consumes 289 MB in FP16. On a 24GB GPU, the cache — not the model weights — becomes the bottleneck for long context.
TurboQuant compresses these vectors by: (1) applying a random orthogonal rotation Π to decorrelate coordinates, (2) quantizing each coordinate independently via Lloyd-Max optimal scalar quantization, and (3) applying a 1-bit QJL correction on the residual for unbiased inner product estimation. This achieves 5× compression at 3-bit with 99.5% attention fidelity.
4 Method: Clifford Rotors Replace Matrix Rotations
RotorQuant embeds d-dimensional vectors as Cl(3,0) multivectors (groups of 3 dimensions → 8-component multivectors: [1, e1, e2, e3, e12, e13, e23, e123]), then applies per-group rotor decorrelation via the sandwich product RxR̃.
| Property | TurboQuant (Π matrix) | RotorQuant (Rotor R) |
|---|---|---|
| Parameters | d² = 16,384 | 8 × ceil(d/3) = 344 |
| Operations / vector | d² = 16,384 FMAs | ~100 FMAs (sparse GP) |
| Preserves | Norms + inner products | + outer products + grades |
| Composition | Π2Π1 (matmul) | R2R1 (geometric product) |
| At d=4096 | 16.7M params | ~11K params |
Rotor Sparsity Exploitation
A rotor R in Cl(3,0) has only 4 non-zero components (scalar + 3 bivectors). The sparse geometric product reduces from 64 to 28 FMAs:
// Sparse GP: rotor * multivector (28 FMAs vs 64 for full)
r[0] = s*x[0] - p12*x[4] - p13*x[5] - p23*x[6];
r[1] = s*x[1] + p12*x[2] + p13*x[3] + p23*x[7];
r[2] = s*x[2] - p12*x[1] + p23*x[3] - p13*x[7];
r[3] = s*x[3] - p13*x[1] - p23*x[2] + p12*x[7];
r[4] = s*x[4] + p12*x[0];
r[5] = s*x[5] + p13*x[0];
r[6] = s*x[6] + p23*x[0];
r[7] = s*x[7] - p23*x[1] + p13*x[2] - p12*x[3]; 5 Results: CUDA Fused Kernel Speed
RTX PRO 4000 Blackwell, d=128, 3-bit quantization. Full pipeline: embed → rotor sandwich → quantize → inverse → extract.
| n_vectors | TurboQuant | RQ PyTorch | RQ CUDA | vs TQ |
|---|---|---|---|---|
| 1,024 | 69 us | 3.30 ms | 6 us | 11x faster |
| 4,096 | 132 us | 3.86 ms | 12 us | 11x faster |
| 8,192 | 285 us | 4.70 ms | 20 us | 14x faster |
| 16,384 | 740 us | 6.71 ms | 39 us | 19x faster |
Apple Silicon: Fused Metal Shader
Mac Mini M4, d=128, 3-bit. Same fused pipeline as CUDA but via Metal compute shader.
| n_vectors | TurboQuant (MPS) | RQ Metal | vs TQ |
|---|---|---|---|
| 1,024 | 764 us | 471 us | 1.6x faster |
| 4,096 | 6.02 ms | 650 us | 9.3x faster |
| 16,384 | 21.94 ms | 1.12 ms | 19.6x faster |
| 65,536 | 86.46 ms | 2.76 ms | 31.3x faster |
6 Real Model Validation: Qwen2.5-3B-Instruct
Actual KV cache from forward pass on real text. RotorQuant matches TurboQuant and beats it on top-1/top-5 at 4K context.
| Context | Bits | Method | Cosine Sim | Top-1 | Top-5 |
|---|---|---|---|---|---|
| 2K | 3-bit | TurboQuant | 0.9906 | 81.2% | 93.8% |
| 2K | 3-bit | RotorQuant | 0.9903 | 81.2% | 93.8% |
| 4K | 3-bit | TurboQuant | 0.9875 | 81.2% | 87.5% |
| 4K | 3-bit | RotorQuant | 0.9870 | 81.2% | 93.8% |
| 4K | 4-bit | TurboQuant | 0.9880 | 75.0% | 93.8% |
| 4K | 4-bit | RotorQuant | 0.9874 | 81.2% | 93.8% |
KV Cache Compression (8K context, all 36 layers)
| Config | Cache Size | Compression | Cosine Sim |
|---|---|---|---|
| FP16 | 289.0 MB | 1.0x | - |
| TQ 4-bit | 75.6 MB | 3.8x | 0.9983 |
| TQ 3-bit | 57.6 MB | 5.0x | 0.9945 |
| TQ 2-bit | 39.5 MB | 7.3x | 0.9851 |
7 Synthetic Benchmarks
MSE Distortion (d=128, 2000 unit vectors)
| Bits | TurboQuant | RotorQuant | Theory Bound |
|---|---|---|---|
| 1-bit | 0.361 | 0.457 | 0.680 |
| 2-bit | 0.116 | 0.197 | 0.170 |
| 3-bit | 0.034 | 0.081 | 0.043 |
| 4-bit | 0.009 | 0.032 | 0.011 |
TurboQuant wins on raw MSE — its full d×d rotation exactly induces the Beta distribution Lloyd-Max was optimized for. However, the QJL residual correction compensates, and on real model data the accuracy gap disappears.
Needle-in-Haystack Retrieval
Perfect 9/9 exact match for both methods across all bit-widths (2, 3, 4) and context lengths (512, 2048, 8192). Both quantizers correctly identify the closest vector every time.
8 Profiling: Where the Time Goes
Before the CUDA kernel, 80% of RotorQuant's time was in the geometric product (Python/PyTorch launching hundreds of tiny kernels):
The fused CUDA kernel eliminated this bottleneck entirely — the full pipeline now takes 6-39 us instead of 3.3-6.7 ms.
9 References
- Zandieh et al. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" ICLR 2026.
- Zandieh et al. "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization" 2024.
- Shwu et al. "PolarQuant: Quantizing KV Caches with Polar Transformation" AISTATS 2026.
- ParaMind. "CliffordNet: All You Need is Geometric Algebra" Jan 2026.
- QJL Reference Implementation: github.com/amirzandieh/QJL
- RotorQuant Code: github.com/scrya-com/rotorquant
- TurboQuant PyTorch: github.com/tonbistudio/turboquant-pytorch
- TurboQuant Website: turboquant.net
Try RotorQuant
pip install, build the CUDA or Metal kernel, run the benchmarks.