Each position emits a query vector: a learned question directed at every other position. The query axis is the truth-seeking organ of attention.
TransformerTransformer
Query, Value, Key — three vectors that learned to mean everything.
The Transformer is the architecture that made the current AI era possible. Its core mechanism — attention — is a three-vector dance. Every token produces a Query (what am I looking for?), a Key (what do I offer?), and a Value (what do I carry?). The dot product Q·K decides which Values to mix into the next representation. Three vectors. A single equation. The mechanism by which a thirty-trillion-token model learns to write, code, reason, and dream.
What gets carried forward when this position is attended to. The 'good' an attended token can offer the rest of the model. The substance under the spotlight.
Each position advertises itself with a key: 'if you are looking for this kind of thing, look at me.' Keys are the aesthetic surface of a token — its match-fitness.
How this trinity came to be.
- 2014
Bahdanau attention
The first soft attention mechanism in neural machine translation. The Q·K idea is born inside a recurrent net.
- 2017
Attention Is All You Need
Vaswani et al. publish the Transformer. Throw away recurrence; keep only attention. AI history bifurcates.
- 2020+
Scaling laws
It turns out Q-K-V scales smoothly with compute and data. The same three vectors are now the substrate of every frontier model.
How to use this lens today.
- 01All large language models, including GPT, Claude, Gemini, and DeepSeek, are layered Q-K-V machines.
- 02Diffusion models, code copilots, and protein folding all reuse the same trinity.
Where this trinity is heading.
- →Mixture-of-experts, mamba/SSM architectures, and neuromorphic chips all attempt to keep the trinity while trading one axis for compute or memory savings.
- →Whoever finds the post-Transformer trinity wins the next decade.