Transformer Architecture: From First Principles
Mon Jun 08 2026
In our previous post, we broke down Self-Attention into its simplest mathematical form—discovering how Query, Key, and Value matrices route contextual information between words.
But Self-Attention is just an engine. To build a system capable of translating languages or generating code, you need the complete vehicle: The Transformer Architecture.
Rather than dumping the entire complex wiring diagram on you, we are going to trace a single sentence from the moment it enters the system to the moment the model outputs its response. We will build the architecture layer by layer, tracking the exact mathematical transformations.
Understanding this All in 1 Go that to in Depth with mathematics foundation, definetly is not cup of a tea, It needs iterative Learning, Revisiting the same concept again and again with different perspective
The Core Bottleneck: Why Invent Transformers?
Before 2017, Sequence-to-Sequence (Seq2Seq) Recurrent Neural Networks (RNNs) were the absolute rulers of AI. RNNs processed text strictly sequentially:
"My" ──► [Step 1] ──► "name" ──► [Step 2] ──► "is" ──► [Step 3] ──► "Ankit"
If an RNN processed a 500-word paragraph, the context of the first word had to survive 499 mathematical updates before reaching the end. Information degraded rapidly. Gradients vanished. This was the Encoder Bottleneck.
The Transformer abandoned recurrence entirely. It ingests the entire sentence simultaneously, replacing time-stepping loops with parallel mathematics.
The Complete End-to-End Architecture Flow
Here is the exact blueprint of the Transformer. Notice how there are no sequential loops—data flows strictly bottom-to-top.
[ Input Sentence ] [ Output So Far ]
│ │
▼ ▼
Embedding + Positional Embedding + Positional
Encoding Encoding
│ │
▼ ▼
┌───────────────────┐ ┌──────────────────────────┐
│ ENCODER │ │ DECODER │
│ │ │ │
│ Multi-Head │ │ Masked Self-Attention │
│ Self-Attention │ │ (can't see future words) │
│ │ │ │ │
│ Add & Norm │ │ Add & Norm │
│ │ │ │ │
│ Feed-Forward │ │ Cross-Attention │
│ Network │◄──│ Q ← Decoder │
│ │K,V│ K,V ← Encoder (Z) │
│ Add & Norm │ │ │ │
└───────────────────┘ │ Add & Norm │
│ │ │ │
Context Matrix (Z) │ Feed-Forward Network │
(K and V for │ │ │
Cross-Attention) │ Add & Norm │
└──────────────────────────┘
│
▼
Linear Projection
│
▼
Softmax (Probabilities)
│
▼
[ Next Word Output ]
Let's break down exactly what happens to our data at every single stage of this pipeline.
Phase 1: Preparation (Tokenization-Embeddings-Positional Encoding)
When we pass the sentence "My name is Ankit" into the model, neural networks cannot read text. They only understand continuous numbers.
Step 1.1: Tokenization & Embeddings
The sentence is broken into IDs and mapped into a continuous representation matrix $X$.
$$\text{"My name is Ankit"} \ \longrightarrow \ [523, 1024, 98, 4567] \ \longrightarrow \ X \in \mathbb{R}^{4 \times d_{\text{model}}}$$
Step 1.2: Positional Encoding
Because the Transformer ingests all 4 tokens simultaneously, it has zero concept of word order. To the raw network, "My name is Ankit" and "Ankit is my name" are mathematically identical.
To fix this, we fuse a mathematical watermark—a Positional Encoding Vector—directly into our embeddings.
[0.12, 0.33] // Semantic Meaning ("My")
+ [0.84, 0.11] // Sine Waveform (Position 1)
────────────────
= [0.96, 0.44] // Position-Aware Vector
By adding specific sine and cosine wave frequencies, the model naturally learns to separate the underlying meaning of the word from its physical coordinate in the sentence.
Phase 2: The Encoder (Building Deep Understanding)
Our position-aware matrix $X$ now enters the Encoder. The Encoder's solitary job is to analyze the input sentence and construct a deeply interconnected representation of it.
Step 2.1: Multi-Head Self-Attention
↳ Deep Dive: Read Self-Attention from First PrinciplesLanguage is deeply complicated. Ever wondered how the exact same sentence in English can have multiple interpretations?
"She saw the man with the telescope."
By looking at this, a Transformer must generate not just one, but multiple different attention masks:
- Dependency 1: (saw → man, man → telescope) — The man has the telescope.
- Dependency 2: (she → telescope, saw → man) — She is using the telescope to see the man.
With one attention head, the model can focus on only one type of relationship at a time. But language has many relationships simultaneously.
The Limitation of a Single Head
Consider the sentence: "I love Apple Phone"
Things happening at once:
Apple → Phone(Brand / Tech)Love → Apple(Sentiment)I → Love(Subject - Verb)Word Order(Positional structure)
A single attention head must average all of these into one focus. That loses information.
The Solution: Multi-Head Attention Multi-Head Attention = Multiple independent attention mechanisms running in parallel.
Each head:
- Sees the same sentence.
- Uses its own learned projections ($W^Q, W^K, W^V$).
- Focuses on a different relationship.
Think of it as multiple viewpoints looking at the exact same data. Each word embedding is copied into multiple heads, so each head creates its own independent geometry.
- Head 1: Focuses on syntax.
- Head 2: Focuses on semantics.
- Head 3: Focuses on entities.
- Head 4: Focuses on long-range relations.
No one tells them this explicitly. Training discovers it.
Why Splitting into Heads Helps (Important Intuition) Instead of doing one massive 512-dimensional attention pass, we do 8 heads $\times$ 64-dimensional attention.
Why don't all heads just learn the exact same thing?
- Different random initialization: At the start, all $W$ matrices are random. They are literally looking in different directions in vector space.
- Dimensional Bottleneck: Because each head only gets 64 dimensions, it cannot memorize everything. It is forced to specialize.
What Happens During the First Forward Pass? Sentence: "Apple released a new phone and people love it."
Because the weights are random, the early attention patterns are accidental:
- Head 1 Attention: Apple → released → phone → it (Nearby words)
- Head 2 Attention: love → people → it (Emotion-heavy words)
- Head 3 Attention: Apple → phone (Capitalized / Noun words)
What Happens During Backpropagation? Suppose the model makes a mistake in understanding entity reference (failing to realize "it" refers to "phone"). Loss is high.
Backprop asks: "Which parameters most influenced the wrong output?"
Each head gets different gradients because their outputs and contributions were different:
- Head 2 (Sentiment) helped prediction a bit. Gradient reinforces emotional alignment.
- Head 3 (Entities) helped entity resolution. Gradient strongly reinforces entity links.
- Head 1 (Nearby words) didn't help much. Gradient weakens or pushes it elsewhere.
Over millions of examples, gradients strengthen useful patterns and weaken useless ones. Just like CNNs naturally learn filters for edges and textures, different heads capture different aspects of the sentence.
The Linear Transformation ($W_O$) Each head produces an output vector. All head outputs are concatenated and passed through one final linear layer ($W_O$). The job of $W_O$ is to make sure only the relevant head information is passed forward into the final combined representation.
Step 2.2: The "Add & Norm"
Immediately after attention routing, the mixed output passes through an Add & Norm stabilization layer.
$$\text{Output} = \text{LayerNorm}\Big(\text{Attention}(X) + X \Big)$$
1. The "Add" (Residual Connection) — Preventing Information Loss We physically add the original input $X$ directly back to the Attention output vector.
[ 0.22, 0.44 ] // Processed Attention Output
+ [ 0.10, 0.80 ] // Original Input X (Uncorrupted)
────────────────
= [ 0.32, 1.24 ] // Bypassed Residual Vector
Why is this necessary? When a Transformer is initialized, its weights are completely random. When you pass an input vector through multiple matrix multiplications, it gradually becomes mathematically distorted.
In the first few layers, this isn't a huge issue. But as the network gets deeper, the original input starts getting severely corrupted. By the time it reaches the output layer, massive information loss has occurred.
If a specific attention layer fails to learn anything useful early in training, the raw features flow safely past it, structurally eliminating the Vanishing Gradient problem.
2. The "Norm" (Layer Normalization) — Stabilizing Training Deep networks are notoriously unstable. Without normalization, some activation numbers grow exponentially large (Exploding Gradients), while others shrink to zero.
Why is this catastrophic for speed? If you have unnormalized data with wild variance, you are forced to use a very Small Learning Rate to prevent the model from crashing. A smaller learning rate means the model will take a significantly longer time to converge to the minimum loss. Training becomes painfully slow.
Layer Normalization solves this by mathematically standardizing the variance within each token. Let's see the exact internal computation:
// 1. Token Vector
x = [2.0, 4.0, 6.0, 8.0]
// 2. Compute Mean (μ)
μ = (2 + 4 + 6 + 8) / 4 = 5.0
// 3. Subtract Mean
x - μ = [-3.0, -1.0, 1.0, 3.0]
// 4. Divide by Standard Deviation (σ ≈ 2.23)
Normalized = [-1.34, -0.44, 0.44, 1.34]
By explicitly keeping the numbers tightly bounded around zero, the model avoids gradient explosions. This allows us to use a High Learning Rate, radically improving training speed and stability.
Step 2.3: Feed-Forward Network (The Concept Factory)
Self-Attention only routes information between tokens. It is fundamentally a linear operation. But to truly understand language, a model must be able to draw complex, non-linear boundaries.
The Feed-Forward Network (FFN) introduces Non-Linearity.
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
Let's look at the exact mathematical mechanics inside the Concept Factory:
// Step 1: Input vector (2-dim, simplified)
Input = [0.5, 0.3]
// W_1 projects it up to 5 dimensions (learned weighted sums)
Expanded = [4.2, -1.8, 3.1, -0.5, 8.9]
// Step 2: Apply ReLU → max(0, x) → Kill every negative
Before: [ 4.2, -1.8, 3.1, -0.5, 8.9 ]
After: [ 4.2, 0, 3.1, 0, 8.9 ] // ◄── -1.8 and -0.5 zeroed out
// Step 3: W_2 projects back down (5 → 3 dims)
// Each output is a dot product of ALL 5 ReLU values × a learned row of W_2
W_2 row 1 = [0.3, 0.1, 0.5, 0.2, 0.4]
Output[0] = (4.2×0.3)+(0×0.1)+(3.1×0.5)+(0×0.2)+(8.9×0.4) = 1.26+0+1.55+0+3.56 = 6.37
W_2 row 2 = [0.2, 0.4, 0.1, 0.3, 0.2]
Output[1] = (4.2×0.2)+(0×0.4)+(3.1×0.1)+(0×0.3)+(8.9×0.2) = 0.84+0+0.31+0+1.78 = 2.93
Final Output = [6.37, 2.93, ...] // ◄── Back to standard dimension
$W_2$ is a learned matrix. Each output value is a dot product that mixes all 5 ReLU signals into a single summary number. The network learns exactly which combination of signals encodes useful concepts. That is how 2048 neurons get distilled back down to 512 dimensions.
(Note: The output of the FFN passes through a second Add & Norm layer before officially exiting the Encoder).
Phase 3: The Decoder (Generating the Answer)
The Encoder is finished. It outputs a highly dense, fully contextualized matrix called the Context Matrix ($Z$).
Now, the Decoder steps in. Its job is to take that Context Matrix ($Z$) and generate the output text autoregressively (one word at a time).
Step 3.1: Masked Self-Attention
↳ Deep Dive: Read Self-Attention from First PrinciplesMasked Multi-Head Attention is self-attention where a token is NOT allowed to look at future tokens.
Before we do the math, let's understand the structural difference between the two halves of the Transformer:
- Encoder: Sees the full input at once. Builds a global understanding. No masking.
- Decoder: Generates output step by step. Uses masked self-attention to see only past outputs.
Why Do We Need Masking?
The Decoder's job is to generate text one word at a time. Consider the sentence: "I love apple phone"
When predicting:
- word 1 →
"I" - word 2 →
"love" - word 3 →
"apple" - word 4 →
"phone"
At the exact moment of predicting "apple", the model must NOT see "phone". Why? Because "phone" is the exact answer it is supposed to predict later!
Without Masking
During training, the Decoder processes all target words simultaneously for massive GPU efficiency. Suppose the Decoder sees the entire sequence "I love apple phone" at once. While predicting "apple", the attention mechanism would simply look ahead at "phone".
The model learns a lazy, catastrophic rule: "Just peek at the future to get the answer."
At inference time—when you are actually chatting with the AI—future words do not exist yet. Because the model relied on cheating during training, it completely fails to generate anything useful.
The Math of Masking To prevent cheating, we must enforce causality. The model applies a Mask to the raw $QK^T$ attention scores before applying Softmax. It explicitly overwrites all future coordinate scores with negative infinity ($-\infty$), creating a lower-triangular matrix:
[ Word 1, Word 2, Word 3, Word 4 ]
Word 1 [ 1.2, -Inf, -Inf, -Inf ]
Word 2 [ 0.4, 1.1, -Inf, -Inf ]
Word 3 [ 0.9, 0.1, 2.2, -Inf ]
Word 4 [ 0.2, 0.5, 1.4, 0.8 ]
When this matrix passes through the Softmax function ($\frac{e^{x}}{\sum e^x}$), any value of $e^{-\infty}$ collapses exactly to $0$.
Future words are mathematically erased. Word 2 is only allowed to route information from Word 1 and itself. This strictly enforces causality, ensuring the model learns to truly predict the next word rather than simply copying it.
Step 3.2: Cross-Attention
↳ Deep Dive: Read Self-Attention from First PrinciplesNext, the Decoder must link its generated text with the original input sentence. It does this via Cross-Attention.
Unlike Self-Attention where $Q, K, V$ all come from the exact same sentence, Cross-Attention deliberately splits the source matrices.
The Math of the Bridge:
- The Queries ($Q$) are generated from the Decoder's current word.
- The Keys ($K$) and Values ($V$) are pulled directly from the Encoder's fully processed Context Matrix ($Z$).
Decoder's generated text so far ──► Generates Query (Q)
│
Encoder's Source text (Matrix Z) ──► Generates Key (K) & Value (V)
First-Principle Example: Suppose we are translating "I love Apple" to French. The Decoder has already generated "Je". Now it must generate the next word.
- The Query: The Decoder projects "Je" into a Query vector ($Q$). Mathematically, it asks: "I am a first-person pronoun. What context should I attach to?" $Q_{\text{Je}} = [0.8, 0.2]$
- The Keys: The Encoder's Context Matrix contains the deeply processed Keys ($K$) for "I", "love", and "Apple". $K_{\text{I}} = [0.9, 0.1]$ $K_{\text{love}} = [0.1, 0.8]$
- The Matching (Dot Product): The Query computes the dot product multiplication against all Encoder Keys to measure geometric similarity.
It discovers a massive mathematical alignment with $K_{\text{I}}$.Q("Je") • K("I") = (0.8 × 0.9) + (0.2 × 0.1) = 0.74 (High Alignment) Q("Je") • K("love") = (0.8 × 0.1) + (0.2 × 0.8) = 0.24 (Low Alignment) - The Value (Softmax + Weighted Blend): The raw scores are converted to attention weights via Softmax, then used to blend all Value vectors together:
The Decoder does not blindly pick one word. It blends the meaning of every source token, weighted by alignment strength. This// Softmax on dot product scores e^0.74 = 2.10 → 2.10 / 3.37 = 62% weight on V("I") e^0.24 = 1.27 → 1.27 / 3.37 = 38% weight on V("love") // Value vectors (simplified) V("I") = [1.0, 0.0] V("love") = [0.0, 1.0] // Final output = weighted blend Output = 0.62 × [1.0, 0.0] + 0.38 × [0.0, 1.0] = [0.62, 0.38] // ◄── Carries 62% "I" meaning + 38% "love" meaning[0.62, 0.38]context vector is then passed into the FFN.
Cross-Attention is a translator computing dot products against a dictionary. The Decoder maps its current state onto a weighted semantic blend of the entire source sentence.
After Cross-Attention, the Decoder runs the vectors through its own Feed-Forward Network and Add & Norm layers.
Phase 4: Output Translation (Linear & Softmax)
We are translating "I love Apple" to French. The Decoder has generated "Je". Now the final layers must select the next French word.
The final context vector is pushed through a Linear Layer that scores every word in the 50,000-word French vocabulary. Each score is called a logit.
// Step 1: Raw Logits from the Linear Layer
aime = 4.5 // "love" in French ← likely next word
suis = 2.1 // "am" in French
mange = 1.2 // "eat" in French
... // 49,997 more words with tiny scores
// Step 2: Exponentiate every logit e^x (makes all values positive)
e^4.5 = 90.01 // aime
e^2.1 = 8.16 // suis
e^1.2 = 3.32 // mange
Sum = 101.49 // (real sum spans all 50,000 words)
// Step 3: Divide each by Sum → final probability for every word
aime = 90.01 / 101.49 = 89% // ◄── Winner. Model outputs "aime"
suis = 8.16 / 101.49 = 8%
mange = 3.32 / 101.49 = 3%
The model outputs "aime" with 89% confidence. The Decoder's input is now "Je aime". On the next loop it predicts the French word for Apple.
This loop runs until the model outputs <EOS> (End Of Sequence), terminating generation. The final translation: "Je aime Apple".
Summary of the Engine
| Block | Mathematical Purpose |
|---|---|
| Positional Encoding | Fuses spatial time-step coordinates directly into the semantic vectors. |
| Multi-Head Attention | Analyzes the sequence from multiple structural perspectives in parallel. |
| Add & Norm | Bypasses failing layers to preserve gradients and stabilizes numerical variance. |
| Feed-Forward | Constructs higher-order concepts from the routed contextual evidence. |
| Masked Attention | Zeroes out future coordinates to prevent the decoder from cheating. |
| Cross-Attention | Routes the decoder's current progress directly to the encoder's source understanding. |
This article i wrote specially for myself to grasping the entire concept in more better way and try to not forget.
Best,
Ankit