Deconstructing the Transformer: A Neuron-Level Analysis of a Modern Neural Circuit

Section 1: Foundational Principles: From Recurrence to Parallel Attention

The advent of the Transformer architecture in 2017 marked a watershed moment in the field of deep learning, particularly for sequence processing tasks.1 To fully appreciate the novelty and power of its internal circuits, it is essential to first understand the limitations of the architectures it superseded. For years, Recurrent Neural Networks (RNNs) and their more sophisticated variants, such as Long Short-Term Memory (LSTM) networks, were the dominant paradigms for modeling sequential data like natural language.3

1.1 The Limitations of Sequential Computation in Recurrent Architectures

Recurrent models process data sequentially, ingesting one element, or token, at a time while maintaining an internal “hidden state” that is updated at each step.4 This recurrent loop, where the output at time

$t$ is a function of the input at time $t$ and the hidden state from time $t-1$, creates a form of internal memory, allowing the network to theoretically retain information from arbitrarily far back in the sequence.7

However, this design introduces two fundamental challenges. The first is a practical constraint on learning long-range dependencies known as the vanishing and exploding gradient problem.5 During backpropagation, gradients are multiplied through the network’s layers over time. For long sequences, this repeated multiplication can cause the gradients to shrink exponentially toward zero (vanish) or grow uncontrollably (explode). Vanishing gradients leave the model unable to learn connections between distant elements in a sequence, as the error signal from the output cannot effectively propagate back to update the weights associated with early inputs.3

LSTM networks were a significant breakthrough designed specifically to mitigate this issue.3 They introduced a more complex internal structure involving a “cell state” and a series of “gates”—the input gate, forget gate, and output gate—that meticulously regulate the flow of information.5 These gating mechanisms, which are essentially small neural networks, learn to selectively add, remove, or pass through information from the cell state, allowing LSTMs to maintain context over much longer sequences than simple RNNs.5

Despite this improvement, both RNNs and LSTMs are hamstrung by a second, more fundamental limitation: their inherently sequential nature.3 The computation for the token at position

$t$ is strictly dependent on the completion of the computation for the token at position $t-1$. This dependency makes it impossible to parallelize the processing of tokens within a single training example.4 In an era of massively parallel hardware like Graphics Processing Units (GPUs), this sequential bottleneck became a major impediment to scaling models and training on the vast datasets required for state-of-the-art performance.8 Furthermore, the path that information must travel between two distant tokens, say at positions

$i$ and $j$, has a length proportional to their distance, $O(j-i)$. This long path makes it computationally difficult to capture complex, long-range dependencies effectively, even with the aid of LSTM gates.10

 

1.2 The Transformer Paradigm Shift: Parallel Set Transformation

 

The Transformer architecture, introduced in the seminal paper “Attention Is All You Need,” proposed a radical departure from this recurrent philosophy.1 It completely dispenses with recurrence and convolutions, relying instead on a mechanism called

self-attention to draw global dependencies between input and output.10

This architectural choice directly addresses the limitations of its predecessors. By eliminating the sequential processing loop, the Transformer can process all tokens in an input sequence simultaneously.1 Every token’s representation is computed in parallel, with the self-attention mechanism directly modeling the relationships between all pairs of tokens in the sequence, regardless of their distance.10 This parallelizability dramatically reduces training time on modern hardware and has enabled the development of the massive models that define the current state of the art.3

The core conceptual leap is the re-framing of “context.” RNNs build a compressed, temporal state that evolves sequentially. This state is inherently biased towards more recent information, as the memories of earlier tokens are repeatedly transformed and potentially diluted. In contrast, the Transformer builds a dynamic, relational context. Each token’s meaning is defined not by a summary of what came before it, but by a weighted relationship to every other token in the input, treated as a complete set. This moves the central problem of sequence modeling from “how to remember the past” to “how to relate all parts of the present input.”

This is made possible by the self-attention mechanism, which connects all positions with a constant number of sequentially executed operations. This reduces the maximum path length for information to travel between any two tokens in the sequence to $O(1)$, a profound advantage for capturing long-range dependencies compared to the $O(n)$ path length in RNNs.10 The rise of the Transformer is thus inextricably linked to the maturation of GPU hardware. The architecture’s near-total reliance on matrix multiplications—the core of self-attention and subsequent feed-forward layers—is perfectly suited to the parallel processing capabilities of modern GPUs. The Transformer’s design can be seen as a deliberate reformulation of sequence processing into a computational language that GPUs speak fluently, thereby unlocking unprecedented scalability.1

The following table provides a concise summary of the key architectural differences that underpin this paradigm shift.

Attribute RNN/LSTM Transformer
Processing Paradigm Sequential (token-by-token) Parallel (whole sequence at once)
Parallelizability Low (within a sequence) High
Long-Range Dependency Challenging (vanishing gradients) Effective (direct connections)
Max Path Length $O(n)$ $O(1)$
Computational Complexity $O(n \cdot d^2)$ $O(n^2 \cdot d)$

Table 1: Architectural Comparison of Sequence Models. Here, $n$ is the sequence length and $d$ is the representation dimension. The Transformer’s complexity is higher for long sequences but is highly parallelizable, making it faster in practice on modern hardware for typical $n < d$ scenarios.10

 

Section 2: Encoding Information: The Journey from Text to Context-Aware Tensors

 

Before the core computational circuits of the Transformer can operate, the raw input text must be converted into a numerical format that is both machine-readable and imbued with the necessary information about meaning and order. This is a multi-stage process that transforms a string of characters into a matrix of high-dimensional vectors.

 

2.1 Tokenization and Semantic Embedding

 

The first step in this pipeline is Tokenization. This process breaks down the input text into a sequence of smaller, manageable units called tokens.17 These tokens can be words, but more commonly, modern systems use subword tokenization algorithms like Byte-Pair Encoding (BPE).18 Subword tokenization is advantageous because it can handle rare words by breaking them down into more common sub-units (e.g., “empowers” might become “empower” and “s”), and it keeps the overall vocabulary size manageable.17 Each token in the model’s pre-defined vocabulary is assigned a unique integer ID.17

Once the text is a sequence of integer IDs, it is passed to an Embedding Layer. This layer acts as a lookup table, mapping each token ID to a high-dimensional vector.19 This lookup table is, in practice, a large weight matrix of size

(vocabulary_size, d_model), where d_model is the dimensionality of the embedding space (a key hyperparameter, typically 512, 768, or larger).3 Each row of this matrix is a vector that has been learned during training to capture the semantic meaning of the corresponding token. Words with similar meanings or that appear in similar contexts will have embedding vectors that are close to each other in this high-dimensional space.19 The output of this stage is a matrix of size

(sequence_length, d_model), where each row is the semantic embedding of a token in the input sequence.

 

2.2 The Problem of Order: Sinusoidal Positional Encoding

 

The self-attention mechanism, which forms the core of the Transformer, is permutation-invariant. It processes the input as an unordered set of vectors.22 This means that after the embedding step, the model has no inherent way of knowing the original order of the tokens. The sentences “The cat sat on the mat” and “The mat sat on the cat” would be represented by identical sets of vectors, leading to an obvious loss of meaning.24

To solve this, the Transformer injects explicit information about the position of each token into the model. This is achieved through Positional Encodings.20 A positional encoding is a vector of the same dimension as the token embeddings (

$d_{model}$) that is unique to each position in the sequence. This positional vector is then added element-wise to the corresponding token embedding vector.15 The resulting vector thus contains information about both the token’s meaning (“what”) and its position (“where”).

This design choice represents a powerful form of feature engineering, explicitly decoupling semantic content from positional context. In an RNN, these two streams of information are inextricably fused within the evolving hidden state. In the Transformer, this separation allows the subsequent attention mechanism to operate on a representation where it can learn to focus on semantic similarity, positional proximity, or a complex combination of both, as determined by the learned weight matrices of the attention layer.

The original Transformer paper proposed a clever, deterministic method for generating these positional vectors using sine and cosine functions of varying frequencies 20:

PE(pos,2i)​=sin(pos/100002i/dmodel​)

PE(pos,2i+1)​=cos(pos/100002i/dmodel​)

 

Here, $pos$ is the position of the token in the sequence (e.g., 0, 1, 2,…), and $i$ is the index of the dimension within the encoding vector (from 0 to $d_{\text{model}}/2 – 1$).

This sinusoidal formulation is not arbitrary; it was chosen for several key properties. First, it produces a unique encoding for each position. Second, because the wavelengths of the sinusoids form a geometric progression (from $2\pi$ to $2\pi \cdot 10000$), it can generalize to sequence lengths not seen during training.20 Most importantly, it allows the model to easily learn relative positions. For any fixed offset

$k$, the positional encoding $PE_{pos+k}$ can be represented as a linear transformation of $PE_{pos}$, a property stemming from the sum-of-angles identities for sine and cosine. This means the model can learn a general rule for “what it means to be $k$ steps away,” rather than memorizing absolute position encodings for every possible position.

This can be conceptualized as encoding position using a multi-scale clock. The low-frequency sinusoids (corresponding to small values of $i$) act like the hour hand, providing a coarse sense of position over the entire sequence. The high-frequency sinusoids (for large $i$) act like the second hand, providing fine-grained information about local neighbors. This rich, multi-scale representation gives the model a powerful and flexible way to understand sequence order. The final input to the first encoder layer is the matrix representing the sum of the token embeddings and their corresponding positional encodings.17

 

Section 3: The Core Computational Unit: Scaled Dot-Product Self-Attention

 

At the heart of the Transformer is the self-attention mechanism, the circuit responsible for relating different positions of a single sequence to compute a new representation of that sequence.13 This mechanism allows the model to weigh the importance of all other words in the input when processing a specific word, thereby “baking in” context directly.15 To understand this at a neuron-level, one must deconstruct its core components: the Query-Key-Value abstraction and the mathematical formula that governs their interaction.

 

3.1 The Query, Key, Value Abstraction

 

The self-attention mechanism is elegantly described as mapping a query and a set of key-value pairs to an output.13 This formulation is inspired by information retrieval systems. A helpful analogy is searching for a video on a platform like YouTube: your search text is the

Query; the platform compares this query against the titles and descriptions of all videos, which act as the Keys; when a strong match is found, the system returns the video itself, which is the Value.28

In the context of a Transformer, this process is applied to the sequence of input vectors (token embedding + positional encoding). For each input vector $x_i$ corresponding to the $i$-th token, the model generates three distinct vectors 15:

  1. Query vector ($q_i$): This vector represents what the current token $i$ is “looking for” or what kind of information it is interested in. It acts as a probe to score the relevance of other tokens.29
  2. Key vector ($k_i$): This vector represents the “label” or the type of information that token $i$ can provide. It’s what other tokens will match their queries against.29
  3. Value vector ($v_i$): This vector represents the actual content or meaning of token $i$. This is the information that will be passed on if its key is deemed relevant by a query.29

These three vectors are generated by multiplying the input vector $x_i$ by three separate, learned weight matrices: $W^Q$, $W^K$, and $W^V$.15

qi​=xi​WQki​=xi​WKvi​=xi​WV

 

These weight matrices are learned during training through backpropagation. Their role is to project the input vector into three different subspaces, each tailored to its specific role in the attention calculation. For instance, $W^Q$ learns to transform the input vector into a representation that is effective for querying, while $W^K$ learns to transform it into a representation that is effective for being matched against.

This Q-K-V mechanism is a fully differentiable implementation of a soft, content-based dictionary lookup.28 A standard dictionary lookup is a discrete operation: if

query == key, then return value. Self-attention transforms this into a continuous process. The similarity between a query and a key is a continuous score, and the output is a “blended” value, aggregating information from all values based on their key’s relevance to the query. Because every operation is differentiable, the model can learn the optimal projection matrices ($W^Q, W^K, W^V$) to perform these lookups effectively for any given task.

 

3.2 The Attention Formula Deconstructed: softmax(dk​​QKT​)V

 

The interaction between the Query, Key, and Value vectors for all tokens in a sequence is captured in a single, elegant matrix formula known as Scaled Dot-Product Attention.10 This calculation can be broken down into four distinct steps. For this explanation, let

$Q, K, V$ be matrices where each row is the $q, k, v$ vector for a token in the sequence, respectively.

Step 1: Compute Similarity Scores ($QK^T$)

To determine how much attention the token at position $i$ should pay to the token at position $j$, the model computes the dot product of their respective query and key vectors: $score_{ij} = q_i \cdot k_j$.2 The dot product is a geometric measure of similarity; if the query and key vectors point in similar directions in their high-dimensional space, the dot product will be large, indicating high relevance. If they are orthogonal, the dot product will be zero, indicating no relevance.34 This calculation is performed for all pairs of tokens simultaneously by computing the matrix product of the Query matrix

$Q$ and the transpose of the Key matrix $K^T$. The resulting matrix, often called the attention score matrix, has dimensions (sequence_length, sequence_length), where the entry at (i, j) is the relevance score of token $j$ to token $i$.15

Step 2: Scale for Stability ($/\sqrt{d_k}$)

The attention scores are then scaled by dividing each element by $\sqrt{d_k}$, where $d_k$ is the dimension of the key vectors.1 This scaling factor is a crucial, though seemingly minor, detail. The authors of the original paper observed that for large values of

$d_k$, the magnitude of the dot products tends to grow large. These large values, when fed into the softmax function in the next step, can push it into regions where its gradient is extremely close to zero.2 This effect, known as the vanishing gradient problem, would effectively halt the learning process. Scaling by

$\sqrt{d_k}$ counteracts this by keeping the variance of the scores at approximately 1, ensuring that the gradients remain stable and learning can proceed effectively.10

Step 3: Normalize to a Probability Distribution (softmax)

The scaled scores are then passed through a softmax function, which is applied independently to each row of the score matrix.2 The softmax function exponentiates each score (making them all positive) and then normalizes them so that the scores in each row sum to 1.37 The result is the final

attention weights matrix, a probability distribution for each query token over all the key tokens.37 The weight

$\alpha_{ij}$ in this matrix represents the proportion of attention that token $i$ will pay to token $j$.

Step 4: Create the Contextualized Output (…V)

The final step is to compute the output vector for each token as a weighted sum of all the Value vectors in the sequence.13 The output for token

$i$, denoted $z_i$, is calculated as:

 

zi​=j=1∑n​αij​vj​

 

where $n$ is the sequence length. This means that the Value vectors of tokens with high attention weights will contribute more significantly to the output representation of token $i$, while tokens with low weights will be effectively ignored.15 This is the critical step where information from across the sequence is aggregated and synthesized. The entire operation is performed for all tokens at once via a single matrix multiplication of the attention weights matrix and the Value matrix

$V$.

The output of this process is a new sequence of vectors, $Z$, where each vector $z_i$ is a “contextualized” representation of the original input vector $x_i$. It now contains not only its own information but also a blend of information from all other tokens in the sequence, weighted by their relevance. The “neuron-level” computation for a single output vector is thus a distributed process, a dynamic circuit whose connections and weights ($\alpha_{ij}$) are re-calculated for every input based on the learned projections ($W^Q, W^K, W^V$) and the content of the sequence itself.

 

Section 4: Expanding the Attentional Field: The Multi-Head Mechanism

 

While the scaled dot-product attention mechanism is powerful, relying on a single attention calculation can be limiting. It forces the model to average all types of linguistic and positional relationships into a single representation space, potentially conflating distinct signals.10 For instance, a model might need to simultaneously track syntactic dependencies (e.g., subject-verb agreement) and semantic relationships (e.g., synonymy). To address this, the Transformer introduces a more sophisticated mechanism called

Multi-Head Attention.

 

4.1 Rationale for Multiple Heads

 

Multi-Head Attention allows the model to jointly attend to information from different “representation subspaces” at different positions.1 The core idea is to run the scaled dot-product attention mechanism multiple times in parallel, with each parallel run, or “head,” learning a different type of relationship.40

This can be conceptualized as an ensemble of specialists operating within a single layer. Each attention head is a “specialist” that learns to identify a particular kind of pattern or dependency within the sequence. For example, in the sentence “The animal didn’t cross the street because it was too tired,” one head might learn projection matrices that cause the query for “it” to have a high similarity with the key for “animal,” thus resolving the co-reference.30 Simultaneously, another head might learn to connect “tired” with “animal,” capturing a state-of-being relationship. A third might focus on the syntactic link between “didn’t” and “cross.” The final output can then integrate the findings from all these specialists to produce a much richer and more nuanced representation than any single head could achieve alone.41

 

4.2 The Split-Attend-Concatenate-Project Workflow

 

The multi-head mechanism is not simply running the same attention calculation multiple times. Instead, it involves a four-step process that is both powerful and computationally efficient 10:

Step 1: Linear Projections (Split)

Instead of having a single set of weight matrices ($W^Q, W^K, W^V$), the model has $h$ different sets, where $h$ is the number of attention heads (a hyperparameter, typically 8 or 12).10 The initial input vectors are projected

$h$ times, once for each head, using these distinct weight matrices ($W_i^Q, W_i^K, W_i^V$ for head $i$). Crucially, these projections map the input vector from its full $d_{model}$ dimension to a lower dimension, typically $d_k = d_v = d_{model} / h$.10 For example, if

$d_{model}=512$ and $h=8$, each head will work with Q, K, and V vectors of dimension 64.

Step 2: Parallel Attention (Attend)

Scaled dot-product attention is then performed independently and in parallel for each of the $h$ heads, using its respective lower-dimensional Q, K, and V matrices.10 This step is computationally identical to the single-head attention described in Section 3, but it happens

$h$ times on different projections of the data. This results in $h$ separate output matrices, each of dimension (sequence_length, d_v).

Step 3: Concatenate

The $h$ output matrices from the attention heads are concatenated together along the last dimension.1 This combines the “findings” of all the specialist heads into a single, large matrix of dimension

(sequence_length, h \cdot d_v). Since $h \cdot d_v = d_{model}$, this restores the representation to its original dimensionality.

Step 4: Final Linear Projection (Project)

Finally, this concatenated matrix is passed through one more linear projection, multiplying it by a final learned weight matrix, $W^O$, of size (d_model, d_model).10 This final projection layer allows the model to learn how to best mix and combine the outputs of the different attention heads, producing the final output of the multi-head attention layer.

This entire process is designed to be computationally efficient. While it seems more complex, the total number of computations is roughly the same as performing a single-head attention with the full $d_{model}$ dimension, because the work is distributed across multiple heads operating on smaller vectors.10 The model gains significant expressive power without a major increase in computational cost simply by restructuring the computation. This architectural pattern—achieving complexity through parallel simplicity—is a key element of the Transformer’s success.

 

Section 5: The Anatomy of a Transformer Block

 

The multi-head attention mechanism and the position-wise feed-forward network are the core computational engines of the Transformer. They are organized, along with two other crucial components, into a standardized unit called a Transformer Block. The full encoder and decoder are simply stacks of these identical blocks (the original paper uses a stack of six).13 Understanding the interplay of the components within a single block is key to understanding the model’s overall function.

 

5.1 The Attention Sub-Layer

 

The first component within a Transformer block is the multi-head attention sub-layer, as detailed in Section 4.13 In an encoder block, this is a standard multi-head self-attention mechanism, where the Q, K, and V inputs are all derived from the output of the previous layer. This sub-layer is responsible for all inter-token communication and information aggregation within the block. It takes a sequence of vectors as input and produces a sequence of contextualized vectors of the same shape as output.

 

5.2 The Position-wise Feed-Forward Sub-Layer (FFN)

 

The second major component is a Position-wise Feed-Forward Network (FFN), which is a two-layer fully connected neural network.2 After the attention sub-layer has aggregated information from across the sequence, the FFN processes the resulting vector for each token

independently.15 The exact same FFN (with the same learned weight matrices) is applied to each position’s vector in the sequence, but there is no information sharing between positions within this sub-layer.15

The FFN consists of two linear transformations with a non-linear activation function, such as ReLU (Rectified Linear Unit) or GELU (Gaussian Error Linear Unit), in between.3 The standard configuration is:

FFN(z)=ReLU(zW1​+b1​)W2​+b2​

 

The first linear layer typically expands the dimensionality of the vector (e.g., from $d_{model}=512$ to an inner-layer dimension $d_{ff}=2048$), and the second linear layer projects it back down to $d_{model}$.3 This expansion and contraction, combined with the non-linear activation, allows the model to learn more complex transformations of the token representations.

The Transformer block thus exhibits a powerful division of labor. The multi-head attention layer performs communication and aggregation, gathering context from across the entire sequence. The FFN then performs computation and transformation, applying a rich, non-linear function to each token’s context-aware representation in isolation. The full Transformer architecture works by stacking these blocks, creating a deep network that alternates between these two modes: first, every token talks to every other token (attention), and then every token “thinks” about what it heard (FFN).

 

5.3 The Architectural Glue: Residual Connections and Layer Normalization

 

Training a deep stack of these blocks would be practically impossible without two additional components that act as the architectural “glue”: Residual Connections and Layer Normalization.13 Each of the two sub-layers (attention and FFN) is wrapped in these two operations. The output of a sub-layer is formally defined as

LayerNorm(x + Sublayer(x)), where $x$ is the input to the sub-layer and Sublayer(x) is the function implemented by the sub-layer itself (e.g., multi-head attention).13

Residual Connections, also known as skip connections, add the input of the sub-layer to its output.46 This simple addition has a profound impact. First, it creates a direct path, or “information highway,” for the gradient to flow during backpropagation. The derivative of the residual connection with respect to its input is 1, which ensures that even if the gradient through the sub-layer itself becomes very small (vanishes), the overall gradient signal can still pass through unimpeded, making it possible to train very deep networks.47 From a forward-pass perspective, this structure forces each sub-layer to learn a

modification or a residual to the input, rather than the entire transformation from scratch. This makes the default behavior of a sub-layer closer to an identity function, which is a more stable starting point for learning. It also ensures that a token’s original information is always carried forward, with each layer adding new contextual refinements on top.49

Layer Normalization is a technique that stabilizes the training process by normalizing the activations of each layer.46 For each token’s vector in the sequence, Layer Normalization computes the mean and variance across the feature dimension (

$d_{model}$) and uses them to rescale the vector to have a mean of zero and a variance of one. It also includes two learnable parameters, a gain ($\gamma$) and a bias ($\beta$), that allow the model to scale and shift the normalized output, preserving its representational capacity.47 By keeping the activations within a consistent range, layer normalization smooths the optimization landscape and makes the model less sensitive to the scale of the initial weights.46 The precise placement of the layer normalization step—either before the residual connection (Pre-LN) or after (Post-LN, as in the original paper)—has significant implications for training stability, with Pre-LN often being more robust for very deep models.50

 

Section 6: The Generative Counterpart: The Decoder Architecture

 

While the encoder’s role is to build a rich, contextualized representation of the input sequence, the decoder’s role is to generate an output sequence, one token at a time. This generative task requires a slightly different architecture that incorporates the encoder’s output while respecting the sequential, causal nature of generation.52 The decoder block is similar to the encoder block but includes a third sub-layer and modifies one of the existing ones.

 

6.1 Causal Attention: The Masked Multi-Head Mechanism

 

The decoder is auto-regressive, meaning that the prediction for the token at position $t$ depends on the previously generated tokens from positions 1 to $t-1$.53 During inference, this happens naturally, as the model generates one token, appends it to the input, and then generates the next. During training, however, to enable parallel processing, the entire ground-truth output sequence is fed to the decoder at once (a technique known as teacher forcing).54

To prevent the model from “cheating” by looking ahead at future tokens it is supposed to be predicting, the first attention sub-layer in the decoder employs Masked Multi-Head Self-Attention.15 This mechanism is identical to the standard multi-head self-attention in the encoder, with one crucial modification: a

look-ahead mask is applied to the attention scores.53

This mask is a matrix that is added to the scaled $QK^T$ matrix before the softmax function is applied. The mask sets all values in the upper triangle of the score matrix—corresponding to connections where a query at position $i$ attends to a key at position $j > i$—to negative infinity ($-\infty$).14 When the softmax function is applied, these negative infinities become zeros, effectively preventing any token from attending to subsequent tokens.58 This ensures that the prediction for any given position can only depend on the known outputs at previous positions, preserving the auto-regressive property even during parallelized training.58

 

6.2 Cross-Attention: The Encoder-Decoder Bridge

 

After the masked self-attention sub-layer (and its associated residual connection and layer normalization), the decoder block contains a second, distinct attention mechanism. This is the third sub-layer, known as Encoder-Decoder Attention or, more commonly, Cross-Attention.15

This is the critical component that connects the encoder and decoder. While it uses the same multi-head attention machinery, its inputs are sourced differently 15:

  • The Query (Q) matrix is generated from the output of the previous decoder layer (the masked self-attention layer).
  • The Key (K) and Value (V) matrices are generated from the final output of the encoder stack. These K and V matrices are computed once after the encoder finishes and are used by every decoder block in the stack.

This mechanism allows every position in the decoder to attend over all positions in the original input sequence.14 It is the step where the decoder consults the source text to inform its generation. For example, in a machine translation task, when the decoder is about to generate a French verb, the cross-attention layer allows its query (representing the context of the French sentence generated so far) to find the most relevant English words in the encoder’s output (e.g., the corresponding English verb and its subject) and incorporate their meaning into its prediction.19

The decoder’s architecture thus elegantly solves two distinct problems. The masked self-attention builds a coherent internal representation of the sequence generated so far, answering the question, “Given what I’ve already written, what is the structure of my partial output?” The cross-attention then grounds this generation in the source material, answering the question, “Given my partial output, what part of the original input should I focus on to decide the next word?” The decoder is therefore constantly balancing internal linguistic consistency with external contextual alignment.

 

Section 7: From Vector to Vocabulary: The Final Output Projection

 

After the data has passed through the entire stack of decoder blocks, the model has produced a final sequence of high-dimensional vectors. The last step of the process is to transform the vector corresponding to the final token position into a probability distribution over the entire vocabulary, from which the next token can be predicted. This is accomplished by a final linear layer and a softmax function.

 

7.1 The Final Linear Transformation

 

The output vector from the top decoder block at the current time step is fed into a final, fully connected linear layer.17 This layer acts as a projection, transforming the

$d_{model}$-dimensional vector into a much larger vector with a dimension equal to the size of the vocabulary (e.g., 50,257 for GPT-2).17 The output of this layer is a vector of raw, unnormalized scores known as

logits, where each element corresponds to a token in the vocabulary.

This final linear layer can be thought of as an “un-embedding” layer, performing the inverse operation of the initial input embedding layer. The input embedding layer maps a token ID to a dense semantic vector; this final linear layer maps a dense, context-rich vector back to a score for every possible token ID. This conceptual symmetry has led to the practice of weight tying, where the weight matrix of this final linear layer is constrained to be the transpose of the input embedding matrix.3 This reduces the total number of model parameters and imposes an elegant inductive bias: the knowledge used to convert a word into a vector should be the same as the knowledge used to convert a vector back into a word.

 

7.2 The Softmax Function and Probability Distribution

 

The logits vector, containing a raw score for every word in the vocabulary, is then passed through a softmax function.17 As in the attention mechanism, the softmax function converts these scores into a valid probability distribution.37 It exponentiates each logit and normalizes the results so that every value is between 0 and 1, and the sum of all values in the vector equals 1.

The resulting vector represents the model’s predicted probability for each possible next token. In the simplest decoding strategy, known as greedy decoding, the token with the highest probability is selected as the output for the current time step.17 This predicted token is then fed back into the decoder as input for the next time step, and the entire process repeats until the model generates a special

<end-of-sequence> token, signaling the completion of the output.53 More advanced strategies like beam search or nucleus sampling can be used to generate more diverse or higher-quality outputs by considering multiple candidate tokens at each step.

 

Section 8: A Synthetic Walkthrough: Tracing a Vector’s Journey

 

To synthesize the concepts discussed, this section will trace a hypothetical input sequence through a simplified Transformer model designed for a machine translation task. We will follow the journey of the input sentence “The cat sat” as it is processed by the encoder and as the decoder begins to generate a translated output.

  1. Input Representation:
  • Tokenization & Embedding: The input “The cat sat” is tokenized into “. Each token is looked up in an embedding matrix, producing five vectors of dimension $d_{model}$ (e.g., 512).
  • Positional Encoding: Sinusoidal positional encoding vectors are generated for positions 0 through 4 and added to the corresponding token embeddings. The result is a matrix of size (5, 512), which is the final input to the encoder.
  1. Through the Encoder Block:

Let’s focus on the processing of the token “cat” at position 2. Its input vector, $x_{cat}$, enters the first encoder block.

  • Multi-Head Self-Attention:
  • Q, K, V Generation: $x_{cat}$ is multiplied by the weight matrices $W_i^Q, W_i^K, W_i^V$ for each of the $h$ heads, producing $h$ sets of lower-dimensional query, key, and value vectors (e.g., dimension 64).
  • Attention Scores: The query vector for “cat” from one head, $q_{cat, i}$, is used to calculate dot-product scores with the key vectors from that same head for all tokens: $k_{<start>, i}, k_{The, i}, k_{cat, i}, k_{sat, i}, k_{<end>, i}$. The model might learn, for example, that the key for “sat” is highly compatible with the query for “cat,” resulting in a high score.
  • Weights: These scores are scaled by $\sqrt{d_k}$ and passed through a softmax function, yielding attention weights. The weight for “sat” might be high (e.g., 0.6), while the weight for “The” might be lower (e.g., 0.1).
  • Output Vector: The new vector for “cat” for this head, $z_{cat, i}$, is the weighted sum of all value vectors: $\alpha_{cat \to <start>}v_{<start>, i} +… + \alpha_{cat \to <end>}v_{<end>, i}$. This vector now contains strong contextual information from “sat.”
  • Concatenation & Projection: The output vectors from all $h$ heads are concatenated and projected by $W^O$ to produce the final attention output for “cat,” $z_{cat}$, of dimension 512.
  • Add & Norm: A residual connection adds the original input $x_{cat}$ to the attention output: $z_{cat}’ = x_{cat} + z_{cat}$. This sum is then layer-normalized.
  • Feed-Forward Network: The normalized vector $z_{cat}’$ is passed through the two-layer FFN for further non-linear transformation.
  • Final Add & Norm: Another residual connection and layer normalization are applied. The result is the final output of the first encoder block for the token “cat.” This process happens in parallel for all tokens and is repeated for each block in the encoder stack.
  1. The Encoder-Decoder Bridge:

After passing through all encoder blocks, the final output is a matrix of contextualized representations for the input sentence. These vectors are used to generate the Key ($K_{enc}$) and Value ($V_{enc}$) matrices for the cross-attention mechanism in every decoder block.

  1. One Step of Decoding:

Assume the decoder has already generated the start token <s> and the first word “Le”. It now needs to predict the next word.

  • Decoder Input: The input to the decoder is [<s>, Le]. These tokens are embedded and combined with positional encodings.
  • Masked Self-Attention: The decoder performs masked self-attention on its own input. When processing “Le”, it can only attend to <s> and “Le” itself. This builds a representation of the generated prefix.
  • Cross-Attention: The output from the masked self-attention layer for “Le” is used to form a query vector, $q_{Le}$. This query is then used to attend to the encoder’s output. It computes scores against the keys of the English words: $q_{Le} \cdot k_{<start>, enc}, q_{Le} \cdot k_{The, enc}, q_{Le} \cdot k_{cat, enc},…$. The model would likely learn to produce a high score for “cat,” as “Le chat” is the translation of “The cat.” The resulting cross-attention output vector for “Le” will be heavily influenced by the encoder’s representation of “cat.”
  • FFN and Output: This vector passes through the FFN and final Add & Norm layers. The resulting vector at the top of the decoder stack is then passed to the final linear layer.
  • Prediction: The linear layer projects this vector into a logits vector over the entire French vocabulary. The softmax function converts this to probabilities. The token with the highest probability (ideally, “chat”) is selected as the next output. The new input to the decoder in the next step will be [<s>, Le, chat].

This step-by-step process, combining self-attention for internal context, cross-attention for external grounding, and feed-forward networks for transformation, allows the Transformer to effectively model complex dependencies and perform tasks like machine translation with high fidelity. The following table provides a concrete map of how the data tensor shapes evolve through this process.

Stage Tensor Name Shape Description
Input Input IDs (B, N) Batch of integer token IDs.
Embedding Token Embeddings (B, N, D) Semantic vectors from lookup table.
Positional Enc. Positional Encodings (N, D) Sinusoidal position vectors.
Encoder Input Final Input (B, N, D) Sum of token and positional embeddings.
Multi-Head Attn. Q, K, V Projections (B, N, D) Input to each head’s projection.
Q, K, V (reshaped) (B, H, N, D_k) Vectors split across H heads.
Attention Scores (B, H, N, N) Raw similarity scores from QK^T.
Attention Weights (B, H, N, N) Scores after scaling and softmax.
Attention Output (B, H, N, D_k) Weighted sum of Value vectors per head.
Concatenated Output (B, N, D) Head outputs merged and projected.
FFN FFN Input (B, N, D) Output from attention sub-layer.
FFN Hidden (B, N, D_ff) Expanded representation after 1st linear layer.
FFN Output (B, N, D) Output after 2nd linear layer.
Final Block Output Encoder/Decoder Output (B, N, D) Final contextualized vectors.
Output Layer Logits (B, N, V_size) Projection to vocabulary space.
Probabilities (B, N, V_size) Final probability distribution after softmax.

Table 2: Matrix Dimensionality Trace. B: batch size, N: sequence length, D: model dimension ($d_{model}$), H: number of heads, $D_k$: key/value dimension per head ($D/H$), $D_{ff}$: FFN inner dimension, $V_{size}$: vocabulary size.

 

Conclusion

 

The Transformer architecture represents a fundamental shift in processing sequential data, moving from iterative, recurrent computation to a parallelized, attention-based approach. At the “neuron level,” its circuits are not composed of simple, isolated neurons but are intricate, dynamic systems of vector transformations. The journey of a token begins with its conversion into a high-dimensional vector encoding both its semantic meaning and its absolute position. This vector then traverses a deep stack of Transformer blocks, each of which refines its representation through two key sub-layers. The multi-head self-attention layer acts as a communication hub, allowing each token to dynamically query all other tokens and aggregate relevant information into a new, context-aware vector. The position-wise feed-forward network then acts as a computational unit, applying a complex, non-linear transformation to this contextualized vector in isolation.

These core operations are enabled by architectural “glue”—residual connections that ensure a stable flow of information and gradients, and layer normalization that stabilizes the training dynamics. In generative tasks, the decoder builds upon this foundation with two specialized attention mechanisms: masked self-attention to enforce causality and maintain internal coherence, and cross-attention to ground the generated output in the context provided by the encoder. The process culminates in a projection back to the vocabulary space, where a softmax function yields a probability distribution for predicting the next token.

By deconstructing sequence processing into a series of parallelizable matrix operations, the Transformer not only overcame the primary limitations of its recurrent predecessors but also created an architecture that could fully leverage the power of modern parallel computing hardware. This synergy between algorithmic design and hardware capability has unlocked unprecedented scale, making the Transformer the foundational circuit for the current generation of large language models and a cornerstone of modern artificial intelligence.