We are on the edge of a new era of artificial intelligence built on the recognition that information is intrinsically multi-dimensional and through training, computer systems can interact with it in powerful, human-like ways. In this discussion, we look specifically at transformers to gain more insight into how they do this. We must paint this picture with broad strokes because no human knows, or will ever know, how these systems work at a granular level.
Key Mathematical Principles
The math in transformers is surprisingly basic, but it does require some knowledge of vectors and matrices and at least an awareness of the ideas behind calculus. A vector is a point in n-dimensional space, or an arrow from the root of axes to that point. With the latter perspective, vectors have a direction and value (distance). A one-row matrix with 300 columns is equivalent to a vector in a 300-dimensional space. A matrix with multiple rows (confusingly called a multi-dimensional matrix) is essentially a collection of vectors. This is also called a tensor.
We can use matrices as operands in various operations like multiplication and addition. Dot-product multiplication is particularly important. This is an operation with several steps, not a simple product. We take each row in the first matrix, rotate it up to form a column, and multiply its values against the values in each column in the second matrix. This generates a new matrix, but it is just temporary. We next add the values in each column and use those sums to form a row in a final matrix. Repeating this process for each row in the first matrix, we build row after row in this final matrix. This operation requires the number of columns of the first matrix to equal the number of rows in the second.
In this discussion, we will use a basic concept in calculus called the derivative. A derivative is the instantaneous rate of change of one variable with a change in another variable in an equation. For example, the speed of an accelerating car at an instant in time is the derivative of an equation (function) relating speed and acceleration.
The Concept Space
Imagine that every concept we ever use in life is represented by a point in multi-dimensional space, such as a 300-D space. Our transformer established the location of these points during its training, butthat training required us to first break language into units representing the many distinct concepts we use in thought and understanding. Often, an individual word corresponds to a concept, but sometimes it is a part of a word or an entire phrase. We call these units tokens, and the software that converts language to tokens a tokenizer. In the training process, the transformer creates a point in 300-D space for each token. A different dimensionality is also possible, but 300 is common.
For now, we can just accept that we have these points in 300-D space, embedded there by the training process, and each corresponds to a token. We have no idea what the 300 axes measure; they are whatever the transformer optimized them to be during training. These axes don’t correspond exactly to any human ideas and the transformer doesn’t bother to name them. They are “whatever works.” We can suspect that there may be more than 300 aspects to a concept, and this is a reduction, combining aspects that tend to track together. Although many transformers use the same system of tokens, each transformer embeds them (sets their axis values) in its own way.
Since a 300-D space has 300 axes, each of these points has 300 values. We could show this as a matrix with 300 columns and one row for each concept. This is called the Embeddings Matrix or Embeddings Layer. However, we are taking a spatial perspective, so we will speak in terms of the collection of E points in 300-D space.
Having achieved the right position for each E point through the training process, we can use them for artificial intelligence. When the user provides input to the transformer, it first goes through a tokenizer that converts it to a sequence of token IDs and feeds those IDs to the transformer. The transformer only needs the token IDs, since it set up its E points in reference to these IDs. As an example, “you” might be token #432, and the transformer knows that corresponds to an E point at specific coordinates.
Using the Transformer
The transformer has access to all the E points through its embedding layer, properly indexed, so it can look up the relevant E points from the input token indices. During the inference process (responding to users), it executes these steps:
1. Translocate the Input E points to reflect token position. The transformer’s first step after looking up the input’s E points is to adjust them in the 300-D space based on their position in the sequence. Each point moves in a specific “direction” and distance in the 300-D space based exclusively on its position in the sequence of inputs. The E point for the first concept will always move the same distance in the same “direction” regardless of the concept. Traditional terminology for this is “adding positional encoding to the input embedding.” If you want more detail, the transformer moves E by superimposing sine and cosine waves on the 300 dimension values at a frequency dependent on the token position, adding those values (between -1 and 1)to the dimension values. We could say more about the rationale for doing it this way; suffice it to say it works and keeps the added values small.
2. Spawn three new points. We’ll call these the Q, K, and V points. They also exist in the 300-D space. These are projections from the translocated E points using a specific set of weights (from the training) for each of the new points. These projections are identical for every token because the weights for Q, K and V are the same for every token. Hence, the resultant cluster of points will look the same, just in different regions of the 300-D space. If it makes it easy to visualize, think of these points as colored: white E, red Q, green K, and blue V. Since Q, K and V all derive from E, it is as though we have a prism breaking E into its parts.
3. Use the new Q and K points to create a matrix of attention scores. In this step, we execute a dot-product operation between two matrices. Let “T” be the number of tokens. The first matrix has a row for each token and a column for each dimension (T x300) and shows the values for each token’s Q point. The second has a column for each token and a row for each dimension (300 x T) and shows the values for each token’s K point. The result of this dot-product multiplication is a T x T matrix. Each value is the sum of the Q x K dimension multiples for the two tokens. These are the attention scores. We can think of these as representing the relationshipsbetween tokens. After calculating the attention score matrix, the transformer performs scaling and normalization, so the values are between 0 and 1 and each row sums to 1.
4. Create C points from V points and the attention matrix.
As discussed above, if “T” is the number of tokens/concepts in the input, the attention score matrix will be T x T. The next step is a dot-product multiplication of this attention score matrix against the T x 300 matrix of all the V vectors (the V matrix), resulting in a T x 300 matrix of new vectors and corresponding points in space. These are called Context Vectors, so let’s call these the C points and think of them as yellow when visualizing the constellation of points. There is one C point for every V point. If we looked at our concepts in 300-D space now, we would see that some of the white E points (the ones mapping to our input tokens) have a constellation of other red, green, blue, and yellow points associated with them (Q, K, V, and C).
To illustrate what this dot-product multiplication is doing, let us consider a specific V point that we will call “Bob” and another called “Tom.” Bob and Tom are each attached to tokens; remember there is an attention score between every pair of tokens. To get the first dimension of Bob’s C point, we will be adding the first dimension of every V point, but with each multiplied by the relevant attention score. If the attention score between Bob and Tom is 0.5, and Tom’s first dimension is 6, we will use 3 for the sum to get the first dimension of Bob’s C.
The part of the transformer carrying out steps 1-4 above is called an attention head. It first translocated our E points to reflect each token’s position in the sequence. Then, it spawned three new points from each E point (Q, K, V). For every E point, the E-Q, E-K, and E-V direction and distances differ, but they do not change from E to E. Next, it created a matrix reflecting, for each token, the multiplications of its Q and all K points and used that matrix to create the C points. The output of the attention head is the C points. These trace back to the E points but bear the influences of token position and attention scores reflecting their relationship to other tokens/concepts.
Since a token with lots of low attention scores will have its dimension values diminished more, the corresponding C point will be closer to the root of the axes in our 300-D space than a token with high attention scores. In other words, it will have a lower Euclidean norm than one with high scores. Many people have viewed such tokens as having less influence on the output of the transformer, but a 2019 study by Jain and Wallace published in Proceedings of NAACL-HLT 2019 (“Attention is not Explanation”) casts doubt on that assumption. Although ideas about human cognition inspired the attention mechanism, as it turns out we still don’t really know why it works so well!
Summary of Terms
Matrix or Vector | As Points in Space |
The Embedding Matrix | The complete E point collection for the system of tokens |
The Input Embeddings | The input E points |
Positional Encoding | E point translocation direction/distance |
Q (query) Vector. | Q point |
K (key) Vector. | K point |
V (value) Vector and Values Matrix | V point and V point collection |
Attention Scores | Used to scale down V dimension values |
Context Vector and Context Matrix | C point and C point collection |
Multiple Heads. Although the training process created only one set of E points in 300-D space, the transformer layer may have several parallel attention heads for separate manipulations of these points, each operating in 300-D space. This allows the system to examine the context in several ways. Interestingly, there is evidence that the training process may result in individual attention heads specializing in identifiable aspects, such grammatical relationships, but their function nevertheless remains somewhat opaque.
Typically, there will be 3 “heads” that spawn the Q, K and V points differently and hence produce different C points. The transformer layer then concatenates the results to create one big T x 900 matrix. Each token now has three C points in the 300-D space, described by this matrix. These are the points/vectors that go into further processing.
The magic of the transformer is in this matrix, the result of all the transformations described above. The math is simple, there’s just a lot of it.
Further processing. After the attention layer, the concatenated output from the attention heads goes through a plain old-fashioned feed-forward neural network (FFN). This network typically consists of two linear (fully connected) layers with a non-linear activation function in between and a final step of normalization. Feed-forward neural networks are old technology but can be powerful for specific scenarios. Like with all neural networks, training optimizes their parameters, but we don’t know how they make their decisions.
The FFN output is a collection of vectors in the 300-D space (a matrix) with one vector corresponding to each token. At this point, the system does an odd thing. It reaches back down to the beginning and adds the original input embedding vector (our C vector/point) to each corresponding FFN output vector. This is called the “residual connection” or “skip connection.” Originally, this was to address a technical problem in the training process (the “vanishing/exploding gradient” problem) and from a general consideration of information flow. It turns out that it does indeed improve transformer performance during use (inference).
From here, the resultant T x 300 matrix may go as the input sequence into another attention layer. Eventually, the output goes through additional FFN layers with various normalizations. This continued transformation finally gives rise to a set of probabilities indicating, for each token in the vocabulary, the chance it would be the best next word in generating output.
The Points Perspective in Training
This discussion focuses on how the transformer operates after training, but it is worthwhile to briefly summarize the training process. Training refers to optimizing the transformer’s parameters so that in interacting with a user, it does a decent job in choosing its words. More specifically, given all the preceding tokens, it must choose each next token, so the entire response is of high quality and on target. (You will recall that tokens represent words, parts of words, word groups, etc.)
To understand parameters, we must first consider the nature of neural networks. These consist of many equations that take inputs and deliver outputs to other equations. We can visualize this as a web, with equations at the strand crossings. In this spatial visualization, each equation is a node in the web. The nodes use weights in their calculations; those weights remain constant during use. These are the parameters. We earlier described how the transformer spawns Q, V and K points from E points in 300-D space. That involves 300 parameters to specify the change in axis positions for Q vs E. Subsequently, as the inputs trigger a cascade of calculations throughout the network, a transformer may use billions of parameters.
Although these are constant numbers during use, during training they are variables. In the training process, we feed the transformer a large quantity of text and it continually tries to predict the next token in that input, grading itself on its performance. A perfect performance would occur if the correct token has a score of “1” and all others in the vocabulary (e.g., 50,000) have a score of “0.” However, there are always differences between the ideal and actual scores for every prediction. The average of these differences across all tokens and all inputs in a batch is the average loss. The average loss is dependent on all the parameter values, so adjusting those values can reduce it. The system of equations with their weights, described above, is the loss function.
Since the parameters and the average loss L are all variables during training, we can think of the loss function as an equation with P+1 variables, where P is the number of parameters. Of course, not all combinations of values are possible with a given loss function. If we think of a gigantic multidimensional space with P+1 axes, the allowed combinations form a shape (“a manifold”) in this space. The shape can resemble a landscape with respect to the L axis. If you start with one set of parameter values, adjusting them may increase the loss, decrease it, or keep it the same. Parameter adjustments move you from one allowed point to another and thereby define a vector. We call the direction of steepest decrease in loss the loss gradient.
To determine that direction, we can take partial derivatives on each parameter vs. loss. Visualize a plane in our high-dimensional space that includes two axes, P and L, where P is a given parameter value and L is the loss axis. The derivative, dL/dP describes the instantaneous rate of change of loss per change in the parameter, with all other parameters held constant. We can use this value to create a vector in the direction of the parameter axis, with length equal to this derivative value. An optimizer component does this for each parameter during training, adds the vectors all together, then scales the result down according to its settings. This is a huge calculus exercise, but the chain rule of calculus helps make it feasible.
With each batch of input, the transformer executes this process and adjusts the parameters. We call this back propagation. The size of the adjustments depends on how far we chose to move in the direction of the loss gradient (a setting we specify). Eventually, the transformer reaches a flat place in our manifold, and it may stop (again depending on settings).
The fact that neural networks can use calculus to determine the right “direction” for parameter adjustments to decrease loss is what makes them possible. We can thank Newton for that! Still, the process is daunting and requires huge computing capacity. In training, it may calculate over one quadrillion derivatives!