Understanding the Transformer Encoder

The Transformer architecture in machine learning is a deep learning model primarily used for natural language processing tasks. Introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017, the Transformer utilizes a mechanism known as self-attention to process input data.

One of the key features of the Transformer is its ability to handle sequential data without relying on recurrent neural networks (RNNs) or convolutional neural networks (CNNs). This is achieved through the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence relative to one another, providing context and enhancing understanding.

The architecture consists of an encoder and a decoder. The encoder's role is to process the input data and generate a continuous representation, while the decoder takes this representation and generates the output sequence. Each encoder and decoder layer consists of multiple attention heads, enabling the model to focus on different parts of the input simultaneously.

Positional encoding is also a critical component of the Transformer, as it provides the model with information about the position of each token in the input sequence. This is essential because, unlike RNNs, Transformers do not inherently maintain the order of the input data.

The combination of self-attention and feed-forward neural networks allows Transformers to capture complex relationships in data, making them particularly effective for tasks such as translation, summarization, and sentiment analysis. The architecture has also led to the development of state-of-the-art models like BERT, GPT, and T5, which have achieved remarkable performance across various benchmarks in natural language processing.

Attention Is All You Need

Overall, the Transformer architecture has revolutionized the field of machine learning and continues to be a foundational element in the development of advanced AI systems.

Encoder: The encoder consists of a series of six identical layers, each designed to enhance the representation of the input data through a two-part structure. The first part features a multi-head self-attention mechanism, which allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies. The second part comprises a positionwise fully connected feed-forward network that processes the output of the attention mechanism.

To ensure smooth learning and maintain information flow, we incorporate residual connections around each of these two components. This means that for each sub-layer, the output is calculated as the sum of the layer’s input and the sub-layer's output, followed by a layer normalization step. This is mathematically expressed as LayerNorm(x + Sublayer(x)), where Sublayer(x) denotes the operation carried out by the particular sub-layer.

Furthermore, to support these residual connections, every sub-layer in the model—including the embedding layers—generates outputs with a consistent dimensionality of 512, known as d_model. This uniformity aids in effectively combining the outputs from different layers while preserving the integrity of the information being processed.

In this discussion, we will take a closer look at the Transformer Architecture, breaking it down into its fundamental components, beginning with the Encoder. The Transformer model can be divided into two primary sections, which we'll explore in detail. This approach will help clarify how each part functions and contributes to the overall effectiveness of the architecture.

In the encoder block of a transformer model, input consists of word embeddings enhanced by positional encodings, which provide information about the order of the words in a sequence. These enriched embeddings are then processed through a multi-head attention mechanism. This mechanism allows the model to focus on different parts of the input simultaneously, capturing various relationships and dependencies between words. After the attention process, a skip connection is added from the original embeddings, enabling smoother gradient flow and facilitating learning. Finally, the combined output undergoes layer normalization, which standardizes the activations and helps stabilize and accelerate training. Below shown is the block of encoder which we will break into steps.

Input Embeddings

A 512-dimensional word embedding vector is generated for each word in the sentence "I visited Spain." To enhance these embeddings, positional encodings—derived from Sine and Cosine functions—are calculated for each word. These positional encodings are then added to their corresponding word embeddings to provide context related to the position of each word within the sentence. This approach helps in capturing the meaning of the words while also retaining their sequential order. This becomes the input to the Self Attention block.

Attention

In the process of transforming words into meaningful representations, three specific weight matrices are utilized: W query, W key, and W value. Each of these matrices performs a dot product with the positional-encoded word embeddings corresponding to each word in the text. This operation results in the creation of three distinct sets of vectors: Queries, Keys, and Values for every word. These vectors are crucial for various natural language processing tasks, as they facilitate the understanding of relationships and contexts within the input data.

\(Attention(Q, K, V ) = softmax(QKT √ dk )V\)

In this example, we begin by calculating the Attention contextual embeddings. We start with three query vectors, each of size 512. These query vectors are then subjected to a dot product operation with keys that are arranged in a matrix of size 512 by 3. This operation produces a resultant matrix of size 3 by 3. To refine the values obtained in this matrix and improve the stability of the gradients during training, we scale the results by dividing each element in the matrix by the square root of the dimension of the keys, which is 512. The softmax function is applied to each element of a 3 by 3 matrix, transforming it into a set of weights. These weights are then utilized in a dot product operation with a 3 by 512 value vector. This process ultimately produces contextual embeddings, resulting in a final output of a 3 by 512 vector.

MultiHead Attention

MultiHead attention is achieved by replicating the attention mechanism multiple times, where each attention block independently processes the input data. The outputs from these multiple attention heads are then concatenated into a single tensor. Following this concatenation, a weighted matrix is applied through a dot product operation, which reduces the resulting dimensions back to a size of 3 by 512. This process allows the model to capture various aspects of the data through different attention heads, enhancing its ability to understand complex relationships within the input.

Add and layer Normalization

In the multihead attention mechanism, the original input features are incorporated back into the outputs through a residual connection. This process involves adding the input directly to the attention output, which helps to preserve important information from the input data. After this addition, layer normalization is applied to the combined result, ensuring that the resulting embeddings are standardized. This normalization process helps to stabilize and improve the training of the model by reducing internal covariate shift, ultimately leading to better performance in capturing relationships within the input data.

Feed Forward

The process involves taking normalized embeddings with a shape of 3 by 512 and inputting them into a feedforward neural network that performs a linear transformation.

In the first hidden layer, which consists of 2048 neurons, the weights are organized in a matrix of size 512 by 2048. The input matrix, which is of size 3 by 512, is multiplied by these weights to generate an intermediate output of size 3 by 2048. This product is then passed through a Rectified Linear Unit (ReLU) activation function, introducing non-linearity to the model.

Following this, the output enters a second hidden layer composed of 512 neurons, where a linear activation function is applied. This final transformation reduces the dimensions back to a matrix of size 3 by 512, completing the feedforward process.

Add and layer Normalization

In the process of creating embeddings within the encoder block, a residual connection is established that feeds the original input back into the output of the feed-forward layer. This step is essential as it enhances the flow of information, allowing the model to maintain important features of the input. Following this, layer normalization is applied to stabilize and improve the training process. This entire sequence constitutes the first round of the encoder block. This process is repeated a total of six times, gradually refining the embeddings at each stage. By the end of these iterations, the final embeddings are prepared and sent to the decoder for further processing.