It's been a while since I've wanted to read Sebastian Raschka's new book, "Build a Large Language Model (from scratch)". Last week, I finally got it and felt the urge to read it with others. So, I tweeted about organizing a study group to go through the book, hoping to get at least four like-minded people interested.
Turns out, over 600 people wanted to join something like that. I quickly put together a Discord server, and people have been joining all week. We're now at almost 700 learners. I named the community AI from Scratch—if this goes well, it would be awesome to continue studying ML/AI methods from scratch together.
I'll add my notes here as I read the book.
Some years ago, asking a computer to write an email from a list of keywords sounded crazy. This is now a trivial task for an LLM. Earlier NLP models were designed for specific tasks. LLMs perform well across a large variety of tasks.
Large amounts of training data have allowed LLMs to outperform previous approaches.These days, LLMs have around 10-100+ billion parameters (tunable weights). Transformers are the breakthrough that made this possible. They allow models "to pay selective attention to different parts of the inputs when making predictions."
BERT: Focuses on the encoder part of the transformer architecture. It can "see" text in both directions, hence its name, Bidirectional Encoder Representations from Transformers. It was trained using masked word prediction.
GPT: Focuses on the decoder part, performing next-word prediction, which works well for generative tasks. It’s impressive that it can handle translation tasks, given that it was only trained to predict the next word.
GPT-style models are considered autoregressive models, as they incorporate previous outputs as inputs for future predictions.
Preparing input text involves splitting it into individual words and subword tokens, which can then be encoded into vector representations.
Neural networks (NNs) can't process raw text directly. Text data is categorical, which doesn’t align with the mathematical representation that NNs require. To overcome this, we represent words as continuous-valued vectors, known as embeddings.
These embeddings can be created using a neural network layer within the model or by leveraging a pretrained neural network model. There are also embeddings for sentences, paragraphs, and even entire documents—not just for words. In fact, sentence or paragraph embeddings are often used in retrieval-augmented generation (RAG).
RAG combines:
One of the first word embedding models was Word2Vec. It’s a neural network trained to predict either the context of a word given the target word or vice versa. The intuition is that words appearing in similar contexts tend to have similar meanings.
Today, large language models (LLMs) generate their own embeddings as part of the input layer. These embeddings can be updated during training, allowing them to be fine-tuned for specific tasks.
After tokenization, we need to convert tokens into unique IDs for further processing as embedding vectors. This requires creating a vocabulary:
This process enables a mapping from token to ID. Later, we may also need to convert IDs back to tokens, so we should build an inverse dictionary as well.
The SimpleTokenizerV1 class performs tokenization and converts tokens to and from IDs using its encode
and decode
methods.
class SimpleTokenizerV1:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocessed = [
item.strip() for item in preprocessed if item.strip()
]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text
However, this tokenizer has limitations—it can’t handle unknown words (i.e., words not present in the training set).
To handle unknown words, we can modify our tokenizer to include special tokens such as <|unk|>
for unknown tokens and <|endoftext|>
to mark the end of a text source.
<|unk|>
token for out-of-vocabulary words. Instead, it employs Byte Pair Encoding (BPE) to break words into subword units.BPE enables encoding of unknown words without the need for an <|unk|>
token. It breaks down text into frequently occurring subword units, allowing words not in the original vocabulary to be represented.
Here’s an example of how BPE works:
Let’s take the sentence: "low low low lower"
Initialization: Each unique character is treated as a token. Initially, the sentence appears as:
["l", "o", "w", " ", "l", "o", "w", " ", "l", "o", "w", " ", "l", "o", "w", "e", "r"]
Count Pairs: BPE counts the frequency of adjacent pairs. "lo"
and "ow"
are frequent pairs. Suppose "lo"
is the most frequent.
Merge Pairs: The most frequent pair, "lo"
, is merged into a single token, updating the sentence to:
["lo", "w", " ", "lo", "w", " ", "lo", "w", " ", "lo", "w", "e", "r"]
Repeat the Process: BPE counts pairs again and merges the next most frequent, such as "low"
, resulting in:
["low", " ", "low", " ", "low", " ", "low", "e", "r"]
Stop Condition: This process continues until a desired vocabulary size or condition is met. Here, BPE stops with the tokens [" ", "low", "e", "r"]
.
Using BPE, models efficiently encode subword information, handling words absent from the original training data by breaking them into recognizable subwords.
LLMs are pre-trained by predicting the next word based on previous words. To facilitate this, we create input-target pairs like:
['LLMs', 'learn', 'to', 'predict']
['learn', 'to', 'predict']
The process is as follows:
Embedding vectors are typically initialized with random values, which are optimized during training.
Word embeddings, as described, don’t capture the position of words in a sentence. For example, a Token ID:7 maps to the same embedding vector regardless of its place in a sentence. This consistency is good for reproducibility, but since self-attention is also position-agnostic, we need a way to encode positional information in the model.
Two options are commonly used:
Absolute Positional Embeddings: Assigns a unique embedding for each position within the input sentence.
[1, 1, 1]
[1.1, 1.2, 1.3]
[2.1, 2.2, 2.3]
Relative Positional Embeddings: Focuses on the relative distances between tokens rather than their exact positions. This approach helps generalize better to sequences of varying lengths.
OpenAI’s GPT models use absolute positional embeddings, which are optimized during training rather than fixed or predefined.
This chapter builds the multi-head attention mechanishm from the ground up by starting from a simple version to multi-head attention.
Before LLM architectures (without self-attention):
In the encoder-decoder approach:
Limitations of encoder-decoder RNNs:
The "self" in self-attention relates to the mechanism's ability to compute attention weights relating different positions within a single input sequence.
Goal: Compute the context vector for each input element
Ingredients:
Process:
Flow: (input token embeddings) → ω (attention scores) → α (normalized attention weights) → Z (context vector)
Also called scaled-dot product attention, this version adds trainable matrices that are updated during model training to create better context vectors.
Three trainable matrices:
Process:
Causal attention restricts a model to only consider previous and current inputs in a sequence when processing any given token while computing attention scores.
Standard self-attention allows access to the entire sequence at once. Causal attention is also called masked attention.
In causal attention, we mask out the attention above the diagonal and normalize the non-masked attention weights such that the attention weights sum up to 1 in each row.
When applying dropout to an attention weight matrix with a rate of 50%, half of the elements in the matrix are randomly set to zero. To compensate for the reduction, the remaining elements of the matrix are scaled up by a factor of 1/0.5 = 2. This is done to maintain the balance of the attention weights, ensuring that the average influence of the attention mechanism remains consistent during training and inference.
Multi-head attention divides the attention mechanism into multiple "heads," each operating independently. The idea of multi-head attention is to run the attention mechanism multiple times in parallel with different learned linear projections.
Sebastian takes a step by step approach for teaching how to code the general LLM architecture. We start by coding a GPE backbone, with some place holder methods and gradually improve on it untill we have the general implementation.
Training deep neural networks with many layers can be challenging due to problems like vanishing or exploding gradients. These issues lead to unstable training dynamics, making it difficult for the network to effectively adjust its weights. As a result, the learning process struggles to minimize the loss function, limiting the network's ability to learn meaningful patterns in the data, which affects its accuracy in predictions or decisions.
Layer normalization addresses this by improving the stability and efficiency of neural network training.
Adjust the outputs of a neural network layer to have a mean of 0 and a variance of 1, which speeds up convergence.
This is typically done:
GELU: Gaussian Error Linear Unit
This activation function offers improved performance for deep learning models. In GPT-3, an approximation of GELU was used, derived through curve fitting.
Insert GELU Formula Here
GELU provides small, non-zero outputs for negative values. This allows neurons with negative inputs to still contribute to learning, albeit to a lesser extent than positive inputs.
Although the input and output dimensions of this module are the same, the module internally expands the embedding dimension into a higher-dimensional space via the first linear layer. This expansion is followed by a non-linear GELU activation and then a contraction back to the original dimension through a second linear transformation. This design facilitates exploration of a richer representation space.
Shortcut connections, also known as skip or residual connections, were originally proposed for deep networks in computer vision (e.g., residual networks) to address the vanishing gradient problem.
Gradients, which guide weight updates during training, become progressively smaller as they propagate backward through the layers. This makes it difficult to effectively train earlier layers.
A shortcut connection creates an alternative, shorter path for the gradient to flow through the network, bypassing one or more layers.
A transformer block combines the causal multi-head attention module (discussed in the previous chapter) with linear layers and the feed-forward neural network implemented earlier. Additionally, it employs dropout and shortcut connections.
The preservation of shape throughout the transformer block architecture is not incidental—it is a crucial design aspect. This allows the block to be applied effectively to sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship.
Work in progress...