Join our weekly reading group on Discord where we are reading Chip Huynen's book on AI Engineering.
Preface
- With ChatGPT:
- Lots of new possibilities
- Lowered the entry barrier for many people, especially developers—they can now use AI in their applications.
- Many principles for building AI remain the same.
- However, new scale and capabilities introduce new challenges that require new solutions.
- Before building an AI application:
- Is this application necessary?
- Is AI needed?
- Do I have to build it myself?
- Evaluation is one of the hardest parts of AI engineering.
Chapter 1: Introduction to Building AI Applications with Foundation Models
- Post-2020 models are BIG.
- We risk running out of publicly available internet data to train them.
- Scale of AI models:
- Allows for more tasks → more applications
- Training is complex (specialized tasks) → models as a service (e.g., ChatGPT API)
- AI engineering: the process of building applications on top of available models.
From Language Models to Large Language Models
#
- Large Language Models (LLMs) evolved from Language Models, which have been studied since the 1950s.
- A language model encodes statistical information about one or more languages.
- Determines how likely a word is to appear in a given context.
- Example: "My favorite color is ___"
- "Blue" should have a higher probability than "car."
- Why do language models use tokens instead of words or characters?
- Break words into meaningful components (e.g., "cook" and "ing").
- Fewer unique tokens than unique words → reduced model vocabulary → more efficiency.
- Helps the model process unknown words.
- Masked Language Model:
- "Fill-the-blank" model
- Example: "My favorite ____ is blue."
- Example: BERT
- Common use cases:
- Sentiment analysis and text classification
- Autoregressive Language Model:
- Predicts the next token using only preceding tokens.
- Example: "My favorite color is ____"
- Common use cases:
- Many tasks can be framed as completion tasks.
- Language models can be trained using self-supervision, whereas many other ML models require supervision.
- Why do larger models need more data?
- Larger models have more capacity to learn and therefore require more training data to maximize performance.
From Large Language Models to Foundation Models
#
- AI needs to process data beyond text to operate in the real world.
- Supporting new data modalities makes models more powerful.
- Self-supervision works for multimodal models too.
- OpenAI used a variant of self-supervision (Natural Language Supervision) to train their language-image model.
- Foundation models enabled the transition from task-specific models to general-purpose models.
From Foundation Models to AI Engineering
#
- AI engineering → the process of building applications on top of foundation models.
Foundation Model Use Cases
#
- A 2024 O’Reilly survey categorized use cases into eight categories: programming, data analysis, customer support, marketing copy, other copy, research, web design, and art.
Planning AI Applications
#
- If you just want to learn and have fun, jump right in—building is one of the best ways to learn.
- It’s easy to build a cool demo with foundation models. It’s hard to create a profitable product.
- If you don’t evaluate your use case:
- Competitors using AI can make you obsolete.
- You’ll miss opportunities to boost profits and productivity.
- You might not know where AI fits into your business but don’t want to be left behind.
The Role of AI and Humans in the Application
#
- Critical or complementary?
- "The more critical AI is to the application, the more accurate and reliable it has to be."
- Reactive or proactive?
- A chatbot is reactive, whereas traffic alerts on Google Maps are proactive.
- Dynamic or static?
- Face ID needs updates as people’s faces change over time.
- Object detection in Google Photos likely updates only when Google Photos is upgraded.
AI Product Defensibility
#
- If your product is easy to build, you have no moat to defend it.
- Three types of competitive advantages:
- Technology
- Data
- Distribution
- A good initial demo doesn’t guarantee a good end product.
The AI Engineering Stack
#
- Every AI application stack has three layers:
- Application development
- Model development
- Infrastructure
- Many principles of AI application development remain the same as in ML engineering.
- AI Engineering vs. ML Engineering:
- AI engineering focuses less on modeling and training, more on model adaptation.
- AI engineers work with bigger models and need more GPUs.
- AI engineering deals with open-ended output models.
- ML Engineering: Data → Model → Product
- AI Engineering: Product → Data → Model
- Inference optimization: making models faster and cheaper.
Chapter 2: Understanding Foundation Models
- Differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences.
- The impact of sampling is often overlooked.
- Sampling: How a model chooses an output from all possible options.
- Common source → Common Crawl
- OpenAI used only Reddit links with at least three upvotes to train GPT-2.
- Tokenization might have different requirements in different languages. In English, the median token length is 7, in Hindi, it is 32, and in Burmese, it is 72.
- General-purpose foundation models are unlikely to perform well on domain-specific tasks.
Seq2Seq
- Before transformers, we had Seq2Seq, created by Google and popularized when Google updated Google Translate to use it.
- Uses an encoder-decoder paradigm.
- Uses an RNN as both encoder and decoder.
- The encoder processes the input tokens sequentially and outputs the final state. The decoder generates output tokens sequentially using the final hidden state and the previously generated tokens.
Attention Mechanism
-
Allows the model to weigh the importance of different input tokens when generating each output token.
-
Inference for transformer-based models
- Prefill:
- The model processes the input tokens in parallel. This creates the intermediate state necessary to generate the first output token. (Includes the key and value vectors for all input tokens.)
- Decode:
- The model generates one output token at a time.
-
The attention mechanism computes how much attention to give an input token by computing the dot product between the query and key vectors.
-
Multiple heads allow the model to attend to different groups of previous tokens simultaneously.
Other Model Architectures
#
- RWKV (RwaKuv) is gaining traction.
- An RNN-based model that can be parallelized.
- SSMs (State Space Models)
- Promising for long-range memory.
- Some examples: S4, H3, Mamba, Jamba.
- MoE (Mixture of Experts) is a model divided into different groups of parameters, where each group is an expert. Only a subset of the experts is active to process each token → Sparse model.
- If a dataset contains 1 trillion tokens and a model is trained on that dataset for two epochs, then the number of training tokens is 2 trillion.
- Chinchilla paper: They found that for compute-optimal training, the number of training tokens should be approximately 20 times the model size. This means that a 3B-parameter model needs approximately 60B training tokens.
- Scaling extrapolation (also called hyperparameter transferring): A research subfield that tries to predict, for large models, what hyperparameters will give the best performance.
- There are already two visible bottlenecks for scaling: training data and electricity.
- As of this writing, data centers are estimated to consume 1–2% of global electricity.
- Sampling: How a model constructs outputs.
- Makes AI outputs probabilistic.
- Some examples: temperature, top-k, and top-p.
- Top-k:
- Picks the top-k tokens and samples from them.
- Reduces computational workload when computing softmax.
- Top-p:
- The total number of values to consider is not fixed.
- The model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p.
- After sampling multiple outputs, you pick the one with the highest average log probability (logprob).
Chapter 3: Evaluation Methodology
Work in progress...