Notes on "AI Engineering" [WIP]

February 19, 2025

Join our weekly reading group on Discord where we are reading Chip Huynen's book on AI Engineering.

Preface

With ChatGPT:
- Lots of new possibilities
- Lowered the entry barrier for many people, especially developers—they can now use AI in their applications.
Many principles for building AI remain the same.
- However, new scale and capabilities introduce new challenges that require new solutions.
Before building an AI application:
- Is this application necessary?
- Is AI needed?
- Do I have to build it myself?
Evaluation is one of the hardest parts of AI engineering.

Chapter 1: Introduction to Building AI Applications with Foundation Models

Post-2020 models are BIG.
- We risk running out of publicly available internet data to train them.
Scale of AI models:
- Allows for more tasks → more applications
- Training is complex (specialized tasks) → models as a service (e.g., ChatGPT API)
AI engineering: the process of building applications on top of available models.

From Language Models to Large Language Models#

Large Language Models (LLMs) evolved from Language Models, which have been studied since the 1950s.
A language model encodes statistical information about one or more languages.
- Determines how likely a word is to appear in a given context.
- Example: "My favorite color is ___"
  - "Blue" should have a higher probability than "car."
Why do language models use tokens instead of words or characters?
- Break words into meaningful components (e.g., "cook" and "ing").
- Fewer unique tokens than unique words → reduced model vocabulary → more efficiency.
- Helps the model process unknown words.
Masked Language Model:
- "Fill-the-blank" model
  - Example: "My favorite ____ is blue."
- Example: BERT
- Common use cases:
  - Sentiment analysis and text classification
Autoregressive Language Model:
- Predicts the next token using only preceding tokens.
  - Example: "My favorite color is ____"
- Common use cases:
  - Text generation
Many tasks can be framed as completion tasks.
Language models can be trained using self-supervision, whereas many other ML models require supervision.
Why do larger models need more data?
- Larger models have more capacity to learn and therefore require more training data to maximize performance.

From Large Language Models to Foundation Models#

AI needs to process data beyond text to operate in the real world.
Supporting new data modalities makes models more powerful.
Self-supervision works for multimodal models too.
- OpenAI used a variant of self-supervision (Natural Language Supervision) to train their language-image model.
Foundation models enabled the transition from task-specific models to general-purpose models.

From Foundation Models to AI Engineering#

AI engineering → the process of building applications on top of foundation models.

Foundation Model Use Cases#

A 2024 O’Reilly survey categorized use cases into eight categories: programming, data analysis, customer support, marketing copy, other copy, research, web design, and art.

Planning AI Applications#

If you just want to learn and have fun, jump right in—building is one of the best ways to learn.
It’s easy to build a cool demo with foundation models. It’s hard to create a profitable product.

Use Case Evaluation#

If you don’t evaluate your use case:
- Competitors using AI can make you obsolete.
- You’ll miss opportunities to boost profits and productivity.
- You might not know where AI fits into your business but don’t want to be left behind.

The Role of AI and Humans in the Application#

Critical or complementary?
- "The more critical AI is to the application, the more accurate and reliable it has to be."
Reactive or proactive?
- A chatbot is reactive, whereas traffic alerts on Google Maps are proactive.
Dynamic or static?
- Face ID needs updates as people’s faces change over time.
- Object detection in Google Photos likely updates only when Google Photos is upgraded.

AI Product Defensibility#

If your product is easy to build, you have no moat to defend it.
Three types of competitive advantages:
- Technology
- Data
- Distribution

Milestone Planning#

A good initial demo doesn’t guarantee a good end product.

The AI Engineering Stack#

Every AI application stack has three layers:
1. Application development
2. Model development
3. Infrastructure
Many principles of AI application development remain the same as in ML engineering.
AI Engineering vs. ML Engineering:
- AI engineering focuses less on modeling and training, more on model adaptation.
- AI engineers work with bigger models and need more GPUs.
- AI engineering deals with open-ended output models.
- ML Engineering: Data → Model → Product
- AI Engineering: Product → Data → Model
Inference optimization: making models faster and cheaper.

Chapter 2: Understanding Foundation Models

Differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences.
The impact of sampling is often overlooked.
- Sampling: How a model chooses an output from all possible options.

Training Data#

Common source → Common Crawl
OpenAI used only Reddit links with at least three upvotes to train GPT-2.
Tokenization might have different requirements in different languages. In English, the median token length is 7, in Hindi, it is 32, and in Burmese, it is 72.

Domain-Specific Models#

General-purpose foundation models are unlikely to perform well on domain-specific tasks.

Modeling#

Transformer Architecture#

Seq2Seq

Before transformers, we had Seq2Seq, created by Google and popularized when Google updated Google Translate to use it.
- Uses an encoder-decoder paradigm.
- Uses an RNN as both encoder and decoder.
- The encoder processes the input tokens sequentially and outputs the final state. The decoder generates output tokens sequentially using the final hidden state and the previously generated tokens.

Attention Mechanism

Allows the model to weigh the importance of different input tokens when generating each output token.
Inference for transformer-based models
- Prefill:
  - The model processes the input tokens in parallel. This creates the intermediate state necessary to generate the first output token. (Includes the key and value vectors for all input tokens.)
- Decode:
  - The model generates one output token at a time.
The attention mechanism computes how much attention to give an input token by computing the dot product between the query and key vectors.
Multiple heads allow the model to attend to different groups of previous tokens simultaneously.

Other Model Architectures#

RWKV (RwaKuv) is gaining traction.
- An RNN-based model that can be parallelized.
SSMs (State Space Models)
- Promising for long-range memory.
- Some examples: S4, H3, Mamba, Jamba.

Model Size#

MoE (Mixture of Experts) is a model divided into different groups of parameters, where each group is an expert. Only a subset of the experts is active to process each token → Sparse model.
If a dataset contains 1 trillion tokens and a model is trained on that dataset for two epochs, then the number of training tokens is 2 trillion.
Chinchilla paper: They found that for compute-optimal training, the number of training tokens should be approximately 20 times the model size. This means that a 3B-parameter model needs approximately 60B training tokens.
Scaling extrapolation (also called hyperparameter transferring): A research subfield that tries to predict, for large models, what hyperparameters will give the best performance.

Scaling Bottlenecks#

There are already two visible bottlenecks for scaling: training data and electricity.
As of this writing, data centers are estimated to consume 1–2% of global electricity.

Sampling#

Sampling: How a model constructs outputs.
Makes AI outputs probabilistic.
Some examples: temperature, top-k, and top-p.
- Top-k:
  - Picks the top-k tokens and samples from them.
  - Reduces computational workload when computing softmax.
- Top-p:
  - The total number of values to consider is not fixed.
  - The model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p.

Test-Time Compute#

After sampling multiple outputs, you pick the one with the highest average log probability (logprob).

Chapter 3: Evaluation Methodology

Work in progress...