Santiago Víquez

Home

Blog

Notes on "AI Engineering" [WIP]

Join our weekly reading group on Discord where we are reading Chip Huynen's book on AI Engineering.

Preface

  • With ChatGPT:
    • Lots of new possibilities
    • Lowered the entry barrier for many people, especially developers—they can now use AI in their applications.
  • Many principles for building AI remain the same.
    • However, new scale and capabilities introduce new challenges that require new solutions.
  • Before building an AI application:
    • Is this application necessary?
    • Is AI needed?
    • Do I have to build it myself?
  • Evaluation is one of the hardest parts of AI engineering.

Chapter 1: Introduction to Building AI Applications with Foundation Models

  • Post-2020 models are BIG.
    • We risk running out of publicly available internet data to train them.
  • Scale of AI models:
    • Allows for more tasks → more applications
    • Training is complex (specialized tasks) → models as a service (e.g., ChatGPT API)
  • AI engineering: the process of building applications on top of available models.

From Language Models to Large Language Models#

  • Large Language Models (LLMs) evolved from Language Models, which have been studied since the 1950s.
  • A language model encodes statistical information about one or more languages.
    • Determines how likely a word is to appear in a given context.
    • Example: "My favorite color is ___"
      • "Blue" should have a higher probability than "car."
  • Why do language models use tokens instead of words or characters?
    • Break words into meaningful components (e.g., "cook" and "ing").
    • Fewer unique tokens than unique words → reduced model vocabulary → more efficiency.
    • Helps the model process unknown words.
  • Masked Language Model:
    • "Fill-the-blank" model
      • Example: "My favorite ____ is blue."
    • Example: BERT
    • Common use cases:
      • Sentiment analysis and text classification
  • Autoregressive Language Model:
    • Predicts the next token using only preceding tokens.
      • Example: "My favorite color is ____"
    • Common use cases:
      • Text generation
  • Many tasks can be framed as completion tasks.
  • Language models can be trained using self-supervision, whereas many other ML models require supervision.
  • Why do larger models need more data?
    • Larger models have more capacity to learn and therefore require more training data to maximize performance.

From Large Language Models to Foundation Models#

  • AI needs to process data beyond text to operate in the real world.
  • Supporting new data modalities makes models more powerful.
  • Self-supervision works for multimodal models too.
    • OpenAI used a variant of self-supervision (Natural Language Supervision) to train their language-image model.
  • Foundation models enabled the transition from task-specific models to general-purpose models.

From Foundation Models to AI Engineering#

  • AI engineering → the process of building applications on top of foundation models.

Foundation Model Use Cases#

  • A 2024 O’Reilly survey categorized use cases into eight categories: programming, data analysis, customer support, marketing copy, other copy, research, web design, and art.

Planning AI Applications#

  • If you just want to learn and have fun, jump right in—building is one of the best ways to learn.
  • It’s easy to build a cool demo with foundation models. It’s hard to create a profitable product.

Use Case Evaluation#

  • If you don’t evaluate your use case:
    • Competitors using AI can make you obsolete.
    • You’ll miss opportunities to boost profits and productivity.
    • You might not know where AI fits into your business but don’t want to be left behind.

The Role of AI and Humans in the Application#

  • Critical or complementary?
    • "The more critical AI is to the application, the more accurate and reliable it has to be."
  • Reactive or proactive?
    • A chatbot is reactive, whereas traffic alerts on Google Maps are proactive.
  • Dynamic or static?
    • Face ID needs updates as people’s faces change over time.
    • Object detection in Google Photos likely updates only when Google Photos is upgraded.

AI Product Defensibility#

  • If your product is easy to build, you have no moat to defend it.
  • Three types of competitive advantages:
    • Technology
    • Data
    • Distribution

Milestone Planning#

  • A good initial demo doesn’t guarantee a good end product.

The AI Engineering Stack#

  • Every AI application stack has three layers:
    1. Application development
    2. Model development
    3. Infrastructure
  • Many principles of AI application development remain the same as in ML engineering.
  • AI Engineering vs. ML Engineering:
    • AI engineering focuses less on modeling and training, more on model adaptation.
    • AI engineers work with bigger models and need more GPUs.
    • AI engineering deals with open-ended output models.
    • ML Engineering: Data → Model → Product
    • AI Engineering: Product → Data → Model
  • Inference optimization: making models faster and cheaper.

Chapter 2: Understanding Foundation Models

  • Differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences.
  • The impact of sampling is often overlooked.
    • Sampling: How a model chooses an output from all possible options.

Training Data#

  • Common source → Common Crawl
  • OpenAI used only Reddit links with at least three upvotes to train GPT-2.
  • Tokenization might have different requirements in different languages. In English, the median token length is 7, in Hindi, it is 32, and in Burmese, it is 72.

Domain-Specific Models#

  • General-purpose foundation models are unlikely to perform well on domain-specific tasks.

Modeling#

Transformer Architecture#

Seq2Seq

  • Before transformers, we had Seq2Seq, created by Google and popularized when Google updated Google Translate to use it.
    • Uses an encoder-decoder paradigm.
    • Uses an RNN as both encoder and decoder.
    • The encoder processes the input tokens sequentially and outputs the final state. The decoder generates output tokens sequentially using the final hidden state and the previously generated tokens.

Attention Mechanism

  • Allows the model to weigh the importance of different input tokens when generating each output token.

  • Inference for transformer-based models

    • Prefill:
      • The model processes the input tokens in parallel. This creates the intermediate state necessary to generate the first output token. (Includes the key and value vectors for all input tokens.)
    • Decode:
      • The model generates one output token at a time.
  • The attention mechanism computes how much attention to give an input token by computing the dot product between the query and key vectors.

  • Multiple heads allow the model to attend to different groups of previous tokens simultaneously.

Other Model Architectures#

  • RWKV (RwaKuv) is gaining traction.
    • An RNN-based model that can be parallelized.
  • SSMs (State Space Models)
    • Promising for long-range memory.
    • Some examples: S4, H3, Mamba, Jamba.

Model Size#

  • MoE (Mixture of Experts) is a model divided into different groups of parameters, where each group is an expert. Only a subset of the experts is active to process each token → Sparse model.
  • If a dataset contains 1 trillion tokens and a model is trained on that dataset for two epochs, then the number of training tokens is 2 trillion.
  • Chinchilla paper: They found that for compute-optimal training, the number of training tokens should be approximately 20 times the model size. This means that a 3B-parameter model needs approximately 60B training tokens.
  • Scaling extrapolation (also called hyperparameter transferring): A research subfield that tries to predict, for large models, what hyperparameters will give the best performance.

Scaling Bottlenecks#

  • There are already two visible bottlenecks for scaling: training data and electricity.
  • As of this writing, data centers are estimated to consume 1–2% of global electricity.

Sampling#

  • Sampling: How a model constructs outputs.
  • Makes AI outputs probabilistic.
  • Some examples: temperature, top-k, and top-p.
    • Top-k:
      • Picks the top-k tokens and samples from them.
      • Reduces computational workload when computing softmax.
    • Top-p:
      • The total number of values to consider is not fixed.
      • The model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p.

Test-Time Compute#

  • After sampling multiple outputs, you pick the one with the highest average log probability (logprob).

Chapter 3: Evaluation Methodology

Work in progress...