Deep Dive into LLMs: Notes from Andrej Karpathy’s 3h30 video
It took me several weeks to complete watching this YouTube video. It was not a random one, but Deep Dive into LLMs like ChatGPT from Andrej Karpathy. This guy knows a bit about neural networks 😉 and this video has been greatly recommended to me. It’s a walkthrough on how LLMs are trained, what steps are required to make them act as the assistants we know as LLM end-users, and how inference works.
Here are my notes, closely matching the different chapters of the video.
Pretraining: Data Collection and Processing
The process begins by crawling the internet
Several filtering steps are applied:
- URL filtering to exclude unwanted content
- Text extraction to remove HTML markup
- Language filtering to focus on specific languages
- Deduplication to remove redundant content
- Removal of personal information
- Various other filters for quality control
Tokenization: Breaking Text into Manageable Pieces
- Finding the optimal tradeoff between vocabulary size and sequence length
- At one extreme, binary encoding uses only two symbols but creates very long sequences
- Byte pair encoding (BPE) replaces frequently occurring symbol pairs with new tokens
- A good tradeoff is around 1,000 different tokens
- Popular LLMs typically use 30,000–100,000 tokens in their vocabularies
- Tools like tiktokenizer help visualize tokenization
Neural Network Input/Output
- Models work with a fixed context window of tokens
- Longer contexts increase computational requirements
- The fundamental task is predicting the next token in a sequence
- The neural network takes the context as input
- The output is a probability distribution across all possible tokens
- Initially, these probabilities are random
- Training adjusts the network to assign higher probabilities to correct tokens
- The goal is to make the model’s statistical predictions match the dataset
Neural Network Internals
The process involves:
- Token window inputs
- Billions of parameters (initially random)
- A massive mathematical expression
- Output statistics for every possible token
Interactive visualizations like bbycroft.net/llm help understand the architecture
Inference: Generating Text
- Inference is the process of generating new text from the model
- Starting with an initial token, the model predicts the next one
- Each token is selected randomly based on its probability weight (like “flipping a weighted coin”)
Base Models
- Base models only output tokens — essentially rewriting internet content
- They cannot be prompted directly in a useful way
- Example: OpenAI’s GPT-2 (
src/model.pydefines the steps executed during neural network training) - The real value lies in the parameters (e.g., 1.5 billion numbers in GPT-2)
- These are stochastic systems where output is an estimate
- Base models function as “internet document simulators”
- To create a basic assistant from a base model, you prefix prompts with instructions and examples of desired interactions
Post-Training: Refining the Model
- Models are fed large sets of human-created assistant-human conversations
- This requires a specific data format to encode conversations with metadata markers
- Example datasets include UltraChat
Addressing Hallucinations
Models imitate the confident tone of training data even when they don’t know answers
Solutions include:
- Adding “I don’t know” responses to the training set
- Implementing search capabilities with special tokens like
<SEARCH_WEB>query</> - With enough examples, models learn when to admit ignorance or search for information
LLM Knowledge exists in two places:
- Model parameters (like long-term memory)
- Context window (like working memory)
Self-Knowledge
- By default, models don’t have accurate self-knowledge
- They might respond with “I’m ChatGPT by OpenAI” because this appears frequently online
- Post-training with identity-focused conversations helps, for example with questions like “Who are you?”
- System messages can prefix each context window with identity information, to remind assistants who they are, who build them, what are their limits, etc
Computational Limitations
- Models have finite computational capacity per token
- When creating training conversations, it’s better to place answers at the end, so the model doesn’t have to compute the complete response with a very limited context window
- This gives the model more context before generating responses
- Requesting verifiable outputs like code helps ensure accuracy
From Supervised Fine-Tuning to Reinforcement Learning
Training progression:
- Pretraining (background knowledge)
- Supervised Fine-Tuning/SFT (problems with solutions)
- Reinforcement Learning (practice problems without immediate solutions)
Reinforcement Learning
- We often don’t know what’s difficult for LLMs, making it hard to design optimal incentives
- If we train the LLM with short responses, there is a risk of computational limitations and errors
- But long responses may waste tokens
- RL lets the model discover optimal token sequences
- The Deepseek R1 paper noted models can learn to try multiple problem-solving approaches
- OpenAI’s o1/o3 models use these techniques, while GPT-4 is primarily SFT-based
- RL models excel at reasoning tasks like mathematics
- In games like Go, RL models surpassed SFT models and human champions (AlphaGo)
RLHF: Reinforcement Learning from Human Feedback
Used to train AI in domains where answers aren’t objectively verifiable
Creates a reward model based on human preferences:
- Humans rank responses to prompts
- A model learns to predict these rankings
- This creates a “neural net simulator of human preferences”
RLHF works because:
- It’s easier for humans to discriminate between responses than to generate perfect ones
- This makes the labeling process more efficient
Limitations:
- The reward model is a lossy simulation of human judgment
- Models can learn to “game” the reward function
- Extended RLHF training can degrade performance
- Traditional RL with precise, human-validated answers doesn’t suffer this limitation
Future Developments
- Multimodality: Integration of text, audio, and video capabilities (already emerging in current models)
- Supervised Agents: AI systems that can perform complex tasks while operating under human supervision
- Ubiquitous Integration: LLMs embedded in virtually every application (similar to the current “AI” marketing trend, but with actual integration everywhere)
- Computer-Using Capabilities (MCP): Models that can effectively utilize computers and digital tools as humans do
- Improved Learning Mechanisms: Development of a middle ground between static model parameters and dynamic context windows, allowing models to continuously learn and adapt
Useful Resources
- LM Arena (though Karpathy expresses some reservations)
- Smol AI Newsletter
- Together.ai (inference provider)
- LM Studio (for running local LLMs)
Conclusion
- LLMs are lossy simulations of human intelligence
- What ChatGPT generates is a neural simulation of data labelers following OpenAI’s instructions
- Reasoning models using RL demonstrate more complex thinking processes
- We still don’t know if RL capabilities in one domain transfer to others
By Thomas Martin
Follow me or comment