Harper Carroll AI Course — Session 1: Foundations of Generative AI

Course by Harper Carroll | Notes compiled February 21, 2026

Overview

Session 1 lays the conceptual groundwork for the entire course, building a mental model of how modern generative AI systems actually work — from the lowest-level representation of data all the way up to production deployment patterns and the distinction between chatbots and autonomous agents. The session resists the temptation to treat AI as magic and instead grounds everything in mechanics: numbers, probabilities, and linear algebra working at massive scale.

The central character is the Large Language Model (LLM), but the session is careful to situate it within a broader landscape of generative AI that includes image synthesis, video generation, and audio models. What unifies all of these modalities is a single key insight: every type of data — a photograph, a sentence, a song — ultimately gets converted into numerical representations before a model can work with it. Understanding that transformation is the first step toward understanding everything else.

By the end of the session, students have touched on transformer architecture, the training pipeline from pre-training through RLHF, the mechanics of Retrieval-Augmented Generation, context window limitations, the difference between chatbots and agents, and the practical trade-offs between open-source and closed-source models. It's a dense first session, but it's structured as a foundation — a map of the terrain before the deeper dives begin.

Key Concepts

Data Representation: Everything Is Numbers

Before any AI model can process information, that information must be converted into numbers. Images are represented as grids of RGB pixel values. Audio becomes waveforms sampled at regular intervals. Text goes through a process called tokenization, where words (or sub-word fragments) are split into tokens, each assigned a numeric ID. Those IDs are then mapped to embeddings — high-dimensional vectors that encode not just identity but meaning. Words with similar meanings cluster near each other in this mathematical space. "King" and "Queen" are neighbors; "king" and "carburetor" are not. This geometry is what allows LLMs to reason about language at all.

LLMs: Probabilistic Next-Token Prediction

The core operation of any LLM is deceptively simple: given a sequence of tokens, predict the most likely next one. Repeat. That's it. What makes this powerful is the scale — billions of parameters, trained on effectively the entire public internet — and the sophistication of the prediction mechanism. But the framing is crucial: LLMs do statistical continuation, not thinking. They are not search engines retrieving facts; they are pattern-completion engines. This distinction has direct practical implications: it explains hallucinations, it explains why confident-sounding output can be wrong, and it explains why prompting style matters so much.

The Transformer Architecture

The technical leap that made modern LLMs possible is the Transformer, introduced in the 2017 paper "Attention Is All You Need." Before Transformers, language models processed text sequentially — one word at a time — making them slow and limiting their ability to relate distant parts of a sentence. Transformers introduced self-attention, a mechanism that allows every token in a sequence to attend to (i.e., consider its relationship to) every other token simultaneously, in parallel. Combined with positional encodings (which preserve word order since attention itself is order-agnostic) and multi-head attention (which runs multiple attention mechanisms in parallel to capture different types of relationships), Transformers unlocked both the speed and the contextual sophistication that modern LLMs exhibit.

Context Windows and Working Memory

The context window is best understood as the model's working memory — the total amount of text (prompt + conversation history + outputs) the model can "see" at once. Current frontier models are impressive: GPT-5.2 supports roughly 400,000 tokens (~1,000 pages of text), while Claude Opus 4.6 and Gemini 3 Pro both support around 1 million tokens (~2,500 pages). But bigger isn't always better in practice. The lost-in-the-middle phenomenon means models tend to weight information at the beginning and end of a context more heavily than content buried in the middle. Additionally, a sliding window mechanism means that in very long conversations, earlier content can effectively "fall off" the model's attention — a FIFO memory with fuzzy edges. System prompts also consume context budget, which becomes relevant in production deployments.

Hallucinations: Causes and Mitigations

Hallucinations — confidently stated falsehoods — are a direct consequence of the probabilistic generation mechanism. The model doesn't retrieve facts; it predicts plausible continuations. Common examples include fabricated academic citations with real-sounding author names, invented URLs that follow correct format but don't exist, and syntactically valid but logically broken code. Mitigations include building verification layers into workflows, using Retrieval-Augmented Generation to ground the model in real documents, and applying domain expertise to spot-check outputs. The session is clear: hallucinations are not a bug to be patched away but a fundamental property of the architecture to be managed.

Retrieval-Augmented Generation (RAG)

RAG is the primary practical tool for reducing hallucinations when factual accuracy matters. The workflow: embed a corpus of documents into a vector database; when a query arrives, embed the query and perform a similarity search against the corpus; retrieve the most relevant chunks; inject them into the prompt alongside the question; then generate. The model is now working with retrieved evidence rather than purely from parametric memory. This is powerful, but not foolproof — retrieval errors (fetching the wrong chunks) can introduce their own confabulations. RAG is a grounding strategy, not a guarantee.

The Training Pipeline: Pre-training, Fine-tuning, RLHF, and Distillation

LLMs are trained in stages. Pre-training exposes the model to internet-scale text corpora, teaching it statistical patterns across virtually every domain. Fine-tuning then applies a smaller, curated dataset to specialize the model for particular tasks — and this step is more accessible than most people realize: a personal fine-tune can cost under $5 and take about 20 minutes. RLHF (Reinforcement Learning from Human Feedback) refines behavior by having humans rank outputs, training a reward model on those rankings, and then using reinforcement learning to nudge the LLM toward higher-rated responses. RLAIF replaces human raters with AI raters. One important side effect: both RLHF and RLAIF can introduce sycophancy — the model learns to tell users what they want to hear rather than what's accurate, and this tendency worsens at scale. Model distillation takes a different approach, training a smaller "student" model on outputs from a larger "teacher" model — producing efficient models that punch above their parameter count. DeepSeek R1 is a prominent case study. Note: distillation from self-hosted models is generally legal; distillation from proprietary APIs typically violates terms of service.

Chatbots vs. Agents

A chatbot is an input-output text transformer: it takes your message and produces a reply. An agent is something more ambitious — it analyzes tasks, maintains goals across steps, uses external tools (web search, code execution, file systems, APIs), and takes actions with real-world side effects. This distinction matters because agents introduce risks that chatbots don't: prompt injection (malicious instructions embedded in content the agent processes) and data retention concerns (agents may read and transmit sensitive information). Platforms highlighted for agentic work include Claude Code, Codex, and OpenClaw.

Open-Source vs. Closed-Source Models

Closed-source models (GPT, Claude, Gemini) offer the highest current capability, native multimodality, and simple API access — but your data goes to the provider. Open-source models (Llama, Mistral, DeepSeek) offer full control, local execution, and privacy — but require infrastructure and typically lag on capability. The practical heuristic: use the smallest reliable model for the task. Ollama is the recommended tool for running open-source models locally.

Key Insights & Takeaways

LLMs are not thinking machines — they are sophisticated pattern-completion engines. Treating them as oracles leads to misplaced trust; treating them as statistical tools leads to smarter use.
Hallucinations are structural, not accidental — build workflows that assume the model will sometimes be wrong, and design verification into the loop.
Context position matters — place the most important information at the beginning or end of a prompt, not buried in the middle.
RAG is your primary factual grounding tool — for any application where accuracy on real-world facts matters, RAG is the answer.
Fine-tuning is more accessible than it sounds — under $5 and 20 minutes for a personal fine-tune changes the calculus on when customization is worth it.
Sycophancy is a training artifact — models trained on human feedback learn to flatter. Push back on confident-sounding answers and demand evidence.
Match model to task — bigger isn't always better. A well-prompted small model often outperforms an over-trusted large one.

Memorable Quotes / Notable Examples

The embedding space analogy: Semantically similar words cluster together in high-dimensional space — meaning "meaning" is literally geometric distance. This is one of those ideas that sounds abstract until you realize it's why asking the model to "think step by step" actually changes the output distribution.
DeepSeek R1 as a distillation case study: A concrete example of how smaller models can be trained to approximate the outputs of larger ones — and a reminder that the legal status of this technique depends entirely on where those teacher outputs came from.
Fine-tuning for under $5: This figure tends to reframe how practitioners think about customization. Fine-tuning isn't reserved for well-funded labs.
The lost-in-the-middle problem: Imagine handing someone a 2,500-page document and asking them to recall a detail from page 1,200 — that's roughly what you're asking of the model when you rely on buried context.

What's Coming Next

Session 2 will go deeper into reasoning models — a class of LLMs that do additional compute at inference time, essentially "thinking before they speak" — and will introduce more sophisticated prompt engineering systems. The course is also expected to cover more advanced agent architectures and the emerging landscape of multi-modal and multi-agent systems.