Harper Carroll AI Course — Session 1: Foundations of Generative AI

Course by Harper Carroll | Notes compiled February 21, 2026

Overview

Session 1 lays the conceptual groundwork for the entire course, building a mental model of how modern generative AI systems actually work — from the lowest-level representation of data all the way up to production deployment patterns and the distinction between chatbots and autonomous agents. The session resists the temptation to treat AI as magic and instead grounds everything in mechanics: numbers, probabilities, and linear algebra working at massive scale.

The central character is the Large Language Model (LLM), but the session is careful to situate it within a broader landscape of generative AI that includes image synthesis, video generation, and audio models. What unifies all of these modalities is a single key insight: every type of data — a photograph, a sentence, a song — ultimately gets converted into numerical representations before a model can work with it. Understanding that transformation is the first step toward understanding everything else.

By the end of the session, students have touched on transformer architecture, the training pipeline from pre-training through RLHF, the mechanics of Retrieval-Augmented Generation, context window limitations, the difference between chatbots and agents, and the practical trade-offs between open-source and closed-source models. It's a dense first session, but it's structured as a foundation — a map of the terrain before the deeper dives begin.


Key Concepts

Data Representation: Everything Is Numbers

Before any AI model can process information, that information must be converted into numbers. Images are represented as grids of RGB pixel values. Audio becomes waveforms sampled at regular intervals. Text goes through a process called tokenization, where words (or sub-word fragments) are split into tokens, each assigned a numeric ID. Those IDs are then mapped to embeddings — high-dimensional vectors that encode not just identity but meaning. Words with similar meanings cluster near each other in this mathematical space. "King" and "Queen" are neighbors; "king" and "carburetor" are not. This geometry is what allows LLMs to reason about language at all.

LLMs: Probabilistic Next-Token Prediction

The core operation of any LLM is deceptively simple: given a sequence of tokens, predict the most likely next one. Repeat. That's it. What makes this powerful is the scale — billions of parameters, trained on effectively the entire public internet — and the sophistication of the prediction mechanism. But the framing is crucial: LLMs do statistical continuation, not thinking. They are not search engines retrieving facts; they are pattern-completion engines. This distinction has direct practical implications: it explains hallucinations, it explains why confident-sounding output can be wrong, and it explains why prompting style matters so much.

The Transformer Architecture

The technical leap that made modern LLMs possible is the Transformer, introduced in the 2017 paper "Attention Is All You Need." Before Transformers, language models processed text sequentially — one word at a time — making them slow and limiting their ability to relate distant parts of a sentence. Transformers introduced self-attention, a mechanism that allows every token in a sequence to attend to (i.e., consider its relationship to) every other token simultaneously, in parallel. Combined with positional encodings (which preserve word order since attention itself is order-agnostic) and multi-head attention (which runs multiple attention mechanisms in parallel to capture different types of relationships), Transformers unlocked both the speed and the contextual sophistication that modern LLMs exhibit.

Context Windows and Working Memory

The context window is best understood as the model's working memory — the total amount of text (prompt + conversation history + outputs) the model can "see" at once. Current frontier models are impressive: GPT-5.2 supports roughly 400,000 tokens (~1,000 pages of text), while Claude Opus 4.6 and Gemini 3 Pro both support around 1 million tokens (~2,500 pages). But bigger isn't always better in practice. The lost-in-the-middle phenomenon means models tend to weight information at the beginning and end of a context more heavily than content buried in the middle. Additionally, a sliding window mechanism means that in very long conversations, earlier content can effectively "fall off" the model's attention — a FIFO memory with fuzzy edges. System prompts also consume context budget, which becomes relevant in production deployments.

Hallucinations: Causes and Mitigations

Hallucinations — confidently stated falsehoods — are a direct consequence of the probabilistic generation mechanism. The model doesn't retrieve facts; it predicts plausible continuations. Common examples include fabricated academic citations with real-sounding author names, invented URLs that follow correct format but don't exist, and syntactically valid but logically broken code. Mitigations include building verification layers into workflows, using Retrieval-Augmented Generation to ground the model in real documents, and applying domain expertise to spot-check outputs. The session is clear: hallucinations are not a bug to be patched away but a fundamental property of the architecture to be managed.

Retrieval-Augmented Generation (RAG)

RAG is the primary practical tool for reducing hallucinations when factual accuracy matters. The workflow: embed a corpus of documents into a vector database; when a query arrives, embed the query and perform a similarity search against the corpus; retrieve the most relevant chunks; inject them into the prompt alongside the question; then generate. The model is now working with retrieved evidence rather than purely from parametric memory. This is powerful, but not foolproof — retrieval errors (fetching the wrong chunks) can introduce their own confabulations. RAG is a grounding strategy, not a guarantee.

The Training Pipeline: Pre-training, Fine-tuning, RLHF, and Distillation

LLMs are trained in stages. Pre-training exposes the model to internet-scale text corpora, teaching it statistical patterns across virtually every domain. Fine-tuning then applies a smaller, curated dataset to specialize the model for particular tasks — and this step is more accessible than most people realize: a personal fine-tune can cost under $5 and take about 20 minutes. RLHF (Reinforcement Learning from Human Feedback) refines behavior by having humans rank outputs, training a reward model on those rankings, and then using reinforcement learning to nudge the LLM toward higher-rated responses. RLAIF replaces human raters with AI raters. One important side effect: both RLHF and RLAIF can introduce sycophancy — the model learns to tell users what they want to hear rather than what's accurate, and this tendency worsens at scale. Model distillation takes a different approach, training a smaller "student" model on outputs from a larger "teacher" model — producing efficient models that punch above their parameter count. DeepSeek R1 is a prominent case study. Note: distillation from self-hosted models is generally legal; distillation from proprietary APIs typically violates terms of service.

Chatbots vs. Agents

A chatbot is an input-output text transformer: it takes your message and produces a reply. An agent is something more ambitious — it analyzes tasks, maintains goals across steps, uses external tools (web search, code execution, file systems, APIs), and takes actions with real-world side effects. This distinction matters because agents introduce risks that chatbots don't: prompt injection (malicious instructions embedded in content the agent processes) and data retention concerns (agents may read and transmit sensitive information). Platforms highlighted for agentic work include Claude Code, Codex, and OpenClaw.

Open-Source vs. Closed-Source Models

Closed-source models (GPT, Claude, Gemini) offer the highest current capability, native multimodality, and simple API access — but your data goes to the provider. Open-source models (Llama, Mistral, DeepSeek) offer full control, local execution, and privacy — but require infrastructure and typically lag on capability. The practical heuristic: use the smallest reliable model for the task. Ollama is the recommended tool for running open-source models locally.


Key Insights & Takeaways


Memorable Quotes / Notable Examples


What's Coming Next

Session 2 will go deeper into reasoning models — a class of LLMs that do additional compute at inference time, essentially "thinking before they speak" — and will introduce more sophisticated prompt engineering systems. The course is also expected to cover more advanced agent architectures and the emerging landscape of multi-modal and multi-agent systems.