Introduction to Data Systems and Data Design
Data systems underpin much of modern computer science and artificial intelligence. In this course, we explore data systems of varying complexity—from classical relational databases to modern generative AI infrastructure.
Despite their differences, we will discover how a few fundamental ideas govern their design and optimization: the decoupling of logical and physical data representation, common performance modeling principles that apply across vastly different architectures, and recurring optimization techniques—such as tiling and exploiting locality—that reappear at every level of the system stack.
Concretely, we examine three workloads, each offering a different answer to the question “what is data and how should we represent it?”—relational databases, linear algebra, and large language models. We also navigate three hardware regimes, each offering a different answer to the question “what is a computer and how can we model and optimize workloads running on it?”—starting from a pure block I/O model for relational operations, then moving to CPU cache hierarchies and SIMD, and finally to GPU-scale parallelism.
The goal of this course is to equip students with a principled way of reasoning about data, data systems, and system performance—and to ground this reasoning in three case studies that, despite their surface differences, all benefit from the same underlying principles.
- Where: Harper Mem Library 140
- When: Mon/Wed 04:30 PM-05:50 PM
Note
The structure below is tentative. It will evolve as we teach and learn together—consider it a living document that we’ll refine based on our pace and interests.
Part 1: Relational Databases and I/O Optimizations
In Part 1, we introduce the first data model of the course—the relational database—and develop a pure block I/O-based performance model. We then apply this model to analyze and optimize fundamental relational operators.
Learning Objectives:
- Model data using the relational model and manipulate it using relational algebra, understanding its foundations in first-order logic and model theory
- Construct and apply a simplified system performance model based on block I/O counts
- Analyze and optimize relational operators (scans, joins) under this I/O cost model
- Hands-on Exercise: Implement these concepts within SQLite’s page-based storage structure
Lecture 1: The Relational Model - A Logical Data Model
- Data models
- Relational Model: Relations, tuples, and attributes
- Relational algebra: selection, projection, join, set operations
- Relational Semiring and how relational data processing is tied to matrix multiplication
- Slides
Lecture 2: Physical Data Layout and the Block I/O Model
- From logical relations to physical storage: bridging the abstraction gap
- Layout algebra
- Data layout on disk: pages, slots, and record formats
- SQLite’s file format as a concrete example
- The storage/memory hierarchy and the I/O bottleneck
- Modeling cost as block (page) transfers
- Slides
- layout.zip – Play with layout algebra in CuTe without a GPU (works on Macbook).
Lecture 3: Optimizing Select
- Full table scans and their I/O cost
- Index scans: B+ trees and their I/O characteristics
- Clustered vs. unclustered indexes
- Selectivity estimation and its impact on scan choice
- Slides
Lecture 4: Optimizing Joins
- Hash table
- Nested loop join and its I/O cost analysis
- Block nested loop join optimization
- Sort-merge join: leveraging sorted runs
- Hash join: partitioning for I/O efficiency
- Choosing join algorithms based on data characteristics
- Slides
Lecture 5: Other Topics in Database Systems 1 (Jan 21)
- Cool things about databases.
- Incomplete Information
- Relational Calculus and Codd Theorem
- Slides
Lecture 6: Other Topics in Database Systems 2 (Jan 26)
- Cool things about databases.
- Views, its updates, and maintenance
- Recursion
- Logical plan, Physical plan, and optimizer
- Slides
Part 2: Linear Algebra and CPU Optimizations
In Part 2, we shift to a new data model: tensors manipulated via linear algebra operators, with General Matrix Multiplication (GEMM) as our central case study. Our performance model also grows more complex in several ways: we now account for computation (not just data movement), the granularity shrinks from I/O pages to CPU cache lines, and the memory hierarchy deepens from a simple Disk↔DRAM model to DRAM↔L3↔L2↔L1↔Registers.
Learning Objectives:
- Model data as tensors (vectors, matrices, higher-order arrays) and express computations as linear algebra operations
- Understand the CPU memory hierarchy: registers, L1/L2/L3 caches, and main memory
- Construct performance models using arithmetic intensity and the roofline model to predict whether a workload is memory-bound or compute-bound
- Identify the key factors that limit performance—locality, pipeline stalls, and instruction-level parallelism—and apply classical optimization techniques to address them
- Use GEMM as a case study to connect these principles to practice: loop tiling, SIMD vectorization, and register blocking, etc.
- Hands-on Exercise: Implement optimized GEMM and measure performance against theoretical peak
Materials
- CPU Microarchitecture: A really great video from Intel about microarchitecture.
Lecture 7: CPU Microarchitecture and Performance Model (Jan 28)
- CPU Microarchitecture and Performance Hazards
- Slides and Code
Lecture 8: Benchmarking and Instruction-Level Parallelism (Feb 2)
Lecture 9: SIMD (Feb 4)
Lecture 10: Locality, Caches, Blocking (Feb 9)
Lecture 11: GEMM (Feb 11)
Lecture 12: GEMM (Feb 16)
Hands-on Exercise 2: Optimizing GEMM on CPUs
In this exercise, we will implement increasingly optimized versions of matrix multiplication and measure their performance against the CPU’s theoretical peak:
- We are given square matrices of varying sizes (e.g., 128×128 to 2048×2048)
- Target machine: modern x86-64 CPU with known cache sizes and SIMD width
- We will implement naive, blocked, and vectorized GEMM variants, measuring cache misses, arithmetic intensity, and GFLOPS to validate our roofline predictions
Part 3: LLM Inference and GPU Optimizations
In Part 3, we turn to a data model where information is tokenized and knowledge emerges from next-token prediction, with neural networks as the computational operators. We also introduce a performance model at GPU-scale: massively parallel execution across thousands of threads.
Learning Objectives:
- Understand LLMs from a dataflow perspective
- Understand the GPU execution model—threads, warps, blocks, and grids—and its performance characteristics: global memory bandwidth, shared memory, occupancy, and latency hiding
- Revisit locality, arithmetic intensity, and the roofline model in the context of GPUs
- Understand the principles underpinning GPU performance optimizations
- Analyze and optimize key LLM inference operations
- Hands-on Exercise: Implement and optimize several GPU kernels on GPU with
tilelang, comparing to cuBLAS
Lecture 13: Generative Data Model — Tokens and Transformers (Feb 18)
Lecture 14: Performance Model and Tile-Based Programming Model (Feb 23)
Lecture 15: Kernel: GEMM (Feb 25)
Lecture 16: Inference Systems — Beyond Kernels (March 2)
Hands-on Exercise 3: Optimizing GEMM on GPUs
In this exercise, we will implement matrix multiplication kernels on modern GPUs using tilelang:
- We are given square matrices of varying sizes on a modern NVIDIA GPU with known compute capability and memory bandwidth
- We will implement progressively optimized GEMM variants
- We will measure GFLOPS and memory throughput, comparing our implementations to cuBLAS and validating against the GPU roofline model
Part 4: Finale
Lecture 17: Data Systems — The Last Ten Years and the Next Ten (March 4)
In this closing lecture, I’d like to share some personal reflections as one perspective shaped by a decade of watching this field evolve.