Introduction to Data Systems and Data Design

Data systems underpin much of modern computer science and artificial intelligence. In this course, we explore data systems of varying complexity—from classical relational databases to modern generative AI infrastructure.

Despite their differences, we will discover how a few fundamental ideas govern their design and optimization: the decoupling of logical and physical data representation, common performance modeling principles that apply across vastly different architectures, and recurring optimization techniques—such as tiling and exploiting locality—that reappear at every level of the system stack.

Concretely, we examine three workloads, each offering a different answer to the question “what is data and how should we represent it?”—relational databases, linear algebra, and large language models. We also navigate three hardware regimes, each offering a different answer to the question “what is a computer and how can we model and optimize workloads running on it?”—starting from a pure block I/O model for relational operations, then moving to CPU cache hierarchies and SIMD, and finally to GPU-scale parallelism.

The goal of this course is to equip students with a principled way of reasoning about data, data systems, and system performance—and to ground this reasoning in three case studies that, despite their surface differences, all benefit from the same underlying principles.

Where: Harper Mem Library 140
When: Mon/Wed 04:30 PM-05:50 PM

Note
The structure below is tentative. It will evolve as we teach and learn together—consider it a living document that we’ll refine based on our pace and interests.

Part 1: Relational Databases and I/O Optimizations

In Part 1, we introduce the first data model of the course—the relational database—and develop a pure block I/O-based performance model. We then apply this model to analyze and optimize fundamental relational operators.

Learning Objectives:

Model data using the relational model and manipulate it using relational algebra, understanding its foundations in first-order logic and model theory
Construct and apply a simplified system performance model based on block I/O counts
Analyze and optimize relational operators (scans, joins) under this I/O cost model
Hands-on Exercise: Implement these concepts within SQLite’s page-based storage structure

Lecture 1: The Relational Model - A Logical Data Model

Data models
Relational Model: Relations, tuples, and attributes
Relational algebra: selection, projection, join, set operations
Relational Semiring and how relational data processing is tied to matrix multiplication
Slides

Lecture 2: Physical Data Layout and the Block I/O Model

From logical relations to physical storage: bridging the abstraction gap
Layout algebra
Data layout on disk: pages, slots, and record formats
SQLite’s file format as a concrete example
The storage/memory hierarchy and the I/O bottleneck
Modeling cost as block (page) transfers
Slides
layout.zip – Play with layout algebra in CuTe without a GPU (works on Macbook).

Lecture 3: Optimizing Select

Full table scans and their I/O cost
Index scans: B+ trees and their I/O characteristics
Clustered vs. unclustered indexes
Selectivity estimation and its impact on scan choice
Slides

Lecture 4: Optimizing Joins

Hash table
Nested loop join and its I/O cost analysis
Block nested loop join optimization
Sort-merge join: leveraging sorted runs
Hash join: partitioning for I/O efficiency
Choosing join algorithms based on data characteristics
Slides

Lecture 5: Other Topics in Database Systems 1 (Jan 21)

Cool things about databases.
- Incomplete Information
- Relational Calculus and Codd Theorem
Slides

Lecture 6: Other Topics in Database Systems 2 (Jan 26)

Cool things about databases.
- Views, its updates, and maintenance
- Recursion
- Logical plan, Physical plan, and optimizer
Slides

Part 2: Linear Algebra and CPU Optimizations

In Part 2, we shift to a new data model: tensors manipulated via linear algebra operators, with General Matrix Multiplication (GEMM) as our central case study. Our performance model also grows more complex in several ways: we now account for computation (not just data movement), the granularity shrinks from I/O pages to CPU cache lines, and the memory hierarchy deepens from a simple Disk↔DRAM model to DRAM↔L3↔L2↔L1↔Registers.

Learning Objectives:

Model data as tensors (vectors, matrices, higher-order arrays) and express computations as linear algebra operations
Understand the CPU memory hierarchy: registers, L1/L2/L3 caches, and main memory
Construct performance models using arithmetic intensity and the roofline model to predict whether a workload is memory-bound or compute-bound
Identify the key factors that limit performance—locality, pipeline stalls, and instruction-level parallelism—and apply classical optimization techniques to address them
Use GEMM as a case study to connect these principles to practice: loop tiling, SIMD vectorization, and register blocking, etc.
Hands-on Exercise: Implement optimized GEMM and measure performance against theoretical peak

Materials

Exercise:
- Optimizing single-core GEMM on Macbook to >90% peak flops (matches Accelerate)
- a version that gets 90% peak flops (matches OpenBLAS) on Intel CPUs
CPU Microarchitecture: A really great video from Intel about microarchitecture.

Lecture 7: CPU Microarchitecture and Performance Model (Jan 28)

CPU Microarchitecture and Performance Hazards
Slides
Code

Lecture 8: Instruction-Level Parallelism (Feb 2)

Lecture 9: SIMD (Feb 4)

Lecture 10: Locality, Caches, Blocking (Feb 9)

Lecture 11: Roofline (Feb 11)

Lecture 12: GEMM (Feb 16)

Hands on – 80% Accelerate
Code

Lecture 13: GEMM (Feb 19)

Reading
- Compare this with what we produced on Monday (specifically, look at blocking strategy and loop order)
- How does it help to remove the packing overhead as we identified on Monday (it was 20% of the time!)
- Think about all the tricks that are in this paper, but we haven’t tried on Monday

Lecture 14: GEMM (Feb 23)

Hands on – >99% Accelerate!
Code

Part 3: LLM Inference and GPU Optimizations

In Part 3, we turn to a data model where information is tokenized and knowledge emerges from next-token prediction, with neural networks as the computational operators. We also introduce a performance model at GPU-scale: massively parallel execution across thousands of threads.

Learning Objectives:

Understand LLMs from a dataflow perspective
Understand the GPU execution model—threads, warps, blocks, and grids—and its performance characteristics: global memory bandwidth, shared memory, occupancy, and latency hiding
Revisit locality, arithmetic intensity, and the roofline model in the context of GPUs
Understand the principles underpinning GPU performance optimizations

Lecture 15: Generative Data Model — Tokens and Transformers (Feb 25)

Slides

Lecture 16: A Very Shallow Dive on GPUs (March 2)

Slides

Part 4: Finale

Lecture 17: Data Systems — The Last Ten Years and the Next Ten (March 4)

In this closing lecture, I’d like to share some personal reflections as one perspective shaped by a decade of watching this field evolve.