Skip to main content Link Search Menu Expand Document (external link)

Introduction to Data Systems and Data Design

Data systems underpin much of modern computer science and artificial intelligence. In this course, we explore data systems of varying complexity—from classical relational databases to modern generative AI infrastructure.

Despite their differences, we will discover how a few fundamental ideas govern their design and optimization: the decoupling of logical and physical data representation, common performance modeling principles that apply across vastly different architectures, and recurring optimization techniques—such as tiling and exploiting locality—that reappear at every level of the system stack.

Concretely, we examine three workloads, each offering a different answer to the question “what is data and how should we represent it?”—relational databases, linear algebra, and large language models. We also navigate three hardware regimes, each offering a different answer to the question “what is a computer and how can we model and optimize workloads running on it?”—starting from a pure block I/O model for relational operations, then moving to CPU cache hierarchies and SIMD, and finally to GPU-scale parallelism.

The goal of this course is to equip students with a principled way of reasoning about data, data systems, and system performance—and to ground this reasoning in three case studies that, despite their surface differences, all benefit from the same underlying principles.

  • Where: Harper Mem Library 140
  • When: Mon/Wed 04:30 PM-05:50 PM

Note

The structure below is tentative. It will evolve as we teach and learn together—consider it a living document that we’ll refine based on our pace and interests.

Part 1: Relational Databases and I/O Optimizations

In Part 1, we introduce the first data model of the course—the relational database—and develop a pure block I/O-based performance model. We then apply this model to analyze and optimize fundamental relational operators.

Learning Objectives:

  • Model data using the relational model and manipulate it using relational algebra, understanding its foundations in first-order logic and model theory
  • Construct and apply a simplified system performance model based on block I/O counts
  • Analyze and optimize relational operators (scans, joins) under this I/O cost model
  • Hands-on Exercise: Implement these concepts within SQLite’s page-based storage structure

Lecture 1: The Relational Model - A Logical Data Model

  • Relations, tuples, and attributes
  • Relational algebra: selection, projection, join, set operations
  • Connection to first-order logic and finite model theory
  • SQL as a practical realization of relational algebra

Lecture 2: Physical Data Layout and the Block I/O Model

  • From logical relations to physical storage: bridging the abstraction gap
  • Data layout on disk: pages, slots, and record formats
  • SQLite’s file format as a concrete example
  • The storage/memory hierarchy and the I/O bottleneck
  • Modeling cost as block (page) transfers
  • The buffer manager: caching pages in memory, replacement policies, and pinning

Lecture 3: Optimizing Select

  • Full table scans and their I/O cost
  • Index scans: B+ trees and their I/O characteristics
  • Clustered vs. unclustered indexes
  • Selectivity estimation and its impact on scan choice

Lecture 4: Optimizing Joins

  • Nested loop join and its I/O cost analysis
  • Block nested loop join optimization
  • Sort-merge join: leveraging sorted runs
  • Hash join: partitioning for I/O efficiency
  • Choosing join algorithms based on data characteristics

Hands-on Exercise 1: Joins on SQLite Pages

In this exercise, we will implement various algorithms by directly manipulating SQLite’s page format, bypassing the SQL layer entirely. This hands-on experience reinforces the I/O cost model and gives us concrete insight into how database systems manage data at the storage level:

  • We are given a SQLite database file containing two tables stored across multiple 4KB pages
  • We will read and parse pages directly using SQLite’s documented file format
  • We will implement various algorithms, comparing their I/O costs as a function of buffer size and understanding how our theoretical cost model reflects in real-world performance

Part 2: Linear Algebra and CPU Optimizations

In Part 2, we shift to a new data model: tensors manipulated via linear algebra operators, with General Matrix Multiplication (GEMM) as our central case study. Our performance model also grows more complex in several ways: we now account for computation (not just data movement), the granularity shrinks from I/O pages to CPU cache lines, and the memory hierarchy deepens from a simple Disk↔DRAM model to DRAM↔L3↔L2↔L1↔Registers.

Learning Objectives:

  • Model data as tensors (vectors, matrices, higher-order arrays) and express computations as linear algebra operations
  • Understand the CPU memory hierarchy: registers, L1/L2/L3 caches, and main memory
  • Construct performance models using arithmetic intensity and the roofline model to predict whether a workload is memory-bound or compute-bound
  • Identify the key factors that limit performance—locality, pipeline stalls, and instruction-level parallelism—and apply classical optimization techniques to address them
  • Use GEMM as a case study to connect these principles to practice: loop tiling, SIMD vectorization, and register blocking, etc.
  • Hands-on Exercise: Implement optimized GEMM and measure performance against theoretical peak

Lecture 5: Data Model — Tensors and Linear Algebra

Lecture 6: Performance Model — CPU Caches, Arithmetic Intensity, and Roofline

Lecture 7: GEMM 1

Lecture 8: GEMM 2

Lecture 9: GEMM 3

Hands-on Exercise 2: Optimizing GEMM on CPUs

In this exercise, we will implement increasingly optimized versions of matrix multiplication and measure their performance against the CPU’s theoretical peak:

  • We are given square matrices of varying sizes (e.g., 128×128 to 2048×2048)
  • Target machine: modern x86-64 CPU with known cache sizes and SIMD width
  • We will implement naive, blocked, and vectorized GEMM variants, measuring cache misses, arithmetic intensity, and GFLOPS to validate our roofline predictions

Part 3: LLM Inference and GPU Optimizations

In Part 3, we turn to a data model where information is tokenized and knowledge emerges from next-token prediction, with neural networks as the computational operators. We also introduce a performance model at GPU-scale: massively parallel execution across thousands of threads.

Learning Objectives:

  • Understand LLMs from a dataflow perspective
  • Understand the GPU execution model—threads, warps, blocks, and grids—and its performance characteristics: global memory bandwidth, shared memory, occupancy, and latency hiding
  • Revisit locality, arithmetic intensity, and the roofline model in the context of GPUs
  • Understand the principles underpinning GPU performance optimizations
  • Analyze and optimize key LLM inference operations
  • Hands-on Exercise: Implement and optimize several GPU kernels on GPU with tilelang, comparing to cuBLAS

Lecture 11: Generative Data Model — Tokens and Transformers

Lecture 12: Performance Model 1

Lecture 13: Performance Model 2

Lecture 14: Tile-Based Programming Model

Lecture 15: Kernel: GEMM

Lecture 16: Kernel: Attention — Beyond GEMM

Lecture 17: Inference Systems — Beyond Kernels

Hands-on Exercise 3: Optimizing GEMM on GPUs

In this exercise, we will implement matrix multiplication kernels on modern GPUs using tilelang:

  • We are given square matrices of varying sizes on a modern NVIDIA GPU with known compute capability and memory bandwidth
  • We will implement progressively optimized GEMM variants
  • We will measure GFLOPS and memory throughput, comparing our implementations to cuBLAS and validating against the GPU roofline model

Part 4: Finale

Lecture 18: Data Systems — The Last Ten Years and the Next Ten

In this closing lecture, I’d like to share some personal reflections as one perspective shaped by a decade of watching this field evolve.