Skip to main content Link Search Menu Expand Document (external link)


Today’s machine learning and artificial intelligence are limited by their costs — Training a model can easily take million of dollars, while acquiring, cleaning, and improving the underlying datasets to enforce model quality, fairness, and robustness is no cheaper. These costs come from different, closely-coupled sources: (1) the staggering amount computation, storage and communication that these models need, (2) the cost of infrastructure ownership in today’s centralized cloud, (3) the cost of data acquisition, cleaning, and debugging and associated human costs, (4) the cost of regulation compliance, and (5) the cost of operational deployment such as monitoring, continuous testing, and continuous adaptation.

The key belief behind my research is that we need to bring the costs down, by orders of magnitude, in all these fronts to bring ML/AI into a trustworthy and democratized future. In order to achieve this goal, our research focuses on building machine learning systems, enabled by novel algorithms, theory, and system optimizations and abstractions. Our research falls into two directions.

Project Zip.ML: Distributed and Decentralized Learning at Scale

Bagua Distributed learning with system relaxations: Decentralization, Asynchronization, Compression

Persia Deep Recommendation Models at 100 Trillion Parameter Scale

OpenGauss DB4AI In-Database Machine Learning with deep physical integration

LambdaML Distributed machine learning over serverless infrastructure

Project Ease.ML: Data-centric ML DevOps

Ease.ML Family End-to-end Lifecycle management for ML DevOps: Rethinking Data Quality for Machine Learning