A list of All Publications (Chronological Order) can be found here.
1. Decentralized and Distributed Learning at Scale
Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang. Decentralized Training of Foundation Models in Heterogeneous Environments. NeurIPS 2022 (Oral Presentation 186/9600 = 1.9% submissions)
Training large langauge models over decentralized geo-distributed devices (500Mbps bandwidth, 100ms latency). It is possible if we carefully schedule the network considering hetergenous network conditions!
Jue WANG, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang. Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees. NeurIPS 2022.
Compressing forward activations in pipeline parallelism requires careful thinking --- If not careful, it introduces bias in the gradient and hurts the convergence. This paper provides, to our best knowledge, one of the first compression scheme that leads to an unbiased convergence, and enables decentralized training of large language models over slow networks (e.g., <300Mbps). The secret is to compress the difference! Enabling decentralized training
Shaoduo Gan, Xiangru Lian, Rui Wang, Jianbin Chang, Chengjun Liu, Hongmei Shi, Shengzhuo Zhang, Xianghong Li, Tengxu Sun, Jiawei Jiang, Binhang Yuan, Sen Yang, Ji Liu, Ce Zhang. BAGUA: Scaling up Distributed Learning with System Relaxations. VLDB 2022.
A unified distributed learning framework that supports a diverse range of communication-efficient algorithms taking advantage of (1) asynchrony, (2) decentralization, (3) communication compression, and (4) their combinations. The key is a declarative optimization framework that manages communication and computation. Change one line of code to use BaguaStrategy in Pytorch Lightning and speedup your training over Horovod and BytePS even in data center networks and expect much larger speedups in slow networks!
Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang. In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle. SIGMOD 2022.
Optimizing training over large datasets stored on disks, which can be used for both Tensorflow FileScanner and in-Database ML. On the ML side, a novel first-order method that does not requires full data shuffle (with convergence guarantees) and thus I/O efficient. On the system, a novel in-database ML framework that integrates ML physically into DB. Check out this algorithm in OpenGauss and expect up to two orders of magnitude speedups over MADlib. Also available as a new Tensorflow FileScanner.
Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, Ce Zhang. Towards Demystifying Serverless Machine Learning Training. SIGMOD 2021.
Systematic depiction of the tradeoff space of training machine learning models over serverless infrastructure. When does serverless make sense for ML training? How should serverless infrastructure evolves to better support ML?
X Lian, Ce Zhang, H Zhang, CJ Hsieh, W Zhang, J Liu. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. NIPS 2017. (Oral Presentation)
A Gossip-style algorithm for decentralized learning --- instead of all workers exchange information with everyone, information is propagated by each machine only talking to two neighbors. More surprisingly, this algorithm has the same convergence rate (in O(-) sense) as its centralized counterpart. Decentralized learning can be efficient, if we co-design the algorithm and the system together!
H Zhang, J Li, K Kara, D Alistarh, J Liu, Ce Zhang. The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning. ICML 2017.
Can we conduct lossy compression on the data when training ML models? This question is delicate even for simple models such as linear regression. Different from gradient, an unbiased compressor will lead to bias when it is applied to data. In this paper, we propose a data compression framework and study its application to a various models, from linear regression, to deep neural networks.
2. Data-centric MLOps
Leonel Aguilar, David Dao, Shaoduo Gan, Nezihe Merve Gurel, Nora Hollenstein, Jiawei Jiang, Bojan Karlas, Thomas Lemmin, Tian Li, Yang Li, Susie Rao, Johannes Rausch, Cedric Renggli, Luka Rimanic, Maurice Weber, Shuai Zhang, Zhikuan Zhao, Kevin Schawinski, Wentao Wu, Ce Zhang. Ease.ML: A Lifecycle Management System for Machine Learning. CIDR 2021.
Building high-quality, trustworthy ML applications is not an easy task --- often improving the quality of a model is a journey of improving the quality of the underlying data. Users are desperately need help today to conduct these "data iterations". Ease.ML is a framework that we developed over the years that consists of a collection of tools to support this end-to-end process of "data engineering for ML".
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, Ce Zhang. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. VLDB 2021.
A principled framework for "Data Cleaning for ML" via information maximization. Algorithmically, we show that computing Entropy, a key quantity for information maximization, can be computed in PTIME for models with strong locality. This is inspired by the notion of "Certain Answer" in database theory, and bring together data and learning in a principled, yet computationally feasible way.
Peng Li, Xi Rao, Jeffinifer Blase, Yue Zhang, Xu Chu, Ce Zhang. CleanML: A Benchmark for Evaluating the Impact of Data Cleaning on ML Classification Tasks. ICDE 2021.
Data quality has been studied intensively by the data management community for decades, but how does noise and biases in the training data influence the downstream ML model? This paper provides a systematic benchmark on this fundamental question. In a quantitative way, we show that often data matters much more than specific choices of models!
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J. Spanos, Dawn Song. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. VLDB 2019.
Data provenance has been studied intensively by the data management community for decades, but how can we reason about data influence for ML training? In previous work we proposed to use Shapley value as one way to measure data influence. However, Shapley value is often computationally infeasible to compute. In this paper, we show that computing Shapley value can be computed in PTIME for models with strong locality, and we can use these simpler models as a proxy for more complex models!
Cedric Renggli, Bojan Karlas, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, Ce Zhang. Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment. SysML 2019.
Continuous integration is an important functionality for modern software development, how does a CI/CD framework for ML look like? In this paper, we look at this problem and realize a unique challenge --- CI/CD for ML needs to carefully manage test case reuse and overfitting. To this end, we provide a statistically rigorous CI/CD framework rooted in adaptive analytics and description length. To our best knowledge, this is one of the first early prototypes of CI/CD for ML.
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Re. Incremental knowledge base construction using DeepDive. VLDB 2015. (SIGMOD Research Highlight Award)
Knowledge-base construction is a task that is inherently knowledge-rich. DeepDive is a framework that we developed for years to provide declarative knowledge integration and statistical reasoning at scale. It has enabled a diverse range of applications, from Paleontology, to anti-human trafficking.
Ce Zhang, Arun Kumar, and Christopher Re. Materialization optimizations for feature selection workloads. SIGMOD 2014. (SIGMOD Best Paper Award)
A declarative framework for feature selection, which automatically optimizes and manages the tradeoff. It is amazing to see how traditional data management concepts (such as materialization) can be applied to ML, and see the unique challenges and opportunities that ML imposes to these traditional concepts given different levels of error tolerance.