Machine Learning for Efficient and Robust Datacenter Performance Management

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Modern datacenters host a wide range of applications. Managing application performance is critical for the overall cost efficiency of datacenter infrastructures. However, previous solutions struggle to address the unique challenges presented in datacenters and cannot achieve robust and efficient management for datacenter-scale computing. In this thesis, I will focus on representative data center management problems and address these challenges utilizing practical machine learning frameworks.

First, I will present my work on tuning complex configuration parameters for datacenter applications. Configuration parameters have a significant impact on application execution, and finding the optimal configuration to maximize application performance is desirable. However, the prevalent system noise in datacenters often disturbs tuning configuration parameters. Specifically, system noise could cause unexpected performance outliers and unreliable performance measurements. I propose machine learning models and experiment design methods to address the challenges resulting from system noise. The proposed methods identify configurations that are resilient to system noise and minimize the effects of system noise to obtain a robust statistical estimate of configuration performance for machine learning-based optimization. On the other hand, machine learning-based tuning requires collecting training data, which can be expensive if the configuration space is high dimensional. Therefore, I will also present a framework for reducing machine learning training overhead for high dimensional configuration tuning. The framework achieves efficient model training by identifying parts of the configuration space that are likely to improve model accuracy while dynamically and cautiously growing the model capacity during training.

Second, I will discuss the problem of enforcing the quality of service for user-interactive applications at runtime. Enforcing the quality of service is challenging with the presence of shared resource contentions among collocated applications on the same datacenter servers. I introduce a reinforcement learning-based runtime controller that makes strategic runtime decisions and achieves efficient service enforcement quality and server resource utilization improvement compared with previous solutions.





Li, Yuhao (2022). Machine Learning for Efficient and Robust Datacenter Performance Management. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.