Machine Learning for Efficient and Robust Datacenter Performance Management
Date
2022
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
Modern datacenters host a wide range of applications. Managing application performance is critical for the overall cost efficiency of datacenter infrastructures. However, previous solutions struggle to address the unique challenges presented in datacenters and cannot achieve robust and efficient management for datacenter-scale computing. In this thesis, I will focus on representative data center management problems and address these challenges utilizing practical machine learning frameworks.
First, I will present my work on tuning complex configuration parameters for datacenter applications. Configuration parameters have a significant impact on application execution, and finding the optimal configuration to maximize application performance is desirable. However, the prevalent system noise in datacenters often disturbs tuning configuration parameters. Specifically, system noise could cause unexpected performance outliers and unreliable performance measurements. I propose machine learning models and experiment design methods to address the challenges resulting from system noise. The proposed methods identify configurations that are resilient to system noise and minimize the effects of system noise to obtain a robust statistical estimate of configuration performance for machine learning-based optimization. On the other hand, machine learning-based tuning requires collecting training data, which can be expensive if the configuration space is high dimensional. Therefore, I will also present a framework for reducing machine learning training overhead for high dimensional configuration tuning. The framework achieves efficient model training by identifying parts of the configuration space that are likely to improve model accuracy while dynamically and cautiously growing the model capacity during training.
Second, I will discuss the problem of enforcing the quality of service for user-interactive applications at runtime. Enforcing the quality of service is challenging with the presence of shared resource contentions among collocated applications on the same datacenter servers. I introduce a reinforcement learning-based runtime controller that makes strategic runtime decisions and achieves efficient service enforcement quality and server resource utilization improvement compared with previous solutions.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Li, Yuhao (2022). Machine Learning for Efficient and Robust Datacenter Performance Management. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/25406.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.