Machine Learning for Efficient and Robust Datacenter Performance Management

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

119
views
490
downloads

Abstract

Modern datacenters host a wide range of applications. Managing application performance is critical for the overall cost efficiency of datacenter infrastructures. However, previous solutions struggle to address the unique challenges presented in datacenters and cannot achieve robust and efficient management for datacenter-scale computing. In this thesis, I will focus on representative data center management problems and address these challenges utilizing practical machine learning frameworks.

First, I will present my work on tuning complex configuration parameters for datacenter applications. Configuration parameters have a significant impact on application execution, and finding the optimal configuration to maximize application performance is desirable. However, the prevalent system noise in datacenters often disturbs tuning configuration parameters. Specifically, system noise could cause unexpected performance outliers and unreliable performance measurements. I propose machine learning models and experiment design methods to address the challenges resulting from system noise. The proposed methods identify configurations that are resilient to system noise and minimize the effects of system noise to obtain a robust statistical estimate of configuration performance for machine learning-based optimization. On the other hand, machine learning-based tuning requires collecting training data, which can be expensive if the configuration space is high dimensional. Therefore, I will also present a framework for reducing machine learning training overhead for high dimensional configuration tuning. The framework achieves efficient model training by identifying parts of the configuration space that are likely to improve model accuracy while dynamically and cautiously growing the model capacity during training.

Second, I will discuss the problem of enforcing the quality of service for user-interactive applications at runtime. Enforcing the quality of service is challenging with the presence of shared resource contentions among collocated applications on the same datacenter servers. I introduce a reinforcement learning-based runtime controller that makes strategic runtime decisions and achieves efficient service enforcement quality and server resource utilization improvement compared with previous solutions.

Description

Provenance

Citation

Citation

Li, Yuhao (2022). Machine Learning for Efficient and Robust Datacenter Performance Management. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/25406.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.