Artificial Intelligence for Understanding Large and Complex Datacenters

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



As the democratization of global-scale web applications and cloud computing, understanding the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization.

The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis. Stragglers are exceptionally slow tasks in parallel execution that delay overall job completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating stragglers, but relatively little research has focused on systematically identifying their causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs.

The second thrust of this dissertation is the use of graph theory and statistical semantic learning for datacenter workload understanding, which has significant impact on datacenter hardware architecture, capacity planning, software re-optimization, etc. Datacenter engineers understand datacenter workloads with continuous, distributed profiling that produces snapshots of call stacks across datacenter machines. Unlike stack traces profiled for isolated micro-benchmarks or small applications, those for hyperscale datcenters are enormous and complex and reflect the scale and diversity of their production codes, and expose great challenges for efficient and effective interpretation. We present Limelight+, an algorithmic framework based on graph theory and statistical semantic learning, to extract workload insights from datacenter-scale stack traces, and to gain design insights for datacenter architecture.





Zheng, Pengfei (2020). Artificial Intelligence for Understanding Large and Complex Datacenters. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.