Towards Large-Scale RDMA Networks without Performance Anomalies
Date
2024
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
Remote Direct Memory Access (RDMA) has become a popular networking solution in modern data centers for its capability of delivering high bandwidth, low latency, and high CPU efficiency. Applications, such as machine learning training and inference, as well as remote storage, heavily rely on RDMA networks for their inter-host communication. Entering the new era of artificial intelligence, RDMA is becoming, if not has already become, one of the core components in modern datacenter network infrastructure.
Although RDMA can deliver extremely high performance, fully realizing this potential at large scale remains challenging: (1) specific workloads can trigger slow paths on the host or inside the RDMA NIC (RNIC), leading to unexpected RDMA performance degradation and even threatening the entire data center network. (2) Severe performance interference caused by RDMA-specific resource contention prevents applications from efficiently sharing the network infrastructure. We name these two types of issues as RDMA performance anomalies. These anomalies can lead to catastrophic consequences, such as applications' performance drop, Service Level Agreement (SLA) violation, head-of-line blocking, and even deadlocking the entire datacenter. Therefore, they have to be systematically uncovered and effectively addressed before a large-scale RDMA network starts to serve critical workloads.
Unfortunately, no existing approach can uncover and prevent these anomalies. The root cause lies in the fact that current methods adopt a traditional network perspective, overlooking critical aspects unique to RDMA networks, which possess a highly complex microarchitecture. For instance, RNICs integrate on-NIC processing units and caches to support their hardware offloading capabilities. The lack of microarchitecture-awareness limits the effectiveness and efficiency of existing solutions. Moreover, due to the invisibility of RNIC internals, prior work only has limited, if not none, understanding of the complex RDMA microarchitecture.
This raises an important question: is it possible for cloud operators to gain insight into RDMA's microarchitecture and develop microarchitecture-aware solutions to effectively and efficiently uncover and prevent performance anomalies? In this dissertation, I argue that this is indeed feasible and practical, and being microarchitecture-aware is crucial to achieving these goals. I propose, design, and implement three software systems to support this thesis argument from various aspects:
(i) Collie, a performance anomaly detection system that is the first to use qualitative microarchitecture information exposed by RDMA hardware counters to efficiently uncover RDMA performance anomalies. (ii) Husky, an end-to-end test suite that reveals RDMA microarchitecture resource consumption model and identifies unique performance interference caused by microarchitecture resource contention in RDMA networks. (iii) Harmonic, a first microarchitecture-aware solution that monitors and modulate per application's RDMA microarchitecture resource usage to prevent performance interference and mitigate performance anomalies.
These systems have been comprehensively evaluated across various testbeds, and the results strongly support the proposed thesis statement. These systems and their evaluation results have also been successfully transferred to multiple industry collaborators, making a significant impact on the broader community.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Kong, Xinhao (2024). Towards Large-Scale RDMA Networks without Performance Anomalies. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32615.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.