Optimizing Distributed Workloads with Infrastructure-managed Communication and Deployment

Loading...

Date

2024

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

0
views
1
downloads

Abstract

As the scale and complexity of distributed workloads grows, performance is no longer the sole objective sought by application developers and infrastructure operators, as they increasingly demand cost efficiency and manageability. Existing system infrastructure struggles to meet these goals. On the lower-level datacenter network stacks, existing solutions rely on a library-based approach where tenants implement and control communication of their workloads. Without insights of the infrastructure and other tenants, they achieve sub-optimal performance while offering limited manageability in an inefficient way. On the higher-level application deployment side, the space of deployment configuration has grown intractably large for users to manually tune, especially with new workloads like machine learning inference workflows and new infrastructure options like spot instances.

In this dissertation, I argue that decoupling the implementation of communication primitives and the control of deployment strategies from distributed applications can improve their performance, cost efficiency, and manageability.On the lower-level communication side, we can implement common primitives via managed system services provided by the infrastructure operators, enabling new performance optimization opportunities and better manageability with negligible overheads. On the higher-level workload deployment side, we can build systems that manage and optimize deployment strategies for new workloads on new types of infrastructure, improving cost efficiency without sacrificing performance.

The contributions of this dissertation are the design, implementation and evaluation of the following systems. (1) To improve the performance of remote procedure calls (RPCs) and enhance manageability, we present mRPC, a system service that decouples RPC marshalling and policy enforcement from applications, speeding up microservice applications by up to 2.5x compared to existing solutions for enforcing polices. (2) To improve the performance and manageability of collective communication, we introduce MCCS, a system service that exposes collective communication abstractions to applications while providing control and flexibility to cloud providers for their implementation, improving tenant collective performance by up to 2.4x compared with existing library based solutions. (3) To improve the performance and cost efficiency when deploying machine learning inference workflows, we develop, JellyBean, a system service that optimizes and serves them over heterogeneous infrastructure, reducing total serving cost by up to 58%; (4) To improve the performance and cost efficiency for training mixture-of-experts (MoE) models, we build Lazarus, a system service that manages and optimizes training of MoE models on spot instances with resiliency and elasticity, enabling cost reductions while outperforming existing checkpoint-based systems by up to 3.4x.

Description

Provenance

Subjects

Computer science

Citation

Citation

Wu, Yongji (2024). Optimizing Distributed Workloads with Infrastructure-managed Communication and Deployment. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32624.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.