Optimizing Distributed Workloads with Infrastructure-managed Communication and Deployment

dc.contributor.advisor

Zhuo, Danyang

dc.contributor.advisor

Lentz, Matthew

dc.contributor.author

Wu, Yongji

dc.date.accessioned

2025-07-02T19:02:55Z

dc.date.available

2025-07-02T19:02:55Z

dc.date.issued

2024

dc.department

Computer Science

dc.description.abstract

As the scale and complexity of distributed workloads grows, performance is no longer the sole objective sought by application developers and infrastructure operators, as they increasingly demand cost efficiency and manageability. Existing system infrastructure struggles to meet these goals. On the lower-level datacenter network stacks, existing solutions rely on a library-based approach where tenants implement and control communication of their workloads. Without insights of the infrastructure and other tenants, they achieve sub-optimal performance while offering limited manageability in an inefficient way. On the higher-level application deployment side, the space of deployment configuration has grown intractably large for users to manually tune, especially with new workloads like machine learning inference workflows and new infrastructure options like spot instances.

In this dissertation, I argue that decoupling the implementation of communication primitives and the control of deployment strategies from distributed applications can improve their performance, cost efficiency, and manageability.On the lower-level communication side, we can implement common primitives via managed system services provided by the infrastructure operators, enabling new performance optimization opportunities and better manageability with negligible overheads. On the higher-level workload deployment side, we can build systems that manage and optimize deployment strategies for new workloads on new types of infrastructure, improving cost efficiency without sacrificing performance.

The contributions of this dissertation are the design, implementation and evaluation of the following systems. (1) To improve the performance of remote procedure calls (RPCs) and enhance manageability, we present mRPC, a system service that decouples RPC marshalling and policy enforcement from applications, speeding up microservice applications by up to 2.5x compared to existing solutions for enforcing polices. (2) To improve the performance and manageability of collective communication, we introduce MCCS, a system service that exposes collective communication abstractions to applications while providing control and flexibility to cloud providers for their implementation, improving tenant collective performance by up to 2.4x compared with existing library based solutions. (3) To improve the performance and cost efficiency when deploying machine learning inference workflows, we develop, JellyBean, a system service that optimizes and serves them over heterogeneous infrastructure, reducing total serving cost by up to 58%; (4) To improve the performance and cost efficiency for training mixture-of-experts (MoE) models, we build Lazarus, a system service that manages and optimizes training of MoE models on spot instances with resiliency and elasticity, enabling cost reductions while outperforming existing checkpoint-based systems by up to 3.4x.

dc.identifier.uri

https://hdl.handle.net/10161/32624

dc.rights.uri

https://creativecommons.org/licenses/by-nc-nd/4.0/

dc.subject

Computer science

dc.title

Optimizing Distributed Workloads with Infrastructure-managed Communication and Deployment

dc.type

Dissertation

duke.embargo.release

2025-07-08

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wu_duke_0066D_18291.pdf
Size:
3.6 MB
Format:
Adobe Portable Document Format

Collections