Solving Practical Problems in Datacenter Networks

Wu, Xin

Solving Practical Problems in Datacenter Networks

View / Download1.9 MB

Date

2013

Authors

Wu, Xin

Advisors

Yang, Xiaowei

Repository Usage Stats

291
views

484
downloads

Abstract

The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.

To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.

In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.

To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.

Type

Dissertation

Department

Computer Science

Subjects

Computer science, Adaptive Routing, Datacenter Network, Failure Mitigation, Multipath Transport Protocol

Permalink

https://hdl.handle.net/10161/8201

Citation

Wu, Xin (2013). Solving Practical Problems in Datacenter Networks. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/8201.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Solving Practical Problems in Datacenter Networks

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections