Solving Practical Problems in Datacenter Networks
Date
2013
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.
To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.
In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.
To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.
Type
Department
Description
Provenance
Citation
Permalink
Citation
Wu, Xin (2013). Solving Practical Problems in Datacenter Networks. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/8201.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.