Browsing by Author "Lee, Benjamin C"
Results Per Page
Sort Options
Item Open Access Artificial Intelligence for Understanding Large and Complex Datacenters(2020) Zheng, PengfeiAs the democratization of global-scale web applications and cloud computing, understanding the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization.
The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis. Stragglers are exceptionally slow tasks in parallel execution that delay overall job completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. A large body of research has focused on mitigating stragglers, but relatively little research has focused on systematically identifying their causes. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs.
The second thrust of this dissertation is the use of graph theory and statistical semantic learning for datacenter workload understanding, which has significant impact on datacenter hardware architecture, capacity planning, software re-optimization, etc. Datacenter engineers understand datacenter workloads with continuous, distributed profiling that produces snapshots of call stacks across datacenter machines. Unlike stack traces profiled for isolated micro-benchmarks or small applications, those for hyperscale datcenters are enormous and complex and reflect the scale and diversity of their production codes, and expose great challenges for efficient and effective interpretation. We present Limelight+, an algorithmic framework based on graph theory and statistical semantic learning, to extract workload insights from datacenter-scale stack traces, and to gain design insights for datacenter architecture.
Item Open Access Coordinating Software and Hardware for Performance Under Power Constraints(2019) Huang, ZiqiangFor more than 50 years since its birth in 1965, Moore's Law has been a self-fulfilling prophecy that drives computing forward. However, as Dennard scaling ends, chip power density presents a challenge that becomes increasingly severe with every process generation. Consequently, a growing subset of transistors on a chip will need to be powered off in order to operate under a sustainable thermal envelope, a design strategy commonly referred as ``dark silicon''.
Although dark silicon poses a major challenge for consistently delivering higher performance, it also inspires researchers to rethink how a chip should be designed and managed under a power and thermal constrained environment. The historical way of extracting performance, whether single-thread or multi-thread, by throwing complicated and power-hungry hardwares is no longer applicable. Instead, we need to rely more on software, but the hardware needs to provide new mechanisms. In this thesis, we present three pieces of work on software and hardware codesign to demonstrate how coordinating softwares like compilers and runtimes with underlying hardware support can help boosting performance in a power-efficient way.
First, out-of-order (OoO) processors achieves higher performance than the in-order (IO) ones by aggressively scheduling instructions out of program order during execution. However, dynamic scheduling requires sophisticated control and numerous bookkeeping structures---e.g., reorder buffer, load-store queue, register alias table---that increase complexity, area, and most importantly power. Observing that a compiler produces better static schedules when the instruction set defines simple operation, we propose an ISA extension that decouples the data access and register write operations in a load instruction. We show that with modest system and hardware support, we can improve compilers' instruction scheduling by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedup of 8.4% for a wide range of applications, bringing a step closer to OoO performance on IO design.
Second, sprinting is a class of computational mechanisms that provides a short but significant performance boost while temporarily exceeding the thermal design point. Using phase change material to buffer heat, sprinting is a promising way to deliver high performance in future chip designs that are likely to be power and thermal constrained. However, because sprints cannot be sustained, the system needs a mechanism to decide when to start and stop a sprint. We propose UTAR, a software runtime framework that manages sprints by dynamically predicting utility and modeling thermal headroom. Moreover, we propose a new sprint mechanism for caches, increasing capacity briefly for enhanced performance. For a system that extends last-level cache capacity from 2MB to 4MB per core and can absorb 10J of heat with phase change material, UTAR-guided cache sprinting improves performance by 17% on average and by up to 40% over a non-sprinting system. These performance outcomes, within 95% of an oracular policy, are possible because UTAR accurately predicts phase behavior and sprint utility.
Finally, applications often exhibit phase behaviors that demands for different types of system resources. As a result, management frameworks need to coordinate between different sprinting mechanisms to realize the full performance potential. we propose UTAR+, an extended version of UTAR that not only determines when to sprint, but also the type of resource as well as the sprinting intensity. Building upon UTAR's phase predictor and utility and thermal-aware policy, UTAR+ quickly identifies the most profitable sprinting option to maximize performance/watt. For a system that offers multiple sprinting options, UTAR+-guided multi-resource sprinting improves performance by 22% on average and by up to 83% over a non-sprinting system, outperforming UTAR+guided single-resource sprinting for a variety of applications.
Item Open Access Coordinating the Design and Management of Heterogeneous Datacenter Resources(2014) Guevara, Marisabel AlejandraHeterogeneous design presents an opportunity to improve energy efficiency but raises a challenge in management. Whereas prior work separates the two, we coordinate heterogeneous design and management. We present a market-based resource allocation mechanism that navigates the performance and power trade-offs of heterogeneous architectures. Given this management framework, we explore a design space of heterogeneous processors and show a 12x reduction in response time violations when equipping a datacenter with three processor types over a homogeneous system that consumes the same power. To better understand trade-offs in large heterogeneous design spaces, we explore dozens of design strategies and present a risk taxonomy that classifies the reasons why a deployed system may underperform relative to design targets. We propose design strategies that explicitly mitigate risk, such as a strategy that minimizes the coefficient of variation in performance. In our experiments, we find that risk-aware design accounts for more than 70% of the strategies that produce systems with the best service quality. We also present a new datacenter management mechanism that fairly allocates processors to latency-sensitive applications. Tasks express value for performance using sophisticated piecewise-linear utility functions. With fairness in market allocations, we show how datacenters can mitigate envy amongst latency-sensitive users. We quantify the price of fairness and detail efficiency-fairness trade-offs. Finally, we extend the market to fairly allocate heterogeneous processors.
Item Open Access Design Strategies for Efficient and Secure Memory(2019) Lehman, Tamara SilbergleitRecent computing trends force users to relinquish physical control to unknown parties, making the system vulnerable to physical attacks. Software alone is not well equipped to protect against physical attacks. Instead software and hardware have to enforce security in collaboration to defend against physical attacks. Many secure processor implementations have surfaced over the last two decades (i.e. Intel SGX, ARM Trustzone) but inefficiencies are hindering their adoption.
Secure processors use secure memory to detect and guard against physical attacks. Secure memory assumes that everything within the chip boundary is trusted and provides confidentiality and integrity verification for data in memory. Both of these features, confidentiality and integrity, require large metadata structures which are
stored in memory. When a system equipped with secure memory misses at the last-level-cache (LLC), the memory controller has to issue additional memory requests to fetch the corresponding metadata from memory. These additional memory requests increase delay and energy. The main goal of this dissertation is to reduce overheads of secure memory in two dimensions: delay and energy.
First, to reduce the delay overhead we propose the first safe speculative integrity verification mechanism, PoisonIvy, that effectively hides the integrity verification latency while maintaining security guarantees. Secure memory has high delay overheads due to the long integrity verification latency. Speculation allows the system to return decrypted data back to the processor before the integrity verification completes, effectively removing the integrity verification latency from the critical path of a memory access. However, speculation without any other mechanism to safeguard security is unsafe. PoisonIvy safeguards security guarantees by preventing any effect of unverified data from leaving the trusted boundary. PoisonIvy is able to realize all the benefits of speculative integrity verification while maintaining the same security guarantees as the non-speculative system.
Speculation is effective in reducing delay overhead but it has no effect on reducing the number of additional memory accesses, which cause large energy overhead. Secure memory metadata has unique memory access patterns that are not compatible with traditional cache designs. In the second part of this work, we provide the first in-depth study of metadata access patterns, MAPS, to help guide architects design more efficient cache architectures customized for secure memory metadata. Based on the unique characteristics of secure memory metadata observed in the in-depth analysis, in the third part of this work we explore the design space of efficient
cache designs. We describe one possible design, Metadata Cache eXtension (MCX), which exploits the bimodal reuse distance distribution of metadata blocks to improve the cache efficiency thereby reducing the number of additional memory accesses. We
also explore an LLC eviction policy suitable to handle multiple types of blocks to improve the efficiency of caching metadata blocks on-chip further.
Item Open Access Machine Learning for Efficient and Robust Datacenter Performance Management(2022) Li, YuhaoModern datacenters host a wide range of applications. Managing application performance is critical for the overall cost efficiency of datacenter infrastructures. However, previous solutions struggle to address the unique challenges presented in datacenters and cannot achieve robust and efficient management for datacenter-scale computing. In this thesis, I will focus on representative data center management problems and address these challenges utilizing practical machine learning frameworks.
First, I will present my work on tuning complex configuration parameters for datacenter applications. Configuration parameters have a significant impact on application execution, and finding the optimal configuration to maximize application performance is desirable. However, the prevalent system noise in datacenters often disturbs tuning configuration parameters. Specifically, system noise could cause unexpected performance outliers and unreliable performance measurements. I propose machine learning models and experiment design methods to address the challenges resulting from system noise. The proposed methods identify configurations that are resilient to system noise and minimize the effects of system noise to obtain a robust statistical estimate of configuration performance for machine learning-based optimization. On the other hand, machine learning-based tuning requires collecting training data, which can be expensive if the configuration space is high dimensional. Therefore, I will also present a framework for reducing machine learning training overhead for high dimensional configuration tuning. The framework achieves efficient model training by identifying parts of the configuration space that are likely to improve model accuracy while dynamically and cautiously growing the model capacity during training.
Second, I will discuss the problem of enforcing the quality of service for user-interactive applications at runtime. Enforcing the quality of service is challenging with the presence of shared resource contentions among collocated applications on the same datacenter servers. I introduce a reinforcement learning-based runtime controller that makes strategic runtime decisions and achieves efficient service enforcement quality and server resource utilization improvement compared with previous solutions.
Item Open Access Managing Shared Resources in the Data Center Era(2018) Zahedi, Seyed MajidTo improve efficiency and amortize cost over more computation, resource
sharing has become vital in high performance computing systems. In such
systems, the conventional wisdom assumes that users have to share,
regardless of the management policy. With a wide range of computing options
available, this assumption does not seem to hold for today's self-interested
users. These users selfishly pursue their individual performance without
regard for others or the system. And if they dislike management outcomes,
they will withdraw from the shared system. If they decide to share, they
will try to game the management system by misreporting their resource
demands to improve their performance, perhaps at the expense of others in
the system. To address this challenge and study strategic behavior of
self-interested users, game theory is known to be an effective tool. Drawing
on game theory, this thesis encourages new thinking in designing management
platforms robust to strategic behavior. In this thesis, we present five
pieces of work on data center management platforms.
First, with the democratization of cloud and datacenter computing, users
increasingly share large hardware platforms. In this setting, architects
encounter two challenges: sharing fairly and sharing multiple resources.
Drawing on game theory, we rethink fairness in computer
architecture. A fair allocation must provide sharing incentives (SI),
envy-freeness (EF), and Pareto efficiency (PE). We show that Cobb-Douglas
utility functions are well suited to modeling user preferences for cache
capacity and memory bandwidth. Additionally we present an allocation mechanism that
uses Cobb-Douglas preferences to determine each user's fair share of the
hardware. This mechanism provably guarantees SI, EF, and PE, as well as
strategy-proofness in the large (SPL). And it does so with modest
performance penalties, less than 10% throughput loss, relative to an unfair
mechanism.
Second, computational sprinting is a class of mechanisms that boost
performance but dissipate additional power. We describe a sprinting
architecture in which many independent chip multiprocessors share a power
supply and sprints are constrained by the chips' thermal limits and the
rack's power limits. Moreover, we present the computational sprinting game,
a multi-agent perspective on managing sprints. Strategic agents decide
whether to sprint based on application phases and system conditions. The
game produces an equilibrium that improves task throughput for data
analytics workloads by 4-6x over prior greedy heuristics and performs
within 90% of an upper bound on throughput from a globally optimized
policy.
Third, ensuring fairness in a system with scarce and commonly preferred resources
requires time sharing. We consider a heterogeneous system with a few
``big'' and many ``small'' processors. We allocate heterogeneous processors
using a novel token mechanism that supports game-theoretic notions of
fairness such as sharing incentives and envy-freeness. The mechanism frames
the allocation problem as a repeated game. In each round of the game, users
request big processors and spend a token if their request is granted. We
formulate game dynamics and optimize users' strategies to produce an
equilibrium. Allocations from optimal strategies balance performance and
fairness. Our token mechanism outperforms classical, fair mechanisms by
1.7x, on average, in total performance gains, and is competitive with a
performance maximizing mechanism.
Fourth, we present a processor allocation framework that uses Amdahl's Law to
model parallel performance and a market mechanism to allocate cores. We
propose the Amdahl utility function and demonstrate its accuracy when
modeling performance from processor core allocations. We then design a
market based on Amdahl utility and propose the Amdahl bidding procedure that
optimizes users' bids for processors based on workload parallelizability.
The framework uses entitlements to guarantee fairness yet outperforms
existing proportional share algorithms.
Finally, sharing computational resources amortizes cost and improves
utilization and efficiency. When agents pool their resources together,
each becomes entitled to a portion of the shared pool. Static
allocations in each round can guarantee entitlements and are
strategy-proof, but efficiency suffers because allocations do not
reflect variations in agents' demands for resources across rounds.
Dynamic allocation mechanisms assign resources to agents across
multiple rounds while guaranteeing agents their entitlements.
Designing dynamic mechanisms is challenging, however, when agents are
strategic and can benefit by misreporting their demands for resources.
The Amdahl bidding mechanism facilitates the trade in resources between
users with static demands within a single management round. To facilitate
the trade in resources between users with dynamic demands across multiple
rounds, we propose two novel mechanisms. First, the T-period mechanism
satisfies strategy-proofness and sharing incentives but with low efficiency.
Second, the token mechanism satisfies strategy-proofness and guarantees at
least a 50% approximation of sharing incentives, which means users receive
at least half the utility they would have received by not participating in
the mechanism. Through simulations on data gathered from Google clusters, we
show that the performance of the token mechanism is comparable to that of
state-of-the-art mechanisms that do not guarantee our game-theoretic
properties. Further, although the token mechanism only guarantees a 50%
approximation of sharing incentives, in practice, users receive at least
98\% of their sharing incentives guarantee.
Item Open Access Microeconomic Models for Managing Shared Datacenters(2017) Llull, QiuyunAs demands for users’ applications’ data increase, the world’s computing platforms are moving towards more capable machines – servers and warehouse-scale datacenters. Diverse users share datacenters for complex computation and compete for shared resources. In some systems, such as public clouds where users pay for reserved hardware, management policies pursue performance goals. In contrast, private systems consist of users who voluntarily combine their resources and subscribe to a common management policy. These users reserve the right to opt-out from shared systems if resources are managed poorly. The system management framework needs to ensure fairness among strategic users, encouraging users to participate while guaranteeing individual performance and preserving the system’s performance. Microeconomic models are well suited for studying individual behavior and the allocation of scarce resources. In this thesis, we present three pieces of work on task colocation, resource allocation, and task scheduling problems to demonstrate the effectiveness of a microeconomic approach.
Colocating applications on shared hardware (i.e., chip-multiprocessors) improves server utilization but introduces resource contention into the memory subsystem. In the first work, we design a colocation framework based on cooperative game theory to manage shared resource contention. Our framework uses a recommendation system to predict individual applications preferences for colocated tasks. It then uses these predictions to drive novel colocation mechanisms to guarantee user fairness and preserve system performance. Attractive system outcomes encourage strategic users to participate in the datacenter.
Processor allocations are inefficient when they are based on static reservations because reservations are often conservative; users rarely know their application’s needs across time, especially when applications have complex phase behavior. In the second work, we propose a fast, lightweight performance prediction framework to help users capture their phase behaviors in parallel applications. We design a dynamic and distributed core allocation framework so that users can trade resources for better efficiency based on predicted performance. Our management framework provides efficient allocations and game-theoretic fairness guarantees. In the last work, we characterize applications’ sensitivity to non-uniform memory access (NUMA) in big memory servers. We develop performance and energy models for communication costs in a blade server. We use this model to perform case studies on NUMA-aware scheduling policies and task queue management. Our parameterized models lay the foundation for the coordinated design of scheduling policies and hardware configurations. This method can be further used to design locality-aware schedulers with microeconomic models, e.g., dynamic pricing strategies for city parking.
Item Open Access Studying Recommender Systems to Enhance Distributed Computing Schedulers(2016) Demoulin, Henri MaximeDistributed Computing frameworks belong to a class of programming models that allow developers to
launch workloads on large clusters of machines. Due to the dramatic increase in the volume of
data gathered by ubiquitous computing devices, data analytic workloads have become a common
case among distributed computing applications, making Data Science an entire field of
Computer Science. We argue that Data Scientist's concern lays in three main components: a dataset,
a sequence of operations they wish to apply on this dataset, and some constraint they may have
related to their work (performances, QoS, budget, etc). However, it is actually extremely
difficult, without domain expertise, to perform data science. One need to select the right amount
and type of resources, pick up a framework, and configure it. Also, users are often running their
application in shared environments, ruled by schedulers expecting them to specify precisely their resource
needs. Inherent to the distributed and concurrent nature of the cited frameworks, monitoring and
profiling are hard, high dimensional problems that block users from making the right
configuration choices and determining the right amount of resources they need. Paradoxically, the
system is gathering a large amount of monitoring data at runtime, which remains unused.
In the ideal abstraction we envision for data scientists, the system is adaptive, able to exploit
monitoring data to learn about workloads, and process user requests into a tailored execution
context. In this work, we study different techniques that have been used to make steps toward
such system awareness, and explore a new way to do so by implementing machine learning
techniques to recommend a specific subset of system configurations for Apache Spark applications.
Furthermore, we present an in depth study of Apache Spark executors configuration, which highlight
the complexity in choosing the best one for a given workload.
Item Open Access Towards Energy-Efficient Mobile Sensing: Architectures and Frameworks for Heterogeneous Sensing and Computing(2016) Fan, SongchunModern sensing apps require continuous and intense computation on data streams. Unfortunately, mobile devices are failing to keep pace despite advances in hardware capability. In contrast to powerful system-on-chips that rapidly evolve, battery capacities merely grow. This hinders the potential of long-running, compute-intensive sensing services such as image/audio processing, motion tracking and health monitoring, especially on small, wearable devices.
In this thesis, we present three pieces of work that target at improving the energy efficiency for mobile sensing. (1) In the first work, we study heterogeneous mobile processors that dynamically switch between high-performance and low-power cores according to tasks' performance requirements. We benchmark interactive mobile workloads and quantify the energy improvement of different microarchitectures. (2) Realizing that today's users often carry more than one mobile devices, in the second work, we extend the resource boundary of individual devices by prototyping a distributed framework that coordinates multiple devices. When devices share common sensing goals, the framework schedules sensing and computing tasks according to devices' heterogeneity, improving the performance and latency for compute-intensive sensing apps. (3) In the third work, we study the power breakdown of motion sensing apps on wearable devices and show that traditional offloading schemes cannot mitigate sensing’s high energy costs. We design a framework that allows the phone to take over sensing and computation by predicting the wearable's sensory data, when motions of the two devices are highly correlated. This allows the wearable to offload without communicating raw sensing data, resulting in little performance loss but significant energy savings.
Item Open Access Uncertainty propagation through software dependability models(2011) Mishra, KesariSystems in critical applications employ various hardware and software fault-tolerance techniques to ensure high dependability. Stochastic models are often used to analyze the dependability of these systems and assess the effectiveness of the fault-tolerance techniques employed. Measures like performance and performability of systems are also analyzed using stochastic models. These models take into account randomness in various events in the system (known as aleatory uncertainty) and are solved at fixed parameter values to obtain the measures of interest. However, in real life, the parameters of the stochastic models themselves are uncertain as they are derived from a finite (limited) number of observations or are simply based on expert opinions. Solving the stochastic models at fixed values of the model input parameters result in estimates of model output metrics which do not take into account the uncertainty in model input parameters (known as epistemic uncertainty). In this research work, we address the computation of uncertainty in output metrics of stochastic models due to epistemic uncertainty in model input parameters, with a focus on dependability and performance models of current computer and communication systems. We develop an approach for propagation of epistemic uncertainty in input parameters through stochastic dependability and performance models of varying complexity, to compute the uncertainty in the model output measures. The uncertainty propagation method can be applied to a wide range of stochastic model types with different model output measures. For simple analytic stochastic dependability models, we present a closed-form analytic method for epistemic uncertainty propagation, where we derive closed-form expressions for the expectation, distribution and variance of the model output metrics due to the epistemic uncertainty in the model input parameters. We analyze the results thus obtained and study their limiting behavior. For slightly more complex analytic stochastic models, where the closed-form expressions for the expectation, distribution and variance of the model output cannot be easily obtained, we present a numerical integration based method. For large and complex stochastic models, we develop a sampling based epistemic uncertainty propagation method which also considers dependencies in the input parameter values and is an improvement over previous sampling based uncertainty propagation approaches. The sampling based epistemic uncertainty propagation method explained in this dissertation acts as a wrapper to existing models and their solution types (hence the wide applicability) and provides more robust estimates of uncertainty in the model output metrics than previous sampling based methods. We demonstrate the applicability of the uncertainty propagation approach by applying it to analytic stochastic dependability and performance models of computer systems, ranging from simple non-state-space models with a few input parameters to large state-space models and even hierarchical models with more than fifty input parameters. We further apply the uncertainty propagation approach to stochastic models with not only analytic or analytic-numeric solutions but also those with simulative solutions. We also consider a wide range of model output metrics including reliability and availability of computer systems, response time of a web service, capacity oriented availability of a communication system, security (probability ofsuccessful attack) of a network routing session, expected number of jobs in a queueing system with breakdown and repair of servers and call handoff probability of a cellular wireless communication cell.