Automating Memory Management in Data Analytics
Date
2019
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
Recent years have seen unprecedented growth in the volume, velocity, and variety of the data managed by data analytics platforms. At the same time, the skilled IT staff required to develop and operate the datacenters are growing at a much smaller pace. This trend suggests a big interest in making the data analytics platforms more autonomic (or, more popularly, self-driving). There are, however, several major challenges in this task. Firstly, multiple `one-size' systems need to co-exist and co-operate in order to support a variety of computation needs such as log processing, business predictions, and real-time analysis. Secondly, cluster resources are managed at multiple levels exhibiting complex interactions between the many distributed system components. Finally, multiple tenants share a cluster, each with specific performance expectations restricting opportunities for optimal use of resources.
We have built an integrated management platform, called Thoth, that provides a data-centric view over the data analytics system environment. This platform is used to develop multiple auto-tuning algorithms to help systems meet their performance goals. We specifically focus on memory-based data analytics considering the growing sizes of---and effectively, more aggressive use of---memory in data processing systems. Our first contribution is a cache manager targeted at multi-tenant cluster setups. It supports a novel fairness model providing guarantees to tenants on the performance speedups experienced by their workload.
Our second contribution is automatic tuning of memory management decisions taken at multiple levels during an application execution. This problem is approached in two ways: (i) A black-box modeling assisted with system internal knowledge, and (ii) An empirically-driven white-box approach. The two algorithms that we have developed significantly improve the state-of-the-art tuning techniques, while exhibiting different trade-offs between the convergence guarantees and the speed of optimization.
We expect the work presented here act as a major step towards building self-driving data processing systems, motivating further work in automating components such as physical design of data storage and root cause analysis of performance problems.
Type
Department
Description
Provenance
Citation
Permalink
Citation
Kunjir, Mayuresh (2019). Automating Memory Management in Data Analytics. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/18767.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.