Show simple item record

dc.contributor.advisor Yang, Jun en_US
dc.contributor.author Zhang, Yi en_US
dc.date.accessioned 2012-05-25T20:09:09Z
dc.date.available 2013-05-20T04:30:05Z
dc.date.issued 2012 en_US
dc.identifier.uri http://hdl.handle.net/10161/5431
dc.description Dissertation en_US
dc.description.abstract <p>Statistical analysis of massive array data is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating intermediate results. In the big data setting, storage and I/O efficiency is a key to efficient analytics. Because of the distinct characteristics of disk-resident arrays and the operations performed on them, we need a computing environment that is easy to use, scalable to big data, and different from traditional, CPU- and memory-centric solutions.</p><p>R is a popular computing environment for statistical/numerical data analysis. Like many such environments, R performs poorly for large datasets. This dissertation presents RIOT (R with I/O Transparency), a framework to make R programs I/O-efficient in a way transparent to users. RIOT-DB, an implementation of RIOT using a relational database system as its backend, significantly outperforms R in many big-data scenarios. RIOT users are insulated from the data management backend and I/O optimization specifics. Because of this transparency, RIOT is easy to adopt by the majority of the R users.</p><p>While RIOT-DB demonstrates the feasibility of transparent I/O efficiency and the potential of database-style inter-operator optimizations, it also reveals significant deficiencies of database systems in handling statistical computation. To improve the efficiency of array storage, RIOT uses a novel storage structure called Linearized-Array B-tree, or LAB-tree. LAB-tree supports flexible array layouts and automatically adapts to varying sparsity across parts of an array and over time. It also implements splitting strategies and update batching policies with good theoretical guarantees and/or practical performance.</p><p>While LAB-tree removes many I/O inefficiencies that arise in accessing individual arrays, programs consisting of multiple operators need further optimization. To this end, RIOT incorporates an I/O optimization framework, RIOTShare, which is able to jointly optimize I/O sharing and array layouts for a broad range of analysis tasks expressible in nested-loop forms. RIOTShare explores the middle ground between the high-level, database-style operator-based query optimization and low-level, compiler-style loop-based code optimization.</p><p>In sum, combining a transparent language binding mechanism, an efficient and flexible storage engine, and an accurate I/O sharing and array layout optimizer, RIOT provides a systematic solution for data-intensive array-based statistical computing.</p> en_US
dc.subject Computer science en_US
dc.subject Databases en_US
dc.subject Input/Output en_US
dc.subject Polyhedral optimization en_US
dc.subject R en_US
dc.subject Scientific computing en_US
dc.subject Statistical computing en_US
dc.title Transparent and Efficient I/O for Statistical Computing en_US
dc.type Dissertation en_US
dc.department Computer Science en_US
duke.embargo.months 12 en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record