Browsing by Author "Yang, Jun"
- Results Per Page
- Sort Options
Item Open Access A model of sequential heart and composite tissue allotransplant in rats.(Plast Reconstr Surg, 2010-07) Yang, Jun; Erdmann, Detlev; Chang, JC; Komatsu, Issei; Zhang, YiXin; Wang, DanRu; Hodavance, Michael S; Hollenbeck, Scott T; Levinson, Howard; Klitzman, Bruce; Levin, LSBACKGROUND: Some of the 600,000 patients with solid organ allotransplants need reconstruction with a composite tissue allotransplant, such as the hand, abdominal wall, or face. The aim of this study was to develop a rat model for assessing the effects of a secondary composite tissue allotransplant on a primary heart allotransplant. METHODS: Hearts of Wistar Kyoto rats were harvested and transplanted heterotopically to the neck of recipient Fisher 344 rats. The anastomoses were performed between the donor brachiocephalic artery and the recipient left common carotid artery, and between the donor pulmonary artery and the recipient external jugular vein. Recipients received cyclosporine A for 10 days only. Heart rate was assessed noninvasively. The sequential composite tissue allotransplant consisted of a 3 x 3-cm abdominal musculocutaneous flap harvested from Lewis rats and transplanted to the abdomen of the heart allotransplant recipients. The abdominal flap vessels were connected to the femoral vessels. No further immunosuppression was administered following the composite tissue allotransplant. Ten days after composite tissue allotransplantation, rejection of the heart and abdominal flap was assessed histologically. RESULTS: The rat survival rate of the two-stage transplant surgery was 80 percent. The transplanted heart rate decreased from 150 +/- 22 beats per minute immediately after transplant to 83 +/- 12 beats per minute on day 20 (10 days after stopping immunosuppression). CONCLUSIONS: This sequential allotransplant model is technically demanding. It will facilitate investigation of the effects of a secondary composite tissue allotransplant following primary solid organ transplantation and could be useful in developing future immunotherapeutic strategies.Item Open Access Algorithms for continuous queries: A geometric approach(2013) Yu, AlbertThere has been an unprecedented growth in both the amount of data and the number of users interested in different types of data. Users often want to keep track of the data that match their interests over a period of time. A continuous query, once issued by a user, maintains the matching results for the user as new data (as well as updates to the existing data) continue to arrive in a stream. However, supporting potentially millions of continuous queries is a huge challenge. This dissertation addresses the problem of scalably processing a large number of continuous queries over a wide-area network.
Conceptually, the task of supporting distributed continuous queries can be divided into two components--event processing (computing the set of affected users for each data update) and notification dissemination (notifying the set of affected users). The first part of this dissertation focuses on event processing. Since interacting with large-scale data can easily frustrate and overwhelm the users, top-k queries have attracted considerable interest from the database community as they allow users to focus on the top-ranked results only. However, it is nearly impossible to find a set of common top-ranked data that everyone is interested in, therefore, users are allowed to specify their interest in different forms of preferences, such as personalized ranking function and range selection. This dissertation presents geometric frameworks, data structures, and algorithms for answering several types of preference queries efficiently. Experimental evaluations show that our approaches outperform the previous ones by orders of magnitude.
The second part of the dissertation presents comprehensive solutions to the problem of processing and notifying a large number of continuous range top-k queries across a wide-area network. Simple solutions include using a content-driven network to notify all continuous queries whose ranges contain the update (ignoring top-k), or using a server to compute only the affected continuous queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. This dissertation presents a geometric framework which allows the set of affected continuous queries to be described succinctly with messages that can be efficiently disseminated using content-driven networks. Fast algorithms are also developed to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all continuous queries.
The final component of this dissertation is the design of a wide-area dissemination network for continuous range queries. In particular, this dissertation addresses the problem of assigning users to servers in a wide-area content-based publish/subscribe system. A good assignment should consider both users' interests and locations, and balance multiple performance criteria including bandwidth, delay, and load balance. This dissertation presents a Monte Carlo approximation algorithm as well as a simple greedy algorithm. The Monte Carlo algorithm jointly considers multiple performance criteria to find a broker-subscriber assignment and provides theoretical performance guarantees. Using this algorithm as a yardstick, the greedy algorithm is also concluded to work well across a wide range of workloads.
Item Open Access Computational Journalism: from Answering Question to Questioning Answers and Raising Good Questions(2015) Wu, YouOur media is saturated with claims of ``facts'' made from data. Database research has in the past focused on how to answer queries, but has not devoted much attention to discerning more subtle qualities of the resulting claims, e.g., is a claim ``cherry-picking''? This paper proposes a Query Response Surface (QRS) based framework that models claims based on structured data as parameterized queries. A key insight is that we can learn a lot about a claim by perturbing its parameters and seeing how its conclusion changes. This framework lets us formulate and tackle practical fact-checking tasks --- reverse-engineering vague claims, and countering questionable claims --- as computational problems. Within the QRS based framework, we take one step further, and propose a problem along with efficient algorithms for finding high-quality claims of a given form from data, i.e. raising good questions, in the first place. This is achieved to using a limited number of high-valued claims to represent high-valued regions of the QRS. Besides the general purpose high-quality claim finding problem, lead-finding can be tailored towards specific claim quality measures, also defined within the QRS framework. An example of uniqueness-based lead-finding is presented for ``one-of-the-few'' claims, landing in interpretable high-quality claims, and an adjustable mechanism for ranking objects, e.g. NBA players, based on what claims can be made for them. Finally, we study the use of visualization as a powerful way of conveying results of a large number of claims. An efficient two stage sampling algorithm is proposed for generating input of 2d scatter plot with heatmap, evalutaing a limited amount of data, while preserving the two essential visual features, namely outliers and clusters. For all the problems, we present real-world examples and experiments that demonstrate the power of our model, efficiency of our algorithms, and usefulness of their results.
Item Open Access Cumulon: Simplified Matrix-Based Data Analytics in the Cloud(2016) Huang, BotongCumulon is a system aimed at simplifying the development and deployment of statistical analysis of big data in public clouds. Cumulon allows users to program in their familiar language of matrices and linear algebra, without worrying about how to map data and computation to specific hardware and cloud software platforms. Given user-specified requirements in terms of time, monetary cost, and risk tolerance, Cumulon automatically makes intelligent decisions on implementation alternatives, execution parameters, as well as hardware provisioning and configuration settings -- such as what type of machines and how many of them to acquire. Cumulon also supports clouds with auction-based markets: it effectively utilizes computing resources whose availability varies according to market conditions, and suggests best bidding strategies for them. Cumulon explores two alternative approaches toward supporting such markets, with different trade-offs between system and optimization complexity. Experimental study is conducted to show the efficiency of Cumulon's execution engine, as well as the optimizer's effectiveness in finding the optimal plan in the vast plan space.
Item Open Access Durability Queries on Temporal Data(2020) Gao, JunyangTemporal data is ubiquitous in our everyday life, but tends to be noisy and often exhibits transient patterns. To make better decisions with data, we must avoid jumping to conclusions based on certain particular query results or observations. Instead, a useful perspective is to consider "durability", or, intuitively speaking, finding results that are robust and stand "the test of time". This thesis studies durability queries on temporal data that return durable results efficiently and effectively.
The focus of this thesis is two-fold: (1) design meaningful and practical notions of durability (and corresponding queries) on different types of temporal data, and (2) develop efficient techniques for durability query processing.
We first study sequence-based temporal datasets where each temporal object has a series of values indexed by time.
Durability queries ask for objects whose (snapshot) values were among the top $k$ for at least some fraction of the times during a given time interval; e.g., "from 2013 to 2016, United Airlines has the highest stock price among American-based airline companies for at least 80\% of the time."
Second, we consider instant-stamped temporal datasets where each data record is stamped by a time instant.
Here, durability queries look for records that stand out among nearby records (defined by a time window) and retain their supremacy for a long period of time; e.g. "On January 22, 2006, Kobe Bryant dropped 81 points against Toronto Raptors, a scoring record that since then has yet to be broken."
Finally, going beyond analyzing historical data, we investigate the notation of durability into the future, where durability needs to be predicted by performing stochastic simulation of temporal models.
For answering durability queries across these problem settings, we apply principled approaches to design fast, scalable algorithms and indexing methods.
Our solutions broadly combine geometric, statistical, and approximate query processing techniques to provide a meaningful balance between query efficiency and result quality, along with theoretical worst-case (or average-case) guarantees.
Item Open Access Handling Resource Constraints and Scalability in Continuous Query Processing(2007-12-12) Xie, JunyiRecent years have witnessed a rapid rise of a new class of data-intensive applications in which data arrive as transient, high-volume streams. Financial data processing, network monitoring, and sensor networks are all examples of such applications. Traditional relational database systems model data as persistent relations, but for this new class of applications, it is more appropriate to model data as unbounded streams with continuously arriving tuples. The stream data model necessitates a new style of queries called continuous queries. Unlike a one-time query executed over a single finite and static database state, a continuous query continuously generates new result tuples as new stream tuples arrive. This dissertation tackles a range of challenges that arise in processing continuous queries. Specifically, for resource-constrained settings, this dissertation proposes techniques for coping with response-time and memory constraints. To scale to a large number of continuous queries running concurrently, this dissertation proposes techniques for indexing continuous queries as data, and processing and optimizing incoming stream tuples as queries over such data. A common theme underlying most of these techniques is exploiting the characteristics of the data and the continuous queries, e.g., asymmetry in the costs of processing different streams, temporal trends in the values of stream attributes, and clusteredness that arises in a large number of continuous queries.Item Open Access Interstitial engraftment of adipose-derived stem cells into an acellular dermal matrix results in improved inward angiogenesis and tissue incorporation.(J Biomed Mater Res A, 2013-10) Komatsu, Issei; Yang, Jun; Zhang, Ying; Levin, L Scott; Erdmann, D; Klitzman, Bruce; Hollenbeck, Scott TAcellular dermal matrices (ADM) are commonly used in reconstructive procedures and rely on host cell invasion to become incorporated into host tissues. We investigated different approaches to adipose-derived stem cells (ASCs) engraftment into ADM to enhance this process. Lewis rat adipose-derived stem cells were isolated and grafted (3.0 × 10(5) cells) to porcine ADM disks (1.5 mm thick × 6 mm diameter) using either passive onlay or interstitial injection seeding techniques. Following incubation, seeding efficiency and seeded cell viability were measured in vitro. In addition, Eighteen Lewis rats underwent subcutaneous placement of ADM disk either as control or seeded with PKH67 labeled ASCs. ADM disks were seeded with ASCs using either onlay or injection techniques. On day 7 and or 14, ADM disks were harvested and analyzed for host cell infiltration. Onlay and injection techniques resulted in unique seeding patterns; however cell seeding efficiency and cell viability were similar. In-vivo studies showed significantly increased host cell infiltration towards the ASCs foci following injection seeding in comparison to control group (p < 0.05). Moreover, regional endothelial cell invasion was significantly greater in ASCs injected grafts in comparison to onlay seeding (p < 0.05). ADM can successfully be engrafted with ASCs. Interstitial engraftment of ASCs into ADM via injection enhances regional infiltration of host cells and angiogenesis, whereas onlay seeding showed relatively broad and superficial cell infiltration. These findings may be applied to improve the incorporation of avascular engineered constructs.Item Open Access Optimizing Database Algorithms for Random-Access Block Devices(2015) Thonangi, RisiThe past decade has seen wide availability of solid-state drives (SSDs) in settings ranging from personal computing to enterprise storage. Their success over the hard disks is driven by performance considerations and cost savings. Besides SSDs based on flash memory, there have been ongoing efforts in developing other non-volatile memory technologies such as phase-change memory and MRAM. All these technologies enable what we refer to as random-access block devices. Unlike hard disks, these devices have fast random accesses; on the other hand, their writes are more expensive than their reads. In this work, we study how to optimize database and storage algorithms for the I/O characteristics of random-access block devices. Specifically, we tackle the following three problems.
The first one is about permuting data out-of-core. While external merge sort is popular for implementing permutation on hard disks, it carries unnecessary overhead in storing and comparing keys. We propose more efficient algorithms for a useful class of permutations called Address-Digit Permutations on random-access block devices.
The second problem is concurrency control for indexes on SSDs. Various indexes have been proposed for these devices, but to make such indexes practical, we must address the issue of concurrency control. We propose a novel indexing and concurrency control scheme which allows concurrent accesses during ongoing index reorganizations, and does so with minimal memory and block-level locking.
The third problem concerns log-structured merge, a popular indexing technique well-suited to random-access block devices. We show how an intelligent partial merge policy, combined with a block-preserving merge procedure, can significantly lower write traffic while preserving other advantages of log-structured merge.
Item Open Access Perturbation Analysis of Database Queries(2019) Walenz, BrettData-driven decision making plays a dominant role across all domains, from health,
business, government, to sports. High impact decisions, decisions that may affect
an entire organization, are now being automated by analytical queries. A bank
account may be shut down if a query determines there may be fraudulent activity, a
sporting event may decrease ticket costs if the rate of ticket sales is not fast enough.
These data-driven decisions are often ad-hoc and resource intensive: a bank has to
compare and analyze all users, sporting events might use previous events to estimate
an acceptable ticket sales rate.
In this dissertation, I describe efficient methods for optimizing analytic queries
where the query has a complex user-defined function or join. I introduce an analysis
methodology for queries called query perturbation analysis. Query perturbation
analysis models a query as a collection of parameter settings rather than a collection
of database relations. In this model, a query conceptually involves evaluating the
same query template multiple times using perturbed parameter settings.
I begin with a discussion of the motivation for modeling complex database queries
in a perturbation analysis framework. Perturbation analysis often produces an effi-
cient and natural (to the end user) way to break down complex queries into isolated
parameterized queries. Analytical queries such as analyzing performance in sports,
or efficiently finding nearest neighbors to identify intrusion attacks in computer net-
works, are prime candidates for perturbation analysis.
I start by describing optimizations introduced by modeling a problem using per-
turbation analysis. I discuss four common optimizations to query perturbation anal-
ysis: parallelization, memoization, grouping, and pruning. Each optimization can be
used in different scenarios and under different constraints. I provide examples where
each optimization can be used.
I show how to tackle this problem from three distinct angles: as a distributed
system, within a database system, and using approximation methods.
As a distributed system, I introduce Perada, a system for perturbation analysis
which exposes these common optimizations and simplifies the development of general,
ad hoc perturbation analysis jobs. I first describe how to specify and interact with the
underlying optimizations, and then describe the implications of each optimization to
job performance. Perada hides this job performance complexity by runtime tuning
of optimizations through multiple stages and on/off switches for optimizations. I
describe this runtime strategy to balance tradeoffs of various optimizations and the
performance gain from a naive always-on strategy.
Within a database system, I push perturbation analysis through a SQL optimizer
and show how a certain class of queries called iceberg queries can be rewritten for
an order of magnitude decrease in execution time. I describe specific constraints and
requirements to first identify iceberg queries and ensure that rewritten execution
returns the same set as a traditional SQL compiler.
Using approximation methods, I describe solutions to approximately count per-
turbation analysis queries, and push the approach to a general class of queries with
expensive filters. All approaches rely on constructing a fast, probabilistic classifier
in place of the expensive filter and classifying individual perturbations. Using the
trained classifier, I describe two approaches to approximately count perturbation
analysis queries. Quantification learning uses the classifier directly to count the pre-
dicted labels. Sampling uses the classifier to create unequal probability sampling
schemes to produce highly accurate, low variance estimates.
For each distinct angle, I provide empirical results that show the effectiveness
of perturbation analysis in data systems. This dissertation shows that perturbation
analysis is a practical alternative method to traditional query execution techniques
and can be used in diverse settings: distributed, SQL, and approximated.
Item Open Access The Tsinghua-Lancet Commission on Healthy Cities in China: unlocking the power of cities for a healthy China.(Lancet (London, England), 2018-04-17) Yang, Jun; Siri, José G; Remais, Justin V; Cheng, Qu; Zhang, Han; Chan, Karen KY; Sun, Zhe; Zhao, Yuanyuan; Cong, Na; Li, Xueyan; Zhang, Wei; Bai, Yuqi; Bi, Jun; Cai, Wenjia; Chan, Emily YY; Chen, Wanqing; Fan, Weicheng; Fu, Hua; He, Jianqing; Huang, Hong; Ji, John S; Jia, Peng; Jiang, Xiaopeng; Kwan, Mei-Po; Li, Tianhong; Li, Xiguang; Liang, Song; Liang, Xiaofeng; Liang, Lu; Liu, Qiyong; Lu, Yongmei; Luo, Yong; Ma, Xiulian; Schwartländer, Bernhard; Shen, Zhiyong; Shi, Peijun; Su, Jing; Wu, Tinghai; Yang, Changhong; Yin, Yongyuan; Zhang, Qiang; Zhang, Yinping; Zhang, Yong; Xu, Bing; Gong, PengItem Open Access Transparent and Efficient I/O for Statistical Computing(2012) Zhang, YiStatistical analysis of massive array data is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating intermediate results. In the big data setting, storage and I/O efficiency is a key to efficient analytics. Because of the distinct characteristics of disk-resident arrays and the operations performed on them, we need a computing environment that is easy to use, scalable to big data, and different from traditional, CPU- and memory-centric solutions.
R is a popular computing environment for statistical/numerical data analysis. Like many such environments, R performs poorly for large datasets. This dissertation presents RIOT (R with I/O Transparency), a framework to make R programs I/O-efficient in a way transparent to users. RIOT-DB, an implementation of RIOT using a relational database system as its backend, significantly outperforms R in many big-data scenarios. RIOT users are insulated from the data management backend and I/O optimization specifics. Because of this transparency, RIOT is easy to adopt by the majority of the R users.
While RIOT-DB demonstrates the feasibility of transparent I/O efficiency and the potential of database-style inter-operator optimizations, it also reveals significant deficiencies of database systems in handling statistical computation. To improve the efficiency of array storage, RIOT uses a novel storage structure called Linearized-Array B-tree, or LAB-tree. LAB-tree supports flexible array layouts and automatically adapts to varying sparsity across parts of an array and over time. It also implements splitting strategies and update batching policies with good theoretical guarantees and/or practical performance.
While LAB-tree removes many I/O inefficiencies that arise in accessing individual arrays, programs consisting of multiple operators need further optimization. To this end, RIOT incorporates an I/O optimization framework, RIOTShare, which is able to jointly optimize I/O sharing and array layouts for a broad range of analysis tasks expressible in nested-loop forms. RIOTShare explores the middle ground between the high-level, database-style operator-based query optimization and low-level, compiler-style loop-based code optimization.
In sum, combining a transparent language binding mechanism, an efficient and flexible storage engine, and an accurate I/O sharing and array layout optimizer, RIOT provides a systematic solution for data-intensive array-based statistical computing.
Item Open Access Unifying Databases and Internet-Scale Publish/Subscribe(2008-08-01) Chandramouli, BadrishWith the advent of Web 2.0 and the Digital Age, we are witnessing an unprecedented increase in the amount of information collected, and in the number of users interested in different types of information. This growth means that traditional techniques, where users poll data sources for information of interest, are no longer sufficient. Polling too frequently does not scale, while polling less often may result in users missing important updates. The alternative push technology has long been the goal of publish/subscribe systems, which proactively push updates (events) to users with matching interests (expressed as subscriptions). The push model is better suited for ensuring scalability and timely delivery of updates, important in many application domains: personal (e.g., RSS feeds, online auctions), financial (e.g., portfolio monitoring), security (e.g., reporting network anomalies), etc.
Early publish/subscribe systems were based on predefined subjects (channels), and were too coarse-grained to meet the specific interests of different subscribers. The second generation of content-based publish/subscribe systems offer greater flexibility by supporting subscriptions defined as predicates over message contents. However, subscriptions are still stateless filters over individual messages, so they cannot express queries across different messages or over the event history. The few systems that support more powerful database-style subscriptions do not address the problem of efficiently delivering updates to a large number of subscribers over a wide-area network. Thus, there is a need to develop next-generation publish/subscribe systems that unify the support for richer database-style subscription queries and flexible wide-area notification. This support needs to be complemented with robust processing and dissemination techniques that scale to high event rates and large databases, as well as to a large number of subscribers over the Internet.
The main contribution of our work is a collection of techniques to support efficient and scalable event processing and notification dissemination for an Internet-scale publish/subscribe system with a rich subscription model. We investigate the interface between event processing by a database server and notification delivery by a dissemination network. Previous research in publish/subscribe has largely been compartmentalized; database-centric and network-centric approaches each have their own limitations, and simply putting them together does not lead to an efficient solution. A closer examination of database/network interfaces yields a spectrum of new and interesting possibilities. In particular, we propose message and subscription reformulation as general techniques to support stateful subscriptions over existing content-driven networks, by converting them into equivalent but stateless forms. We show how reformulation can successfully be applied to various stateful subscriptions including range-aggregation, select-joins, and subscriptions with value-based notification conditions. These techniques often provide orders-of-magnitude improvement over simpler techniques adopted by current systems, and are shown to scale to millions of subscriptions. Further, the use of a standard off-the-shelf content-driven dissemination interface allows these techniques to be easily deployed, managed, and maintained in a large-scale system.
Based on our findings, we have built a high-performance publish/subscribe system named ProSem (to signify the inseparability of database processing and network dissemination). ProSem uses our novel techniques for group-processing many types of complex and expressive subscriptions, with a per-event optimization framework that chooses the best processing and dissemination strategy at runtime based on online statistics and system objectives.