Browsing by Subject "Database"
- Results Per Page
- Sort Options
Item Open Access A Data-Intensive Framework for Analyzing Dynamic Supreme Court Behavior(2012) Calloway, Timothy JosephMany law professors and scholars think of the Supreme Court as a black box--issues and arguments go in to the Court, and decisions come out. The almost mystical nature that these researchers impute to the Court seems to be a function of the lack of hard data and statistics about the Court's decisions. Without a robust dataset from which to draw proper conclusions, legal scholars are often left only with intuition and conjecture.
Explaining the inner workings of one of the most important institutions in the United States using such a subjective approach is obviously flawed. And, indeed, data is available that can provide researchers with a better understanding of the Court's actions, but scholars have been slow in adopting a methodology based on data and statistical analysis. The sheer quantity of available data is overwhelming and might provide one reason why such an analysis has not yet been undertaken.
Relevant data for these studies is available from a variety of sources, but two in particular are of note. First, legal database provider LexisNexis provides a huge amount of information about how the Court's opinions are treated by subsequent opinions; thus, if the Court later overrules one of its earlier opinions, that information is captured by LexisNexis. Second, researchers at Washington University in St. Louis have compiled a database that provides detailed information about each Supreme Court decision. Combining these two sources into a coherent database will provide a treasure trove of results for future researchers to study, use, and build upon.
This thesis will explore a first-of-its-kind attempt to parse these massive datasets to provide a powerful tool for future researchers. It will also provide a window to help the average citizen understand Supreme Court behavior more clearly. By utilizing traditional data extraction and dataset analysis methods, many informative conclusions can be reached to help explain why the Court acts the way it does. For example, the results show that decisions decided by a narrow margin (i.e., by a 5 to 4 vote) are almost 4x more likely to be overruled than unanimous decisions by the Court. Many more results like these can be synthesized from the dataset and will be presented in this thesis. Possibly of higher importance, this thesis presents a framework to predict the outcomes of future and pending Supreme Court cases using statistical analysis of the data gleaned from the dataset.
In the end, this thesis strives to provide input data as well as results data for future researchers to use in studying Supreme Court behavior. It also provides a framework that researchers can use to analyze the input data to create even more results data.
Item Open Access An Information-Theoretic Analysis of X-Ray Architectures for Anomaly Detection(2018) Coccarelli, David ScottX-ray scanning equipment currently establishes a first line of defense in the aviation security space. The efficacy of these scanners is crucial to preventing the harmful use of threatening objects and materials. In this dissertation, I introduce a principled approach to the analyses of these systems by exploring performance limits of system architectures and modalities. Moreover, I validate the use of simulation as a design tool with experimental data as well as extend the use of simulation to create high-fidelity realizations of a real-world system measurements.
Conventional performance analysis of detection systems confounds the effects of the system architecture (sources, detectors, system geometry, etc.) with the effects of the detection algorithm. We disentangle the performance of the system hardware and detection algorithm so as to focus on analyzing the performance of just the system hardware. To accomplish this, we introduce an information-theoretic approach to this problem. This approach is based on a metric derived from Cauchy-Schwarz mutual information and is analogous to the channel capacity concept from communications engineering. We develop and utilize a framework that can produce thousands of system simulations representative of a notional baggage ensembles. These simulations and the prior knowledge of the virtual baggage allow us to analyze the system as it relays information pertinent to a detection task.
In this dissertation, I discuss the application of this information-theoretic approach to study variations of X-ray transmission architectures as well as novel screening systems based on X-ray scatter and phase. The results show how effective use of this metric can impact design decisions for X-ray systems. Moreover, I introduce a database of experimentally acquired X-ray data both as a means to validate the simulation approach and to produce a database ripe for further reconstruction and classification investigations. Next, I show the implementation of improvements to the ensemble representation in the information-theoretic material model. Finally I extend the simulation tool toward high-fidelity representation of real-world deployed systems.
Item Open Access Answering and Explaining SQL Queries Privately(2022) Tao, YuchaoData privacy has been receiving an increasing amount of attention in recent years. While large-scale personal information is collected for scientific research and commercial products, a privacy breach is not acceptable as a trade-off. In the last decade, differential privacy has become a gold standard to protect data privacy and has been applied in many organizations. Past work focused on building a differentially private SQL query answering system as a building block for wider applications. However, answering counting queries with joins under differential privacy appears as a challenge. The join operator allows any user to have an unbounded impact on the query result, which impedes hiding the existence of a single user by differential privacy. On the other hand, the introduction of differential privacy to the query answering also prevents the users from understanding the query results correctly, since she needs to distinguish the effect of differential privacy from the contribution of data.
In this thesis, we study two problems about answering and explaining SQL queries privately. First, we present efficient algorithms to compute local sensitivities of counting queries with joins, which is an important premise for answering these queries under differential privacy. We track the sensitivity contributed by each tuple, based on which we propose a truncation mechanism that answers counting queries with joins privately with high utility. Second, we propose a formal framework DPXPlain, a three-phase framework that allows users to get explanations for group-by COUNT/SUM/AVG query results while preserving DP. We utilize confidence intervals to help users understand the uncertainty in the query results introduced by differential privacy, and further provide top-k explanations under differential privacy to explain the contribution of data to the results.
Item Open Access BUILDING A DATABASE FOR NANOMATERIAL EXPOSURE(2015-04-23) He, LinchenNanomaterials is a type of material with more advanced properties than conventional materials, and both scientists and engineers have a strong motivation to apply them in lots of areas. However, before they are widely applied, it is necessary to understand their toxicity on organisms. To date, large amounts of studies have explored the toxicity of nanomaterials, and they have greatly helped people understand how nanomaterials impact organisms. However, the developing speed of this field is getting slower because it is becoming more difficult for researchers to effectively search for information they need. Building a user-friendly database for nanomaterials and bioactivity is the main objective of this project and it is also an effective solution to address this problem by strengthening the information dissemination in this field. Based on the basic database structure developed by researchers in the Center of Environmental Implications of Nanotechnology (CEINT), exposure data for carbon nanotubes (CNTs) will be collected and imported into the database, and in the meanwhile, the database structure will be further optimized to fit new dataset imported. The method of this project is based on five steps: 1. Finding related studies and sources. 2. Extracting data from sources. 3. Preparing source files for the database. 4. Imputing data into MySQL database. 5. Query data from the database. The database consists of six sections: 1. Materials section: Recording the properties of nanomaterials tested in each study. 2. Environmental System section: Describing the environmental system in which the study was conducted. 3. Biological System section: Recording information about the organisms chosen to conduct exposure experiments. 4. Functional Assay section: Recording the assay that provides a parameter that can be used to describe fate or effects of nanomaterials exposure. 5. Study section: Serving as the main section to connect each previous part and functionalize the whole database. 6. Study_PI_Publication section: Recording information about primary investigator and publication, and connecting this information with Study table. Based on this database structure, I have imported data from 21 studies for CNTs into the database. The whole database works well and several applications have been developed. In my project report, two applications are introduced in detail. Application #1: The impacts of exposing the same organism to different CNTs. Different CNTs usually have different impacts on the same organism. However, most of studies usually focus on one of more types of CNTs. It would be a time-consuming process to review all published papers to understand how organism responds to different CNTs exposure. Building a database is an effective way to help reduce time for searching data. In this project, I targeted at C.elegans as an example to show this application. As a result, C.elegans were exposed to three types of CNTs, and about 359 functional assays were found. Further analysis was conducted based on this selected data. Application #2: The impacts of exposing the same type of CNTs to different organisms. The Same type of CNTs may have different impacts on different organisms. The database is a useful tool to help address this issue. In this project, I wanted to know how single wall carbon nanotubes (SWCNTs) influence different organisms. As a result, among all the dataset stored in my database, there were six organisms were exposed to SWCNTs and considerable amount of functional assays were conducted post SWCNTs exposure. However, currently, the impacts of exposing the same CNTs to different organisms are incomparable, because of following reasons. The first one is that CNTs used in each study is not completely the same, although they are called with the same name. The second is that, due to the limited amount of data, all functional assays are different, and it means that simple comparison is not available to know which organism are more vulnerable to CNTs exposure. This report also provides several key points of the database and recommendations to make a better database for nanomaterials exposure and boost the development of the field of nanomaterials safety. 1. The database can help researchers to avoid doing redundant studies and strengthen the communication between them. Moreover, it is a different search engine by focusing on specific study instead of keywords that is applied by conventional search method 2. The database structure should be further optimized in order to better fit the newly imported dataset. 3. The data quantity can be further expanded by developing a platform for database users to self-report their data. 4. Designing a series of standards for conducting exposure experiment and nanomaterials manufacturing will help to make the results of different studies more comparable. It is also an effective way to help increase the usability of dataset imported into the database. 5. Designing a series of indices, which include results of some normal tests (e.g. biouptake, death rate) and other important biomarkers. Based on analyzing these indices, a model can be built to evaluate the toxicity of exposing a certain type of nanomaterial to an organism.Item Open Access Knowledge Discovery in Databases of Radiation Therapy Treatment Planning(2017) Sheng, YangRadiation has been utilized in medical domain for multiple purposes. Treating cancer using radiation has increasing popularity during the last century. Radiation beam is directed to the tumor cells while the surrounding healthy tissue is attempted to be avoided. Radiation therapy treatment planning serves the goal of delivering high concentrated radiation to the treatment volume while minimizing the normal tissue as much as possible. With the advent of more sophisticated delivery technology, treatment planning time increases over time. In addition, the treatment plan quality relies on the experience of the planner. Several computer assistance techniques emerged to help the treatment planning process, among which knowledge-based planning (KBP) has been successful in inverse planning IMRT. KBP falls under the umbrella of Knowledge Discovery in Databases (KDD) which originated in industry. The philosophy is to extract useful knowledge from previous application/data/observations to make predictions in the future practice. KBP reduces the iterative trial-and-error process in manual planning, and more importantly guarantees consistent plan quality. Despite the great potential of treatment planning KDD (TPKDD), three major challenges remain before TPKDD can be widely implemented in the clinical environment: 1. a good knowledge model asks for sufficient amount of training data to extract useful knowledge and is therefore less efficient; 2. a knowledge model is usually only applicable for the specific treatment site and treatment technique and is therefore less generalizable; 3. a knowledge model needs meticulous inspection before implementing in the clinic to verify the robustness.
This study aims at filling in the niche in TPKDD and improves current TPKDD workflow by tackling the aforementioned challenges. This study is divided into three parts. The first part of the study aims to improve the modeling efficiency by introducing an atlas-based treatment planning guidance. In the second part of the study, an automated treatment planning technique for whole breast radiation therapy (WBRT) is proposed to provide a solution for the area where TPKDD has not yet set foot on. In the third part of the study, several topics related to the knowledge model quality are addressed, including improving the model training workflow, identifying geometric novelty and dosimetric outlier case, building a global model and facilitating incremental learning.
I. Improvement of the modeling efficiency. First, a prostate cancer patient anatomy atlas was established to generate 3D dose distribution guidance for the new patient. The anatomy pattern of the prostate cancer patient was parameterized with two descriptors. Each training case was represented in 2D feature space. All training cases were clustered using the k-medoids algorithm. The optimal number of clusters was determined by the largest average silhouette width. For the new case, the most similar case in the atlas was identified and used to generate dose guidance. The anatomy of the atlas case and the query case was registered and the deformation field was applied to the 3D radiation dose of the atlas case. The deformed dose served as the goal dose for the query case. Dose volume objectives were then extracted from the goal dose to guide the inverse IMRT planning. Results showed that the plans generated with atlas guidance had similar dosimetric quality as compared to the clinical manual plans. The monitor units (MU) of the auto plan were also comparable with the clinical plan. Atlas-guided radiation therapy has proven to be effective and efficient in inverse IMRT planning.
II. Improvement of model generalization. An automatic WBRT treatment planning workflow was developed. First of all, an energy selection tool was developed based on previous single energy and dual energy WBRT plans. The DRR intensity histograms of training cases were collected and the principal component analysis (PCA) was performed to reduce the dimension of the histogram. First two components were used to represent each case and the classification was performed in the 2D space. This tool helps new patient to select appropriate energy based on the anatomy information. Secondly, an anatomy feature based random forest (RF) model was proposed to predict the fluence map for the patient. The model took the input of multiple anatomical features and output the fluence intensity of each pixel within the fluence map. Finally, a physics rule based method was proposed to further fine tune the fluence map to achieve optimal dose distribution within the irradiated volume. Extra validation cases were tested on the proposed workflow. Results showed similar dosimetric quality between auto plan and clinical manual plan. The treatment planning time was reduced from between 1-4 hours for the manual planning to within 1 minute for the auto planning. The proposed automatic WBRT planning technique has proven to be efficient.
III. Rapid learning of radiation therapy KBP. Several topics were analyzed in this part of the study. First of all, a systematic workflow was established to improve the KBP model quality. The workflow started with identifying geometric novelty case using the statistical metric “leverage”, followed by removing the novelty case. Then the dosimetric outlier was identified using studentized residual and then cleaned. The cleaned model was compared with the uncleaned model using the extra validation cases. This study used pelvic cases as an example. Results showed that the existence of novelty and outlier cases did degrade the model quality. The proposed statistical tools can effectively identify novelty and outlier cases. The workflow is able to improve the quality of the knowledge-based model.
Secondly, a clustering-based method was proposed to identify multiple geometric novelty cases and dosimetric outlier cases at the same time. One class support vector machine (OCSVM) was applied to the feature vectors of all training cases to generate one class of inliers while cases falling out of the frontier belonged to the novelty case group. Once the novelty cases were identified and cleaned, the robust regression followed by outlier identification (ROUT) was applied to all remaining cases to identify dosimetric outliers. A cleaned model was trained with the novelty and outlier free case pool and was tested using 10 fold cross validation. Initial training pool included intentionally added outlier cases to evaluate the efficacy of the proposed method. The model prediction on the inlier cases was compared with that of novelty and outlier cases. Results showed that the method can successfully identify geometric novelty and dosimetric outliers. The model prediction accuracy between the inliers and novelty/outliers was significantly different, indicating different dosimetric behavior between two groups. The proposed method proved to be effective in identifying multiple geometric novelty and dosimetric outliers.
Thirdly, a global model using the model tree and the clustering-based model was proposed to include cases with different clinical conditions and indications. The model tree is a combination of decision tree and linear regression, where all cases are branched into leaves and regression is performed within each leaf. A clustering-based model used k-means algorithm to segment all cases into more aggregated groups, and then the regression was performed within each small group. The overall philosophy of both the model tree and the clustering-based method is that cases with similar features have similar geometry-dosimetry relation. Training cases within small feature range gives better model accuracy. The proposed method proved to be effective in improving the model accuracy over the model trained with all cases without segmenting the cases.
At last, the incremental learning was analyzed in radiation therapy treatment planning model. This study tries to answer the question when model re-training should be invoked. In the clinical environment, it is often unnecessary to re-train the model whenever there is a new case. The scenario of incrementally adapting the model was simulated using the pelvic cases with different number of training cases and new incoming cases. The result showed that re-training was often necessary for small training dataset and as the number of cases increased, re-training became less frequent.
In summary, this study addressed three major challenges in TPKDD. In the first part, an atlas-guided treatment planning technique was proposed to improve the modeling efficiency. In the second part, an automatic whole breast radiation therapy treatment planning technique was proposed to tackle the issue where TPKDD has not yet resolved. In the final part, outlier analysis, global model training and incremental learning was further analyzed to facilitate rapid learning, which lay the foundation of future clinical implementation of radiation therapy knowledge models.
Item Open Access Unifying Databases and Internet-Scale Publish/Subscribe(2008-08-01) Chandramouli, BadrishWith the advent of Web 2.0 and the Digital Age, we are witnessing an unprecedented increase in the amount of information collected, and in the number of users interested in different types of information. This growth means that traditional techniques, where users poll data sources for information of interest, are no longer sufficient. Polling too frequently does not scale, while polling less often may result in users missing important updates. The alternative push technology has long been the goal of publish/subscribe systems, which proactively push updates (events) to users with matching interests (expressed as subscriptions). The push model is better suited for ensuring scalability and timely delivery of updates, important in many application domains: personal (e.g., RSS feeds, online auctions), financial (e.g., portfolio monitoring), security (e.g., reporting network anomalies), etc.
Early publish/subscribe systems were based on predefined subjects (channels), and were too coarse-grained to meet the specific interests of different subscribers. The second generation of content-based publish/subscribe systems offer greater flexibility by supporting subscriptions defined as predicates over message contents. However, subscriptions are still stateless filters over individual messages, so they cannot express queries across different messages or over the event history. The few systems that support more powerful database-style subscriptions do not address the problem of efficiently delivering updates to a large number of subscribers over a wide-area network. Thus, there is a need to develop next-generation publish/subscribe systems that unify the support for richer database-style subscription queries and flexible wide-area notification. This support needs to be complemented with robust processing and dissemination techniques that scale to high event rates and large databases, as well as to a large number of subscribers over the Internet.
The main contribution of our work is a collection of techniques to support efficient and scalable event processing and notification dissemination for an Internet-scale publish/subscribe system with a rich subscription model. We investigate the interface between event processing by a database server and notification delivery by a dissemination network. Previous research in publish/subscribe has largely been compartmentalized; database-centric and network-centric approaches each have their own limitations, and simply putting them together does not lead to an efficient solution. A closer examination of database/network interfaces yields a spectrum of new and interesting possibilities. In particular, we propose message and subscription reformulation as general techniques to support stateful subscriptions over existing content-driven networks, by converting them into equivalent but stateless forms. We show how reformulation can successfully be applied to various stateful subscriptions including range-aggregation, select-joins, and subscriptions with value-based notification conditions. These techniques often provide orders-of-magnitude improvement over simpler techniques adopted by current systems, and are shown to scale to millions of subscriptions. Further, the use of a standard off-the-shelf content-driven dissemination interface allows these techniques to be easily deployed, managed, and maintained in a large-scale system.
Based on our findings, we have built a high-performance publish/subscribe system named ProSem (to signify the inseparability of database processing and network dissemination). ProSem uses our novel techniques for group-processing many types of complex and expressive subscriptions, with a per-event optimization framework that chooses the best processing and dissemination strategy at runtime based on online statistics and system objectives.