Browsing by Subject "data-driven"
- Results Per Page
- Sort Options
Item Open Access Data-driven Analysis of Heavy Quark Transport in Ultra-relativistic Heavy-ion Collisions(2019) Xu, YingruHeavy flavor observables provide valuable information on the properties of the hot and dense Quark-Gluon Plasma (QGP) created in ultra-relativistic heavy-ion collisions.
Previous study has made significant progress regarding the heavy quark in-medium interaction, energy loss and collective behaviors. Various theoretical models are developed to describe the evolution of heavy quarks in heavy-ion collisions, but also show limited performance as they experience challenges to simultaneously describe all the experimental data.
In this thesis, I present a state-of-the-art Bayesian model-to-data analysis to calibrate a heavy quark evolution model on the experimental data at different collision systems and different energies: the heavy quark evolution model incorporates an improved Langevin dynamics for heavy quarks with an event-by-event viscous hydrodynamical model for the expanding QGP medium, and considers both heavy quark collisional and radiative energy loss. By applying the Bayesian analysis to such a modularized framework, the heavy quark evolution model is able to describe the heavy flavor observables in multiple collision system and make prediction of unseen observables. In addition, the estimated heavy quark diffusion coefficient shows a strong positive temperature dependence and strong interaction around the critical temperature.
Finally, by comparing the transport coefficients estimated by various theoretical approaches, I have quantitatively evaluated the contribution from different sources of deviation, which can provide a reference for the theoretical uncertainties regarding the heavy quark transport coefficients.
Item Open Access Data-driven Decision Making with Dynamic Learning under Uncertainty: Theory and Applications(2022) Li, YuexingDigital transformation is changing the landscape of business and sparking waves of innovation, calling for advanced big data analytics and artificial intelligence techniques. To survive from intensified and rapidly changing market conditions in the big data era, it is crucial for companies to hone up their competitive advantages by wielding the great power of data. This thesis provides data-driven solutions to facilitate informed decision making under various forms of uncertainties, making contributions to both theory and applications.
Chapter 1 presents a study motivated by a real-life data set from a leading Chinese supermarket chain. The grocery retailer sells a perishable product, making joint pricing and inventory ordering decisions over a finite time horizon with lost sales. However, she does not have perfect information on (1) the demand-price relationship, (2) the demand noise distribution, (3) the inventory perishability rate, and (4) how the demand-price relationship changes over time. Moreover, the demand noise distribution is nonparametric for some products but parametric for others. To help the retailer tackle these challenges, we design two versions of data-driven pricing and ordering (DDPO) policies, for the settings of nonparametric and parametric noise distributions, respectively. Measuring performance by regret, i.e., the profit loss relative to a clairvoyant policy with perfect information, we show that both versions of our DDPO policies achieve the best possible rates of regret in their respective settings. Through a case study on the real-life data, we also demonstrate that both of our policies significantly outperform the historical decisions of the supermarket, establishing the practical value of our approach. In the end, we extend our model and policy to account for age-dependent product perishability and demand censoring.
Chapter 2 discusses a work inspired by a real-life smart meter data set from a major U.S. electric utility company. The company serves retail electricity customers over a finite time horizon. Besides granular data of customers' consumptions, the company has access to high-dimensional features on customer characteristics and exogenous factors. What is unique in this context is that these features exhibit three types of heterogeneity---over time, customers, or both. They induce an underlying cluster structure and influence consumptions differently in each cluster. The company knows neither the underlying cluster structure nor the corresponding consumption models. To tackle this challenge, we design a novel data-driven policy of joint spectral clustering and feature-based dynamic pricing that efficiently learns the underlying cluster structure and the consumption behavior in each cluster, and maximizes profits on the fly. Measuring performance by average regret, i.e., the profit loss relative to a clairvoyant policy with perfect information per customer per period, we derive distinct theoretical performance guarantees by showing that our policy achieves the best possible rate of regret. Our case study based on the real-life smart meter data indicates that our policy significantly increases the company profits by 146\% over a three-month period relative to the company policy. Our policy performance is also robust to various forms of model misspecification. Finally, we extend our model and method to allow for temporal shifts in feature means, general cost functions, and potential effect of strategic customer behavior on consumptions.
Chapter 3 investigates an image cropping problem faced by a large Chinese digital platform. The platform aims to crop and display a large number of images to maximize customer conversions in an automated fashion, but it does not know how cropped images influence conversions, referred to as the reward function. What the platform knows is that good cropping should capture salient objects and texts, collectively referred to as salient features, as much as possible. Due to the high dimensionality of digital images and the unknown reward function, finding the optimal cropping for a given image is a highly unstructured learning problem. To overcome this challenge, we leverage the more advanced deep learning techniques to design a neural network policy with two types of neural networks, one for detecting salient features and the other for learning the reward function. We then show that our policy achieves the best possible theoretical performance guarantee by deriving matching upper and lower bounds on regret. To the best of our knowledge, these results are the first of their kind in deep learning applications in revenue management. Through case studies on the real-life data set and a field experiment, we demonstrate that our policy achieves statistically significant improvement on conversions over the platform's incumbent policy, translating into an annual revenue increase of 2.85 million U.S. dollars. Moreover, our neural network policy significantly outperforms the traditional machine learning methods and exhibits good performance even if the reward function is misspecified.
Item Open Access Data-Driven Study of Polymer-Based Nanocomposites (PNC) – FAIR Online Data Resource Development and ML-Facilitated Material Design(2023) Lin, AnqiPolymer-based nanocomposites (PNCs) are materials consisting of nanoparticles and polymers. The enhancement of the mechanical, thermal, electrical, and other properties of the PNCs brought by the nanoparticles makes it a useful material in various applications. The huge amount of surface area brought by the nanoparticles interacts with polymer chains to form an interphase, which drives the property change. The presence of the interphase adds to the complexity of the processing-structure-property (p-s-p) relationship of PNCs that guide material design. As conventional trial-and-error approaches in the laboratory prove time-consuming and resource-intensive, an alternative approach is to utilize data-driven methods for PNC design. However, data-driven material design suffers from data scarcity issues.To tackle the data scarcity issue on a cross-community level, there has been a growing emphasis on the adoption of the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles. In 2016, the NanoMine data resource, later evolved into MaterialsMine with the inclusion of metamaterials, and its accompanying schema were introduced to handle PNC data, offering a user-friendly and FAIR approach to manage these complex PNC data, facilitating data-driven material design. To make this schema more accessible to curators and material scientists, an Excel-based customizable master template was designed for experimental data. In parallel to the long-lasting cumulative effort of curating experimental PNC data from literature, simulation data can be generated and curated much faster due to its computational nature. Thus, the NanoMine schema and template for experimental data were expanded to support popular simulation methods like Finite Element Analysis (FEA) with high utilization of existing fields, demonstrating the flexibility of the schema/template approach. With the schema and template in place for NanoMine to host FEA data, an efficient and highly automated end-to-end pipeline was developed for FEA data generation. A data management system was implemented to capture the FEA data and the associated metadata, which are critical for the data to be FAIR. A resource management system was implemented to address the system restrictions. Starting from microstructure generation, all the way to packaging the data into a curation-ready format, the pipeline lives in a standardized Jupyter notebook for easier usage and better bookkeeping. FEA simulations, while faster than laboratory experiments, remain resource-intensive and are often constrained by commercial software licenses. Thus, the last part of this research aims to develop an efficient, reliable, and lightweight surrogate model for FEA simulation of the viscoelastic response of PNCs, named ViscoNet, with machine learning (ML). Drawing inspiration from NLP models like GPT, ViscoNet utilizes pre-training and fine-tuning techniques to reproduce FEA simulations, achieving a mean absolute percentage error (MAPE) of < 5% for rubbery modulus, < 1% for glassy modulus, and 1.22% for tan delta peak, with as few as 500 FEA simulation data for fine-tuning. ViscoNet demonstrates impressive generalization capabilities from thermoplastics to thermosets. ViscoNet enables the generation of over 20k VE responses in under 2 minutes, making it a versatile tool for high-throughput PNC design and optimization. Notably, ViscoNet does not require a GPU for training, allowing anyone with Internet access to download 500 FEA data from NanoMine and fine-tune ViscoNet on a personal laptop, thereby making data-driven materials design accessible to a broader scientific community.
Item Open Access Extending Probabilistic Record Linkage(2020) Solomon, Nicole ChanelProbabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic record linkage arises in comparative effectiveness research and other clinical research scenarios when records in different databases do not share an error-free unique patient identifier. This dissertation seeks to develop new methodology for probabilistic record linkage to address two highly practical and recurring challenges: how to implement record linkage in a manner that optimizes downstream statistical analyses of the linked data, and how to efficiently link databases having a clustered or multi-level data structure.
In Chapter 2 we propose a new framework for balancing the tradeoff between false positive and false negative linkage errors when linked data are analyzed in a generalized linear model framework and non-linked records lead to missing data for the study outcome variable. Our method seeks to maximize the probability that the point estimate of the parameter of interest will have the correct sign and that the confidence interval around this estimate will correctly exclude the null value of zero. Using large sample approximations and a model for linkage errors, we derive expressions relating bias and hypothesis testing power to the user's choice of threshold that determines how many records will be linked. We use these results to propose three data-driven threshold selection rules. Under one set of simplifying assumptions we prove that maximizing asymptotic power requires that the threshold be relaxed at least until the point where all pairs with >50% probability of being a true match are linked.
In Chapter 3 we explore the consequences of linkage errors when the study outcome variable is determined by linkage status and so linkage errors may cause outcome misclassification. This scenario arises when the outcome is disease status and those linked are classified as having the disease while those not linked are classified as disease-free. We assume the parameter of interest can be expressed as a linear combination of binomial proportions having mean zero under the null hypothesis. We derive an expression for the asymptotic relative efficiency of a Wald test calculated with a misclassified outcome compared to an error-free outcome using a linkage error model and large sample approximations. We use this expression to generate insights for planning and implementing studies using record linkage.
In Chapter 4 we develop a modeling framework for linking files with a clustered data structure. Linking such clustered data is especially challenging when error-free identifiers are unavailable for both individual-level and cluster-level units. The proposed approach improves over current methodology by modeling inter-pair dependencies in clustered data and producing collective link decisions. It is novel in that it models both record attributes and record relationships, and resolves match statuses for individual-level and cluster-level units simultaneously. We show that linkage probabilities can be estimated without labeled training data using assumptions that are less restrictive compared to existing record linkage models. Using Monte Carlo simulations based on real study data, we demonstrate its advantages over the current standard method.