Browsing by Subject "Data science"
Results Per Page
Sort Options
Item Open Access Discovering Digital Biomarkers of Glycemic Health from Wearable Sensors(2021) Bent, BrinnaePrediabetes is a progressive, chronic condition characterized by abnormal glucose control that affects over one third of people in the United States. While prediabetes is highly prevalent and has serious consequences, it is also seriously under-diagnosed— only ten percent of those with prediabetes are aware that they have the disease and for those who have been diagnosed, prediabetes is often poorly managed. Innovative, practical strategies to improve monitoring and management of glycemic health are desperately needed.
Non-invasive wrist-worn biometric sensors, often referred to as ‘wearables,’ are becoming nearly ubiquitous in the United States, with 117 million currently in use and an expected 100% growth in the next three years. Because of this widespread use, wearables have important potential to aid in the development of digital biomarkers which will facilitate detection and monitoring of chronic diseases. Digital biomarkers are digitally collected data (e.g. heart rate measurements from a wearable) that may be used as indicators of health outcomes (e.g. prediabetes). Important contributors to prediabetes, glucose control and variability, are physiologically linked to the autonomic nervous system (ANS) and wearable sensors have the capability to noninvasively measure metrics of the ANS, suggesting the feasibility of utilizing non-invasive, wrist-worn wearable sensors to monitor glycemic health and improve monitoring of prediabetes. The primary objective of this dissertation is to explore the development of digital biomarkers from wearable sensors to assess glycemic health for remote diagnosis, monitoring, and management of prediabetes.
Digital biomarker development is a rapidly growing field facing numerous challenges, including validation and optimization of wearable sensor data and a lack of standards for wearable sensor validation and digital biomarker development. In this dissertation, we address these challenges in order to develop a platform to assess the feasibility of developing digital biomarkers of glycemic health, which would aid in the early detection of prediabetes and the management of prediabetes.
In this work we present a validation and verification framework for wearable sensor data and we use this framework to investigate sources of inaccuracy in wearable optical heart rate sensors. We determined activity levels, device type, and device to be significant contributors to inaccuracy of the sensors but showed that accuracy was not affected by skin tone.
A problem with digital biomarker discovery is the need for data to be high resolution, which is at odds with the storage costs of data and battery power consumption. In order to optimize wearable sensor data for digital biomarker discovery, we determined the optimal sampling rate for optical blood volume pulse and found the optimal sampling rate for nearly all heart rate and heart rate variability metrics to be 21-64Hz. We then built and open-sourced a wearables data compression toolbox, testing five data compression methods on five different wearable sensor data types. We incorporated this toolbox in the Digital Biomarker Discovery Pipeline, an open source platform for the development of digital biomarkers to establish best practices for digital biomarker development that we launched as part of this work.
Building upon the frameworks we developed for digital biomarker discovery, we showed the feasibility of using noninvasive wearables to estimate glucose variability metrics and hemoglobin A1c (HbA1c). We developed 11 glucose variability estimation models using non-invasive wearables data that achieved high accuracy (<10% mean average percent error, MAPE). Our HbA1c estimation model using wearables data achieved MAPE of 5.1% on an external data test set and performed comparably to the American Diabetes Association estimated HbA1c model from continuous glucose monitors and our own continuous glucose monitor-based HbA1c estimation model. This shows the feasibility of using noninvasive wearables for HbA1c estimation, although limitations of our study include a narrow HbA1c range, resulting in our models not being significantly different from the mean model. Combining estimation of glucose variability and HbA1c, we could greatly improve screening for prediabetes.
We incorporated all of the previous work into the final component of this work, engineering putative digital biomarkers for intraday interstitial glucose prediction. In order to manage glucose fluctuations, it is important for patients to understand how their lifestyle habits may influence their blood glucose levels so that they can begin to appropriately manage their disease. There is a critical need for innovative, practical strategies to improve monitoring and management of glycemic health. In the final component of this dissertation, we demonstrated the feasibility of using noninvasive and widely accessible methods to classify glucose excursions and predict interstitial glucose values. We also show robust methods for both data-driven and domain-driven feature engineering from noninvasive wearables. Furthermore, we compared population approach machine learning and personalized approach machine learning for the prediction of glucose and demonstrated the existence of a “crossover point” at which the personalized model accuracy exceeds the traditional population approach to modeling glucose.
Overall, this dissertation addresses challenges to digital biomarker development, including validation and optimization of wearable sensor data, an absence of open-source methodologies, and a lack of standards for wearable sensor validation and digital biomarker development, in order to establish a platform for discovering digital biomarkers of glycemic health. We show feasibility of estimating metrics of glycemic health using non-invasive wearable sensors. Finally, we show the utility of digital biomarkers in the classification and prediction of interstitial glucose for intraday glycemic health monitoring and management. Because wearables are prevalent in the general population, leveraging them for glycemic health monitoring could represent a major advancement in early detection of prediabetes and improved monitoring and self-management of prediabetes.
Item Open Access Efficient Inference for High Dimensional Data Under Physical and Human Constraints(2017) Hunt, Xin JiangBig data has become ubiquitous due to the advances of modern sensors -- high-resolution cameras capture millions of pixels at every fraction of a second, from both on the ground and in satellites; high-throughput experiments in biology and physical sciences generate terabytes of data everyday; people post on average 350,000 tweets on Twitter every minute. Big data problems are inherently different from traditional signals due to a few key salient features: high Volume, high Velocity, and high Variety. These three "V"s are the major challenges modern data science faces.
The volume of big data is reflected in both the number of data points, and the dimensionality of each data point. Large numbers of data points put hard constraints on the computational and space complexities of the systems, while high-dimensional data results in the classical "curse of dimensionality". The problem is further complicated by the fact that high-volume data often lacks meaningful labels or thorough annotations, which can make high-dimensional problems ill-posed even when large quantities of data are available.
The velocity of data refers to the speed of data acquisition in streaming data. For instance, commercial video systems usually work at a thirty to sixty per second frame rate, while the new high-speed camera at MIT can capture a stunning one trillion frames per second. High-velocity data requires the system to be both efficient and "online", i.e., to be able to update models and estimates on the fly.
The variety of data includes both data types and data dynamics. Big data often come in multiple sources. For example, healthcare record data may include numerical test readings, images of Ultrasound, CT and MRI scans, and texts of symptom descriptions. A person's Facebook profile is often comprised of various types of data like videos, images, texts, and social interactions. Moreover, the distribution of data can change with time or location, and different applications may have various physical and human constraints that impose further dynamics on the systems. As a result, efficient methods not only need to work with multiple data sources, but also need to adapt to potential dynamics within the data.
Data science focuses on extracting useful information from these challenging data. Most existing methods and analyses fail in the big data setting because they do NOT account for the dynamic environments, limited quantities of labeled data, physical models, or other system constraints. This dissertation describes methods that account for these challenges, and novel insights resulting from those methods.
The first contribution of this dissertation is minimax lower and upper bounds for high-dimensional Poisson inverse problems under physical constraints. In this problem, high dimensionality prevails, and physical constraints invalidate classical measurement matrices.
In addition to the bounds, a novel alternative analysis approach and a weighted LASSO estimator for sparse Poisson inverse problems are proposed to sidestep the technical challenges present in previous work. The next contribution is a method for online data thinning, in which large-scale streaming datasets are winnowed to preserve unique, anomalous, or salient elements for timely expert analysis. This application is challenged by the dimension and velocity of the data, as well as a highly dynamic environment. The last contribution is the development of a real-time interactive search system and an empirical evaluation of a new and various state-of-the-art search algorithms on both simulated and real users. The main challenges in this application are the high data volume, unlabeled data, a finite time horizon, and low processing time due to human interactions.
Item Open Access STA 112, Data Science, Statcast(2016-12-12) Coleman, Jake; Rundel, ColinIn this Data Exploration, students were introduced to baseball dataset Statcast, downloaded from baseballsavant.mlb.com, that included every pitch thrown in the first week of the 2016 season, with 21 characteristics. The students were tasked with using R packages dplyr and ggplot2 to answer data exploration and summarizion questions. The exercises challenged them to use information about the data as well as newly acquired computation skills. The Statcast data is owned by MLB Advanced Media, L.P. and was downloaded from a search performed on baseballsavant.mlb.com for all pitches from 4/1/16 to 4/7/16. Statcast is a relatively new dataset (introduced in 2015), including all pitch characteristics from its precurser PitchF/X (such as pitch movement, type, start and end velocity, etc.). Statcast alsoadded tracking of the ball during the entirety of the play, as well as tracking for all elders. Full Statcast data is not yet available to the public, but Baseball Savant allows the public to have access to Statcast-added batted ball variables such as launch angle and batted ball speed. Dplyr is an extremely powerful tool for exploring data, using simple structure to perform complex data management tasks. Students were introducted to dplyr in a previous lecture, and used the Statcast data to gain hands-on experience working with data. Their tasks ranged from simple summaries to sophisticated manipulation (as real data is rarely in perfect form for desired analysis). They also integrated the R package ggplot2 to visualize some of their findings and draw further conclusions.