Efficient Inference for High Dimensional Data Under Physical and Human Constraints
Big data has become ubiquitous due to the advances of modern sensors -- high-resolution cameras capture millions of pixels at every fraction of a second, from both on the ground and in satellites; high-throughput experiments in biology and physical sciences generate terabytes of data everyday; people post on average 350,000 tweets on Twitter every minute. Big data problems are inherently different from traditional signals due to a few key salient features: high Volume, high Velocity, and high Variety. These three "V"s are the major challenges modern data science faces.
The volume of big data is reflected in both the number of data points, and the dimensionality of each data point. Large numbers of data points put hard constraints on the computational and space complexities of the systems, while high-dimensional data results in the classical "curse of dimensionality". The problem is further complicated by the fact that high-volume data often lacks meaningful labels or thorough annotations, which can make high-dimensional problems ill-posed even when large quantities of data are available.
The velocity of data refers to the speed of data acquisition in streaming data. For instance, commercial video systems usually work at a thirty to sixty per second frame rate, while the new high-speed camera at MIT can capture a stunning one trillion frames per second. High-velocity data requires the system to be both efficient and "online", i.e., to be able to update models and estimates on the fly.
The variety of data includes both data types and data dynamics. Big data often come in multiple sources. For example, healthcare record data may include numerical test readings, images of Ultrasound, CT and MRI scans, and texts of symptom descriptions. A person's Facebook profile is often comprised of various types of data like videos, images, texts, and social interactions. Moreover, the distribution of data can change with time or location, and different applications may have various physical and human constraints that impose further dynamics on the systems. As a result, efficient methods not only need to work with multiple data sources, but also need to adapt to potential dynamics within the data.
Data science focuses on extracting useful information from these challenging data. Most existing methods and analyses fail in the big data setting because they do NOT account for the dynamic environments, limited quantities of labeled data, physical models, or other system constraints. This dissertation describes methods that account for these challenges, and novel insights resulting from those methods.
The first contribution of this dissertation is minimax lower and upper bounds for high-dimensional Poisson inverse problems under physical constraints. In this problem, high dimensionality prevails, and physical constraints invalidate classical measurement matrices.
In addition to the bounds, a novel alternative analysis approach and a weighted LASSO estimator for sparse Poisson inverse problems are proposed to sidestep the technical challenges present in previous work. The next contribution is a method for online data thinning, in which large-scale streaming datasets are winnowed to preserve unique, anomalous, or salient elements for timely expert analysis. This application is challenged by the dimension and velocity of the data, as well as a highly dynamic environment. The last contribution is the development of a real-time interactive search system and an empirical evaluation of a new and various state-of-the-art search algorithms on both simulated and real users. The main challenges in this application are the high data volume, unlabeled data, a finite time horizon, and low processing time due to human interactions.
High dimensional data
Statistical signal processing
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations