Browsing by Subject "Representation learning"
Results Per Page
Sort Options
Item Open Access Improving Clinical Prediction Models with Statistical Representation Learning(2021) Xiu, ZidiThis dissertation studies novel statistical machine learning approaches for healthcare risk prediction applications in the presence of challenging scenarios, such as rare events, noisy observations, data imbalance, missingness and censoring. Such scenarios manifest frequently in practice, and they compromise the validity of standard predictive models which often expect clean and complete data. As such, alleviating the negative impacts of real-world data challenges is of great significance and constitutes the overarching goal of this dissertation, which investigates novel strategies to (i) account for data uncertainties and statistical characteristics of low-prevalence events; (ii) re-balancing and augmenting the representations for minority data under proper causal assumptions; (iii) dynamically assigning scores and attend to the observed units to derive robust features.By integrating the ideas from representation learning, variational Bayes, causal inference, and contrastive training, this dissertation builds tools for risk modeling frameworks that are robust to various peculiarities of real-world datasets to yield reliable individualized risk evaluations.
This dissertation starts with a systematic review of classical risk prediction models in Chapter 1, and discusses the new opportunities and challenges presented by the big data era. With the increasing availability of healthcare data and the current rapid development of machine learning models, clinical decision support systems have seen new opportunities to improve clinical practice. However, in healthcare risk prediction applications, statistical analysis is not only challenged by data incompleteness and skewed distributions but also the complexity of the inputs. To motivate the subsequent developments, discussions on the limitations in risk minimization methods, robustness against high-dimensional data with incompleteness, and the need for individualization are provided.
As a concrete example to address a canonical problem, Chapter 2 proposes a variational disentanglement approach to semi-parametrically learns from the heavily imbalanced binary classification datasets. In this new method, which is named Variational Inference for Extremals (VIE), we apply an extreme value theory to enable efficient learning with few observations. By organically integrating the generalized additive model and isotonic neural nets, VIE enjoys the merits of improved robustness, interpretability, and generalizability for the accurate prediction of rare events. An analysis of the COVID-19 cohort from Duke Hospitals demonstrates that the proposed approach outperforms competing solutions. We investigate a more generalized setting of a multi-classification problem with heavily imbalanced data in Chapter 3, from the perspective of causal machine learning to promote sample efficiency and model generalization. Our solution, named Energy-based Causal Representation Transfer (ECRT), posits a meta-distributional scenario, where the data generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causally informed data augmentation procedure based on nonlinear independent component analysis to enrich the representation of minority classes and simultaneously data whitening. The effectiveness and enhanced prediction accuracy are demonstrated through synthetic data and real-world benchmarks compared with state-of-art models.
In Chapter 4 we deal with the time-to-event prediction with censored (missing in outcomes) and incomplete (missing in covariates) observations. Also known as survival analysis, time-to-event prediction plays a crucial role in many clinical applications, yet the classical survival solutions scale poorly w.r.t. data complexity and incompleteness. To better handle sophisticated modern health data and alleviate the impact of real-world data challenges, we introduce a self-attention based model to capture helpful information for time-to-event prediction, called Energy-based Latent Self-Attentive Survival Analysis (ELSSA). A key novelty of this approach is the integration of contrastive mutual information based loss that non-parametrically maximizes the informativeness of learned data representations. The effectiveness of our approaches has been intensively validated with synthetic and real-world benchmarks, showing improved performance relative to competing solutions.
In summary, this dissertation presents flexible risk prediction frameworks that acknowledge representation uncertainties, data heterogeneity and incompleteness. Altogether, it presents three contributions: improved efficient learning from imbalanced data, enhanced robustness to missing data, and better generalizability to out-of-sample subjects.
Item Open Access LEARNING BOTH EXPERT AND UNIVERSAL KNOWLEDGE USING TRANSFORMERS(2022) Li, YuanThe Transformer has demonstrated superior performance in various natural language processing (NLP) tasks, including machine translation, language understanding, and text generation. The proposed multihead attention mechanism provides strong flexibility for fusing contextual information and therefore facilitates long-range relation modeling. Further, Transformers have proved effective for learning universal knowledge at scale, and representative models are BERT, GPT, and their subsequent variants. It is observed that the Transformer is more tolerant to the convergence plateau and is capable of scaling to more than one hundred billion parameters.Despite these advances, we believe that the Transformer can be further pushed toward the two extremes of knowledge learning: expert knowledge and universal knowledge. On the one hand, professional knowledge, such as medical knowledge accumulated by humans through a large amount of education and practice, plays a vital role in professional disciplines. However, due to the various forms of expert knowledge (e.g., knowledge graphs, textual templates, and tables of statistics) and the need to develop different Transformer models to deal with different forms of knowledge, there is an urgent need for a unified framework to efficiently encode and decode different types of knowledge. On the other hand, learning universal knowledge requires substantial training data and a large model size to absorb the information from unlabeled data in a self-supervised manner. However, the existing self- supervised language models lack a structured encoding of the input and therefore fail to generate plausible text in a controllable way. Moreover, learning from high-dimensional input, such as image pixels, is challenging for the Transformer due to the heavy computational consumption and sparse semantic information represented by the pixels. In this proposal, we address these challenges by first defining a unified formulation for acquiring both expert and universal knowledge and then developing several novel Transformer models and their variants, including the Graph Transformers, the Variational Autoencoders (VAEs) implemented by the Transformer architecture, and the Visual-Linguistic Masked Autoencoders (VL-MAEs) for learning visual representation with additional language supervision. Finally, the techniques developed within this proposal will alleviate the burden and lower the entry bar of learning with universal knowledge and expertise for ML researchers and practitioners. They will also reduce the cost of research.
Item Open Access Modeling Time Series and Sequences: Learning Representations and Making Predictions(2015) Lian, WenzhaoThe analysis of time series and sequences has been challenging in both statistics and machine learning community, because of their properties including high dimensionality, pattern dynamics, and irregular observations. In this thesis, novel methods are proposed to handle the difficulties mentioned above, thus enabling representation learning (dimension reduction and pattern extraction), and prediction making (classification and forecasting). This thesis consists of three main parts.
The first part analyzes multivariate time series, which is often non-stationary due to high levels of ambient noise and various interferences. We propose a nonlinear dimensionality reduction framework using diffusion maps on a learned statistical manifold, which gives rise to the construction of a low-dimensional representation of the high-dimensional non-stationary time series. We show that diffusion maps, with affinity kernels based on the Kullback-Leibler divergence between the local statistics of samples, allow for efficient approximation of pairwise geodesic distances. To construct the statistical manifold, we estimate time-evolving parametric distributions by designing a family of Bayesian generative models. The proposed framework can be applied to problems in which the time-evolving distributions (of temporally localized data), rather than the samples themselves, are driven by a low-dimensional underlying process. We provide efficient parameter estimation and dimensionality reduction methodology and apply it to two applications: music analysis and epileptic-seizure prediction.
The second part focuses on a time series classification task, where we want to leverage the temporal dynamic information in the classifier design. In many time series classification problems including fraud detection, a low false alarm rate is required; meanwhile, we enhance the positive detection rate. Therefore, we directly optimize the partial area under the curve (PAUC), which maximizes the accuracy in low false alarm rate regions. Latent variables are introduced to incorporate the temporal information, while maintaining a max-margin based method solvable. An optimization routine is proposed with its properties analyzed; the algorithm is designed as scalable to web-scale data. Simulation results demonstrate the effectiveness of optimizing the performance in the low false alarm rate regions.
The third part focuses on pattern extraction from correlated point process data, which consist of multiple correlated sequences observed at irregular times. The analysis of correlated point process data has wide applications, ranging from biomedical research to network analysis. We model such data as generated by a latent collection of continuous-time binary semi-Markov processes, corresponding to external events appearing and disappearing. A continuous-time modeling framework is more appropriate for multichannel point process data than a binning approach requiring time discretization, and we show connections between our model and recent ideas from the discrete-time literature. We describe an efficient MCMC algorithm for posterior inference, and apply our ideas to both synthetic data and a real-world biometrics application.
Item Open Access On the Knowledge Transfer via Pretraining, Distillation and Federated Learning(2022) Hao, WeituoModern machine learning technology based on a revival of deep neural networks has been successfully applied in many pragmatic domains such as computer vision(CV) and natural language processing(NLP). The very standard paradigm is \emph{pre-training}: a large model with billions of parameters is trained on a surrogate task and then adapted to the downstream task of interest via fine-tuning. Knowledge transfer is what makes the pre-training possible, but the scale is what makes it powerful, which requires the availability of much more training data and computing resources.
\hspace{0.5cm} Along with the great success of deep learning, fueled by larger datasets and more computation capability, however, come a series of interesting research topics. First, most pre-trained models learn on one-modal(vision or text) dataset and are designed for the single-step downstream task such as classification. Does pre-training for more complex tasks such as reinforcement learning still work? Second, pre-trained models obtain impressive empirical performance at the price of deployment challenges on low-resource(both memory and computation) platforms. How to compress the large models into smaller ones in an efficient way? Third, collecting sufficient training data is often expensive, time-consuming, or even unrealistic in many scenarios due to privacy constraints. Does it exist a training paradigm without data exchange?
\hspace{0.5cm}For less explored questions mentioned above, I conducted several projects related but not limited to: $\RN{1}$) large-scale pre-training based on multi-modal input for vision and language navigation, proofing the effectiveness of knowledge transfer across complex tasks via ~\emph{pre-training}; $\RN{2}$) data augmentation for compressing large-scale language models, improving the efficiency of knowledge transfer in the teacher-student \emph{distillation} framework; $\RN{3}$) weight factorization for model weights sharing in \emph{Federated Learning}, achieving the trade-off between model performance and data privacy.
Item Open Access Probabilistic Time-to-Event Modeling Approaches for Risk Profiling(2021) Chapfuwa, PaidamoyoModern health data science applications leverage abundant molecular and electronic health data, providing opportunities for machine learning to build statistical models to support clinical practice. Time-to-event analysis, also called survival analysis, stands as one of the most representative examples of such statistical models. Models for predicting the time of a future event are crucial for risk assessment, across a diverse range of applications, i.e., drug development, risk profiling, and clinical trials, and such data are also relevant in fields like manufacturing (e.g., for equipment monitoring). Existing time-to-event (survival) models have focused primarily on preserving the pairwise ordering of estimated event times (i.e., relative risk).
In this dissertation, we propose neural time-to-event models that account for calibration and uncertainty, while predicting accurate absolute event times. Specifically, we introduce an adversarial nonparametric model for estimating matched time-to-event distributions for probabilistically concentrated and accurate predictions. We consider replacing the discriminator of the adversarial nonparametric model with a survival-function matching estimator that accounts for model calibration. The proposed estimator can be used as a means of estimating and comparing conditional survival distributions while accounting for the predictive uncertainty of probabilistic models.
Moreover, we introduce a theoretically grounded unified counterfactual inference framework for survival analysis, which adjusts for bias from two sources, namely, confounding (from covariates influencing both the treatment assignment and the outcome) and censoring (informative or non-informative). To account for censoring biases, a proposed flexible and nonparametric probabilistic model is leveraged for event times. Then, we formulate a model-free nonparametric hazard ratio metric for comparing treatment effects or leveraging prior randomized real-world experiments in longitudinal studies. Further, the proposed model-free hazard-ratio estimator can be used to identify or stratify heterogeneous treatment effects. For stratifying risk profiles, we formulate an interpretable time-to-event driven clustering method for observations (patients) via a Bayesian nonparametric stick-breaking representation of the Dirichlet Process.
Finally, through experiments on real-world datasets, consistent improvements in predictive performance and interpretability are demonstrated relative to existing state-of-the-art survival analysis models.
Item Open Access Theoretical Understanding of Neural Network Optimization Landscape and Self-Supervised Representation Learning(2023) Wu, ChenweiNeural networks have achieved remarkable empirical success in various areas. One key factor of their success is their ability to automatically learn useful representations from data. Self-supervised representation learning, which learns the representations during pre-training and applies learned representations in downstream tasks, has become the dominant approach for representation learning in recent years. However, theoretical understanding of self-supervised representation learning is scarce. Two main bottlenecks in understanding self-supervised representation learning are the big differences between pre-training and downstream tasks and the difficulties in neural network optimization. In this thesis, we present an initial exploration into analyzing the benefit of pre-training in self-supervised representation learning and two heuristics in neural network optimization.
The first part of this thesis presents our attempts to understand why the representations produced by pre-trained models are useful in downstream tasks. We assume we can optimize the training objective well in this part. For the over-realized sparse coding model with noise, we show that the masking objective used in pre-training ensures the recovery of ground-truth model parameters. For a more complicated log-linear word model, we characterize what downstream tasks can benefit from the learned representations in pre-training. Our experiments validate these theoretical results.
The second part of this thesis provides explanations about two important phenomena in the neural network optimization landscape. We first propose and rigorously prove a novel conjecture that explains the low-rank structure of the layer-wise neural network Hessian. Our conjecture is verified experimentally and can be used to tighten generalization bounds for neural networks. We also study the training stability and generalization problem in the learning-to-learn framework where machine learning algorithms are used to learn parameters for training neural networks. We rigorously proved our conjectures in simple models and empirically verified our theoretical results in our experiments with practical neural networks and real data.
Our results provide theoretical understanding of the benefits of pre-training for downstream tasks and two important heuristics of neural network optimization landscape. We hope these insights could further improve the performance of self-supervised representation learning approaches and inspire the design of new algorithms.