Browsing by Subject "Natural language processing"
Results Per Page
Sort Options
Item Open Access Applications of Deep Representation Learning to Natural Language Processing and Satellite Imagery(2020) Wang, GuoyinDeep representation learning has shown its effectiveness in many tasks such as text classification and image processing. Many researches have been done to directly improve the representation quality. However, how to improve the representation quality by cooperating ancillary data source or by interacting with other representations is still not fully explored. Also, using representation learning to help other tasks is worth further exploration.
In this work, we explore these directions by solving various problems in natural language processing and image processing. In the natural language processing part, we first discuss how to introduce alternative representations to improve the original representation quality and hence boost the model performance. We then discuss a text representation matching algorithm. By introducing such matching algorithm, we can better align different text representations in text generation models and hence improve the generation qualities.
For the image processing part, we consider a real-world air condition prediction problem: ground-level $PM_{2.5}$ estimation. To solve this problem, we introduce a joint model to improve image representation learning by incorporating image encoder with ancillary data source and random forest model. We the further extend this model with ranking information for semi-supervised learning setup. The semi-supervised model can then utilize low-cost sensors for $PM_{2.5}$ estimation.
Finally, we introduce a recurrent kernel machine concept to explain the representation interaction mechanism within time-dependent neural network models and hence unified a variety of algorithms into a generalized framework.
Item Open Access Automated Learning of Event Coding Dictionaries for Novel Domains with an Application to Cyberspace(2016) Radford, Benjamin JamesEvent data provide high-resolution and high-volume information about political events. From COPDAB to KEDS, GDELT, ICEWS, and PHOENIX, event datasets and the frameworks that produce them have supported a variety of research efforts across fields and including political science. While these datasets are machine-coded from vast amounts of raw text input, they nonetheless require substantial human effort to produce and update sets of required dictionaries. I introduce a novel method for generating large dictionaries appropriate for event-coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and deep learning to greatly reduce the researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describes that domain. An application to cybersecurity is described and both the generated dictionaries and resultant event data are examined. The cybersecurity event data are also examined in relation to existing datasets in related domains.
Item Open Access Causal Inference for Natural Language Data and Multivariate Time Series(2023) Tierney, GrahamThe central theme of this dissertation is causal inference for complex data, and highlighting how for certain estimation problems, collecting more data has limited benefit. The central application areas are natural language data and multivariate time series. For text, large language models are trained on predictive tasks not necessarily well-suited for causal inference. Moreover, documents that vary in some treatment feature will often also vary systematically in other, unknown ways that prohibit attribution of causal effects to the feature of interest. Multivariate time series, even with high-quality contemporaneous predictors, still exhibit positive dependencies such that even with many treated and control units, the amount of information available to estimate causal quantities is quite low.
Chapter 2 builds a model for short text, as is typically found on social media platforms. Chapter 3 analyzes a randomized experiment that paired Democrats and Republicans to have a conversation about politics, then develops a sensitivity procedure to test for mediation effects attributable to the politeness of the conversation. Chapter 4 expands on the limitations of observational, model-based methods for causal inference with text and designs an experiment to validate how significant those limitations are. Chapter 5 covers experimentation with multivariate time series.
The general conclusion from these chapters is that causal inference always requires untestable assumptions. A researcher trying to make causal conclusions needs to understand the underlying structure of the problem they are studying to validate whether those assumptions hold. The work here shows how to still conduct causal analysis when commonly made assumptions are violated.
Item Open Access Data curation of a findable, accessible, interoperable, reusable polymer nanocomposites data resource - MaterialsMine(2022) Hu, BingyinA polymer nanocomposite (PNC) is a composite material consisting of a polymer matrix and stiff fillers with at least one dimension smaller than 100 nm. With the addition of a small amount of filler to the polymer matrix, PNC demonstrates large reinforcement of mechanical, viscoelastic, dielectric, thermal, optical, and other physiochemical properties as compared to pure polymer or pure fillers acting alone. PNCs have thus attracted significant amounts of research interest over recent years. To accelerate materials design, we need findable, accessible, interoperable, and reusable (FAIR) data resources to provide sufficient data for data-driven approaches to replace the traditional trial-and-error style of exploration in a lab. With the goal to build a FAIR data resource for the PNC community, we built NanoMine in 2016, which later evolves into MaterialsMine with the extension of MetaMine in the metamaterial domain. To be FAIR, we need a clear and extensible data representation to enable the interoperable knowledge exchange. We thus designed the NanoMine XML schema. With the data framework and data representation in place, we still need tools and a user-friendly interface for data curation. This dissertation describes in detail the tools and data interfaces we developed to ensure a smooth data curation pathway for NanoMine/MaterialsMine. To reduce and prevent curation errors and thus improve data quality, we need data validation mechanisms. To address the need, we discuss the validation mechanisms embedded both during and after the curation. On many occasions, even without human-caused curation errors, the data resource cannot perform to its full capacity due to data inconsistencies. For example, the inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names and the inconsistent use of mass fraction and volume fraction in specifying the composite composition. To address the need for data standardization, tools developed to bypass manual curation, the mass fraction – volume fraction conversion agent, and ChemProps, a RESTful API-enabled multi-algorithm-based polymer/filler name mapping methodology, are discussed in detail in this dissertation. To create truly powerful and transformative materials design paradigms and towards a sustainable future for MaterialsMine, we need to harness the power of AI to efficiently extract a significant set of data from the published, archival literature. Natural Language Processing (NLP) offers an opportunity to make this data accessible and readily reusable by humans and machines. The first step is to generate a sample list where curators can easily find the number of samples, their compositions, and properties reported in the paper. The task is handled in a pretraining-finetuning fashion. Downstream tasks include Named Entity Recognition (NER) to detect sample code, sample composition, property, and group reference to samples in the captions, and Relation Extraction (RE) which predicts the relations between pairs of detected named entities. In this dissertation, a detailed discussion of how the two corpora for pretraining and finetuning are constructed is provided. A T5-base model pretrained on the caption-mention corpus and finetuned for the NER and RE tasks is proposed. We evaluated it along with an array of BERT-based models and seq2seq models for potential applications in semi-automated curation pipeline for MaterialsMine.
Item Open Access Deep Generative Models for Vision, Languages and Graphs(2019) Wang, WenlinDeep generative models have achieved remarkable success in modeling various types of data, ranging from vision, languages and graphs etc. They offer flexible and complementary representations for both labeled and unlabeled data. Moreover, they are naturally capable of generating realistic data. In this thesis, novel variations of generative models have been proposed for various learning tasks, which can be categorized into three parts.
In the first part, generative models are designed to learn generalized representation for images under Zero-Shot Learning (ZSL) setting. An attribute conditioned variational autoencoder is introduced, representing each class as a latent-space distribution and enabling learning highly discriminative and robust feature representations. It endows the generative model discriminative power by choosing one class that maximize the variational lower bound. I further show that the model can be naturally generalized to transductive and few-shot setting.
In the second part, generative models are proposed for controllable language generation. Specifically, two types of topic enrolled language generation models have been proposed. The first introduces a topic compositional neural language model for controllable and interpretable language generation via a mixture-of-expert model design. While the second solve the problem via a VAE framework with a topic-conditioned GMM model design. Both of the two models have boosted the performance of existing language generation systems with controllable properties.
In the third part, generative models are introduced for the broaden graph data. First, a variational homophilic embedding (VHE) model is proposed. It is a fully generative model that learns network embeddings by modeling the textual semantic information with a variational autoencoder, while accounting for the graph structure information through a homophilic prior design. Secondly, for the heterogeneous multi-task learning, a novel graph-driven generative model is developed to unifies them into the same framework. It combines graph convolutional network (GCN) with multiple VAEs, thus embedding the nodes of graph in a uniform manner while specializing their organization and usage to different tasks.
Item Open Access Deep Latent-Variable Models for Natural Language Understanding and Generation(2020) Shen, DinghanDeep latent-variable models have been widely adopted to model various types of data, due to its ability to: 1) infer rich high-level information from the input data (especially in a low-resource setting); 2) result in a generative network that can synthesize samples unseen during training. In this dissertation, I will present the contributions I have made to leverage the general framework of latent-variable model to various natural language processing problems, which is especially challenging given the discrete nature of text sequences. Specifically, the dissertation is divided into two parts.
In the first part, I will present two of my recent explorations on leveraging deep latent-variable models for natural language understanding. The goal here is to learn meaningful text representations that can be helpful for tasks such as sentence classification, natural language inference, question answering, etc. Firstly, I will propose a variational autoencoder based on textual data to digest unlabeled information. To alleviate the observed posterior collapse issue, a specially-designed deconvolutional decoder is employed as the generative network. The resulting sentence embeddings greatly boost the downstream tasks performances. Then I will present a model to learn compressed/binary sentence embeddings, which is storage-efficient and applicable to on-device applications.
As to the second part, I will introduce a multi-level Variational Autoencoder (VAE) to model long-form text sequences (with as many as 60 words). A multi-level generative network is leveraged to capture the word-level, sentence-level coherence, respectively. Moreover, with a hierarchical design of the latent space, long-form and coherent texts can be more reliably produced (relative to baseline text VAE models). Semantically-rich latent representations are also obtained in such an unsupervised manner. Human evaluation further demonstrates the superiority of the proposed method.
Item Embargo Examining How Patients Judge Their Physicians in Online Physician Reviews(2023) Madanay, Farrah LynnIn three essays, this dissertation examines how patients judge their physicians in online physician reviews and whether those judgements align with traditional gender stereotypes. Specifically, I qualitatively explore patients’ judgments of their physicians’ interpersonal manner and technical competence, and the predominant factors within the two dimensions. I then train a machine-learning algorithm to code patients’ judgments in online physician reviews at scale. Finally, I use the machine-coded sample to analyze physician gender differences in judgments received from patients and how those judgments affect physicians’ review star ratings. In Essay 1, I propose an elaborated theoretical framework to identify the predominant factors underlying patients’ interpersonal manner and technical competence judgments of their physicians. This framework expands on prior grounded theory work by Lopez et al. (2012) and uses findings from a qualitative content analysis of 2,000 reviews received by distinct physicians. For this framework, I draw on a larger, new dataset of physician reviews from Healthgrades.com, one of the leading physician review websites, and use a balanced sample of reviews representing primary care physicians and surgeons, male and female physicians, and low- and high-rated reviews. I provide rich descriptions and illustrative quotations of the factors comprising interpersonal manner and technical competence, and describe factors added to and removed from Lopez et al.’s original framework. This framework from Essay 1 demonstrates that patients value their physicians on a wide array of interpersonal manner and technical competence factors, including but not limited to bedside manner, going above and beyond, availability, knowledge, diagnostic skill, and open-mindedness about treatment. In Essay 2, I train, test, and validate an advanced natural language processing algorithm called Robustly Optimized BERT Pre-Training Approach (i.e., RoBERTa) for classifying the presence and positive or negative valence of patients’ interpersonal manner and technical competence judgments in online physician reviews. I use the 2,000 manually coded physician reviews from Essay 1 to train and test two classification models, one for interpersonal manner and one for technical competence. Both models perform with 90% accuracy, with high precision, recall, and weighted F1 scores. I validate the models using the full sample of 345,053 RoBERTa-coded reviews for 167,150 physicians by testing associations between the valence-coded judgments and review star ratings and by comparing review rating and gender analyses with extant results in the literature. The fine-tuned algorithm from Essay 2 allows us to code a large dataset of unstructured textual review data with high efficiency and accuracy, enabling subsequent large-scale text analysis. In Essay 3, I analyze whether patients’ judgments of their physicians’ interpersonal manner and technical competence align with traditional gender stereotypes. Drawing on the Stereotype Content Model, I hypothesize that patients’ judgments will conform with gender stereotypes, such that female physicians will be more likely to receive reviews with interpersonal manner judgments whereas male physicians will be more likely to receive reviews with technical competence judgments. Using the full sample of machine-coded reviews from Essay 2, I estimate multilevel logistic regressions to identify gender differences in interpersonal manner and technical competence judgments of physicians. Results from Essay 3 suggest that patients’ judgments partly align with traditional gender stereotypes: Female physicians are more likely to receive interpersonal manner judgments, but male physicians are not more likely to receive technical competence judgments. Whether female physicians are relatively more likely to receive praise or criticism for their interpersonal manner depends on their specialty. In stereotypically warm specialties, like primary care, females are penalized for seeming cold, whereas in stereotypically technical specialties, like surgery, females are advantaged for appearing warm. Last, female physicians, in some cases, are either not rewarded as much or penalized more than their male counterparts in their star ratings when receiving positive or negative interpersonal manner and technical competence judgments.
Item Open Access Semantic Term “Blurring” and Stochastic “Barcoding” for Improved Unsupervised Text Classification(2018-04) Martorano, RobertThe abundance of text data being produced in the modern age makes it increasingly important to intuitively group, categorize, or classify text data by theme for efficient retrieval and search. Yet, the high dimensionality and imprecision of text data, or more generally language as a whole, prove to be challenging when attempting to perform unsupervised document clustering. In this thesis, we present two novel methods for improving unsupervised document clustering/classification by theme. The first is to improve document representations. We look to exploit “term neighborhoods” and “blur” semantic weight across neighboring terms. These neighborhoods are located in the semantic space afforded by “word embeddings.” The second method is for cluster revision, based on what we deem as “stochastic barcoding”, or “S- Barcode” patterns. Text data is inherently high dimensional, yet clustering typically takes place in a low dimensional representation space. Our method utilizes lower dimension clustering results as initial cluster configurations, and iteratively revises the configuration in the high dimensional space. We show with experimental results how both of the two methods improve the quality of document clustering. While this thesis elaborates on the two new conceptual contributions, a joint thesis by David Yan details the feature transformation and software architecture we developed for unsupervised document classification.Item Open Access The Influence of Structural Information on Natural Language Processing(2020) Zhang, XinyuanLearning effective and efficient vectoral representations for text has been a core problem for many downstream tasks in natural language processing (NLP).
Most traditional NLP approaches learn a text representation by only modeling the text itself.
Recently, researchers have discovered that some structural information associated with the texts can also be used to learn richer text representations.
In this dissertation, I will present my recent contributions on how to utilize various structural information including graphical networks, syntactic trees, knowledge graphs and implicit label dependencies to improve the model performances for different NLP tasks.
This dissertation consists of three main parts.
In the first part, I show that the semantic relatedness between different texts, represented by textual networks adding edges between correlated text vertices, can help with text embedding.
The proposed DMTE model embeds each vertex with a diffusion convolution operation applied on text inputs such that the complete level of connectivity between any two texts in the graph can be measured.
In the second part, I introduce the syntax-infused variational autoencoders (SIVAE) which jointly encode a sentence and its syntactic tree into two latent spaces and decode them simultaneously.
Sentences generated by this VAE-based framework are more grammatical and fluent, demonstrating the effectiveness of incorporating syntactic trees on language modeling.
In the third part, I focus on modeling the implicit structures of label dependencies for a multi-label medical text classification problem.
The proposed convolutional residual model successfully discovers label correlation structures and hence improves the multi-label classification results.
From the experimental results of proposed models, we can conclude that leveraging some structural information can contribute to better model performances.
It is essential to build a connection between the chosen structure and a specific NLP task.