Statistical Issues in Quantifying Text Mining Performance

Chai, Christine Peijinn

Statistical Issues in Quantifying Text Mining Performance

dc.contributor.advisor	Banks, David L
dc.contributor.author	Chai, Christine Peijinn
dc.date.accessioned	2017-05-16T17:28:30Z
dc.date.available	2017-05-16T17:28:30Z
dc.date.issued	2017
dc.department	Statistical Science
dc.description.abstract	Text mining is an emerging field in data science because text information is ubiquitous, but analyzing text data is much more complicated than analyzing numerical data. Topic modeling is a commonly-used approach to classify text documents into topics and identify key words, so the text information of interest is distilled from the large corpus sea. In this dissertation, I investigate various statistical issues in quantifying text mining performance, and Chapter 1 is a brief introduction. Chapter 2 is about the adequate pre-processing for text data. For example, words of the same stem (e.g. "study" and "studied") should be assigned the same token because they share the exact same meaning. In addition, specific phrases such as "New York" and "White House" should be retained because many topic classification models focus exclusively on words. Statistical methods, such as conditional probability and p-values, are used as an objective approach to discover these phrases. Chapter 3 starts the quantification of text mining performance; this measures the improvement of topic modeling results from text pre-processing. Retaining specific phrases increases their distinctivity because the "signal" of the most probable topic becomes stronger (i.e., the maximum probability is higher) than the "signal" generated by any of the two words separately. Therefore, text pre-processing helps recover semantic information at word level. Chapter 4 quantifies the uncertainty of a widely-used topic model { latent Dirichlet allocation (LDA). A synthetic text dataset was created with known topic proportions, and I tried several methods to determine the appropriate number of topics from the data. Currently, the pre-set number of topics is important to the topic model results because LDA tends to utilize all topics allotted, so that each topic has about equal representation. Last but not least, Chapter 5 explores a few selected text models as extensions, such as supervised latent Dirichlet allocation (sLDA), survey data application, sentiment analysis, and the infinite Gaussian mixture model.
dc.identifier.uri	https://hdl.handle.net/10161/14500
dc.subject	Statistics
dc.subject	Data cleaning
dc.subject	Latent Dirichlet allocation
dc.subject	N-gramming
dc.subject	Text mining
dc.subject	Topic modeling
dc.subject	Uncertainty
dc.title	Statistical Issues in Quantifying Text Mining Performance
dc.type	Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chai_duke_0066D_13982.pdf
Size:: 957.27 KB
Format:: Adobe Portable Document Format

Download

Collections

Dissertations