Pitsianis, NikosSun, XiaobaiYan, David2018-05-182018-05-182018-04-24https://hdl.handle.net/10161/16743In our thesis work, we investigate on unsupervised theme-based categorization or classification of documents in a large corpus, where the documents are represented in a high-dimensional feature space. Unsupervised classification of text documents is in high demand with the data explosion of the modern age, yet it remains a challenging problem. In particular, we re-examine two key areas: the manner in which the documents are represented and the process by which the documents are clustered. Two innovative methods are presented in the joint research work with Rob Martorano. The first is on feature transformation, which is elaborated on in this thesis. We point out that existing approaches for document feature description serve well for author identification but pose limitations on theme-based classification. In our feature transformation, we discount esoteric use of words by authors and disclose and exploit semantic similarities and associations among different words used by different authors. We first locate semantically close words by utilizing word embedding techniques and products based on much larger word collections, external to the terms used in a particular document corpus. We then make numerical associations among term neighbors with similar semantic meanings; we denote these term neighborhoods as semantic elements. Using semantic elements, we use a self-tuning Gaussian blurring technique to increase association between documents that share similar context patterns. The second contribution is on cluster revision, which is briefly discussed in this thesis and elaborated more in Rob Martorano’s thesis. Clustering algorithms are typically used after feature dimension reduction. Some properties are preserved, and some are lost in the reduced dimension space. Some clusters are fragmented into smaller ones, and some are merged. We revise the clustering results by going back to the high dimensional space. We characterize the cluster features with what we refer to as stochastic barcodes. We developed a software architecture composed of the following major components. The first component uses semantic elements to form a refined document feature space using our novel feature transformation method. The second component performs a dimension reduction on the document feature space, then forms and refines the subsequent document clusters. We show, with experimental results on real-word document corpora, improvements made by our approach in comparison to existing and influential ones.en-USText processingfeature transformationunsupervised clusteringcluster revisionword embeddingsterm associationsExploiting Semantic Word Relationships for Improved Unsupervised Academic Document ClassificationHonors thesis