Browsing by Author "Banks, David L"
Results Per Page
Sort Options
Item Open Access A Bayesian Strategy to the 20 Question Game with Applications to Recommender Systems(2017) Suresh, Sunith RajIn this paper, we develop an algorithm that utilizes a Bayesian strategy to determine a sequence of questions to play the 20 Question game. The algorithm is motivated with an application to active recommender systems. We first develop an algorithm that constructs a sequence of questions where each question inquires only about a single binary feature. We test the performance of the algorithm utilizing simulation studies, and find that it performs relatively well under an informed prior. We modify the algorithm to construct a sequence of questions where each question inquires about 2 binary features via AND conjunction. We test the performance of the modified algorithm
via simulation studies, and find that it does not significantly improve performance.
Item Open Access Analyzing Amazon CD Reviews with Bayesian Monitoring and Machine Learning Methods(2020) Su, EricThis paper analyzes customer reviews of CDs sold on Amazon.com using various statistical and machine learning methods. We investigated the distribution properties through exploratory analyses and the Bayesian monitoring method was utilized to analyze life cycles of CDs. We proposed an adjustment to the classic Bayesian monitoring technique which allows it to deal with extreme changes in data. To predict how many reviews CDs get, we compared the performances of a range of machine learning models and identified important features affecting the number of reviews using permutation importance.
Item Open Access Latent Space Diffusion(2015) Fisher, Jacob CharlesSocial networks represent two different facets of social life: (1) stable paths for diffusion, or the spread of something through a connected population, and (2) random draws from an underlying social space, which indicate the relative positions of the people in the network to one another. The dual nature of networks creates a challenge - if the observed network ties are a single random draw, is it realistic to expect that diffusion only follows the observed network ties? This study takes a first step towards integrating these two perspectives by introducing a social space diffusion model. In the model, network ties indicate positions in social space, and diffusion occurs proportionally to distance in social space. Practically, the simulation occurs in two parts: positions are estimated using a latent space model, and then the predicted probabilities of a tie from that model - representing the distances in social space - or a series of networks drawn from those probabilities - representing routine churn in the network - are used as weights in a weighted averaging framework. Using a school friendship network, I show that the model is more consistent and, when probabilities are used, the model converges faster than diffusion following only the observed network ties.
Item Open Access Mining Political Blogs With Network Based Topic Models(2014) Liang, JiaweiWe develop a Network Based Topic Model (NBTM), which integrates a Random
Graph model with the Latent Dirichlet Allocation (LDA) model. The NBTM assumes that the topic proportion of a document has a xed variance across the document corpus with author dierences treated as random eects. It also assumes that the links between documents are binary variables whose probabilities depend upon the author random eects. We t the model to political blog posts during the calendar year 2012 that mention Trayvon Martin. This paper presents the topic extraction results and posterior prediction results for hidden links within the blogosphere.
Item Open Access Momentum Scale Estimation Using Maximum LikelihoodTemplate Fitting(2010) Zeng, YuA maximum likelihood template fitting procedure is performed by using Upsilon --> mu+mu- events to extract the momentum scale, a scale factor applied to measured momentum, of the CDF detector at Fermilab. The constructed invariant mass spectrum from data events is compared with the invariant mass spectrum from Monte Carlo simulated events, with the momentum scale varying as a free parameter in the simulation. The invariant mass spectrum from simulation which best matches the data spectrum gives the maximum likelihood estimation of the momentum scale. We find the momentum scale is dp/p = (-1.330 ± 0.028(stat) ± 0.099(syst)) × 10^{-3}.
Item Open Access Probabilistic Models for Text in Social Networks(2018) Owens-Oas, DerekText in social networks is a common form of data. Common examples include emails between coworkers, text messages in a group chat, or comments on Facebook. There is value in developing models for such data. Examples of related services include archiving emails by topic and recommending job prospects for those seeking employment. However, due to privacy concerns, these data are relatively hard to obtain. We therefore work with similar data of the same structure which are publically available to design and experiment.
Motivated primarily by topic discovery, this thesis begins with a thorough survey of models which extend the foundational probabilistic topic model, latent Dirichlet allocation. My focus is on those which endow documents with meta data, like a time stamp, the author, or a set of links to other authors. Each approach is given common notation, described in terms of a structural innovation to LDA, and presented in a graphical model. The review reveals, to our knowledge, there was previously no model which combines dynamic topic modeling and community detection.
The first data set studied in this thesis is a corpus of political blog posts. Our motivation is to learn communities, guided by the presence of links and dynamic topic interests. This formulation enables new link recommendation. We therefore develop an appropriate Bayesian probabilistic model to learn these parameters jointly. Experiments reveal the model successfully identifies a groups of blogs which discuss sensational crime, despite having very few links between these blogs. It also enables presentation of top blogs, according to various criteria, for a specified topic interest community.
In a second analysis of the blog post data I develop a similar model. The motivation is to partition documents into groups. The groups are defined by shared topic interest proportions and shared linking patterns. Documents in the same group are reasonable recommendations to a reader. The model is designed to extend the foundational LDA. This enables easy comparison to a strong baseline. Also, it offers an alternative to LDA for situations where a hard clustering of documents is desired, and documents with similar enough topic proportions are clustered together. It simultaneously learns the linking tendency for each of these groups.
We show a different application of a probabilistic model for text data in social networks to related text event sequence data. Here we analyze a transcription of group conversation data from the movie 12 Angry Men. A main contribution is an algorithm based on marked multivariate Hawkes processes to recover latent structure, learning the root source of an event. The algorithm is tested on synthetic data and a Reddit data set where structure is observed. The algorithm enables partial credit attribution, distributing the credit over likely people who start each new conversation thread.
The above models and applications demonstrate the value of text network data. Generalized software for such data enables visualization and summarization of model outputs for text data in social networks.
Item Open Access Problems in Computational Advertising(2021) Guo, YiComputational advertising is a multi-billion-dollar industry, yet it has gotten little attention from academic statisticians. Despite this, the performance of this collection of pricing models, keyword auctions, A/B testing, and recommender systems is largely reliant on statistical technique in almost every element of its design and implementation.
Online ad auctions and e-commercial logistics are two of the major components of computational advertising. In a real-time bidding scenario, the objective for the former is to maximize expected utilities. The latter is concerned with the development of statistical modeling for dynamic continuous flows. In turn, this leads to a range of various issues, three of which are discussed in this thesis.
Chapter 1 briefly introduces the topics of online advertising and computational advertising. Chapter 2 proposes a new method, the Backwards Indifference Derivation (BID) algorithm, to numerically approximate the pure strategy Nash equilibrium (PSNE) bidding functions in asymmetric first-price auctions. The classic PSNE solution assumes that all parties agree on the type distribution for each participant, and all know that this information is held in common. This common knowledge assumption is strong and often unrealistic. Chapter 3 addresses that gap by providing two alternative solutions, each based upon an adversarial risk analysis (ARA) perspective. Chapter 4 extends the previous methodology for Bayesian dynamic flow models of discrete data to real-valued and positive flows. Finally, Chapter 5 presents some concluding remarks and briefly discusses other problems in computational advertising.
Item Open Access Scheduling Optimization with LDA and Greedy Algorithm(2016) Bi, YongjianScheduling optimization is concerned with the optimal allocation of events to time slots. In this paper, we look at one particular example of scheduling problems - the 2015 Joint Statistical Meetings. We want to assign each session among similar topics to time slots to reduce scheduling conflicts. Chapter 1 briefly talks about the motivation for this example as well as the constraints and the optimality criterion. Chapter 2 proposes use of Latent Dirichlet Allocation (LDA) to identify the topic proportions in each session and talks about the fitting of the model. Chapter 3 translates these ideas into a mathematical formulation and introduces a Greedy Algorithm to minimize conflicts. Chapter 4 demonstrates the improvement of the scheduling with this method.
Item Open Access Statistical Inference Utilizing Agent Based Models(2014) Heard, Daniel PhilipAgent-based models (ABMs) are computational models used to simulate the behaviors,
actionsand interactions of agents within a system. The individual agents
each have their own set of assigned attributes and rules, which determine
their behavior within the ABM system. These rules can be
deterministic or probabilistic, allowing for a great deal of
flexibility. ABMs allow us to
observe how the behaviors of the individual agents affect the system
as a whole and if any emergent structure develops within the
system. Examining rule sets in conjunction with corresponding emergent
structure shows how small-scale changes can
affect large-scale outcomes within the system. Thus, we can better
understand and predict the development and evolution of systems of
interest.
ABMs have become ubiquitous---they used in business
(virtual auctions to select electronic ads for display), atomospheric
science (weather forecasting), and public health (to model epidemics).
But there is limited understanding of the statistical properties of
ABMs. Specifically, there are no formal procedures
for calculating confidence intervals on predictions, nor for
assessing goodness-of-fit, nor for testing whether a specific
parameter (rule) is needed in an ABM.
Motivated by important challenges of this sort,
this dissertation focuses on developing methodology for uncertainty
quantification and statistical inference in a likelihood-free context
for ABMs.
Chapter 2 of the thesis develops theory related to ABMs,
including procedures for model validation, assessing model
equivalence and measuring model complexity.
Chapters 3 and 4 of the thesis focuses on two approaches
for performing likelihood-free inference involving ABMs,
which is necessary because of the intractability of the
likelihood function due to the variety of input rules and
the complexity of outputs.
Chapter 3 explores the use of
Gaussian Process emulators in conjunction with ABMs to perform
statistical inference. This draws upon a wealth of research on emulators,
which find smooth functions on lower-dimensional Euclidean spaces that approximate
the ABM. Emulator methods combine observed data with output from ABM
simulations, using these
to fit and calibrate Gaussian-process approximations.
Chapter 4 discusses Approximate Bayesian Computation for ABM inference,
the goal of which is to obtain approximation of the posterior distribution
of some set of parameters given some observed data.
The final chapters of the thesis demonstrates the approaches
for inference in two applications. Chapter 5 presents application models the spread
of HIV based on detailed data on a social network of men who have sex with
men (MSM) in southern India. Use of an ABM
will allow us to determine which social/economic/policy
factors contribute to thetransmission of the disease.
We aim to estimate the effect that proposed medical interventions will
have on the spread of HIV in this community.
Chapter 6 examines the function of a heroin market
in the Denver, Colorado metropolitan area. Extending an ABM
developed from ethnographic research, we explore a procedure
for reducing the model, as well as estimating posterior
distributions of important quantities based on simulations.
Item Open Access Statistical Issues in Quantifying Text Mining Performance(2017) Chai, Christine PeijinnText mining is an emerging field in data science because text information is ubiquitous, but analyzing text data is much more complicated than analyzing numerical data. Topic modeling is a commonly-used approach to classify text documents into topics and identify key words, so the text information of interest is distilled from the large corpus sea. In this dissertation, I investigate various statistical issues in quantifying text mining performance, and Chapter 1 is a brief introduction.
Chapter 2 is about the adequate pre-processing for text data. For example, words of the same stem (e.g. "study" and "studied") should be assigned the same token because they share the exact same meaning. In addition, specific phrases such as "New York" and "White House" should be retained because many topic classification models focus exclusively on words. Statistical methods, such as conditional probability and p-values, are used as an objective approach to discover these phrases.
Chapter 3 starts the quantification of text mining performance; this measures the improvement of topic modeling results from text pre-processing. Retaining specific phrases increases their distinctivity because the "signal" of the most probable topic becomes stronger (i.e., the maximum probability is higher) than the "signal" generated by any of the two words separately. Therefore, text pre-processing helps recover semantic information at word level.
Chapter 4 quantifies the uncertainty of a widely-used topic model { latent Dirichlet allocation (LDA). A synthetic text dataset was created with known topic proportions, and I tried several methods to determine the appropriate number of topics from the data. Currently, the pre-set number of topics is important to the topic model results because LDA tends to utilize all topics allotted, so that each topic has about equal representation.
Last but not least, Chapter 5 explores a few selected text models as extensions, such as supervised latent Dirichlet allocation (sLDA), survey data application, sentiment analysis, and the infinite Gaussian mixture model.
Item Open Access Topics in Applied Statistics(2023) LeBlanc, Patrick MOne of the fundamental goals of statistics is to develop methods which provide improved inference in applied problems. This dissertation will introduce novel methodology and review state-of-the-art existing methods in three different areas of applied statistics. Chapter 2 focuses on modelling subcommunity dynamics in gut microbiome data. Existing methods ignore cross-sample heterogeneity in subcommunity composition; we propose a novel mixed-membership model which models cross-sample heterogeneity using the phylogenetic tree and as a result is robust to mispecifying the number of subcommunities. Chapter 3 reviews state-of-the-art methods in recommender systems, including collaborative filtering, content-based filtering, hybrid recommenders, and active recommender systems. Existing literature has focused primarily on bespoke applications; statisticians have an opportunity to build recommender system theory. Chapter 4 proposes a novel method of accounting for time-based design inconsistencies in Bayesian network meta-analysis models and discovers non-linear time trends in the effectiveness of vancomycin as a MRSA treatment. Chapter 5 provides some concluding remarks.
Item Open Access Topics in Computational Advertising(2014) Au, Timothy ChunWaiComputational advertising is an emerging scientific discipline that incorporates tools and ideas from fields such as statistics, computer science, and economics. Although a consequence of the rapid growth of the Internet, computational advertising has since helped transform the online advertising business into a multi-billion dollar industry.
The fundamental goal of computational advertising is to determine the ``best'' online ad to display to any given user. This ``best'' ad, however, changes depending upon the specific context that is under consideration. This leads to a variety of different problems, three of which are discussed in this thesis.
Chapter 1 briefly introduces the topics of online advertising and computational advertising. Chapter 2 proposes a numerical method to approximate the pure strategy Nash equilibrium bidding functions in an independent private value first-price sealed-bid auction where bidders draw their types from continuous and atomless distributions---a setting in which solutions cannot generally be analytically derived, despite the fact that they are known to exist and be unique. Chapter 3 proposes a cross-domain recommender system that is a multiple-domain extension of the Bayesian Probabilistic Matrix Factorization model. Chapter 4 discuss some of the tools and challenges of text mining by using the Trayvon Martin shooting incident as a case study in analyzing the lexical content and network connectivity structure of the political blogosphere. Finally, Chapter 5 presents some concluding remarks and briefly discusses other problems in computational advertising.
Item Open Access Two Applications of Adversarial Risk Analysis(2011) Wang, ShouqiangAdversarial risk analysis (ARA) attempts to apply statistical methodology
to game-theoretic problems and provides an alternative to the solution concepts in traditional game theory. Specifically, it uses a Bayesian model for the decision-making processes of one's opponents to develop a subjective distribution over their actions, enabling the application of traditional risk analysis to maximize the expected utility. This thesis applies ARA framework to network routing problems in an adversarial contexts and a range of simple Borel gambling games.
Item Open Access VizMaps: A Bayesian Topic Modeling Based PubMed Search Interface(2015) Kamboj, KirtiA common challenge that users of academic databases face is making sense of their query outputs for knowledge discovery. This is exacerbated by the size and growth of modern databases. PubMed, a central index of biomedical literature, contains over 25 million citations, and can output search results containing hundreds of thousands of citations. Under these conditions, efficient knowledge discovery requires a different data structure than a chronological list of articles. It requires a method of conveying what the important ideas are, where they are located, and how they are connected; a method of allowing users to see the underlying topical structure of their search. This paper presents VizMaps, a PubMed search interface that addresses some of these problems. Given search terms, our main backend pipeline extracts relevant words from the title and abstract, and clusters them into discovered topics using Bayesian topic models, in particular the Latent Dirichlet Allocation (LDA). It then outputs a visual, navigable map of the query results.