# Browsing by Department "Statistical and Economic Modeling"

###### Results Per Page

###### Sort Options

Item Open Access A Comparison Of Multiple Imputation Methods For Categorical Data(2015) Akande, Olanrewaju MichaelThis thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.

Item Open Access A Comparison of Serial & Parallel Particle Filters for Time Series Analysis(2014) Klemish, DavidThis paper discusses the application of parallel programming techniques to the estimation of hidden Markov models via the use of a particle filter. It highlights how the Thrust parallel programming

language can be used to implement a particle filter in parallel. The impact of a parallel particle filter on the running times of three different models is investigated. For particle filters using a large number

of particles, Thrust provides a speed-up of five to ten times over a serial C++ implementation, which is less than reported in other research.

Item Open Access Applications of Statistical and Economic Analysis in Finance and Health Industry(2015) Sun, XuanThis paper intends to present my summary of internship and some academic individual and team projects, including a quantitative and statistical analysis of some important Macro factors and financial models, and a data analysis project in drug cost reduction. The first chapter discusses the mechanism and impact of pass-through from the dynamics of RMB exchange in China, and the method I used here is the basic econometrics regression analysis. The result is significant and coincides with our common sense when we make investment decisions. The second chapter is about the revised CCAPM model. Through a modified distribution of error terms, CCAPM model will show an improved explanation power. The third chapter is a data analysis project of drug cost reduction. I used Bayesian method to explore the relationship between drug cost and other predictors, and the result gives us advice on designing health plans to minimize the cost.

Item Open Access Bayes High-Dimensional Density Estimation Using Multiscale Dictionaries(2014) Wang, YeAlthough Bayesian density estimation using discrete mixtures has good performance in modest dimensions, there is a lack of statistical and computational scalability to high-dimensional multivariate cases. To combat the curse of dimensionality, it is necessary to assume the data are concentrated near a lower-dimensional subspace. However, Bayesian methods for learning this subspace along with the density of the data scale poorly computationally. To solve this problem, we propose an empirical Bayes approach, which estimates a multiscale dictionary using geometric multiresolution analysis in a first stage. We use this dictionary within a multiscale mixture model, which allows uncertainty in component allocation, mixture weights and scaling factors over a binary tree. A computational algorithm is proposed, which scales efficiently to massive dimensional problems. We provide some theoretical support for this method, and illustrate the performance through simulated and real data examples.

Item Open Access Bayesian Hierarchical Models to Address Problems in Neuroscience and Economics(2017) Zaman, Azeem ZahidIn the first chapter, motivated by a model used to analyze spike train data, we present a method for learning multiple probability vectors by using information from large samples to improve estimates for smaller samples. The method makes use of Polya-gamma data augmentation to construct a conjugate model whose posterior can estimate the weights of a mixture distribution. This novel method successfully uses borrows information from large samples to increase the precision and accuracy of estimates for smaller samples.

In the second chapter, data from the Federal Communications Commission spectrum auction number 73 is analyzed. By analyzing the structure of the auctions bounds are placed on the valuations that govern the bidders' decisions in the auction. With these bounds, common models are estimated by imputing valuations and the results are compared with the estimates from standard methods used in the literature. The comparison shows some important differences between the approaches. A second model that accounts for the geographic relationship between the licenses sold finds strong evidence of a correlation between the value of adjacent licenses, as expected by economic theory.

Item Open Access Bayesian Models for Causal Analysis with Many Potentially Weak Instruments(2015) Jiang, ShengThis paper investigates Bayesian instrumental variable models with many instruments. The number of instrumental variables grows with the sample size and is allowed to be much larger than the sample size. With some sparsity condition on the coefficients on the instruments, we characterize a general prior specification where the posterior consistency of the parameters is established and calculate the corresponding convergence rate.

In particular, we show the posterior consistency for a class of spike and slab priors on the many potentially weak instruments. The spike and slab prior shrinks the number of instrumental variables, which avoids overfitting and provides uncertainty quantifications on the first stage. A simulation study is conducted to illustrate the convergence notion and estimation/selection performance under dependent instruments. Computational issues related to the Gibbs sampler are also discussed.

Item Open Access Claims Severity Modeling(2015) Anand, RadhikaThis study is presented as a portfolio of three projects, two of which were a part of my summer internship at CNA Insurance, Chicago and one was a part of the course STA 663: Statistical Computation.

Project 1, Text Mining Software, aimed at building an efficient text mining software for CNA Insurance, in the form of an R package, to mine sizable amounts of unstructured claim notes data to prepare structured input for claims severity modeling. This software decreased run-time 30 fold compared to the software used previously at CNA.

Project 2, Workers’ Compensation Panel Data Analysis, aimed at tracking workers’ compensation claims over time and pointing out variables (particularly medical) that made a claim successful in the long run. It involved creating a parsimonious Mixed Effects model on a panel dataset of Workers’ Compensation claims at CNA Insurance.

Project 3, Infinite Latent Feature Models and the Indian Buffet Process (IBP), used IBP as a prior in models for unsupervised learning by deriving and testing an infinite Gaussian binary latent feature model. An image dataset was simulated (similar to Griffiths and Ghahramani (2005)) to test the model.

Item Open Access Default Prior Choice for Bayesian Model Selection in Generalized Linear Models with Applications in Mortgage Default(2014) Kramer, ZacharyThe adoption of Zellner's g prior is a popular prior choice in Bayesian Model Averaging, although literature has shown that using a fixed g has undesirable properties. Mixtures of g priors have recently been proposed for Generalized linear models, extending results from the Gaussian linear model context. This paper will specifically look at the model selection problem as it applies to logistic regression. The effect of prior choice on both model selection and prediction using Bayesian Model Averaging is analyzed. This is done by testing a variety of model space and mixtures of g priors in a simulated data study as well illustrating their use in mortgage default data. This paper shows that the different mixtures of g priors tends to fall into one of two groups that have similar behavior. Additionally, priors in one of these groups, specifically the n/2, Beta Prime, and Robust mixtures of g priors, tend to outperform the other choices.

Item Open Access Differentially Private Verification ofPredictions from Synthetic Data(2017) Yu, HaoyangWhen data are confidential, one approach for releasing public available files is to make synthetic data, i.e, data simulated from statistical models estimated on the confidential data. Given access only to synthetic data, users have no way to tell if the synthetic data can preserve the adequacy of their analysis. Thus, I present methods that can help users to make such assessments automatically while controlling the information disclosure risks in the confidential data. There are three verification methods presented in this thesis: differentially private prediction tolerance intervals, differentially private prediction histogram, and differentially private Kolmogorov-Smirnov test. I use simulation to illustrate these prediction verification methods.

Item Open Access Do Chinese Investors Get What They Don’t Pay For? Expense Ratios, Loads, and The Returns to China's Open-End Mutual Funds(2015) Wang, YangIn this paper we analyze the performance of China's open-end mutual funds by different approaches. Using the data of 467 open-end mutual funds from 60 fund families from Jan 2010 to Apr 2015, we find that the performance of most mutual funds does not beat the collection of indexes that most closely track the fund, and the fund families with high expense ratios serve investors less well than those with low expense ratios. Investors would earn higher returns by investing in mutual funds with low expenses and low front end loads.

Item Open Access Dynamic Time Varying Models for Predicting Patient Deterioration(2017) McCreanor, Reuben KnowlesStreaming data are becoming more common in a variety of fields. One common data stream in clinical medicine is electronic health records (EHRs) which have been used to develop risk prediction models. Our motivating application considers the risk of patient deterioration, which is defined as in-hospital mortality or transfer to the Intensive Care Unit (ICU). Duke University Hospital recently implemented an alert risk score for acute care wards: the National Early Warning Score (NEWS). However, NEWS was designed to be hand-calculable from patient vital data rather than to optimize prediction. Our approach considers three further methods to use on real-time EHR data to predict patient deterioration. We propose a Cox model, a joint modeling approach, and a Gaussian process. By evaluating the implementation of these models on clinical EHR data from more than 51,000 patients, we are able to provide a comparison of the methods on real EHR data for patient deterioration. We evaluate the results on both performance and scalability and consider the feasibility of implementing each approach in a clinical environment. While the more complicated models may potentially offer a small gain in predictive performance, they do not scale to a full patient data set. Thus, within a clinical setting, the Cox model is clearly the best approach.

Item Open Access Mining Political Blogs With Network Based Topic Models(2014) Liang, JiaweiWe develop a Network Based Topic Model (NBTM), which integrates a Random

Graph model with the Latent Dirichlet Allocation (LDA) model. The NBTM assumes that the topic proportion of a document has a xed variance across the document corpus with author dierences treated as random eects. It also assumes that the links between documents are binary variables whose probabilities depend upon the author random eects. We t the model to political blog posts during the calendar year 2012 that mention Trayvon Martin. This paper presents the topic extraction results and posterior prediction results for hidden links within the blogosphere.

Item Open Access Multi-View Weighted Network(2016) Yang, XiExtensive investigation has been conducted on network data, especially weighted network in the form of symmetric matrices with discrete count entries. Motivated by statistical inference on multi-view weighted network structure, this paper proposes a Poisson-Gamma latent factor model, not only separating view-shared and view-specific spaces but also achieving reduced dimensionality. A multiplicative gamma process shrinkage prior is implemented to avoid over parameterization and efficient full conditional conjugate posterior for Gibbs sampling is accomplished. By the accommodating of view-shared and view-specific parameters, flexible adaptability is provided according to the extents of similarity across view-specific space. Accuracy and efficiency are tested by simulated experiment. An application on real soccer network data is also proposed to illustrate the model.

Item Open Access Multiple Imputation on Missing Values in Time Series Data(2015) Oh, SohaeFinancial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.

Item Open Access Simulation Study on Exchangeability and Significant Test on Survey Data(2015) Cao, YongThe two years of Master of Science in Statistical and Economic Modeling program is the most rewarding time ever in my life. This thesis acts as a portfolio of project and applied experience while I am enrolled in the Master of Science in Statistical and Economic Modeling program. This thesis will summarize my graduate study in two parts: Simulation Study of Exchangeability for Binary Data, and Summary of Summer Internship at Center for Responsible Lending. The project of Simulation Study of Exchangeability for Binary Data contains materials from a team project, which jointly performed by Sheng Jiang, Xuan Sun and me. Abstracts for both projects are below in order.

(1) Simulation Study of Exchangeability for Binary Data

To investigate tractable Bayesian tests on exchangeability, this project considers special cases of nonexchangeable random sequences: Markov chains. Asymptotic results of Bayes factor (BF) are derived. When null hypothesis is true, Bayes Factor in favor of the null goes to infinity at geometric rate (true odds is not one half). When null hypothesis is not true, Bayes Factor in favor of the null goes to 0 faster than geometric rate. The results are robust under misspecifications. Simulation studies are employed to see the performance of the test when the sample size is small, prior beliefs change and true parameters change.

(2) Summary of Summer Internship at Center for Responsible Lending

My summer internship deals with a survey data from Social Science Research Solution about auto financing. The dataset includes about one thousand valid responses and 114 variables for each response. My efforts on exploratory statistic analysis unfolded many interesting findings. For example, African Americans and Latinos are receiving 2.02% higher APR on average than white buyers, excluding the effects of relevant variables. And what's more, a Fisher's Exact Test of Significance is widely used to discover the significance of a series of variables. Results are presented in organized neat tables. Findings are included in weekly reports. One example finding is that warranty add-‐‑ons of a financed car has significant impacts on all three aspects of a loan, which is Annual Percent Rate, Loan Amount, and Monthly Payment.

Item Open Access Spatial Assignments Using Intrinsic Markers to Infer Migratory Patterns(2017) Qian, LeiIn ecology, it is extremely useful to model migratory connections as it can be built

upon to produce further research into organisms and the environment. However, it

can be difficult due to the high cost of tracking animals using extrinsic factors such

as tagging or electronic chips and the possibility of influencing organisms behavior

afterwards. I use a Bayesian Gaussian Process method developed in (Rundel et al.

2013) to model migratory bird connectivity patterns using intrinsic markers, allele

counts, on a spatial scope to predict breeding grounds of a genetic sample.

Item Open Access U.S. Fiscal Multipliers(2015) Lusompa, Amaze BasilwaThis paper investigates whether government spending multipliers are time-varying.

The multipliers are measured using time-varying parameter (TVP) local projections.

This paper uses a simple modication to local projections that corrects for the inherent

autocorrelated errors in local projections. The results indicate that there is

evidence of time variation in government spending multipliers and that the results of

previous studies should be seriously questioned. The results also indicate that there

is significant time variation in the strength of Blanchard-Perotti and defense news

identified shocks.

Item Open Access VizMaps: A Bayesian Topic Modeling Based PubMed Search Interface(2015) Kamboj, KirtiA common challenge that users of academic databases face is making sense of their query outputs for knowledge discovery. This is exacerbated by the size and growth of modern databases. PubMed, a central index of biomedical literature, contains over 25 million citations, and can output search results containing hundreds of thousands of citations. Under these conditions, efficient knowledge discovery requires a different data structure than a chronological list of articles. It requires a method of conveying what the important ideas are, where they are located, and how they are connected; a method of allowing users to see the underlying topical structure of their search. This paper presents VizMaps, a PubMed search interface that addresses some of these problems. Given search terms, our main backend pipeline extracts relevant words from the title and abstract, and clusters them into discovered topics using Bayesian topic models, in particular the Latent Dirichlet Allocation (LDA). It then outputs a visual, navigable map of the query results.