Browsing by Subject "Community detection"
- Results Per Page
- Sort Options
Item Open Access Communities in Social Networks: Detection, Heterogeneity and Experimentation(2022) Mathews, HeatherThe study of network data in the social and health sciences frequently concentrates on understanding how and why connections form. In particular, the task of determining latent mechanisms driving connection has received a lot of attention across statistics, machine learning, and information theory. In social networks, this mechanism often manifests as community structure. As a result, this work provides methods for discovering and leveraging these communities to better understand networks and the data they generate.
We provide three main contributions. First, we present methodology for performing community detection in challenging regimes. Existing literature has focused on modeling the spectral embedding of a network using Gaussian mixture models (GMMs) in scaling regimes where the ability to detect community memberships improves with the size of the network. However, these regimes are not very realistic. As such, we provide tractable methodology motivated by new theoretical results for networks with non-vanishing noise by using GMMs that incorporate truncation and shrinkage effects.
Further, when covariate information is available, often we want to understand how covariates impact connections. It is likely that the effects of covariates on edge formation differ between communities (e.g. age might play a different role in friendship formation in communities across a city). To address this issue, we introduce a latent space network model where coefficients associated with certain covariates can depend on latent community membership of the nodes. We show that ignoring such structure can lead to either over- or under-estimation of covariate importance to edge formation and propose a Markov Chain Monte Carlo approach for simultaneously learning the latent community structure and the community specific coefficients.
Finally, we consider how community structure can impact experimentation. It is evident that communities can act in different ways, and it is natural that this propagates into experimental design. As as result, this observation motivates our development of community informed experimental design. This design recognizes that information between individuals likely flows along within community edges rather than across community edges. We demonstrate that this design improves estimation of global average treatment effect, even when the community structure of the graph needs to be estimated.
Item Open Access FUNDAMENTAL LIMITS FOR COMMUNITY DETECTION IN LABELLED NETWORKS(2020) Mayya, Vaishakhi SathishThe problem of detecting the community structure of networks as well as closely related problems involving low-rank matrix factorization arise in applications throughout science and engineering. This dissertation focuses on the the fundamental limits of detection and recovery associated with a broad class of probabilistic network models, that includes the stochastic block model with labeled-edges. The main theoretical results are formulas that describe the asymptotically exact limits of the mutual information and reconstruction error. The formulas are described in terms of low-dimensional estimation problems in additive Gaussian noise.
The analysis builds upon a number of recent theoretical advances at the interface of information theory, random matrix theory, and statistical physics, including concepts such as channel universality and interpolation methods. The theoretical formulas provide insight into the ability to recover the community structure in the network. The analysis is supported by numerical simulations. Numerical simulations for different network models show that the observed performance closely follows the performance predicted by the formulas.
Item Open Access Probabilistic Models for Text in Social Networks(2018) Owens-Oas, DerekText in social networks is a common form of data. Common examples include emails between coworkers, text messages in a group chat, or comments on Facebook. There is value in developing models for such data. Examples of related services include archiving emails by topic and recommending job prospects for those seeking employment. However, due to privacy concerns, these data are relatively hard to obtain. We therefore work with similar data of the same structure which are publically available to design and experiment.
Motivated primarily by topic discovery, this thesis begins with a thorough survey of models which extend the foundational probabilistic topic model, latent Dirichlet allocation. My focus is on those which endow documents with meta data, like a time stamp, the author, or a set of links to other authors. Each approach is given common notation, described in terms of a structural innovation to LDA, and presented in a graphical model. The review reveals, to our knowledge, there was previously no model which combines dynamic topic modeling and community detection.
The first data set studied in this thesis is a corpus of political blog posts. Our motivation is to learn communities, guided by the presence of links and dynamic topic interests. This formulation enables new link recommendation. We therefore develop an appropriate Bayesian probabilistic model to learn these parameters jointly. Experiments reveal the model successfully identifies a groups of blogs which discuss sensational crime, despite having very few links between these blogs. It also enables presentation of top blogs, according to various criteria, for a specified topic interest community.
In a second analysis of the blog post data I develop a similar model. The motivation is to partition documents into groups. The groups are defined by shared topic interest proportions and shared linking patterns. Documents in the same group are reasonable recommendations to a reader. The model is designed to extend the foundational LDA. This enables easy comparison to a strong baseline. Also, it offers an alternative to LDA for situations where a hard clustering of documents is desired, and documents with similar enough topic proportions are clustered together. It simultaneously learns the linking tendency for each of these groups.
We show a different application of a probabilistic model for text data in social networks to related text event sequence data. Here we analyze a transcription of group conversation data from the movie 12 Angry Men. A main contribution is an algorithm based on marked multivariate Hawkes processes to recover latent structure, learning the root source of an event. The algorithm is tested on synthetic data and a Reddit data set where structure is observed. The algorithm enables partial credit attribution, distributing the credit over likely people who start each new conversation thread.
The above models and applications demonstrate the value of text network data. Generalized software for such data enables visualization and summarization of model outputs for text data in social networks.
Item Open Access Statistical Inference and Community Detection in Proximity and Spatial Proteomics: Resolving the Organization of the Neuronal Proteome(2021) Bradshaw, Tyler WesleyTechnological advances in protein mass spectrometry (MS), aka proteomics, haveenabled high-throughput quantification of spatially-resolved, subcellular-specific proteomes. Biological insight in these experiments depends upon sound statistical analysis. Despite the myriad of existing proprietary and open-source software solutions for statistical analysis of proteomics data, these tools suffer a drawback inherent in any general solution: a loss of specificity. These tools often fail to be easily adapted to analyze experiment-specific designs. I present a flexible, linear mixed-effects model framework for assessing differential abundance in protein mass spectrometry experiments. Combined with methods to identify communities of proteins in biological networks, I extend this framework to perform inference at the level of protein groups or modules. Using these software tools, I demonstrate how module-level insight in proximity and spatial proteomics generates hypotheses that identify foci of biological function and dysfunction which may underlie the neuropathology of disease.