Abstract:
This paper addresses the problem of detecting anomalous interactions or traffic within a very large
network using a limited number of unlabeled observations. In particular, consider n recorded interactions
among p nodes, where p may be very large relative to n. A novel method based on using a hypergraph
representation of the data is proposed to deal with this very high-dimensional, “big p, small n” problem.
Hypergraphs constitute an important extension of graphs which allows edges to connect more than two
vertices simultaneously. An algorithm for detecting anomalies directly on the corresponding discrete
space, without any feature selection or dimensionality reduction, is presented. The algorithm has O(np)
computational complexity, making it ideally suited for very large networks, and requires no tuning,
bandwidth or regularization parameters.
The distribution of the data is modeled as a two-component mixture, consisting of a “nominal”
and an “anomalous” component. The deviance of each observation from nominal behavior, as well
as the mixture parameters, are learned using Expectation-Maximization (EM), assuming a multivariate
Bernoulli variational approximation. This approach is related to probability mass function level set
estimation and is shown to allow False Discovery Rate control. The identifiability of the underlying
distribution, the local consistency of the EM algorithm, and the avoidance of singular solutions are
proved. The proposed approach is validated on high-dimensional synthetic data and it is shown that, for
a useful class of data distributions, it can outperform other state-of-the-art methods.