Anomaly-Detection and Health-Analysis Techniques for Core Router Systems
A three-layer hierarchy is typically used in modern telecommunication systems in order to achieve high performance and reliability. The three layers, namely core, distribution, and access, perform different roles for service fulfillment. The core layer is also referred to as the network backbone, and it is responsible for the transfer of a large amount of traffic in a reliable and timely manner. The network devices (such as routers) in the core layer are vulnerable to hard-to-detect/hard-to-recover errors. For example, the cards that constitute core router systems and the components that constitute a card can encounter hardware failures. Moreover, connectors between cards and interconnects between different components inside a card are also subject to hard faults. Also, since the performance requirement of network devices in the core layer is approaching Tbps levels, failures caused by subtle interactions between parallel threads or applications have become more frequent. All these different types of faults can cause a core router to become incapacitated, necessitating the design and implementation of fault-tolerant mechanisms in the core layer.
Proactive fault tolerance is a promising solution because it takes preventive action before a failure occurs. The state of the system is monitored in a real-time manner. When anomalies are detected, proactive repair actions such as job migration are executed to avoid errors, thereby maintaining the non-stop utilization of the entire system. The effectiveness of proactive fault-tolerance solutions depends on whether abnormal behaviors of core routers can be accurately pinpointed in a timely manner.
This dissertation first presents an anomaly detector for core router systems using correlation-based time series analysis. The proposed technique monitors a set of features obtained from a system deployed in the field. Various types of correlations among extracted features are identified. A set of features with minimum redundancy and maximum relevance are then grouped into different categories based on their statistical characteristics. A hybrid approach is developed to analyze various feature categories using a combination of different anomaly detection methods, leading to the detection of realistic anomalies.
Next, this dissertation presents the design of a changepoint-based anomaly detector such that anomaly detection can be adaptive to changes in the statistical features of data streams. The proposed method first detects changepoints from collected time-series data, and then utilizes these changepoints to detect anomalies. A clustering method is developed to identify a wide range of the normal/abnormal patterns from changepoint windows. Experimental results show that changepoint-based anomaly detector can detect outliers even when the statistical properties of the monitored data change significantly with time.
An efficient data-driven anomaly detector is not adequate to obtain a full picture of the health status of monitored core routers. It is also essential to learn how healthy a core router system is and how different task scenarios can affect the system. Therefore, this dissertation presents a symbol-based health status analyzer that first encodes, as a symbol sequence, the long-term complex time series collected from a number of core routers, and then utilizes the symbol sequence for health analysis. Symbol-based clustering and classification methods are developed to identify the health status.
In order to accurately identify the health status, historical operation data needs to be fully labeled, which is a challenge in the early stages of monitoring. Therefore, this dissertation presents an iterative self-learning procedure for assessing the health status. This procedure first computes a representative feature matrix to capture different characteristics of time-series data. Hierarchical clustering is then utilized to infer labels for the unlabeled dataset. Finally, a classifier is built and iteratively updated using both labeled and unlabeled datasets. Partially-labeled field data collected from a set of commercial core routers are used to experimentally validate the proposed method.
In summary, the dissertation tackles important problems of anomaly detection and health status analysis in complex core router systems. The results emerging from this dissertation provide the first comprehensive set of data-driven resiliency solutions for core router systems. It is anticipated that other high-performance computing systems will also benefit from this framework.
Core router systems
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations