Simplifying System Management Through Automated Forecasting, Diagnosis, and Configuration Tuning
Large-scale networked computing systems are widely deployed to run business-critical applications in environments where changes are frequent. Manual management of these complex systems can be tedious and error-prone. Meanwhile, the high costs of application downtime make it critical to ensure system availability and reliability. Recent progress in monitoring tools enables system administrators to collect fine-grained data about system activity with low overhead. This data provides valuable information for system management. However, the monitoring data collected from production systems is massive in size and noisy; which makes it hard for system administrators to fully utilize this data for effective system management.
This dissertation describes a data-management platform, called Fa, where system administrators can pose declarative queries over system monitoring data. Fa automatically finds fairly accurate and efficient execution plans for given queries, and returns query results in easy-to-interpret formats. Fa supports three key query types, namely, forecasting queries (for predicting or detecting performance problems), diagnosis queries (for finding the cause of performance problems), and tuning queries (for recommending changes to system configuration to resolve diagnosed problems):
(a) For processing diagnosis queries, Fa constructs problem signatures from system monitoring data to identify recurrent problems and to reuse past diagnostic information. For a rare or new problem, Fa employs an anomaly-based clustering technique to generate performance baselines and to characterize the deviation from baselines to pinpoint root causes. Fa also incorporates an active-learning component that identifies diagnosis queries whose results, if provided or confirmed by system administrators, can be used to update problem signatures and to improve the accuracy and efficiency for processing future queries.
(b) For processing tuning queries to resolve problems caused by system misconfiguration, Fa employs an adaptive sampling algorithm that plans experiments to efficiently identify high-impact configuration parameters and high-performance settings. These experiments bring in information---required for generating accurate query results---that is missing in the monitoring data collected so far.
(c) For both one-time and continuous forecasting queries, Fa automatically searches for efficient execution plans in a large space of plans composed of data-transformation operators as well as synopsis-learning and prediction operators. Forecasting queries can be composed with diagnosis and tuning queries to enable proactive system management that avoids potential problems.
We have evaluated the Fa platform with monitoring data collected from database-backed multitier services, and with synthetic data that models the noisy nature of monitoring data from production systems. Our evaluation shows that Fa's query plan selection and execution strategies provide actionable information for system management automatically, accurately, and efficiently. Critical features like reliable confidence estimates, robustness to noise, and providing supporting evidence for query results make Fa a practical and useful platform.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations