dc.description.abstract |
<p>Over the next decade, it is estimated that the number of servers (virtual and physical)
in enterprise datacenters will grow by a factor of 10, the amount of data managed
by these datacenters will grow by a factor of 50, and the number of files the datacenter
has to deal with will grow by a factor of 75. Meanwhile, skilled information technology
(IT) staff to manage the growing number of servers and data will increase less than
1.5 times. Thus, a system administrator will face the challenging task of managing
larger and larger numbers of production systems. We have developed solutions to make
the system administrator more productive by automating some of the hard and time-consuming
tasks in system management. In particular, we make new contributions in the Monitoring,
Problem Diagnosing, and Testing phases of the system management cycle.</p><p>We start
by describing our contributions in the Monitoring phase. We have developed a tool
called Amulet that can continuously monitor and proactively detect problems on production
systems. A notoriously hard problem that Amulet can detect is that of data corruption
where bits of data in persistent storage differ from their true values. Once a problem
is detected, our DiaDS tool helps in diagnosing the cause of the problem. DiaDS uses
a novel combination of machine learning techniques and domain knowledge encoded in
a symptoms database to guide the system administrator towards the root cause of the
problem.</p><p>Before applying any change (e.g., changing a configuration parameter
setting) to the production system, the system administrator needs to thoroughly understand
the effect that this change can have. Well-meaning changes to production systems have
led to performance or availability problems in the past. For this phase, our Flex
tool enables administrators to evaluate the change hypothetically in a manner that
is fairly accurate while avoiding overheads on the production system. We have conducted
a comprehensive evaluation of Amulet, DiaDS, and Flex in terms of effectiveness, efficiency,
integration of these contributions in the system management cycle, and how these tools
bring data-intensive computing systems closer the goal of self-managing systems.</p>
|
|