dc.description.abstract |
<p>Today, large-scale scientific applications are both data driven and distributed.
To support the scale and inherent distribution of these applications, significant
heterogeneous and geographically distributed resources are required over long periods
of time to ensure adequate performance. Furthermore, the behavior of these applications
depends on a large number of factors related to the application, the system software,
the underlying hardware, and other running applications, as well as potential interactions
among these factors.</p>
<p>Most Grid application users are primarily concerned with obtaining the result
of the application as fast as possible, without worrying about the details involved
in monitoring and understanding factors affecting application performance. In this
work, we aim to provide the application users with a simple and intuitive performance
evaluation mechanism during the execution time of their long-running Grid applications
or workflows. Our performance evaluation mechanism provides a qualitative and periodic
assessment of the application's behavior by informing the user whether the application's
performance is expected or unexpected. Furthermore, it can help improve overall application
performance by informing and guiding fault-tolerance services when the application
exhibits persistent unexpected performance behaviors.</p>
<p>This thesis addresses the hypotheses that in order to qualitatively assess application
behavioral states in long-running scientific Grid applications: (1) it is necessary
to extract temporal information in performance time series data, and that (2) it is
sufficient to extract variance and pattern as specific examples of temporal information.
Evidence supporting these hypotheses can lead to the ability to qualitatively assess
the overall behavior of the application and, if needed, to offer a most likely diagnostic
of the underlying problem.</p>
<p>To test the stated hypotheses, we develop and evaluate a general <em> qualitative
performance analysis</em> framework that incorporates (a) techniques from time series
analysis and machine learning to extract and learn from data, structural and temporal
features associated with application performance in order to reach a qualitative interpretation
of the application's behavior, and (b) mechanisms and policies to reason over time
and across the distributed resource space about the behavior of the application. </p>
<p>Experiments with two scientific applications from meteorology and astronomy comparing
signatures generated from instantaneous values of performance data versus those generated
from temporal characteristics support the former hypothesis that temporal information
is necessary to extract from performance time series data to be able to accurately
interpret the behavior of these applications. Furthermore, temporal signatures incorporating
variance and pattern information generated for these applications reveal signatures
that have distinct characteristics during well-performing versus poor-performing executions.
This leads to the framework's accurate classification of instances of similar behaviors,
which represents supporting evidence for the latter hypothesis. The proposed framework's
ability to generate a qualitative assessment of performance behavior for scientific
applications using temporal information present in performance time series data represents
a step towards simplifying and improving the quality of service for Grid applications.</p>
|
|