Abstract
The objective of this project is to develop
failure prediction models based on failure logs for large-scale
parallel systems such as IBM BlueGene/L.
Due to the increased complexity in both hardware and software,
today's large scale clusters (with tens of thousands of processors)
experience frequent failures, which are impossible to avoid and can
significantly degrade the system performance. Hence, the ability to
predict failure occurrence is the key to graceful operations under
faulty conditions for such systems. Towards this goal, this research
expects to make the following contributions. The first set of
contributions will be on collecting and analyzing system events and
failure data from an actual BlueGene/L system (today's fastest
supercomputer with 128,000 processors) over an extended period of
time. The second set of contributions will be models for online
analysis and prediction of evolving failure data by exploiting
correlations between system events over time, across the nodes, and
with respect to external factors such as imposed workload and
operating temperature. The third set of contributions will be on devising
a suite of prediciton-assisted runtime management strategies to
hide the impact of failures from applications.
This is a collaborative project between Rutgers University, Penn
State University, IBM T. J. Watson Research Center, Lawrence
Livermore National Laboratory, and IBM Almaden Research Center, with
a major portion of the research conducted at Rutgers. By now, we
have developed a set of filtering tools to compress voluminous
failure logs, analysis tools to extract characteristics from
filtered logs, and prediction models that can forecast failure
occurrence at a reasonable precision (leading to an F-measure of
70%). Finally, we also designed a set of runtime fault-tolerant
algorithms that take advantage of failure predictions. Our next step is to refine the
prediction models and port them onto an actual platform.
Publications
|