PROGNOSIS

The objective of this project is to develop failure prediction models based on failure logs for large-scale parallel systems such as IBM BlueGene/L.

Due to the increased complexity in both hardware and software, today's large scale clusters (with tens of thousands of processors) experience frequent failures, which are impossible to avoid and can significantly degrade the system performance. Hence, the ability to predict failure occurrence is the key to graceful operations under faulty conditions for such systems. Towards this goal, this research expects to make the following contributions. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system (today's fastest supercomputer with 128,000 processors) over an extended period of time. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on devising a suite of prediciton-assisted runtime management strategies to hide the impact of failures from applications.

This is a collaborative project between Rutgers University, Penn State University, IBM T. J. Watson Research Center, Lawrence Livermore National Laboratory, and IBM Almaden Research Center, with a major portion of the research conducted at Rutgers. By now, we have developed a set of filtering tools to compress voluminous failure logs, analysis tools to extract characteristics from filtered logs, and prediction models that can forecast failure occurrence at a reasonable precision (leading to an F-measure of 70%). Finally, we also designed a set of runtime fault-tolerant algorithms that take advantage of failure predictions. Our next step is to refine the prediction models and port them onto an actual platform.

Enhancing Failure Analysis for IBM BlueGene/L: A Filtering Perspective
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo
Submitted to IEEE Transactions on Dependable and Secure Computing.

Prediction in IBM BlueGene/L Event Logs
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo
To appear in the Proceedings of the 2007 IEEE Conference on Data Mining (ICDM), 2007.

An Adaptive Semantic Filter for BlueGene/L Failure Log Analysis
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo
Proceedings of the Third International Workshop on System Management Techniques, Processes, and Services (SMTPS), 2007. [.pdf]

BlueGene/L Failure Analysis and Prediction Models
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Moe Jette, and Ramendra Sahoo
Proceedings of IEEE International Conference on Dependable System and Network (DSN '06), pp. 425-434, 2006. [.pdf]

Filtering Failure Logs for a BlueGene/L Prototype
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra Sahoo, Jose Moreira, and Manish Gupta
Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 476- 485, 2005. [.pdf]

Performance Implications of Failures in Large-Scale Cluster Scheduling
Yanyong Zhang, Mark Squillante, Anand Sivasubramaniam, and Ramendra Sahoo
Proceedings of 10th Workshop on Job Scheduling Strategies for Parallel Processing, New York, NY, June 2004. [.pdf]