The 1st Workshop on System Management Tools for Large-Scale Parallel Systems

Workshop organizers:

Fabrizio Petrini, LANL
Ramendra Sahoo, IBM Research
Yanyong Zhang, Rutgers

Program Committee:

Ricardo Bianchini, Rutgers
Henri Casanova, UCSD
Dror Feitelson, Hebrew University
Rahul Garg, IBM India
Jose E. Moreira, IBM Research
Manish Parashar, Rutgers
Kyung Ryu, IBM Research
Anand Sivasubramaniam, PennState
Rajeev Thakur, Argonne
Jeff Vetter, Oak Ridge National Lab.
Andy Yoo, LLNL
Xiaodong Zhang, William & Mary

Technical Program:

Session 1: Keynote Address

"Challenges in Programming and Managing Highly Parallel Systems"
Manish Gupta, IBM Research

Session 2: Network System Management

"Fast Scalable File Distribution Over Infiniband"
Dennis Dalessandro,Pete Wyckoff, Ohio Supercomputing Center

"Performance Modeling of Subnet Management on Fat Tree Infiniband Networks using openSM"
Abhinav Vishnu, Amith R Mamidala, Hyun Wook Jin, Dhabaleswar K. Panda (Ohio State)

Session 3: Modeling and Scalability

"A Framework Focus on Configuration Modeling and Integration with Transparent Persistence"
Ivan Diaz, Juan Tourino, Jesus Salceda, Ramon Doallo (University of A Coruna)

"A Fixed-Structure Learning Automaton Solution to the Stochastic Mapping Problem"
Geir Horn (SIMULA Research Laboratory), B. John Oommen (Carleton University)

"On the Scalability of Centralized Control"
Dror G. Feitelson (The Hebrew University of Jerusalem)

Session 4: MPI and System Management

"Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters"
Juan Fernandez, Fabrizio Petrini, Eitan Frachtenberg (LANL)

"MPISH: A Parallel Shell for MPI Programs"
Narayan Desai, Andrew Lusk, Rick Bradshaw, Ewing Lusk (Argonne National Lab.)

Session 5: Fault Tolerance

"Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems"
Adam J. Oliner (MIT), Ramendra K. Sahoo, Jose E. Moreira, Manish Gupta (IBM Research)

"Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance"
Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, Song Jiang (LANL)

"Destructive Transaction: Human-Oriented Cluster System Management Mechanism"
Taoying Liu, Zhiwei Xu (Chinese Academy of Sciences)

Session 6: Industrial Presentations

"Installing the XT3: System Management Challenges in Being Bigger, Faster, Better"
Laura McGinnis, J. Ray Scott, Katie Vargo (Pittsburgh Supercomputer Center)