Characterization of Computation and Communication Imbalance in Parallel Applications
Root cause analysis: Load- and communication imbalance prevents many codes from taking advantage of the increasing parallelism available in modern large-scale supercomputers. When employing complex point-to-point communication patterns, delays on single processes may spread across the entire machine through far-reaching cause-effect chains that are hard to trace back. Extending the Scalasca automatic performance analysis tool, our root cause analysis is able to detect wait states and attribute their cost in terms of resource waste back to their original cause. By replaying prerecorded application event traces in parallel, our approach identifies the processes and call paths responsible for the most severe load imbalances even for very large numbers of processes. After evaluating different implementation strategies, a backward-replay approach offering optimal scalability and low resource consumption was implemented. Starting at the endmost wait states, the backward replay propagates wait state costs backwards through the communication chain to their original root causes. The effectiveness of the root cause analysis was evaluated with different MPI codes on configurations with up to 65,536 processes on the BlueGene/P Supercomputer at Jülich Supercomputing Centre, showing excellent scalability and greatly enhanced insights into wait-state formation in parallel applications.