|Schedule:||View OpenMPCon 2015 Program|
|Abstract:||OpenMP directives remain one of the most popular ways to express shared memory multi-threaded parallelism. Fork-join work sharing constructs are simple to use and permit the incremental introduction of data parallelism. As a result the OpenMP language is widely used in application domains such as scientific computing, computer-aided design and engineering. However, the easy introduction of parallelism does not mean that the user can ignore parallel performance inefficiencies such as load imbalance or overhead on work scheduling.
This talk will present what we have been doing in Intel ® VTune™ Amplifier XE to support OpenMP performance analysis that shows the results in terms of the OpenMP constructs that the programmer operates with, rather than offering general tuning paradigms that can confuse OpenMP programmers rather than helping them understand the real problems in their code.In the case if a performance tool shows performance related information associated with functions/loops in a bottom-up or top-down style a customer can lose the parallel region/parallel loop context of the pragmas that they had inserted. Without that context understanding inefficiencies like imbalance or overhead on parallel work organizing or scheduling becomes hard without complicated additional analysis.
One of the reasons why OpenMP has significant market share is its simple and incremental approach to the introduction of parallelism. Experience shows us that customers have the same expectations from performance tools, – “show me how much of the wall time of my application is parallelized, how well it is parallelized, what constructs are inefficient and what I can do to fix them quickly”.
The Intel® VTune™ Amplifier XE performance analysis tool and Intel® OpenMP Runtime teams have been working to answers the customer’s questions above in terms of OpenMP constructs. The combination of notifications from the OpenMP library and sampling data from VTune Amplifier XE collectors allow to calculate and show the parallel and serial (outside of any region) time of an application, the difference between measured and ideal execution of a parallel region, what are the reasons of the difference, statistics broken down by work-sharing constructs inside a parallel region, per region CPU utilization histograms. Based on this information a user can more easily understand where to invest their tuning effort knowing how much gain in application wall time they can theoretically achieve by fixing each problem. The OpenMP analysis of a compute node in Intel® VTune™ Amplifier XE is scaled for MPI + OpenMP hybrid programming model case with per-rank metrics available.