|Schedule:||View OpenMPCon 2015 Program|
|Other Authors:||Hideki Saito, Serguei V. Preis, Sergey S. Kozhukhov, Intel Corp.|
|Abstract:||The relentless pace of Moore’s Law will lead to modern multi-core processors, coprocessors and GPU designs with extensive on-die integration of SIMD execution units on CPU and GPU cores to achieve better performance and power efficiency. To make efficient use of the underlying SIMD hardware, utilizing its wide vector registers and SIMD instructions such as Xeon Phi™, SIMD vectorization plays a key role of converting plain scalar C/C++/Fortran code into SIMD code that operating on vectors of data each holding one or more elements.Intel® Xeon processors and Xeon Phi™ coprocessors combine abundant thread parallelism with SIMD vector units.
Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon and Xeon Phi™.In this paper, we present Intel® compiler framework that supports OpenMP4.0/4.1 SIMD extensions, and also present a set of key vectorization techniques such as function vectorization, masking support, uniformity and linearity propagation, alignment optimization, gather/scatter optimization, remainder and peeling loop vectorization that are implemented inside the Intel® C/C++ and Fortran product compilers for Intel® Xeon processors and Xeon Phi™ coprocessors.
A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 3x to ~12x performance gain on the Intel® Xeon processors and Xeon Phi™ coprocessors that illustrate how the power of compiler can be harnessed with minimum programmer efforts to enable effective SIMD parallelism. We also demonstrate a speedup ranging from ~100x to ~2000x with the seamless integration of SIMD vectorization and parallelization.