Gromacs 4.5.3: ICC vs. GCC

An earlier post reported significant performance improvements for Gromacs by switching from GCC to Intel's C++ Compiler (ICC). This generated a number of emails and here we update the results using newer codes.

The hardware being used this time is an Intel Core i7-980x (standard clocking) and 12 GiB of 1333MHz RAM. The software versions are Gromacs 4.5.3, GCC 4.4.5, ICC, 11.1, ICC 12.0.0, FFTW 3.2.2, and Fedora 13.

For each configuration, gromacs executes for one hour (-maxh 1) and the reported hr./ns. recorded. The values shown here come from the average of three runs. This seems fair given the duration is decent and the amount of work to be performed is increased for parallel jobs. Differing from the earlier comparison, a larger protein was used, requiring more work.

The table below show the raw results. The default compile flags were used in all cases except those marked with ex. In those cases, the more "extreme" options "-O3 -xHOST -ipo -no-prec-div" are used. Where mkl is indicated, "-mkl=sequential" was added. FFTW was provided by Fedora 13.

Compiler config hr./ns. hr./ns. %
gcc-4.5.3-fftw 11.70 100.0%
icc-11.1-ex-fftw 10.98 106.6%
icc-11.1-ex-mkl 11.31 103.4%
icc-11.1-mkl 11.33 103.3%
icc-12-ex-fftw 11.12 105.3%
icc-12-ex-mkl 11.17 104.8%
icc-12-mkl 11.04 106.0%

In some ways the results are a little disappointing. ICC only seems to improve performance a little more than 6% in the best case. FFTW appears to generally be faster than MKL. The effort required to use the "-ipo" hardly seem worth it and may actually degrade performance slightly. (To use -ipo, in addition to adding compiler the flag, after the link.txt files are generated they must be updated to use xiar instead of ar.) However, it is impressive that the open source gcc and fftw are so close in performance to the proprietary ICC.

The scaling results, arbitrarily for the icc-12 mdrun binary, are shown next. For parallel execution the -nt option was used instead of MPI. As expected, scaling efficiency takes a good drop past six threads. Little if any benefit from HyperThreading is observed which is disappointing. It would be interesting to repeat this and measure the contributions of TurboBoost and HyperThreading.


Why did the previous results make ICC look so much better? Several parameters, including the software versions, have changed, but someone else pointed out there is a chance I did not use fftw. If it is not part of the standard CentOs installation at that time there is a good chance this is the cause. If true, part of the previous test demonstrated ICC's superior ability to optimize the fft routines built into gromacs in addition to comparing it to mkl but fftw should have also been included.

It would be interesting to repeat this benchmarking with SandyBridge. Have the MKL routines used by gromacs been optimized already for AVX? If so, how long will be required before the same optimizations are available in fftw? While mkl provided little benefit in today' scenario with the i7-980x, does mkl have an earlier advantage when it comes to new architectures?