Three machines are used for these tests: Machine A = dual Intel Xeon x5355 (Clovertown); Machine B = single x5570 (Nehalem-EP); Machine C = dual x5670 (Westmere-EP). Hyper-threading (HT) is enabled on both the x5570 and x5670. Sequential execution time on the x5570 is 1.40x faster than the x5355. Surprisingly, the x5670 is roughly 10% slower than the x5570 for one, two, and four processes. It would be interesting if this could be explained by experimenting with core affinity. Since the collected times are the average of three runs, hopefully any effect of initial core assignment of a process is minimized.
The choice of compiler and compilation makes a significant difference with Gromacs. The results shown in the following table are collected on a x5570. Using GCC and default options as the standard, simply by using the Intel C++ Compiler (ICC) we reduce the sequential execution time by nearly 20%. Using ICC with Intel's Math Kernel Libraries (MKL) reduces the sequential runtime by nearly 42%. These are completely free improvements gained by simply switching the compiler. A few other options are also tried but result in little gain.
Compiler and configuration | Time (%) |
gcc (GCC 4.4.2) | 100.0 |
icc (Intel C++ Compiler 11.1) | 80.9 |
icc+mkl (Intel Math Kernel Library) | 58.3 |
icc+mkl+lapack+blas | 58.3 |
icc+mkl+lapack+blas+no-soft-sqrt | 58.1 |
icc+mkl+lapack+blas+no-soft-sqrt+prefetch-forces | 57.9 |
Next we consider parallel execution. The performance and efficiency results are summarized in the following three graphs with the time reported as a percentage of execution time of a single process on x5355. Efficiency remains high with the x5355 despite the slower execution. As expected, efficiency drops as process count exceeds physical core count, since there is competition for the same hardware resources, but performance does slightly improve, justifying HT. The slope of the efficiency curve changes from one to two processes on the x5570 and x5670, perhaps as a consequence of Turbo Boost, the processor decreasing frequency as the process count increases. Somewhat disappointing is the poor scaling on the x5670. At six processes, recall this is a dual six-core machine, scaling starts to degrade. Beyond six processes little performance is gained, efficiency drops fast at 12 processes, surprising since there are 12 physical cores, and performance degrades at 24 processes. More tests are required, this could be a consequence of Gromacs and the simulation, the dual processors, or the impact of six-cores such as perhaps a cache penalty because of the two additional cores.