Last updated: Apr 6th 2009
We ran the WRF Conus Benchmark using different compile-time and runtime options for bluefire.
Conus 12km (low resolution)WRF V3.0.1.1 was compiled using xlf V11.1, with the default WRF options for AIX (including dmpar, i.e. MPI-only) plus the following:
-qarch=auto -qtune=auto -qcache=auto
-qarch=auto -qtune=auto -qcache=auto plus
-qthot
Note that -qthot is not recommended from WRF developers for
WRF V3.0.1.1 because of reported problems with model results under certain
configurations. The README says "Use at your own risk", but we thought it was
worth investigating its possible performance benefits.
We investigated:
Processor Binding is shown with a letter b in the legend.
Largepages is shown with a l, and SMT with a s. To
keep the legend aligned, we show a letter x when the relevan
option is not in use.
The results for the high resolution case are below. It is clear why processor binding is mandatory: the performance increase is impressive.
Use of largepages is also helping performance a little, but the
careful use of compile-time options and SMT are more important. When it's possible,
the use of -qhot may help a lot (the caveat is that -qhot
might alter the program semantic, and thus produce incorrect results).
The results for the low resolution case are below. They confirm what is shown for the high resolution case, but the difference between different cases is smaller.
The reason for such a smaller difference among different options is due to the smaller size of the problem: any single node has less data to crunch and more data to transfer (this case is more communication-bond than the previous). For the same reason its scaling is also not as good as the higher resolution case.