Using Hardware Performance Monitor (HPM) Toolkit: A primer
last update: 01/07/2009

OpenMP threaded example

This example uses the same exponential sum loop as the simple serial example, but the exponential sum is performed with four OpenMP threads. By examining the Fortran code below, you can see that each thread gets its thread number in its own copy of variable thdID, and the program calls libhpm subroutines F_HPMTSTART and F_HPMTSTOP instead of F_HPMSTART and F_HPMSTOP as in the simple serial example.

The output of this run contains four sections instead of the single section of the simple serial code. The four sections are numbered to correspond with the four partial-sum thread numbers.

Script for compiling source code, creating batch job, and running:

#! /bin/csh

# Part1: put the OpenMP Fortran source code in file it.F
cat << 'EOF1' > it.F
      program main
      implicit none
#include "/usr/include/f_hpm.h"
      integer thdID, omp_get_thread_num
      integer i
      real sum
! Initialize hpmtoolkit:
      call f_hpminit(100,"main")
      sum=0.0
!$omp parallel private (thdID)
! Start instrumentation around compute loop:
      thdID = 1+omp_get_thread_num()
      call f_hpmtstart(thdID, "Partial Sum EXP")
!$omp do reduction(+:sum)
      do i=1,10000000
         sum=sum+exp(.00000001*i)
      end do
! Stop instrumentation after compute loop:
      call f_hpmtstop(thdID)
!$omp end parallel
! Generate hardware analysis output file:
      call f_hpmterminate(100)
      stop
      end
'EOF1'

# Part2: compile the it.F source code with hpm and pmapi libraries
xlf95_r -I/usr/include -qarch=auto -qsmp=omp -O3 -qstrict -oit it.F \
-L/usr/lib -lhpm_r -lpmapi -lm

# Part3: create the batch job lsf.ompjob
cat << 'EOF2' > lsf.ompjob
#!/bin/csh
#
# LSF script to run an OMP code
#
#BSUB -x                         # exclusive use of node
#BSUB -n 1                       # placeholder; see OMP_NUM_THREADS below
#BSUB -R "span[ptile=1]"         # run 1 tasks per host
#BSUB -o omplsf.%J.out           # output filename
#BSUB -e omplsf.%J.err           # input filename
#BSUB -J omplsf.test             # job name
#BSUB -P xxxxxxxx                # your valid 8-digit project number
#BSUB -W 0:10                    # hh:mm wall clock time 
#BSUB -q regular                 # queue

setenv OMP_NUM_THREADS 4
mpirun.lsf ./it
exit
'EOF2'

# Part4: submit lsf.ompjob to the batch queue specified by #BSUB -q
bsub < lsf.ompjob

# Part5: cleanup
rm -f it.F it lsf.ompjob
exit

Output on bluefire POWER6:


 Total execution time of instrumented code (wall time): 0.152229 seconds

 ########  Resource Usage Statistics  ########  

 Total amount of time in user mode            : 0.599284 seconds
 Total amount of time in system mode          : 0.006747 seconds
 Maximum resident set size                    : 2340 Kbytes
 Average shared memory use in text segment    : 10 Kbytes*sec
 Average unshared memory use in data segment  : 1528 Kbytes*sec
 Number of page faults without I/O activity   : 571
 Number of page faults with I/O activity      : 26
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 25
 Number of involuntary context switches       : 3

 #######  End of Resource Statistics  ########

 Instrumented section: 1 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151727 seconds
  Total time in user mode: 0.151429463435374 seconds

 Set: 1
 Counting duration: 0.151624695 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000005
  PM_FPU_FMA (FPU executed multiply-add instruction)         :        27500000
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               0
  PM_CYC (Processor cycles)                                  :       712324196
  PM_RUN_INST_CMPL (Run instructions completed)              :       195545509
  PM_RUN_CYC (Run cycles)                                    :       713189616


  Utilization rate                                 :          99.804 %
  Flop                                             :          85.000 Mflop
  Flop rate (flops / WCT)                          :         560.217 Mflop/s
  Flops / user time                                :         561.317 Mflop/s
  FMA percentage                                   :          95.652 %


 Instrumented section: 2 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151707 seconds
  Total time in user mode: 0.151499265943878 seconds

 Set: 1
 Counting duration: 0.151595207 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000001
be1105en.ucar.edu:/blhome/valent/hpmtoolkit/omp>cat main_311962_0100.hpm 
 Total execution time of instrumented code (wall time): 0.152229 seconds

 ########  Resource Usage Statistics  ########  

 Total amount of time in user mode            : 0.599284 seconds
 Total amount of time in system mode          : 0.006747 seconds
 Maximum resident set size                    : 2340 Kbytes
 Average shared memory use in text segment    : 10 Kbytes*sec
 Average unshared memory use in data segment  : 1528 Kbytes*sec
 Number of page faults without I/O activity   : 571
 Number of page faults with I/O activity      : 26
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 25
 Number of involuntary context switches       : 3

 #######  End of Resource Statistics  ########

 Instrumented section: 1 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151727 seconds
  Total time in user mode: 0.151429463435374 seconds

 Set: 1
 Counting duration: 0.151624695 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000005
  PM_FPU_FMA (FPU executed multiply-add instruction)         :        27500000
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               0
  PM_CYC (Processor cycles)                                  :       712324196
  PM_RUN_INST_CMPL (Run instructions completed)              :       195545509
  PM_RUN_CYC (Run cycles)                                    :       713189616


  Utilization rate                                 :          99.804 %
  Flop                                             :          85.000 Mflop
  Flop rate (flops / WCT)                          :         560.217 Mflop/s
  Flops / user time                                :         561.317 Mflop/s
  FMA percentage                                   :          95.652 %


 Instrumented section: 2 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151707 seconds
  Total time in user mode: 0.151499265943878 seconds

 Set: 1
 Counting duration: 0.151595207 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000001
  PM_FPU_FMA (FPU executed multiply-add instruction)         :        27500000
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               0
  PM_CYC (Processor cycles)                                  :       712652547
  PM_RUN_INST_CMPL (Run instructions completed)              :       196413574
  PM_RUN_CYC (Run cycles)                                    :       713067052


  Utilization rate                                 :          99.863 %
  Flop                                             :          85.000 Mflop
  Flop rate (flops / WCT)                          :         560.291 Mflop/s
  Flops / user time                                :         561.059 Mflop/s
  FMA percentage                                   :          95.652 %


 Instrumented section: 3 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151684 seconds
  Total time in user mode: 0.151189739158163 seconds

 Set: 1
 Counting duration: 0.151348445 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000001
  PM_FPU_FMA (FPU executed multiply-add instruction)         :        27500000
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               0
  PM_CYC (Processor cycles)                                  :       711196533
  PM_RUN_INST_CMPL (Run instructions completed)              :       195208782
  PM_RUN_CYC (Run cycles)                                    :       711893378


  Utilization rate                                 :          99.674 %
  Flop                                             :          85.000 Mflop
  Flop rate (flops / WCT)                          :         560.376 Mflop/s
  Flops / user time                                :         562.207 Mflop/s
  FMA percentage                                   :          95.652 %


 Instrumented section: 4 - Label: Partial Sum EXP - process: 100
 file: it.F, lines: 13 <--> 19
  Count: 1
  Wall Clock Time: 0.151675 seconds
  Total time in user mode: 0.151419299319728 seconds

 Set: 1
 Counting duration: 0.151589205 seconds
  PM_FPU_1FLOP (FPU executed one flop instruction )          :        30000001
  PM_FPU_FMA (FPU executed multiply-add instruction)         :        27500000
  PM_FPU_FSQRT_FDIV (FPU executed FSQRT or FDIV instruction) :               0
  PM_CYC (Processor cycles)                                  :       712276384
  PM_RUN_INST_CMPL (Run instructions completed)              :       196614451
  PM_RUN_CYC (Run cycles)                                    :       713002604


  Utilization rate                                 :          99.831 %
  Flop                                             :          85.000 Mflop
  Flop rate (flops / WCT)                          :         560.409 Mflop/s
  Flops / user time                                :         561.355 Mflop/s
  FMA percentage                                   :          95.652 %

Next page | Table of contents - HPM Toolkit primer

If you have questions about this document, please contact us via any of the methods shown on this page: CISL Customer Support.

© Copyright 2003-2009. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.cisl.ucar.edu/docs/ibm/hpm.toolkit/ex.omp.html