EIC TOP

Parallel processing programming

Auto-parallelization

Translation serial portions of the input program into equivalent multithreaded code automatically.

  • -parallel
    • Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel.
  • -par-threshold[n]
    • This option sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel. ([n]:0-100, The default value is 100)
    • If n is 0, loops get auto-parallelized always, regardless of computation work volume. If n is 100, loops get auto-parallelized when performance gains are predicted based on the compiler analysis data.
  • -qopt-report[=n] -qopt-report-phase=par (-qopt-report[=n]は省略可)
    Generates Auto-parallelization report
    • n=1:Reports which loops were parallelized.
    • n=2:Generates level 1 details, and reports which loops were not parallelized along with a short reason.
    • n=3:Generates level 2 details, and prints the memory locations that are categorized as private, shared, reduction, etc..
    • n=4:For this phase, this is the same as specifying level 3.
    • n=5:Generates level 4 details, and dependency edges that inhibit parallelization.
    • The deault value : n=2。
      (*)Option -qopt-report is the replacement option for -opt-report, which is deprecated. (version 15.0)
      (*)The output, by default, comes out to a file with the same name as the object file but with .optrpt extension and is written into the same directory as the object file.
  • ex)
    $ cat mxm.f
    
      1       SUBROUTINEMXM(A,B,C,L,M,N,N1)
      2       DIMENSION A(M,L),B(N1,M),C(N1,L)
      3       DO  K =1,L
      4         DO  I =1,N
      5           C(I,K)=0.
      6         ENDDO
      7       ENDDO
      8       DO K =1,L
      9         DO I =1,N
     10           DO J =1,M
     11             C(I,K)=C(I,K)+B(I,J)*A(J,K)
     12           ENDDO
     13         ENDDO
     14       ENDDO
     15       RETURN
     16       END
    
    $ ifort  -c  -O3  -xAVX -parallel  -qopt-report -qopt-report-phase=par mxm.f
    ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
    
    The auto-parallelization report file is mxm.optrpt. 
    You could recognize the loop in the 8th line was parallelized.
    
    $ ifort  -c  -O3  -xAVX  -parallel  -par-threshold0 -qopt-report -qopt-report-phase=par  mxm.f
    ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
    
    The auto-parallelization report file is mxm.optrpt. 
    You could recognize the loops in the third and 8th lines were parallelized.
  • If -par-threshold100(default) is set、the initialization loop did not get parallelozation because auto-parallelizer evaluate the loop has not enough work volume.
  • If -par-threshold0 is set, loops get auto-parallelized always, regardless of computation work volume.

OpenMP

Enables the parallelizer to generate multi-threaded code based on OpenMP directives.
(*)Option -qopenmp is the replacement option for -openmp, which is deprecated. -openmp option works at this version.
(*)The optimization report, by default, comes out to a file with the same name as the object file but with an .optrpt extension

  • -qopenmp
    • Enables the parallelizer to generate multi-threaded code based on OpenMP directives.
  • -qopt-report[=n] -qopt-report-phase=openmp (-qopt-report[=n] is omissible.)
    • Tells the compiler to generate an optimization report.
    • n=1:Reports loops, regions, sections, and tasks successfully parallelized.
    • n=2:Generates level 1 details, and messages indicating successful handling of MASTER constructs, SINGLE constructs, CRITICAL constructs, ORDERED constructs, ATOMIC directives, and so forth.
    • The default is n=2
      (*)Option -qopt-report is the replacement option for -opt-report. Option -qopt-report-phase is the replacement option for -opt-report-phase.
  • opt-report and -openmp-report work at this version, then you could see warning message.
  • ex.)
    $ cat mxm2.f
    
      1       SUBROUTINE MXM (A,B,C,L,M,N,N1)
      2       DIMENSION A(M,L),B(N1,M),C(N1,L)
      3 !$OMP PARALLEL DO
      4       DO  K =1,L
      5         DO  I =1,N
      6           C(I,K)=0.
      7         ENDDO
      8       ENDDO
      9 !$OMP PARALLAL DO
     10       DO K =1,L
     11         DO I =1,N
     12           DO J =1,M
     13             C(I,K)=C(I,K)+B(I,J)*A(J,K)
     14           ENDDO
     15         ENDDO
     16       ENDDO
     17       RETURN
     18       END
    
    $ ifort -c  -O3 -xAVX  -qopenmp  -qopt-report -qopt-report-phase=openmp mxm2.f
    ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
    
    The OpenMP report file is mxm2.optrpt.

(*) Setting -parallel and -qoprnmp options in a command line. [#f1487081]
The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

  • qopenmp option enables the parallelizer to generate multi-threaded code based on OpenMP directives.
  • ex.)
    $ cat mxm3.f
    
      1       SUBROUTINE MXM (A,B,C,L,M,N,N1)
      2       DIMENSION A(M,L),B(N1,M),C(N1,L)
      3       DO  K =1,L
      4         DO  I =1,N
      5           C(I,K)=0.
      6         ENDDO
      7       ENDDO
      8 !$OMP PARALLEL DO
      9       DO K =1,L
     10         DO I =1,N
     11           DO J =1,M
     12             C(I,K)=C(I,K)+B(I,J)*A(J,K)
     13           ENDDO
     14         ENDDO
     15       ENDDO
     16       RETURN
     17       END
    
    $ ifort -c -O3 -xAVX -parallel  -par-threshold0  -qopt-report \
    -qopt-report-phase=par -qopenmp  -qopt-report-phase=openmp  mxm3.f
    ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
    
    The analyzing report file is mxm3.optrpt.
  • The loops the third and 9th line, are auto-parallelized and the loop in the 8th line is parallelized by OpenMP.
  • The environment variable OMP_NESTED=false is set by deault, so the parallel region beginning with the 9th loop is not nested, ie., the thread number of this region is the value of OMP_NUM_THREADS.
  • OMP_NESTED=false disable nested parallel regions.

For OMP_NUM_THREADS, see The number of threads.

The number of threads

  • Setting the number of threads in the envieonment variable OMP_NUM_THRAEDS.
$ setenv OMP_NUM_THREADS 4

For using batch-job, see Using Batch Job.

Data Distribution

  • EIC system adopt "First Touch Policy".
    First processor to touch a page of memory causes it to be allocated from its local memory.
    The initialization loop, if executed serially, will grab pages from a local memory of single processor.
    In the parallel loop, multiple processors will access that one memory
    So, perform initialization in parallel, such that each processor initializes data that it is likely to access later for calculation.


  • ex. )
       real*8 A(n), B(n), C(n), D(n)
    
        do i=1, n
            A(i) = 0.
            B(i) = i/2
            C(i) = i/3
            D(i) = i/7
        enddo
    !$omp parallel do
        do i=1, n
            A(i) = B(i) + C(i) + D(i)
        enddo

    first_touch1_e.JPG

        real*8 A(n), B(n), C(n), D(n)
    !$omp parallel do
        do i=1, n
            A(i) = 0.
            B(i) = i/2
            C(i) = i/3
            D(i) = i/7
        enddo
    !$omp parallel do
        do i=1, n
            A(i) = B(i) + C(i) + D(i)
        enddo
    first_touch2_e.JPG

Stack size

  • Stack size for each users is 4GB ( It is not a limit of memory utilization. ). The stack is a region of memory used to store automatic variables. A stack overflow causes SegmentationFault or AddressError. In this case, that variables should be in COMMON block for static allocation.
  • The environment variable KMP_STACKSIZE is set the number of bytes to allocate for each auto-parallelized/OpenMP thread to use as its private stack. It does not include global variables, private variables of master thread. The default size is 2GB.
  • You could use the optional suffixes to specify byte units: B (bytes), K (Kilobytes), M (Megabytes), G (Gigabytes), or T (Terabytes) to specify the units.
  • If you encounter SegmentationFault or AddressError immediately launching your parallel job ( the serial version comlete without error ), the value of KMP_STACKSIZE should be increased.


ex. )

$ setenv KMP_STACKSIZE 4g

MPI processing

  • lmpi link option should be specified in link cmmand line.
    例)
    $ ifort  -O3  -xAVX  ex_mpi.f -lmpi

Run-time MPI command is mpiexec_mpt.
For using batch-job, see Using Batch Job.

  • MPI Data Type
    MPI Fortran TypeFortran Typebit
    MPI_DATATYPE_NULL
    MPI_INTEGERINTEGER32
    MPI_REALREAL32
    MPI_DOUBLE_PRECISIONDOUBLE_PRECISION64
    MPI_COMPLEXCOMPLEX64
    MPI_DOUBLE_COMPLEXDOUBLE_COMPLEX128
    MPI_LOGICALLOGICAL4
    MPI_CHARACTERCHARACTER1
    MPI_INTEGER1INTEGER(KIND=1)8
    MPI_INTEGER2INTEGER(KIND=2)16
    MPI_INTEGER4INTEGER(KIND=4)32
    MPI_INTEGER8INTEGER(KIND=8)64
    MPI_REAL2REAL(KIND=2)16
    MPI_REAL8REAL(KIND=8)64
    MPI_REAL16REAL(KIND=16)128
    MPI_BYTEBYTE1

    MPI C TypeC Typebit
    MPI_CHARchar8
    MPI_SHORTshort16
    MPI_INTint32
    MPI_LONGlong64
    MPI_UNSIGHNED_CHARunsighned char8
    MPI_UNSIGHNED_SHORTunsighned short16
    MPI_UNSIGHNEDunsighned int32
    MPI_UNSIGHNED_LONGunsighned long64
    MPI_FLOATfloat32
    MPI_DOUBLEdouble64
    MPI_LONG_DOUBLElong_double128
    MPI_BYTE1
    MPI_PACKED

MPI/OpenMP hybrid processing

For generating hybrid binary file, specify parallelizing options and -lmpi link option.

  • ex. )
    $  ifort  -O3  -xAVX -qopenmp  -qopt-report -qopt-report-phase=openmp hybrid.f  -lmpi

    For using batch-job, see Using Batch Job.

Performance Analyzing Tool

  • SGI Perfsuite
    For detection which functions, routines, lines consume run-rime. Using psrun command with your program. When your job is done, the result of the psrun command are available on current directory. Use psprocess to format that result files.
  • auto-parallelization/OpenMP
  • Launch your job with psrun command sprcifying -p option. -p option enable thread support.
  • auto-parallelization/OpenMP jobs use extra thread that is inactive. So, the number of result files are N+1. The file of extra thread includes no profiling data.
  • Your programs do not have to be recompiled. Compilation with -g, psrun provide source line profiling.

    ex.)
    $ setenv OMP_NUM_THREADS 4
    $ dplace -x5 psrun -p ./a.out

    $ ls*.xml
    a.out.3.4032.eich1.xml
    a.out.2.4032.eich1.xml
    a.out.4.4032.eich1.xml
    a.out.1.4032.eich1.xml
    a.out.0.4032.eich1.xml

    The result of psrun command is : "Prog-name.thread-num.pid.hostid.xml"
    psprocess command could format the result file of psrun command.
    $ psprocess  a.out.0.4032.eich1.xml

    (*)psprocess command could not accept more than one file.
    $ psprocess  a.out.0.4032.eich1.xml  a.out.2.4032.xml  a.out.3.4032.xml  <--- Not Allowed

    ex. )
    Compilation with -g and auto-parallelizing options.
    $ ifort -g -O3 -xAVX -parallel -par-threshold0 -qopt-report -qopt-report-phase=par test.f
    $ dplace -x5 psrun -p ./a.out
    $ ls *.xml
    a.out.4.16904.eich1.xml
    a.out.3.16904.eich1.xml
    a.out.2.16904.eich1.xml
    a.out.1.16904.eich1.xml
    a.out.0.16904.eich1.xml
    $ psprocess  a.out.0.16904.eich1.xml
    
    ......
    Samples      Self%    Total%  Function  
         7922   87.82%   87.82%  L_jacobi__202__par_region0_2_137  
         1043   11.56%   99.38%  __intel_new_memcpy
           34    0.38%   99.76%  main$himenobmtxp_m_omp_$BLK
            6    0.07%   99.82%  _intel_fast_memcpy.J
    
    Samples     Self%  Total%   Function:File:Line
         1260  13.97% 13.97%   L_jacobi__202__par_:himeno_omp.f:215
         1162  12.88% 26.85%   L_jacobi__202__par_:himeno_omp.f:219
         1043  11.56% 38.41%   __intel_new_memcpy:??:0
          993  11.01% 49.42%   L_jacobi__202__par:himeno_omp.f:216
    .......
  • MPI programs
    To analyze MPI programs, psrun command with -f options is necessary.
  • Hybrid programs
    To analyze hybrid programs, psrun command with -f and -p options are necessary.
  • MPInside (provides information about MPI communication time)
    MPInside is a MPI profiling tool.
    Command name is MPInside.

    例)
    $ mpiexec_mpt -np 4 dplace -s1 MPInside a.out

    The result file name is "mpinside_stats", that file is in current directory.
    .....
    >>>> Elapse times in (s) 0 1<<<<
    CPU   Comput   init     waitall  isend    irecv    barrier  bcast    allreduce
    0000 42.8395   0.0001   0.8088   0.9166   0.0817   0.0001   0.0001   0.1077
    0001 42.6846   0.0001   0.8629   0.9188   0.0706   0.0002   0.0161   0.1840 
    0002 42.5846   0.0001   1.0888   0.8851   0.0544   0.0002   0.0161   0.1294
    0003 42.6181   0.0001   0.9546   0.9162   0.0535   0.0002   0.0161   0.1331
    0004 42.4007   0.0001   1.1028   0.9472   0.0466   0.0003   0.0161   0.2360
    ......

    Bellow shows example of "Execl" area chart that could be produced from the mpinside_stats file.
    MPInside_e.jpg
  • The x-axis is MPI rank, the y-axis is elapsed time.
  • The elapsed time of each MPI routines is colored.
  • In the above example, Compute and MPI_Alltoallv time account for a large portion of whole elapsed time.
  • In the above example, MPI communication time include transfering data time and wait time for the communication.

添付ファイル: fileMPInside_e.jpg 176件 [詳細] filefirst_touch2_e.JPG 236件 [詳細] filefirst_touch1_e.JPG 174件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ