본문 바로가기

HPC

Command line option for MPIRUN

PPR :

Processes Per Resource

 

Examples

--map-by ppr:1:socket 1 MPI rank per CPU socket
--map-by ppr:2:numa 2 MPI ranks per NUMA node
--map-by ppr:1:node 1 MPI rank per compute node
--map-by ppr:4:core 4 MPI ranks per core (⚠️ rarely used; oversubscribes CPUs)
--map-by ppr:1:numa:PE=16 1 MPI rank per NUMA node, with 16 cores (OpenMP threads) each

 

Bind to

It tells OpenMPI to pin each MPI process (and possibly its threads) to specific CPU hardware:

 

  • Prevents CPU migration
  • Improves cache locality
  • Reduces NUMA latency
  • Prevents thread/process bouncing across cores
--bind-to none No binding — processes can float across all CPUs (not NUMA friendly)
--bind-to core Bind each process to a set of cores (based on PE or thread count)
--bind-to socket Bind each process to a CPU socket (group of cores)
--bind-to numa Bind each process to a NUMA node
--bind-to hwthread Bind to logical CPUs (hyperthreads) — rarely used for performance

 

Example
mpirun -np 4 --map-by ppr:1:numa:PE=16 --bind-to core ...

 

  • 4 MPI ranks
  • Each rank gets 16 cores
  • Those 16 cores are pinned, so the OS can’t move them
  • OpenMP threads stay within those cores → great NUMA locality

 

--bind-to none MPI process and threads can move across all 64 cores — may cause NUMA traffic
--bind-to core Process and its threads pinned to specific physical cores — best for cache/NUMA
--bind-to numa Process bound to full NUMA node — useful when OpenMP threads < NUMA cores
--bind-to socket Similar to numa, but not always accurate on AMD with NPS settings

 

Thread binding in OpenMP
export OMP_PLACES=cores
export OMP_PROC_BIND=close

 

  • OpenMP threads stay within the process’s assigned cores
  • Threads are packed closely (minimizing L3 cache sharing)

 

Example
export OMP_NUM_THREADS=16
export OMP_PLACES=cores
export OMP_PROC_BIND=close

mpirun -np 4 \
  --map-by ppr:1:numa:PE=16 \
  --bind-to core \
  --report-bindings \
  ./lmp -sf omp -pk omp 16 -in in.benchmark
[dell7875-Precision-7875-Tower:21904] MCW rank 0 bound to NUMA node 0[core 0-15]: [B/B/B/B/./././././././././././.]
[dell7875-Precision-7875-Tower:21905] MCW rank 1 bound to NUMA node 1[core 16-31]: [././././B/B/B/B/B/B/B/B/B/B/B/B]
[dell7875-Precision-7875-Tower:21906] MCW rank 2 bound to NUMA node 2[core 32-47]: [././././././././B/B/B/B/B/B/B/B]
[dell7875-Precision-7875-Tower:21907] MCW rank 3 bound to NUMA node 3[core 48-63]: [././././././././././././B/B/B/B]

 

  • B = core bound to this MPI rank
  • . = not bound
  • [core x-y] = physical cores used
  • NUMA node n = NUMA affinity (inferred from core set)
export OMP_NUM_THREADS=32
mpirun -np 2 \
  --map-by ppr:1:node:PE=32:overload-allowed \
  --bind-to core \
  --report-bindings \
  ./lmp -sf omp -pk omp 32 -in in.benchmark
[dell7875:00000] MCW rank 0 bound to [core 0-31]
[dell7875:00001] MCW rank 1 bound to [core 32-63]

'HPC' 카테고리의 다른 글

Setup Process_Intel Based Machine  (0) 2025.05.01
NUMA Pinning of AMD ThreadRipper 7995WX Pro  (0) 2025.04.24
pkg-config in NURION  (0) 2025.04.15
생각해 볼 것  (0) 2025.04.14
Ncurses  (0) 2025.04.14