PPR :
Processes Per Resource
Examples
--map-by ppr:1:socket | 1 MPI rank per CPU socket |
--map-by ppr:2:numa | 2 MPI ranks per NUMA node |
--map-by ppr:1:node | 1 MPI rank per compute node |
--map-by ppr:4:core | 4 MPI ranks per core (⚠️ rarely used; oversubscribes CPUs) |
--map-by ppr:1:numa:PE=16 | 1 MPI rank per NUMA node, with 16 cores (OpenMP threads) each |
Bind to
It tells OpenMPI to pin each MPI process (and possibly its threads) to specific CPU hardware:
- Prevents CPU migration
- Improves cache locality
- Reduces NUMA latency
- Prevents thread/process bouncing across cores
--bind-to none | No binding — processes can float across all CPUs (not NUMA friendly) |
--bind-to core | Bind each process to a set of cores (based on PE or thread count) |
--bind-to socket | Bind each process to a CPU socket (group of cores) |
--bind-to numa | Bind each process to a NUMA node |
--bind-to hwthread | Bind to logical CPUs (hyperthreads) — rarely used for performance |
Example
mpirun -np 4 --map-by ppr:1:numa:PE=16 --bind-to core ...
- 4 MPI ranks
- Each rank gets 16 cores
- Those 16 cores are pinned, so the OS can’t move them
- OpenMP threads stay within those cores → great NUMA locality
--bind-to none | MPI process and threads can move across all 64 cores — may cause NUMA traffic |
--bind-to core | Process and its threads pinned to specific physical cores — best for cache/NUMA |
--bind-to numa | Process bound to full NUMA node — useful when OpenMP threads < NUMA cores |
--bind-to socket | Similar to numa, but not always accurate on AMD with NPS settings |
Thread binding in OpenMP
export OMP_PLACES=cores
export OMP_PROC_BIND=close
- OpenMP threads stay within the process’s assigned cores
- Threads are packed closely (minimizing L3 cache sharing)
Example
export OMP_NUM_THREADS=16
export OMP_PLACES=cores
export OMP_PROC_BIND=close
mpirun -np 4 \
--map-by ppr:1:numa:PE=16 \
--bind-to core \
--report-bindings \
./lmp -sf omp -pk omp 16 -in in.benchmark
[dell7875-Precision-7875-Tower:21904] MCW rank 0 bound to NUMA node 0[core 0-15]: [B/B/B/B/./././././././././././.]
[dell7875-Precision-7875-Tower:21905] MCW rank 1 bound to NUMA node 1[core 16-31]: [././././B/B/B/B/B/B/B/B/B/B/B/B]
[dell7875-Precision-7875-Tower:21906] MCW rank 2 bound to NUMA node 2[core 32-47]: [././././././././B/B/B/B/B/B/B/B]
[dell7875-Precision-7875-Tower:21907] MCW rank 3 bound to NUMA node 3[core 48-63]: [././././././././././././B/B/B/B]
- B = core bound to this MPI rank
- . = not bound
- [core x-y] = physical cores used
- NUMA node n = NUMA affinity (inferred from core set)
export OMP_NUM_THREADS=32
mpirun -np 2 \
--map-by ppr:1:node:PE=32:overload-allowed \
--bind-to core \
--report-bindings \
./lmp -sf omp -pk omp 32 -in in.benchmark
[dell7875:00000] MCW rank 0 bound to [core 0-31]
[dell7875:00001] MCW rank 1 bound to [core 32-63]
'HPC' 카테고리의 다른 글
Setup Process_Intel Based Machine (0) | 2025.05.01 |
---|---|
NUMA Pinning of AMD ThreadRipper 7995WX Pro (0) | 2025.04.24 |
pkg-config in NURION (0) | 2025.04.15 |
생각해 볼 것 (0) | 2025.04.14 |
Ncurses (0) | 2025.04.14 |