본문 바로가기

OpenMPI

TEST

Here’s a breakdown of that command and what it implies for process placement, threading, and CPU “binding.”


Command

mpirun -x OMP_NUM_THREADS=7 -np 8 \
       --bind-to numa --map-by ppr:4:socket  \
       ./lmp_IGO -in TI.in
  1. -x OMP_NUM_THREADS=7
    Tells each MPI process to spawn 7 OpenMP threads.
  2. -np 8
    You are launching 8 total MPI processes.
  3. --map-by ppr:4:socket
    • “ppr” = processes per resource
    • ppr:4:socket means “place 4 MPI processes on each socket.”
    • If your node has 2 sockets, that results in 4 processes on socket #0 + 4 processes on socket #1 = 8 processes total (matching -np 8).
  4. --bind-to numa
    • After mapping the processes to sockets, each process is bound (“affinity locked”) to a NUMA domain within that socket.
    • Most modern dual-socket Xeon systems have 2 NUMA nodes per socket (for a total of 4 NUMA nodes).
    • Because you want 4 processes on one socket (which usually has 2 NUMA nodes), effectively you get 2 processes per NUMA node on that socket.
  5. ./lmp_IGO -in TI.in
    • The actual LAMMPS executable and input file.

What this means in practice

  • Total processes: You have 8 MPI processes.
  • Threads per process: Each MPI process spawns 7 OpenMP threads (OMP_NUM_THREADS=7).
  • Grand total “threads of execution”: 8 MPI ranks×7 threads each=568 \text{ MPI ranks} \times 7 \text{ threads each} = 56. That’s more than your 48 physical cores (assuming you have 48 cores total), leading to some oversubscription unless you have hyper-threading enabled.
  • Placement:
    • ppr:4:socket ensures that on each socket, 4 ranks are created.
    • bind-to numa further pins each rank to one NUMA domain. Typically, each socket has 2 NUMA domains, each with 12 cores (on a 24-core socket). Therefore:
      • Socket #0 has 2 NUMA domains → 4 processes distributed across them (2 processes per NUMA domain).
      • Socket #1 similarly.
    • Each MPI process (and its 7 threads) can only run on the cores of its assigned NUMA node.

Potential side effects

  1. Oversubscription:
    • With 7 threads per process × 2 processes per NUMA node = 14 threads on a 12-core NUMA node, you get 2 extra threads beyond the physical cores in that domain. That can reduce performance if the code is memory-intensive or if thread contention is high.
  2. Memory locality:
    • bind-to numa can improve memory locality because each rank+threads stays within a single NUMA domain, accessing local memory.
    • But if you over-subscribe that NUMA node, you might see diminishing returns.
  3. Load balancing:
    • If the simulation domain decomposition or LAMMPS input is not balanced, some processes might be more heavily loaded than others.
  4. Alternative configurations:
    • If you want to use all 48 cores exactly (no oversubscription), you might do 8×6=488 \times 6=48 (i.e., OMP_NUM_THREADS=6) or 4×12=484 \times 12=48, etc.
    • Or if you truly want 7 threads per rank, you could reduce the MPI ranks to 6 or so, to stay near 42 total threads (less than or equal to 48).
    • Or enable hyper-threading if you want to match 56 “logical cores.”

Bottom line

  • ppr:4:socket + bind-to numa → 4 processes per socket, each pinned to a NUMA domain (2 processes per NUMA domain).
  • Each MPI process uses 7 OpenMP threads → total 56 threads across 8 ranks, which may oversubscribe a 48-core machine.
  • This layout can still work fine, but you should benchmark different MPI–OpenMP splits (and possibly different binding strategies) to find the best performance for your specific LAMMPS simulation.

'OpenMPI' 카테고리의 다른 글

OpenMPI with AOCC  (2) 2025.01.23
Another  (2) 2025.01.09
Performance Check  (0) 2025.01.09
OpenMPI 설치 with CUDA Support  (2) 2024.12.31