본문 바로가기

OpenMPI

Another

In Open MPI (and similarly in other MPI distributions like Intel MPI), the map and bind directives determine:

  1. How MPI processes are mapped (placed) onto hardware resources (sockets, cores, nodes, NUMA domains, etc.).
  2. How (or if) each MPI process is bound (affinity) to those resources so that they do not migrate among them arbitrarily.

This can have a large impact on performance, especially on systems with multiple sockets or multiple NUMA domains, and when running hybrid MPI+OpenMP codes.

Below is a detailed breakdown:


1. “Map By” (--map-by)

“map-by” specifies the scheme for how MPI processes (ranks) get assigned to available hardware resources. Think of “map by” as the placement policy—where do your processes go?

Common map-by directives

  1. --map-by node
    • Spread processes across nodes first. This is more relevant in a cluster setting.
    • On a single node with multiple processes, “map-by node” is basically placing all processes on the same node, so it’s not particularly interesting on a single machine.
  2. --map-by socket
    • Fill one socket before moving to the next (or spread round-robin across sockets, depending on other options).
    • If you have 2 sockets, it tries to place an equal or balanced set of MPI processes on each socket.
  3. --map-by core
    • Assign one process per core, typically in a round-robin fashion across all cores in the system (or across all allocated nodes/cores in a cluster).
  4. --map-by hwthread
    • Assign processes per “hardware thread” (lowest hardware granularity). For example, if hyper-threading is on, each physical core might appear as two “hardware threads.”
  5. --map-by ppr:X:socket
    • PPR stands for Processes Per Resource.
    • Example: --map-by ppr:4:socket means “on each socket, place 4 processes.”
    • If you have 2 sockets, that yields 8 total MPI processes. If you have 4 sockets, that yields 16, etc.
    • Similarly, ppr:X:node means “on each node, place X processes” (for cluster runs).

Key idea: map-by chooses which resource you start with when laying out ranks. You can specify advanced patterns such as “ppr:2:socket” or “core:SPAN” or “node:SPAN,” but the main concept is the same: how to distribute the total number of MPI processes across sockets, cores, nodes, or other hardware boundaries.


2. “Bind To” (--bind-to)

“bind-to” specifies the affinity policy for your MPI processes. Once you’ve mapped a process onto (say) a socket or a set of cores, “bind-to” determines how tightly that process (and its threads) stay on those cores during runtime.

Common bind-to directives

  1. --bind-to none
    • No binding at all; the operating system is free to schedule each MPI process (and its threads) on any core. Processes can migrate from core to core.
    • This can be good if your code benefits from dynamic scheduling, or if you want maximum flexibility.
    • It can also lead to NUMA performance issues if memory pages get allocated on a different NUMA node than where the process later migrates.
  2. --bind-to core
    • Each MPI process is pinned to exactly one core.
    • If your code spawns multiple threads (like OpenMP threads), they are all forced onto the same single core. This is usually not what you want for a multi-threaded (hybrid) application but can be okay for pure MPI (1 rank per core).
  3. --bind-to socket
    • Each MPI process is pinned to an entire socket—that is, all cores within that socket.
    • This is good for hybrid MPI+OpenMP if you want each MPI process to freely share the 24 (or however many) cores of its assigned socket.
  4. --bind-to numa
    • Each MPI process is pinned to one NUMA domain.
    • On modern dual-socket Xeon systems with 2 NUMA domains per socket, “bind-to numa” can be more fine-grained than “bind-to socket.” A 24-core socket might be split into two 12-core NUMA domains, for example.
  5. --bind-to hwthread
    • Pins each MPI process to a single hardware thread (lowest granularity). Again, not common for hybrid codes, but can be used in certain HPC contexts.

Key idea: bind-to chooses how many hardware resources each rank is allowed to run on. If you have multiple OpenMP threads per MPI rank, you usually want bind-to socket or bind-to numa so that all threads of that MPI rank can share multiple cores in that region.


3. Combining “Map By” and “Bind To”

You can combine them in one command line, for example:

mpirun -np 8 \
       --map-by ppr:4:socket \
       --bind-to numa \
       ./my_app

Interpretation

  • -np 8: You want 8 MPI processes in total.
  • --map-by ppr:4:socket: Place 4 processes per socket. If you have 2 sockets, that accounts for all 8 processes (4 on each socket).
  • --bind-to numa: Each process is pinned to a NUMA domain within that socket. On many modern Intel CPUs, each socket might have 2 NUMA domains, so you effectively get 2 processes per NUMA domain. Each process can only use the cores in that NUMA domain.

If you used:

mpirun -np 8 \
       --map-by core \
       --bind-to core \
       ./my_app
  • You’re distributing 8 processes across cores in a round-robin manner (map-by core).
  • Each process is pinned to exactly one core (bind-to core). If your code uses threads, that wouldn’t allow them to spread out—leading to potential underuse of your CPU resources if you have more than 8 cores total.

4. Why does this matter?

  1. Performance:
    • HPC applications often exhibit better performance when pinned near the memory that’s local to them (NUMA awareness) and avoid overhead from context switching or random migration.
    • Hybrid MPI+OpenMP codes often want a certain ratio of MPI ranks to CPU cores.
  2. Memory Locality:
    • Binding ensures that a process mostly allocates and accesses memory in its local NUMA node, reducing expensive cross-socket or cross-NUMA memory traffic.
  3. Load Balancing:
    • Proper mapping can keep processes well balanced. For instance, “ppr:2:socket” ensures each socket runs the same number of processes, so each socket is equally loaded.
  4. Avoiding Over-Subscription:
    • If you have a 48-core system, you typically don’t want to run 48 MPI processes each with 4 OpenMP threads (which would be 192 threads) unless you have some reason to over-subscribe.

5. Examples and Practical Scenarios

  1. Pure MPI on a 48-core node
    • -np 48 --map-by core --bind-to core
    • 48 MPI ranks, each pinned to 1 core.
  2. Hybrid: 4 MPI ranks x 12 threads
    • That means total 48 cores used.
    • mpirun --map-by ppr:2:socket --bind-to socket -np 4 ...
      • 2 ranks per socket → 4 total ranks on a 2-socket node.
      • Each rank is bound to the entire socket (24 cores), so each rank can spawn up to 24 threads if needed. If you only want 12 threads, that’s fine—just set OMP_NUM_THREADS=12.
  3. Hybrid: 8 MPI ranks x 6 threads
    • -np 8 total ranks, each pinned to half a socket (or a NUMA domain with ~6–12 cores).
    • Example: mpirun -np 8 --map-by ppr:4:socket --bind-to numa ...
      • 4 processes per socket.
      • Each is pinned to a NUMA domain.
      • If each rank sets OMP_NUM_THREADS=6, you get 48 threads total.

Bottom Line

  • --map-by: decides how many processes go onto which resource (cores, sockets, nodes, etc.).
  • --bind-to: decides how strictly each process is “affined” (pinned) to that resource.
  • For hybrid MPI+OpenMP, you generally want each MPI process to have multiple cores available—so you often see bind-to socket (or bind-to numa) combined with an appropriate map-by policy (like ppr:<ranks_per_socket>:socket).
  • Test different combinations to see which yields the best performance for your workload. Different HPC codes benefit from different mapping/binding schemes, especially on multi-socket or multi-NUMA systems.

'OpenMPI' 카테고리의 다른 글

OpenMPI with AOCC  (1) 2025.01.23
TEST  (0) 2025.01.09
Performance Check  (0) 2025.01.09
OpenMPI 설치 with CUDA Support  (2) 2024.12.31