TEST

Here’s a breakdown of that command and what it implies for process placement, threading, and CPU “binding.”

Command

mpirun -x OMP_NUM_THREADS=7 -np 8 \
       --bind-to numa --map-by ppr:4:socket  \
       ./lmp_IGO -in TI.in

-x OMP_NUM_THREADS=7
Tells each MPI process to spawn 7 OpenMP threads.
-np 8
You are launching 8 total MPI processes.
--map-by ppr:4:socket
- “ppr” = processes per resource
- ppr:4:socket means “place 4 MPI processes on each socket.”
- If your node has 2 sockets, that results in 4 processes on socket #0 + 4 processes on socket #1 = 8 processes total (matching -np 8).
--bind-to numa
- After mapping the processes to sockets, each process is bound (“affinity locked”) to a NUMA domain within that socket.
- Most modern dual-socket Xeon systems have 2 NUMA nodes per socket (for a total of 4 NUMA nodes).
- Because you want 4 processes on one socket (which usually has 2 NUMA nodes), effectively you get 2 processes per NUMA node on that socket.
./lmp_IGO -in TI.in
- The actual LAMMPS executable and input file.

Total processes: You have 8 MPI processes.
Threads per process: Each MPI process spawns 7 OpenMP threads (OMP_NUM_THREADS=7).
Grand total “threads of execution”: 8 MPI ranks×7 threads each=568 \text{ MPI ranks} \times 7 \text{ threads each} = 56. That’s more than your 48 physical cores (assuming you have 48 cores total), leading to some oversubscription unless you have hyper-threading enabled.
Placement:
- ppr:4:socket ensures that on each socket, 4 ranks are created.
- bind-to numa further pins each rank to one NUMA domain. Typically, each socket has 2 NUMA domains, each with 12 cores (on a 24-core socket). Therefore:
  - Socket #0 has 2 NUMA domains → 4 processes distributed across them (2 processes per NUMA domain).
  - Socket #1 similarly.
- Each MPI process (and its 7 threads) can only run on the cores of its assigned NUMA node.

Oversubscription:
- With 7 threads per process × 2 processes per NUMA node = 14 threads on a 12-core NUMA node, you get 2 extra threads beyond the physical cores in that domain. That can reduce performance if the code is memory-intensive or if thread contention is high.
Memory locality:
- bind-to numa can improve memory locality because each rank+threads stays within a single NUMA domain, accessing local memory.
- But if you over-subscribe that NUMA node, you might see diminishing returns.
Load balancing:
- If the simulation domain decomposition or LAMMPS input is not balanced, some processes might be more heavily loaded than others.
Alternative configurations:
- If you want to use all 48 cores exactly (no oversubscription), you might do 8×6=488 \times 6=48 (i.e., OMP_NUM_THREADS=6) or 4×12=484 \times 12=48, etc.
- Or if you truly want 7 threads per rank, you could reduce the MPI ranks to 6 or so, to stay near 42 total threads (less than or equal to 48).
- Or enable hyper-threading if you want to match 56 “logical cores.”

ppr:4:socket + bind-to numa → 4 processes per socket, each pinned to a NUMA domain (2 processes per NUMA domain).
Each MPI process uses 7 OpenMP threads → total 56 threads across 8 ranks, which may oversubscribe a 48-core machine.
This layout can still work fine, but you should benchmark different MPI–OpenMP splits (and possibly different binding strategies) to find the best performance for your specific LAMMPS simulation.