Here’s an analysis of your hybrid MPI+OpenMP LAMMPS run using:
export I_MPI_PIN=1
export OMP_NUM_THREADS=4
mpirun -np 12 lmp -in in.ST1.MSCDSS -sf intel -pk intel 0 omp 4
1. What the Configuration Does
- -np 12: Launches 12 MPI processes
- OMP_NUM_THREADS=4: Each MPI process spawns 4 OpenMP threads
- So the total logical concurrency demand is: 12 MPI x 4 OMP = 48 threads
- I_MPI_PIN=1: Requests Intel MPI to pin processes/threads to specific CPU cores for performance via CPU affinity
2. Affinity Check: taskset -cp
Each PID (MPI rank) is bound to 4 logical cores, and the cores do not overlap, which is correct behavior for:
- Ensuring no thread contention
- Allowing vectorization and cache locality to be preserved
Here’s the summary mapping:
MPI RankCPU Affinity (4 logical cores)
0 | 0-3 |
1 | 7,8,12,13 |
2 | 14,18-20 |
3 | 4-6,9 |
4 | 10,11,15,16 |
5 | 17,21-23 |
6 | 24-27 |
7 | 31-33,37 |
8 | 38,39,43,44 |
9 | 28-30,34 |
10 | 35,36,40,41 |
11 | 42,45-47 |
🔍 Good signs:
- All MPI ranks are affinitized to different cores, matching OMP_NUM_THREADS=4
- There is no core sharing, avoiding resource contention
3. Thread Verification: ps -L -p
Each PID has 4 LWPs (lightweight processes = threads):
PID 211438 → 4 threads (main + 3 OpenMP)
PID 211439 → 4 threads
...
✅ Confirms that OMP_NUM_THREADS=4 is properly respected
4. System Optimization Insight
You are using:
- 12 MPI processes
- 4 OpenMP threads per process
- Pinning is working well
- No oversubscription
- NUMA locality and core grouping are probably honored (would need lstopo or numactl -H to confirm precisely)
This is a well-formed hybrid MPI+OMP execution setup.
✅ Summary
- ✅ OpenMP threads per MPI rank = 4: confirmed
- ✅ Core binding per rank = 4 non-overlapping logical cores: confirmed
- ✅ Total 48 threads over system cores: efficient and clean
- ✅ I_MPI_PIN=1 is effective
- 🔄 Further optimization (if needed) could involve:
- Ensuring NUMA locality with numactl or hwloc
- Using KMP_AFFINITY=granularity=fine,compact if Intel OpenMP is used (to help fine-tune thread affinity)
- Checking lscpu or hwloc-ls to verify CCD/domain boundaries
Let me know if you’d like an analysis with lstopo or NUMA distance awareness added.
TID in This Context?
In Linux, TID refers to the Thread ID, and in most tools (like ps -L), it’s shown as the Lightweight Process (LWP)ID. In your output:
ps -L -p 211438
PID LWP TTY TIME CMD
211438 211438 pts/1 00:02:57 lmp
211438 211486 pts/1 00:00:00 lmp
211438 211488 pts/1 00:00:00 lmp
211438 211491 pts/1 00:00:00 lmp
- PID is the process ID (i.e., MPI rank)
- LWP (or TID) is the thread ID
- So MPI rank 0 (PID 211438) is using 4 threads (TID 211438, 211486, 211488, 211491)
This shows that each MPI process is spawning OpenMP threads as expected.
🧠 What
taskset -cp <PID>
Does
taskset -cp <PID> reports or sets the CPU affinity for the entire process, including all its threads, unless you modify it per-thread with extra work.
So, when you run:
taskset -cp 211438
You get:
pid 211438's current affinity list: 0-3
This tells you that PID 211438 and all its threads (TIDs 211438, 211486, 211488, 211491) are restricted to cores 0–3.
However, note:
- taskset does not show per-thread affinity
- Linux generally inherits the CPU affinity of the main process to its threads (unless changed by OpenMP or a manual override)
To check per-thread affinity, you’d need something like:
for tid in $(ps -L -p 211438 | awk 'NR>1 {print $2}'); do
taskset -cp $tid
done
💡 Is it Worth Running
taskset -cp <PID>
?
Yes — it is absolutely worth running if:
- You’re verifying CPU affinity of your MPI ranks
- You want to ensure OpenMP threads are not competing for the same CPU cores
- You’re checking that Intel MPI pinning via I_MPI_PIN=1 is effective
But it’s not sufficient for per-thread analysis (use hwloc or thread-level taskset for that).
✅ Summary
TermMeaning
PID | MPI rank process ID |
TID | Thread ID (LWP), shown under ps -L |
taskset -cp <PID> | Shows/sets CPU affinity for the whole process (and by default, its threads) |
Worth running? | ✅ Yes, for checking MPI rank affinity setup |
For deeper optimization (e.g., NUMA-aware mapping or thread-core pinning), tools like hwloc, numactl, or likwid-pinmay offer better granularity.
Let me know if you want to script full rank/thread affinity checks.
'HPC' 카테고리의 다른 글
Mellanox Connect-X (0) | 2025.07.08 |
---|---|
AMD Threadripper Hybrid MPI+OMP (0) | 2025.06.26 |
LAMMPS on Intel Xeon (0) | 2025.05.14 |
LAMMPS Process Mapping in OpenMPI - (3) --map-by ppr:<N>:<resource> (0) | 2025.05.08 |
LAMMPS Process Mapping in OpenMPI - (2) --map-by numa:PE (0) | 2025.05.08 |