본문 바로가기

HPC

NUMA Pinning in IntelOneAPI

나의 비루한 인텔 워크스테이션은 2 Socket Xeon Gold 이다. 소켓당 2개의 Sub-NUMA Node를 할당할 수 있으며, 총 24개의 코어가 NUMA당 12개씩 분배된다. 

LAMMPS에서 4개의 NUMA 노드에 균등하게 프로세스를 분배하여 MPIRUN을 하고자 한다.

MPI Run을 위한 총 Domain-decomposition은 4이다. 각 프로세스에서 OpenMP로 스레드를 10개로 나누고자 한다.

 

아래는 LAMMPS의 스크립트이다.

processors       2 2 1 numa_nodes 4

 

 

 

LAMMPS의 실행은

1. I_MPI_PIN=1

Enables process pinning.

2. I_MPI_PIN_DOMAIN=numa

Tells Intel MPI to pin each MPI rank to exactly one NUMA domain (rather than the whole socket or node).

3. Optionally: I_MPI_PIN_ORDER=spread (or scatter)

 By default, Intel MPI tries to place ranks consecutively.

 If you want to ensure ranks are “spread out” across the 4 domains, set I_MPI_PIN_ORDER=spread or I_MPI_PIN_PROCESSOR_LIST=allcores:map=NUMA for more manual control.

mpirun -np 4 \
  -genv I_MPI_PIN 1 \
  -genv I_MPI_PIN_DOMAIN numa \
  -genv I_MPI_PIN_ORDER spread \
  -genv I_MPI_DEBUG 5 \
  ./lmp -var ompTh 10 -in MWE_HP.in

Explanation

 -np 4 → 4 MPI ranks total.

 I_MPI_PIN=1 → Turn on process pinning.

 I_MPI_PIN_DOMAIN=numa → Each rank is confined to exactly one NUMA domain.

 I_MPI_PIN_ORDER=spread → Distribute ranks across the NUMA domains in a scatter‐like fashion (rank 0→node 0, rank 1→node 1, rank 2→node 2, rank 3→node 3).

 

LAMMPS실행 후에 체크 방법은

:~$  hwloc-ps
2886091	Group0:0		./lmp
2886092	Group0:1		./lmp
2886093	Group0:2		./lmp
2886094	Group0:3		./lmp

:~$ taskset -cp 2886091
pid 2886091's current affinity list: 0-3,7,8,12-14,18-20
:~$ taskset -cp 2886092
pid 2886092's current affinity list: 4-6,9-11,15-17,21-23
:~$ taskset -cp 2886093
pid 2886093's current affinity list: 24-27,31-33,37-39,43,44
:~$ taskset -cp 2886094
pid 2886094's current affinity list: 28-30,34-36,40-42,45-47

:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 7 8 12 13 14 18 19 20
node 0 size: 31893 MB
node 0 free: 28307 MB
node 1 cpus: 4 5 6 9 10 11 15 16 17 21 22 23
node 1 size: 32251 MB
node 1 free: 30688 MB
node 2 cpus: 24 25 26 27 31 32 33 37 38 39 43 44
node 2 size: 32208 MB
node 2 free: 29525 MB
node 3 cpus: 28 29 30 34 35 36 40 41 42 45 46 47
node 3 size: 32248 MB
node 3 free: 30633 MB
node distances:
node   0   1   2   3
  0:  10  11  21  21
  1:  11  10  21  21
  2:  21  21  10  11
  3:  21  21  11  10

 

잘 분배되어 있다.

 

메모리 분배 체크

$ hwloc-ps
2886091	Group0:0		./lmp
2886092	Group0:1		./lmp
2886093	Group0:2		./lmp
2886094	Group0:3		./lmp

$ numastat -p 2886091

Per-node process memory usage (in MBs) for PID 2886091 (lmp)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         3.81            0.00            0.00
Stack                        0.09            0.00            0.00
Private                    196.07           17.00           27.12
----------------  --------------- --------------- ---------------
Total                      199.97           17.00           27.12

                           Node 3           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                         0.00            3.81
Stack                        0.00            0.09
Private                     11.58          251.77
----------------  --------------- ---------------
Total                       11.58          255.67

$ numastat -p 2886092

Per-node process memory usage (in MBs) for PID 2886092 (lmp)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            3.67            0.00
Stack                        0.00            0.10            0.00
Private                     21.78          187.74           25.25
----------------  --------------- --------------- ---------------
Total                       21.78          191.51           25.25

                           Node 3           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                         0.00            3.67
Stack                        0.00            0.10
Private                     12.29          247.06
----------------  --------------- ---------------
Total                       12.29          250.83

$ numastat -p 2886093

Per-node process memory usage (in MBs) for PID 2886093 (lmp)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            3.66
Stack                        0.00            0.00            0.10
Private                     21.26           14.72          213.07
----------------  --------------- --------------- ---------------
Total                       21.26           14.72          216.84

                           Node 3           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                         0.00            3.66
Stack                        0.00            0.10
Private                     12.03          261.09
----------------  --------------- ---------------
Total                       12.03          264.86

$ numastat -p 2886094

Per-node process memory usage (in MBs) for PID 2886094 (lmp)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                     20.48           15.97           26.13
----------------  --------------- --------------- ---------------
Total                       20.48           15.97           26.13

                           Node 3           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                         3.67            3.67
Stack                        0.10            0.10
Private                    233.12          295.71
----------------  --------------- ---------------
Total                      236.89          299.48

 

무난하긴 한데, 메모리 leak가 있다.

 

From these numastat -p <PID> outputs, each LAMMPS process is indeed allocating most of its memory in the NUMA node where it is pinned, but not 100% of the memory is on that local node. For example:

 PID 2886091 (pinned to node 0’s cores) has ~200 MB on node 0 out of ~256 MB total (≈78% local), but also some pages on nodes 1, 2, 3.

 PID 2886092 (node 1’s cores) has ~192 MB on node 1 out of ~251 MB total (≈76% local).

 PID 2886093 (node 2’s cores) has ~217 MB on node 2 out of ~265 MB total (≈82% local).

 PID 2886094 (node 3’s cores) has ~233 MB on node 3 out of ~299 MB total (≈78% local).

 

This is still a good result overall—each process places the majority of its memory on the local NUMA node, which is typically the main goal of “NUMA‐aware” pinning. A few observations and tips:

1. Why Some Memory Ends Up on Other Nodes

 Library / Runtime Allocations: Libraries loaded at program start (MPI libraries, shared libs, C runtime) can allocate pages before the process is fully pinned. These pages then remain on whichever node the OS selected initially.

 Small “Global” Data: Certain small allocations or OS overhead may occur on other nodes.

 Thread Initialization: If some threads allocate or “touch” pages before binding is finalized, those pages can reside in a different NUMA node.

 

None of this typically harms performance significantly, as long as most of the large data arrays (the main HPC data) get allocated (or “touched”) while the process is pinned.

2. Ensuring Even More Locality

 

If you want to push toward nearly 100% local memory:

1. Launch with numactl --membind=<node>

 For example, if rank 0 is pinned to node 0, also run it under numactl --cpunodebind=0 --membind=0. This forces new allocations to come from node 0.

 In MPI, that typically means wrapping each rank’s launch with something like:

mpirun -np 4 \
   -genv I_MPI_PIN 1 \
   -genv I_MPI_PIN_DOMAIN=numa \
   ...
   numactl --cpunodebind=$NODE_ID --membind=$NODE_ID ./lmp ...

(You’d have to script it or rely on MPI to assign ranks to nodes properly.)

 

2. Ensure “First‐Touch”:

 The standard Linux policy is that the first CPU thread that “touches” (writes) a memory page determines which NUMA node it is allocated on.

 If your application reads/writes big arrays after the rank is already pinned, they should land in the local node.

3. Check or reduce early allocations:

 If a lot of memory is allocated or “touched” before MPI rank pinning occurs, that can pollute the distribution. Some HPC setups do a “dummy initialization” after rank pinning to ensure big arrays are forced local.

3. Overall Assessment

 You are successfully pinning each MPI process to a distinct set of cores, and most memory is allocated locally. This is already much better than having everything float across sockets/nodes.

 Seeing some remote pages is normal and often only a small fraction of the total. The overhead penalty is typically minimal unless the remote portion is large or performance‐critical.

 If you need absolute minimal remote access, consider explicit numactl memory binding or thoroughly controlling the timing of allocations.

Bottom Line

 

You’re in a good NUMA state: each LAMMPS rank is pinned to a separate node, and the majority of its memory is local to that node. A small portion of remote pages is common. If you want to minimize that further, either launch with numactl --membind=node or ensure that big data arrays are touched after pinning is fully established.

 

 

'HPC' 카테고리의 다른 글

AOCC Clang Performance 옵션 확인  (0) 2025.02.28
AOCC Command-line options  (0) 2025.02.28
NUMA Check  (1) 2025.02.05
Clustering  (0) 2025.01.22
valgrind를 이용한 mpirun의 memory leak 분석  (0) 2025.01.21