##제일 중요한 것
- 케이블을 Mellanox 호환용을 사야된다. QSFP28 100GB Infiniband로, 모델명은
MCP1600-E001E30: 1m, QSFP28, Passive CopperMCP1600-E002E30: 2m, QSFP28, Passive CopperMCP1600-E003E26: 3m, QSFP28, Passive Copper
1. Opensm의 설치
Opensm은 Infiniband망에서 이더넷망의 DHCP + Router의 역할을 해주며, infiniband 네트워크 상에서 하나의 컴퓨터 또는 스위치 허브에서 실행해 주면 된다.
하는 역할은
- 노드와 스위치의 인식
- LID (이더넷의 IP)부여
- 라우팅 테이블 작성
- Topology and path resolution
sudo apt update
sudo apt install opensm
2. mst, mlxburn, flint의 설치
양쪽 노드에 공히 설치
https://network.nvidia.com/products/adapter-software/firmware-tools/
NVIDIA Firmware Tools (MFT)
The NVIDIA Firmware Tools (MFT) package is a set of firmware management tools
network.nvidia.com
$ tar -xzf mft-4.31.0-149-x86_64-deb.tgz
$ sudo ./install.sh
$ which mst
/usr/bin/mst
$ which mlxconfig
/usr/bin/mlxconfig
$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4119_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
/dev/mst/mt4119_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
3. Port link type을 infiniband로 설정
$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P2=1
sudo mlxconfig -d /dev/mst/mt4119_pciconf1 set LINK_TYPE_P2=1
Device #1:
----------
Device type: ConnectX5
Name: MCX556M-ECA_Ax_Bx
Description: ConnectX-5 VPI adapter card with Socket Direct supporting dual-socket server; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; 2x PCIe3.0 x8; ROHS R6
Device: /dev/mst/mt4119_pciconf0
Configurations: Next Boot New
LINK_TYPE_P2 IB(1) IB(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Device #1:
----------
Device type: ConnectX5
Name: MCX556M-ECA_Ax_Bx
Description: ConnectX-5 VPI adapter card with Socket Direct supporting dual-socket server; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; 2x PCIe3.0 x8; ROHS R6
Device: /dev/mst/mt4119_pciconf1
Configurations: Next Boot New
LINK_TYPE_P2 IB(1) IB(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
'HPC' 카테고리의 다른 글
LAMMPS Process Mapping in OpenMPI - (1) Basic (0) | 2025.05.08 |
---|---|
NVIDIA Driver install (0) | 2025.05.07 |
AMD머신에 AOCL 설치 (0) | 2025.05.02 |
SLURM (0) | 2025.05.01 |
MUNGE (0) | 2025.05.01 |