본문 바로가기

HPC

NVIDIA ConnectX-5 Direct Dual Connection

##제일 중요한 것
- 케이블을 Mellanox 호환용을 사야된다. QSFP28 100GB Infiniband로, 모델명은 
MCP1600-E001E30: 1m, QSFP28, Passive CopperMCP1600-E002E30: 2m, QSFP28, Passive CopperMCP1600-E003E26: 3m, QSFP28, Passive Copper


1. Opensm의 설치

 

Opensm은 Infiniband망에서 이더넷망의 DHCP + Router의 역할을 해주며, infiniband 네트워크 상에서 하나의 컴퓨터 또는 스위치 허브에서 실행해 주면 된다.

하는 역할은 

  1. 노드와 스위치의 인식
  2. LID (이더넷의 IP)부여
  3. 라우팅 테이블 작성
  4. Topology and path resolution
sudo apt update
sudo apt install opensm

 

 

2. mst, mlxburn, flint의 설치

 

양쪽 노드에 공히 설치

 

https://network.nvidia.com/products/adapter-software/firmware-tools/

 

NVIDIA Firmware Tools (MFT)

The NVIDIA Firmware Tools (MFT) package is a set of firmware management tools

network.nvidia.com

 

$ tar -xzf mft-4.31.0-149-x86_64-deb.tgz

$ sudo ./install.sh

$ which mst
/usr/bin/mst
$ which mlxconfig
/usr/bin/mlxconfig

$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

$ sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4119_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:02:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

 

3. Port link type을 infiniband로 설정

 

$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P2=1
sudo mlxconfig -d /dev/mst/mt4119_pciconf1 set LINK_TYPE_P2=1

Device #1:
----------

Device type:        ConnectX5
Name:               MCX556M-ECA_Ax_Bx
Description:        ConnectX-5 VPI adapter card with Socket Direct supporting dual-socket server; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; 2x PCIe3.0 x8; ROHS R6
Device:             /dev/mst/mt4119_pciconf0

Configurations:                                          Next Boot       New
        LINK_TYPE_P2                                IB(1)                IB(1)

 Apply new Configuration? (y/n) [n] :  y
Applying... Done!
-I- Please reboot machine to load new configurations.

Device #1:
----------

Device type:        ConnectX5
Name:               MCX556M-ECA_Ax_Bx
Description:        ConnectX-5 VPI adapter card with Socket Direct supporting dual-socket server; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; 2x PCIe3.0 x8; ROHS R6
Device:             /dev/mst/mt4119_pciconf1

Configurations:                                          Next Boot       New
        LINK_TYPE_P2                                IB(1)                IB(1)

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

 

'HPC' 카테고리의 다른 글

LAMMPS Process Mapping in OpenMPI - (1) Basic  (0) 2025.05.08
NVIDIA Driver install  (0) 2025.05.07
AMD머신에 AOCL 설치  (0) 2025.05.02
SLURM  (0) 2025.05.01
MUNGE  (0) 2025.05.01