본문 바로가기

HPC

Mellanox Connect-X

infiniband 연결 확인

Device information 확인

z641@z641:~$ ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				16.35.4030
	node_guid:			0c42:a103:0017:3b2e
	sys_image_guid:			0c42:a103:0017:3b28
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000023
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				16.35.4030
	node_guid:			0c42:a103:0017:3b2f
	sys_image_guid:			0c42:a103:0017:3b28
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000023
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx5_2
	transport:			InfiniBand (0)
	fw_ver:				16.35.4030
	node_guid:			0c42:a103:0017:3b28
	sys_image_guid:			0c42:a103:0017:3b28
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000023
	phys_port_cnt:			1
		port:	1
			state:			PORT_INIT (2)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx5_3
	transport:			InfiniBand (0)
	fw_ver:				16.35.4030
	node_guid:			0c42:a103:0017:3b29
	sys_image_guid:			0c42:a103:0017:3b28
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000023
	phys_port_cnt:			1
		port:	1
			state:			PORT_INIT (2)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

 

Infiniband status

z640@z640:~$ ibstat
CA 'mlx5_0'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.4030
	Hardware version: 0
	Node GUID: 0x0c42a10300173ae6
	System image GUID: 0x0c42a10300173ae0
	Port 1:
		State: Down
		Physical state: LinkUp
		Rate: 100
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0xa641e848
		Port GUID: 0x0c42a10300173ae6
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.4030
	Hardware version: 0
	Node GUID: 0x0c42a10300173ae7
	System image GUID: 0x0c42a10300173ae0
	Port 1:
		State: Down
		Physical state: LinkUp
		Rate: 100
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0xa641e848
		Port GUID: 0x0c42a10300173ae7
		Link layer: InfiniBand
CA 'mlx5_2'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.4030
	Hardware version: 0
	Node GUID: 0x0c42a10300173ae0
	System image GUID: 0x0c42a10300173ae0
	Port 1:
		State: Initializing
		Physical state: LinkUp
		Rate: 100
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0xa651e848
		Port GUID: 0x0c42a10300173ae0
		Link layer: InfiniBand
CA 'mlx5_3'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.4030
	Hardware version: 0
	Node GUID: 0x0c42a10300173ae1
	System image GUID: 0x0c42a10300173ae0
	Port 1:
		State: Initializing
		Physical state: LinkUp
		Rate: 100
		Base lid: 65535
		LMC: 0
		SM lid: 0
		Capability mask: 0xa651e848
		Port GUID: 0x0c42a10300173ae1
		Link layer: InfiniBand

 

SM lid는 Subnet Manager Local IDentifier의 약자로,

Base lid 65535 값은 LID가 할당되지 않은 경우 default값이다. 

Base lid 는 해당포트의 LID값이며, SM lid는 Subnet manager가 돌아가고 있는 포트의 LID값이다.

정상적으로 SM이 실행되고 있는 환경이라면 다음과 같이 나와야 된다.

Port 1:
    State: Active
    Base lid: 3
    SM lid: 1

 

OpenSM이 설치되어 있는가?

z640@z640:~$ which opensm
/usr/sbin/opensm
z640@z640:~$ systemctl status opensm
○ opensm.service - Starts the OpenSM InfiniBand fabric Subnet Managers
     Loaded: loaded (/usr/lib/systemd/system/opensm.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition unmet at Sat 2025-07-05 16:59:32 UTC; 21h ago
             └─ ConditionPathExists=/sys/class/infiniband_mad/abi_version was not met
       Docs: man:opensm(8)

Jul 05 16:59:32 z640 systemd[1]: opensm.service - Starts the OpenSM InfiniBand fabric Subnet Managers was skipped because of an unmet condition check (ConditionPathExists=/sys/class/infiniband_mad/abi_version).

 

Mellanox OFED 커널이 현재 Linux버전을 서포트 하지 않아서 그렇다 한다. Mellanox OFED를 인스톨할때 옵션에서 현재 Linux kernel을 support할 수 있게 해야 한다고 한다.

 

'HPC' 카테고리의 다른 글

Bandwidth Test  (0) 2025.09.24
IPoIB  (0) 2025.09.24
AMD Threadripper Hybrid MPI+OMP  (0) 2025.06.26
CPU Pinning and Affinity Check  (0) 2025.06.25
LAMMPS on Intel Xeon  (0) 2025.05.14