LAN 및 가상 LAN 노드 전반에 걸쳐 OpenMPI를 사용하여 분산 Pytorch 실행

2024-7-9 • tag-icon

LAN 및 가상 LAN 노드 전반에 걸쳐 OpenMPI를 사용하여 분산 Pytorch 실행

소스에서 PyTorch와 이기종 OpenMPI를 배포한 두 개의 Ubuntu 노드가 있습니다. 둘 다 비밀번호 없는 SSH를 통해 서로 연결할 수 있으며 OpenMPI를 통해 실행될 간단한 PyTorch 스크립트(distmpi.py)가 포함된 공유 NFS 디렉터리(home/nvidia/shared)를 가지고 있습니다.

노드-1(XPA): LAN 네트워크 인터페이스 enp4s0에서 IP 192.168.201.23을 갖는 데스크탑 PC(IP 주소) 노드-2(헤테로): vLAN 인터페이스 ens3에 가상 IP 11.11.11.21 및 부동 IP 192.168.200.151이 있는 OpenStack의 VM(IP 주소,ifconfig)

XPS에서 2개의 프로세스(1개는 192.168.201.23, 다른 하나는 192.168.200.151)를 실행하기 위해 mpirun을 시작할 때 다음 오류가 발생합니다.

(torch) nvidia@xps:~$ mpirun -v -np 2 -H 192.168.201.23:1,192.168.200.151 torch/bin/python shared/distmpi.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xps
  Local PID:           7113
  Peer hostname:       192.168.200.151 ([[55343,1],1])
  Source IP of socket: 192.168.200.151
  Known IPs of peer:   
    11.11.11.21
--------------------------------------------------------------------------
[xps][[55343,1],0][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 11.11.11.21 failed: Connection timed out (110)

참고를 위해 distmpi.py와 같은 Python 스크립트를 살펴보십시오.

#!/usr/bin/env python
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def run(rank, size):
    tensor = torch.zeros(size)
    print(f"I am {rank} of {size} with tensor {tensor}")

    # incrementing the old tensor
    tensor += 1

    # sending tensor to next rank
    if rank == size-1:
       dist.send(tensor=tensor, dst=0)
    else:
       dist.send(tensor=tensor, dst=rank+1)

    # receiving tensor from previous rank
    if rank == 0:
        dist.recv(tensor=tensor, src=size-1)
    else:
        dist.recv(tensor=tensor, src=rank-1)

    print('Rank ', rank, ' has data ', tensor[0])
    pass


def init_processes(rank, size, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, hostname, run, backend='mpi')

문안 인사.

관련 정보