在 LAN 和虛擬 LAN 節點上使用 OpenMPI 運行分散式 Pytorch

2024-7-19 • tag-icon

在 LAN 和虛擬 LAN 節點上使用 OpenMPI 運行分散式 Pytorch

我有兩個 Ubuntu 節點，從原始碼安裝了分散式 PyTorch 和異質 OpenMPI。它們都可以透過無密碼 SSH 相互連接，並具有共用 NFS 目錄 (home/nvidia/shared)，其中包含要透過 OpenMPI 執行的簡單 PyTorch 腳本 (distmpi.py)。

節點 1 (xpa)： LAN 網路介面 enp4s0 上具有 IP 192.168.201.23 的桌上型電腦（IP 位址） Node-2（異質）：OpenStack 中的虛擬機器在 vLAN 介面 ens3 上具有虛擬 IP 11.11.11.21 和浮動 IP 192.168.200.151 (IP 位址,如果配置）

當啟動 mpirun 以從 XPS 運行 2 個進程（1 個在 192.168.201.23 上，另一個在 192.168.200.151 上）時，會發生下列錯誤

(torch) nvidia@xps:~$ mpirun -v -np 2 -H 192.168.201.23:1,192.168.200.151 torch/bin/python shared/distmpi.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xps
  Local PID:           7113
  Peer hostname:       192.168.200.151 ([[55343,1],1])
  Source IP of socket: 192.168.200.151
  Known IPs of peer:   
    11.11.11.21
--------------------------------------------------------------------------
[xps][[55343,1],0][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 11.11.11.21 failed: Connection timed out (110)

請查看 python 腳本，例如 distmpi.py，以供參考：

#!/usr/bin/env python
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def run(rank, size):
    tensor = torch.zeros(size)
    print(f"I am {rank} of {size} with tensor {tensor}")

    # incrementing the old tensor
    tensor += 1

    # sending tensor to next rank
    if rank == size-1:
       dist.send(tensor=tensor, dst=0)
    else:
       dist.send(tensor=tensor, dst=rank+1)

    # receiving tensor from previous rank
    if rank == 0:
        dist.recv(tensor=tensor, src=size-1)
    else:
        dist.recv(tensor=tensor, src=rank-1)

    print('Rank ', rank, ' has data ', tensor[0])
    pass


def init_processes(rank, size, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, hostname, run, backend='mpi')

問候。

相關內容