LAN および仮想 LAN ノード間で OpenMPI を使用して分散 Pytorch を実行する

2024-7-9 • tag-icon

LAN および仮想 LAN ノード間で OpenMPI を使用して分散 Pytorch を実行する

私は 2 つの Ubuntu ノードを持っており、分散 PyTorch と異種 OpenMPI をソースからインストールしています。どちらもパスワードなしの SSH で相互に接続でき、OpenMPI で実行されるシンプルな PyTorch スクリプト (distmpi.py) を含む共有 NFS ディレクトリ (home/nvidia/shared) があります。

ノード 1 (xpa): LANネットワークインターフェースenp4s0上のIP 192.168.201.23を持つデスクトップPC（IPアドレス） ノード2（ヘテロ）: OpenStack 内の VM で、vLAN インターフェース ens3 上の仮想 IP 11.11.11.21 とフローティング IP 192.168.200.151 (IPアドレス、ifconfig）

mpirun を起動して XPS から 2 つのプロセス (1 つは 192.168.201.23、もう 1 つは 192.168.200.151) を実行すると、次のエラーが発生します。

(torch) nvidia@xps:~$ mpirun -v -np 2 -H 192.168.201.23:1,192.168.200.151 torch/bin/python shared/distmpi.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xps
  Local PID:           7113
  Peer hostname:       192.168.200.151 ([[55343,1],1])
  Source IP of socket: 192.168.200.151
  Known IPs of peer:   
    11.11.11.21
--------------------------------------------------------------------------
[xps][[55343,1],0][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 11.11.11.21 failed: Connection timed out (110)

参考までに、Python スクリプト (例: distmpi.py) をご覧ください。

#!/usr/bin/env python
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def run(rank, size):
    tensor = torch.zeros(size)
    print(f"I am {rank} of {size} with tensor {tensor}")

    # incrementing the old tensor
    tensor += 1

    # sending tensor to next rank
    if rank == size-1:
       dist.send(tensor=tensor, dst=0)
    else:
       dist.send(tensor=tensor, dst=rank+1)

    # receiving tensor from previous rank
    if rank == 0:
        dist.recv(tensor=tensor, src=size-1)
    else:
        dist.recv(tensor=tensor, src=rank-1)

    print('Rank ', rank, ' has data ', tensor[0])
    pass


def init_processes(rank, size, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, hostname, run, backend='mpi')

よろしくお願いいたします。

関連情報