Execute Pytorch distribuído com OpenMPI em nós de LAN e LAN virtual

2024-7-9 • tag-icon

Execute Pytorch distribuído com OpenMPI em nós de LAN e LAN virtual

Eu tenho dois nós Ubuntu, distribuindo PyTorch e OpenMPI heterogêneo instalado a partir do código-fonte. Ambos podem se conectar entre si por meio de SSH sem senha e ter um diretório NFS compartilhado (home/nvidia/shared) que contém um script PyTorch simples (distmpi.py) para ser executado em OpenMPI.

Nó-1 (xpa): um PC desktop com IP 192.168.201.23 na interface de rede LAN enp4s0 (endereço IP) Nó-2 (hetero): uma VM no OpenStack com IP Virtual 11.11.11.21 na interface vLAN ens3 e IP flutuante 192.168.200.151 (endereço IP,ifconfig)

O seguinte erro ocorre quando o mpirun é iniciado para executar 2 processos do XPS (1 em 192.168.201.23 e outro em 192.168.200.151)

(torch) nvidia@xps:~$ mpirun -v -np 2 -H 192.168.201.23:1,192.168.200.151 torch/bin/python shared/distmpi.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xps
  Local PID:           7113
  Peer hostname:       192.168.200.151 ([[55343,1],1])
  Source IP of socket: 192.168.200.151
  Known IPs of peer:   
    11.11.11.21
--------------------------------------------------------------------------
[xps][[55343,1],0][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 11.11.11.21 failed: Connection timed out (110)

Por favor, dê uma olhada no script python, por exemplo, distmpi.py, para referência:

#!/usr/bin/env python
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def run(rank, size):
    tensor = torch.zeros(size)
    print(f"I am {rank} of {size} with tensor {tensor}")

    # incrementing the old tensor
    tensor += 1

    # sending tensor to next rank
    if rank == size-1:
       dist.send(tensor=tensor, dst=0)
    else:
       dist.send(tensor=tensor, dst=rank+1)

    # receiving tensor from previous rank
    if rank == 0:
        dist.recv(tensor=tensor, src=size-1)
    else:
        dist.recv(tensor=tensor, src=rank-1)

    print('Rank ', rank, ' has data ', tensor[0])
    pass


def init_processes(rank, size, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, hostname, run, backend='mpi')

Cumprimentos.

informação relacionada