Failover de DHCP com PXE

Failover de DHCP com PXE

Estou tentando configurar o failover de DHCP com inicialização PXE permitida em um dos servidores DHCP. Conforme exigido pela especificação DHCP, configurei pools separados para DHCP "normal" e para inicialização PXE. Minha configuração de failover funciona bem, mas a configuração DHCP que deveria responder às solicitações PXE não funciona mais.

Histórico: atualizei recentemente para o AlmaLinux 9 (do CentOS 7), que executa o ISC DHCP 4.4. Na configuração mais antiga, eu não tinha failover de DHCP e permitia a inicialização PXE de todo o pool. Devido a um histórico de falhas de hardware em nosso site, quero configurar o failover de DHCP.

Para efeitos desta configuração, vamos chamar o sistema que deve responder às solicitações PXE de servidor DHCP "primário". Aqui está um fragmento /etc/dhcp/dhcpd/confdesse servidor. Observe que configurei um pool separado apenas para lidar com as consultas PXE/BOOTP. (Por favor, desculpe o tom didático dos comentários. Eles são destinados a mim enquanto eu faço minhas tarefas de administrador de sistemas.)

authoritative; # Send out acknowledgements to DHCP client queries.

failover peer "dhcp-failover" {
  primary; # declare this to be the primary server
  address 10.4.7.9;
  port 647;
  peer address 10.4.7.210;
  peer port 647;
  # How many seconds to wait before we assume that the other has failed.
  max-response-delay 30;
  # How many BNDUPD messages to send before receiving BNDACK.
  max-unacked-updates 10;
  # How many seconds to wait before disabling load balancing.
  load balance max seconds 3;
  # Maximum Client Lead Time = How long a lease may be renewed
  # without contacting the other DHCP peer.
  mclt 1800;
  # The split between primary and secondary. 128 means a
  # 50% split between peers; 255 means the primary handles
  # everything until it fails. 
  split 128;
}

# This is the primary DHCP server. Respond to BOOTP requests.
allow booting;
allow bootp;

option domain-name "company.example.com";
option time-offset -18000; # Eastern Standard Time

# Is this a DHCP query (as opposed to a BOOTP query)?
class "dhcp" {
      match if exists dhcp-message-type;
}
class "pxe" {
      match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
}

subnet 10.4.0.0 netmask 255.255.0.0 {
    default-lease-time 86400; # one day (in seconds)
    option subnet-mask 255.255.0.0;
    option broadcast-address 10.4.255.255;
    option routers 10.4.0.1;
    option domain-name-servers 10.4.7.7, 10.4.7.29; 
    option domain-name "company.example.com";
    option time-offset -18000; # Eastern Standard Time
    option ntp-servers 10.4.7.105, 10.4.7.7, 10.4.7.29;

    pool {
         failover peer "dhcp-failover";
         deny dynamic bootp clients;
         deny members of "pxe";
         range 10.4.45.1 10.4.45.250; # DHCP pool on private network
    }
    # A separate pool for BOOTP services.
    pool {
         range dynamic-bootp 10.4.45.251 10.4.45.255; # DHCP pool on private network
         allow dynamic bootp clients;
         deny members of "dhcp";
         allow members of "pxe";
         next-server 10.4.7.9;    # On which system the bootp filename is located.

         if substring (option vendor-class-identifier, 0, 9) = "PXEClient" {

            if substring(option vendor-class-identifier,15,5) = "00007" {
               log(info,"UEFI PXE Boot - private network");
               filename "pxelinux/grubx64.efi"; # The file to load for EFI systems.
               }
            else {
                log(info,"BIOS PXE Boot - private network");
                filename "pxelinux.0"; # The file to load via bootp for BIOS systems.
                }
        }
    }
}

Isso ocorre /etc/dhcp/dhcpd.confno servidor de failover/secundário/não PXE:

authoritative; # Send out acknowledgements to DHCP client queries. 

failover peer "dhcp-failover" {
  secondary; # declare this to be the secondary server
  address 10.4.7.210;
  port 647;
  peer address 10.4.7.9;
  peer port 647;
  # How many seconds to wait before we assume that the other has failed.
  max-response-delay 30;
  # How many BNDUPD messages to send before receiving BNDACK.
  max-unacked-updates 10;
  # How many seconds to wait before disabling load balancing.
  load balance max seconds 3;
}

# Make sure that this failover DHCP server does _not_
# respond to bootp.
deny bootp;

option domain-name "company.example.com";
option time-offset -18000; # Eastern Standard Time

# Is this a DHCP query (as opposed to a BOOTP query)?
class "dhcp" {
      match if exists dhcp-message-type;
}
class "pxe" {
      match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
}

subnet 10.4.0.0 netmask 255.255.0.0 {
    default-lease-time 86400; # one day (in seconds)
    option subnet-mask 255.255.0.0;
    option broadcast-address 10.4.255.255;
    option routers 10.4.0.1;
    option domain-name-servers 10.4.7.7, 10.4.7.29; 
    option domain-name "company.example.com";
    option time-offset -18000; # Eastern Standard Time
    option ntp-servers 10.4.7.105, 10.4.7.7, 10.4.7.29;

    # Note that there are a few IP addresses in the range of the primary
    # server that are not included here. This is for BOOTP, which is
    # not handled by the secondary server.
    pool {
         failover peer "dhcp-failover";
         deny dynamic bootp clients;     
         deny members of "pxe";
         range 10.4.45.1 10.4.45.250; # DHCP pool on private network
    }
}

Eu sei que estou exagerando nas classes "dhcp" e "pxe". Eu os adicionei enquanto tentava corrigir o problema. Eles não surtiram efeito, exceto para apresentar as peer holds all free leasesmensagens de log abaixo.

Isto é o que vejo nos logs do servidor "primário". Observe que 52:54:00:31:f2:7fé o endereço MAC de um sistema de teste que configurei para inicializar via PXE antes de "desistir" e inicializar a partir do disco.

Sep  8 14:20:46 dhcpd dhcpd[17922]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:20:49 dhcpd dhcpd[17922]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:20:57 dhcpd dhcpd[17922]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:21:13 dhcpd dhcpd[17922]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases

Isto vem do log no servidor "secundário". Isso é consistente com o atraso de aproximadamente um minuto desde a primeira inicialização do cliente, enquanto ele tenta localizar um servidor PXE, até o ponto em que ele deixa de inicializar a partir do sistema operacional e adquire um endereço DHCP da maneira usual.

Sep  8 14:20:46 dhcpdsec dhcpd[67768]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:20:46 dhcpdsec dhcpd[67768]: bind update on 10.4.45.183 from dhcp-failover rejected: incoming update is less critical than outgoing update
Sep  8 14:20:49 dhcpdsec dhcpd[67768]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:20:57 dhcpdsec dhcpd[67768]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:21:13 dhcpdsec dhcpd[67768]: DHCPDISCOVER from 52:54:00:31:f2:7f via enp7s0: peer holds all free leases
Sep  8 14:22:03 dhcpdsec dhcpd[67768]: DHCPREQUEST for 10.4.45.183 from 52:54:00:31:f2:7f via enp7s0
Sep  8 14:22:04 dhcpdsec dhcpd[67768]: DHCPACK on 10.4.45.183 to 52:54:00:31:f2:7f via enp7s0

Depois de me debater em testes anteriores, confirmei que o valor de substring (option vendor-class-identifier, 0, 9)é de fato PXEClient.

Eu já tentei parar o daemon dhcpd nas duas máquinas e editar manualmente as entradas para 52:54:00:31:f2:7fin /var/lib/dhcpd/dhcpd.leases. Nenhuma mudança.

Alguma ideia?

Editar: ocorreu-me que pode ajudar postar minha configuração DHCP anterior, sem failover. A inicialização PXE funcionou bem:

subnet 10.4.0.0 netmask 255.255.0.0 {
    range dynamic-bootp 10.4.45.1 10.4.45.254; # DCHP pool on private network
    default-lease-time 86400; # one day (in seconds)
    option subnet-mask 255.255.0.0;
    option broadcast-address 10.4.255.255;
    option routers 10.4.0.1;
    option domain-name-servers 10.4.7.7, 10.4.7.29; 
    option domain-name "nevis.columbia.edu";
    option time-offset -18000; # Eastern Standard Time
    option ntp-servers 10.4.7.105, 10.4.7.7, 10.4.7.29;
    next-server 10.4.7.9;    # On which system the bootp filename is located.

    if substring (option vendor-class-identifier, 0, 9) = "PXEClient" {

        if substring(option vendor-class-identifier,15,5) = "00007" {
            log(info,"UEFI PXE Boot - private network");
            filename "pxelinux/grubx64.efi"; # The file to load for EFI systems.
            }
        else {
            log(info,"BIOS PXE Boot - private network");
            filename "pxelinux.0"; # The file to load via bootp for BIOS systems.
        }
    }
}

Responder1

Encontrei uma resposta depois de muita experimentação: Acontece que a ordem das instruções de controle de acesso dentro de um pool é importante.

Aqui está uma repetição das definições de classe em minha postagem original:

class "dhcp" {
      match if exists dhcp-message-type;
}
class "pxe" {
      match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
}

Aqui está a subnetdefinição que funciona no meu servidor DHCP primário. A principal diferença entre esta e a configuração em minha postagem original é a ordem das rangeinstruções em comparação com as instruções any allowou deny, e que eu defino o "pxe"pool primeiro. As linhas de failover originais permanecem inalteradas.

subnet 10.4.0.0 netmask 255.255.0.0 {
    default-lease-time 86400; # one day (in seconds)
    option subnet-mask 255.255.0.0;
    option broadcast-address 10.4.255.255;
    option routers 10.4.0.1;
    option domain-name-servers 10.4.7.7, 10.4.7.29; 
    option domain-name "company.example.com";
    option time-offset -18000; # Eastern Standard Time
    option ntp-servers 10.4.7.105, 10.4.7.7, 10.4.7.29;
    next-server 10.4.7.9;    # On which system the bootp filename is located.

    if substring (option vendor-class-identifier, 0, 9) = "PXEClient" {
        if option architecture-type =  00:07 {
           filename "uefi/grubx64.efi"; # The file to load for EFI systems.
        }
        else {
           filename "pxelinux/pxelinux.0"; # The file to load via bootp for BIOS systems.
        }
    }

    # A separate pool for PXE services.
    pool {
         range dynamic-bootp 10.4.45.251 10.4.45.255; # DHCP pool on private network
         allow dynamic bootp clients;
         allow members of "pxe";
    }

    # The "regular" DHCP pool.
    pool {
         failover peer "dhcp-failover";
         range 10.4.45.1 10.4.45.250; # DHCP pool on private network
         deny dynamic bootp clients;
         deny members of "pxe";
    }
}

Aqui estão as linhas revisadas subnetna configuração do meu servidor DHCP secundário, embora essas alterações provavelmente não importem:

subnet 10.4.0.0 netmask 255.255.0.0 {
    default-lease-time 86400; # one day (in seconds)
    option subnet-mask 255.255.0.0;
    option broadcast-address 10.4.255.255;
    option routers 10.4.0.1;
    option domain-name-servers 10.4.7.7, 10.4.7.29; 
    option domain-name "company.example.com";
    option time-offset -18000; # Eastern Standard Time
    option ntp-servers 10.4.7.105, 10.4.7.7, 10.4.7.29;

    # Note that there are a few IP addresses in the range of the primary
    # server that are not included here. This is for PXE, which is
    # not handled by the secondary server.
    pool {
         failover peer "dhcp-failover";
         deny dynamic bootp clients;     
         range 10.4.45.1 10.4.45.250; # DCHP pool on private network
    }
}

Agora tenho uma configuração com failover de DHCP e inicialização PXE para instalar/reparar um sistema operacional, que acomoda sistemas BIOS e EFI. Espero que alguém considere as linhas acima úteis!

informação relacionada