
Tenemos un clúster de Hadoop con servicios de administrador de recursos activo/en espera. El administrador de recursos activo está en la máquina master1 y el administrador de recursos en espera está en la máquina master2.
en nuestro clúster, el servicio YARN que incluye ambos servicios de administrador de recursos administra el componente del administrador de 276 nodos en las máquinas de los trabajadores.
de las alertas de Ambari WEB UI (Alertas para Resource Manager) notamos lo siguiente
Resource Manager Web UI
Connection failed to http://master2.jupiter.com:8088(timed out)
Comenzamos a depurar el problema mediante wget con el puerto 8088 y descubrimos que el proceso está bloqueado: solicitud HTTP enviada awaiting response... No data received
.
ejemplo de la máquina del administrador de recursos
wget --debug http://master2.jupiter.com:8088
DEBUG output created by Wget 1.14 on Linux-gnu.
URI encoding = ‘UTF-8’
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:13:42-- http://master2` .jupiter.com:8088/
Resolving master2.jupiter.com (master2.jupiter.com)... 192.9.201.169
Caching master2.jupiter.com => 192.9.201.169
Connecting to master2.jupiter.com (master2.jupiter.com)|192.9.201.169|:8088... connected.
Created socket 3.
Releasing 0x0000000000a0da00 (new refcount 1).
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: master2.jupiter.com:8088
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Wed, 21 Feb 2024 10:13:42 GMT
Date: Wed, 21 Feb 2024 10:13:42 GMT
Pragma: no-cache
Expires: Wed, 21 Feb 2024 10:13:42 GMT
Date: Wed, 21 Feb 2024 10:13:42 GMT
Pragma: no-cache
Content-Type: text/plain; charset=UTF-8
X-Frame-Options: SAMEORIGIN
Location: http://master1.jupiter.com:8088/
Content-Length: 43
Server: Jetty(6.1.26.hwx)
---response end---
307 TEMPORARY_REDIRECT
Registered socket 3 for persistent reuse.
URI content encoding = ‘UTF-8’
Location: http://master1.jupiter.com:8088/ [following]
Skipping 43 bytes of body: [This is standby RM. The redirect url is: /
] done.
URI content encoding = None
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:13:42-- http://master1.jupiter.com:8088/
conaddr is: 192.9.201.169
Resolving master1.jupiter.com (master1.jupiter.com)... 192.9.66.14
Caching master1.jupiter.com => 192.9.66.14
Releasing 0x0000000000a0f320 (new refcount 1).
Found master1.jupiter.com in host_name_addresses_map (0xa0f320)
Connecting to master1.jupiter.com (master1.jupiter.com)|192.9.66.14|:8088... connected.
Created socket 4.
Releasing 0x0000000000a0f320 (new refcount 1).
.
.
.
---response end---
302 Found
Disabling further reuse of socket 3.
Closed fd 3
Registered socket 4 for persistent reuse.
URI content encoding = ‘UTF-8’
Location: http://master1.jupiter.com:8088/cluster [following]
] done.
URI content encoding = None
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:27:07-- http://master1.jupiter.com:8088/cluster
Reusing existing connection to master1.jupiter.com:8088.
Reusing fd 4.
---request begin---
GET /cluster HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: master1.jupiter.com:8088
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Wed, 21 Feb 2024 10:30:23 GMT
Date: Wed, 21 Feb 2024 10:30:23 GMT
Pragma: no-cache
Expires: Wed, 21 Feb 2024 10:30:23 GMT
Date: Wed, 21 Feb 2024 10:30:23 GMT
Pragma: no-cache
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
Server: Jetty(6.1.26.hwx)
---response end---
200 OK
URI content encoding = ‘utf-8’
Length: unspecified [text/html]
Saving to: ‘index.html’
[ <=> ] 1,018,917 --.-K/s in 0.04s
2024-02-21 10:31:31 (24.0 MB/s) - ‘index.html’ saved [1018917]
Como podemos ver arriba, wget se completó después de mucho tiempo, alrededor de ~ 20 minutos, en lugar de completar el proceso en uno o dos segundos.
podemos tomar tcpdump como
tcpdump -vv -s0 tcp port 8088 -w /tmp/why_8088_hang.pcap
pero quiero entender si hay formas mejores y más sencillas de entender por qué recibimos una solicitud HTTP en espera de respuesta... y tal vez esté relacionada con el servicio de administrador de recursos.