
Recientemente noté que hay algunas advertencias en nagios.log
:
[1366060611] Warning: The check of service 'pt-deadlock-logger' on host 'xx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
El problema crítico es: después de eso, Nagios ya no ejecuta ninguna verificación. Como solución alternativa, tengo que configurar un controlador de eventos para reiniciar Nagios cada vez que veo esta advertencia:
localhost.cfg
define service{
use logfile-service
host_name localhost
service_description nagios_orphaned
check_command check_nagios_orphaned
event_handler restart_nagios
contact_groups admin
}
commands.cfg
define command {
command_name check_nagios_orphaned
command_line sudo $USER2$/check_logfiles --tag=orphaned --logfile=/usr/local/nagios/var/nagios.log --warningpattern="looks like it was orphaned"
}
define command {
command_name restart_nagios
command_line $USER1$/eventhandlers/restart_nagios.sh $SERVICESTATE$
}
restart_nagios.sh
#!/bin/bash
case "$1" in
OK)
;;
WARNING)
/usr/bin/screen -S nagios -d -m sudo /etc/init.d/nagios restart
;;
UNKNOWN)
;;
CRITICAL)
;;
esac
exit 0
He estado intentando actualizar Nagios a la última versión:
# nagios -V
Nagios Core 3.5.0
Copyright (c) 2009-2011 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 03-15-2013
License: GPL
pero aún así recibo esta advertencia.
El primer resultado al buscar en Google es:http://support.nagios.com/wiki/index.php/Nagios_XI:FAQs#Check_Services_Being_Orphaned
pero estoy seguro de que solo hay un proceso (principal) ejecutándose:
# ps -ef | grep '/usr/local/nagios/bin/nagio[s]'
nagios 8956 15155 0 18:08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 8957 15155 0 18:08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 15155 1 5 14:09 ? 00:13:47 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Además, no veo el Resource temporarily unavailable
error en el archivo de registro, por lo que ulimit
se puede excluir la posibilidad de restricciones.
El intérprete de Perl integrado ya estaba deshabilitado:
enable_embedded_perl=0
use_embedded_perl_implicitly=0
¿Hay otras causas?
PD: estoy ejecutando Nagios en un Xen HVM:
# virt-what
xen
xen-hvm
ACTUALIZACIÓN martes 16 de abril 22:07:09 TIC 2013
Busque esta advertencia en el directorio del código fuente y encontré:
# grep -lr 'looks like it was orphaned' nagios-3.5.0
/nagios-3.5.0/base/checks.o
/nagios-3.5.0/base/nagios
/nagios-3.5.0/base/checks.c
y esta es la check_for_orphaned_services
función:
/* check for services that never returned from a check... */
void check_for_orphaned_services(void) {
service *temp_service = NULL;
time_t current_time = 0L;
time_t expected_time = 0L;
log_debug_info(DEBUGL_FUNCTIONS, 0, "check_for_orphaned_services()\n");
/* get the current time */
time(¤t_time);
/* check all services... */
for(temp_service = service_list; temp_service != NULL; temp_service = temp_service->next) {
/* skip services that are not currently executing */
if(temp_service->is_executing == FALSE)
continue;
/* determine the time at which the check results should have come in (allow 10 minutes slack time) */
expected_time = (time_t)(temp_service->next_check + temp_service->latency + service_check_timeout + check_reaper_interval + 600);
/* this service was supposed to have executed a while ago, but for some reason the results haven't come back in... */
if(expected_time < current_time) {
/* log a warning */
logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: The check of service '%s' on host '%s' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...\n", temp_service->description, temp_service->host_name);
log_debug_info(DEBUGL_CHECKS, 1, "Service '%s' on host '%s' was orphaned, so we're scheduling an immediate check...\n", temp_service->description, temp_service->host_name);
/* decrement the number of running service checks */
if(currently_running_service_checks > 0)
currently_running_service_checks--;
/* disable the executing flag */
temp_service->is_executing = FALSE;
/* schedule an immediate check of the service */
schedule_service_check(temp_service, current_time, CHECK_OPTION_ORPHAN_CHECK);
}
}
return;
}
Actualización en: jueves 18 de abril 22:32:19 TIC 2013
Solo para confirmar, edité el código fuente para agregar el valor de expected_time
y current_time
al archivo de registro. Lo que obtengo es:
[1366294608] expected_time: 'Thu Apr 18 21:16:36 2013
', current_time: 'Thu Apr 18 21:16:48 2013
' - Warning: The check of service 'Check_MK' on host 'xx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
Al volver a leer el archivo de registro, veo un mensaje importante:
[1366218303] Warning: A system time change of 0d 0h 0m 1s (backwards in time) has been detected. Compensating...
Parece que Xen es el culpable.