Nagios: ¿comprobar servicios huérfanos?

Nagios: ¿comprobar servicios huérfanos?

Recientemente noté que hay algunas advertencias en nagios.log:

[1366060611] Warning: The check of service 'pt-deadlock-logger' on host 'xx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

El problema crítico es: después de eso, Nagios ya no ejecuta ninguna verificación. Como solución alternativa, tengo que configurar un controlador de eventos para reiniciar Nagios cada vez que veo esta advertencia:

localhost.cfg

define service{
    use                     logfile-service
    host_name               localhost
    service_description     nagios_orphaned
    check_command           check_nagios_orphaned
    event_handler           restart_nagios
    contact_groups          admin
}

commands.cfg

define command {
    command_name    check_nagios_orphaned
    command_line    sudo $USER2$/check_logfiles --tag=orphaned --logfile=/usr/local/nagios/var/nagios.log --warningpattern="looks like it was orphaned"
}

define command {
    command_name    restart_nagios
    command_line    $USER1$/eventhandlers/restart_nagios.sh $SERVICESTATE$
}

restart_nagios.sh

#!/bin/bash

case "$1" in
        OK)
                ;;
        WARNING)
                /usr/bin/screen -S nagios -d -m sudo /etc/init.d/nagios restart
                ;;
        UNKNOWN)
                ;;
        CRITICAL)
                ;;
esac

exit 0

He estado intentando actualizar Nagios a la última versión:

# nagios -V

Nagios Core 3.5.0
Copyright (c) 2009-2011 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 03-15-2013
License: GPL

pero aún así recibo esta advertencia.

El primer resultado al buscar en Google es:http://support.nagios.com/wiki/index.php/Nagios_XI:FAQs#Check_Services_Being_Orphaned

pero estoy seguro de que solo hay un proceso (principal) ejecutándose:

# ps -ef | grep '/usr/local/nagios/bin/nagio[s]'
nagios    8956 15155  0 18:08 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    8957 15155  0 18:08 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   15155     1  5 14:09 ?        00:13:47 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Además, no veo el Resource temporarily unavailableerror en el archivo de registro, por lo que ulimitse puede excluir la posibilidad de restricciones.

El intérprete de Perl integrado ya estaba deshabilitado:

enable_embedded_perl=0
use_embedded_perl_implicitly=0

¿Hay otras causas?

PD: estoy ejecutando Nagios en un Xen HVM:

# virt-what 
xen
xen-hvm

ACTUALIZACIÓN martes 16 de abril 22:07:09 TIC 2013

Busque esta advertencia en el directorio del código fuente y encontré:

# grep -lr 'looks like it was orphaned' nagios-3.5.0
/nagios-3.5.0/base/checks.o
/nagios-3.5.0/base/nagios
/nagios-3.5.0/base/checks.c

y esta es la check_for_orphaned_servicesfunción:

/* check for services that never returned from a check... */
void check_for_orphaned_services(void) {
    service *temp_service = NULL;
    time_t current_time = 0L;
    time_t expected_time = 0L;


    log_debug_info(DEBUGL_FUNCTIONS, 0, "check_for_orphaned_services()\n");

    /* get the current time */
    time(&current_time);

    /* check all services... */
    for(temp_service = service_list; temp_service != NULL; temp_service = temp_service->next) {

        /* skip services that are not currently executing */
        if(temp_service->is_executing == FALSE)
            continue;

        /* determine the time at which the check results should have come in (allow 10 minutes slack time) */
        expected_time = (time_t)(temp_service->next_check + temp_service->latency + service_check_timeout + check_reaper_interval + 600);

        /* this service was supposed to have executed a while ago, but for some reason the results haven't come back in... */
        if(expected_time < current_time) {

            /* log a warning */
            logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: The check of service '%s' on host '%s' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...\n", temp_service->description, temp_service->host_name);

            log_debug_info(DEBUGL_CHECKS, 1, "Service '%s' on host '%s' was orphaned, so we're scheduling an immediate check...\n", temp_service->description, temp_service->host_name);

            /* decrement the number of running service checks */
            if(currently_running_service_checks > 0)
                currently_running_service_checks--;

            /* disable the executing flag */
            temp_service->is_executing = FALSE;

            /* schedule an immediate check of the service */
            schedule_service_check(temp_service, current_time, CHECK_OPTION_ORPHAN_CHECK);
            }

        }

    return;
    }

Actualización en: jueves 18 de abril 22:32:19 TIC 2013

Solo para confirmar, edité el código fuente para agregar el valor de expected_timey current_timeal archivo de registro. Lo que obtengo es:

[1366294608] expected_time: 'Thu Apr 18 21:16:36 2013
', current_time: 'Thu Apr 18 21:16:48 2013
' - Warning: The check of service 'Check_MK' on host 'xx' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...

Al volver a leer el archivo de registro, veo un mensaje importante:

[1366218303] Warning: A system time change of 0d 0h 0m 1s (backwards in time) has been detected. Compensating...

Parece que Xen es el culpable.

información relacionada