The most common symptoms of the problem described in this document are:
This document is broken into 3 sections:
|
Section I: Root cause analysis Based upon the problem symptoms, the most likely direct causes of the problem are:
The most common causes of that type of problem are:
Testing cron via the checkCron.pl commandThere is a command in $BEST1_HOME/bgs/scripts called 'checkCron.pl' that can be executed to test that cron is properly executing scheduled commands. The output looks something like this: Reviewing the cron logsOption A: cron didn't execute pcron on that > perform 27711 c Fri Sep 19 10:52:00 2008
If cron didn't execute pcron then we should be missing an entry from 23:40 for pcron in the /var/cron/log file since the Manager run that didn't execute is the only pcron event that runs at 23:40:
> CMD: /usr/adm/best1_7.4.10/bgs/bin/perl -w -I /usr/adm/best1_7.4.10/bgs/lib/PERL /usr/adm/best1_7.4.10/bgs/scripts/pcron 2>/dev/null < perform 27711 c Fri Sep 19 10:52:00 2008 60 30 23 * * * /opt/perform/manager/Jul-30-2008.18.21/Jul-31-2008.00.10-Jul-30-2009.23.55.Manager >/dev/null 2>&1
There isn't a simple way to parse the cron log since it puts the execution time and the command being run on separate lines in the log. You'll probably just need to do a search in the file. So, you could do this: grep "23:30" /var/cron/*log
That will search both the cron 'log' and 'olog' files for all the scripts executed at 23:30. If there is an execution gap in the output that would indicate that cron failed to execute pcron on that minute on the problem day. Some common reasons that cron is not able to execute the pcron script are: (A) The TSCO Installation Owner user isn't listed in the /etc/hosts.allow file and there is an /etc/hosts.deny file blocking access to users (B) The error "You (user) are not allowed to access to (crontab) because of pam configuration." often indicates that the TSCO Installation Owner user account is locked or the password has expired Option B: cron attempted to execute pcron but it failed for some reason
When cron executes a command it records the return value of that command execution. So, if cron appears to have properly executed pcron that minute, we should check the return code of that pcron execution and make sure there isn't an error condition reported.When there is an error the line for the script completion will report an error return code like this: < perform 3895 c Thu Sep 18 23:40:00 2008 rc=1
Option C: pcron was executed but it failed to execute the *.Manager script for some reason Whenever pcron executes a script it writes a log entry to the $BEST1_HOME/bgs/log/pcron/[hostname]-[username].log file. The entry looks something like this: [2008/07/24:23:35:00] executed 3 35 23 * * * /home2/best1data/manager/daily/topgun/May-05-2008.16.21/May-05-2007.00.05-Dec-21-2012.23.59.Manager >/dev/null 2>&1 .
If there was an immediate error running the command pcron will note in the log that it couldn't properly execute the script. So, if cron executed pcron then we should next check the pcron log file to determine if pcron says it properly executed the *.Manager script for that minute. Option D: pcron was executed and successfully executed the *.Manager script but then the script failed soon after execution This is the hardest problem to identify. At this point we'll have seen in the pcron log that pcron successfully executed the *.Manager script but we know from the problem symptoms that the script never create d the daily collection and processing scripts, didn't schedule them into pcron, and never started the udrCollectMgr processes. That indicates that although pcron was able to successfully execute the *.Manager script in the background via a perl system() call, the *.Manager script itself wasn't able to run to successful completion. This is usually the result of a system level problem that may be visible in the /var/adm/messages file. In the past I've seen this when for some reason all available process slots were in-use (generally due to some run away process spawning children wildly), a file system full condition, or some other system level problem. A system level event is more likely in a scenario where multiple Manager runs didn't execute at the same minute (but pcron was executed by cron) or other applications were having problems on the machine at the same time. TSCO Gateway Server on Linux 'nprocs' settingIf the cron is calling pcron but it appears that the *.XferData script isn't being executed or it isn't running the *.ProcessDay script, the problem could be related to the "max user processes" ulimit setting being reached.When an environment has more than, say, 20 Manager runs and they have been configured with the same Processing Delay it is possible to saturate the number of available threads for the user under which the Manager runs are scheduled. The 'max user processes' ulimit setting on Linux is, by default, 1024 threads (it s a limit on threads, not processes) and the nightly Manager runs will generate a number of threads -- particularly between 11:30 PM and 12:15 AM when the udrCollectMgr processes have been started for the next day's collection and the udrCollectMgr processes are still running from the previous night's collection to complete the data transfer. The recommendation would be to set the 'max user process' soft ulimit to allow 1024 threads for every 25 Manager runs (so set it to 2048 for 50 runs). The soft limit can be set via the /etc/security/limits.conf file. The symptoms of this problem are pretty subtle. Usually the symptoms will be that everything appears to be properly configured on the machine but the [date]-[date].manager.log doesn't show the *.XferData script executing the *.ProcessDay script and there is no other obvious reason why it wouldn't have been run. RHEL 7 NOTEOn Linux the 'nproc' value is typically updated via the /etc/security/limits.conf file. But, on RHEL 7 there is another file where the ulimits can be set:/etc/security/limits.d/20-nproc.conf That file is deployed in an out-of-the-box OS install with the nproc 'soft' limit set to 4096 by default: * soft nproc 4096 In an environment with a very large number of Manager runs it could be necessary to increase this value further. If the 'nproc' value is set in the /etc/security/limits.conf file on RHEL 7 the value in the 20-nproc.conf (or any other file in the limits.d directory that sets it) will override it. Section II: Impact mitigation for a potential future *.Manager failureOne of the easiest ways to prevent data loss from a Manager run execution failure is to enable the UDR Collection Manager "days in advance" collection feature (which is enabled by default to register collection requests for 3 days in advance). This feature instructs the Manager run to pre-register data collection requests on the remote node for up to 3 days in advance so if the Perform console fails for some reason (cron fails, console hardware fails, the network fails, lighting strike, whatever) the collection requests will already be registered on the remote node so no data collection will be lost of the problem is caught and resolved within 3 days.
This document describes how to enable the COLREQ_DAYS_ADVANCE feature via the $BEST1_HOME/local/setup/collectManager.cfg file in the "The COLREQ_DAYS_ADVANCE parameter" sub-section of "Section I: Recommendations for both the Unix and Windows console".
Section III: Monitoring for a potential future *.Manager failureAlthough there isn't a solution within the product itself to monitor for this type of failure, it would be possible to create a basic custom script that monitored pcron to validate that all of the daily scripts were properly scheduled for each active Manager run.
Here is an inelegant way to see if there are any *.Manager scripts scheduled without associated *.XferData or *.Collect scripts: $BEST1_HOME/bgs/scripts/pcrontab.sh -list | grep -v udrCollectMgr | cut -d " " -f 7 | awk -F\/ '{for(i=1;i < NF;i++) printf("%s/",$i); printf("\n"); }' | sort | uniq -c
The output would look something like this:
3 /net/topgun/home3/paska/manager/operations/Jun-17-2008.13.47/
That says that there are 3 scripts scheduled in pcron in each of the listed Manager Output Directory. If you ran a command line that after data processing was completed when there should only be the *.Collect and *.XferData scripts for the current day in cron you could be alerted to missing entries where there was only a *.Manager scheduled in cron. Obviously a better script could be written that monitors more things (like seeing that your *.Manager scripts are still listed in pcron).3 /net/topgun/home3/paska/manager/topsecret/Jun-19-2008.14.19/ This is just a quick idea on the type of monitoring you can do for this type of failure by monitoring pcrontab.sh output. Related Products:
|