On Linux, a Gateway Server Manager run wasn't executed at night but the *.Manager script still exists and is still scheduled in pcron in TrueSight Capacity Optimization (TSCO)

Knowledge Article

Article Number

000229523

Old Article Number

000096499

Article Type

Solutions to a Product Problem

Title

On Linux, a Gateway Server Manager run wasn't executed at night but the *.Manager script still exists and is still scheduled in pcron in TrueSight Capacity Optimization (TSCO)

Summary

On Linux, a TSCO Gateway Server (formerly BPA) Manager run wasn't executed at night but the *.Manager script still exists and is still scheduled in pcron

Product

TrueSight Capacity Optimization

Component

Applies to

TrueSight Capacity Optimization (Gateway Server) 11.3. 11.0, 10.x ; BMC Performance Assurance 9.5

Problem

The most common symptoms of the problem described in this document are:

The *.Manager script exists in the Manager Output Directory but the *.ProcessDay, *.XferData, and *.Collect scripts weren't created for the current day.
Manually running the *.Manager script successfully creates and executes the daily processing scripts without error

This document is broken into 3 sections:

Section I: Root cause analysis
Section II: Impact mitigation for a potential future *.Manager failure
Section III: Monitoring for a potential future *.Manager failure

BMC TrueSight Capacity Optimization (Gateway Server) 20.02, 11.5, 11.3, 11.0, 10.7, 10.5, 10.3, 10.0
BMC Performance Assurance 9.5, 9.0, 7.5.10
Linux Manager

Cause

Solution

Section I: Root cause analysis

Based upon the problem symptoms, the most likely direct causes of the problem are:

The *.Manager script for that Manager run wasn't executed for that day -- OR --
The *.Manager script for that Manager run was executed but failed almost immediately

This is based upon the fact that none of the daily scripts (*.Collect, *.XferData, *.ProcessDay) were created for that particular Manager run and re-running the *.Manager script manually worked properly.

The most common causes of that type of problem are:

cron didn't execute pcron on that minute
cron attempted to execute pcron but it failed for some reason
pcron was executed but it failed to execute the *.Manager script for some reason
pcron was executed and successfully executed the *.Manager script but then the script failed soon after execution

Events 1 - 3 should leave a log trail that we can follow. Event 4 is more difficult to debug and is usually related to an environmental issue (no more process slots available on the machine, file system full, something like that).

Testing cron via the checkCron.pl command

There is a command in $BEST1_HOME/bgs/scripts called 'checkCron.pl' that can be executed to test that cron is properly executing scheduled commands.

The output looks something like this:

> ./checkCron.pl
Warning: set BEST1_HOME is not a directory ()
Using /usr/adm/best1_10.5.00
Info: Opening log file = /usr/adm/best1_10.5.00/local/manager/log/checkCron.log
INFO: crontab Output:* * * * * /usr/adm/best1_10.5.00/bgs/bin/perl -w -I /usr/adm/best1_10.5.00/bgs/lib/PERL /usr/adm/best1_10.3.00/bgs/scripts/pcron 2>/dev/null

Info : crontab listing OK
Info: adding line to cron 41 12 14 3 * touch /tmp/1457973629.cronTest
INFO: crontab Output:
Info : Sleeping 140(sec) waiting for cron to generate file /tmp/1457973629.cronTest
Info : File found
Info: Your cron scheduler works
INFO: crontab Output:

So, that output indicates that cron is working because the test file was properly created when scheduled in cron.

Reviewing the cron logs

Option A: cron didn't execute pcron on that

Ever time cron executes a script it records the time the script is executed, the command being run, and the return status of the execution.

In the /var/cron/log file, the execution of pcron should look like this:

>  perform 27711 c Fri Sep 19 10:52:00 2008
>  CMD: /usr/adm/best1_7.4.10/bgs/bin/perl -w -I /usr/adm/best1_7.4.10/bgs/lib/PERL /usr/adm/best1_7.4.10/bgs/scripts/pcron 2>/dev/null
<  perform 27711 c Fri Sep 19 10:52:00 2008
 

If cron didn't execute pcron then we should be missing an entry from 23:40 for pcron in the /var/cron/log file since the Manager run that didn't execute is the only pcron event that runs at 23:40:

  60 30 23 * * * /opt/perform/manager/Jul-30-2008.18.21/Jul-31-2008.00.10-Jul-30-2009.23.55.Manager >/dev/null 2>&1

There isn't a simple way to parse the cron log since it puts the execution time and the command being run on separate lines in the log. You'll probably just need to do a search in the file.

So, you could do this:

  grep "23:30" /var/cron/*log

That will search both the cron 'log' and 'olog' files for all the scripts executed at 23:30. If there is an execution gap in the output that would indicate that cron failed to execute pcron on that minute on the problem day.

Some common reasons that cron is not able to execute the pcron script are:
(A) The TSCO Installation Owner user isn't listed in the /etc/hosts.allow file and there is an /etc/hosts.deny file blocking access to users
(B) The error "You (user) are not allowed to access to (crontab) because of pam configuration." often indicates that the TSCO Installation Owner user account is locked or the password has expired

Option B: cron attempted to execute pcron but it failed for some reason

When cron executes a command it records the return value of that command execution. So, if cron appears to have properly executed pcron that minute, we should check the return code of that pcron execution and make sure there isn't an error condition reported.

When there is an error the line for the script completion will report an error return code like this:

<  perform 3895 c Thu Sep 18 23:40:00 2008 rc=1

Option C: pcron was executed but it failed to execute the *.Manager script for some reason

Whenever pcron executes a script it writes a log entry to the $BEST1_HOME/bgs/log/pcron/[hostname]-[username].log file. The entry looks something like this:

  [2008/07/24:23:35:00] executed 3 35 23 * * * /home2/best1data/manager/daily/topgun/May-05-2008.16.21/May-05-2007.00.05-Dec-21-2012.23.59.Manager >/dev/null 2>&1 .

If there was an immediate error running the command pcron will note in the log that it couldn't properly execute the script. So, if cron executed pcron then we should next check the pcron log file to determine if pcron says it properly executed the *.Manager script for that minute.

Option D: pcron was executed and successfully executed the *.Manager script but then the script failed soon after execution

This is the hardest problem to identify. At this point we'll have seen in the pcron log that pcron successfully executed the *.Manager script but we know from the problem symptoms that the script never create d the daily collection and processing scripts, didn't schedule them into pcron, and never started the udrCollectMgr processes. That indicates that although pcron was able to successfully execute the *.Manager script in the background via a perl system() call, the *.Manager script itself wasn't able to run to successful completion. This is usually the result of a system level problem that may be visible in the /var/adm/messages file. In the past I've seen this when for some reason all available process slots were in-use (generally due to some run away process spawning children wildly), a file system full condition, or some other system level problem. A system level event is more likely in a scenario where multiple Manager runs didn't execute at the same minute (but pcron was executed by cron) or other applications were having problems on the machine at the same time.

TSCO Gateway Server on Linux 'nprocs' setting

If the cron is calling pcron but it appears that the *.XferData script isn't being executed or it isn't running the *.ProcessDay script, the problem could be related to the "max user processes" ulimit setting being reached.

When an environment has more than, say, 20 Manager runs and they have been configured with the same Processing Delay it is possible to saturate the number of available threads for the user under which the Manager runs are scheduled. The 'max user processes' ulimit setting on Linux is, by default, 1024 threads (it s a limit on threads, not processes) and the nightly Manager runs will generate a number of threads -- particularly between 11:30 PM and 12:15 AM when the udrCollectMgr processes have been started for the next day's collection and the udrCollectMgr processes are still running from the previous night's collection to complete the data transfer.

The recommendation would be to set the 'max user process' soft ulimit to allow 1024 threads for every 25 Manager runs (so set it to 2048 for 50 runs). The soft limit can be set via the /etc/security/limits.conf file.

The symptoms of this problem are pretty subtle. Usually the symptoms will be that everything appears to be properly configured on the machine but the [date]-[date].manager.log doesn't show the *.XferData script executing the *.ProcessDay script and there is no other obvious reason why it wouldn't have been run.

RHEL 7 NOTE

On Linux the 'nproc' value is typically updated via the /etc/security/limits.conf file. But, on RHEL 7 there is another file where the ulimits can be set:
/etc/security/limits.d/20-nproc.conf

That file is deployed in an out-of-the-box OS install with the nproc 'soft' limit set to 4096 by default:
* soft nproc 4096

In an environment with a very large number of Manager runs it could be necessary to increase this value further. If the 'nproc' value is set in the /etc/security/limits.conf file on RHEL 7 the value in the 20-nproc.conf (or any other file in the limits.d directory that sets it) will override it.

Section II: Impact mitigation for a potential future *.Manager failure

One of the easiest ways to prevent data loss from a Manager run execution failure is to enable the UDR Collection Manager "days in advance" collection feature (which is enabled by default to register collection requests for 3 days in advance). This feature instructs the Manager run to pre-register data collection requests on the remote node for up to 3 days in advance so if the Perform console fails for some reason (cron fails, console hardware fails, the network fails, lighting strike, whatever) the collection requests will already be registered on the remote node so no data collection will be lost of the problem is caught and resolved within 3 days.

This document describes how to enable the COLREQ_DAYS_ADVANCE feature via the $BEST1_HOME/local/setup/collectManager.cfg file in the "The COLREQ_DAYS_ADVANCE parameter" sub-section of "Section I: Recommendations for both the Unix and Windows console".

000105215: Perform Manager run configuration best practices for the Unix console (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000105215)

Section III: Monitoring for a potential future *.Manager failure

Although there isn't a solution within the product itself to monitor for this type of failure, it would be possible to create a basic custom script that monitored pcron to validate that all of the daily scripts were properly scheduled for each active Manager run.

Here is an inelegant way to see if there are any *.Manager scripts scheduled without associated *.XferData or *.Collect scripts:

$BEST1_HOME/bgs/scripts/pcrontab.sh -list | grep -v udrCollectMgr | cut -d " " -f 7 | awk -F\/ '{for(i=1;i < NF;i++) printf("%s/",$i); printf("\n"); }' | sort | uniq -c
 

The output would look something like this:

   3 /net/topgun/home3/paska/manager/operations/Jun-17-2008.13.47/
   3 /net/topgun/home3/paska/manager/topsecret/Jun-19-2008.14.19/
 

That says that there are 3 scripts scheduled in pcron in each of the listed Manager Output Directory. If you ran a command line that after data processing was completed when there should only be the *.Collect and *.XferData scripts for the current day in cron you could be alerted to missing entries where there was only a *.Manager scheduled in cron. Obviously a better script could be written that monitors more things (like seeing that your *.Manager scripts are still listed in pcron).

This is just a quick idea on the type of monitoring you can do for this type of failure by monitoring pcrontab.sh output.

Related Products:

TrueSight Capacity Optimization
BMC Performance Assurance for Servers

Legacy ID:KA300179

Attachment(s):

Manage Support IDs