TrueSight Capacity Optimization (TSCO) Gateway Server Troubleshooting Guide for data collection, transfer, and processing problems

Knowledge Article

Article Number

000325876

Old Article Number

000104520

Article Type

Solutions to a Product Problem

Title

TrueSight Capacity Optimization (TSCO) Gateway Server Troubleshooting Guide for data collection, transfer, and processing problems

Summary

TrueSight Capacity Optimization (TSCO) Gateway Server Troubleshooting Guide for data collection, transfer, and processing problems

Product

TrueSight Capacity Optimization

Component

Capacity Optimization

Applies to

TrueSight Capacity Optimization 20.02, 11.5, 11.3.01, 11.0, 10.7, 10.5, 10.3, 10.0 ; BMC Performance Assurance 9.5

Problem

What are some initial debugging steps when there are problems with the TrueSight Gateway Server Manager runs making it into the TrueSight Capacity Optimization (TSCO) database?

Cause

Solution

In TrueSight Capacity Optimization (TSCO), when debugging problems with the import of TSCO Agent data into the TSCO database via the TSCO Gateway Server ETL it is usually best to start by trying to break down the source of the problem into one of 4 different stages:

Stage	Description
Collect	This covers the sending of collection requests to the TSCO Agent from the TSCO Gateway Server and the collection of data on the TSCO Agent server
Transfer	This covers the transfer of the UDR data collected on the TSCO Agent server back to the TSCO Gateway Server
Process	This covers the processing of the data by the Manager run into an output VIS file
Import	This covers the TSCO Gateway Server VIS Parser ETL tasks detection of the Manager run and VIS file on the TSCO Gateway Server, the transfer of that VIS file to the ETL Engine, the parsing of that VIS file, and the LOAD of the data it contains into the STAGING area of the TSCO database

A common indicator of which stage of the nightly data flow is experiencing a failure is how many servers are experiencing a problem that is preventing their data from making it into the TSCO database. Looking at the number of servers affected and whether there is any common relationship between those servers is often a good starting point to narrow down which part of the data flow from the TSCO Agent into the TSCO database is experiencing a problem.

Here is a table that maps the extent of the problem to the most likely stage that is failing (and thus should be the initial focus for debugging):

Extent	What is affected	Initial debugging focus	Description
Very Small	A single computer	Collect	When the problem is limited to a single computer the most likely problem is related to a data collection, transfer, or processing problem related to that machine itself
Small	A small number of computers	Collect/Transfer	When problems are related to a small number of computers a good initial consideration is whether there is any commonality between those computers. For example, are all computers in the same data center, manager run, or Gateway Server? Are they all running the same OS type or version? Recognizing how computers that are failing are related can provide a good pointer for how to begin investigating the problem.
Medium	A moderate group of computers, for example all computers in a particular Data Center or Manager run	Transfer/Processing	If a larger number of computers are failing it is often easier to see commonality between those computers. They may all be in the same data center, manager run, or all processing together on the same Gateway Server. The more computers that are failing the less likely the problem exists on the TSCO Agent side and instead more likely exists on the TSCO Gateway Server or ETL data import
Large	A large group of computers, for example, all computers managed by a particular Gateway Server	Processing/Import	When a large group of computers are failing, the focus should be on finding a major component of the data processing or import into the TSCO database that is failing and could explain the scope of the problem. That could be a problem affecting the TSCO Gateway Server, affecting the Gateway Server ETL execution, or another common component that is common across the data flows
Very Large	A very large group of computers, for example, all computers	Import	If the problem affects all computers in an environment, particularly across multiple Gateway Servers, then the initial debugging investigation should begin on the TSCO Application Server/ETL Engine side. The problem could be related to a TSCO Datahub outage (affecting all data import, even data import beyond the TSCO Agents) or it could be a problem affecting the ETL Engine Scheduler (so the Gateway Server VIS Parser ETL isn’t being executed).

A good resource to determine the extent of the problem and identify a starting point for investigating TSCO Agent data availability problems is the TSCO Gateway Manager UI.

Gateway Manager UI

Navigation	Provides	Screenshot	Things to look for
Gateway Operations -> Status/Recover Runs	A table of each active Manager run and the breakdown of failure counts broken down by collect, transfer, and processing states		Do you see the Manager run listed there? If the Manager run isn’t listed that indicates a problem on the TSCO Gateway Server What is the status of Collect, Transfer, Process for the run? Since the Status/Recover Runs screen breaks the status of each run down by Collect, Transfer, and Processing failure percentages this can be a good starting point for the initial problem investigation focus area
Gateway Reports -> Node History	A table that contains the collect, transfer, and processing status of each computer in the active Manager runs		Do you see the computers listed there with a current 'Date'? The Node History shows the Collection and Transfer status of every computer being managed by a Manager run on that Gateway Server. If so, what is the collection status? If there is a problem on the data collection or data transfer side, it will be visible here and associated with an error code
Gateway Logs -> Manager Run Logs	A screen that provides access to the ProcessDay.out and and the udrCollectMgr logs files from the Manager run		Find the Manager run in the list of the left side and click the 'Run Logs' view button. The logs should be broken down into two parts. After data processing is complete there will be the ProcessDay.out output at the top (under a "### Manager Log ####" heading). There will also be the UCM logs section under a "##### UCM Log #######" heading. The Manager Log section will have the output about the data processing. The UCM Log section will have the logs related to data collection and data transfer.

Gateway Manager components

Component	Description
pcron	pcron manages the execution of the scripts associated with the nightly Manager run. Pcron is scheduled in the TSCO Gateway Server Installation Owner’s crontab and executes every 1 minute. It checks if any scripts need to be executed that minute and, if so, runs them and if not terminates
[begin]-[end].Manager script	The [begin_date]-[end_date].Manager script controls the nightly Manager runs. It is executed each night, by default, 30 minutes before the start of data collection and starts the udrCollectMgr processes and creates the scripts used for monitoring data collection and transfer and for processing the data
[date]-[date].Collect	The [date]-[date].Collect script ensures that there is a udrCollectMgr process running for the Manager run
[date]-[date].XferData script	The [date]-[date].XferData script checks that the Collect and Transfer phase of the Manager run is complete and, when it is, executes the ProcessDay script
[date]-[date].ProcessDay script	The ProcessDay script manages data processing for the nightly Manager run. The purpose of this script is to generate the output VIS file which contains the results of the processing and becomes the input to the TSCO Gateway Server VIS Parser ETL. The ProcessDay script also manages aspects of the Manager run such as data cleanup. The ProcessDay script triggers the Combine process which combines the interval VIS files into the final full run VIS file.
udrCollectMgr process	The udrCollectMgr process manages data collection and data transfer for nightly Manager runs
Analyze (bgsanalyze) process	Analyze is executed by the ProcessDay script and reads the raw UDR data and covers that into an ‘interval’ VIS file. Analyze also creates a model (*.md) file which is a mathematical representation of the system and becomes the input to the Predict component
Predict (bgspredict) process	Predict is run by the ProcessDay script when response time calculation is enabled in the Manager run configuration. It reads the Model (*.md) file created by Analyze and calculates Response Time values for each workload defined in the run
GeneralManagerServer	GeneralManagerServer listens on port 10129 and is how the TSCO Gateway Manager and TSCO Gateway Server VIS Parser ETL communicate with the TSCO Gateway Server. For secure communicate the GeneralManagerServer process may be configured by the TSCO Gateway Server install to be fronted by an Apache Web Server listening on port 10130 which supports https communication.

See KA 000304391: ‘TrueSight Capacity Optimization (TSCO) - Overview of the nightly Gateway/BPA Manager Runs data processing process activity’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000304391) for a more detailed description of the nightly Manager run workflow and what time different components of the nightly processing are executed.

Command Line Debugging

Debugging can also be performed from the command line of the TSCO Gateway Server on Linux.

Note that these commands need to be run as the TSCO Gateway Server Installation Owner (the owner of the running 'udrCollectMgr' processes on the machine).

(1) A good starting command to run is the listManagerRuns.pl command which will list the output of each scheduled Manager run on the TSCO Gateway Server:

$BEST1_HOME/bgs/scripts/listManagerRuns.pl -p MANAGER_COMMANDS_FILE OUTPUT_DIRECTORY

Sample output:

#MANAGER_COMMANDS_FILE,OUTPUT_DIRECTORY,
/usr/adm/best1_workspace/automation/tsco105labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.43_tsco105labs,
/usr/adm/best1_workspace/automation/tsco107labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.43_tsco107labs,
/usr/adm/best1_workspace/automation/tsco103labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.44_tsco103labs,
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.37_tsco2002labs,

That command lists each scheduler Manager run and the and the Manger Output Directory associated with that run.

What we’re looking to see here is that each of the expected Manager runs is listed as running. This is also a useful command for quickly mapping the VCMDS file to its active Manager Output Directory.

(2) Check if there is a running 'udrCollectMgr' process for the VCMDS file name associated with the Manager run:

ps -ef | grep udr | grep [vcmds_name]

Where [vcmds_name] is the name of the VCMDS file. If there isn't that could indicate there is a problem with the Manager run (maybe it isn't scheduled or maybe it is scheduled but the collection requests are all moving to aborted status).

$ ps -ef | grep udr
perform 12662     1 0 Aug18 ?        00:00:02 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco105labs.vcmds
perform 12690     1 0 Aug18 ?        00:00:05 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco107labs.vcmds
perform 12720     1 0 Aug18 ?        00:00:01 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco103labs.vcmds
perform 12750     1 0 Aug18 ?        00:00:03 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco2002labs.vcmds

There should be one running udrCollectMgr process per active Manager run. If the udrCollectMgr process isn’t running the problem could be that (a) the Manager run was not properly initialized for the day or (b) all of the TSCO Agent collection requests in the run transitioned to ABORTED status resulting in the udrCollectMgr terminating. The udrCollectMgr logs for the run is a good starting point for debugging if the udrCollectMgr isn’t running.

The udrCollectStat command is a good way to get information about the status of data collection and data transfer for all computers managed by a Gateway Server.
Information on the formatting flags for the udrCollectStat command (which allows you to select the desired information to see in the table) is available here:
KA 000202580: What are the format flags available for the udrCollectStat command on the TSCO Gateway Server? (https://community.bmc.com/s/article/Inquira-KA307660)

(3) That following command lists the current day’s collection status for all the computers in a Manager run:
$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %gc" | grep [run_name]

Where [run_name] is the name of the Manager VCMDS file such as 'data.vcmds'.

Sample output:

$ $BEST1_HOME/bgs/bin/udrCollectStat -D -d `date +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %gc" | grep tsco2002labs
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-cs: SENDING_REQUEST, WARNING, 91, Service daemon invalid host name provided (can not find server or DNS error) 0
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-mp: COLLECT_RUNNING, OK, 105, Collect request is already scheduled at the agent 56
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-sa: COLLECT_RUNNING, OK, 105, Collect request is already scheduled at the agent 56

(4) That following command lists the collection and transfer status for yesterday for all computers in a Manager run:

$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date --date=yesterday +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %th %te %tes %tg %tt" | grep [run_name]

Where [run_name] is the name of the Manager VCMDS file such as 'data.vcmds'.

Sample output:

$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date --date=yesterday +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %th %te %tes %tg %tt" | grep tsco2002labs
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-cs: ABORTED, ERROR, 91, Service daemon invalid host name provided (can not find server or DNS error) N/A 0 Normal Operation 0 0
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-mp: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 25 1
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-sa: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 23 1
/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-db-mp: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 23 2

(5) The following command lists all of the Manager runs that have transitioned to FINISHED state on the TSCO Gateway Server:

$BEST1_HOME/bgs/bin/GeneralManagerClient -s localhost:10129 listFinishedManagerRuns

This output can be useful to see that runs are being executed and that the [date]-[date].ProcessDay script is running to completion. This is the list of Manager runs that are visible to the TSCO Gateway Server VIS parser ETL.

(6) If it is suspected that problems are related to data collection or data transfer those messages will be in the UDR Collection Manager (UCM) logs. The UCM log files for each Manager run can be found in the $BEST1_HOME/local/manager/log directory.

The log file names will be in the format:
[gws_hostname]-[run_name]-MM-DD-YYYY-longnumber.log (and .log.bak)

(7) The output from the [date]-[date].ProcessDay script, which is the script that manages the nightly data processing once udrCollectMgr has completed the data collection and transfer phase is written to the [date]-[date].ProcessDay.out files from the Manager Output Directory associated with the Manager run.

The output of the earlier listManagerRuns.pl command provides a mapping of the Manager run names and the associated Manager Output Directories.

Change directory into the listed output directory and there should be files with the name [date]-[date].ProcessDay.out. Start with the most recent dates where the problem is occurring. If a new problem has recently started it can also be useful to compare the data processing flow in the current logs with the flow from the last good day of processing to look for differences.

Gateway Server Log Files

Component	Log File	Contents
Pcron	$BEST1_HOME/bgs/log/pcron/[hostname]-[username].log	Output of pcron which manages the execution of the scripts scheduled in pcron which includes the nightly Manager runs script execution
AutoNodeDiscovery	$BEST1_HOME/bgs/manager/log/[hostname]-AutoNodeDiscovery.log	Output of the AutoNodeDiscovery command which refreshes the Manager run domain files each night based upon the run’s Agent List
GeneralManagerServer	$BEST1_HOME/bgs/manager/log/[hostname]-GeneralManager.log	Output of GeneralManagerServer which is how the TSCO Application Server and ETL Engine communicates with the TSCO Gateway Server
UDR Collection Manager (by Manager Run)	$BEST1_HOME/bgs/monitor/log/[hostname]-[mgrrun]-MM-DD-YYYY-longnumber.log	Output of udrCollectMgr which manages data collection and transfer
[begin]-[end].Manager	/[Manager Output Directory]/best1manager.log	Output of the Manager script which is executed each night to initialize the run for that day
[date]-[date].XferData	/[Manager Output Directory]/[date]-[date].manager.log	Output of the XferData script which ensures that the udrCollectMgr binary is running throughout the day and restarts it as necessary
[date]-[date].ProcessDay	/[Manager Output Directory]/[date]-[date].ProcessDay.out	Output of the ProcessDay script which manages the processing of the data, includes the creation of the VIS file and data cleanup
GeneralManagerLite	$BEST1_HOME/bgs/manager/log/BCO_BPAStatusAndRecoveryManager.log	Output of the General Manager Lite script which can be scheduled to sent a nightly e-mail status update regarding nightly processing success over time
managerExceptions	$BEST1_HOME/bgs/manager/log/managerException.log	Output of the managerExceptions command which is executed as part of the ProcessDay script to summarize the status of the processing.

Common Issues

Issue	Cause	Resolution
After a manual cleanup of old files in the Manager Output Directory none of my Manager runs are active anymore	The [begin]-[end].Manager script is created in the Manger Output Directory when the Manage run is submitted and then never updated. A time based cleanup (for example, using the ‘find’ command) may have deleted the *.Manager script from the Manager Output Directory which has stopped the execution of the run.	KA 000210037: ‘For the TrueSight Capacity Optimization (TSCO) Gateway Server on Linux, what is the best way to re-initialize Manager runs if the *.Manager script hasn't been getting executed?’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000210037)
Gateway Server Manager runs aren’t being executed at night but the Manager script still exists and is scheduled in pcron	Either pcron isn’t being executed by cron (if all Manager runs aren’t being executed) or the Linux ‘nproc’ limit is being reached if a subset of runs aren’t being executed	KA 000229523: ‘On Linux, a Gateway Server Manager run wasn't executed at night but the *.Manager script still exists and is still scheduled in pcron in TrueSight Capacity Optimization’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000229523)
Agent List based Manager runs duplicated after TSCO Gateway Server upgrade by the migrateManagerRuns command	This problem happens when the TSCO Gateway Server version 11.5.00 migrateManagerRuns command is update to migrate runs from an earlier TSCO Gateway Server release to the current version.	KA 000263021: ‘TrueSight Capacity Optimization (TSCO) - Agent List based Manager runs duplicated after TSCO Gateway Server upgrade by the migrateManagerRuns command’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000263021)
The TSCO Gateway Server VIS Parser ETL fails to connect to the TSCO Gateway Server on port 10130 when the Gateway Server is configured to communicate via https	This problem happens when the httpd server that fronts the GeneralManagerServer process isn’t running and thus there isn’t a listen on port 10130	KA 000364210: ‘TrueSight Capacity Optimization (TSCO) -After TSCO Gateway Server report the Gateway Manager and Gateway Server ETL is no longer able to connect via HTTPS port 10130’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000364210)
Manager runs have failed after one of the output file systems used by the run have filled to 100% on the TSCO Gateway Server	If any of the directories used by the nightly Manager run fill up that will destabilize the nightly Manager run. The impact could be that the runs fail until the file system utilization problem is corrected or it could corrupt the Manager run and require it be resubmitted	KA 000320139: ‘TrueSight Capacity Optimization (TSCO) - Gateway Server Manager recovery options after the BEST1_HOME file system has filled up on the Linux console’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000320139)
Repeating error in ProcessDay.out file: I-MSGLINE: Line number ###: [line contents] E-CPARSERERR Can't parse \[line contents]\. After uploading an updated Analyze Command File to change the workload characterization of the Manager run	The Analyze Commands File contains windows new line (^M) characters at the end of each line	KA 000282761: ‘Analyze on Linux fails to load an Analyze Commands File (*.an) created on the Windows console with error, "Can't parse [line]" for every line in the file’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000282761)

Additional Resources

Connect with TrueSight - TSCO: Best Practices Using the Gateway Manager Webinar
This webinar reviews the best practices for using the TrueSight Capacity Optimization Gateway Manager. This includes creation, managing, and recovery ‘Agent List’ based manager runs enrolled in automated ETL management.

Attachment(s):

Manage Support IDs

BMC Contact Options

Knowledge Article

Gateway Manager UI

Gateway Manager components

Command Line Debugging

Gateway Server Log Files

Additional Resources