In TrueSight Capacity Optimization (TSCO), when debugging problems with the import of TSCO Agent data into the TSCO database via the TSCO Gateway Server ETL it is usually best to start by trying to break down the source of the problem into one of 4 different stages:
Stage | Description | Collect | This covers the sending of collection requests to the TSCO Agent from the TSCO Gateway Server and the collection of data on the TSCO Agent server | Transfer | This covers the transfer of the UDR data collected on the TSCO Agent server back to the TSCO Gateway Server | Process | This covers the processing of the data by the Manager run into an output VIS file | Import | This covers the TSCO Gateway Server VIS Parser ETL tasks detection of the Manager run and VIS file on the TSCO Gateway Server, the transfer of that VIS file to the ETL Engine, the parsing of that VIS file, and the LOAD of the data it contains into the STAGING area of the TSCO database |
A common indicator of which stage of the nightly data flow is experiencing a failure is how many servers are experiencing a problem that is preventing their data from making it into the TSCO database. Looking at the number of servers affected and whether there is any common relationship between those servers is often a good starting point to narrow down which part of the data flow from the TSCO Agent into the TSCO database is experiencing a problem.Here is a table that maps the extent of the problem to the most likely stage that is failing (and thus should be the initial focus for debugging):
Extent | What is affected | Initial debugging focus | Description | Very Small | A single computer | Collect | When the problem is limited to a single computer the most likely problem is related to a data collection, transfer, or processing problem related to that machine itself | Small | A small number of computers | Collect/Transfer | When problems are related to a small number of computers a good initial consideration is whether there is any commonality between those computers. For example, are all computers in the same data center, manager run, or Gateway Server? Are they all running the same OS type or version? Recognizing how computers that are failing are related can provide a good pointer for how to begin investigating the problem. | Medium | A moderate group of computers, for example all computers in a particular Data Center or Manager run | Transfer/Processing | If a larger number of computers are failing it is often easier to see commonality between those computers. They may all be in the same data center, manager run, or all processing together on the same Gateway Server. The more computers that are failing the less likely the problem exists on the TSCO Agent side and instead more likely exists on the TSCO Gateway Server or ETL data import | Large | A large group of computers, for example, all computers managed by a particular Gateway Server | Processing/Import | When a large group of computers are failing, the focus should be on finding a major component of the data processing or import into the TSCO database that is failing and could explain the scope of the problem. That could be a problem affecting the TSCO Gateway Server, affecting the Gateway Server ETL execution, or another common component that is common across the data flows | Very Large | A very large group of computers, for example, all computers | Import | If the problem affects all computers in an environment, particularly across multiple Gateway Servers, then the initial debugging investigation should begin on the TSCO Application Server/ETL Engine side. The problem could be related to a TSCO Datahub outage (affecting all data import, even data import beyond the TSCO Agents) or it could be a problem affecting the ETL Engine Scheduler (so the Gateway Server VIS Parser ETL isn’t being executed). |
A good resource to determine the extent of the problem and identify a starting point for investigating TSCO Agent data availability problems is the TSCO Gateway Manager UI.
Gateway Manager UI
Navigation | Provides | Screenshot | Things to look for | Gateway Operations -> Status/Recover Runs | A table of each active Manager run and the breakdown of failure counts broken down by collect, transfer, and processing states | |
- Do you see the Manager run listed there?
- If the Manager run isn’t listed that indicates a problem on the TSCO Gateway Server
- What is the status of Collect, Transfer, Process for the run?
- Since the Status/Recover Runs screen breaks the status of each run down by Collect, Transfer, and Processing failure percentages this can be a good starting point for the initial problem investigation focus area
| Gateway Reports -> Node History | A table that contains the collect, transfer, and processing status of each computer in the active Manager runs | |
- Do you see the computers listed there with a current 'Date'?
- The Node History shows the Collection and Transfer status of every computer being managed by a Manager run on that Gateway Server.
- If so, what is the collection status?
- If there is a problem on the data collection or data transfer side, it will be visible here and associated with an error code
| Gateway Logs -> Manager Run Logs | A screen that provides access to the ProcessDay.out and and the udrCollectMgr logs files from the Manager run | | Find the Manager run in the list of the left side and click the 'Run Logs' view button. The logs should be broken down into two parts. After data processing is complete there will be the ProcessDay.out output at the top (under a "### Manager Log ####" heading). There will also be the UCM logs section under a "##### UCM Log #######" heading. The Manager Log section will have the output about the data processing. The UCM Log section will have the logs related to data collection and data transfer. |
Gateway Manager components
Component | Description | pcron | pcron manages the execution of the scripts associated with the nightly Manager run. Pcron is scheduled in the TSCO Gateway Server Installation Owner’s crontab and executes every 1 minute. It checks if any scripts need to be executed that minute and, if so, runs them and if not terminates | [begin]-[end].Manager script | The [begin_date]-[end_date].Manager script controls the nightly Manager runs. It is executed each night, by default, 30 minutes before the start of data collection and starts the udrCollectMgr processes and creates the scripts used for monitoring data collection and transfer and for processing the data | [date]-[date].Collect | The [date]-[date].Collect script ensures that there is a udrCollectMgr process running for the Manager run | [date]-[date].XferData script | The [date]-[date].XferData script checks that the Collect and Transfer phase of the Manager run is complete and, when it is, executes the ProcessDay script | [date]-[date].ProcessDay script | The ProcessDay script manages data processing for the nightly Manager run. The purpose of this script is to generate the output VIS file which contains the results of the processing and becomes the input to the TSCO Gateway Server VIS Parser ETL. The ProcessDay script also manages aspects of the Manager run such as data cleanup. The ProcessDay script triggers the Combine process which combines the interval VIS files into the final full run VIS file. | udrCollectMgr process | The udrCollectMgr process manages data collection and data transfer for nightly Manager runs | Analyze (bgsanalyze) process | Analyze is executed by the ProcessDay script and reads the raw UDR data and covers that into an ‘interval’ VIS file. Analyze also creates a model (*.md) file which is a mathematical representation of the system and becomes the input to the Predict component | Predict (bgspredict) process | Predict is run by the ProcessDay script when response time calculation is enabled in the Manager run configuration. It reads the Model (*.md) file created by Analyze and calculates Response Time values for each workload defined in the run | GeneralManagerServer | GeneralManagerServer listens on port 10129 and is how the TSCO Gateway Manager and TSCO Gateway Server VIS Parser ETL communicate with the TSCO Gateway Server. For secure communicate the GeneralManagerServer process may be configured by the TSCO Gateway Server install to be fronted by an Apache Web Server listening on port 10130 which supports https communication. |
See KA 000304391: ‘TrueSight Capacity Optimization (TSCO) - Overview of the nightly Gateway/BPA Manager Runs data processing process activity’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000304391) for a more detailed description of the nightly Manager run workflow and what time different components of the nightly processing are executed.
Command Line Debugging
Debugging can also be performed from the command line of the TSCO Gateway Server on Linux.Note that these commands need to be run as the TSCO Gateway Server Installation Owner (the owner of the running 'udrCollectMgr' processes on the machine).(1) A good starting command to run is the listManagerRuns.pl command which will list the output of each scheduled Manager run on the TSCO Gateway Server:$BEST1_HOME/bgs/scripts/listManagerRuns.pl -p MANAGER_COMMANDS_FILE OUTPUT_DIRECTORYSample output:#MANAGER_COMMANDS_FILE,OUTPUT_DIRECTORY,/usr/adm/best1_workspace/automation/tsco105labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.43_tsco105labs,/usr/adm/best1_workspace/automation/tsco107labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.43_tsco107labs,/usr/adm/best1_workspace/automation/tsco103labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.44_tsco103labs,/usr/adm/best1_workspace/automation/tsco2002labs.vcmds,/usr/adm/best1_workspace/results/May-08-2020.19.37_tsco2002labs,That command lists each scheduler Manager run and the and the Manger Output Directory associated with that run.What we’re looking to see here is that each of the expected Manager runs is listed as running. This is also a useful command for quickly mapping the VCMDS file to its active Manager Output Directory.(2) Check if there is a running 'udrCollectMgr' process for the VCMDS file name associated with the Manager run:ps -ef | grep udr | grep [vcmds_name]Where [vcmds_name] is the name of the VCMDS file. If there isn't that could indicate there is a problem with the Manager run (maybe it isn't scheduled or maybe it is scheduled but the collection requests are all moving to aborted status).$ ps -ef | grep udrperform 12662 1 0 Aug18 ? 00:00:02 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco105labs.vcmdsperform 12690 1 0 Aug18 ? 00:00:05 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco107labs.vcmdsperform 12720 1 0 Aug18 ? 00:00:01 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco103labs.vcmdsperform 12750 1 0 Aug18 ? 00:00:03 /usr/adm/best1_20.02.00/bgs/bin/udrCollectMgr -b /usr/adm/best1_20.02.00 -d 08-19-2021 -v /usr/adm/best1_workspace/automation/tsco2002labs.vcmdsThere should be one running udrCollectMgr process per active Manager run. If the udrCollectMgr process isn’t running the problem could be that (a) the Manager run was not properly initialized for the day or (b) all of the TSCO Agent collection requests in the run transitioned to ABORTED status resulting in the udrCollectMgr terminating. The udrCollectMgr logs for the run is a good starting point for debugging if the udrCollectMgr isn’t running.The udrCollectStat command is a good way to get information about the status of data collection and data transfer for all computers managed by a Gateway Server.Information on the formatting flags for the udrCollectStat command (which allows you to select the desired information to see in the table) is available here: KA 000202580: What are the format flags available for the udrCollectStat command on the TSCO Gateway Server? (https://community.bmc.com/s/article/Inquira-KA307660)(3) That following command lists the current day’s collection status for all the computers in a Manager run:$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %gc" | grep [run_name]Where [run_name] is the name of the Manager VCMDS file such as 'data.vcmds'.Sample output:$ $BEST1_HOME/bgs/bin/udrCollectStat -D -d `date +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %gc" | grep tsco2002labs/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-cs: SENDING_REQUEST, WARNING, 91, Service daemon invalid host name provided (can not find server or DNS error) 0/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-mp: COLLECT_RUNNING, OK, 105, Collect request is already scheduled at the agent 56/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-26-2021 tsco2002-as-sa: COLLECT_RUNNING, OK, 105, Collect request is already scheduled at the agent 56(4) That following command lists the collection and transfer status for yesterday for all computers in a Manager run:$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date --date=yesterday +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %th %te %tes %tg %tt" | grep [run_name]Where [run_name] is the name of the Manager VCMDS file such as 'data.vcmds'.Sample output:$BEST1_HOME/bgs/bin/udrCollectStat -D -d `date --date=yesterday +%m-%d-%Y` -f "%v %r %d %n: %s, %ch, %ce, %ces %th %te %tes %tg %tt" | grep tsco2002labs/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-cs: ABORTED, ERROR, 91, Service daemon invalid host name provided (can not find server or DNS error) N/A 0 Normal Operation 0 0/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-mp: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 25 1/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-as-sa: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 23 1/usr/adm/best1_workspace/automation/tsco2002labs.vcmds tsco2002labs 08-25-2021 tsco2002-db-mp: COMPLETE, WARNING, 98, No data collected for some of the configured metric groups OK 54 Collect Request - Agent data transfer successful 23 2(5) The following command lists all of the Manager runs that have transitioned to FINISHED state on the TSCO Gateway Server:$BEST1_HOME/bgs/bin/GeneralManagerClient -s localhost:10129 listFinishedManagerRunsThis output can be useful to see that runs are being executed and that the [date]-[date].ProcessDay script is running to completion. This is the list of Manager runs that are visible to the TSCO Gateway Server VIS parser ETL.(6) If it is suspected that problems are related to data collection or data transfer those messages will be in the UDR Collection Manager (UCM) logs. The UCM log files for each Manager run can be found in the $BEST1_HOME/local/manager/log directory.The log file names will be in the format: [gws_hostname]-[run_name]-MM-DD-YYYY-longnumber.log (and .log.bak)(7) The output from the [date]-[date].ProcessDay script, which is the script that manages the nightly data processing once udrCollectMgr has completed the data collection and transfer phase is written to the [date]-[date].ProcessDay.out files from the Manager Output Directory associated with the Manager run.The output of the earlier listManagerRuns.pl command provides a mapping of the Manager run names and the associated Manager Output Directories.Change directory into the listed output directory and there should be files with the name [date]-[date].ProcessDay.out. Start with the most recent dates where the problem is occurring. If a new problem has recently started it can also be useful to compare the data processing flow in the current logs with the flow from the last good day of processing to look for differences.
Gateway Server Log Files
Component | Log File | Contents | Pcron | $BEST1_HOME/bgs/log/pcron/[hostname]-[username].log | Output of pcron which manages the execution of the scripts scheduled in pcron which includes the nightly Manager runs script execution | AutoNodeDiscovery | $BEST1_HOME/bgs/manager/log/[hostname]-AutoNodeDiscovery.log | Output of the AutoNodeDiscovery command which refreshes the Manager run domain files each night based upon the run’s Agent List | GeneralManagerServer | $BEST1_HOME/bgs/manager/log/[hostname]-GeneralManager.log | Output of GeneralManagerServer which is how the TSCO Application Server and ETL Engine communicates with the TSCO Gateway Server | UDR Collection Manager (by Manager Run) | $BEST1_HOME/bgs/monitor/log/[hostname]-[mgrrun]-MM-DD-YYYY-longnumber.log | Output of udrCollectMgr which manages data collection and transfer | [begin]-[end].Manager | /[Manager Output Directory]/best1manager.log | Output of the Manager script which is executed each night to initialize the run for that day | [date]-[date].XferData | /[Manager Output Directory]/[date]-[date].manager.log | Output of the XferData script which ensures that the udrCollectMgr binary is running throughout the day and restarts it as necessary | [date]-[date].ProcessDay | /[Manager Output Directory]/[date]-[date].ProcessDay.out | Output of the ProcessDay script which manages the processing of the data, includes the creation of the VIS file and data cleanup | GeneralManagerLite | $BEST1_HOME/bgs/manager/log/BCO_BPAStatusAndRecoveryManager.log | Output of the General Manager Lite script which can be scheduled to sent a nightly e-mail status update regarding nightly processing success over time | managerExceptions | $BEST1_HOME/bgs/manager/log/managerException.log | Output of the managerExceptions command which is executed as part of the ProcessDay script to summarize the status of the processing. |
Common Issues
Issue | Cause | Resolution | After a manual cleanup of old files in the Manager Output Directory none of my Manager runs are active anymore | The [begin]-[end].Manager script is created in the Manger Output Directory when the Manage run is submitted and then never updated. A time based cleanup (for example, using the ‘find’ command) may have deleted the *.Manager script from the Manager Output Directory which has stopped the execution of the run. | KA 000210037: ‘For the TrueSight Capacity Optimization (TSCO) Gateway Server on Linux, what is the best way to re-initialize Manager runs if the *.Manager script hasn't been getting executed?’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000210037) | Gateway Server Manager runs aren’t being executed at night but the Manager script still exists and is scheduled in pcron | Either pcron isn’t being executed by cron (if all Manager runs aren’t being executed) or the Linux ‘nproc’ limit is being reached if a subset of runs aren’t being executed | KA 000229523: ‘On Linux, a Gateway Server Manager run wasn't executed at night but the *.Manager script still exists and is still scheduled in pcron in TrueSight Capacity Optimization’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000229523) | Agent List based Manager runs duplicated after TSCO Gateway Server upgrade by the migrateManagerRuns command | This problem happens when the TSCO Gateway Server version 11.5.00 migrateManagerRuns command is update to migrate runs from an earlier TSCO Gateway Server release to the current version. | KA 000263021: ‘TrueSight Capacity Optimization (TSCO) - Agent List based Manager runs duplicated after TSCO Gateway Server upgrade by the migrateManagerRuns command’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000263021) | The TSCO Gateway Server VIS Parser ETL fails to connect to the TSCO Gateway Server on port 10130 when the Gateway Server is configured to communicate via https | This problem happens when the httpd server that fronts the GeneralManagerServer process isn’t running and thus there isn’t a listen on port 10130 | KA 000364210: ‘TrueSight Capacity Optimization (TSCO) -After TSCO Gateway Server report the Gateway Manager and Gateway Server ETL is no longer able to connect via HTTPS port 10130’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000364210) | Manager runs have failed after one of the output file systems used by the run have filled to 100% on the TSCO Gateway Server | If any of the directories used by the nightly Manager run fill up that will destabilize the nightly Manager run. The impact could be that the runs fail until the file system utilization problem is corrected or it could corrupt the Manager run and require it be resubmitted | KA 000320139: ‘TrueSight Capacity Optimization (TSCO) - Gateway Server Manager recovery options after the BEST1_HOME file system has filled up on the Linux console’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000320139) | Repeating error in ProcessDay.out file: I-MSGLINE: Line number ###: [line contents] E-CPARSERERR Can't parse \[line contents]\. After uploading an updated Analyze Command File to change the workload characterization of the Manager run | The Analyze Commands File contains windows new line (^M) characters at the end of each line | KA 000282761: ‘Analyze on Linux fails to load an Analyze Commands File (*.an) created on the Windows console with error, "Can't parse [line]" for every line in the file’ (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000282761) |
Additional Resources
Connect with TrueSight - TSCO: Best Practices Using the Gateway Manager Webinar This webinar reviews the best practices for using the TrueSight Capacity Optimization Gateway Manager. This includes creation, managing, and recovery ‘Agent List’ based manager runs enrolled in automated ETL management. |