In TrueSight Capacity Optimization (TSCO), the most visible problem symptom is that the following diagnostic alert is being reported for the vCenter Extractor Service: BCO_ETL_WARN301: Service ## has detected a problem in data extraction. Please verify service status. Looking at the ETL logs if one finds the last messages from the extractor thread (ignoring the messages from the SAVER thread being executed once per hour) the last messages are: INFO [service-68-1]- Starting service [68] - execTimeDelay: 0ms INFO [service-68-1]- [VMWareCollector] Querying VCenter data.... When the ETL is running normally each time the EXTRACT thread executes it logs how long it took to extract the host/guest data, but at some point that messaging stops being reported: stops: > grep "Perf extraction for host" service*.log* service68.log.2016-08-12-22:2016-08-12 22:03:28,497 INFO [service-68-1]- Perf extraction for host/vm took: 201.196 second(s) service68.log.2016-08-12-22:2016-08-12 22:18:32,288 INFO [service-68-1]- Perf extraction for host/vm took: 201.355 second(s) service68.log.2016-08-12-22:2016-08-12 22:33:26,689 INFO [service-68-1]- Perf extraction for host/vm took: 201.25 second(s) service68.log.2016-08-12-22:2016-08-12 22:48:29,790 INFO [service-68-1]- Perf extraction for host/vm took: 201.355 second(s) < -- No messages from the extraction thread from this point until the vCenter Extractor Service is restarted -- > Here is what the beginning of a normal query looks like: 2016-08-12 22:00:00,409 INFO [service-68-1]- Starting service [68] - execTimeDelay: 0ms 2016-08-12 22:00:00,410 INFO [service-68-1]- [VMWareCollector] Querying VCenter data.... 2016-08-12 22:00:04,282 INFO [service-68-1]- Updates have been detected. Time taken: 1.784 second(s). #updates: 61 2016-08-12 22:00:05,075 INFO [service-68-1]- Time taken to process delta configuration updates: 0.793 seconds 2016-08-12 22:00:05,075 INFO [service-68-1]- Extracting direct configuration metrics 2016-08-12 22:00:05,132 INFO [service-68-1]- Direct config metric extraction for resource pool's took: 0.056 second(s) 2016-08-12 22:00:07,297 INFO [service-68-1]- Direct config metric extraction for datastore's took: 2.165 second(s) 2016-08-12 22:00:07,301 INFO [service-68-1]- Setting host start time: 13/08/2016 01:24:40 +0000, cluster start time: 13/08/2016 01:25:00 +0000, datacenter start time: 13/08/2016 01:20:00 +0000, current VC server time: 13/08/2016 02:00:07 +0000 2016-08-12 22:00:07,301 INFO [service-68-1]- Collector: collecting host/vm performance information 2016-08-12 22:00:07,343 INFO [service-68-1]- Recovering host/vm perf information for host: vlxdpapesxc03.eur.bnymellon.net[host-6428], start time: 13/08/2016 01:09:40 +0000 2016-08-12 22:03:28,497 INFO [service-68-1]- Perf extraction for host/vm took: 201.196 second(s) < -- cut -- > Here is what the beginning of the query where the vCenter Extractor Service stopped working looks like: 2016-08-12 23:00:00,409 INFO [service-68-1]- Starting service [68] - execTimeDelay: 0ms 2016-08-12 23:00:00,410 INFO [service-68-1]- [VMWareCollector] Querying VCenter data.... < -- End of messages from this thread for the full hour -- > In general, what is needed to identify the problem is to review the ETL logs to understand at what point the vCenter Extractor Service hung during the data extraction thread execution. . So in the log look for messages like this: INFO [service-##-1]- Starting service [68] - execTimeDelay: 0ms INFO [service-##-1]- [VMWareCollector] Querying VCenter data.... But without the message that indicates the end of extraction: INFO [service-##-1]- Finished data extraction After the extraction thread has hung you'll only see messages from the LOAD thread which executes once per hour (by default) at 25 minutes past the hour. |
You can watch this video to see how to solve this issue: A code change to handle this problem with the vSphere Web API is being tracked by Defect QM002189455 and has been implemented in the TrueSight Capacity Optimization (TSCO) version 10.3 and later vCenter Extractor Service in the latest Cumulative Hot Fix (CHF) package. The fix is initially available in: TSCO 10.5: CHF version 10.5.00.01.C00006 TSCO 10.3: CHF version 10.3.00.01.C00021 The following KA has the download links for the latest TSCO CHF package: 000097159: Cumulative Hot Fixes for TrueSight Capacity Optimization (CO), CO Gateway Server, and CO Agent, and CO Perceiver (https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000097159) Additional ConfigurationIf the vCenter Extractor Service hangs when running the required Cumulative Hot Fix (CHF) version it may be necessary to manually set the ETL parameter that controls the extractor thread time out period.The following steps describe how to do that: (1) Under Administration -> ETL & System Tasks -> ETL Tasks open your vCenter Extractor Service ETL and click the "edit this run configuration" button within the Run Configuration section of the screen. (2) In the "Edit run configuration Default" screen click the "this page" link at the bottom of the page in the "You can manually edit ETL properties from this page" text. (3) In the 'Add new property' section at the bottom of the screen enter "extract.vmware.timeout.period" and click the 'Add' button. That will add the property to the list with a value of "#INSERT VALUE#". (4) Change the "#INSERT VALUE#" for the extract.vmware.timeout.period" property to "1800" and click the Save button at the bottom of the screen. (5) Click the 'Stop' button to stop this vCenter Extractor Service ETL (this would be best to do between 0 and 20 minutes past the hour which is when it is least likely for the LOADER thread to be running within the ETL. Click the 'Start Active Configuration' button to restart the vCenter Extractor Service. Additional InformationWhen running the appropriate TSCO patch version this problem shouldn't happen for the vCenter Extractor Service because there are timeouts in the ETL to stop any vCenter Web API calls that hang.If this problem were to occur, to debug this problem further, the following information should be captured and provided to TSCO Technical Support via a case: (1) While the vCenter Extractor Service is in a hung state execute a 'kill -3 [PID of scheduler]' command. This will send a signal 3 to the scheduler java process which will cause it to dump its running threads and their current state into the /[TSCO Installation Directory]/scheduler/log/scheduler.out file. (2) Capture the log grabber output from the ETL Engine (3) At this point restart the TSCO Scheduler on the ETL Engine to fix the vCenter Extractor Service collection problem |