MapReduce Health Checks

MapReduce Active JobTracker Health

This is a MapReduce service-level health check that checks for the presence of an active, running and healthy JobTracker. The check returns "Bad" health if the service is running and a running, active JobTracker cannot be found. In all other cases it returns the health of the running, active JobTracker. A failure of this health check may indicate stopped or unhealthy JobTracker roles, the need to issue a failover command to make some JobTracker active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more JobTrackers. Check the status of the MapReduce service's JobTracker roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the JobTracker Role Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active MapReduce JobTracker before this health check fails, and the JobTracker Activation Startup Tolerance can be used to adjust the amount of time around JobTracker startup that the check allows for a JobTracker to be made active.

Short Name: Active JobTracker Health

Property Name Description Template Name Default Value Unit
Active JobTracker Detection Window The tolerance window that will be used in Mapreduce service tests that depend on detection of the active JobTracker. mapreduce_active_jobtracker_detecton_window 3 MINUTES
JobTracker Activation Startup Tolerance The amount of time after JobTracker(s) start that the lack of an active JobTracker will be tolerated. This is intended to allow either the auto-failover daemon to make a JobTracker active, or a specifically issued failover command to take effect. mapreduce_jobtracker_activation_startup_tolerance 180 SECONDS
JobTracker Role Health Check When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_jobtracker_health_enabled true no unit

MapReduce Job Failure Ratio

This is a MapReduce service-level health check that checks that no more than some percentage of recently completed jobs have failed. A failure of this health check may indicate problems with the MapReduce service or with the failing jobs. Check the status of the MapReduce service for more details. This test can be configured using the Job Failure Ratio Thresholds, Job Failure Ratio Minimum Failing Jobs and Job Failure Ratio Monitoring Period MapReduce service-wide monitoring setting.

Short Name: Job Failure Ratio

Property Name Description Template Name Default Value Unit
Job Failure Ratio Minimum Failing Jobs The minimum number of jobs that must fail during the test time period before the threshold values will be checked. Until this number of jobs have failed in the test time period the health check will continue to return good health. mapreduce_job_failure_ratio_minimum_jobs 0 no unit
Job Failure Ratio Monitoring Period The time period to review when computing job failure ratio. Specified in minutes. mapreduce_job_failure_ratio_window 5 MINUTES
Job Failure Ratio Thresholds The health check thresholds of the number of recently failed jobs. Specified as a percentage of recently completed jobs. See Job Failure Ratio Monitoring Period. mapreduce_job_failure_ratio_thresholds critical:never, warning:never PERCENT

MapReduce JobTracker Health

This MapReduce service-level health check checks for the presence of a running, healthy JobTracker. The check returns "Bad" health if the service is running and the JobTracker is not running. In all other cases it returns the health of the JobTracker. A failure of this health check indicates a stopped or unhealthy JobTracker. Check the status of the JobTracker for more information. This test can be enabled or disabled using the JobTracker Role Health Check JobTracker service-wide monitoring setting.

Short Name: JobTracker Health

Property Name Description Template Name Default Value Unit
JobTracker Role Health Check When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_jobtracker_health_enabled true no unit

MapReduce Map Task Backlog

This is a MapReduce service-level health check that checks that the number of waiting map tasks in the cluster does not rise above some percentage of the number of total available map slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting map tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Map Task Backlog Thresholds MapReduce service-wide monitoring setting.

Short Name: Map Task Backlog

Property Name Description Template Name Default Value Unit
MapReduce Map Task Backlog Thresholds The health check thresholds of the number of map tasks in the backlog. Specified as a percentage of the total number of map slots. mapreduce_map_backlog_thresholds critical:never, warning:never PERCENT

MapReduce Map Task Locality

This is a MapReduce service-level health check that checks that no more than some percentage of recently completed maps were operating on rack-local or other-local data. The check returns "Concerning" health if the number of rack-local maps is above a configured minimum number of maps and greater than the warning threshold or if the number of other-local maps is above a configured minimum number of maps and greater than the warning threshold. The test never returns "Bad" health. A failure of this health check may indicate problems with the configuration of the MapReduce service. Consider using the fair-scheduler and changing its delay configuration mapred.fairscheduler.locality.delay. In some scenarios, it may be normal to have a large number of non-local maps. For example, data import maps are always non-local. In such scenarios, consider disabling one or more of the thresholds used by this test. This test can be configured using the Rack-Local Map Task Thresholds, Maps Locality Minimum Rack-Local Maps, Other-Local Map Task Thresholds, Maps Locality Minimum Other-Local Maps and Map Tasks Locality Monitoring Period MapReduce service-wide monitoring settings.

Short Name: Map Task Locality

Property Name Description Template Name Default Value Unit
Map Tasks Locality Monitoring Period The time period to monitor when computing health test results for map tasks locality. Specified in minutes. mapreduce_maps_locality_window 15 MINUTES
Maps Locality Minimum Other-Local Maps The minimum number of non-local maps that must complete during the test time period before the threshold values will be checked. Until this number of non-local maps have completed in the test time period the health check will continue to return good health. mapreduce_maps_locality_minimum_other_locality_maps 0 no unit
Maps Locality Minimum Rack-Local Maps The minimum number of rack-local maps that must complete during the test time period before the threshold values will be checked. Until this number of rack-local maps have completed in the test time period the health check will continue to return good health. mapreduce_maps_locality_minimum_rack_local_maps 0 no unit
Other-Local Map Task Thresholds The health check thresholds of the number of map tasks using non-local data. Specified as a percentage of other-local map tasks in the total number of map tasks. mapreduce_other_local_thresholds critical:never, warning:never PERCENT
Rack-Local Map Task Thresholds The health check thresholds of the number of map tasks using non-local data. Specified as a percentage of rack-local map tasks in the total number of map tasks. mapreduce_rack_local_thresholds critical:never, warning:never PERCENT

MapReduce Reduce Task Backlog

This is a MapReduce service-level health check that checks that the number of waiting reduce tasks in the cluster does not rise above some percentage of the number of total available reduce slots. The behavior of this health check depends on how the MapReduce cluster is being used. In some scenarios, it may be normal to have large numbers of waiting reduce tasks. In such scenarios, this test should be disabled. In other scenarios, a failure of this health check may indicate a capacity planning problem or a problem with the MapReduce service. Check the status of the MapReduce service for more details. This test can be configured using the MapReduce Reduce Task Backlog Thresholds MapReduce service-wide monitoring setting.

Short Name: Reduce Task Backlog

Property Name Description Template Name Default Value Unit
MapReduce Reduce Task Backlog Thresholds The health check thresholds for the number of reduce tasks in the backlog. Specified as a percentage of the total number of reduce slots. mapreduce_reduce_backlog_thresholds critical:never, warning:never PERCENT

MapReduce Standby JobTracker Health

This is an MapReduce service-level health check that checks for a running, healthy JobTracker in standby mode. The check is disabled if the MapReduce service is not configured with multiple JobTrackers. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no JobTracker running in standby mode. Second, if the running standby JobTracker is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy JobTrackers, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the MapReduce JobTrackers. Check the status of the MapReduce service's JobTracker roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby JobTracker Health Check MapReduce service-wide monitoring setting. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active JobTracker before this health check fails.

Short Name: Standby JobTracker Health

Property Name Description Template Name Default Value Unit
Active JobTracker Detection Window The tolerance window that will be used in Mapreduce service tests that depend on detection of the active JobTracker. mapreduce_active_jobtracker_detecton_window 3 MINUTES
Standby JobTracker Health Check When computing the overall cluster health, consider the health of the standby JobTracker. mapreduce_standby_jobtrackers_health_enabled true no unit

MapReduce TaskTrackers Health

This is a MapReduce service-level health check that checks that enough of the TaskTrackers in the cluster are healthy. The check returns "Concerning" health if the number of healthy TaskTrackers falls below a warning threshold, expressed as a percentage of the total number of TaskTrackers. The check returns "Bad" health if the number of healthy and "Concerning" TaskTrackers falls below a critical threshold, expressed as a percentage of the total number of TaskTrackers. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 TaskTrackers, this check would return "Good" health if 95 or more TaskTrackers have good health. This check would return "Concerning" health if at least 90 TaskTrackers have either "Good" or "Concerning" health. If more than 10 TaskTrackers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy TaskTrackers. Check the status of the individual TaskTrackers for more information. This test can be configured using the Healthy TaskTracker Monitoring Thresholds MapReduce service-wide monitoring setting.

Short Name: TaskTrackers Health

Property Name Description Template Name Default Value Unit
Healthy TaskTracker Monitoring Thresholds The health check thresholds of the overall TaskTrackers health. The check returns "Concerning" health if the percentage of "Healthy" TaskTrackers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" TaskTrackers falls below the critical threshold. mapreduce_tasktrackers_healthy_thresholds critical:90.000000, warning:95.000000 PERCENT