Service, Role, and Host Health
Cloudera Manager monitors the state of the services and roles that are running on your cluster, and the hosts in your cluster. You can see the summary results of these under the Services tab, where various health results determine an overall health assessment of the service or role. The Hosts tab shows similar summary result for the hosts.
Cloudera Manager also monitors a number of metrics for HDFS, MapReduce, HBase, ZooKeeper, Flume, Impala, and Cloudera Management service and role instances. These are reflected in the results shown in the Health Tests panel under the Status tab when you have selected an HDFS, MapReduce, HBase, ZooKeeper, Flume, Impala, or Cloudera Management service or role instance to view.
The overall health of a role or service is a roll-up of its health checks; if any health check is Bad, the service's or role's health will be Bad. If any health check is Concerning (but none are Bad) the role's or service's health will be concerning.
The health check results are presented in the Health Tests panel. For some of these, you can also chart the associated metrics over time. Other metrics are also shown as charts over a time range. See Viewing Service Status and Viewing Status for a Role Instance for more details. See Viewing Detailed Information about Hosts for details of the health of a host.
There are several types of health checks that can be performed, depending on the type of service or role instance:
- Simple pass/fail checks, such as a service or role started as expected, a DataNode is connected to its NameNode, or a TaskTracker is (or is not) blacklisted. These checks result in the health of that metric being either Good or Bad.
- Metric-type tests, such as the number of file descriptors in use, the amount of disk space used or free, how much time spent in garbage collection, or how many pages were swapped to disk in the previous 15 minutes. The results of these types of checks can be compared to threshold values that determine whether everything is OK (for example, plenty of disk space available), whether it is Concerning (disk space getting low), or is Bad (a critically low amount of disk space).
- HDFS (NameNode), HBase, and ZooKeeper also run a health test known as the "canary" test which exercises the service lightly to confirm it is working and responsive.
By default most health checks are enabled and (if appropriate) configured with reasonable thresholds. You can modify threshold values by editing the Monitoring properties under Configuration tab for the service. You can also enable or disable individual or summary health checks, and in some cases specify what should be included in the calculation of overall health for the service or role. See Configuring Monitoring Settings for more information.