HDFS Health Checks
HDFS Active NameNode Health
This is an HDFS service-level health check that checks for the presence of an active, running and healthy NameNode. The check returns "Bad" health if the service is running and a running, active NameNode cannot be found. In all other cases it returns the health of the running, active NameNode. A failure of this health check may indicate stopped or unhealthy NameNode roles, the need to issue a failover command to make some NameNode active, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and one or more NameNodes. Check the status of the HDFS service's NameNode roles and look in the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Active NameNode Role Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active HDFS NameNode before this health check fails, and the NameNode Activation Startup Tolerance can be used to adjust the amount of time around NameNode startup that the check allows for a NameNode to be made active.
Short Name: Active NameNode Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Active NameNode Detection Window | The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. | hdfs_active_namenode_detecton_window | 3 | MINUTES |
Active NameNode Role Health Check | When computing the overall HDFS cluster health, consider the active NameNode's health | hdfs_namenode_health_enabled | true | no unit |
NameNode Activation Startup Tolerance | The amount of time after NameNode(s) start that the lack of an active NameNode will be tolerated. This is intended to allow either the auto-failover daemon to make a NameNode active, or a specifically issued failover command to take effect. | hdfs_namenode_activation_startup_tolerance | 180 | SECONDS |
HDFS Canary
This is an HDFS service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it creates a file. By default, the path is /tmp/.cloudera_health_monitoring_canary_timestamp. The canary test then writes a small amount of data to that file, reads that data back, and verifies that the data is correct. Lastly, the canary test removes the created file. The check returns "Bad" health if any of the basic operations fail. The check returns "Concerning" health if the canary test runs too slowly. A failure of this health check may indicate that the cluster is failing to satisfy basic client requests correctly or in a timely fashion. Check the status of the NameNode and other service-level health checks. Look in the Service Monitor logs for log messages from the canary test. Look in the NameNode logs for more details about the processing of the canary test requests. This test can be enabled or disabled using the HDFS Canary Health Check HDFS service-wide monitoring setting.
Short Name: HDFS Canary
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
HDFS Canary Health Check | Enables the health check that a client can create, read, write, and delete files | hdfs_canary_health_enabled | true | no unit |
HDFS Corrupt Blocks
This is an HDFS service-level health check that checks that the number of corrupt blocks does not rise above some percentage of the cluster's total blocks. A block is called corrupt by HDFS if it has at least one corrupt replica along with at least one live replica. As such, a corrupt block does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS, not a corrupt block. HDFS automatically fixes corrupt blocks in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Blocks With Corrupt Replicas Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Corrupt Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Blocks With Corrupt Replicas Monitoring Thresholds | The health check thresholds of the number of blocks that have at least one corrupt replica. Specified as a percentage of the total number of blocks. | hdfs_blocks_with_corrupt_replicas_thresholds | critical:1.000000, warning:0.500000 | PERCENT |
HDFS Corrupt Replicas
This is an HDFS service-level health check that checks that the number of corrupt replicas does not rise above some percentage of the cluster's total blocks. A block in HDFS is usually made up of multiple replicas, so a corrupt replica does not by itself indicate unavailable data. Unavailable data is indicated by missing blocks. Corrupt replicas do indicate an increased chance that data may become unavailable. If none of a block's replicas are live, the block is called a missing block by HDFS. HDFS automatically fixes corrupt replicas in the background. A failure of this health check may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. This test can be configured using the Corrupt Replicas Monitoring Thresholds HDFS service-wide monitoring setting. Note that the percentage here that we are thresholding is a computation of replicas divided by blocks, so it could be more than 100% in some cases.
Short Name: Corrupt Replicas
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Corrupt Replicas Monitoring Thresholds | The health check thresholds of the number of corrupt replica. Specified as a percentage of the total number of blocks. Note that there are more replicas than blocks, so it is theoretically possible for this to be over one hundred percent. | hdfs_corrupt_blocks_thresholds | critical:1.000000, warning:0.500000 | PERCENT |
HDFS DataNodes Health
This is an HDFS service-level health check that checks that enough of the DataNodes in the cluster are healthy. The check will return "Concerning" health if the number of healthy DataNodes falls below a warning threshold, expressed as a percentage of the total number of DataNodes. The check returns "Bad" health if the number of healthy and "Concerning" DataNodes falls below a critical threshold, expressed as a percentage of the total number of DataNodes. For example, if this check is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 DataNodes, this check would return "Good" health if 95 or more DataNodes have good health. This check would return "Concerning" health if at least 90 DataNodes have either "Good" or "Concerning" health. If more than 10 DataNodes have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy DataNodes. Check the status of the individual DataNodes for more information. This test can be configured using the Healthy DataNodes Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: DataNodes Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Healthy DataNodes Monitoring Thresholds | The health check thresholds of the overall DataNodes health. The check returns "Concerning" health if the percentage of "Healthy" DataNodes falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" DataNodes falls below the critical threshold. | hdfs_datanodes_healthy_thresholds | critical:90.000000, warning:95.000000 | PERCENT |
HDFS Free Space
This is an HDFS service-level health check that checks that the amount of free space in the HDFS cluster does not fall below some percentage of total configured capacity. A failure of this health check may indicate a capacity planning problem, or a loss of DataNodes. This test can be configured using the HDFS Free Space Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Free Space
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
HDFS Free Space Monitoring Thresholds | The health check thresholds of free space in HDFS. Specified as a percentage of total HDFS capacity. | hdfs_free_space_thresholds | critical:10.000000, warning:20.000000 | PERCENT |
HDFS Missing Blocks
This is an HDFS service-level health check that checks the number of missing blocks does not rise above some percentage of the cluster's total blocks. A missing block is a block with no live replicas. All replicas are either missing or corrupt. This may happen because of corruption or because DataNodes are offline or being decomissioned. A failure of this health check may indicate the loss of several DataNodes at once. If there are files stored in the cluster with a replication factor value of 1, you may see missing blocks with the loss or malfunction of a single DataNode. Use the HDFS fsck command to identify which files contain missing blocks. This test can be configured using the Missing Block Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Missing Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Missing Block Monitoring Thresholds | The health check thresholds of the number of missing blocks. Specified as a percentage of the total number of blocks. | hdfs_missing_blocks_thresholds | critical:any, warning:never | PERCENT |
HDFS NameNode Health
This HDFS service-level health check checks for the presence of a running, healthy NameNode. The check returns "Bad" health if the service is running and the NameNode is not running. In all other cases it returns the health of the NameNode. A failure of this health check indicates a stopped or unhealthy NameNode. Check the status of the NameNode for more information. This test can be enabled or disabled using the Active NameNode Role Health Check NameNode service-wide monitoring setting.
Short Name: NameNode Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Active NameNode Role Health Check | When computing the overall HDFS cluster health, consider the active NameNode's health | hdfs_namenode_health_enabled | true | no unit |
HDFS Standby NameNode Health
This is an HDFS service-level health check that checks for a running, healthy NameNode in standby mode. The check is disabled if the HDFS service is not configured with multiple NameNodes. Otherwise, the check returns "Concerning" health if either of two conditions are met. First, if there is no NameNode running in standby mode. Second, if the running standby NameNode is in less than "Good" health. A failure of this health check may indicate one or more stopped or unhealthy NameNodes, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and some or all of the HDFS NameNodes. Check the status of the HDFS service's NameNode roles and the Cloudera Manager Service Monitor's log files for more information when this check fails. This test can be enabled or disabled using the Standby NameNode Health Check HDFS service-wide monitoring setting. In addition, the Active NameNode Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active NameNode before this health check fails.
Short Name: Standby NameNode Health
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Active NameNode Detection Window | The tolerance window that will be used in HDFS service tests that depend on detection of the active NameNode. | hdfs_active_namenode_detecton_window | 3 | MINUTES |
Standby NameNode Health Check | When computing the overall HDFS cluster health, consider the health of the standby NameNode. | hdfs_standby_namenodes_health_enabled | true | no unit |
HDFS Under-Replicated Blocks
This is an HDFS service-level health check that checks that the number of under-replicated blocks does not rise above some percentage of the cluster's total blocks. A failure of this health check may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain under-replicated blocks. This test can be configured using the Under-replicated Block Monitoring Thresholds HDFS service-wide monitoring setting.
Short Name: Under-Replicated Blocks
Property Name | Description | Template Name | Default Value | Unit |
---|---|---|---|---|
Under-replicated Block Monitoring Thresholds | The health check thresholds of the number of under-replicated blocks. Specified as a percentage of the total number of blocks. | hdfs_under_replicated_blocks_thresholds | critical:40.000000, warning:10.000000 | PERCENT |
<< | ||
Terms and Conditions Privacy Policy |