Metric Aggregation
It is often useful to see an aggregated view of the activity on a cluster. For example, you might want to see the average number of bytes read per DataNode, or the maximum number of bytes read by any DataNode. To make this easy Cloudera Manager pre-aggregates many of these metrics and allow you to access them through charts.
What Metrics Are Aggregated
Cloudera Manager aggregates metrics based on the category of the entity that generated them. The categories map to components in the system such as hosts, disks, RegionServers, and HDFS services. Metrics are aggregated from their generating entity to larger entities of which they are a part. For example, metrics that are generated by disks, network interfaces, and file systems are aggregated to their respective hosts and clusters. Generally, this hierarchy is defined as follows:
- Disk, network interface, file system - host, cluster
- Host - cluster
- Role - service, cluster
- HTables - HBase service, cluster
- Agents - Flume service, cluster
- FlumeChannel, FlumeSource, FlumeSink - Flume service, cluster
Aggregate Types
- Maximum - the largest value for any entity
- Minimum - the smallest value for any entity
- Average - the average value for all entities
- Standard deviation - the standard deviation of the values for all entities
- Sum - the total of the value for all entities
Example Use Cases
Use Case 1: Compare the maximum, minimum, and average CPU usage across a cluster
- Select the .
- Enter the tsquery
statement:
SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg
- Click Search. You should see three charts, each with CPU data.
- Click in the left column. Now you should see all the data on one chart.
Use Case 2: Compare the CPU usage of a single host to the max, min, and average for the cluster
- Follow the instructions from Use Case 1, except in step
2 enter the following statement
instead:
SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg, cpu_percent where category=cluster or hostname='MYHOST.COM'
Aggregate Metric Names
- The metric being aggregated - for example, cpu_percent or jvm_gc_count
- The category of the entity generating the metric - for example, "host" or "RegionServer"
- The aggregate type - for example, "max" or "avg"
These parts are combined to form a aggregate name such as "cpu_percent_host_max"
The name of the final component, aggregate type, varies by the type of the metric. Cloudera Manager support three types of metrics: gauges, weighted gauges, and counters.
Gauges
- maximum - "max"
- minimum - "min"
- average - "avg"
- standard deviation - "std_dev"
- sum - "sum".
Weighted Gauges
A weighted gauge weighs a gauge by the number of counts of that gauge. Consider the HBase RegionServer metric put_avg_time. This metric tracks the average put time for each RegionServer. Now consider the case where you have two RegionServers, one that did 10,000 puts with an average time of one millisecond per put, and another that did 10 puts with an average time of one second per put. In this case if you just averaged the two averages, you would get that the average across the whole service was about half a second, but that doesn't accurately reflect reality.
Instead if you calculated the average by weighting the number of puts by the counter per RegionServer you would get a more accurate number:
Total puts = 10,000 + 10 = 10,010 puts
Total time = (10000 * 1ms) + (10 * 1000ms) = 20,000 ms
Average time = (20,000ms) / (10,010 puts) = ~2 ms
- maximum - "max"
- minimum - "min"
- average - "weighted_avg"
- standard deviation - "weighted_std_dev"
- sum - For weighted gauges sum aggregations represent the weighted total and are not an average. In our example the value would be 20,000 ms and the name would be put_time_regionserver_sum.
Counters
- maximum - "max_rate"
- minimum - "min_rate"
- average - "avg_rate"
- standard deviation - "std_dev_rate"
- sum - For counters sum aggregations represent the total number of times an event occurred and are not a rate. In this case we append the word "sum" to the end of name. For example: jvm_gc_count_regionserver_sum.