Metric Aggregation
It is often useful to see an aggregated view of the activity on a cluster. For example, one might want to see the average number of bytes read per DataNode, or they might want to see the maximum number of bytes read by any DataNode. To make this easy Cloudera Manager pre-aggregates many of these metrics and allow you to access them through our charts.
What Metrics Are Aggregated
Cloudera Manager aggregates metrics based on the category of the entity that generated them. The categories map to components in the system such as hosts, disks, RegionServers, and HDFS services.
Metrics are aggregated from their generating entity to larger entities they are a part of. For example, metrics that are generated by disks, network interfaces, and file systems are aggregated to their respective hosts and clusters. Generally, this hierarchy is defined as follows:
- Disks, Network Interfaces, File systems -> Hosts, Clusters
- Hosts -> Clusters
- Roles -> Services, Clusters
- HTables -> HBase Services, Clusters
- Agents -> Flume, Clusters
- FlumeChannel, FlumeSource, FlumeSink -> Flume, Cluster
Aggregate Types
Each aggregate is calculated every minute and takes into account all the metrics logged over the previous minute. For example, the metric cpu_percent_host_max takes into account all cpu_percent metrics logged by all hosts in a cluster in the previous minute.
- Max - the largest value for any entity
- Min - the smallest value for any entity
- Average - the average value for all entities
- Standard deviation - the standard deviation of the values for all entities
- Sum - the sum total of the value for all entities
Example Use Cases
Use Case 1: Compare the maximum, minimum and average CPU usage across a cluster
- Select the .
- Enter the tsquery
statement:
SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg
- Click Search. You should see three charts, each with CPU data.
- Click in the left column. Now you should see all the data on one chart.
Use Case 2: Compare the CPU usage of a single host to the max, min, and average for the cluster
- Follow the instructions from Use Case 1, except in step 2 enter
the following statement
instead:
SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg, cpu_percent where category=cluster or hostname='MYHOST.COM'
Aggregate Metric Names
- The metric we're aggregating - for example, cpu_percent or jvm_gc_count
- The category of the entity generating the metric - for example, "host" or "RegionServer"
- The aggregation type - for example, "max" or "avg"
These parts are combined to form a full name such as "cpu_percent_host_max"
The naming of the final component, aggregation type, varies by the type of the metric. We support three types of metrics: gauges, weighted gauges, and counters.
Weighted Gauges
These are probably best explained with an example. Consider the HBase RegionServer metricput_avg_time. This metric tracks the average put time for each RegionServer. Now consider the case where you have two RegionServers, one that did 10,000 puts with an average time of one millisecond per put, and another that did 10 puts with an average time of one second per put. In this case if you just averaged the two averages, you would get that the average across the whole service was about half a second, but that doesn't accurately reflect reality. Instead if you calculated the average by weighting by the number of puts by the counter per RegionServer you would get a more accurate number:
Total puts = 10,000 + 10 = 10,010 puts
Total time = (10000 * 1ms) + (10 * 1000ms) = 20,000 ms
Average time = (20,000ms) / (10,010 puts) = ~2 ms
- max -> "max"
- min -> "min"
- average -> "weighted_avg"
- standard deviation -> "weighted_std_dev"
Sum aggregations are a special case. They represent the weighted total, which would be 20,000 ms in our example and are named accordingly - put_time_regionserver_sum. Note that we remove "avg" from the name of the metric.
Counters
- max -> "max_rate"
- min -> "min_rate"
- average -> "avg_rate"
- standard deviation -> "std_dev_rate"
As in the weighted gauges case, sum aggregations are a special case. For counters they represent the total number of times an event occurred and are not a rate. In this case we append the word "sum" to the end of name just like we would for gauge metrics - jvm_gc_count_regionserver_sum.