Metric Aggregation

It is often useful to see an aggregated view of the activity on a cluster. For example, one might want to see the average number of bytes read per DataNode, or they might want to see the maximum number of bytes read by any DataNode. To make this easy Cloudera Manager pre-aggregates many of these metrics and allow you to access them through our charts.

What Metrics Are Aggregated

Cloudera Manager aggregates metrics based on the category of the entity that generated them. The categories map to components in the system such as hosts, disks, RegionServers, and HDFS services.

Metrics are aggregated from their generating entity to larger entities they are a part of. For example, metrics that are generated by disks, network interfaces, and file systems are aggregated to their respective hosts and clusters. Generally, this hierarchy is defined as follows:

  • Disks, Network Interfaces, File systems -> Hosts, Clusters
  • Hosts -> Clusters
  • Roles -> Services, Clusters
  • HTables -> HBase Services, Clusters
  • Agents -> Flume, Clusters
  • FlumeChannel, FlumeSource, FlumeSink -> Flume, Cluster

Aggregate Types

Each aggregate is calculated every minute and takes into account all the metrics logged over the previous minute. For example, the metric cpu_percent_host_max takes into account all cpu_percent metrics logged by all hosts in a cluster in the previous minute.

Cloudera Manager supports five types of aggregation:
  • Max - the largest value for any entity
  • Min - the smallest value for any entity
  • Average - the average value for all entities
  • Standard deviation - the standard deviation of the values for all entities
  • Sum - the sum total of the value for all entities

Example Use Cases

Use Case 1: Compare the maximum, minimum and average CPU usage across a cluster

  1. Select the Charts > Search.
  2. Enter the tsquery statement:
    SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg
  3. Click Search. You should see three charts, each with CPU data.
  4. Click Facets > All Combined in the left column. Now you should see all the data on one chart.

Use Case 2: Compare the CPU usage of a single host to the max, min, and average for the cluster

  1. Follow the instructions from Use Case 1, except in step 2 enter the following statement instead:
    SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg, cpu_percent where category=cluster or hostname='MYHOST.COM'

Aggregate Metric Names

To access aggregated metrics it helps to know how they are named. There are three components to the name:
  • The metric we're aggregating - for example, cpu_percent or jvm_gc_count
  • The category of the entity generating the metric - for example, "host" or "RegionServer"
  • The aggregation type - for example, "max" or "avg"

These parts are combined to form a full name such as "cpu_percent_host_max"

The naming of the final component, aggregation type, varies by the type of the metric. We support three types of metrics: gauges, weighted gauges, and counters.

Gauges

These are metrics that can go up and down, such as cpu_percent. Gauges have a straightforward naming convention:
  • max -> "max"
  • min -> "min"
  • average -> "avg"
  • standard deviation -> "std_dev" sum -> "sum".

Weighted Gauges

These are probably best explained with an example. Consider the HBase RegionServer metricput_avg_time. This metric tracks the average put time for each RegionServer. Now consider the case where you have two RegionServers, one that did 10,000 puts with an average time of one millisecond per put, and another that did 10 puts with an average time of one second per put. In this case if you just averaged the two averages, you would get that the average across the whole service was about half a second, but that doesn't accurately reflect reality. Instead if you calculated the average by weighting by the number of puts by the counter per RegionServer you would get a more accurate number:

Total puts = 10,000 + 10 = 10,010 puts

Total time = (10000 * 1ms) + (10 * 1000ms) = 20,000 ms

Average time = (20,000ms) / (10,010 puts) = ~2 ms

To reflect this we have the concept of weighted gauges that perform this calculation. Their aggregates are named as follows:
  • max -> "max"
  • min -> "min"
  • average -> "weighted_avg"
  • standard deviation -> "weighted_std_dev"
.

Sum aggregations are a special case. They represent the weighted total, which would be 20,000 ms in our example and are named accordingly - put_time_regionserver_sum. Note that we remove "avg" from the name of the metric.

Counters

These are metrics that track the total count since a process or host started. An example of a counter is jvm_gc_count which tracks the number of Java garbage collections since a Java process started. Since users are more interested in the rate of change of counters (that is, how many garbage collections were there per second over the last five minutes) rather than their raw value we calculate the aggregates in terms of rate. They are named as follows:
  • max -> "max_rate"
  • min -> "min_rate"
  • average -> "avg_rate"
  • standard deviation -> "std_dev_rate"

As in the weighted gauges case, sum aggregations are a special case. For counters they represent the total number of times an event occurred and are not a rate. In this case we append the word "sum" to the end of name just like we would for gauge metrics - jvm_gc_count_regionserver_sum.