In this last installment of the pipeline aggregation blog series, we introduce the remaining class of aggregations, which includes the stats, extended stats, and the percentile aggregations. These aggregations are used to get a range of statistics like the average, percentiles, mean square values, and more for the data we have.

Data Set

Consider we have monthly information of the sales of two products, say television and laptops. Our sample data looks like below:

curl -X PUT "localhost:9200/sales-info" -d '{
  "mappings": {
    "items": {
      "properties": {
        "product":{
          "type": "string",
          "index": "not_analyzed"
        },
        "price":{
          "type": "integer"
        },
        "date": {
          "type":   "date",
          "format": "MM-dd-yyyy"
        }
      }
    }
  }
}'
curl -X PUT "localhost:9200/sales-info/items/1" -d '{
  "product": "television",
  "price": 868,
  "date": "01-01-2015"
}'
curl -X PUT "localhost:9200/sales-info/items/2" -d '{
  "product": "laptop",
  "price": 300,
  "date": "01-01-2015"
}'

We index the monthly sales for the year 2015. Name the index "sales-info".

Stats

In the above data, we need to find the sales information for the product "television". Use the "stats" pipeline aggregation for this purpose.

The first step is always to fetch the data documents of the product of our choice, in this case television. For this, use a simple term query. The next step is to employ a "date_histogram" application so that we get the monthly sales numbers. As a sub aggregation, do a "sum" aggregation to calculate the total sales per month on each bucket.

For the last step, apply the stats aggregation and point the buckets_path to "monthly_sales>total_sale". As you can see, "stats" is a sibling aggregation, and it operates on all the buckets of the sibling aggregation and gives us the results. Our final query looks like the following:

curl -XPOST 'localhost:9200/sales-info/_search?pretty' -d '{
  "query": {
    "term": {
      "product": "television"
    }
  },
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_sale": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "stats_television_sales": {
      "stats_bucket": {
        "buckets_path": "monthly_sales>total_sale"
      }
    }
  }
}'

Now in the response we can see under the “stats_television”, the following statistics returned by the “stats” aggregation, with each term in the aggregation results also explained below:

"stats_monthly_sales": {
"count": 6,  // the total number of occurrences
"min": 475, // minimum value of the total sales
"max": 956, // maximum value of the total sales
"avg": 705.1666666666666, // average sales
"sum": 4231 // total sales
}

Extended Stats

In the previous section, we explained the function and output of the "stats" pipeline aggregation. Requirements for the statistics for the data may vary and might prefer another set of detailed statistics.

The "extended_stats" pipeline aggregation gives additional statistics on our data, including variance information.

Learn About Our New Open Source Software Supergiant >

In order to view the results obtained from the "extended_stats" aggregation, replace the "stats_television_sales" part of the query in the above section with the following:

"detailed_monthly_stats": {
  "extended_stats_bucket": {
    "buckets_path": "monthly_sales>total_sale"
  }
}

Running the above query gives us a response, of which the part of extended_stats is given below:

"detailed_monthly_stats": {
  "count": 6,
  "min": 475,
  "max": 956,
  "avg": 705.1666666666666,
  "sum": 4231,
  "sum_of_squares": 3196200,
  "variance": 35439.97222222225,
  "std_deviation": 188.25507223504562,
  "std_deviation_bounds": {
    "upper": 1081.6768111367578,
    "lower": 328.6565221965754
  }
}

As you can see from the above data, there are four additional fields. These fields indicate the variance of the fields from the mean or average values.

Percentiles

Like the stats and extended stats aggregations, the percentiles aggregation can be used to calculate the percentiles of the data in the buckets of a sibling aggregation. This can be applied by inserting the following query in the stats aggregation query, which we have seen in the previous section.

"sum_monthly_sales": {
  "percentiles_bucket": {
    "buckets_path": "monthly_sales>sales_total"
  }
}

This returns the percentile information of the bucket data as below:

"sum_monthly_sales": {
  "values": {
    "1.0": 475,
    "5.0": 475,
    "25.0": 535,
    "50.0": 842,
    "75.0": 868,
    "95.0": 956,
    "99.0": 956
  }
}

Conclusion

In this post, we’ve shown three classes of pipeline aggregations with examples. In the next post of this series, we explain the moving average and serial difference aggregations. These are used for trend analysis with available data.