In this installment of the pipeline aggregations series, we introduce three important pipeline aggregations: Sum Buckets, Derivative Sum, and Cumulative Sum Aggregations. These aggregations provide us with easy ways to get required results by using the results of other aggregations.

Data Set

The data set we use here is the same which we used in our previous article on pipeline aggregations, but with a small change. Here we add a field "relatedDeaths" to all documents, indicating weather related deaths per month in both the cities.

A sample document with the appropriate change will look like:

curl -X PUT "localhost:9200/weather-data" -d '{
  "mappings": {
    "city": {
      "properties": {
        "city":{
          "type": "string",
          "index": "not_analyzed"
        },
        "date": {
          "type":   "date",
          "format": dd-MM-yyyy"
        },
        "temp":{
          "type": "integer"
        },
        "relatedDeaths":{
          "type": "integer"
        }
      }
    }
  }
}'
curl -X PUT "localhost:9200/weather-data/city/1" -d '{
  "city": "NY",
  "date": "01-01-2015",
  "temp": 38,
  "relatedDeaths": 20
}'

Sum Bucket Aggregation

The field "relatedDeaths" tells us how many people died due to weather related issues. Suppose we want to know the total weather related deaths in the state of New York, within the given data. We can use the "sum_bucket" aggregation for this purpose.

First, query all the documents for New York from the index. Now, apply a date_histogram aggregation with the name temp and then a sibling aggregation of "sum", which gives the sum of the relatedDeath field per month.

Then, use the "sum_bucket" aggregation to get the total of the per-month-sum of the "relatedDeath" field. Do this by setting the parameter "bucket_path" to "temp>monthly""temp" and "monthly" are the names given for the "date_histogram" and "sum" sibling aggregations respectively.

This returns the combined sum of weather related deaths, in 2015, in the state of New York. The query from above looks like:

curl -XPOST 'http://localhost:9200/weather-data/_search?pretty' -d '{
  "query": {
    "match": {
      "city": "NY"
    }
  },
  "aggs": {
    "temp": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "format": "dd-MM-yyyy"
      },
      "aggs": {
        "monthly": {
          "sum": {
            "field": "relatedDeaths"
          }
        }
      }
    },
    "sum_bucket_demo": {
      "sum_bucket": {
        "buckets_path": "temp>monthly"
      }
    }
  }
}'

In the response under the aggregations, you see an aggregation named "sum_bucket_demo" with “value” field equalling the combined total of each month’s “relatedDeath” field average in each bucket of the “date_histogram” aggregation.

Cumulative Sum Aggregation

In the previous section, we saw how to calculate the total number of weather related deaths in the given set of documents. What if we need to find out the cumulative sum of the deaths per month in the state of New York. That is, in simpler terms, the weather related deaths so far per month statistics. Here the cumulative sum pipeline aggregation becomes applicable.

curl -XPOST 'http://localhost:9200/weather-data/_search?pretty' -d '{
  "query": {
    "match": {
      "city": "NY"
    }
  },
  "aggs": {
    "temp": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "format": "dd-MM-yyyy"
      },
      "aggs": {
        "monthly": {
          "sum": {
            "field": "relatedDeaths"
          }
        },
        "cumulative_demo": {
          "cumulative_sum": {
            "buckets_path": "monthly"
          }
        }  
      }
    }
  }
}'

Run the resulting query and you should see the following response:

"aggregations": {
  "temp": {
    "buckets": [
      {
        "key_as_string": "01-01-2015",
        "key": 1420070400000,
        "doc_count": 1,
        "monthly": {
          "value": 12
        },
        "cumulative_demo": {
          "value": 12
        }
      },
      {
        "key_as_string": "01-02-2015",
        "key": 1422748800000,
        "doc_count": 1,
        "monthly": {
          "value": 21
        },
        "cumulative_demo": {
          "value": 33
        }
      },
      ....
    ]
  }
}

In the first bucket, the "relatedDeath" value is 12. This is the value that exists in the first document, because there is no previous document to do the sum. In the second bucket, the value of "relatedDeaths" is 21. The corresponding value in the "cumulative_demo" bucket is the sum of the value in the first bucket (12) and the value in the second bucket (21), resulting in 33. Therefore, it is understood that for the month of February, the total weather related deaths is 33 so far.

Derivative Sum Aggregation

Let us now explore the “derivative” pipeline aggregation. The “derivative” aggregation is a parent pipeline aggregation, which means that,it operates on a metric of a parent histogram aggregation and calculate the derivative. The conditions for the derivative aggregation to function properly are:

  1. The specified metric should be numeric, otherwise finding difference would be not possible.
  2. The parent histogram should have the "min_doc_count" parameter set to zero. If min_doc_count is set to other values, some intervals might be omitted in the results. These results would be used to calculate the derivative, which will result in erroneous output values.

Suppose we need to get an idea of the temperature differences in New York City, from the time range, in our data. Use the derivative aggregation function below:

curl -XPOST 'http://localhost:9200/weather-data/_search?pretty' -d '{
  "query": {
    "match": {
      "city": "NY"
    }
  },
  "aggs": {
    "temp": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "format": "dd-MM-yyyy"
      },
      "aggs": {
        "monthly": {
          "avg": {
            "field": "temp"
          }
        },
        "derivative": {
          "derivative": {
            "buckets_path": "monthly"
          }
        }
      }
    }
  }
}'

We will get the following response for the above query:

"aggregations": {
  "temp": {
    "buckets": [
      {
        "key_as_string": "01-01-2015",
        "key": 1420070400000,
        "doc_count": 1,
        "monthly": {
          "value": 38
        }
      },
      {
        "key_as_string": "01-02-2015",
        "key": 1422748800000,
        "doc_count": 1,
        "monthly": {
          "value": 42
        },
        "derivative": {
          "value": 4
        }
      },
      ......
    ]
  }
}

Here we can see in the aggregations "temp", the first bucket does not have the field derivative. This is because it is the first document and there is no preceding document to calculate the difference. In the second bucket, there is a "derivative" field with value = 4. This is the difference between the temperature of the second document’s temperature value, 42, and the first document’s temperature value, 38.

Conclusion

In this post we showed three variants of pipeline aggregations with use case examples. In the next article, we see pipeline aggregations that make use of scripts for their operations.