Elasticsearch 2.0.0 has introduced one of the most anticipated feature requests in its arsenal, pipeline aggregations. This is an important addition to query dsl. It enables a user to operate on the results of other aggregations. Prior to this, it had to be done separately from the user side. This blog series covers, in detail, the types of pipeline aggregations along with examples.

Pipeline Aggregations

As mentioned above, the significance of pipeline aggregations is enormous as it allows a user to operate on the output produced by other aggregations, excluding the need for operations on the document sets, and then adding the results to the output tree. There are many types of pipeline aggregations, the list can be found here, but they can be generally classified in two categories.

Parent

This class of pipeline aggregations are provided with the output of its parent aggregation. This will generate the values of the pipeline aggregation in the same bucket as that of the parent aggregation. These generated values may be new values or new buckets depending on the type of parent pipeline aggregations used.

Sibling

This class of pipeline aggregations are provided with the outputs of a sibling aggregation. The results will be buckets or values in the same level of the sibling aggregation.

Data Set

You can download the data here. Take simple weather data of two cities, say New York and California, in to account for the purpose demonstration of pipeline aggregations. Here is the mapping for the sample data to be indexed:

curl -X PUT "localhost:9200/weather-data" -d '{
  "mappings": {
    "state": {
      "properties": {
        "city":{
          "type": "string",
          "index": "not_analyzed"
        },
        "date": {
          "type":   "date",
          "format": "dd-MM-yyyy"
        },
        "temp":{
          "type": "integer"
        }
      }
    }
  }
}'

Create some sample data for the above index. The sample data is in the following format.

curl -X PUT "localhost:9200/weather-data/city/1" -d '{
  "city": "NY",
  "date": "01-01-2015",
  "temp": 38
}'
curl -X PUT "localhost:9200/weather-data/city/2" -d' {
  "city": "CL",
  "date": "01-01-2015",
  "temp": 56
}'

As you can see in the above data, the temperatures of the cities New York and California are indexed on a monthly basis for the year 2015.

In this introductory blog to the pipeline aggregations, we will familiarize with the following pipeline aggregations with the above data.

  1. Avg Bucket Aggregations
  2. Max Bucket Aggregations
  3. Min Bucket Aggregations

Average Bucket Aggregation:

Average bucket aggregation belongs to the sibling class of pipeline aggregations. This aggregation calculates the average value of a specified metric in a sibling aggregation. The condition here is that the metric which is specified should be numeric and the sibling should be a multi-bucket aggregation. This is because mean or average can be taken only on numeric values and in order to take mean there should be more than one bucket too.

So suppose from the above data, we need to calculate the average temperature per month of both cities. How are we going to do that?

First, prepare a date_histogram aggregation. Name it "temp" with interval month. This would return us the documents per month in the database, in this case 2.

Then, write a "average" aggregation (let us name it as “monthly_sum”) inside the “temp” aggregation such that it would calculate the sum of the temperatures of “NY” and “CL” for each month and return us the values.

Introducing Supergiant – The first container orchestration system that makes it easy to scale stateful, distributed apps.

Now comes the role of average bucket aggregation. We write this as the sibling of the "temp" aggregation and name it "avg_monthly_temp". Now we specify what this aggregation should do. For that we have a "buckets_path" parameter to be specified. This parameter is set to "temp>monthly_sales". This instructs the average bucket aggregation to take the values from the "monthly_avg" aggregation, in each bucket of "temp" and return the mean value.

From the above mentioned structure, let us build the query for the same and pass it to the index "weather-data" which we created.

curl -XPOST 'http://localhost:9200/weather-data/_search?pretty' -d '{
  "aggs": {
    "temp": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "format": "dd-MM-yyyy"
      },
      "aggs": {
        "monthly_avg": {
          "avg": {
            "field": "temp"
          }
        }
      }
    },
    "avg_monthly_temp": {
      "avg_bucket": {
        "buckets_path": "temp>monthly_avg"
      }
    }
  }
}'

This query will return a response:

"aggregations": {
  "temp": {
    "buckets": [
      {
        "key_as_string": "01-01-2015",
        "key": 1420070400000,
        "doc_count": 2,
        "monthly_avg": {
          "value": 47
        }
      },
      ...{
        ...
      }
    ]
  },
  "avg_monthly_temp": {
    "value": 62.208333333333336
  }
}

Now, on analyzing the response from the above query you can see that each date histogram aggregation bucket will have a field named "monthly_average". Inside it, there will be the average of the temperatures of the 2 cities under the key "value". And to the same level of the date_histogram aggregation, we can find the "avg_monthly_temp" aggregation, where there is the average of the 12 values returned in the date histogram buckets. This is the average of all the values from the date histogram’s sibling "monthly_avg" agg buckets.

If we have all the weather data for the 50 states, we can calculate the average temperature of US by this process.

Maximum Bucket Aggregation

Suppose we want to know maximum value of the average temperatures of both cities. Employ the max-bucket aggregation. Replace the "average_monthly_temp" aggregation with the following aggregation:

"max_monthly_average": {
  "max_bucket": {
    "buckets_path": "temp>monthly_avg"
  }
}
Now in the response we could see the following difference.
"max_monthly_average": {
  "value": 74,
  "keys": [
    "01-07-2015",
    "01-08-2015"
  ]
}

The above response says that the maximum average value was 74 and it occurred at two months. It is indicated in the "keys" array.

Minimum Bucket Aggregation

Wwe can calculate the minimum average temperature for both of the cities by replacing and adding the following aggregation to the above query.

"min_monthly_average": {
  "min_bucket": {
    "buckets_path": "temp>monthly_avg"
  }
}

Conclusion

In this post, we introduced pipeline aggregations in Elasticsearch 2.0.0. We also tested three of the aggregations: "average buckets aggregation" and the "maximum" and "minimum" pipeline aggregations. In the next post of this series we will share more examples of other types of pipeline aggregations.