So far we have seen the grouping of documents based on various aggregations like term, date histogram, and more. In this post we are going to see the limitations of the date_histogram aggregation in certain cases, and how to overcome that by employing elasticsearch scripting.

Data set

Let us consider a set of documents which has a date range ranging from the present day to 2 weeks back. Since this involves a little more documents than our usual pattern, I'm going to list down only a single document. You can modify the date and time in the “created_at” field of this document and create another few documents and index it using the bulk API sample document.

{"text": "hi","created_at": "2016-02-21T18:30:36.000Z"}

The above data set has the date value as 21st February and 18.30 as the time in GMT.

Now, all you need to do is create a few more documents with random values in the date and time fields as mentioned above.

Index creation

Now we have the data set, we can create an index. Let us name the index as "testindex-stats".

curl -X PUT "http://localhost:9200/testindex-stats"

Now let us index the documents set under this index and under a type named "weeklystats". Since there are some documents to index, it would be good if you employ the bulk API to do the process. For this post, I have indexed 24 documents, with each document falling under different dates and different hours.

Simple date histogram aggregation

Let us do a simple date histogram aggregation on the documents we have indexed and see what results we are getting by running the following query in the terminal:

curl -X POST 'http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true&size=3' -d '{
  "aggs": {
    "simpleDatehHistogram": {
      "date_histogram": {
        "field": "created_at",
        "interval": "day"
      }
    }
  }
}'

In the results you can see each bucket has its key name as an epoch value. This is fine for general operations, but what if we need to find the weekday statistics or statistics by hour? To be more precise, what if we want to get the bucket names as the days of weeks as MondayTuesday or with 00 and 01 as the hours of the day? See how to in the coming sections.

Viewing documents on per weekday classification

We need to get our buckets name in the weekday format so that we can have a better clarity in the analysis of our data. In order to achieve this we need to make use of the scripting facility in elasticsearch. A quick recap on the usage of scripts is in our previous blog here. In this post, however, we are writing a separate script. Usage of separate scripts have been discussed in detail in the previous post.

Let us name our script as dateConversion.groovy, in the scripts folder, which has the following content:

Date date = new Date(doc[date_field].value); date.format(format);

In the above script, we have the "date_field" and "format" parameters, which are passed from the query.

For the the weekday classification we can employ the following query:

curl -X GET 'http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true' -d '{
"aggs": { "byDays": { "terms": { "script": { "file":"test", "params": { "date_field": "created_at", "format": "EEEEEE" } } } } }

In the above query, the field "script" has the name  "dateConversion". In the "params" field we define all parameters to be passed to the script. The "date_field" parameter takes in the field name "created_at" in the document. We will apply it to the script. It also takes the "format" parameter which decides in what format the date should be returned. In this case, mentioning "EEEEEE" returns the full names of the weekdays. If we give only "EE", it would return us only the first three letters of the weekdays.

Upon running the above aggregation, the response would have their buckets with key names as "Monday","Tuesday", etc., instead of the date string values that we got when we ran the date histogram aggregation.

Viewing documents on per hour classification

In the above section we have successfully achieved the weekday based aggregation of the documents. We can get the hourly based grouping of documents by tweaking a bit in the query applied above. We achieve this by changing the "format" parameter in the above query to "HH", which converts the time field into respective hour. This is how the query will look like after we have made the above modification:

curl -X GET '<a href="http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true">http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true</a>' -d '{
"aggs": { "byHours": { "terms": { "order": { "_term": "asc" }, "script": { "file":"dateConversion", "params": { "date_field": "created_at", "format": "HH" }, "size": 24 } } } }
##make sure of the flower brackets

After running this query, we can see the documents are arranged in the hourly basis mentioned in the "created_at" time field. We applied an ascending sorting, which will make the results go in the order "00","01"...etc to "23".

Per week per hour analysis

There are cases in which we will be needing hourly analyses per day. How can we achieve that?
We have seen the per day analysis and per hour analysis queries in the previous sections. By combining the former with the later, we can achieve the per day per hour analysis. To do that, make the per hour analysis a sub aggregation of the per day analysis like below:

curl -X GET '<a href="http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true">http://localhost:9200/testindex-stats/weeklystats/_search?&pretty=true</a>' - d '{
  "aggs": {
    "byDays": {
      "terms": {
        "script": "dateConversion",
        "params": {
          "date_field": "created_at",
          "format": "EEEEEE"
        }
      },
      "aggs": {
        "byHours": {
          "terms": {
            "order": {
              "_term": "asc"
            },
            "script": "dateConversion",
            "params": {
              "date_field": "created_at",
              "format": "HH"
            },
            "size": 24
          }
        }
      }
    }
  }
}'

Upon running this query we would receive the aggregation buckets which clearly shows the data being split into two levels, the "day" splitting and within each day, the "hour".

Conclusion

In this post we discussed the limitation of the date_histogram aggregation and the methods to overcome the same. We have grouped the documents based on weekdays and on hourly basis by employing scripts. In the next post we will explore advanced analytics by looking into the comparison of this week's data with the previous week's data, and also this day's data with the previous day's data.

In the next post of this series, we will see how to implement this using Kibana. Questions, Comments? Leave us a message below.