Although elasticsearch can scale indefinitely, you should store required data only. This will speed up the search operation, as well as response time to retrieve the data, and even reduce resource utilization substantially.

Elasticsearch uses an “Inverted Index” to retrieve data that you are searching for. Although this algorithm is one of the best when it comes to text searching, keeping only the data that you need in the index is the best approach.

In this tutorial, we discuss data retention techniques that you can use in elasticsearch. This will obviously depend on the kind of data and your application, because some might need longer retention policies compared to others. 

Imagine an application that deals with finance and money transactions. Such applications will need all of the records forever. But, do these records need to always exist in elasticsearch? Does all of this data need to be quickly searchable?

Logstash provides methods where you can segregate different events, and then store it in standard file storage rather than elasticsearch for long-term storage.

Tutorial

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

Let’s start by discussing the easiest method available.  The simplest way is to use a recently released tool called curator.  Curator will help you to manage and automate the task of optimizing your elasticsearch indexes and even delete indices that are no longer required.

Install Elasticsearch-Curator

Elasticsearch-Curator can be installed using the python package manager “pip”, as shown below.

#pip install elasticsearch-curator

Note: You need elasticsearch version 1.0 or later for curator to work. If you are using a qbox cluster, you are mostly likely using the latest stable elasticsearch version. If you want to try out an older version of curator, you can then install it using the below command.

#pip install elasticsearch-curator=4.2.5

You can also install curator using standard Linux package managers like apt. For apt, you need to execute the below series of commands.

wget -O - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
apt-add-repository 'deb [arch=amd64] http://packages.elastic.co/curator/5/debian stable main'
apt-get update && sudo apt-get install elasticsearch-curator

Configuration

Once it is installed, you should have a new command in the system called “curator”. The latest version of curator requires two things to work.

  1. A configuration file for curator itself, where you specify the elasticsearch endpoint settings, credentials, etc.

  2. An action file. This file contains the actions you define. This is the place where you specify things that need to be done on your indexes to optimize it.

Both the configuration file and action file uses YAML format. The configuration file consists of two primary sections. One for client side settings, where you specify:

  • Elasticsearch URL
  • Username
  • Password
  • Port 
  • Other client side connection specific details

The second section of the configuration file uses settings for logging. An example configuration file is shown below.

---
client:
  hosts:
    - eb835675.qb0x.com
  port: 36563
  use_ssl: True
  ssl_no_validate: True
  http_auth: ec18723587235hds2344:efebd7e1e0
  timeout: 30
  master_only: True
logging:
  loglevel: INFO
  logfile:
  logformat: default

In the above shown example configuration, hosts indicates the hostname/IP address of the elasticsearch cluster. In our case, we are using a qbox cluster DNS name. Port number is specified using port. The key ssl_no_validate is set to “True” as we are not using any self-signed certificate. HTTP credentials are provided using the http_auth key, which uses the format username:password.

Tutorial: Eliminating Duplicate Documents in Elasticsearch

The master_only  key is for executing the curator operations only on master node.

Action File

Construct an action file. The action file consists of two sections, an action section and a filter section. The filter section will filter out the index based upon the prefix value given. For example, logstash indices will start with logstash-DATE format.

---
actions:
  1:
    action: delete_indices
    description: >-
      delete indices older than a month.
    options:
      ignore_empty_list: True
      disable_action: False
    filters:
    - filtertype: pattern
      kind: prefix
      value: logstash-
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 30

In our example above, we are using an action of delete_indices, because we are discussing data retention policies. Now, we can run curator using the below syntax.

#curator --config /path/to/config-file.yml /path/to/action-file.yml

Curator CLI

Curator also comes with a single command line called curator-cli.  This can be used instead of using the YAML files we created, which is an alternative method to running curator.

#curator_cli --host eb835675.qb0x.com --verbose --header --http_auth username:password show_indices

The above will show all indices on the elasticsearch node with its size. Similar to the previous configuration and action file example, you can directly use curator_cli to do the same job as shown below.

#curator_cli --host eb835675.qb0x.com --port 32543 --http_auth ec18723587235hds2344:efebd7e1e0 --use_ssl --ssl-no-validate delete_indices --filter_list '[{"filtertype":"age","source":"name","direction":"older","unit":"days","unit_count":30,"timestring": "%Y.%m.%d"},{"filtertype":"pattern","kind":"prefix","value":"logstash"}]'

Apart from using curator, you can directly use elasticsearch API as well to delete old indexes. For example, the below will delete one of the indexes. We know the logstash index pattern, hence it's quite simple to trigger an API call based on date format.

curl -XDELETE <a href="http://eb835675.qb0x.com:32564/logstash-2017.04.22">http://eb835675.qb0x.com:32564/logstash-2017.04.22</a>

As the elasticsearch API requests supports wildcards, we can use the below to delete all indexes of the 4th month.

curl -XDELETE http://eb835675.qb0x.com:32564/logstash-2017.04*

Or even an entire year as shown below.

curl -XDELETE http://eb835675.qb0x.com:32564/logstash-2017*

Conclusion

Elasticsearch-Curator is the best method to manage data retention. Apart from that, you can use curl based scripts to delete old data. Setting up TTL for indexes are old and are not recommended.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus