Data Retention Techniques in Logstash via Elasticsearch-Curator
Posted by Sarath Pillai August 15, 2017Although elasticsearch can scale indefinitely, you should store required data only. This will speed up the search operation, as well as response time to retrieve the data, and even reduce resource utilization substantially.
Elasticsearch uses an “Inverted Index” to retrieve data that you are searching for. Although this algorithm is one of the best when it comes to text searching, keeping only the data that you need in the index is the best approach.
In this tutorial, we discuss data retention techniques that you can use in elasticsearch. This will obviously depend on the kind of data and your application, because some might need longer retention policies compared to others.
Imagine an application that deals with finance and money transactions. Such applications will need all of the records forever. But, do these records need to always exist in elasticsearch? Does all of this data need to be quickly searchable?
Logstash provides methods where you can segregate different events, and then store it in standard file storage rather than elasticsearch for long-term storage.
Tutorial
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.”
Let’s start by discussing the easiest method available. The simplest way is to use a recently released tool called curator. Curator will help you to manage and automate the task of optimizing your elasticsearch indexes and even delete indices that are no longer required.
Install Elasticsearch-Curator
Elasticsearch-Curator can be installed using the python package manager “pip”
, as shown below.
#pip install elasticsearch-curator
Note: You need elasticsearch version 1.0 or later for curator to work. If you are using a qbox cluster, you are mostly likely using the latest stable elasticsearch version. If you want to try out an older version of curator, you can then install it using the below command.
#pip install elasticsearch-curator=4.2.5
You can also install curator using standard Linux package managers like apt. For apt, you need to execute the below series of commands.
wget -O - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
apt-add-repository 'deb [arch=amd64] http://packages.elastic.co/curator/5/debian stable main'
apt-get update && sudo apt-get install elasticsearch-curator
Configuration
Once it is installed, you should have a new command in the system called “curator”
. The latest version of curator requires two things to work.
- A configuration file for curator itself, where you specify the elasticsearch endpoint settings, credentials, etc.
- An action file. This file contains the actions you define. This is the place where you specify things that need to be done on your indexes to optimize it.
Both the configuration file and action file uses YAML format. The configuration file consists of two primary sections. One for client side settings, where you specify:
- Elasticsearch URL
- Username
- Password
- Port
- Other client side connection specific details
The second section of the configuration file uses settings for logging. An example configuration file is shown below.
--- client: hosts: - eb835675.qb0x.com port: 36563 use_ssl: True ssl_no_validate: True http_auth: ec18723587235hds2344:efebd7e1e0 timeout: 30 master_only: True logging: loglevel: INFO logfile: logformat: default
In the above shown example configuration, hosts
indicates the hostname/IP address of the elasticsearch cluster. In our case, we are using a qbox cluster DNS name. Port number is specified using port
. The key ssl_no_validate
is set to “True” as we are not using any self-signed certificate. HTTP credentials are provided using the http_auth
key, which uses the format username:password
.
Tutorial: Eliminating Duplicate Documents in Elasticsearch
The master_only
key is for executing the curator operations only on master node.
Action File
Construct an action file. The action file consists of two sections, an action section and a filter section. The filter section will filter out the index based upon the prefix value given. For example, logstash indices will start with logstash-DATE
format.
--- actions: 1: action: delete_indices description: >- delete indices older than a month. options: ignore_empty_list: True disable_action: False filters: - filtertype: pattern kind: prefix value: logstash- - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 30
In our example above, we are using an action of delete_indices
, because we are discussing data retention policies. Now, we can run curator using the below syntax.
#curator --config /path/to/config-file.yml /path/to/action-file.yml
Curator CLI
Curator also comes with a single command line called curator-cli
. This can be used instead of using the YAML files we created, which is an alternative method to running curator.
#curator_cli --host eb835675.qb0x.com --verbose --header --http_auth username:password show_indices
The above will show all indices on the elasticsearch node with its size. Similar to the previous configuration and action file example, you can directly use curator_cli
to do the same job as shown below.
#curator_cli --host eb835675.qb0x.com --port 32543 --http_auth ec18723587235hds2344:efebd7e1e0 --use_ssl --ssl-no-validate delete_indices --filter_list '[{"filtertype":"age","source":"name","direction":"older","unit":"days","unit_count":30,"timestring": "%Y.%m.%d"},{"filtertype":"pattern","kind":"prefix","value":"logstash"}]'
Apart from using curator, you can directly use elasticsearch API as well to delete old indexes. For example, the below will delete one of the indexes. We know the logstash index pattern, hence it’s quite simple to trigger an API call based on date format.
curl -XDELETE <a href="http://eb835675.qb0x.com:32564/logstash-2017.04.22">http://eb835675.qb0x.com:32564/logstash-2017.04.22</a>
As the elasticsearch API requests supports wildcards, we can use the below to delete all indexes of the 4th month.
curl -XDELETE http://eb835675.qb0x.com:32564/logstash-2017.04*
Or even an entire year as shown below.
curl -XDELETE http://eb835675.qb0x.com:32564/logstash-2017*
Conclusion
Elasticsearch-Curator is the best method to manage data retention. Apart from that, you can use curl based scripts to delete old data. Setting up TTL for indexes are old and are not recommended.
Other Helpful Tutorials
- REST Calls Made Rustic – RS-ES in Idiomatic Rust
- Searching and Fetching Large Datasets in Elasticsearch Efficiently
- Elasticsearch ElastAlert: Alerting at Scale
- How to Use Elasticsearch, Logstash, and Kibana to Manage Apache Logs
- How to Integrate Slack with Elasticsearch, Logstash, and Kibana
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.