In this guide, we explore Refresh and Flush operations in Elasticsearch. This guide will bring resolution to the differences between the two in an effective manner. We also cover the underlying basics of Lucene functionalities, like reopen and commits, which helps in understanding refresh and flush operations.
Refresh and Flush
At the first glance, the general purpose of Refresh and the Flush operations seems identical. Both are used to make documents available for search immediately after index operation. When new documents are added in Elasticsearch, we can call either the
_flush operation on index to make new documents available for search. To understand how these operations work, you must be familiar with Segments, Reopen, and Commits in Lucene, which is the underlying query engine in Elasticsearch.
Segments in Lucene
In Elasticsearch, the most basic unit of storage of data is a shard. But, looking through the Lucene lens makes things a bit different. Here, each Elasticsearch shard is a Lucene index, and each Lucene index consists of several Lucene segments. A segment is an inverted index of the mapping of terms to the documents containing those terms.
This concept of segments and how it applies to an Elasticsearch index and its shards are shown in the below diagram:
The concept behind this segmentation is that whenever new documents are created, they are written in new segments. Whenever new documents are created, they belong to a new segment and there is no need to modify the previous segment. If a document has to be deleted, it is flagged as deleted in its original segment. This means it never gets physically deleted from the segment.
Learn About Our Enterprise Kubernetes Development Support Subscriptions
Same with updating: the previous version of the document is marked as deleted in the previous segment and the updated version is kept under the same document Id in the current segment.
Lucene Reopen, when called, will make the data accumulated available for search. Although the latest data is made available for search, this does not guarantee the persistence of the data or that it is not written to the disk. We can call the reopen feature
n number of times and make the latest data searchable, but cannot be sure about the presence of data on the disk.
Commits in Lucene
Lucene commits make the data safe. For each commit, the data from different segments is merged and pushed to the disk, making the data persistent. Although commits are the ideal way to persist data, the issue is that each commit operation is resource expensive. Each commit operation has its own internal I/O operations and read/write cycles associated with it. This is the exact reason why we prefer the reopen feature to be reused, again and again, in Lucene based systems for making new data searchable.
Elasticsearch addresses the issue of persistence taking a different approach. It introduces a translog (transaction log) in every shard. New documents indexed are passed to this transaction log and an in-memory buffer. This process is shown in the figure below:
Refresh in Elasticsearch
In Elasticsearch, the
_refresh operation is set to be executed every second by default. During this operation, the in-memory buffer contents is copied to a newly created segment in the memory, which is shown in the diagram below. As a result, new data becomes available for search.
Translog and Persistence
However, how does the translog work around the problem of persistence? The translog exists in each shard, which means it pertains to the physical disk memory. It is fsynced and safe, thus you obtain persistence and durability, even for the documents that have not been committed yet. If something bad happens, the transaction log can be restored. Also, the translog is committed to a disk either in every set interval, or upon the completion of a successful request: Index, Bulk, Delete, or Update.
Flush in Elasticsearch
Flush essentially means that all the documents in the in-memory buffer are written to new Luce segments, which is shown in the Figure #3 below. These, along with all existing in-memory segments, are committed to the disk, which clears the translog (See Figure 4). This commit is essentially a Lucene commit.
A flush is triggered either periodically, or whenever the translog reaches a specific size. These settings prevent unruly costs from Lucene commits.
In this guide we explored two closely related Elasticsearch operations,
_refreshshowing the commonalities and differences between them. We also touched upon the underlying architecture components of Lucene, – reopen and commits – which helps in grasping the gist of
_flush operations in Elasticsearch.
_refresh is used to make new documents visible to search. In its turn,
_flush is used to persist in-memory segments on the hard disk.
_flush does not affect the visibility of the documents in Elasticsearch because search happens in memory segments, as opposed to
_refresh that affects their visibility.
Questions/Comments? Drop us a line below.
Other Helpful Tutorials
- Getting Started with Elasticsearch on Qbox
- How to Use Elasticsearch, Logstash, and Kibana to Manage Logs
- How to Use Elasticsearch, Logstash, and Kibana to Manage NGINX Logs
- The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)
- Using the ELK Stack and Python in Penetration Testing Workflow
Give It a Whirl!
It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we’ll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.