In this guide, we explore Refresh and Flush operations in Elasticsearch. This guide will bring resolution to the differences between the two in an effective manner. We also cover the underlying basics of Lucene functionalities, like reopen and commits, which helps in understanding refresh and flush operations.

Refresh and Flush

At the first glance, the general purpose of Refresh and the Flush operations seems identical. Both are used to make documents available for search immediately after index operation. When new documents are added in Elasticsearch, we can call either the _refresh or _flush operation on index to make new documents available for search. To understand how these operations work, you must be familiar with Segments, Reopen, and Commits in Lucene, which is the underlying query engine in Elasticsearch.

Segments in Lucene

In Elasticsearch, the most basic unit of storage of data is a shard. But, looking through the Lucene lens makes things a bit different. Here, each Elasticsearch shard is a Lucene index, and each Lucene index consists of several Lucene segments. A segment is an inverted index of the mapping of terms to the documents containing those terms.

This concept of segments and how it applies to an Elasticsearch index and its shards are shown in the below diagram:

lucene1.png#asset:1574

The concept behind this segmentation is that whenever new documents are created, they are written in new segments. Whenever new documents are created, they belong to a new segment and there is no need to modify the previous segment. If a document has to be deleted, it is flagged as deleted in its original segment. This means it never gets physically deleted from the segment. 

Learn About Our Enterprise Kubernetes Development Support Subscriptions

Same with updating: the previous version of the document is marked as deleted in the previous segment and the updated version is kept under the same document Id in the current segment.

Lucene Reopen

Lucene Reopen, when called, will make the data accumulated available for search. Although the latest data is made available for search, this does not guarantee the persistence of the data or that it is not written to the disk. We can call the reopen feature n number of times and make the latest data searchable, but cannot be sure about the presence of data on the disk.

Commits in Lucene

Lucene commits make the data safe. For each commit, the data from different segments is merged and pushed to the disk, making the data persistent. Although commits are the ideal way to persist data, the issue is that each commit operation is resource expensive. Each commit operation has its own internal I/O operations and read/write cycles associated with it. This is the exact reason why we prefer the reopen feature to be reused, again and again, in Lucene based systems for making new data searchable.

Translog

Elasticsearch addresses the issue of persistence taking a different approach. It introduces a translog (transaction log) in every shard. New documents indexed are passed to this transaction log and an in-memory buffer. This process is shown in the figure below:


lucene2.png#asset:1575

Refresh in Elasticsearch

In Elasticsearch, the _refresh operation is set to be executed every second by default. During this operation, the in-memory buffer contents is copied to a newly created segment in the memory, which is shown in the diagram below. As a result, new data becomes available for search.

lucene3.png#asset:1576

Translog and Persistence

However, how does the translog work around the problem of persistence? The translog exists in each shard, which means it pertains to the physical disk memory. It is fsynced and safe, thus you obtain persistence and durability, even for the documents that have not been committed yet. If something bad happens, the transaction log can be restored. Also, the translog is committed to a disk either in every set interval, or upon the completion of a successful request: Index, Bulk, Delete, or Update.

Flush in Elasticsearch

Flush essentially means that all the documents in the in-memory buffer are written to new Luce segments, which is shown in the Figure #3 below. These, along with all existing in-memory segments, are committed to the disk, which clears the translog (See Figure 4). This commit is essentially a Lucene commit.

lucene4.png#asset:1577


A flush is triggered either periodically, or whenever the translog reaches a specific size. These settings prevent unruly costs from Lucene commits.

Conclusion

In this guide we explored two closely related Elasticsearch operations, _flush and _refreshshowing the commonalities and differences between them. We also touched upon the underlying architecture components of Lucene, - reopen and commits -  which helps in grasping the gist of  _refresh and _flush operations in Elasticsearch.

In short, _refresh is used to make new documents visible to search. In its turn,  _flush  is used to persist in-memory segments on the hard disk.  _flush does not affect the visibility of the documents in Elasticsearch because search happens in memory segments, as opposed to  _refresh  that affects their visibility.

Questions/Comments? Drop us a line below.  

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.