This post is part 3 of a 3-part series about tuning Elasticsearch Indexing. Part 1 can be found here and Part 2 can be found here.

This tutorial series focuses specifically on tuning elasticsearch to achieve maximum indexing throughput and reduce monitoring and management load. 

Elasticsearch provides sharding and replication as the recommended way for scaling and increasing availability of an index. A little over allocation is good but a bazillion shards is bad. It is difficult to define what constitutes too many shards, as it depends on their size and how they are being used. A hundred shards that are seldom used may be fine, while two shards experiencing very heavy usage could be too many. Monitor your nodes to ensure that they have enough spare capacity to deal with exceptional conditions.

Keep reading

This post is part 2 of a 3-part series about tuning Elasticsearch Indexing. Part 1 can be found here.

The tutorial series focuses specifically on tuning elasticsearch to achieve maximum indexing throughput and reduce monitoring and management load. Elasticsearch is near-realtime, in the sense that when you index a document, you need to wait for the next refresh for that document to appear in search. 

Refreshing is an expensive operation and that is why it’s made at a regular interval (default), instead of after each indexing operation. If you are planning to index a lot of documents and you don’t need the new information to be immediately available for search, you can optimize for indexing performance over search performance by decreasing refresh frequency until you are done indexing.

Keep reading

This post is part 1 of a 3-part series about tuning Elasticsearch Indexing. This series focuses specifically on tuning Elasticsearch to achieve maximum indexing throughput and reduce monitoring and management load. 

As a starting point, assume that you start Elasticsearch, create an index, and feed it with JSON documents without incorporating schemas. Elasticsearch will then iterate over each indexed field of the JSON document, estimate its field, and create a respective mapping. While this may seem ideal, Elasticsearch mappings are not always accurate. If, for example, the wrong field type is chosen, indexing errors will occur.

Keep reading

Elasticsearch is document oriented, meaning that it stores entire objects or documents. Aside from storing them, it indexes the contents of each document in order to make them searchable. In Elasticsearch you can index, search, sort, and filter documents—not rows of column data. This is a fundamentally different way of thinking about data and it is one of the reasons Elasticsearch can perform complex full-text search.

The objects in the application are rarely simple lists of keys and values. More often objects are complex data structures that may contain dates, geo locations, other objects, or arrays of values.

Keep reading

In this blog, we will be creating an index in detail, which ranges from static index creation for the creation of simple indices, to dynamic template creation for creating multiple indices.

Keep reading

Qbox users will often need to import an existing dataset from a primary data source or an external Elasticsearch cluster.

As you might expect, the indexing activity is much more intensive during an initial import of such data. This initial indexing places a significant load on the system. In this short article we show you how to change a couple of settings that will greatly improve the efficiency of your bulk data loads into Elasticsearch clusters.

Keep reading