We have already discussed the “The Authoritative Guide to Elasticsearch Performance Tuning” in a three part tutorial series to introduce some general tips and methods for performance tuning, explaining at each step the most relevant system configuration settings and metrics. The tutorials already covered in the same context are as follows:

The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)

The Authoritative Guide to Elasticsearch Performance Tuning (Part 2)

The Authoritative Guide to Elasticsearch Performance Tuning (Part 3)

The aim of this tutorial is to recommend some Performance Tuning techniques, strategies and recommendations specific to Elasticsearch 5.0 or onwards.

Keep reading

We have already discussed the “Langdetect Ingest Plugin” in a previous post. We now focus on the “elasticsearch-langdetect” or the “Nakatani Shuyo's language detector” in this post.

The “elasticsearch-langdetect” plugin offers a mapping type to specify fields where we want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang'. The field can be queried for language codes.

We can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin also offers a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Keep reading

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? Discover how easy it is to manage and scale your Elasticsearch environment.

Get Started 5 minutes to get started

Text segmentation has always been very critical from the perspective of Search. It is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Word segmentation is the task of dividing a string of written language into its component words. In English and many other languages, using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). Some examples where the space character alone may not be sufficient include contractions like won't for will not.

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited

Keep reading

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading

Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Keep reading