The popularity of Elasticsearch is largely attributable to the ease with which a user can approach and begin using it. Although it’s true that a developer can ramp up quickly to some of the basic skills in Elasticsearch, it can be quite difficult to diagnose and solve problems. We know all too well that there are many common pitfalls that new and intermediate users encounter.

You’ll be glad to know that this article provides you with a number of suggestions, tips, and tricks to help ease your journey and reduce frustration. We consider this information to be elemental for new Elasticsearch users, and we also expect that intermediate users will find much of value here.

Introduction

Because it is where many users commonly finds snags, we will focus on textual transformation—more properly known as text analysis. If you’re familiar with conventional databases, it might take you some time to become comfortable with the fact that Elasticsearch indexes documents without the need to transform the data. Of course, there is plenty of documentation available that covers these basics.

Get Results to Match Your Expectations

You may sometimes find that a result isn’t what you expect—even if you search for exactly what appears in your document. Maybe you’re including wildcards, yet you see things in your results that are not present in any of the documents. How could this be?

This is often a consequence of the textual analysis. This analysis is actually a transformation into index terms that are different from the focus of your search. For example, if you have the text “enterprise-grade”, the standard analyzer will produce the terms “enterprise” and “grade” for indexing. Those are the terms that will be part of the search, but the term “enterprise-grade” never becomes part of the index. Any searches for that precise term in the index will return nothing.

This illustrates how important it is to take great care and ensure that the method by which the text analysis is done is compatible with the method of text transformation during indexing. In our example above, this requires a correct tokenization and lowercasing of the terms. Some query types will automatically perform textual analysis, but most do not. None of the filtered queries perform any text processing.

The match and query_string family of queries will indeed process the text. A match query for “enterprise-grade” will return results, but a term query or filter will not. A match query uses the same analyzer and will search for “enterprise” and “grade” in the index. However, a term query for “enterprise-grade” will perform a precise search on that term, but won’t return any results since that term isn’t found in the index.

Many users try a workaround by including wildcards with their search input. A common analogy for this is a SQL construct such as WHERE column ILIKE '%query%. The guarantee in such cases is that the search will be slow—not that you’ll find anything of value! A wildcard search using our example term *enterprise-grade* would need to cull through all the terms in the index, and most likely come up empty.

Elasticsearch users will also frequently encounter problems that are a consequence of the standard analyzer performing stop-word removal on words such as the, is, at, which, and on. This can be especially frustrating, for example, when you index country codes.

The standard analyzer is the default, but we don’t recommend that you rely the on default analyzer. Elasticsearch typically does a good job at guessing the types for non-string values, but it can’t possibly know the precise treatment that you require for your text. It cannot discern whether a specific term is a tag, a name, or an element of prose.

Conversely, when the mappings are wrong, you may sometimes find documents that should not be matching will in fact appear as a match. Example: if the standard analyzer indexes a date "2014-11-12" as a string, then a match query for 2014-11-12 (an entirely different date) will indeed match because the query is an OR for the terms "2014", "11" and "12".

Take Care in Your Mappings

Incorrect text processing is indeed a cause of many common problems. In many of those cases, the root cause of errant text processing is incorrect or incomplete mappings—and reliance on the schema-free nature of dynamic mapping exacerbates the problem further.

“>

Changing mappings requires substantial effort, and you often need to plan extensively to change mappings with little impact to production environments. This leads to many problems, with the solution caveat being “but don’t change the mapping.” Constantly adjusting searches to accommodate poor mappings is a hike up a rocky path, leading to unnecessarily complex queries and poor performance. We recommend that you take extra care to avoid this.

There is no such thing as a schema-free database that can accommodate textual searches with good performance. Most efforts to construct searches according to the structure of the native documents will prove to be inaccurate at worst and inefficient at best. Bottom line: it’s critically important to index according to the structure of our searches.

You can find a number of resources about text processing and mapping with Elasticsearch. You don’t need deep mapping skills if you’re just starting and simply need to quickly import data into Elasticsearch and perform basic searches.

However, it’s very important for those who plan more extensive use of ES to gain some proficiency in the basics of text analysis and mapping. We recommend the following resources:

 

Key-Value Challenges

Another type of problem occurs when a developer attempts to use Elasticsearch as a generic key-value store. Here, again, we see a heavy reliance on the schema-free notion. If the keys are set up according to value, then the mapping that Elasticsearch derives from your data will increase without limit.

Consider an application in which the typical user selects survey questions and the corresponding documents have the form given below. For example, "favorite_author" and "notable_quote" are the keys. These and other keys in the answers section can be anything that the user specifies.

Document 1

{
    "user": “respondent123”,
    “survey_id”: “321”,
    “answers": {
               “favorite_author”: “Fyodor Dostoyevsky”,
               "best_quote”: “Man only likes to count 
               his troubles, but he does not count his joys.”
      }
}

 

Document 2

{
     "user": “respondent456”,     
     “survey_id”: “456”, 
     “answers": {  
                “search_experience”: “6”,
                “common__search_buzzwords”: “[search, 
                 engine, sql, devops, cloud computing]” }
}

 

Initially, this may seem sensible because you permit users to create custom surveys without enforcing a rigid schema.

There’s a problem, however: all of the keys will end up in the mapping for the index (answers.favorite_author, answers.common_buzzwords, …). This will be the case for all other keys. This won’t be a problem for small data sets, but will become quite difficult to manage as the mapping increases in size. A few thousand surveys with custom schemas will induce a memory cost to each and every field, and the ever-larger mappings will be part of the cluster state that distributes among each of the nodes.

If you have keys that change according to values, we strongly recommend that you restructure the documents to have fixed keys. We also recommend that you invest time exploring nested documents. Also, have a look at our blog article on Parent-child relationships in Elasticsearch.

You could revise the example given above to use nested documents. The mapping would only have properties for answers.key, answers.keyword_value, and answers.text_value. This way, the mapping won’t grow as you add more documents. You can also easily customize mappings so that keyword_value will index as not_analyzed and will be facetable, and text_value uses a custom analyzer.

Document Scoring and Relevance: Ensure Proper Weighting in your Searches

Relevancy and scoring are complex subjects that are the focus of much research. The score for a document is a combination of textual similarity and metadata-weighting scores. Influencers on the document score include the count, likes, and location in the information structure. There is an art to combining, refining, and maintaining these models, and we will cover this in future articles.

Set explain=true on your search object to examine the numerical weighting and see which terms and properties influence the score for a particular document.

Below, we provide an example of a simple explain tree that presents the calculation of scores for a simple multi_match-query on title^3 and record_label. The hits in the title are boosted. These can quickly get quite large, with queries that match on many fields and combine various signals using function score queries (for example). However, these are very useful for quickly spotting which factors dominate the scoring calculation. In this example, as you would expect, the match in the title field contributes substantially more to the overall score than the hit in record_label.

We have much more to say on scoring in our recent blog article, Elasticsearch Scripting: Scoring.

<code></code>">
"explanation : {
     "value": 0.1755522
     "description": 'sum of:',
     "details" : [ {
 "value": 0.13746893,
     "description": "weight(title:never^3.0 in 0) [PerFieldSimilarity], result of:"
     "details" : [ {
       "value" : 0.13148214,
       "description" : "score(doc=0,freq=1.0 = termFreq=1.0), product of:",
         "details": [ {
           "value" : 0.94145671,
           "description" : 'queryWeight, product of:',
           "details" : [ { 
              "description" : "boost"
              "value" : 3.0,
               ...
  "value" : 0.035216943,
   "description" : "weight(record_label:never in 0) [PerFieldSimilarity], result of:",
   "details" :
   ...

 

Debug your Queries

“>

Searches can quickly become large and consist of many different queries and filters. We recommend that you start by verifying your assumptions about the most-nested queries/filters and work your way outward. Ensure that you include in your sample data some documents that should not match.

As your experience with Elasticsearch increases, you’ll gain a better understanding of the types of queries, filters, facets, and searches that place greater stress your computing resources. You get better at finding the root causes of your out-of-control mappings. Our approach is to always test a new configuration with realistic amounts of data, while tightly controlling the environment. Small, well-known environments will reveal if your documents, mappings, and searches are delivering the results and behavior that you expect (although such simple environments do not test performance or scalability).

Step Back and Reflect

“>

We hope that this modest-length article covering some basic troubleshooting topics will motivate you to learn more.

In closing, here are some important questions to consider when you encounter problems similar to those discussed in this article:

  • Does this query or filter analyze the text?
  • What are the exact terms in the index that are the target of my query or filter?
  • Which terms are actually found in the index?
  • Do the nested queries/filters actually function according to my assumptions?
  • What parts of my query contribute most to the score? (set explain=true)
  • Am I in full control of the mapping, or do I use the Elasticsearch defaults?
  • Will my mapping and search configuration handle increasing amounts of data?

We hope that you find this article helpful, and we invite you to make comments below.