Avoiding duplication in your Elasticsearch indexes is always a good thing. But you can gain other benefits by eliminating duplicates: save disk space, improve search accuracy, improve the efficiency of hardware resource management. Perhaps most important, you reduce the fetch time for searches.

Surprisingly, there is little documentation available on this topic, so we offer this tutorial that gives you the proper technique for identifying and managing duplicates in your indexes.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

Example Data

Here are four simple documents, one of which is a duplicate of another. We index these documents under the name employeeid and the type info.

Document 1

curl -XPOST 'http://localhost:9200/employeeid/info/1' -d '{
 "name": "John",
 "organisation": "Apple",
 "employeeID": "23141A"
}'

Document 2

curl -XPOST 'http://localhost:9200/employeeid/info/2' -d '{
 "name": "Sam",
 "organisation": "Tesla",
 "employeeID": "TE9829"
 }'

Document 3

curl -XPOST 'http://localhost:9200/employeeid/info/3' -d '{
 "name":"Sarah",
 "organisation":"Microsoft",
 "employeeID" :"M54667"
 }'

Document 4

curl -XPOST 'http://localhost:9200/employeeid/info/4' -d '{
 "name": "John",
 "organisation": "Apple",
 "employeeID": "23141A"
 }'

Look closely, and you’ll see that document 4 is a duplicate of document 1.

Avoiding Duplicate Documents during Indexing

Before we consider how to perform duplication checking in Elasticsearch, let’s take a moment to consider the different types of indexing scenarios.

One scenario is when we have access to the source documents prior to indexing. In such cases, it’s relatively easy to examine the data and find one or more fields containing unique values. That is, each of the distinct values for that field occurs in precisely one document. In such cases, we could set that particular field to be the document id for the Elasticsearch index. Since any duplicate source documents will also have the same document id, Elasticsearch will ensure that these duplicates won’t become part of the index.

Upsert

Another scenario is when one or more documents have the same identifiers but different content. This often happens when a user edits a document and wants to reindex that document using the same document id. The problem is that when the user attempts to reindex, Elasticsearch won’t permit it because it document ids must be unique.

The workaround is to use the upsert API. Upsert checks for the existence of a particular document, and, if it exists, upsert will update that document with the content of the upsert. If the document does not exist, upsert will create the document with the same content. Either way, the user will get the content update under the same document id.

In the third scenario, there is no access to the data set prior to creation of the index. It is in these cases that we will need to search the index and check for the duplicates. This is what we demonstrate in the sections below.

Basic Checks for Duplicates

In each of our example documents, we see three fields: name, organisation, and if we begin by assuming that the field name is unique, we specify that field as the identifier for checking duplicates. If more than one document has the same value for the name field, then that document is indeed a duplicate.

Proceeding with this rationale in mind, we can perform simple terms aggregation to get document counts for each value of the field name. However, this simple aggregation will only return document counts under each value of that field. This approach isn’t useful in checking for duplicates because we want to examine the documents for duplicates of one or more values for the field. To do that, we also need to apply the top_hits aggregation—a subaggregator in which the top matching documents are aggregated per bucket.

Here is the query we would recommend against our index of example documents given above:

curl -XGET 'http://localhost:9200/employeeid/info/_search?pretty=true' -d '{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "name",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}'

Here we define the parameter min_doc_count. By setting this parameter to 2, only the aggregation buckets having a doc_count of two or more will appear in the aggregation (as shown in the results below).

{
  "took": 112,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": [
    ]
  },
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "john",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "4'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                },
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "1'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

It’s important to note that we must set the value of min_doc_count to two. Otherwise, other results will appear in the aggregation and we will not find any duplicates that may exist.

De-duping on values in multiple fields

What we did above is a very basic example of identifying duplicate documents according to values in a single field. This isn’t very interesting. Or useful. In most cases, checking for duplicates requires examination of mulitple fields. We cannot reliably assume duplicates exist among employee documents that merely contain multiple occurences of the value of “Bill” in the name field. In many real-world cases, it’s necessary to check for duplication across many different fields. Considering our example data set above, we need to check for repetition in all of the fields.

We can extend our approach from the previous section and perform a multi-field terms aggregation together with a top-hits aggregation. We can do the terms aggregation for all three fields in the documents of our index. We would again specify the min_doc_count parameter to get only buckets having a doc_count that is greater than or equal to two. We also apply a top_hits aggregation to get the correct results. To accommodate multiple fields, we employ a script to help us append the field values for display in the aggregation:

curl -XGET 'http://localhost:9200/employeeid/info/_search?pretty=true' -d '{
  "size": 0,
  "aggs": {
    "duplicateCount": {"terms": {
      "script": "doc['name'].values + doc['employeeID'].values+doc['organisation'].values",
      "min_doc_count": 2
    },      
    "aggs": {}
      "duplicateDocuments": {
        "top_hits": {}
      }
    }
  }
}'

As we show below, the results from running this query display a duplicateCount aggregation, in which we we get a total of three key values—each with doc_count value of two. Also under each key value, the aggregation duplicateDocuments contains the documents in which the duplicate values were found. We can perform cross-checks and verifications on those documents.

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": [
    ]
  },
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "23141a",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "4'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                },
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "1'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                }
              ]
            }
          }
        },
        {
          "key": "apple",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "4'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                },
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "1'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                }
              ]
            }
          }
        },
        {
          "key": "john",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "4'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                },
                {
                  "_index": "employeeid",
                  "_type": "info",
                  "_id": "1'",
                  "_score": 1,
                  "_source": {
                    "name": "John",
                    "organisation": "Apple",
                    "employeeID": "23141A"
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

That brings us to a wrap on this short tutorial. We welcome your comments below, and we invite you to check out our other informative articles such as these: