Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and e-commerce.

Is there a mechanism to index files as easily as a string or number?

YES! Elasticsearch caters to this need via a specialized data type called "attachment".

In this post, we’ll see how to index files (attachments) to ES by making use of the data type "attachment”, as well as the different ways to search for it. ES 2.3 is being used for the following example. For Elasticsearch 5.x, you'll need to use the Ingest Attachment Processor Plugin, which is not covered in this tutorial.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Tutorial

The first step is to install the Elasticsearch plugin: mapper-attachmentswhich enables ES to recognize the “attachment” data type. In turn, it uses Apache Tika for content extraction and supports several file types such as .pdf, .doc, .xls, .rtf, .html, .odt, etc. 

The plugin can be installed by running the following on the command line:

$ES_HOME> sudo bin/plugin install mapper-attachments

Once the plugin is installed, restart ES for the new plugin to be loaded into ES. Let's get started by creating a mapping under the index “company”:

curl -X POST "http://localhost:9200/company" -d '{
  "mappings":{
     "employee":{
        "properties":{
           "resume":{
              "type":"attachment"
           },
           "name":{
              "type":"string"
           }
        }
     }
  }
}'

As highlighted in the above mapping, “resume” is of the type “attachment”.

Blog Post: Kubernetes Series: Understanding Why Container Architecture is Important to the Future of Your Business

Now that the mapping has been created, let’s index a file under the “company” index and type as "employee”. The file must be base64 encoded. The following encoding has been done using an online utility here. The contents of the text file can be found here. The encoded file content is put under “resume” and “name” is set to Mark:

curl -X POST "http://localhost:9200/company/employee/1" -d '{
"resume": "UWJveCBtYWtlcyBpdCBlYXN5IGZvciB1cyB0byBwcm92aXNpb24gYW4gRWxhc3RpY3NlYXJjaCBjbHVzdGVyIHdpdGhvdXQgd2FzdGluZyB0aW1lIG9uIGFsbCB0aGUgZGV0YWlscyBvZiBjbHVzdGVyIGNvbmZpZ3VyYXRpb24u",
"name":"Mark"
}'

Now, let us search. Since “QBOX” is a word that is present in the file that was indexed, let us search for it.

curl -X POST "http://localhost:9200/company/employee/_search" -d '{
  "query":{
     "query_string":{
        "query":"QBOX"
     }
  }
}'
Search Results:
{
 "took": 394,
 "timed_out": false,
 "_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
 "hits": {
   "total": 1,
   "max_score": 0.047945753,
   "hits": [
     {
       "_index": "company",
       "_type": "employee",
       "_id": "1",
       "_score": 0.047945753,
       "_source": {
         "resume": "UWJveCBtYWtlcyBpdCBlYXN5IGZvciB1cyB0byBwcm92aXNpb24gYW4gRWxhc3RpY3NlYXJjaCBjbHVzdGVyIHdpdGhvdXQgd2FzdGluZyB0aW1lIG9uIGFsbCB0aGUgZGV0YWlscyBvZiBjbHVzdGVyIGNvbmZpZ3VyYXRpb24u",
         "name": "Mark"
       }
     }
   ]
 }
}

Now that the search is successful, see how we can make indexing files more efficient.

Base64-encoding a file increases the content by 33%. Therefore, storing the Base64 content in the document makes it bulkier, consuming more space. There is a solution for this.

Blog Post: Top Reasons Businesses Should Move to Kubernetes Now

In Elasticsearch, for “attachment” indexing, the actual content is mapped under the field name "content".  We can apply the settings in the mappings to exclude storing "resume" content (which is the base64-encoded content) in _source field, like below:

curl -X POST "http://localhost:9200/company/employee/_search" -d '{
  "mappings":{
     "employee":{
        "_source":{
           "excludes":[
              "resume"
           ]
        },
        "properties":{
           "resume":{
              "type":"attachment",
              "fields":{
                 "content":{
                    "type":"string",
                    "store":true
                 }
              }
           },
           "name":{
              "type":"string"
           }
        }
     }
  }
}'

Execute the same search query again and see what the new results return:

Search Results
{
 "took": 13,
 "timed_out": false,
 "_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
 "hits": {
   "total": 1,
<i></i>
   "max_score": 0.047945753,
   "hits": [
     {
       "_index": "company",
       "_type": "employee",
       "_id": "1",
       "_score": 0.047945753,
       "_source": {
         "name": "Mark"
       }
     }
   ]
 }
}

The bulky base64-encoded content of "resume" has vanished in the _source field! Wouldn't it be nice to see the actual content of the resume field? We know that the actual content of files/attachments would be stored in the subfield called "content" and that it can be accessed using the dot notation. Let's modify the search query and execute as shown below:

curl -X POST "http://localhost:9200/company/employee/_search" -d '{  
  "fields":[  
     "resume.content"
  ],
  "query":{  
     "query_string":{  
        "query":"QBOX"
     }
  }
}'
Search Results:
{
 "took": 16,
 "timed_out": false,
 "_shards": {
   "total": 5,
   "successful": 5,
   "failed": 0
 },
 "hits": {
<i></i>
  "total": 1,
   "max_score": 0.047945753,
   "hits": [
     {
       "_index": "company",
       "_type": "employee",
       "_id": "1",
       "_score": 0.047945753,
       "fields": {
         "resume.content": [
           "Qbox makes it easy for us to provision an Elasticsearch cluster without wasting time on all the details of cluster configuration.\n"
         ]
       }
     }
   ]
 }
}

Now we see the actual content of the file we indexed.

The "attachment" type also provides other subfields like date, title, author, content_type, language, content_length, and keywords. These can be used to index the file’s metadata, like file-name, to the attachment. They can be queried using the "dot notation". An example of this is: resume.title.

Conclusion

The mapper-attachments plugin eases the way attachments are indexed into ES. It also provides the ability to detect the language of the content in the files, which means it supports multi-language attachments. 

In the next post on this topic, we'll explore more on "mapper-attachments" and answer questions like how to enable language detection, how to highlight the search results, how to apply custom analyzers for the parsed attachment data, how to populate and query metadata, how the mapper-attachments functionality is taken forward in ES 5.0, and more.