Are you looking for full-text search and highlight capability on .PDF, .doc, or .epub files in your system? In this tutorial, we'll show you how to achieve this with the Elasticsearch mapper-attachment-plugin.

This tutorial is for pre-5.x Elasticsearch scenarios.  For 5.x and on, see this tutorial on how to index attachments and files to Elasticsearch using the Ingest API.

Mapper Attachment Plugin

Mapper attachment plugin is a plugin available for Elasticsearch to index different type of files such as PDFs, .epub, .doc, etc. The plugin uses open source Apache Tika libraries for the metadata and text extraction purposes.

We are going to use this plugin to index a pdf document and make it searchable. Here is how the document will be indexed in Elasticsearch using this plugin:

pdf-to-elasticsearch.png#asset:1529

As you can see, the pdf document is first converted to base64 format, and then passed to Mapper Attachment Plugin. Then, the required parser library is selected and applied to the document to extract its text and metadata. Once text and metadata are extracted, they are indexed to Elasticsearch. 

Plugin Installation

The plugin can be installed using the command below:

bin/plugin install elasticsearch/elasticsearch-mapper-attachments/github

The above command is for the plugin's installation for Elasticsearch 2.3.3. For other versions, you can look up to the plugin's Github repo here.

Applying the Mapping

It is not enough to install the plugin and then pass the document to Elasticsearch as base64. We need to specify a mapping that will reflect the contents and metadata of index files:

curl -X PUT "http://$hostname:9200/pdf-test" -d '{
 "mappings": {
   "person": {
     "properties": {
       "file": {
         "type": "attachment",
         "fields": {
           "content": {
             "store": "yes"
           },
           "title": {
             "store": "yes"
           },
           "date": {
             "store": "yes"
           },
           "author": {
             "store": "yes"
           },
           "keywords": {
             "store": "yes"
           },
           "content_type": {
             "store": "yes"
           },
           "content_length": {
             "store": "yes"
           },
           "language": {
             "store": "yes"
           }
         }
       }
     }
   }
 }
}'

In the example above, we defined a mapping for the type  "person", which specifies a "file" property as "attachment" and includes various metadata fields for that file. 

Indexing Files

As we said earlier, documents to be indexed must be converted to the base64 format. You can  use any programming language you are familiar with to do this. In the example below, we used a Perl script to convert the document and then indexed it to Elasticsearch:

encodedPdf=`cat testDocument.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${encodedPdf}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/pdf-/test/person/" -d @json.file
Searching on the pdf

The extracted content is indexed and mapped as "string" type under the "field.content".  When we are querying for data in that file, we should use the same field. A sample query may look as follows:

curl -XPOST http://localhost:9200/pdf-test/_search -d '{
 "from": 0,
 "fields": [
   "file.content"
 ],
 "query": {
   "match": {
     "file.content": "Easy"
   }
 },
 "highlight": {
   "fields": {
     "file.content": {}
   }
 }
}'

The response for the above query would have the search keyword (here "Easy") in the "content" field. Also, since the highlighting is used in the above query, the results will be returned inside the <em> tag under the "highlight" field of the response.

Default Limit in Character Extraction

Sometimes, when we index a large PDF file, there is a chance that indexing might not happen due to the limitation in the number of characters that can be extracted. By default, a maximum of 100,000 characters is extracted. Exceeding this limit will result in an extraction error. We can avoid it by changing the settings, like in the example below:

index.mapping.attachment.indexed_chars : -1

This will allow for an unlimited extracted characters.

Ingest Attachment Plugin

Mapper attachment plugin is deprecated in Elasticsearch 5 and above. It is replaced with a similar plugin named Ingest Attachment Plugin. The IAP also uses the Apache Tika libraries, and the usage is similar. For more information you can refer to the documentation here.

Conclusion

In this tutorial, we showed how to index commonly used file types (e.g pdf) in Elasticsearch using the mapper-attachment plugin. We also demonstrated how to execute a full text search on the indexed documents to return file contents and metadata. This functionality can be extremely helpful for implementing full text searches for various types of documents leveraging Elasticsearch analyzers and language plugins.