Are you looking for full-text search and highlight capability on .PDF, .doc, or .epub files that you have in your system? In this tutorial, we show you how with the mapper-attachment-plugin

This tutorial is for pre-5.x elasticsearch scenarios.  For 5.x and on, see this tutorial on how to index attachments and files to elasticsearch using the Ingest API.

Mapper Attachment Plugin

Mapper attachment plugin is a plugin available for Elasticsearch to index different type of files such as PDFs, .epub, .doc, etc. This plugin uses the open source Apache Tika libraries for the text extraction purposes.

Try out the plugin and make a pdf document to be indexed and made searchable. Here is how document is indexed using this plugin in elasticsearch:

pdf-to-elasticsearch.png#asset:1529

As you can see from above, the JSON document is first converted to base64 format using any code, and then passed to the elasticsearch (plugin part). Now the required parser library is selected and applied to the document to extract its text and metadata, and then indexed to elasticsearch.

Plugin Installation

The plugin can be installed using the command below:

bin/plugin install elasticsearch/elasticsearch-mapper-attachments/github

The above command is for the installation of the plugin for Elasticsearch 2.3.3. For other versions, you can look up to the plugin's Github repo here.

Applying the Mapping

It is not enough to install the plugin and then pass the document to the elasticsearch as base64. We need to specify in the type mapping that the specified index type contains files like .pdf, .doc, etc., as attached to it. To do this we need to apply the mapping to the required type of the index as below:

curl -X PUT "<a>http://$hostname:9200/pdf-test</a>" -d '{
 "mappings": {
   "person": {
     "properties": {
       "file": {
         "type": "attachment",
         "fields": {
           "content": {
             "store": "yes"
           },
           "title": {
             "store": "yes"
           },
           "date": {
             "store": "yes"
           },
           "author": {
             "store": "yes"
           },
           "keywords": {
             "store": "yes"
           },
           "content_type": {
             "store": "yes"
           },
           "content_length": {
             "store": "yes"
           },
           "language": {
             "store": "yes"
           }
         }
       }
     }
   }
 }
}'

Here we have defined mapping for the type named "person" under which you can see the "file" is given the type "attachment".   

Indexing Files

As we said earlier, the document to be indexed is to be converted to the base64 format. You can employ any familiar language for the process. In the below code, we have used a Perl script for doing that, and then it is indexed to the elasticsearch index, too:

encodedPdf=`cat testDocument.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${encodedPdf}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/pdf-/test/person/" -d @json.file
Searching on the pdf

Check out our Kubernetes Development Support Packages

The extracted content is indexed and mapped as string type under the "fields" under the field "content".  When we are querying for data present in the file, we should use the same field. A sample query is as below:

curl -XPOST<a href="http://localhost:9200/pdf-test/_search"> http://localhost:9200/pdf-test/_search</a> -d '{
 "from": 0,
 "fields": [
   "file.content"
 ],
 "query": {
   "match": {
     "file.content": "Easy"
   }
 },
 "highlight": {
   "fields": {
     "file.content": {}
   }
 }
}'

The response for the above query would have the the search keyword (here "Easy") in the "content" field. Also, since the highlighting is used in the above query, it will be given inside the <em> tag under the "highlight" field of the response.

Default Limit in Character Extraction

Sometimes when we index a large PDF file, there is a chance that indexing might not happen due to the limitation in the number of characters that can be extracted. By default, a maximum of 100,000 characters are extracted. Any exceeding of this limit will result in an extraction error. This can be overcome by changing the settings, like below:

index.mapping.attachment.indexed_chars : -1

This would allow for an unlimited extracted characters.

Ingest Attachment Plugin

Mapper attachment plugin is deprecated for the new release (5 and above) of Elasticsearch. It is replaced with a similar plugin named ingest attachment plugin. The IAP also uses the Apache Tika libraries, and the usage is similar. For more information you can refer to the documentation here.

Conclusion

In this tutorial, we covered how to index commonly used file types (PDF in this case) in elasticsearch using the mapper-attachment plugin. We also executed a full text search and highlighted the results on the indexed PDF document.  

comments powered by Disqus