Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Ingest Nodes are a new type of Elasticsearch node you can use to perform common data transformation and enrichments. Each task is represented by a processor. Processors are configured to form pipelines.

At the time of writing the Ingest Node had 20 built-in processors, for example grok, date, gsub, lowercase/uppercase, remove and rename.

Besides those, there are currently also three Ingest plugins:

  • Ingest Attachment converts binary documents like Powerpoints, Excel Spreadsheets, and PDF documents to text and metadata
  • Ingest Geoip looks up the geographic locations of IP addresses in an internal database
  • Ingest user agent parses and extracts information from the user agent strings used by browsers and other applications when using HTTP

So, Is there a mechanism to index files as easily as a string or number?

YES! As already stated, Elasticsearch caters to this need via a specialized ingest plugin called “Attachment Processor Plugin”.

In this post, we’ll see how to index files (attachments) to ES by making use of the “Ingest Attachment Processor Plugin”. ES 5.x is being used for the following example. For Elasticsearch 2.x, you’ll need to use the data type “attachment”, which is covered in this previous tutorial.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer toProvisioning a Qbox Elasticsearch Cluster.

Ingest Attachment Processor Plugin

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika. The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

The ingest attachment plugin can be used as a replacement for the mapper attachment plugin.

The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.

Installation:

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install ingest-attachment

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Attachment Processor in a Pipeline

Let’s create an attachment pipeline and try to extract encoded information:

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Extract attachment information encoded in Base64 with UTF-8 charset",
 "processors" : [
   {
     "attachment" : {
       "field" : "data"
     }
   }
 ]
}'

Let’s now index a test document to see the attachment pipeline in action.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_type/test_id?pipeline=attachment&pretty' -H 'Content-Type: application/json' -d '{
 "data": "UWJveCBlbmFibGVzIGxhdW5jaGluZyBzdXBwb3J0ZWQsIGZ1bGx5LW1hbmFnZWQsIFJFU1RmdWwgRWxhc3RpY3NlYXJjaCBTZXJ2aWNlIGluc3RhbnRseS4g"
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_type/test_id?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "data" : "UWJveCBlbmFibGVzIGxhdW5jaGluZyBzdXBwb3J0ZWQsIGZ1bGx5LW1hbmFnZWQsIFJFU1RmdWwgRWxhc3RpY3NlYXJjaCBTZXJ2aWNlIGluc3RhbnRseS4g",
   "attachment" : {
     "content_type" : "text/plain; charset=ISO-8859-1",
     "language" : "et",
     "content" : "Qbox enables launching supported, fully-managed, RESTful Elasticsearch Service instantly.",
     "content_length" : 91
   }
 }
}

Array of properties to select can be content, title, author, keywords, date, content_type, content_length, language. Lets specify only some fields to be extracted:

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Extract attachment information encoded in Base64 with UTF-8 charset",
 "processors" : [
   {
     "attachment" : {
       "field" : "data",
       "properties": [ "content", "content_length", "content_type" ]
     }
   }
 ]
}'

Let’s now index a test document to see the attachment pipeline in action.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_type/test_id?pipeline=attachment&pretty' -H 'Content-Type: application/json' -d '{
 "data": "VGhlIHNvdXJjZSBmaWVsZCBtdXN0IGJlIGEgYmFzZTY0IGVuY29kZWQgYmluYXJ5Lg=="
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_type/test_id?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "data" : "VGhlIHNvdXJjZSBmaWVsZCBtdXN0IGJlIGEgYmFzZTY0IGVuY29kZWQgYmluYXJ5Lg==",
   "attachment" : {
     "content_type" : "text/plain; charset=ISO-8859-1",
     "content" : "The source field must be a base64 encoded binary.",
     "content_length" : 91
   }
 }
}

NOTE: Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node.

Attachment Processor with Arrays

The foreach processor is required to use the attachment processor within an array of attachments. This enables the attachment processor to be run on the individual elements of the array.

For example, given the following source:

{
  "attachments" : [
    {
      "filename" : "test_1.txt",
      "data" : "VGhlIGluZ2VzdCBwbHVnaW5zIGV4dGVuZCBFbGFzdGljc2VhcmNoIGJ5IHByb3ZpZGluZyBhZGRpdGlvbmFsIGluZ2VzdCBub2RlIGNhcGFiaWxpdGllcy4="
    },
    {
      "filename" : "test_2.txt",
      "data" : "VGhlIGluZ2VzdCBhdHRhY2htZW50IHBsdWdpbiBsZXRzIEVsYXN0aWNzZWFyY2ggZXh0cmFjdCBmaWxlIGF0dGFjaG1lbnRzLg=="
    }
  ]
}

In this case, we want to process the data field in each element of the attachments field and insert the properties into the document so the following foreach processor is used:

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Extract attachment information from arrays",
 "processors" : [
   {
     "foreach": {
       "field": "attachments",
       "processor": {
         "attachment": {
           "target_field": "_ingest._value.attachment",
           "field": "_ingest._value.data"
         }
       }
     }
   }
 ]
}'

Let’s now index a test document to see the attachment pipeline in action for arrays.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_type/test_id?pipeline=attachment&pretty' -H 'Content-Type: application/json' -d '{
 "attachments" : [
   {
      "filename" : "test_1.txt",
      "data" : "VGhlIGluZ2VzdCBwbHVnaW5zIGV4dGVuZCBFbGFzdGljc2VhcmNoIGJ5IHByb3ZpZGluZyBhZGRpdGlvbmFsIGluZ2VzdCBub2RlIGNhcGFiaWxpdGllcy4="
    },
    {
      "filename" : "test_2.txt",
      "data" : "VGhlIGluZ2VzdCBhdHRhY2htZW50IHBsdWdpbiBsZXRzIEVsYXN0aWNzZWFyY2ggZXh0cmFjdCBmaWxlIGF0dGFjaG1lbnRzLg=="
    }
 ]
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_type/test_id?pretty'

The response would be :

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "attachments" : [
     {
       "filename" : "test_1.txt",
       "data" : "VGhlIGluZ2VzdCBwbHVnaW5zIGV4dGVuZCBFbGFzdGljc2VhcmNoIGJ5IHByb3ZpZGluZyBhZGRpdGlvbmFsIGluZ2VzdCBub2RlIGNhcGFiaWxpdGllcy4=",
       "attachment" : {
         "content_type" : "text/plain; charset=ISO-8859-1",
         "language" : "en",
         "content" : "The ingest plugins extend Elasticsearch by providing additional ingest node capabilities.",
         "content_length" : 90
       }
     },
     {
       "filename" : "test_2.txt",
       "data" : "VGhlIGluZ2VzdCBhdHRhY2htZW50IHBsdWdpbiBsZXRzIEVsYXN0aWNzZWFyY2ggZXh0cmFjdCBmaWxlIGF0dGFjaG1lbnRzLg==",
       "attachment" : {
         "content_type" : "text/plain; charset=ISO-8859-1",
         "language" : "en",
         "content" : "The ingest attachment plugin lets Elasticsearch extract file attachments.",
         "content_length" : 74
       }
     }
   ]
 }
}

Note that the target_field needs to be set, otherwise the default value is used which is a top level field attachment. The properties on this top level field will contain the value of the first attachment only. However, by specifying the target_field on to a value on <strong>_ingest._value</strong> will correctly associate the properties with the correct attachment.

Removal:

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove ingest-attachment

The node must be stopped before removing the plugin.

Give it a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.