Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Ingest Nodes are a new type of Elasticsearch node you can use to perform common data transformation and enrichments. Each task is represented by a processor. Processors are configured to form pipelines.

At the time of writing the Ingest Node had 20 built-in processors, for example grok, date, gsub, lowercase/uppercase, remove and rename.

Besides those, there are currently also three Ingest plugins:

  • Ingest Attachment converts binary documents like Powerpoints, Excel Spreadsheets, and PDF documents to text and metadata

  • Ingest Geoip looks up the geographic locations of IP addresses in an internal database

  • Ingest user agent parses and extracts information from the user agent strings used by browsers and other applications when using HTTP

So, Is there a mechanism to extract and index details from the User-Agent header value as easily as a string or number?

YES! As already stated, Elasticsearch caters to this need via a specialized ingest plugin called “user agent processor plugin”. In this post, we’ll see how to index files (attachments) to ES by making use of the “Ingest user agent processor plugin”.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Ingest User Agent Processor Plugin

The user_agent processor extracts details from the user agent string a browser sends with its web requests. This processor adds this information by default under the user_agent field.

The ingest-user-agent plugin ships by default with the regexes.yaml made available by uap-java with an Apache 2.0 license.

The uap-core repository contains the core of BrowserScope's original user agent string parser: data collected over the years by Steve Souders and numerous other contributors, extracted into a separate YAML file so as to be reusable as is by implementations in any programming language.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install ingest-user-agent

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

The user_agent Processor

Let’s hit a request that adds the user agent details to the user_agent field based on the agent field:

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/user_agent?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Parse and add user agent information",
 "processors" : [
   {
     "user_agent" : {
       "field" : "agent"
     }
   }
 ]
}'

Let’s now index a test document to see the user agent pipeline in action.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_index/test_id?pipeline=user_agent&pretty' -H 'Content-Type: application/json' -d '{
 "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_index/test_id?pretty'

The response would be:

{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "user_agent": {
      "name": "Chrome",
      "major": "51",
      "minor": "0",
      "patch": "2704",
      "os_name": "Mac OS X",
      "os": "Mac OS X 10.10.5",
      "os_major": "10",
      "os_minor": "10",
      "device": "Other"
    }
  }
}

The field regex_file can be used to specify the name of the file in the config/ingest-user-agent directory containing the regular expressions for parsing the user agent string. Both the directory and the file have to be created before starting Elasticsearch. If not specified, ingest-user-agent will use the regexes.yaml from uap-core.

The properties field properties controls what properties are added to target_field : [name, major, minor, patch, build, os, os_name, os_major, os_minor, device]

Lets add a user_agent pipeline which parses, extracts and adds the properties field to target_field.

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/user_agent?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Parse and add user agent information",
 "processors" : [
   {
     "user_agent" : {
       "field" : "agent",
       "properties": [ "name", "major", "minor", "os" ]
     }
   }
 ]
}'

Let’s now index a test document to see the user agent pipeline in action.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_index/test_id?pipeline=user_agent&pretty' -H 'Content-Type: application/json' -d '{
 "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_index/test_id?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_index",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
   "user_agent" : {
     "major" : "51",
     "minor" : "0",
     "os" : "Mac OS X 10.10.5",
     "name" : "Chrome"
   }
 }
}

Custom Regex File

We can use a custom regex file to extract and parse user_agent fields. The file has to be put into the config/ingest-user-agent directory and must have a .yaml filename extension. The file has to be present at node startup and any changes to it or any new files added while the node is running will not have any effect.

In practice, it will make most sense for any custom regex file to be a variant of the default file, either a more recent version or a customised version.

The default file included in ingest-user-agent is the regexes.yaml from uap-core.

A few of the popular parsers included in uap-core are :

user_agent_parsers:
 #StatusCake
 - regex: '(\(StatusCake\))'
   family_replacement: 'StatusCakeBot'
 # Facebook
 - regex: '(facebookexternalhit)/(\d+)\.(\d+)'
   family_replacement: 'FacebookBot'
 # Google Plus
 - regex: 'Google.*/\+/web/snippet'
   family_replacement: 'GooglePlusBot'
 # Gmail
 - regex: 'via ggpht.com GoogleImageProxy'
   family_replacement: 'GmailImageProxy'
 # Twitter
 - regex: '(Twitterbot)/(\d+)\.(\d+)'
   family_replacement: 'TwitterBot'
 # Bots General matcher 'name/0.0'
 - regex: '(?:\/[A-Za-z0-9\.]+)? *([A-Za-z0-9 \-_\!\[\]:]*(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]*))/(\d+)(?:\.(\d+)(?:\.(\d+))?)?'
 # Bots General matcher 'name 0.0'
 - regex: '(?:\/[A-Za-z0-9\.]+)? *([A-Za-z0-9 _\!\[\]:]*(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]*)) (\d+)(?:\.(\d+)(?:\.(\d+))?)?'
 # Bots containing spider|scrape|bot(but not CUBOT)|Crawl
 - regex: '((?:[A-z0-9]+|[A-z\-]+ ?)?(?: the )?(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[A-Za-z0-9-]*(?:[^C][^Uu])[Bb]ot|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]*)(?:(?:[ /]| v)(\d+)(?:\.(\d+)(?:\.(\d+))?)?)?'

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove ingest-user-agent

The node must be stopped before removing the plugin.

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus