Before setting up elasticsearch to fulfill entity extraction, it is worth checking out how it became such an easy task. There is a lot of buzz around the new Ingest API shipped with elasticsearch 5.x.

The Ingest API allows data manipulation and enrichment by defining a pipeline through which every document is subject to pass. This pipeline is created with a set of processors - each of which do specific tasks that enrich our data. A typical example of the processor is a grok processor, which allows you to modify and structure your unstructured log using pattern matching. Elasticsearch 5 ships many built-in processors about which you can read here.

Set Up

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Let us perform some basic entity extraction, and as we go along we explain how to implement the new ingest module. Before creating the pipeline, we need to setup and configure the Open-NLP plugin.

In order to build the plugin, we are required to install gradle version 2.13 or above. Once Gradle is installed, proceed here and clone the repository to your system. Then you can do a Gradle clean check which prepares a build in the zip format ready to be installed under build/distributions

From here, you can install the plugin like shown below:

bin/elasticsearch-plugin install file:///home/neil/opennlp/elasticsearch-ingest-opennlp/build/distributions/ingest-opennlp-0.0.1-SNAPSHOT.zip

Make sure to change the file path as per your system path.

Specify the model files, that the plugin looks for, to find the entities. Add the following lines to your elasticsearch.yml:

ingest.opennlp.model.file.names: en-ner-persons.bin
ingest.opennlp.model.file.locations: en-ner-locations.bin

Once this is done, start elasticsearch. You are ready to create the Open-NLP Ingest pipeline.

Open-NLP Pipeline

 Defining a pipeline is as simple as this:

PUT _ingest/pipeline/YOUR_PIPELINE_ NAME
{
  "description": "YOUR PIPELINE DESCRIPTION",
  "processors": [
    {
      "PROCESSOR_NAME" : {
        "field" : "FIELD_NAME"
      }
    }
  ]
}

The pipeline has two major fields associated with it:

  1. The description, which is a short text that describes the pipeline. 
  2. An array of processors that is involved in the pipeline. 

Note: the order in which you define the pipeline is important, and is always followed at execution time. In our case, the create pipeline statement looks like:

PUT _ingest/pipeline/opennlp-pipeline
{
  "description": "NER pipeline",
  "processors": [
    {
      "opennlp" : {
        "field" : "description_text"
      }
    }
  ]
}

Once the pipeline is created successfully, index some documents. When indexing, make sure to use the created pipeline like this:

PUT /players/epl/4?pipeline=opennlp-pipeline
{
 "description_text" : "Arsenal looks to have the most number of English players in their line up. Jack Wilshere, Alex Oxlade Chamberlain, Theo Walcott and Danny Welbeck played for England already along with the loanee Jack Wilshere. Apart from this, they have Carl Jenkinson, Kieran Gibbs, Calum Chambers, Rob Holding in their team. The London club has a bright english core in their team"
}

This document is passed through our ingest open-nlp pipeline before getting indexed. 

Issue a GET request to retrieve the document:

GET /players/epl/1

You should get the response:

{
 "_index": "players",
 "_type": "epl",
 "_id": "1",
 "_version": 1,
 "found": true,
 "_source": {
   "description_text": "Arsenal looks to have the most number of English players in their line up. Jack Wilshere, Alex Oxlade Chamberlain, Theo Walcott and Danny Welbeck played for England already along with the loanee Jack Wilshere. Apart from this , they have Carl Jenkinson,Kieran Gibbs,Calum Chambers, Rob Holding in their team. The London club has a bright english core in their team",
   "entities": {
     "names": [
       "Danny Welbeck",
       "Rob Holding",
       "Carl Jenkinson",
       "Kieran Gibbs",
       "Calum Chambers",
       "Alex Oxlade Chamberlain",
       "Jack Wilshere"
     ],
     "locations": [
       "London"
     ]
   }
 }
}

Conclusion

Without strain, we were able to identify the names and locations in this text. You can extract any entity by inputting a corresponding model file following the same procedure.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus