Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Ingest Nodes are a new type of Elasticsearch node you can use to perform common data transformation and enrichments. Each task is represented by a processor. Processors are configured to form pipelines.

At the time of writing the Ingest Node had 20 built-in processors, for example grok, date, gsub, lowercase/uppercase, remove and rename.

Besides those, there are currently also three Ingest plugins:

  • Ingest Attachment converts binary documents like Powerpoints, Excel Spreadsheets, and PDF documents to text and metadata

  • Ingest Geoip looks up the geographic locations of IP addresses in an internal database

  • Ingest user agent parses and extracts information from the user agent strings used by browsers and other applications when using HTTP

So, Is there a mechanism to detect and index language of any field’s value as easily as a string or number?

YES! As already stated, Elasticsearch caters to this need via a specialized ingest plugin called “Langdetect Ingest Plugin”. In this post, we’ll see how to detect language and index to ES by making use of the “Langdetect Ingest Plugin”.

In addition to “Langdetect Ingest Plugin”, there is also a “elasticsearch-langdetect” plugin for language detection in Elasticsearch using the Nakatani Shuyo's language detector. It uses 3-gram character and a Bayesian filter with various normalizations and feature sampling. The precision is over 99% for 53 languages.

The plugin offers a mapping type to specify fields where we want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang'. The field can be queried for language codes.

We will be focussing on “Langdetect Ingest Plugin” or the “Langdetect Ingest Processor” in this post and we will be discussing about the Nakatani Shuyo's language detector based elasticsearch-langdetect plugin in the upcoming tutorial.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Langdetect Ingest Processor

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install https://oss.sonatype.org/content/repositories/releases/de/spinscale/elasticsearch/plugin/ingest-langdetect/5.2.0.1/ingest-langdetect-5.2.0.1.zip<a href="https://oss.sonatype.org/content/repositories/releases/de/spinscale/elasticsearch/plugin/ingest-langdetect/5.2.0.1/ingest-langdetect-5.2.0.1.zip"></a>

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Let’s hit a request that adds a langdetect processor to langdetect-pipeline :

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/langdetect-pipeline?pretty' -H 'Content-Type: application/json' -d '
{
 "description" : "Detect and add language information",
 "processors" : [
   {
     "langdetect" : {
       "field" : "sentence",
       "target_field" : "language"
     }
   }
 ]
}'

Let’s now index a test document to see the user agent pipeline in action.

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_index/test_id_en?pipeline=langdetect-pipeline&pretty' -H 'Content-Type: application/json' -d '{
 "sentence": "Qbox makes it easy for us to provision an Elasticsearch cluster without wasting time on all the details of cluster configuration."
}'
curl -XGET 'localhost:9200/test_index/test_index/test_id_en?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_index",
 "_id" : "test_id_en",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "sentence" : "Qbox makes it easy for us to provision an Elasticsearch cluster without wasting time on all the details of cluster configuration.",
   "language" : "en"
 }
}

Let’s check for a sentence in Chinese:

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_index/test_id_cn?pipeline=langdetect-pipeline&pretty' -H 'Content-Type: application/json' -d '{
 "sentence": "Qbox可以轻松地配置弹性搜索集群,而不浪费时间对集群配置的所有细节。"
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_index/test_id_cn?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_index",
 "_id" : "test_id_cn",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "sentence" : "Qbox可以轻松地配置弹性搜索集群,而不浪费时间对集群配置的所有细节。",
   "language" : "zh-cn"
 }
}

Let’s check for a sentence in German:

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_index/test_id_de?pipeline=langdetect-pipeline&pretty' -H 'Content-Type: application/json' -d '{
 "sentence": "Qbox macht es uns leicht, einen Elasticsearch-Cluster bereitzustellen, ohne Zeit auf alle Details der Cluster-Konfiguration zu verschwenden."
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_index/test_id_de?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_index",
 "_id" : "test_id_de",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "sentence" : "Qbox macht es uns leicht, einen Elasticsearch-Cluster bereitzustellen, ohne Zeit auf alle Details der Cluster-Konfiguration zu verschwenden.",
   "language" : "de"
 }
}

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus