We have already discussed the “Langdetect Ingest Plugin” in a previous post. We now focus on the “elasticsearch-langdetect” or the “Nakatani Shuyo's language detector” in this post.

The “elasticsearch-langdetect” plugin offers a mapping type to specify fields where we want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang'. The field can be queried for language codes.

We can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin also offers a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."

Elasticsearch-Langdetect Plugin

Installation

This plugin can be installed using the plugin manager:

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/5.4.0.2/elasticsearch-langdetect-5.4.0.2-plugin.zip

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.4.4.1/elasticsearch-langdetect-2.4.4.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install langdetect -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.6.0.0/elasticsearch-langdetect-1.6.0.0-plugin.zip

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Lets consider a simple language detection example:

We will be using Elasticsearch 5.4.0 for demonstration. We first create an index “test_idx” having a simple detector field

curl -XPUT 'ES_HOST:ES_PORT/test_idx' -d '{
   "mappings": {
      "documents": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}'

Let’s now put some documents (one in each language) in our test_idx:

curl -XPUT 'ES_HOST:ES_PORT/test_idx/documents/1' -d '{
      "text" : "Qbox is the only hosted Elasticsearch provider that allows you to choose both the location and the cloud platform of your cluster, which lowers response times significantly."
}'
curl -XPUT 'ES_HOST:ES_PORT/test_idx/documents/2' -d '{
      "text" : "Qbox ist der einzige gehostete Elasticsearch-Anbieter, mit dem Sie sowohl den Standort als auch die Cloud-Plattform Ihres Clusters auswählen können, was die Reaktionszeiten erheblich senkt."
}'
curl -XPUT 'ES_HOST:ES_PORT/test_idx/documents/3' -d '{
      "text" : "Qbox ist der einzige gehostete Elasticsearch-Anbieter, mit dem Sie sowohl den Standort als auch die Cloud-Plattform Ihres Clusters auswählen können, était Die Reaktionszeiten erheblich senkt."
}'

Let’s now hit term filter search queries for each of our languages:

curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
       "query" : {
           "term" : {
                "text" : "en"
           }
       }
}'

The trimmed response is as follows:

"_source":{"text" : "Qbox is the only hosted Elasticsearch provider that allows you to choose both the location and the cloud platform of your cluster, which lowers response times significantly."}

We can similarly hit other term filter search queries :

curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
       "query" : {
           "term" : {
                "text" : "de"
           }
       }
}'
curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
       "query" : {
           "term" : {
                "text" : "fr"
           }
       }
}'

Indexing Language-Detected Text

Indexing the language code is not enough in most cases. The language-detected text should be passed to a specific analyzer to apply language-specific analysis. This plugin allows that by the language_to parameter.

curl -XDELETE 'ES_HOST:ES_PORT/test_idx'
curl -XPUT 'ES_HOST:ES_PORT/test_idx' -d '{
   "mappings": {
      "documents": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages": [
                  "de",
                  "en",
                  "fr",
                  "nl",
                  "it"
               ],
               "language_to": {
                  "de": "german_field",
                  "en": "english_field"
               }
            },
            "german_field": {
               "analyzer": "german",
               "type": "string"
            },
            "english_field": {
               "analyzer": "english",
               "type": "string"
            }
         }
      }
   }
}'
curl -XPUT 'ES_HOST:ES_PORT/test_idx/documents/1' -d '{
  "text" : "Qbox is the only hosted Elasticsearch provider that allows you to choose both the location and the cloud platform of your cluster, which lowers response times significantly."
}'
curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
   "query" : {
       "match" : {
            "english_field" : "cloud platform"
       }
   }
}'

The trimmed response is as follows:

"_source":{"text" : "Qbox is the only hosted Elasticsearch provider that allows you to choose both the location and the cloud platform of your cluster, which lowers response times significantly."}
Language code and multi_field

Using multifields, it is possible to store the text alongside with the detected language(s). Here, we use another short example text for demonstration, which has more than one detected language code.

curl -XDELETE 'ES_HOST:ES_PORT/test_idx'
curl -XPUT 'ES_HOST:ES_PORT/test_idx' -d '{
   "mappings": {
      "documents": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "de",
                        "en",
                        "fr",
                        "nl",
                        "it"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}'
curl -XPUT 'ES_HOST:ES_PORT/test_idx/documents/1' -d '{
    "text" : "Qbox is de enige gastheer van Elasticsearch provider, waarmee u zowel de locatie als het cloudplatform van uw cluster kunt kiezen, waardoor de reactietijden aanzienlijk worden verlaagd."
}'
curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
   "query" : {
       "match" : {
            "text" : "cloudplatform"
       }
   }
}'

The trimmed response is as follows:

"_source":{"text" : "Qbox is de enige gastheer van Elasticsearch provider, waarmee u zowel de locatie als het cloudplatform van uw cluster kunt kiezen, waardoor de reactietijden aanzienlijk worden verlaagd."}

Let's hit a search request with "nl" astext.language in amatch query:

curl -XPOST 'ES_HOST:ES_PORT/test_idx/_search' -d '{
   "query" : {
       "match" : {
            "text.language" : "nl"
       }
   }
}'

The trimmed response is again as follows:

"_source":{"text" : "Qbox is de enige gastheer van Elasticsearch provider, waarmee u zowel de locatie als het cloudplatform van uw cluster kunt kiezen, waardoor de reactietijden aanzienlijk worden verlaagd."}

Language Detection REST API

The plugin comes with a powerful language detection API for testing and debugging purposes :

curl -XPOST 'ES_HOST:ES_PORT/_langdetect?pretty' -d '{"text" : "This is a sentence in English"}'

The response is:

{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999985946848577
    }
  ]
}
// Let's hit another request with text in "de"
curl -XPOST 'ES_HOST:ES_PORT/_langdetect?pretty' -d '{"text" : "Das ist ein Satz auf Deutsch"}'

The response is:

{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999992779931194
    }
  ]
}
// Let's hit another request with text in "no" and "de"
curl -XPOST 'ES_HOST:ES_PORT/_langdetect?pretty' -d '{"text" : "Datt isse ne test"}'

The response is:

{
  "languages" : [
    {
      "language" : "no",
      "probability" : 0.7142817164480593
    },
    {
      "language" : "de",
      "probability" : 0.2857139947334582
    }
  ]
}

Give it a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus