Elasticsearch is generally used to index data of types like string, number, date, etc. However, what if you wanted to index a file like a .pdf or a .doc directly and make it searchable? This is a real-time use case in applications like HCM, ERP, and ecommerce.

Ingest Nodes are a new type of Elasticsearch node you can use to perform common data transformation and enrichments. Each task is represented by a processor. Processors are configured to form pipelines.

At the time of writing the Ingest Node had 20 built-in processors, for example grok, date, gsub, lowercase/uppercase, remove and rename.

Besides those, there are currently also three Ingest plugins:

  • Ingest Attachment converts binary documents like Powerpoints, Excel Spreadsheets, and PDF documents to text and metadata

  • Ingest GeoIP looks up the geographic locations of IP addresses in an internal database

  • Ingest user agent parses and extracts information from the user agent strings used by browsers and other applications when using HTTP

So, Is there a mechanism to index geographical location of IP addresses as easily as a string or number?

YES! As already stated, Elasticsearch caters to this need via a specialized ingest plugin called “GeoIP Processor Plugin”. In this post, we’ll see how to index files (attachments) to ES by making use of the “Ingest GeoIP Processor Plugin”.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.

Ingest GeoIP Processor Plugin

The GeoIP processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases. This processor adds this information by default under the geoip field. The geoip processor can resolve both IPv4 and IPv6 addresses.

The ingest-geoip plugin ships by default with the GeoLite2 City and GeoLite2 Country geoip2 databases from Maxmind made available under the CCA-ShareAlike 3.0 license.

GeoLite2 databases are free IP geolocation databases comparable to, but less accurate than, MaxMind’s GeoIP2 databases. The GeoLite2 Country and City databases are updated on the first Tuesday of each month. The GeoLite2 ASN database is updated every Tuesday.

IP geolocation is inherently imprecise. Locations are often near the centre of the population. Any location provided by a GeoIP database should not be used to identify a particular address or household. Use the Accuracy Radius as an indication of geolocation accuracy for the latitude and longitude coordinates, we return for an IP address. The actual location of the IP address is likely within the area defined by this radius and the latitude and longitude coordinates.

The GeoIP processor can run with other geoip2 databases from Maxmind. The files must be copied into the geoip config directory, and the database_file option should be used to specify the filename of the custom database. Custom database files must be compressed with gzip.

The geoip config directory is located at $ES_HOME/config/ingest/geoip and holds the shipped databases too. The databases can be downloaded from Maxmind website.

Installation

This plugin can be installed using the plugin manager:

cd ES_HOME
sudo bin/elasticsearch-plugin install ingest-geoip

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Let’s hit a request that uses the default city database and adds the geographical information to the geoip field based on the ip field:

curl -XPUT 'ES_HOST:ES_IP/_ingest/pipeline/geoip?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Add geoip information to the given IP address",
 "processors" : [
   {
     "geoip" : {
       "field" : "ip"
     }
   }
 ]
}'

Let’s index a document to see geoip pipeline in action:

curl -XPUT 'ES_HOST:ES_PORT/test_index/test_type/test_id?pipeline=geoip&pretty' -H 'Content-Type: application/json' -d '{
 "ip": "8.8.0.0"
}'
curl -XGET 'ES_HOST:ES_PORT/index/test_type/test_id?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 2,
 "found" : true,
 "_source" : {
   "geoip" : {
     "continent_name" : "North America",
     "country_iso_code" : "US",
     "location" : {
       "lon" : -97.822,
       "lat" : 37.751
     }
   },
   "ip" : "8.8.0.0"
 }
}

Let’s see an example that uses the default country database and adds the geographical information to the geo field based on the ip field. Note that this database is included in the plugin download.

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/geoip?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Add geoip info using the GeoLite2 country database",
 "processors" : [
   {
     "geoip" : {
       "field" : "ip",
       "target_field" : "geo",
       "database_file" : "GeoLite2-Country.mmdb.gz",
       "properties": [ "country_iso_code", "country_name", "continent_name" ]
     }
   }
 ]
}'

The database_file field specifies the database filename in the geoip config directory. The ingest-geoip plugin ships with the GeoLite2-City.mmdb.gz and GeoLite2-Country.mmdb.gz files.

  • If the GeoLite2 City database is used, then the following fields may be added under the target_field: ip, country_iso_code, country_name, continent_name, region_name, city_name, timezone, latitude, longitude and location.

  • If the GeoLite2 Country database is used, then the following fields may be added under the target_field: ip, country_iso_code, country_name and continent_name.

In either case, the fields actually added depend on what has been found and which properties were configured in properties.

Now, let’s index a document to see geoip pipeline in action using GeoLite2-Country database:

curl -XPUT 'ES_HOST:ES_PORT/test_type/test_id?pipeline=geoip&pretty' -H 'Content-Type: application/json' -d '{
 "ip": "8.8.0.0"
}'
curl -XGET 'ES_HOST:ES_PORT:9200/test_index/test_type/test_id?pretty'

The response would be:

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "geo" : {
     "continent_name" : "North America",
     "country_iso_code" : "US",
     "country_name" : "United States"
   },
   "ip" : "8.8.0.0"
 }
}

It is to be noted that all IP addresses find geo information from the database, When this occurs, no target_field is inserted into the document.

Here is an example of what documents will be indexed as when information for "93.114.46.14" cannot be found:

curl -XPUT 'ES_HOST:ES_PORT/_ingest/pipeline/geoip?pretty' -H 'Content-Type: application/json' -d '{
 "description" : "Add geoip information to the given IP address",
 "processors" : [
   {
     "geoip" : {
       "field" : "ip"
     }
   }
 ]
}'
curl -XPUT 'ES_HOST:ES_PORT/test_index/test_type/test_id?pipeline=geoip&pretty' -H 'Content-Type: application/json' -d '{
 "ip": "93.114.46.14"
}'
curl -XGET 'ES_HOST:ES_PORT/test_index/test_type/test_id?pretty'

The following response is returned:

{
 "_index" : "test_index",
 "_type" : "test_type",
 "_id" : "test_id",
 "_version" : 1,
 "found" : true,
 "_source" : {
   "ip" : "93.114.46.14"
 }
}

Node Settings

The geoip processor supports the following setting:

Ingest.geoip.cache_size – The maximum number of results that should be cached and defaults to 1000.

These settings are node settings and apply to all geoip processors, i.e. there is one cache for all defined geoip processors.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove ingest-geoip

The node must be stopped before removing the plugin.

Give it a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.