This guide is about using the Elasticsearch Python client to do useful things with Elasticsearch. The Python client makes use of the Elasticsearch REST interface.

Let's start by installing some dependencies:

# apt-get install python-setuptools
# easy_install pip
# pip install elasticsearch

I'm going to use the Python API to do something useful, from an operations perspective, with data in Elasticsearch. I'm using data from the official Elasticsearch examples repo on Github. You will need Logstash and Elasticsearch on the machine. With Elasticsearch started, I use the following Github downloads to start Logstash with a configuration that will index the downloaded example repos NGINX logs (nginx_json_logs) with Logstash to Elasticsearch with an index template to setup mapping for us (nginx_json_template.json).  After Logstash has started indexing, we will have data to start searching something interesting with our Python client.

$ mkdir  nginx_json_ELK_Example
$ cd nginx_json_ELK_Example
$ wget https://raw.githubusercontent.com/elastic/examples/master/ELK_NGINX-json/nginx_json_logstash.conf
$ wget https://raw.githubusercontent.com/elastic/examples/master/ELK_NGINX-json/nginx_json_kibana.json
$ wget https://raw.githubusercontent.com/elastic/examples/master/ELK_NGINX-json/nginx_json_template.json
$ wget https://raw.githubusercontent.com/elastic/examples/master/ELK_NGINX-json/nginx_json_logs
$ cd nginx_json_ELK_Example
$ cat nginx_json_logs | /opt/logstash/bin/logstash -f nginx_json_logstash.conf

You can now view the data in Kibana by connecting the index in Kibana.

In the Kibana settings tab, add the new index with the name nginx_json_elk_example and make sure to use the correct timestamp settings. Changing the timestamp is done using the clock in the top right of Kibana. Set it to “Absolute” with dates “From: 2015-05-16 23:14:09.260” and “To: 2015-06-05 18:58:54.666”. This will show all logs we indexed earlier. You can now create a visualization with the data (logs) contained between our timestamps.

The log file that we downloaded is nginx log in JSON format. Nginx, which has quite a following these days, is web server written as an Apache2 replacement by a bored Russian system administrator. 

I began by creating a data visualization to view the top 10 IP's that made requests to the Nginx web server. I'm going to use this visualization to build my query that I will be using in the Elasticsearch Python client.

Make sure you've started both Kibana and Elasticsearch. In Kibana create a Data Metric to show the top ten requests by hits or by count. One option is creating a “Bar Chart” with the following aggregations:

  • Y Axis: Aggregation: Count
  • X Axis: Aggregation: Terms
  • Field: remote_ip.raw, 
  • Order By: metric:Count
  • Order: Descending, Size: 10  

Here is a screenshot of exactly what this will look like:

Kibana Python Elasticsearch Visualization

While viewing the visualization, you can view the request by clicking the arrow below the visualization. This is what the request looks like:

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": true
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": 1431836049260,
                  "lte": 1433548734666,
                 "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
   }
  },
  "size": 0,
  "aggs": {
    "2": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

Take note of the absolute time range used to describe the time range to search for data. We can replace this with a relative time range. For example, if you wanted to search within the last year, the query would look like this:

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": true
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": "now-12M",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
 },
  "size": 0,
  "aggs": {
    "2": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

Our response for the query with the absolute time range would be:

{
  "took": 240,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 51462,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "doc_count_error_upper_bound": 125,
      "sum_other_doc_count": 38958,
      "buckets": [
        {
          "key": "216.46.173.126",
          "doc_count": 2350
        },
        {
          "key": "180.179.174.219",
          "doc_count": 1720
        },
        {
          "key": "204.77.168.241",
         "doc_count": 1439
       },
        {
          "key": "65.39.197.164",
          "doc_count": 1365
        },
        {
          "key": "80.91.33.133",
          "doc_count": 1202
        },
        {
          "key": "84.208.15.12",
          "doc_count": 1120
        },
        {
          "key": "74.125.60.158",
          "doc_count": 1084
        },
        {
          "key": "119.252.76.162",
          "doc_count": 1064
        },
        {
          "key": "79.136.114.202",
          "doc_count": 628
        },
        {
          "key": "54.207.57.55",
          "doc_count": 532
        }
      ]
    }
  }
}

We need to check that our response for a relative time range will be equivalent, and we can do this by using our new query with the Python client API. If we are going to use the Python client, then we will have to ensure that all JSON is valid JSON. Check that all strings are enclosed in double quotes.

Learn About Our New Open Source Containerization Software Built On Kubernetes >

To use the example below, you will need to install Sense, which you can do using the instructions here

This is what the query to view the top ten IP's by requests and hits should look like in Sense:

POST /nginx_json_elk_example/_search?pretty=true
{
   
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": "true"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": "now-12M",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "2": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

Now try doing this with the Elasticsearch Python client. First, we want to print out the results of the query -- nothing complicated yet.

#!/usr/bin/env python
import requests
from elasticsearch import Elasticsearch
import json
# Eerste maand.
# Sit netnou in functions
es  = Elasticsearch()
res = es.search(index="nginx_json_elk_example", body={ 
      "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": "true"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": "now-12M",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "blah": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 1000,
        "order": {
          "_count": "desc"
       }
      }
    }
  } })
#re
print(res)

The output might look a little bit confusing, but I just wanted to show with this example that we could fetch valid data from Elasticsearch using a neat client API in Python.

Now, for a bit more interesting use of the client API, we are going to get the top 100 IP's that performed an http request to our Nginx server. Our Nginx server was configured to log to JSON log file. Let's look at our previous query from Sense and change it to appear like this:

Change the keyword under "aggs". I changed it to "blah":

POST /nginx_json_elk_example/_search?pretty=true
{
   
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": "true"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": "now-12M",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            }
         ],
          "must_not": []
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "blah": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

The response would look something like this:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 51462,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "blah": {
      "doc_count_error_upper_bound": 125,
      "sum_other_doc_count": 38958,
      "buckets": [
        {
          "key": "216.46.173.126",
          "doc_count": 2350
        },
        {
          "key": "180.179.174.219",
         "doc_count": 1720
        },
        {
          "key": "204.77.168.241",
          "doc_count": 1439
        },
        {
          "key": "65.39.197.164",
          "doc_count": 1365
        },
        {
          "key": "80.91.33.133",
          "doc_count": 1202
        },
        {
          "key": "84.208.15.12",
          "doc_count": 1120
        },
        {
          "key": "74.125.60.158",
          "doc_count": 1084
        },
        {
          "key": "119.252.76.162",
          "doc_count": 1064
        },
        {
          "key": "79.136.114.202",
          "doc_count": 628
        },
        {
          "key": "54.207.57.55",
          "doc_count": 532
        }
      ]
    }
  }
}

Thus, we could loop over the values individually, for example, if we wanted to show the top 5 IPs and hits in an ordered manner. If you look at how the response JSON is structured, we need to loop over aggregations --> blah --> buckets

This is what I mean: We want the values that are in the key section.

aggregations {
"blah": {
     "buckets": [
        {
          "key": "119.252.76.162",
          "doc_count": 1064
        },
        {
          "key": "54.207.57.55",
          "doc_count": 532
        }
      ]
}
}

I used the keyword 'blah' to help give a name to the aggregation's return value because otherwise it could've had a number for a name, which would make it hard to loop over the JSON values.

Here is our final Python code to look over our top ten IP's by request and print out the top ten IP's.

#!/usr/bin/env python
import requests
from elasticsearch import Elasticsearch
import json
# Eerste maand.
# Sit netnou in functions
es  = Elasticsearch()
res = es.search(index="nginx_json_elk_example", body={ 
      "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": "true"
        }
      },
      "filter": {
        "bool": {
         "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": "now-12M",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "blah": {
      "terms": {
        "field": "remote_ip.raw",
        "size": 100,
        "order": {
          "_count": "desc"
        }
      }
    }
  } })
#re
#print(res)
for f in res['aggregations']['blah']['buckets']:
    print('Request from IP %s' % f['key'])

Conclusion

In this article, you've learned how to use the Elasticsearch Python client API to do something useful from an operations perspective. This is only a simple query, so imagine the power that you have at your disposal with more complex queries using this Python API. 

Comments / Feedback? Leave us a message below. 

comments powered by Disqus