We recently announced Qbox hosted ElastAlert -- the superb open-source alerting tool built by the team at Yelp Engineering -- now available on all new Elasticsearch clusters on AWS.

Most organizations use the ELK Stack for managing their ever increasing amount of data and logs. Kibana is great for visualizing and querying data, but it needs a companion tool like ElastAlert for alerting on inconsistencies, anomalies, spikes, or other patterns of interest from data in Elasticsearch.

Overview

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster." 

ElastAlert was designed keeping following few principles in mind:

  • It should be easy to understand and human readable. For this, we chose a YAML format with intuitive option names.

  • It should be resilient to outages. It records every query it makes, and can pick up exactly where it left off when it turns back on.

  • We designed it to be modular. The major components, rule types, enhancements and alerts, can all be imported or customized by implementing a base class.

ElastAlert is designed to be reliable, highly modular, and easy to set up and configure.

It works by combining Elasticsearch with two types of components, rule types and alerts. Elasticsearch is periodically queried and the data is passed to the rule type, which determines when a match is found. When a match occurs, it is given to one or more alerts, which take action based on the match.

This is configured by a set of rules, each of which defines a query, a rule type, and a set of alerts.

Several rule types with common monitoring paradigms are included with ElastAlert:

  • “Match where there are X events in Y time” (frequency type)

  • “Match when the rate of events increases or decreases” (spike type)

  • “Match when there are less than X events in Y time” (flatline type)

  • “Match when a certain field matches a blacklist/whitelist” (blacklist and whitelist type)

  • “Match on any event matching a given filter” (any type)

  • “Match when a field has two different values within some time” (change type)

Creating Rules

Each rule defines a query to perform, parameters on what triggers a match, and a list of alerts to fire for each match. We are going to use change rule as an example:

Type: frequency - This rule matches when there are at least a certain number of events in a given time frame. This may be counted on a per-query_key basis:

# Send an email to elastalert@example.com when there are more than 50 documents with some_field == some_value within a 1 hour period
# Type FREQUENCY
name: Example rule
type: frequency
index: logstash-*
num_events: 50
timeframe:
    hours: 1
filter:
- term:
    some_field: "some_value"
alert:
- "email"
email:
- "elastalert@example.com"

The fields that have been configured are as follows:

  • es_host and es_port should point to the Elasticsearch cluster we want to query.

  • name is the unique name for this rule. ElastAlert will not start if two rules share the same name.

  • type: Each rule has a different type which may take different parameters. The frequency type means “Alert when more than num_events occur within timeframe.”

  • index: The name of the index(es) to query. If you are using Logstash, by default the indexes will match "logstash-*".

  • num_events: This parameter is specific to frequency type and is the threshold for when an alert is triggered.

  • timeframe is the time period in which num_events must occur.

  • filter is a list of Elasticsearch filters that are used to filter results. Here we have a single term filter for documents with some_field matching some_value. If no filters are desired, it should be specified as an empty list: filter: []

  • alert is a list of alerts to run on each match. The email alert requires an SMTP server for sending mail. By default, it will attempt to use localhost. This can be changed with the smtp_host option.

  • email is a list of addresses to which alerts will be sent.

All documents must have a timestamp field. ElastAlert will try to use @timestamp by default, but this can be changed with the timestamp_field option. By default, ElastAlert uses ISO8601 timestamps, though unix timestamps are supported by setting timestamp_type.

Type: change - This rule will monitor a certain field and match if that field changes. The field must change with respect to the last event with the same query_key.

# Alert when some field changes between documents
# This rule would alert on documents similar to the following:
# {'username': 'john', 'country_code': 'JPN', '@timestamp': '2016-11-17T09:00:00'}
# {'username': 'john', 'country_code': 'UK', '@timestamp': '2016-11-17T07:00:00'}
# Because the user (query_key) john logged in from different countries (compare_key) in the same day (timeframe)
# Type CHANGE
name: New country login
type: change
index: logstash-*
compare_key: country_code
ignore_null: true
query_key: username
timeframe:
 days: 1
filter:
- query:
   query_string:
     query: "document_type: login"
alert:
- "email"
email:
- "elastalert@example.com"

In addition to frequency type rule fields, following are the additional fields that have been configured:

This rule requires three additional options:

  • compare_key: The name of the field to monitor for changes.

  • ignore_null: If true, events without a compare_key field will not count as changed.

  • query_key: This rule is applied on a per-query_key basis. This field must be present in all of the events that are checked.

There is also an optional field:

  • timeframe: The maximum time between changes. After this time period, ElastAlert will forget the old value of the compare_key field.

Type: spike - This rule matches when the volume of events during a given time period is spike_height times larger or smaller than during the previous time period. It uses two sliding windows to compare the current and reference frequency of events. We will call this two windows “reference” and “current”.

# Alert when there is a sudden spike in the volume of events or if number of matching docs triples up in an hour. The minimum number of events that will trigger an alert is set to 5.
# Type SPIKE
name: Event spike
type: spike
index: logstash-*
threshold_cur: 5
timeframe:
 hours: 1
spike_height: 3
spike_type: "up"
filter:
- query:
   query_string:
     query: "field: value"
- type:
   value: "some_doc_type"
alert:
- "email"
email:
- "elastalert@example.com"

In addition to frequency rule type fields, following are some of the important additional fields that have been configured:

  • spike_height: The ratio of number of events in the last timeframe to the previous timeframe that when hit will trigger an alert.

  • spike_type: Either ‘up’, ‘down’ or ‘both’. ‘Up’ meaning the rule will only match when the number of events is spike_height times higher. ‘Down’ meaning the reference number is spike_height higher than the current number. ‘Both’ will match either.

  • timeframe: The rule will average out the rate of events over this time period. For example, hours: 1 means that the ‘current’ window will span from present to one hour ago, and the ‘reference’ window will span from one hour ago to two hours ago.

There is also an optional configured field:

  • threshold_cur: The minimum number of events that must exist in the current window for an alert to trigger. For example, if spike_height: 3 and threshold_cur: 60, then an alert will occur if the current window has more than 60 events and the reference window has less than a third as many.

Configuring Filters for Rules

Filters are a list of Elasticsearch query DSL filters that are used to query Elasticsearch. ElastAlert will query Elasticsearch using the format {'filter': {'bool': {'must': [config.filter]}}} with an additional timestamp range filter. All of the results of querying with these filters are passed to the RuleType for analysis. This section describes how to create a filter section for your rule config file.

The filters used in rules are part of the Elasticsearch query DSL. We shall list here a small subset of particularly useful filters.

The filter section is passed to Elasticsearch exactly as follows:

filter:
  and:
    filters:
      - [filters from rule.yaml]

Every result that matches these filters will be passed to the rule for processing.

Common Filter Types:

query_string - The query_string type follows the Lucene query format and can be used for partial or full matches to multiple fields.

filter:
- query:
    query_string:
      query: "username: john"
- query:
    query_string:
      query: "_type: auth_logs"
- query:
    query_string:
      query: "field: value OR otherfield: othervalue"
- query:
    query_string:
       query: "this: that AND these: those"

term - The term type allows for exact field matches:

filter:
- term:
    name_field: "john"
- term:
    _type: "auth_logs"

terms - Terms allows for easy combination of multiple term filters:

filter:
- terms:
    field: ["value1", "value2", "value3"]

Using the minimum_should_match option, you can define a set of term filters of which a certain number must match:

- terms:
   fieldX: ["value1", "value2"]
   fieldY: ["something", "something_else"]
   fieldZ: ["foo", "bar", "baz"]
   minimum_should_match: 2


range - For ranges on fields:

filter:
- range:
   status_code:
     from: 100
     to: 199

negation, and, or - Any of the filters can be embedded in not, and, and or:

filter:
- or:
    - term:
        field: "value"
    - wildcard:
        field: "foo*bar"
    - and:
        - not:
            term:
              field: "value"
        - not:
            term:
              _type: "some_type"

Alerting with Monitoring

In following alerting and monitoring examples, we will omit the alerting configuration and focus on the pattern and rule type configuration. The alert itself can be customized a number of ways to suite our needs.

Authentication: SSH and Other Logins

Let’s start off with a simple rule to alert us whenever a failed SSH authentication occurs. We will assume that the log _type is ssh_logs and that it also contains a response, success or failure, and a username.

# Alert when any SSH failures occur
filter:
  - term:
      _type: ssh_logs
  - term:
      response: failure
type: any

The any type will alert on any document which matches the filter. We can change the type to frequency to only alert if multiple SSH failures occur. We can refine it further by using query_key to group the events by username. This will allow us to alert only if a certain number of failures have occurred for a given user.

# Group by username and alert when 5 SSH failures occur for a single user in 1 hour
filter:
  - term:
      _type: ssh_logs
  - term:
      response: failure
type: frequency
num_events: 5
timeframe:
  hours: 1
query_key: username

This will alert you if someone is trying to brute force SSH login, but what if an attacker has already taken your credentials? There are a few other things we could look at such as has the same user connected from multiple IP addresses?

# Alert when a single user has successful connections from 2 IPs within a day timeframe
filter:
  - term:
      _type: ssh_logs
  - term:
      response: success
type: cardinality
max_cardinality: 1
cardinality_field: ip_address
timeframe:
  days: 1
query_key: username

In this example, we are alerting if the cardinality of the ip_address field, grouped by username, is greater than one within a day. This specific alert could also be accomplished using the change rule type, but cardinality gives you more flexibility and could be used for a variety of other purposes.

Logging: Success and Error Logs

We’ll assume that error messages are parsed in such a way that they contain an error_code. You could use type: any and alert on every single error message. However, this might not scale well. Setting an explicit threshold is often hard, as baseline normal may vary over time. We can use the spike type to handle this:

# Alert if number of errors doubles in an hour
filter:
  - term:
      _type: error_logs
type: spike
spike_height: 2
spike_type: up
threshold_ref: 50
timeframe:
  hours: 1
top_count_keys:
  - error_code

With this rule, we are comparing the number of errors in the last hour with the hour before that. If the current hour contains more than 2x the previous hour, and the previous hour contains at least 50 events, it will alert. If this is too sensitive, increasing the timeframe would effectively smooth out the average error rate. By setting top_count_keys, the alert will contain a breakdown of the most common types which occurred within that spike.

This is all well and good if we don’t care about common errors sending us alerts, but critical error messages could still sneak by. Another approach we could take is to send an alert only when a new, never seen before, error code occurs.

# A new error type occurred
filter:
  - term:
      _type: error_logs
query_key: error_code
type: new_term
include:
  - traceback

These are just a few possible use cases. With custom rule types or alerts, anything is possible. If you can get it into Elasticsearch, you can monitor and alert on it. For a full list of features, as well as a tutorial for getting started, check out the documentation and source can be found on Github.

Other Helpful Tutorials

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus