Elasticsearch ElastAlert: Alerting at Scale
Posted by Adam Vanderbush May 15, 2017We recently announced Qbox hosted ElastAlert -- the superb open-source alerting tool built by the team at Yelp Engineering -- now available on all new Elasticsearch clusters on AWS.
Most organizations use the ELK Stack for managing their ever increasing amount of data and logs. Kibana is great for visualizing and querying data, but it needs a companion tool like ElastAlert for alerting on inconsistencies, anomalies, spikes, or other patterns of interest from data in Elasticsearch.
Overview
For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click "Get Started" in the header navigation. If you need help setting up, refer to "Provisioning a Qbox Elasticsearch Cluster."
ElastAlert was designed keeping following few principles in mind:
It should be easy to understand and human readable. For this, we chose a YAML format with intuitive option names.
It should be resilient to outages. It records every query it makes, and can pick up exactly where it left off when it turns back on.
We designed it to be modular. The major components, rule types, enhancements and alerts, can all be imported or customized by implementing a base class.
ElastAlert is designed to be reliable, highly modular, and easy to set up and configure.
It works by combining Elasticsearch with two types of components, rule types and alerts. Elasticsearch is periodically queried and the data is passed to the rule type, which determines when a match is found. When a match occurs, it is given to one or more alerts, which take action based on the match.
This is configured by a set of rules, each of which defines a query, a rule type, and a set of alerts.
Several rule types with common monitoring paradigms are included with ElastAlert:
“Match where there are X events in Y time” (
frequency type
)“Match when the rate of events increases or decreases” (
spike type
)“Match when there are less than X events in Y time” (
flatline type
)“Match when a certain field matches a blacklist/whitelist” (
blacklist
andwhitelist type
)“Match on any event matching a given filter” (
any type
)“Match when a field has two different values within some time” (
change type
)
Creating Rules
Each rule defines a query to perform, parameters on what triggers a match, and a list of alerts to fire for each match. We are going to use change rule as an example:
Type: frequency
- This rule matches when there are at least a certain number of events in a given time frame. This may be counted on a per-query_key basis:
# Send an email to elastalert@example.com when there are more than 50 documents with some_field == some_value within a 1 hour period # Type FREQUENCY name: Example rule type: frequency index: logstash-* num_events: 50 timeframe: hours: 1 filter: - term: some_field: "some_value" alert: - "email" email: - "elastalert@example.com"
The fields that have been configured are as follows:
es_host
andes_port
should point to the Elasticsearch cluster we want to query.name
is the unique name for this rule. ElastAlert will not start if two rules share the same name.type
: Each rule has a different type which may take different parameters. The frequency type means “Alert when more than num_events occur within timeframe.”index
: The name of the index(es) to query. If you are using Logstash, by default the indexes will match "logstash-*".num_events
: This parameter is specific to frequency type and is the threshold for when an alert is triggered.timeframe
is the time period in which num_events must occur.filter
is a list of Elasticsearch filters that are used to filter results. Here we have a single term filter for documents with some_field matching some_value. If no filters are desired, it should be specified as an empty list: filter: []alert
is a list of alerts to run on each match. The email alert requires an SMTP server for sending mail. By default, it will attempt to use localhost. This can be changed with the smtp_host option.email
is a list of addresses to which alerts will be sent.
All documents must have a timestamp field. ElastAlert will try to use @timestamp by default, but this can be changed with the timestamp_field option. By default, ElastAlert uses ISO8601 timestamps, though unix timestamps are supported by setting timestamp_type.
Type: change
- This rule will monitor a certain field and match if that field changes. The field must change with respect to the last event with the same query_key.
# Alert when some field changes between documents # This rule would alert on documents similar to the following: # {'username': 'john', 'country_code': 'JPN', '@timestamp': '2016-11-17T09:00:00'} # {'username': 'john', 'country_code': 'UK', '@timestamp': '2016-11-17T07:00:00'} # Because the user (query_key) john logged in from different countries (compare_key) in the same day (timeframe) # Type CHANGE name: New country login type: change index: logstash-* compare_key: country_code ignore_null: true query_key: username timeframe: days: 1 filter: - query: query_string: query: "document_type: login" alert: - "email" email: - "elastalert@example.com"
In addition to frequency type rule fields, following are the additional fields that have been configured:
This rule requires three additional options:
compare_key
: The name of the field to monitor for changes.ignore_null
: If true, events without a compare_key field will not count as changed.query_key
: This rule is applied on a per-query_key basis. This field must be present in all of the events that are checked.
There is also an optional field:
timeframe
: The maximum time between changes. After this time period, ElastAlert will forget the old value of the compare_key field.
Type: spike
- This rule matches when the volume of events during a given time period is spike_height
times larger or smaller than during the previous time period. It uses two sliding windows to compare the current and reference frequency of events. We will call this two windows “reference” and “current”.
# Alert when there is a sudden spike in the volume of events or if number of matching docs triples up in an hour. The minimum number of events that will trigger an alert is set to 5. # Type SPIKE name: Event spike type: spike index: logstash-* threshold_cur: 5 timeframe: hours: 1 spike_height: 3 spike_type: "up" filter: - query: query_string: query: "field: value" - type: value: "some_doc_type" alert: - "email" email: - "elastalert@example.com"
In addition to frequency rule type fields, following are some of the important additional fields that have been configured:
spike_height
: The ratio of number of events in the last timeframe to the previous timeframe that when hit will trigger an alert.spike_type
: Either ‘up’, ‘down’ or ‘both’. ‘Up’ meaning the rule will only match when the number of events is spike_height times higher. ‘Down’ meaning the reference number is spike_height higher than the current number. ‘Both’ will match either.timeframe
: The rule will average out the rate of events over this time period. For example, hours: 1 means that the ‘current’ window will span from present to one hour ago, and the ‘reference’ window will span from one hour ago to two hours ago.
There is also an optional configured field:
threshold_cur
: The minimum number of events that must exist in the current window for an alert to trigger. For example, ifspike_height: 3
andthreshold_cur: 60
, then an alert will occur if the current window has more than 60 events and the reference window has less than a third as many.
Configuring Filters for Rules
Filters are a list of Elasticsearch query DSL filters that are used to query Elasticsearch. ElastAlert will query Elasticsearch using the format {'filter': {'bool': {'must': [config.filter]}}} with an additional timestamp range filter. All of the results of querying with these filters are passed to the RuleType for analysis. This section describes how to create a filter section for your rule config file.
The filters used in rules are part of the Elasticsearch query DSL. We shall list here a small subset of particularly useful filters.
The filter section is passed to Elasticsearch exactly as follows:
filter: and: filters: - [filters from rule.yaml]
Every result that matches these filters will be passed to the rule for processing.
Common Filter Types:
query_string
- The query_string type follows the Lucene query format and can be used for partial or full matches to multiple fields.
filter: - query: query_string: query: "username: john" - query: query_string: query: "_type: auth_logs" - query: query_string: query: "field: value OR otherfield: othervalue" - query: query_string: query: "this: that AND these: those"
term
- The term type allows for exact field matches:
filter: - term: name_field: "john" - term: _type: "auth_logs"
terms
- Terms allows for easy combination of multiple term filters:
filter: - terms: field: ["value1", "value2", "value3"]
Using the minimum_should_match
option, you can define a set of term filters of which a certain number must match:
- terms: fieldX: ["value1", "value2"] fieldY: ["something", "something_else"] fieldZ: ["foo", "bar", "baz"] minimum_should_match: 2
range - For ranges on fields:
filter: - range: status_code: from: 100 to: 199
negation, and, or
- Any of the filters can be embedded in not, and, and or:
filter: - or: - term: field: "value" - wildcard: field: "foo*bar" - and: - not: term: field: "value" - not: term: _type: "some_type"
Alerting with Monitoring
In following alerting and monitoring examples, we will omit the alerting configuration and focus on the pattern and rule type configuration. The alert itself can be customized a number of ways to suite our needs.
Authentication: SSH and Other Logins
Let’s start off with a simple rule to alert us whenever a failed SSH authentication occurs. We will assume that the log _type is ssh_logs and that it also contains a response, success or failure, and a username.
# Alert when any SSH failures occur filter: - term: _type: ssh_logs - term: response: failure type: any
The any
type will alert on any document which matches the filter. We can change the type to frequency
to only alert if multiple SSH failures occur. We can refine it further by using query_key
to group the events by username. This will allow us to alert only if a certain number of failures have occurred for a given user.
# Group by username and alert when 5 SSH failures occur for a single user in 1 hour filter: - term: _type: ssh_logs - term: response: failure type: frequency num_events: 5 timeframe: hours: 1 query_key: username
This will alert you if someone is trying to brute force SSH login, but what if an attacker has already taken your credentials? There are a few other things we could look at such as has the same user connected from multiple IP addresses?
# Alert when a single user has successful connections from 2 IPs within a day timeframe filter: - term: _type: ssh_logs - term: response: success type: cardinality max_cardinality: 1 cardinality_field: ip_address timeframe: days: 1 query_key: username
In this example, we are alerting if the cardinality of the ip_address
field, grouped by username
, is greater than one within a day. This specific alert could also be accomplished using the change rule type, but cardinality gives you more flexibility and could be used for a variety of other purposes.
Logging: Success and Error Logs
We’ll assume that error messages are parsed in such a way that they contain an error_code
. You could use type: any
and alert on every single error message. However, this might not scale well. Setting an explicit threshold is often hard, as baseline normal may vary over time. We can use the spike
type to handle this:
# Alert if number of errors doubles in an hour filter: - term: _type: error_logs type: spike spike_height: 2 spike_type: up threshold_ref: 50 timeframe: hours: 1 top_count_keys: - error_code
With this rule, we are comparing the number of errors in the last hour with the hour before that. If the current hour contains more than 2x the previous hour, and the previous hour contains at least 50 events, it will alert. If this is too sensitive, increasing the timeframe
would effectively smooth out the average error rate. By setting top_count_keys
, the alert will contain a breakdown of the most common types which occurred within that spike.
This is all well and good if we don’t care about common errors sending us alerts, but critical error messages could still sneak by. Another approach we could take is to send an alert only when a new, never seen before, error code occurs.
# A new error type occurred filter: - term: _type: error_logs query_key: error_code type: new_term include: - traceback
These are just a few possible use cases. With custom rule types or alerts, anything is possible. If you can get it into Elasticsearch, you can monitor and alert on it. For a full list of features, as well as a tutorial for getting started, check out the documentation and source can be found on Github.
Other Helpful Tutorials
- Getting Started with Elasticsearch on Qbox
- How to Use Elasticsearch, Logstash, and Kibana to Manage Logs
- How to Use Elasticsearch, Logstash, and Kibana to Manage NGINX Logs
- The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)
- Using the ELK Stack and Python in Penetration Testing Workflow
Give It a Whirl!
It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.
Questions? Drop us a note, and we'll get you a prompt response.
Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.