Logstash is a data pipeline that helps us process logs and other event data from a variety of sources. With 200 plugins and counting, Logstash can connect to a variety of sources and stream data at scale to a central analytics system. One of the best solutions for the management and analysis of logs and events is ELK stack (Elasticsearch, Logstash and Kibana).

The ability to efficiently analyze and query the data shipped to the ELK Stack depends on the readability and quality of data. This implies that if unstructured data (e.g plain text logs) is being ingested into the system, it must be translated into structured form enriched with valuable fields. Regardless of the data source, pulling the logs and performing some magic to format, transform, and enrich them is necessary to ensure that they are parsed correctly before being shipped to Elasticsearch.

Data transformation and normalization in Logstash is performed using filter plugins. This article focuses on one of the most popular and useful filter plugins - Logstash Grok Filter, which is used to parse unstructured data into structured data making it ready for aggregation and analysis in the ELK.  This allows us to use advanced features like statistical analysis on value fields, faceted search, filters, and more.  If we can’t classify and break down data into separate fields, all searches would be full text, which would not allow us to take full advantage of Elasticsearch and Kibana search. Grok tool is perfect for syslog logs, Apache and other web server logs, Mysql logs, and in general, any log format that is written for humans and includes plain text.

Grok filter ships with a variety of regular expressions and patterns for common data types and expressions you can meet in logs (e.g IP, username, email, hostname etc.) When Logstash reads through the logs, it can use these patterns to find semantic elements of the log message we want to turn into structured fields. 

Thus, Grok filter works by combining text patterns into something that matches your logs. You can tell Grok what data to search for by defining a Grok pattern: %{SYNTAX:SEMANTIC}

The SYNTAX is the name of the pattern that will match your text. For example, the NUMBER pattern can match 4.55, 4, 8 and any other number, and IP pattern can match 54.3.824.2 or 174.49.99.1 etc. 

The SEMANTIC is the identifier given to a matched text. You can think of this identifier as the key in the key-value pair created by the Grok filter and the value being the text matched by the pattern.  Using the example above 4.55, 4, 8 could be a duration of some event and a 54.3.824.2 could be the client making a request. 

We can express this quite simply using Grok pattern as %{NUMBER:duration} and           %{IP:client} and the refer to them in the filter definition 

filter {    
  grok  {   match => { "message" => "%{IP:client} %{NUMBER:duration}" }   
  }
}

As we've mentioned Logstash already ships with lots of predefined patterns. Patterns consist of a label and a regex, e.g.: USERNAME [a-zA-Z0-9._-]+

Let's take a look at some other available patterns (You can find a full list here). 

# Basic Identifiers
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
# Networking
MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
# paths
PATH (?:%{UNIXPATH}|%{WINPATH})
UNIXPATH (/([\w_%!$@:.,+~-]+|\\.)*)+
TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
URIHOST %{IPORHOST}(?::%{POSINT:port})?
# uripath comes loosely from RFC1738, but mostly from what Firefox
# doesn't turn into %XX
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]*)+
# Months: January, Feb, 3, 03, 12, December
MONTHNUM (?:0?[1-9]|1[0-2])
MONTHNUM2 (?:0[1-9]|1[0-2])
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) 
# Log formats
SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
# Log Levels
LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)

A great feature is that patterns can contain other patterns, e.g.: SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

By default all semantics (e.g DURATION or CLIENT) are saved as strings. Optionally, we can add a data type conversion to our Grok pattern.  For example %{NUMBER:num:int} converts the num semantic from a string to an integer. Currently, the only supported conversions are int and float.

Let’s take a look at a more realistic example to illustrate how the Grok filter works. Let's assume we have a HTTP log message like this:

55.3.244.1 GET /index.html 15824 0.043

Many of such log messages are stored in /var/log/http.log so we can use Logstash File input that tails the log files and emits events when a new log message is added. In the filter part of the configuration, we define Syntax-Semantic pairs that match each pattern available in the Grok filter to specific element of the log message sequentially.

input { 
  file {
    path => "/var/log/http.log"
  }
 }
 filter {
  grok {
    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
}

In the example above, we represented the log message as:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

This will add a few extra fields (e.g "client", "method") to the event and store them in the "message" variable sent to Elasticsearch Metricbeat index. 

Let's verify this by running Logstash with the above configuration. First save  the log message above in /var/log/http.log or any file you prefer and then run Logstash with this configuration:

input { 
  file {
    path => "/var/log/http.log"
  }
 }
 filter {
  grok {
    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}
output {
  elasticsearch { hosts => ["localhost:9200"] }
}


  • client: 55.3.244.1

  • method: GET

  • request: /index.html

  • bytes: 15824

  • duration: 0.043

Target Variables

A pattern can store the matched value in a new field. Specify the field name in the grok filter: 

filter {
 grok {
   match => [ "message", "%{USERNAME:user}" ]
 }
}

This would find three lower case letters and create a field called ‘myField’.

Casting

Grok’ed fields are strings by default.  Numeric fields (int and float) can be declared in the pattern:

filter {
 grok {
   match => [ "message", "%{USERNAME:user:int}" ]
 }
}

Note that this is just a hint that logstash will pass along to elasticsearch when it tries to insert the event.  If the field already exists in the index with a different type, this won’t change the mapping in elasticsearch until a new index is created.

Custom Patterns 

Sometimes logstash doesn’t have a pattern we need. For this, we have a few options.

First, we can use the Oniguruma syntax for named capture which will let you match a piece of text and save it as a field:

(?<field_name>the pattern here)

For example, postfix logs have a queue id that is an 10 or 11-character hexadecimal value. I can capture that easily like this:

(?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

  • Create a directory called patterns with a file in it called extra (the file name doesn’t matter, but name it meaningfully for yourself)

  • In that file, write the pattern you need as the pattern name, a space, then the regexp for that pattern. 

For example, doing the postfix queue id example as above:

# contents of ./patterns/postfix:
POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the patterns_dir setting in this plugin to tell logstash where our custom patterns directory is. Here’s a full example with a sample log:

Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The corresponding grok filter configuration will be: 

filter {
  grok {
    patterns_dir => ["./patterns"]
    match => { "message" => "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" }
  }
}

The above will match and result in the following fields:

  • timestamp: Jan 1 06:25:43

  • logsource: mailserver14

  • program: postfix/cleanup

  • pid: 21403

  • queue_id: BEF25A72965

  • syslog_message: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The timestamp, logsource, program, and pid fields come from the SYSLOGBASE pattern which itself is defined by other patterns. If the input doesn’t match the pattern, a tag will be added for “_grokparsefailure”.

Common Examples

Syslog:

grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
      add_field => [ "received_at", "%{@timestamp}" ]
      add_field => [ "received_from", "%{host}" ]
}

Nginx:

grok {
   match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
   overwrite => [ "message" ]
}
 

Apache

grok {
   match => [
         "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}",
         "message" , "%{COMMONAPACHELOG}+%{GREEDYDATA:extra_fields}"
   ]
   overwrite => [ "message" ]
}

Mysql

grok {
    match => [ 'message', "(?m)^%{NUMBER:date} *%{NOTSPACE:time} %{GREEDYDATA:message}"
    ]
    overwrite => [ 'message' ]
    add_field => { "mysql_time" => "%{date} %{time}" }
}

Elasticsearch

grok {
    match => ["message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\]%{SPACE}\[%{DATA:index}\] %{NOTSPACE} \[%{DATA:updated-type}\]", "message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\] (\[%{NOTSPACE:Index}\]\[%{NUMBER:shards}\])?%{GREEDYDATA}"]
}

Custom Application Log:

Let's consider following application log: 

2015-04-17 16:32:03.805 ERROR [grok-pattern-demo-app,BDS567TNP,2424PLI34934934KNS67,true] 54345 --- [nio-8080-exec-1] org.qbox.logstash.GrokApplicarion : this is a sample message

We have following grok pattern configured for above application logs: 

match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:minQId},%{DATA:maxQId},%{DATA:debug}] %{DATA:pid} --- *\[%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}" }

For input data that matches this pattern, Logstash creates a JSON record as shown below.

{
        "minQId" => "BDS567TNP",
         "debug" => "true",
         "level" => "ERROR",
           "log" => "this is a sample message",
           "pid" => "54345",
        "thread" => "nio-8080-exec-1",
          "tags" => [],
        "maxQId" => "2424PLI34934934KNS67",
    "@timestamp" => 2015-04-17 17:02:03.301,
   "application" => "grok-pattern-demo-app",
      "@version" => "1",
         "class" => "org.qbox.logstash.GrokApplicarion",
     "timestamp" => "2015-04-17 16:32:03.805"
}

Debugging

There is an online grok debugger available for building and testing patterns.

It offers three fields:

  1. The first field accepts one (or more) log line(s)

  2. The second the grok pattern

  3. The 3rd is the result of filtering the 1st by the 2nd

Demonstration of Custom Application Log using Grok Debugger: 

c4.gif 

Dissect Filter

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators. Unfortunately, there’s no app for that but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{minQId},%{maxQId},%{debug}]\n
%{pid} %{}[%{thread}] %{class}:%{log}

There are slight differences when moving from a regex-based filter to a separator-based one. Some strings end up padded with spaces. There are 2 ways to handle that:

  • change the logging pattern in the application - which might make direct log reading harder

  • strip additional spaces with Logstash

Using the second option, the final filter configuration config is: 

filter {
  dissect {
    mapping => { "message" => ... }
  }
  mutate {
    strip => [ "log", "class" ]
  }
}

Conclusion 

Grok is library of expressions that make it easy to extract data from our logs. You can select from hundreds of available grok patterns. There are many built-in patterns that are supported out-of-the-box by Logstash for filtering items such as words, numbers, and dates (the full list of supported patterns can be found here). If you cannot find the pattern you need, you can write your own custom pattern.

In order to structure data, the grok filter is powerful and used by many.  However, depending on the specific log format to parse, writing the filter expression might be quite complex a task. The dissect filter, based on separators, is an alternative that makes it much easier - at the price of some additional handling. It also is an option to consider in case of performance issues.

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.