Logstash is a data pipeline that helps us process logs and other event data from a variety of systems. With 200 plugins and counting, Logstash can connect to a variety of sources and stream data at scale to a central analytics system. One of the most Logstash central analytics system is ELK stack (Elasticsearch, Logstash and Kibana).

The ability to efficiently analyze and query the data being shipped into the ELK Stack depends on the information being readable. This means that as unstructured data is being ingested into the system, it must be translated into structured message lines. Regardless of the defined data source, pulling the logs and performing some magic to beautify them is necessary to ensure that they are parsed correctly before being shipped to Elasticsearch

Data manipulation in Logstash is performed using filter plugins. This article focuses on one of the most popular and useful filter plugins - The Logstash Grok Filter, which is used to parse unstructured data into structured data. Grok is currently the best way in logstash to parse crappy unstructured log data into something structured and queryable. This tool is perfect for syslog logs, apache and other web server logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.

The grok filter and its use of patterns is the truly powerful part of logstash. Grok allows us to turn unstructured log text into structured data. The grok filter attempts to match a field with a pattern. Think of patterns as a named regular expression. Patterns allow for increased readability and reuse.  If the pattern matches, logstash can create additional fields (similar to a regex capture group).

Logstash Grok Filter automatically parses many types of data for us including Apache, Nginx, JSON, and more! This allows us to use advanced features like statistical analysis on value fields, faceted search, filters, and more. Even if we don’t have automated parsing available for our log type, we will still be able to log and do full text search over our logs. If we couldn’t classify and break down your data into separate fields, all searches would be full text, which would not allow us to take full advantage of Elasticsearch and Kibana search.

We can setup Logstash to do custom parsing of our logs and then send the output to Elasticsearch. Logstash is able to parse logs using grok filters. This can be useful if your log format is not one of our automatically parsed formats. Parsing allows you to use advance features like statistical analysis on value fields, faceted search, filters and more

Grok Basics

Grok works by combining text patterns into something that matches your logs. The syntax for a grok pattern is %{SYNTAX:SEMANTIC}

The SYNTAX is the name of the pattern that will match your text. For example, 3.44 will be matched by the NUMBER pattern and 55.3.244.1 will be matched by the IP pattern. The syntax is how you match. 

The SEMANTIC is the identifier you give to the piece of text being matched. For example, 3.44 could be the duration of an event, so you could call it simply duration. Further, a string 55.3.244.1 might identify the client making a request. 

For the above example, the grok filter would look something like this: %{NUMBER:duration} %{IP:client}

Logstash already ships with lots of predefined patterns. Patterns consist of a label and a regex, e.g.: USERNAME [a-zA-Z0-9._-]+

 In the grok filter, we would refer to this as %{USERNAME}:

filter {
 grok {
   match => [ "message", "%{USERNAME}" ]
 }
}

These are some predefined logstash grok filters:

# Basic Identifiers
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
# Networking
MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
# paths
PATH (?:%{UNIXPATH}|%{WINPATH})
UNIXPATH (/([\w_%!$@:.,+~-]+|\\.)*)+
TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
URIHOST %{IPORHOST}(?::%{POSINT:port})?
# uripath comes loosely from RFC1738, but mostly from what Firefox
# doesn't turn into %XX
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]*)+
# Months: January, Feb, 3, 03, 12, December
MONTHNUM (?:0?[1-9]|1[0-2])
MONTHNUM2 (?:0[1-9]|1[0-2])
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) 
# Log formats
SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
# Log Levels
LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)

Patterns can contain other patterns, e.g.: SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

Optionally, we can add a data type conversion to our grok pattern. By default all semantics are saved as strings. If you wish to convert a semantic’s data type, for example change a string to an integer then suffix it with the target data type. For example %{NUMBER:num:int} which converts the num semantic from a string to an integer. Currently the only supported conversions are int and float.

Let’s consider a request log like:

55.3.244.1 GET /index.html 15824 0.043

The pattern for this could be:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

A more realistic example, let’s read these logs from a file:

input {
  file {
    path => "/var/log/http.log"
  }
}
filter {
  grok {
    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
  }
}

After the grok filter, the event will have a few extra fields in it: 

  • client: 55.3.244.1

  • method: GET

  • request: /index.html

  • bytes: 15824

  • duration: 0.043

Target Variables

A pattern can store the matched value in a new field. Specify the field name in the grok filter: 

filter {
 grok {
   match => [ "message", "%{USERNAME:user}" ]
 }
}

When using a regexp,  a new field can be created with an Oniguruma trick:

filter {
  grok {
    match => [ "message", "(?<myField>[a-z]{3})" ]
  }
}

This would find three lower case letters and create a field called ‘myField’.

Casting

Grok’ed fields are strings by default.  Numeric fields (int and float) can be declared in the pattern:

filter {
 grok {
   match => [ "message", "%{USERNAME:user:int}" ]
 }
}

Note that this is just a hint that logstash will pass along to elasticsearch when it tries to insert the event.  If the field already exists in the index with a different type, this won’t change the mapping in elasticsearch until a new index is created.

Custom Patterns 

Sometimes logstash doesn’t have a pattern we need. For this, we have a few options.

First, we can use the Oniguruma syntax for named capture which will let you match a piece of text and save it as a field:

(?<field_name>the pattern here)

For example, postfix logs have a queue id that is an 10 or 11-character hexadecimal value. I can capture that easily like this:

(?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

  • Create a directory called patterns with a file in it called extra (the file name doesn’t matter, but name it meaningfully for yourself)

  • In that file, write the pattern you need as the pattern name, a space, then the regexp for that pattern. 

For example, doing the postfix queue id example as above:

# contents of ./patterns/postfix:
POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the patterns_dir setting in this plugin to tell logstash where our custom patterns directory is. Here’s a full example with a sample log:

Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The corresponding grok filter configuration will be: 

filter {
  grok {
    patterns_dir => ["./patterns"]
    match => { "message" => "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" }
  }
}

The above will match and result in the following fields:

  • timestamp: Jan 1 06:25:43

  • logsource: mailserver14

  • program: postfix/cleanup

  • pid: 21403

  • queue_id: BEF25A72965

  • syslog_message: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The timestamp, logsource, program, and pid fields come from the SYSLOGBASE pattern which itself is defined by other patterns. If the input doesn’t match the pattern, a tag will be added for “_grokparsefailure”.

Common Examples

Syslog:

grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
      add_field => [ "received_at", "%{@timestamp}" ]
      add_field => [ "received_from", "%{host}" ]
}

Nginx:

grok {
   match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
   overwrite => [ "message" ]
}
 

Apache

grok {
   match => [
         "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}",
         "message" , "%{COMMONAPACHELOG}+%{GREEDYDATA:extra_fields}"
   ]
   overwrite => [ "message" ]
}

Mysql

grok {
    match => [ 'message', "(?m)^%{NUMBER:date} *%{NOTSPACE:time} %{GREEDYDATA:message}"
    ]
    overwrite => [ 'message' ]
    add_field => { "mysql_time" => "%{date} %{time}" }
}

Elasticsearch

grok {
    match => ["message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\]%{SPACE}\[%{DATA:index}\] %{NOTSPACE} \[%{DATA:updated-type}\]", "message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\] (\[%{NOTSPACE:Index}\]\[%{NUMBER:shards}\])?%{GREEDYDATA}"]
}

Custom Application Log:

Let's consider following application log: 

2015-04-17 16:32:03.805 ERROR [grok-pattern-demo-app,BDS567TNP,2424PLI34934934KNS67,true] 54345 --- [nio-8080-exec-1] org.qbox.logstash.GrokApplicarion : this is a sample message

We have following grok pattern configured for above application logs: 

match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:minQId},%{DATA:maxQId},%{DATA:debug}] %{DATA:pid} --- *\[%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}" }

For input data that matches this pattern, Logstash creates a JSON record as shown below.

{
        "minQId" => "BDS567TNP",
         "debug" => "true",
         "level" => "ERROR",
           "log" => "this is a sample message",
           "pid" => "54345",
        "thread" => "nio-8080-exec-1",
          "tags" => [],
        "maxQId" => "2424PLI34934934KNS67",
    "@timestamp" => 2015-04-17 17:02:03.301,
   "application" => "grok-pattern-demo-app",
      "@version" => "1",
         "class" => "org.qbox.logstash.GrokApplicarion",
     "timestamp" => "2015-04-17 16:32:03.805"
}

Debugging

There is an online grok debugger available for building and testing patterns.

It offers three fields:

  1. The first field accepts one (or more) log line(s)

  2. The second the grok pattern

  3. The 3rd is the result of filtering the 1st by the 2nd

Demonstration of Custom Application Log using Grok Debugger: 

c4.gif 

Dissect Filter

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators. Unfortunately, there’s no app for that but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{minQId},%{maxQId},%{debug}]\n
%{pid} %{}[%{thread}] %{class}:%{log}

There are slight differences when moving from a regex-based filter to a separator-based one. Some strings end up padded with spaces. There are 2 ways to handle that:

  • change the logging pattern in the application - which might make direct log reading harder

  • strip additional spaces with Logstash

Using the second option, the final filter configuration config is: 

filter {
  dissect {
    mapping => { "message" => ... }
  }
  mutate {
    strip => [ "log", "class" ]
  }
}

Conclusion 

Grok is library of expressions that make it easy to extract data from our logs. You can select from hundreds of available grok patterns. There are many built-in patterns that are supported out-of-the-box by Logstash for filtering items such as words, numbers, and dates (the full list of supported patterns can be found here). If you cannot find the pattern you need, you can write your own custom pattern.

In order to structure data, the grok filter is powerful and used by many.  However, depending on the specific log format to parse, writing the filter expression might be quite complex a task. The dissect filter, based on separators, is an alternative that makes it much easier - at the price of some additional handling. It also is an option to consider in case of performance issues.

Give It a Whirl!

It's easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch

Questions? Drop us a note, and we'll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.

comments powered by Disqus