Logstash is a data pipeline that helps us process logs and other event data from a variety of sources.

With over 200 plugins, Logstash can connect to a variety of sources and stream data at scale to a central analytics system. It’s also an important part of one of the best solutions for the management and analysis of logs and events: the ELK stack (Elasticsearch, Logstash, and Kibana).

The ability to efficiently analyze and query the data shipped to the ELK Stack depends on the readability and quality of data. This implies that if unstructured data (e.g., plain text logs) is being ingested into the system, it must be translated into structured form enriched with valuable fields. Regardless of the data source, pulling the logs and performing some magic to format, transform, and enrich them is necessary to ensure that they are parsed correctly before being shipped to Elasticsearch.

 

 

Data transformation and normalization in Logstash are performed using filter plugins. This article focuses on one
of the most popular and useful filter plugins, the Logstash Grok Filter, which is used to parse unstructured data into structured data and make it ready for aggregation and analysis in the ELK. This allows us to use advanced features like statistical analysis on value fields, faceted search, filters, and more. If we can’t classify and break down data into separate fields, all searches would be full text, which would not allow us to take full advantage of Elasticsearch and Kibana search. The Grok tool is widely used to process syslog logs, web server logs (e.g., Apache, NGINX), MySQL logs, and in general, any log format that is written for humans and includes plain text.

The Grok filter ships with a variety of regular expressions and patterns for common data types and expressions commonly found in logs (e.g., IP, username, email, hostname, etc.) When Logstash reads through the logs, it can use these patterns to find semantic elements of the log message we want to turn into structured fields.

Thus, the Grok filter acts on text patterns to create a meaningful representation of your logs. You can tell Grok what data to search for by defining a Grok pattern: %{SYNTAX:SEMANTIC}

The SYNTAX is the name of the pattern that will match your text. For example, the NUMBER pattern can match 4.55, 4, 8, and any other number, and IP pattern can match 54.3.824.2 or 174.49.99.1 etc.

The SEMANTIC is the identifier given to a matched text. You can think of this identifier as the key in the key-value pair created by the Grok filter, with the value being the text matched by the pattern. Using the example above 4.55, 4, 8 could be a duration of some event, and a 54.3.824.2 could be the client making a request.

We can express this quite simply using the Grok pattern as %{NUMBER:duration} and %{IP:client} and then refer to them in the filter definition

filter { 
 grok { match => { "message" => "%{IP:client} %{NUMBER:duration}" } 
 }
}

If it is applied to a log message, this filter will create a document with two custom fields. For example:

client: 54.3.824.2 
duration: 0.043

As we’ve mentioned, Logstash ships with lots of predefined patterns. Patterns consist of a label and a regex, e.g.: USERNAME [a-zA-Z0-9._-]+

Let’s take a look at some other available patterns. (You can find a full list here.)

Grok patterns

A great feature is that patterns can contain other patterns, e.g.: SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

By default, all semantics (e.g., DURATION or CLIENT) are saved as strings. Optionally, we can add a data type conversion to our Grok pattern. For example %{NUMBER:num:int} converts the num semantic from a string to an integer. The only conversions currently supported are int and float.

Let’s take a look at a more realistic example to illustrate how the Grok filter works. Let’s assume we have a log message like this:

2017-03-11T19:23:34.000+00:00 WARNING [App.AnomalyDetector]:Suspicious transaction activity in session  -4jsdf94jsdf29msdf92

Our Grok pattern should be able to parse this log message into separate fields: “timestamp”, “log-level”, “issuer”, and “message”. This can be accomplished by the following pattern:

grok {    
    match => { 
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log-level} \[%{DATA:issuer}\]:%{GREEDYDATA:message}" } 
   }

Here, we define syntax-semantic pairs that match each pattern available in the Grok filter to specific element of the log message sequentially.

The full definition of the filter along with the input source and output can look something like this:

input { 
  file {
    path => "/var/log/http.log" 
   }
} 
filter {
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log-level} \[%{DATA:issuer}\]:%{GREEDYDATA:message}" } 
  }
} 
output {
  elasticsearch { 
    hosts => ["localhost:9200"] 
   } 
}

After the filter is applied to a log message, it gets parsed into the following fields:

{
 "timestamp" => "2017-03-11T19:23:34.000+00:00",
 "log-level" => "WARNING",    
 "message" => "Suspicious transaction activity in session -4jsdf94jsdf29msdf92",
  "issuer" => "App.AnomalyDetector"
}

Also, the Grok filter supports many common options that allow you to manipulate the log message after it was parsed. For example, you can use the add_field option to add custom fields to log events. The custom field can reference fields parsed by the Grok filter. For example, we could enrich the previous filter with the following configuration:

filter {
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log-level} \[%{DATA:issuer}\]:%{GREEDYDATA:message}" }
   add_field => { "notification" => "%{issuer} detected a log event of type %{log-level}" }
  }
} 

Similarly, you can add and remove tags and fields using add_tag, remove_tag, and remove_field options. For the full list of supported options, see the Grok Filter documentation.

Casting

Groked fields are strings by default. Numeric fields (int and float) can be declared in the pattern:

filter {
 grok {
   match => [ "message", "%{USERNAME:user:int}" ]
 }
}

Note that this is just a hint that Logstash will pass along to Elasticsearch when it tries to insert the event. If the field already exists in the index with a different type, this won’t change the mapping in Elasticsearch until a new index is created.

Custom Patterns

In some cases, Logstash and the Grok Filter do not have a built-in pattern that we need for our log messages, so in this situation, we can use the Oniguruma syntax for named capture or create a pattern file.

The general template for the custom pattern looks like this:

(?<field_name>the pattern here)

For example, if you have a message ID with 12 or 13 hexadecimal characters, the custom pattern can be defined as follows:

(?<message_id>[0-9A-F]{12,13})

Another option is to create a custom patterns file (e.g., custom_pattern), put the custom pattern there, and refer to it using the patterns_dir option of the Grok filter.

The syntax in the file should look like this: first the pattern name followed by the space and the regexp.

# contents of ./patterns/messages: 
MESSAGE_ID [0-9A-F]{12,13}

Finally, reference the pattern in the Grok filter configuration, and you are good to go:

filter {
      grok {
        patterns_dir => ["./patterns"]
        match => { "message" => "%{MESSAGE_ID:message_id}: %{GREEDYDATA:message_body}" }
     }
  }

Common Examples

Here are some common examples of Grok filters for the most popular log issuers.

Syslog:

grok {
  match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname}       %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid} \])?: %{GREEDYDATA:syslog_message}" 
 }
}

NGINX:
grok {
  match => { "message" => "%{IPORHOST:remote_addr} %{USERNAME:remote_user} \[%{HTTPDATE:time_local}\] \"%{DATA:request}\" %{INT:status} %{NUMBER:bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\"
   }
}

Apache:

grok { 
  match => [ 
   "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}", 
   "message" , "%{COMMONAPACHELOG}+%{GREEDYDATA:extra_fields}" 
    ] 
    overwrite => [ "message" ] 
}

MySQL:

grok {
match => { "message" => "(?m)^%{NUMBER:date} *%{NOTSPACE:time} {GREEDYDATA:message}" 
  }
}

Custom Application Log

Now, let’s create a more complex example of a Grok filter for a custom log generated by the Qbox application. Let’s consider the following application log:

2019-04-17 16:32:03.805 ERROR [grok-pattern-demo-app,BDS567TNP, 2424PLI34934934KNS67,true] 54345 --- [nio-8080-exec-1] org.qbox.logstash.GrokApplication : this is a sample message

We have the following Grok pattern configured for the above application logs:

match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:minQId},%{DATA:maxQId},%{DATA:debug}] %{DATA:pid} --- *\ [%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}" }

For input data that matches this pattern, Logstash creates a JSON record as shown below.

{
"minQId" => "BDS567TNP",
"debug" => "true",
"level" => "ERROR",
"log" => "this is a sample message",
"pid" => "54345",
"thread" => "nio-8080-exec-1",
"tags" => [],
"maxQId" => "2424PLI34934934KNS67", 
"@timestamp" => 2019-04-17 17:02:03.301, 
"application" => "grok-pattern-demo-app", 
"@version" => "1",
"class" => "org.qbox.logstash.GrokApplication", 
"timestamp" => "2015-04-17 16:32:03.805"
}

Debugging

If you try to create a filter for a lengthy and complex log message, things can get very messy very quickly, so it may be useful to debug your filter configuration one step at a time as you construct a filter. For that purpose, there is an online Grok debugger available for building and testing patterns. It offers three fields:

  • The first field accepts one (or more) log line(s)
  • The second accepts the Grok pattern
  • The third is the result of filtering the first by the second

Demonstration of a Custom Application Log using the Grok Debugger:

Using the Grok Debugger we can test the filter step by step as we add new patterns. Let’s say we want to test the filter for the following syslog log:

2018-11-19T13:13:27+01:00 router.lan pppd[12566]: local IP address 1.2.3.4

We could input semantic/syntax pairs into Grok debugger step by step:

And then:

Grok debugger

As you see, this online Grok debugger makes it easy to test filters in a WYSIWYG manner.

Dissect Filter

The Grok filter gets the job done but it can suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use instead the dissect filter, which is based on separators. Unfortunately, there’s no debugging app for that, but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{minQId},%{maxQId},%{debug}]\n %{pid} %{}[%{thread}] %{class}:%{log}

There are slight differences when moving from a regex-based filter to a separator-based filter. Some strings end up padded with spaces, and there are two ways to handle that:

  • Change the logging pattern in the application, which might make direct log reading harder
  • Strip additional spaces with Logstash

Using the second option, the final filter configuration is:

filter {
 dissect {
 mapping => { "message" => ... }
 }
 mutate {
 strip => [ "log", "class" ]
 }
}

Conclusion

Grok is a library of expressions that make it easy to extract data from your logs. You can select from hundreds of available Grok patterns. There are many built-in patterns that are supported out-of-the-box by Logstash for filtering items such as words, numbers, and dates (see the full list of supported patterns here). If you cannot find the pattern you need, you can write your own custom pattern.

The Grok filter is powerful and used by many to structure data. However, depending on the specific log format to parse, writing the filter expression might be quite a complex task. The dissect filter, based on separators, is an alternative that makes it much easier — at the price of some additional handling. It also is an option to consider in case of performance issues.

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on our Qbox data centers. Note, too, that you can provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment with Qbox.