Parsing log messages with the syslog-ng Python parser

The Python parser of syslog-ng not only enables you to parse any type of log message, but you can also use it to enrich messages. From this blog you will learn how to extract information from a specially formatted log message, and how to create new name-value pairs by consulting external databases about data contained in your log messages. I will show these using two simple parsers: one resolves host names from IP addresses, the other one uses regular expressions to parse synthetic log messages from the loggen utility.

lepke

Before you begin

On your syslog-ng machine you need a recent syslog-ng release with Python support enabled. I recommend 3.17.2 or later, but with minimal modifications earlier versions might also work. My code was only tested with Python 3, but most likely works with Python 2 with small modifications.

For the first example I used Suricata logs, but you do not have to setup an IDS (Intrusion Detection System) and parse its logs to understand and try the example Python code. You can use for example a network source and use the HOSTIP macro.

There are two ways to store the Python source code for your parser. For short code you can store the Python code inline in your syslog-ng configuration. In this blog I use this simple method. For more complex code it is better to store the Python code outside of the syslog-ng configuration. You can learn how to do that from my Python destination blog at https://www.syslog-ng.com/community/b/blog/posts/python-destination-getting-into-details

First example: resolving IP addresses

When it comes to security I often hear that only the IP address is a fix information, host names are unreliable. Many tools – like Suricata – only store the IP address. Still, it is convenient to see host names in logs, even if you know that host names can be easily changed with malicious intent. The code below resolves IP addresses inside the log messages, not from the headers. Here is a sample log message:

{"timestamp":"2018-10-03T17:52:36.000428+0200","flow_id":269616113399740,"event_type":"flow","src_ip":"192.168.123.123","src_port":40852,"dest_ip":"123.123.123.123","dest_port":443,"proto":"TCP","app_proto":"tls","flow":{"pkts_toserver":18,"pkts_toclient":19,"bytes_toserver":2306,"bytes_toclient":5920,"start":"2018-10-03T17:47:54.145340+0200","end":"2018-10-03T17:50:53.166301+0200","age":179,"bypass":"local","state":"bypassed","reason":"timeout","alerted":false},"tcp":{"tcp_flags":"7f","tcp_flags_ts":"3f","tcp_flags_tc":"7e","syn":true,"fin":true,"rst":true,"psh":true,"ack":true,"urg":true,"ecn":true,"state":"established"}}

The following syslog-ng configuration snippet shows you a minimalist Python parser having only mandatory options. It is part of a configuration which receives Suricata logs in JSON format, parses them with the JSON parser, resolves destination IP addresses to host names using the Python parser and saves the logs to a JSON file.

Note, that this code uses blocking calls for reverse lookups, so it can slow down syslog-ng considerably.

Configuration / code

source s_suricata {
    tcp(ip("0.0.0.0") port("514") flags(no-parse));
};

parser p_json {
    json-parser (prefix("suricata."));
};

destination d_suricata {
    file("/var/log/suricata.log" template("$(format-json --key suricata* --key hostname* --key ISODATE)\n"));
};

python {

"""
simple syslog-ng Python parser example
resolves IP to hostname
value pair names are hard-coded
"""

import socket

class SngResolver(object):
    def parse(self, log_message):
        """
        Resolves IP to hostname
        """

        ipaddr_b = log_message['suricata.dest_ip']
        ipaddr = ipaddr_b.decode('utf-8')

        # try to resolve the IP address
        try:
            resolved = socket.gethostbyaddr(ipaddr)
            hostname = resolved[0]
            log_message['hostname.dest'] = hostname
        except:
            pass

        # return True, other way message is dropped
        return True

};

parser p_resolver {
    python(
        class("SngResolver")
    );
};

log {
    source(s_suricata);
    parser(p_json);
    parser(p_resolver);
    destination(d_suricata);
};

How it works?

Lets look a the above configuration. As usual, the heart of the configuration is the log statement. This is the part which connects all the building blocks together. First a source (s_suricata) is configured to receive Suricata log messages. Next logs are parsed by a JSON parser (p_json) to create name-value pairs.

Once you have name-value pairs, you can resolve the desitination IP addresses to hostnames with the Python parser (p_resolver).

The Python parser has a single mandatory option, the class name containing the method to parse your log message. In this case the parser is called p_resolver and the class is called SngResolver.

The actual Python code is enclosed in a

python {}

block. Half of the block is filled with comment lines, which are not strictly necessary, but usually make your life a lot easier. Especially if you have to modify your code a few weeks later... This time the Python block contains a single class – SngResolver – and single mandatory method: parse().

The parse() method receives all the name value pairs from syslog-ng as part of an object. In our case the object is called log_message. You can read or create name-value pairs by referring to them by their names. For example:

ipaddr_b = log_message['suricata.dest_ip']

reads the macro containing the Suricata destination IP address, and:

log_message['hostname.dest'] = hostname

creates a new name-value pair containing the resolved host name. This is only created if the DNS server accessible from your syslog-ng machine can resolve the IP in the messages.

Note: If the parser() method returns True, the message is kept, if it returns False (or does not return anything), the message is dropped. This is why the method ends with:

return True

Second example: parsing loggen output

The second example is based on the documentation, but I ported it from Python 2 to Python 3 to get it running in my environment. In contrary to the first example, here you can see all optional features of the Python parser in action. You do not need any complex setup here, as this parser is processing logs from loggen, a utility bundled with syslog-ng.

Here is a sample message from loggen:

<38>2018-10-03T18:00:17 localhost prg00000[1234]: seq: 0000001451, thread: 0000, runid: 1538582416, stamp: 2018-10-03T18:00:17 PADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADDPADD

Configuration / code

python {
"""
Regex parser sample code for syslog-ng
"""

import re

class SngRegexParser(object):
    """
    Parses the MESSAGE from loggen using regex
    """

    def init(self, options):
        """
        Initializes the parser
        """
        pattern = options["regex"]
        self.regex = re.compile(pattern)
        self.counter = 0
        return True

    def deinit(self):
        """
        Deinitializes the parser, often empty
        """
        pass

    def parse(self, log_message):
        """
        Parses the log message and returns results
        """
        decoded_msg = log_message['MESSAGE'].decode('utf-8')
        match = self.regex.match(decoded_msg)
        if match:
            for key, value in match.groupdict().items():
                log_message[key] = value
            log_message['MY_COUNTER'] = str(self.counter)
            self.counter += 1
            return True
        return False
};

parser my_python_parser{
    python(
        class("SngRegexParser")
        options("regex", "seq: (?P<seq>\\d+), thread: (?P<thread>\\d+), runid: (?P<runid>\\d+), stamp: (?P<stamp>[^ ]+) (?P<padding>.*$)")
    );
};

log {
    source { tcp(port(5555)); };
    parser(my_python_parser);
    destination {
        file("/tmp/regexparser.log.txt" template("seq: $seq thread: $thread runid: $runid stamp: $stamp my_counter: $MY_COUNTER\n"));
    };
};

How it works?

Lets look at the above configuration. Once again, the heart of the configuration is the log statement. First a source is configured to collect log messages from loggen. Next logs are parsed by the Python parser (my_python_parser) to create name-value pairs from the different fields in the log messages. Finally logs are saved to a text file using a template which stores all name-value pairs created by the Python parser.

Next, take a look at the parser block of the configuration. As in the first example it starts with a reference to the class name in the Python code. It does not end here, but also contains an optional feature: options. Using options you can pass additional parameters from the syslog-ng configuration to the Python code: in this case the regular expression describing the structure of a log message generated by loggen.

Looking at the Python code you can see that it has three methods:

  • init(): initalizes the parser.
  • deinit(): de-initializes the parser, often left empty.
  • parse(): parses the message. Also works as a filter: message is dropped if the method does not return True.

This code also allows you to test another feature of the Python destination. Internal variables are kept as long as syslog-ng is not reloaded or restarted. This is how self.counter in the above code can count incoming log messages.

Testing

Append the above configuration snippet to syslog-ng.conf, or save it to a new file under /etc/syslog-ng/conf.d/ if syslog-ng in your Linux distribution is configured to use an include directory.

You need two terminal connections to the machine running syslog-ng. On the first one start loggen to generate some log messages:

loggen -i -S localhost 5555

Use the second terminal to reload syslog-ng a couple of times:

syslog-ng-ctl reload

Now check the results in /tmp/regexparser.log.txt. You should see that while the number in the second column is increasing continuously (the sequence number coming from loggen), the number in the last column is restarting periodically. This column shows the value in the counter implemented in Python: it restarts counting when syslog-ng is reloaded:

seq: 0000002842 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 2842
seq: 0000002843 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 2843
seq: 0000002844 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 2844
seq: 0000002845 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 0
seq: 0000002846 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 1
seq: 0000002847 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 2
seq: 0000002848 thread: 0000 runid: 1538557132 stamp: 2018-10-03T10:58:54 my_counter: 3

If you have questions or comments related to syslog-ng, do not hesitate to contact us. You can reach us by email or you can even chat with us. For a list of possibilities, check our GitHub page under the “Community” section at https://github.com/balabit/syslog-ng. On Twitter, I am available as @PCzanik.

Anonymous