Sequence – making PatternDB creation for syslog-ng easier

We are well into the 21st century, but most of the log messages still arrive in an unstructured format. For well over a decade, syslog-ng had a solution to turn unstructured messages into name-value pairs, called PatternDB. However, creating a pattern database for PatternDB from scratch is a source of major pain. Or rather, it was: sequence-rtg – a fork of the sequence log analyzer – provides a new hope! It can easily create ready-to-use patterns for your most frequent log messages.

Sequence-rtg is still in beta phase, and therefore is a bit rough around the edges. However, once you deal with the initial struggles of creating the database, it works just fine. Especially if you have lots of log messages. My experience was that the more log messages and larger batch sizes I had, the better quality patterns were generated.

Before you begin

Getting started with sequence is luckily not rocket science. All you need to get started is syslog-ng, a Go compiler and logs. Lots of logs.

On the syslog-ng side, you need syslog-ng with JSON template-function support. The version in EPEL 7 might be a bit too old (not tested), but anything from the past five years should be OK. If the JSON template-function is not supported by the Linux distro of your choice, check the following resource for up-to-date third-party packages:

https://www.syslog-ng.com/products/open-source-log-management/3rd-party-binaries.aspx

Sequence-rtg (I will use “sequence” from now on) is available only in source code form. There are no binary packages available. So, you need to compile it yourself. To compile sequence, you need a Go compiler for your choice of Linux distribution. And you also need git to check out the source code. It is available at https://github.com/ccin2p3/sequence-RTG.

Did I already mention that you need lots of logs? Of course, you can download huge amounts of sample log messages from the Internet, but testing sequence with real log messages is a much more life-like experience. My original test environment had 20k log messages a day. I had to drastically reduce the batch size. Of course, it still worked, but once the most frequent log messages were covered by patterns, I could not collect enough further log messages in a single day. My next test environment had about a thousand messages a second, producing much better results.

Installing sequence

Compiling sequence is easy. The following commands are based on instruction in the GitHub readme, but I changed them slightly to be in sync with the configurations I will show later.

git clone https://github.com/ccin2p3/sequence
cd sequence/
go build
cd cmd/sequence_db/
go build

Once you finished running these commands,, you should see a file called sequence_db in the current directory. This is the binary which will do all the log analysis and patterndb generation, based on the command line arguments. As all documentation refers to it as sequence, I copied the file to the /usr/local/bin directory using this name:

cp sequence_db /usr/local/bin/sequence

Creating the initial database

As I mentioned, sequence still has some rough edges. One of them is initial database creation. If you try the createdatabase option, sequence will die a horrible death, throwing all kinds of interesting error messages. In the end I, found a workaround by accident. You can create an initial database when your working directory is the root of the directory structure checked out from git:

cd ~/sequence
sequence createdatabase -l /dev/stderr -n info --conn sequence.sdb --type sqlite3
cp sequence.sdb /etc/syslog-ng/conf.d/

The last command copies the database to the syslog-ng directory for user configurations. If syslog-ng in your Linux distribution of choice does not support it, you should still create it and store the database there. All configuration and command line examples I show you assume this location.

Configuring sequence

Create a new directory – /etc/sequence – and copy sequence.toml from the git root directory there. Now edit the file and make sure that connectioninfo points to the freshly created database file:

connectioninfo = "/etc/syslog-ng/conf.d/sequence.sdb"

You can check the rest of the file, but you should leave it alone (at least for initial testing), as it is not documented. As far as I can see, settings here give hints to sequence what various fields in log messages mean and make field names more useful. This is why field names in generic Linux syslog messages need almost no editing at all. Patterns generated for logs from other sources (like a Vmware cluster in my case) had much more generic field names, like string1, string2, integer3, etc.

Configuring syslog-ng

As mentioned above, I tend to save my configurations in the syslog-ng directory for user configurations. If your distro of choice supports it, create a new configuration file here. Otherwise append it to syslog-ng.conf:

parser p_sequence_patterndb {
    db-parser(
        file(
            '/var/lib/syslog-ng/sequence.xml'
        ),
        drop-unmatched(
            yes
        )
    );
};

destination d_sequence {
    program(
        '/usr/local/bin/sequence analyzebyservice -b 100000 -i - -k json --config /etc/sequence/sequence.toml -l /var/log/sequence.log -n debug',
        template(
            "$(format-json -p service=$PROGRAM message=$MESSAGE)\n"
        ),
        flush-lines(
            100000
        )
    );
};

destination d_parsed {
  file("/var/log/parsed.json" template("$(format-json --scope rfc5424 --scope nv_pairs)\n\n"));
};

source s_net {
  udp(port("514") flags(store-raw-message,syslog-protocol));
};

log {
  source(s_local);
  source(s_net);
  if {
    parser(p_sequence_patterndb);
    destination(d_parsed);
  } else {
    destination(d_sequence);
  };
};

This is only good for testing, as part of this configuration is only used for debugging. You will need to tailor it to your environment. Let us go over this configuration snippet in detail!

When it comes to understanding syslog-ng configurations, you should start with log statements. Usually at the end of configurations, just like in this example. The log statement starts with two sources, one for local log messages, and the other for collecting logs from the network. The if statement uses PatternDB. If a log message matches, then the results of the message parsing are stored in a file in JSON format. Otherwise, it is fed to sequence for analysis.

The name of the local log source varies from configuration to configuration. If you want to parse your local log messages, make sure that the source name here matches your actual configuration.

The network source in my configuration looks a bit strange. This is just a simple test host, so we opted to use UDP. However, log messages looked a bit strange. I got suspicious and enabled storing raw log messages using the store-raw-message flag. And it turned out that even if we used UDP, my colleagues opted to use RFC5424 formatting. Adding the syslog-protocol flag resolved my issues.

The p_sequence_patterndb parser uses the patterndb database created by sequence. If a log message does not match any of the patterns, the message is dropped. This way you can save / forward only parsed logs to a specific location. Before starting it for the first time, you should create an empty file:

touch /var/lib/syslog-ng/sequence.xml

The default template for writing log messages to files does not include any of the name-value pairs. You can work around this by creating your own template. The file in the d_parsed destination uses the format-json template function, and saves basic syslog fields and all the parsed name-value pairs in JSON format.

Finally, if a log message is not yet known by PatternDB, then it is forwarded to sequence for analysis. It is called from a program() destination in syslog-ng. Messages are forwarded in 100000 lines batches using JSON formatting to sequence. If you want to experiment with different batch sizes, you must change both the command line parameter and the flush-lines() parameter of syslog-ng.

Testing

You are now ready to start syslog-ng. As usual, SELinux, a firewall or a wrong source name can prevent syslog-ng from running. None of these problems are specific to this blog, so I just leave these here as a hint where to look if syslog-ng does not start or does not work as expected.

As mentioned earlier: you should have many log messages. Depending on your message rate, it might take either seconds or several hours before sequence receives the first batch of log messages and starts analyzing them. You can use “syslog-ng-ctls stats” if log messages are arriving as expected. Once the amount of log messages you configured is collected, sequence analyzes them.

You can check what is happening in /var/log/sequence.log where you can find similar log messages:

{"id":100,"level":"info","msg":"Read in 500000 records successfully, starting analysis..","time":"2021-10-22T14:13:21+02:00","version":"beta"}

Whenever you see that sequence analyzed another batch of log messages, you can create a new PatternDB database based on the database created by sequence. You can also put it in cron to do it automatically, even without checking the logs. Use the following command line to create a PatternDB database from the sequence database:

/usr/local/bin/sequence exportpatterns -s patterndb --config /etc/sequence/sequence.toml -o /var/lib/syslog-ng/sequence -l /var/log/sequence.log -n debug -f yaml,xml -c 0.5

I used the patterns unmodified. First of all, I wanted to see the quality of patterns that sequence creates. And I am also well aware that proper naming of fields in PatternDB opens up another huge can of worms, as there is no common naming scheme for fields extracted from log messages. Instead, there is a different one for each major vendor.

You can examine the generated patterns by opening /var/lib/syslog-ng/sequence.xml in your favorite editor. One with syntax highlighting helps a lot :-)

Starting over

If you want to start over your experiment, use a different batch size or fine tune other parameters, and make sure that you reset both the PatternDB database (/var/lib/syslog-ng/sequence.xml) and the SQLite database used by sequence (/etc/syslog-ng/conf.d/sequence.sdb). Otherwise, you will have some unexpected results…

My experiences

Your mileage may vary and hugely dependent on the amount and composition of the log messages you have. Fabien Wernli, one of the authors of sequence-rtg reported that in their environment, it took a couple of months to generate patterns covering 85% of their log messages. I had two test environments. The first one had 20k messages a day, most of them coming from sshd. I had to change the batch size to 10k. Sequence created relatively good patterns for sshd, but nothing else, as not even a full day was enough to collect 10k messages again from the rest of the logs.

The second test environment received about 1k messages a second. Most of them arrived from Vmware servers, some from Linux hosts. There, I could use the default 100k batch size, but I could even test with a 500k batch size as well.

In my experience, the larger the batch size, the better quality the generated PatternDB database will be. However, there is a catch: after a while, it will take a very long time to gather enough log messages to fill a batch, and there will be a higher chance of missing some less frequent log messages.

As mentioned earlier, generic Linux log messages had much more meaningful field names, due to the magic included in the configuration. I did not try, but most likely you can fine-tune this magic to better understand your own log messages.

There are a number of parameters you can fine-tune. I only played with the batch size. But even that alone could cause a huge difference: with a 100k batch size, it took slightly over a day to achieve 85% efficiency with the logs I had for testing. With 500k, it was just half a day. Note that unlike in Fabien’s environment, most of my logs available for testing were homogeneous (coming from a Vmware server).

What is next?

Once you tested sequence, share your experiences! Both Fabien and I am very happy to hear how it works for you. You can reach Fabien and the syslog-ng community on the mailing list (https://lists.balabit.hu/mailman/listinfo/syslog-ng), or if you prefer, by chatting on Gitter (https://gitter.im/syslog-ng/syslog-ng). Happy hacking!

Related Content