Big Data: save all or save costs?

When starting a new project, Big Data vendors usually recommend a “save all” and “save raw” approach, as you never know what data might come handy later and in what format. Companies starting those projects also often have the same approach, as they still have their infrastructure under heavy development. Both work under the assumption that storage is practically free compared to the value of data. But is it really so?

In real life nothing is free. If you use quality hardware and storage, it costs a considerable amount of money. If you use a Big Data vendor instead of building your system from open source components yourself, you also have to pay for software licenses. The price of the complete system can become pretty expensive quickly not only for SMBs, but as I learned talking to our users, also for financial institutions and government agencies. This is especially true when you know that the vast majority of data that is stored might never be used at all.

Talking to people at Red Hat Summit last week and at other conferences, they defined three major goals:

  • Reduce the amount of saved data as much as possible (but not more)
  • Bring data to a common format, so it requires less processing later
  • Process incoming data in real time when it is produced, as storing and batch-processing is more expensive

All of this can be achieved using syslog-ng, an application designed for central log collection. Before learning how syslog-ng can save you costs on your Big Data infrastructure, let me give you a quick overview of the syslog-ng application.


Four roles of syslog-ng

There are four major roles of syslog-ng: collecting, processing, filtering and storing log messages.

1. Collecting messages

The syslog-ng application can collect from a wide variety of platform-specific sources, like /dev/log, Journal or Sun Streams. Obviously, as a central log collector, it speaks both the legacy (RFC 3164 or BSD) and the new (RFC 5424 or IETF) syslog protocols and all their variants over UDP, TCP and encrypted connections. You can also collect log messages (or any kind of text data) from pipes, sockets, files and even application output.

2. Processing log messages

The possibilities here are almost endless. You can classify, normalize and structure log messages with built-in parsers. You can even write your own parser in Python if none of those available suites your needs. You can also enrich messages with geo-location data or additional fields based on the message content. Log messages can be reformatted to suit the requirements of the application processing the logs. You can also rewrite log messages. This does not mean falsifying, but rather anonymizing log messages as required by many compliance regulations.

3. Filtering logs

Filtering has two main uses. First of all, it is used to discards surplus log messages – like debug level messages – to save on storage. The other use is log routing, making sure that the right logs reach the right destinations. For example, all authentication-related messages are forwarded to a SIEM system.

4. Storing log messages

Log messages have to be stored somewhere. Traditionally, files are saved locally to flat files or sent to a central syslog server and still sent to flat files. Over the years this has changed, so SQL databases are supported and in the past few years Big Data destinations were also added to syslog-ng, including HDFS, Kafka, MongoDB or Elasticsearch.


Message formats

When you look at your log messages under the /var/log directory, you will see that most are in the form of a date + host name + application name + an almost complete English sentence. Of course, each application event is described by a different sentence. Creating a report based on this data is quite painful job.

The solution to this mess is the use of structured logging. In this case, events are represented as name-value pairs instead of free-form log messages. For example, an SSH login can be described with the following parameters: application name, source IP address, user name, authentication method, and so on.

For your own log messages, you can take a structured approach right from the beginning. When working with legacy log messages, you can use the different parsers in syslog-ng to turn unstructured and some of the structured message formats into name-value pairs. Once you have your logs available as name-value pairs, reporting, alerting or simply just finding the information you are looking for becomes a lot easier.


How these apply to Big Data and saving costs?

We should go back to the goals we have defined at the beginning:

  • Reduce the amount of saved data as much as possible (but not more)
  • Bring data to a common format, so it requires less processing later
  • Process incoming data in real time when it is produced, as storing and batch-processing is a lot more expensive

The use of parsing and enrichment turns data into name-value pairs. They greatly facilitate filtering, as instead of a long random text the name-value pairs contain a single piece of information that is easy to act on.

The use of filtering can greatly reduce the amount of data stored. Of course, filtering also works without extensive parsing, but parsing enables a lot more precise filtering. You can filter out more data while lowering the risk of discarding useful data.

The use of templates when storing messages helps to bring them to a common format. It can be JSON, or you can easily create your own custom format. This can save you lots of processing later, as data is already in a ready-to-use format or requires just a minimal amount of processing before it is used.

And all of this is done by syslog-ng in real time.

While syslog-ng is originally a sysadmin tool focusing on central log message collection, the very same features make it suitable for generic data collection and processing. The syslog-ng application not limited to log messages but can collect and process several types of text data. It is not a just sysadmin tool any more, but also belongs to the tool set of data engineers.

If you have questions or comments related to syslog-ng, do not hesitate to contact us. You can reach us by email or you can even chat with us. For a list of possibilities, check our GitHub page under the “Community” section at https://github.com/balabit/syslog-ng. On Twitter, I am available as @PCzanik.

Anonymous