Testing the performance of log streaming to HDFS with syslog-ng

Hadoop and Kafka are part of modern high speed data management, and syslog-ng has been supporting these destinations for some time now, as it’s an integral part of a versatile central log management solution. Seamless integration is great, but what about performance? One of syslog-ng’s main advantages is its high performance, and writing logs to these destinations is no exception. In this two-part blog post, we will share some details on how syslog-ng performs in these environments.

Our test environment

We performed our tests with syslog-ng Premium Edition 6.0.3 running on a server with two Intel Xeon E5-2620 v3 2,40 GHz CPUs, 16 GB of RAM, a 10 Gbps Ethernet, a 500 GB SSD and Ubuntu Trusty Tahr. The HDFS server ran on VMware ESX with a four-core CPU at 2.6 GHz, 8 GB of RAM, a 1 Gbps Ethernet interface, SSD-based storage, with Ubuntu Trusty Tahr and HDFS 2.7.3.

In all our tests, syslog-ng processed real logs originating from Windows Event Log (sent by the syslog-ng Premium Edition Windows agent), the average message size was 400 bytes (ranging between 137 and 2133 bytes, and syslog-ng received logs from 10 parallel TCP connections.

Hadoop performance

First, we did a test with one HDFS data node and with one HDFS destination, so in this case, syslog-ng wrote one file on HDFS. The result was 140,000 logs per second, which meant a 53 MB/s throughput. syslog-ng consumed approximately 1,5 GB of memory and 30% of the available CPU resources.

But what if syslog-ng writes to multiple files on the same HDFS data node?

Based on our tests, the overall speed remains almost the same. For example, if syslog-ng writes two files on the same HDFS data node, then the total throughput is also around 140,000 logs per second, around 70-70k to each files.

Of course, if syslog-ng writes to several files on the same node, then the speed can decrease significantly due to the disk latency – which happens in non-HDFS environments as well, when a lot of files are written to the same hard disk simultaneously.

What if HDFS is configured to use several data nodes?

If syslog-ng writes only to one file, then there is no change in the performance compared to the single-node scenario.
If syslog-ng is configured to write to multiple files on HDFS (e.g. three HDFS destinations on three log paths and there are three HDFS data nodes running under that Hadoop), then the overall performance will increase. In that case, each syslog-ng HDFS destination can communicate with a different data node and the overall speed depends on the performance of each data node.

This is not a linear scale, but if the bottleneck was the data node, then using more data nodes with more syslog-ng HDFS destinations can increase the speed significantly. Of course, if the bottleneck is not the data node, then the speed won’t change or can even become worse.

In our next post, we will cover the Kafka destination performance, and also serve with some practical tips&tricks which could provide useful information in case you’re already using syslog-ng to deliver log data to these destinations. Stay tuned!

syslog-ng PE test configuration

HDFS
 
@version: 6.0
 @module "mod-java"
 options {
 keep_hostname(yes);
 keep_timestamp(no);
 stats_level(2);
 use_dns(no);
 };
 source s_network_15c0fe6c0365441f9882b20d237f9114 {
 network(ip(0.0.0.0)
 log_fetch_limit(1000)
 log_iw_size(100000)
 max_connections(100)
 port(514));
 };
 destination d_java_eb43d63566364dfb9256007d0587efab {
 java(class_name(org.syslog_ng.hdfs.HdfsDestination)
 class_path('/opt/syslog-ng/lib/syslog-ng/java-modules/*.jar:/var/testdb_working_dir/dab1bb99-bfc1-4394-ac6a-507562967c9c/build/distributions/hdfs-libs/lib/*.jar')
 log_fifo_size(200000)
 option("hdfs_uri", "hdfs://hdp2.syslog-ng.balabit:8020")
 option("hdfs_file", "/var/testdb_working_dir/e0f5aa2a-19bc-4c45-b08f-631173b96031.txt")
 );
 };
 log {
 source(s_network_15c0fe6c0365441f9882b20d237f9114);
 destination(d_java_eb43d63566364dfb9256007d0587efab);
 
flags(flow-control);
 };

Related Content