Analyzing Apache HTTPD logs in syslog-ng

4 May 2022

Analyzing Apache HTTPD logs in syslog-ng

Recently, I started my own blog, and as Google Analytics seems to miss a good part of visitors, I wanted to analyze my web server logs myself. I use syslog-ng to read Apache logs, process them, and store them to Elasticsearch. Along the way, I resolve the IP address using a Python parser, analyze the Agent field of the logs, and also use GeoIP to locate the user on the map.

From this blog, you can learn how I built my configuration. Note that once I was ready, I realized that my configuration is not GDPR compliant, so I also show you which parts to remove from the final configuration :-).

Before you begin

I run Apache HTTPD with default logging settings, so syslog-ng can parse its logs out of the box. On the destination side, I use Elasticsearch 7; however, version 8 should work too, just as OpenSearch by Amazon. On the syslog-ng side, I used the latest syslog-ng version available (version 3.36); however, version 3.23 or later would most likely also work. If your environment has an older version available only, check the syslog-ng third-party package page for pointers to up-to-date packages: https://www.syslog-ng.com/products/open-source-log-management/3rd-party-binaries.aspx

You need http, GeoIP, Python and JSON support enabled in syslog-ng. If you install packages on Linux, most distributions keep these features in separate sub-packages. On FreeBSD, you have to compile syslog-ng from source, as GeoIP and Python support are not available in the pre-compiled package.

Basic configuration

When I start to build a new configuration for syslog-ng, I always start with something (relatively) simple and then add more settings later. You can find my initial configuration below. You can append it to syslog-ng.conf or place it as a separate .conf file under the /etc/syslog-ng/conf.d/ directory, if syslog-ng on your host is configured to use it. It reads the Apache log file, parses it, and stores it to a JSON formatted file. Why? Because this way, you can see the values parsed from the incoming log messages, which helps to further extend your configuration.

~# cat apache_basic.conf
# source: apache access log file
source s_access {
  file("/var/log/httpd/access.log" flags(no-parse));
};
 
# parser for apache access log
parser p_access {
  apache-accesslog-parser(
    prefix("apache.")
  );
};
 
# destination: JSON format with same content as to Elasticsearch
destination d_json {
  file("/var/log/test.json"
    template("$(format-json --scope rfc5424 --scope nv-pairs --exclude DATE --key ISODATE)\n\n"));
};
 
# magic happens here: all building blocks connected together
log {
  # read my blog's access log
  source(s_access);
  # turn it into name-value pairs
  parser(p_access);
  # send logs to json file mimicing Elasticsearch
  destination(d_json);
};

Let’s take a closer look at the various building blocks:

At the beginning, there is a file source reading the Apache access log file. As the log file is in the Apache Common Log file format, we disable parsing here, otherwise syslog-ng would try to interpret it as a syslog formatted file.
The parser then creates name-value pairs from the various fields of the access log file. We change the prefix here because it starts with a dot by default, which can cause problems in Elasticsearch.
The logs are then saved to a JSON formatted text file. Note that without JSON formatting, syslog-ng stores just the basic log message without the parsed name-value pairs. Although storing syslog headers is not strictly necessary in this case (as we are focusing on the Apache log messages in the name-value pairs) I still tend to save them for consistency. As Kibana prefers the ISO date format, we disable the original date field and use the ISO formatted one instead.
Finally, the log statement connects all of these building blocks together.

Once we reload syslog-ng and the configuration is live, the test.json file should have some logs in it. If the file does not exist, make sure that your web server gets some traffic! Here is an example log message:

{"apache":{"verb":"GET","timestamp":"28/Apr/2022:16:36:31 +0200","response":"200","request":"/","referrer":"-","rawrequest":"GET / HTTP/2.0","ident":"-","httpversion":"2.0","clientip":"1.2.3.4","bytes":"9075","auth":"-","agent":"Lynx/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"},"SOURCE":"s_access","PRIORITY":"notice","MESSAGE":"1.2.3.4 - - [28/Apr/2022:16:36:31 +0200] \"GET / HTTP/2.0\" 200 9075 \"-\" \"Lynx/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0\"","ISODATE":"2022-04-28T16:36:31+02:00","HOST_FROM":"www","HOST":"www","FILE_NAME":"/var/log/httpd/access.log","FACILITY":"user","1":"2.0","0":"HTTP/2.0"}

Now that we verified that syslog-ng can read and parse the Apache logs, we can further enhance the configuration.

Full configuration

Parts of the following configuration will look familiar to the previous example. However, there are some new building blocks and the log statement became a lot longer.

# source: apache access log file
source s_access {
  file("/var/log/apache2/peter.czanik.hu-access.log" flags(no-parse));
};
 
# destination: JSON format with same content as to Elasticsearch
destination d_json {
  file("/var/log/test.json"
    template("$(format-json --scope rfc5424 --scope nv-pairs --exclude DATE --key ISODATE)\n\n"));
};
 
# parser for apache access log
parser p_access {
  apache-accesslog-parser(
    prefix("apache.")
  );
};
 
python {

"""
very simple syslog-ng Python parser example
resolves IP to hostname
value pair names are hard-coded
"""

import socket

class SngResolver(object):
    def parse(self, log_message):
        """
        Resolves IP to hostname
        """

        ipaddr_b = log_message['apache.clientip']
        ipaddr = ipaddr_b.decode('utf-8')

        # try to resolve the IP address
        try:
            resolved = socket.gethostbyaddr(ipaddr)
            hostname = resolved[0]
            log_message['hostname.client'] = hostname
        except:
            pass

        # return True, other way message is dropped
        return True

};

parser p_resolver {
    python(
        class("SngResolver")
    );
};

parser p_geoip2 { geoip2( "${apache.clientip}", prefix( "geoip2." ) database("/usr/share/GeoIP/GeoLite2-City.mmdb" ) ); };
 
rewrite r_geoip2 {
    set(
        "${geoip2.location.latitude},${geoip2.location.longitude}",
        value( "geoip2.location2" ),
        condition(not "${geoip2.location.latitude}" == "")
    );
};

destination d_elastic {
    elasticsearch-http(
        index("syslog-ng")
        type("")
        url("http://localhost:9200/_bulk")
        template("$(format-json --scope rfc5424 --scope dot-nv-pairs
        --rekey .* --shift 1 --scope nv-pairs
        --exclude DATE @timestamp=${ISODATE})")
    );
};

# magic happens here: all building blocks connected together
log {
  # read my blog's access log
  source(s_access);
  # turn it into name-value pairs
  parser(p_access);
  # resolve IPs to DNS names
  parser(p_resolver);
  # find geo-locations and make it digestable by Kibana
  parser(p_geoip2);
  rewrite(r_geoip2);
  # find rss readers
  if  (match("agent fetcher" value("apache.agent"))) { rewrite { set("RSS", value("category")); }; };
  if  (match("miniflux" value("apache.agent"))) { rewrite { set("RSS", value("category")); }; };
  if  (match("tt-rss" value("apache.agent"))) { rewrite { set("RSS", value("category")); }; };
  if  (match("simplepie" value("apache.agent"))) { rewrite { set("RSS", value("category")); }; };
  # find monitoring
  if  (match("alertalligator.com" value("apache.agent"))) { rewrite { set("monitor", value("category")); }; };
  if  (match("censys" value("apache.agent"))) { rewrite { set("monitor", value("category")); }; };
  # find search engines
  if  (match("Hatena" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("bingbot" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("Googlebot" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("Qwantify" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("duckduckgo" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("baidu" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  if  (match("yacybot" value("apache.agent"))) { rewrite { set("searchengine", value("category")); }; };
  # etc bot
  if  (match("ahrefs" value("apache.agent"))) { rewrite { set("bot", value("category")); }; };
  if  (match("linkfluence" value("apache.agent"))) { rewrite { set("bot", value("category")); }; };
  if  (match("anderspink" value("apache.agent"))) { rewrite { set("bot", value("category")); }; };
  if  (match("mj12bot" value("apache.agent"))) { rewrite { set("bot", value("category")); }; };
  if  (match("dotbot" value("apache.agent"))) { rewrite { set("bot", value("category")); }; };
  # send both to json file and to Elasticsearch
  # destination(d_json);
  destination(d_elastic);
};

This configuration is a lot longer due to some Python code and the long list of filters introduced in the log statement. From the following explanation, I skip the parts that remained the same as in the basic configuration.

The Python block includes code that can resolve IP addresses to host names. It is a slow and blocking code, but it is not a problem with my low-traffic web server. However, you should avoid using it if your server handles more than a few dozen requests a second. If you have a longer code, you should store it in a separate file, but if it is just a few lines, then it is easier to include the script in the syslog-ng configuration itself. When enabled, it is really fun seeing host names at Chinese telecommunication companies and universities which even mention that they are doing network vulnerability scanning...
The previously mentioned block is just a storage for the source code. The Python parser itself is the next block. It refers to the class name in the Python code.
In the initial configuration, we stored log messages in a JSON formatted text file to see name-value pairs. In the GeoIP parser that we define here, we use one of the name-value pairs as seen in the JSON log: the apache.clientip. Using the city database, the parser stores geographical information into name-value pairs with a “geoip2.” prefix.
The rewrite makes sure that the longitude / latitude information is stored in a way as expected by Elasticsearch / Kibana. This name-value pair is created only if the values are not empty.
The template for the Elasticsearch destination has a few extra elements compared to the JSON file template earlier. It makes sure that the leading dots are removed (they are not necessary here) and instead of inserting the ISODATE macro directly, it renames it to the format expected by Kibana.

The log statement is a lot longer than it was in the initial configuration for several reasons. Firstly, because there are a lot more building blocks to connect together. Secondly, because after all the parsing is ready, but before the log messages are stored, there are many “if” statements. They are an imperfect list of various bots, monitoring services, and search engines, which help to categorize clients based on the content of the apache.agent name-value pair.

At first, I planned to use the in-list() filter for this, but that only allows full matches. However, with version numbers and other similar constantly changing information, full matches are not useful here. Another option could be using the regexp-parser(), which was introduced recently. It can handle multiple patterns, so when using that, just three if statements would be enough in this configuration.

I used the file destination only at the beginning for debugging. It is commented out in the log statement above.

Once syslog-ng is reloaded, you should be able to see your incoming web server logs in Kibana.

Making the configuration GDPR compliant

I was really happy with the results of my syslog-ng configuration, but then I realized that it is not GDPR compliant. This is because it stores the IP address and host name. Storing just the domain name and stripping the rest of the data would most likely be OK. Still, to solve the problem, I simply removed the Python parser resolving the IP address to host name, and storing the IP address is not allowed either. This needs two additional --exclude options in the Elasticsearch destination: one for the MESSAGE macro, which also includes the client IP, and the other one for apache.clientip.

If we really wanted to be picky, then we would notice that the Apache access log also includes the client IP address. I am not this picky though, as logs are regularly rotated, and I only check Kibana anyway. This problem could most likely be worked around by forwarding logs through a pipe or socket to syslog-ng. This way, the IP address is never written to disk, but still could be used by the GeoIP parser to locate the user on the map.

What is next?

This was just a simple syslog-ng configuration I prepared for myself to suit my own needs. You might want to do something completely different, such as send an alert to Telegram when a specific user logs in, or if your website is accessed from China (it is typically bad news, as a lot of traffic is coming from there, and over 95% of that is searching for vulnerabilities). Elasticsearch is my favorite destination, but there are many more viable alternatives. The possibilities are endless.

If you have questions or comments related to syslog-ng, do not hesitate to contact us. You can reach us by email or even chat with us. For a list of possibilities, check our GitHub page under the “Community” section at https://github.com/syslog-ng/syslog-ng. On Twitter, I am available as @PCzanik.