normalizing Apache Access logs to JSON, XML and syslog

I like to make my mind up based on examples, especially for complex issues. For a discussion we had on the CEE editorial board, I’d like to have some real-world example of a log file with many empty fields. An easy to grasp, well understood and easy to parse example of such is the Apache access log. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I also had a few research samples at hand.

Apache common log format is structured data. So there is no point in running it through a free-text normalization engine like liblognorm. Of course it could process the data, but why use that complex technology. Instead, the decoder is now part of libee and receives a simple string describing which fields are present. It’s called like this:

$ ./convert  -exml -dapache -D “host identity user date request status size f1 useragent” < apache.org > apache.xml

Options specify encoder and decoder, and the string after -D tells the convert field names and order. But now let’s speak the input and output for itself:

The converter works by calling the decoder, which creates an in-memory representation of the log format in a CEE-like object model. Then, that object model and the called-for encoder is used to write the actual output format. That way, the conversion tool can convert between any structured log format, as long as the necessary encoders and decoders are available. This greatly facilitates log processing in heterogeneous environments.

Note that liblognorm works similar and, from libee’s point of view, can be viewed at as an decoder for unstructured text data (or, more precisely, for text data where the structure is well-hidden ;)).

log normalization – first results

At the beginning of this week I was pretty confident, that I would not make my self-set deadline of one month to implement a first rough proof of concept of liblognorm, a log normalizing library. Fortunately, I made extremely good progress the past two days and I am now happy to say that I have such a proof of concept available. All of this can be seen by pulling from Adiscon’s public git server: you need libestr, libee and liblognorm to make it work.

Right now, I’d like to provide a glimpse at how things work. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I got a couple of examples to play with (but I am still interested in more lag samples, especially from Cisco devices). Out of the many, I took a random messages.log file written by sysklogd. This is my input file and can be seen here.

To normalize events, liblognorm needs to know which fields are present at which positions of the input file. It learns this via so-called “samples”. Samples are very similar to the patterns used by virus scanners: like virus patterns describe how a specific virus looks, log samples describe how a specific log line looks. Other than virus patters, I have crafted a format hopefully easy (enough) to understand by sysadmins, so that everyone can add relevant samples himself. To support this, samples look relatively similar to actual log lines, and this is the reason I have termed them “log samples”. Like log files, samples are stored in simple text files. For the initial test, I used a a very small set of samples, available here. A production system will have many more samples, and I envision systems that have many (ten?-) thousand of samples loaded at the same time. If you look at the samples, take special care about entities enclosed in ‘%’ – these are field definitions, the rest is literal text.

The actual normalization is performed by the libraries engine, which parses log lines, based on the samples, into fields. This creates an in-memory representation of the event, which can than be processed by the driving application or be written to some other media or the network.

Liblognorm will come with a small toll called “the normalizer”. It is a minimal library user: it loads a sample database and reads log lines from standard input, creates the event in-memory representation and then writes this representation to standard output in a standardized format. So far, it supports formats as they are expected for the upcoming CEE standard.

The result of a normalizer run on my test input file based on the provided sample base can be seen here. The output is actually a bit more verbose than described above, because it lists the to-be-normalized line as well. If you look at the properties I extracted, you’ll probably notice that some do not make too much sense (maybe…). Also, a classification of the message is missing. Don’t care about these aspects right now: it’s a proof of concept and these things will be addressed by future development (the classification, for example, will be based on CEE taxonomy via tags).

I hope I was able to convey some of the power that is available with liblognorm. Of course, a “little bit” of more work and time will be required to get it production-ready. Unfortunately, I will be unavailable for larger parts of the next two weeks (other work now pressing plus a long-awaited seminar ;)), but I will try to get liblognorm as quickly as possible into the best shape possible. In the meantime, if you like, feel free to have a look at its code or play with it. All of what I wrote can actually be done with the versions available in git.

Call for Log Samples

My log normalization effort made good progress and I have a very rough first proof of concept available. It will take a log sample database, and transform input log files to a CEE-like output format.
Now I am looking at ways to practice-test it. So I’d appreciate if you could point me to some sources of log files. It mustn’t be terabytes, but they should be anonymized and be usable in the public Internet. For obvious reasons, it would be good if they are from widely deployed devices.
I would use a subset of these samples to extract usable sample database entries and see how the run through the normalizer.
Thanks,
Rainer

libee status update

Yesterday I reached some important milestones:
  • defined an internal simple event representation format “int” to create test cases
     http://doc.libee.org/classee__int.html
  • wrote a decoder “int” -> CEE
  • wrote a syslog encoder CEE->syslog
    (with enough escapes to be testable, but without all details)
  • wrote a small program (int2syslog) that ties these pieces togetherand can be used to visualize test cases in syslog representation
 All is available from the libee git at http://git.adiscon.com/?p=libee.git;a=summary

introducing libestr

I tried to avoid this, but it looks like I need to start a new project. I need a simple counted-byte string “class” for the CEE effort. All of that is already present in rsyslog, but using it’s runtime library is overkill. I could have copied the string code, but I really don’t like to have multiple copies of the same code around. So I’ll now create a new small library just for that purpose. Well, while I am at it, I’ll probably also add basic hashtable support, as this can help keep libee small for cases when no XML-specific functionality is desired. And I don’t want to start yet another library for that (we already have lib inflation… ;)).

I named it libestr, bare of a better name. I thought a while about the name, but could not find anything really decent. “estr” means “essentials for string handling” and is probably descriptive enough. Quite honestly, I really like to gain some speed in coding again instead of just creating skeletons and thinking about names…

logging and the C NUL problem

Again, I ran into the “C NUL Problem”, that is the way C strings are terminated. Unfortunately, the creators of C have represented strings as variable arrays of char without an explicitely-stated size. Instead of a size property, a C string is terminated by an US-ASCII NUL character (”). This works well enough in most cases, but has one serious drawback: the NUL character is a reserved character that cannot be part of any C string. So, for example strlen(“AB”) equals one and not three as one would usually expect.

CERT has a good presentation of some of the more important problems associated with the standard C string handling functions. I do not intend to reproduce this here or elaborate on any further details except that we get into big trouble if NUL characters are used e.g. in logging data sets. We had this problem in the IETF syslog WG, where we permited NUL to be part of the syslog message text, but permitted a receiver to escape it. This is far from being an ideal solution, but we considered it good enough, especially as it permits to keep compatible with existing toolset libraries.

Now, in CEE, we face the same challenge: the problem is if the in-memory representation of event fields should permit NUL characters. The correct technical answer to this question is “yes, of course”, but unfortunately it has a serious drawback that can affect adoption: if NULs are permited, none of the string handling functions of the C runtime library can be used. This is, as said above, because the C runtime library is not able to handle NULs inside “standard” C strings. A potential solution would be to escape NULs when they enter the system. However, that would require an additional reserved character, to do the escaping. So in any case, we’ll end up with a string that is different from what the “usual” runtime library routines expect.

Of course, this problem is not new, and many solutions already have been proved. The obvious solution is to replace the standard C string handling functions with octet-counting functions that do not require any reserved characters.

A short, non- weighted list of string replacement string libraries is:

Note that some of them try to mimic standard C strings as part of their API. I consider this highly dangerous, because it provides a false sense of security. While the library now can handle strings with included NUL characters (like “AB”), all parts of the string after the first NUL will be discarded if passed to a “regular” C runtime library string function (like printf!). So IMO this is a mis-feature. A replacement library must explicitely try to avoid compatibility to the C runtime library in order to safe the user from subtle issues, many of them resulting in security problems (think: information hiding).

Unfortunately, I could not identify any globally-accepted string replacement library that is in widespread use.. Despite its deficits, C programmers’  tend to use the plain old string functions present in the standard C runtime library.

So we are back to the original issue:

If CEE supports NUL characters inside strings, the C standard string library can not be used, and there are also problems with a potentially large number of other toolsets. This can lead to low acceptance rate.

But if CEE forbids NUL characters, data must be carefully asserted when it enters the system. Most importantly, a string value like “AB” must NOT be accepted when it is put in via an API. Experience tells that implementors sometimes simply overlook such restrictions. So this mode opens up a number of very subtle bug (security) issues.

I am very undicided which route is best. Obviously, a sound technical solution is what we want. However, the best technical solution is irrelevant if nobody actually uses it. In that light, the second best solution might be better. Comments, anyone?

libee – first peek preview available

I have just published a first preview of the libee library API. This work is obviously far from being finished, but I think the preview provides at least some idea of how it will materialize.

To tell a bit about the status in general: I have completed a first round of evaluation of CEE objects and concepts, based on the draft standard. Note that the draft standard is far from being finished, and will probably undergo serious changes. As such, I have not literally taken object definitions from there. Instead, I have mostly pulled what I am currently interested in and also have added some additions here and there, as I need them to do what I intend to do.

My primary goal is to get some proof of concept working first, then re-evaluate everything and probably do a rewrite of some parts. For the same reason, performance is currently not on my radar (but of course I always think about it ;)).

I would also like to note that the libee git is also available — and a good way to follow libee development for those interested. Comments are always appreciated and especially useful in this early phase.

Sagan and rsyslog [Guest Posting]

Hi everyone,

I am happy to feature a guest post written by Champ Clark III, the author of Sagan, a real time, “snort like” event and system log sniffing tool. Champ will explain a bit about it and how he made it work with rsyslog.  I think it is a very interesting project and I am glad it now has full rsyslog support.

But enough of my words, enjoy the real thing ;)

Rainer


I admit it, I’m a recent convert to rsyslog. I’ve known about rsyslog for years, but have only recently started using rsyslog in production environments. The primary reason for looking into rsyslog is users of Sagan are requesting support for it. I’m very glad they pushed me in that direction. I knew how popular rsyslog was,
but the ‘hassles’ of changing our core logging facilities seemed like a pain.

I can tell you, it was easy and seamless. Also, after reading Rainer Gerhards’ excellent, “rsyslog: going up from 40K messages per second to 250K“, I knew that I liked this project.

So I bit the bullet, and started working with Sagan and rsyslog. I haven’t looked back since.

I work in the network & computer security field. I’ve known for years the importance of log management. One thing that I had noticed was a lack of open source log & packet levelcorrelation engines. This is essentially what Sagan does. One common comparison of Sagan is Cisco’s MARS. Sagan reads in your logs and attempts to correlate the information with the intrusion detection/prevention system’s packet level information.

At Softwink, Inc, my place of employment, we monitor security events for various clients. At the
packet-level inspection for ‘bad events’ (security related), we use Snort. Snort ‘watches’ the network connectionsand sends out an ‘alert’ when it sees nefarious traffic. We configure Snort to send the ‘alert’ to a MySQL database for further analysis. We can then monitoring these Snort sensors for ‘bad events/attacks’ in real time.

However, we found that we were missing the ‘bigger picture’ without logs. This is where rsyslog and Sagan come into play. Essentially, we take all machines and equipment on a network and forward it to a
centralized server. Rsyslog is the ‘receiver’, and sometimes the sender of these log messages. In many cases, we find that centralized secure logging is a requirement for clients. With rsyslog, we
have the ability to store log information into a MySQL database for archive purposes. We can then give the client access to the log information via Loganalyzer for easy, simple retrieval.

How does Sagan fit into this picture? For security analysis, we only want key, typically security related, events from the logs. Manually searching databases for ‘security related’ events is prone to error. It is easy to ‘miss’ key events. Sagan is the ‘eyes on the logs’ watching for security related events in real time. First, Sagan has to have access to the logs coming into the network. This is very simple with Rsyslog:

# rsyslog.conf file.
#
# As rsyslog receives logs from remote systems, we put them into a format
# that Sagan can understand:
#

$template
sagan,"%fromhost-ip%|%syslogfacility-text%|%syslogpriority-text%|%syslogseverity-text%|%syslogtag%|%timegenerated:1:10:date-rfc3339%|%timegenerated:12:19:date-rfc3339%|%programname%|%msg%n"

# We now take the logs, in the above format, and send them to a 'named pipe'
# or FIFO.

*.* |/var/run/sagan.fifo;sagan

Sagan can now ‘read’ the logs as they come into rsyslog from the /var/run/sagan.fifo (named pipe/FIFO) in real time. rsyslog actually performs double duty for us; logging to our MySQLdatabase for archival purposes and handing Sagan log information for analysis.

Over all, there is nothing really new about this concept. However, Sagan does something a bit different than other log analysis engines. When Sagan sees a ‘bad event’, Sagan will log that to your Snort IDS/IPS MySQL/PostgreSQL database. What does this mean? Packet level security events and log events reside in
the same database for correlation. There are several advantages; for one, we can now have a single, unified console for log and IDS/IPS events! Second, we can now take advantage of Snort front-end
software to view log events. For example, if you use BASE or Snorby to view packet level IDS/IPS events, you can use the same software to view log level Sagan events. Maybe your shop uses report generation
tools that query the Snort database to show ‘bad packet events’ in your network. Guess what. You can use those same reporting tools for your log information as well. I’ve posted some example screen shots of Snort & Sagan working together here. The idea is that we take advantage of the Snort community’s work on consoles.

Correlation with Sagan and Snort, at the engine level, works several different ways. First, Sagan can in some cases pull network information directly from the log message and use that for correlation in the SQL database. For example, let’s say an attacker is probing your network and is attempting to get information on the SMTP port. The attacker sends your SMTP server ‘expn root’. Your IDS/IPS engine will ‘detect’ this traffic and
store it. It’ll record the source IP, destination IP, packet dump, time stamp, etc. Sagan will do the same at the log level. Sagan will ‘extract’ as much of the information from the log message for further correlation with the packet level.

Recently, Rainer announced liblognorm (early liblognorm website). This is an exciting project. The idea is to “normalize” log information to a nice, standard usable format. I plan on putting as much support and effort as I can into this project, because it’s an important step. For Sagan, it means we will be able to better
correlate information. In my time to ponder about it since its recent announcement, I can see liblognorm being extremely useful for many different projects.

Sagan also shares another feature with Snort; it uses the same rule sets. Sagan rules sets are very much ‘Snort like’. Here is an example rule (this is a single line, broken just for readability):

alert tcp $EXTERNAL_NET any -> $HOME_NET 22 (msg:"[OPENSSH] Invalid or illegal user";
pcre: "/invalid user|illegal user/i"; classtype: attempted-user;
program: sshd; parse_ip_simple; parse_port_simple; threshold:type limit,
track by_src, count 5, seconds 300; reference:
url,wiki.softwink.com/bin/view/Main/5000022; sid:5000022; rev:4;)

If you’re already a Snort user, then the format and syntax should be very simple to understand. We use ‘pcre’ (regular expressions) to ‘look’ for a message from the program ‘sshd’ that contains the term ‘invalid user’ or ‘illegal user’ (case insensitive). We set the classifications, just as Snort does (for further correlation). We can ‘threshold’ the rule, so we don’t get flooded with events.

Sagan uses this format for a variety of reasons. For one, its a well know format in the security field. Second, we can now take advantage of Snort rule maintenance software! For example ‘oinkmaster’ or ‘pulled pork’. The idea is that with Sagan, you don’t need to ‘re-tool’ your network in order for it to work.

Using Sagan with your Snort based IDS/IPS system is just a feature of Sagan. Sagan can operate independently from Snort databases, and offers the normal bells/whistlers you’d expect in a SEIM (e-mailing alerts, etc).

To tie all this together, it means we can simply monitor packet level threats and log level events from a unified console. We can monitor just about everything in a network from the log level standpoint. We can monitor Cisco gear, Fortigate firewalls, Linux/*nix servers, wireless access points, etc.

Sagan is a relatively new project and still under development. Like rsyslog, Sagan is built from the ground up with performance in mind. Sagan is multi-threaded and written in C with the thought that it should be as efficient with memory and the CPU(s) as possible. Rsyslog seems to follow the same philosophy, yet another reason I made the switch.

The more information you know about a threat to your network/system, the better off you’ll be. That is what the mix of rsyslog and Sagan offers. Throw in IDS/IPS (Snort) monitoring, and you can get a complete view about ‘bad things’ happening in your environment.

For more information about Sagan, please see http://sagan.softwink.com.

CEE library will be named libee

After some discussions, we have finally decided to name the CEE part of the original liblognorm project “libee“. Note the missing “c” ;) We originally thought that libcee would be a smart name for a library implementing an API on top of the upcoming CEE standard. However, there were some complexities associated with that because “libcee” sounds much like *the* official library. While I hope I can finish a decent implementation of the full CEE standard, I think it is a bit bold to try to write the standard lib when the standard is not even finished.

So the compromise is to drop the “c”. The “ee” in libee means “Event Expression”, much as CEE means “Common Event Expression”. If CEE evolves as we all hope and if I manage to actually create a decent implementation of it, we may reconsider calling the library libcee. But that’s not important now and definitely something that will involve a lot of people.

In the mean time, I wanted to make sure you all understand what I mean when I talk about “libee”. I hope to populate its public git with at least a skeleton soon (liblognorm reaches a point where I need some CEE definitions, so this becomes more important).