log normalization - first results

At the beginning of this week I was pretty confident, that I would not make my self-set deadline of one month to implement a first rough proof of concept of liblognorm, a log normalizing library. Fortunately, I made extremely good progress the past two days and I am now happy to say that I have such a proof of concept available. All of this can be seen by pulling from Adiscon’s public git server: you need libestr, libee and liblognorm to make it work.

Right now, I’d like to provide a glimpse at how things work. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I got a couple of examples to play with (but I am still interested in more lag samples, especially from Cisco devices). Out of the many, I took a random messages.log file written by sysklogd. This is my input file and can be seen here.

To normalize events, liblognorm needs to know which fields are present at which positions of the input file. It learns this via so-called “samples”. Samples are very similar to the patterns used by virus scanners: like virus patterns describe how a specific virus looks, log samples describe how a specific log line looks. Other than virus patters, I have crafted a format hopefully easy (enough) to understand by sysadmins, so that everyone can add relevant samples himself. To support this, samples look relatively similar to actual log lines, and this is the reason I have termed them “log samples”. Like log files, samples are stored in simple text files. For the initial test, I used a a very small set of samples, available here. A production system will have many more samples, and I envision systems that have many (ten?-) thousand of samples loaded at the same time. If you look at the samples, take special care about entities enclosed in ‘%’ – these are field definitions, the rest is literal text.

The actual normalization is performed by the libraries engine, which parses log lines, based on the samples, into fields. This creates an in-memory representation of the event, which can than be processed by the driving application or be written to some other media or the network.

Liblognorm will come with a small toll called “the normalizer”. It is a minimal library user: it loads a sample database and reads log lines from standard input, creates the event in-memory representation and then writes this representation to standard output in a standardized format. So far, it supports formats as they are expected for the upcoming CEE standard.

The result of a normalizer run on my test input file based on the provided sample base can be seen here. The output is actually a bit more verbose than described above, because it lists the to-be-normalized line as well. If you look at the properties I extracted, you’ll probably notice that some do not make too much sense (maybe…). Also, a classification of the message is missing. Don’t care about these aspects right now: it’s a proof of concept and these things will be addressed by future development (the classification, for example, will be based on CEE taxonomy via tags).

I hope I was able to convey some of the power that is available with liblognorm. Of course, a “little bit” of more work and time will be required to get it production-ready. Unfortunately, I will be unavailable for larger parts of the next two weeks (other work now pressing plus a long-awaited seminar ;)), but I will try to get liblognorm as quickly as possible into the best shape possible. In the meantime, if you like, feel free to have a look at its code or play with it. All of what I wrote can actually be done with the versions available in git.