message classification with liblognorm sample code

I have just enhanced liblognorm‘s normalizer tool to support the -t option. If it is given, only messages with the specified tag will be output. Currently, only a single tag can be specified. The main purpose of this change is to provide some example code on how to use the message classification API, so that other developers can include it into their solutions more easily.

In essence, the whole logic is contained in normalizer.c, line 122 and 123. The application needs to keep the “wanted” tags inside an es_str_t type. Then, it needs to call the ee_getEventField() API to find out if the normalizer (better said: its rule) associated the tag with a given message. That’s it…

Please note that we may implement a more powerful API in the future — if this makes sense. If you think API additions would be useful, please suggest them together with a description of the benefits.

log classification with liblognorm

Today, I have added support for so-called “tags” to liblognorm (and it’s base library libee). This new capabilities permits very easy classification of syslog message and log records in general. So you can not only extract data from your various log source, you can also classify events, for example, as being a “login”, a “logout” or a firewall “denied access”. This makes it very easy to look at specific subsets of messages and process them in ways specific to the information being conveyed.

To see how it works, let’s first define what a tag is: A tag is a simple alphanumeric string that identifies a specific type of object, action, status, etc. For example, we can have object tags for firewalls and servers. For simplicity, let’s call them “firewall” and “server”. Then, we can have action tags like “login”, “logout” and “connectionOpen”. Status tags could include “success” or “fail”, among others. The idea of tags is based on early CEE concepts. I will try to keep consistent with whatever CEE heads to. Tags form a flat space, there is no inherent relationship between then (but this may be added later on top of the current implementation). Think of tags like the tag cloud in a blogging system. Tags can be defined for any reason and need (though obviously we must strive to get to a standard set, something I hope CEE will provide in the not too distant future). A single event can be associated with as many tags as required.

Assigning tags to messages is simple. A rule contains both the sample of the message (including the extracted fields) as well as -now- the tags. Have a look at this sample, taken from liblognorm 0.2.0:

rule=:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Here, we have a rule that shows an invalid ssh login request. The various field are used to extract information into a well-defined structure. Have you ever wondered why every rule starts with a colon? Now, here is the answer: the colon separates the tag part from the actual sample part. Starting with liblognorm 0.3.0, you can create a rule like this:

rule=ssh,user,login,fail:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Note the “ssh,user,login,fail” part in front of the colon. These are the four tags the user has decided to assign to this event. What now happens is that the normalizer does not only extract the information from the message if it finds a match, but it also adds the tags as metadata. Once normalization is done, one can not only query the individual fields, but also query if a specific tag is associated with this event. For example, to find all ssh-related events (provided the rules are built that way), you can normalize a large log and select only that subset of the normalized log that contains the tag “ssh”.

Note that versions of liblognorm 0.2.0 simply ignore the tag part, so old versions of the library are capable of working with new rule bases.

This is pretty cool and has ample potential. Just think about creating firewall reports: if you have different firewalls, you only need to have different rule bases to normalize these events all into the same format. Even more now, you can process the logs based on the classification assigned during the normalization process. For example, a “failed connection request” report may ignore everything that is not tagged as “connection, fail”.

That probably sounds pretty good to you, but how to actually use it? Right now, the core functionality is available inside the libraries (more precisely in the git version, I will do an official release very soon but wanted to spread word). That means developers have the necessary API to integrate with their programs. End user tools do not yet exist (what is not too surprisingly for a library). Integration of the new functionality is very easy. Classification is available without need to change anything in existing applications. A single new simple API ee_EventHasTag() has been added, which needs to be called to see if an event is associated with the given tag. [side-note: the current API is NOT guaranteed to be stable, even though I try not to break things without need]

In hope that developers will play with the new functionality, so that it will be available in end-user tools soon as well. I myself plan to enhance the normalizer tool very soon to support selecting subsets based on tags (this can also serve as an example for other developers). Also, I plan to add classification support to rsyslog very, very soon. So stay tuned to what’s coming up — it’s exciting ;)

log normalization: how to share rulebases?

Rulebases play a crucial role in log normalization. While the log normalizer itself needs to be of high quality and speed, it is the rulebase that really helps to detect which message the one in question is. I myself have so far concentrated on the code and not created any larger rulebase. Champ Clarck III has created many more for his use inside Sagan. But this means everything is in its infancy. What we really need is community involvement to create a large number of easy to access rulebases for almost all devices.

This brings up the question of how to manage and share such a repository. One method may be to place it on a web site, together with some submission tool. An alternate approach would be to put everything into a public git. This latter approach has some beauty, because git is universally available and well know. Even if a user does not know git, only a minimal set of commands is required to pull the rulebase. So maybe this is the way to go?

I would be very interested in suggestions on how we shall manage rulebases and spread the word. What do we need to support a great community? Whom can we talk to? If you have any ideas, concerns, questions or even an idle rant, be sure to let me know. At best, send mail to the lognorm mailing list, so we can broadcast this to other folks interested.

New Mailing List for Log Normalization

Thankfully, the interest in log normalization and the related libraries liblognorm and libee has increased. Up until now, I have handled discussions on this topics via the rsyslog mailing list. As conversations increase, this may be come an unnecessary burden for those only interested in rsyslog. So I have created a new mailing list named lognorm. I used this somewhat generic name, as I intend to use it for both libraries. This saves me some overhead, and I strongly assume that anyone interested in liblognorm will also be interested in libee (but to a lesser extent in the reverse direction).

Please subscribe to the new lists. Currently, it is a very exciting phase in log normalization development, so getting involved is a great way to shape things in the way you need it!

log normalization with rsyslog

I just wanted to give you a quick heads-up on my current development efforts:  I have begun to work heavily on a message modfication module for rsyslog which will support liblognorm-style normalization inside rsyslog. In git
there already is a branch “lognorm”, which I will hopefully complete and merge into master soon. It provides some very interesting shortcuts of pulling specific information out of syslog messages. I’ll probably promote it
some more when it is available. IMHO it’s the coolest and potentially most valuable feature I have added in the past three years. Once I have enabled tags in liblognorm/libee, you can even very easily classify log messages
based on their content.

 

log normalization – first results

At the beginning of this week I was pretty confident, that I would not make my self-set deadline of one month to implement a first rough proof of concept of liblognorm, a log normalizing library. Fortunately, I made extremely good progress the past two days and I am now happy to say that I have such a proof of concept available. All of this can be seen by pulling from Adiscon’s public git server: you need libestr, libee and liblognorm to make it work.

Right now, I’d like to provide a glimpse at how things work. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I got a couple of examples to play with (but I am still interested in more lag samples, especially from Cisco devices). Out of the many, I took a random messages.log file written by sysklogd. This is my input file and can be seen here.

To normalize events, liblognorm needs to know which fields are present at which positions of the input file. It learns this via so-called “samples”. Samples are very similar to the patterns used by virus scanners: like virus patterns describe how a specific virus looks, log samples describe how a specific log line looks. Other than virus patters, I have crafted a format hopefully easy (enough) to understand by sysadmins, so that everyone can add relevant samples himself. To support this, samples look relatively similar to actual log lines, and this is the reason I have termed them “log samples”. Like log files, samples are stored in simple text files. For the initial test, I used a a very small set of samples, available here. A production system will have many more samples, and I envision systems that have many (ten?-) thousand of samples loaded at the same time. If you look at the samples, take special care about entities enclosed in ‘%’ – these are field definitions, the rest is literal text.

The actual normalization is performed by the libraries engine, which parses log lines, based on the samples, into fields. This creates an in-memory representation of the event, which can than be processed by the driving application or be written to some other media or the network.

Liblognorm will come with a small toll called “the normalizer”. It is a minimal library user: it loads a sample database and reads log lines from standard input, creates the event in-memory representation and then writes this representation to standard output in a standardized format. So far, it supports formats as they are expected for the upcoming CEE standard.

The result of a normalizer run on my test input file based on the provided sample base can be seen here. The output is actually a bit more verbose than described above, because it lists the to-be-normalized line as well. If you look at the properties I extracted, you’ll probably notice that some do not make too much sense (maybe…). Also, a classification of the message is missing. Don’t care about these aspects right now: it’s a proof of concept and these things will be addressed by future development (the classification, for example, will be based on CEE taxonomy via tags).

I hope I was able to convey some of the power that is available with liblognorm. Of course, a “little bit” of more work and time will be required to get it production-ready. Unfortunately, I will be unavailable for larger parts of the next two weeks (other work now pressing plus a long-awaited seminar ;)), but I will try to get liblognorm as quickly as possible into the best shape possible. In the meantime, if you like, feel free to have a look at its code or play with it. All of what I wrote can actually be done with the versions available in git.

Call for Log Samples

My log normalization effort made good progress and I have a very rough first proof of concept available. It will take a log sample database, and transform input log files to a CEE-like output format.
Now I am looking at ways to practice-test it. So I’d appreciate if you could point me to some sources of log files. It mustn’t be terabytes, but they should be anonymized and be usable in the public Internet. For obvious reasons, it would be good if they are from widely deployed devices.
I would use a subset of these samples to extract usable sample database entries and see how the run through the normalizer.
Thanks,
Rainer

liblognorm site comes online

While there is not yet much content, the liblognorm site has been put online today. Over time, this will become an important place to both learn about liblognorm AND share log samples. It will most probably also contain the area that you can use to download new log samples (much like you download virus patterns for a scanner). But for now, I just wanted to share the good news.

liblognorm will use passive Unicode mode (UTF-8)

I thought a while on how to support Unicode in liblognorm. The final decision is to use passive mode, which is a very popular option under Linux. A core driver behind this decision is the ability to safe lots of space (and thus also cache space and so processing time as well) as the majority of log content is written in US-ASCII. This is even the case in Asian countries, where large parts of the log message are usually ASCII but contain a few select fields in local language support (like names). Even if the message itself is in local language, there is a lot of punctuation and numbers in them, so I think the overall result will not use up notably more space than a UTF-16 implementation. I18N-wise, it must also be noted that UTF-16 is a very small (but important) subset of full unicode, so using UTF-8 gives us the ability to encode full 32-bit UCS-4 characters should there be need to do so.

The same decision will apply to the CEE library (whatever it will be named). This is also nicely in line with libxm2, which I intend to use for XML parsing.