Quick update on log normalization

I just wanted to give a heads up on the status of log normalization. We have just released updated versions of libee and liblognorm. These provide important new features, like the capability to annotate events based on classification. This, among others, brings the libraries more back inline with recent CEE developments. Also, the log normalizer tool is nearly ready for prime time. The “only” thing that is missing is a decent set of rule bases. Thankfully, sagan already has a couple. I guess besides programming obtaining rule bases is a major thing to target.

As soon as I find time (I hope soon!), I’ll finalize some lose ends on the software side and get doc online on how to use the normalizer tool. I think with that a great tool for everyday use in log analysis will become available. Feedback and collaboration is always appreciated!

Paper on LogNormalization

I wanted to make all of your aware that I have posted a paper on log normalization . This was originally done in regard to CEE, but I noticed that the classification of different log sources and the way to handle them is of broader use. I hope you find the paper useful.

log classification with liblognorm

Today, I have added support for so-called “tags” to liblognorm (and it’s base library libee). This new capabilities permits very easy classification of syslog message and log records in general. So you can not only extract data from your various log source, you can also classify events, for example, as being a “login”, a “logout” or a firewall “denied access”. This makes it very easy to look at specific subsets of messages and process them in ways specific to the information being conveyed.

To see how it works, let’s first define what a tag is: A tag is a simple alphanumeric string that identifies a specific type of object, action, status, etc. For example, we can have object tags for firewalls and servers. For simplicity, let’s call them “firewall” and “server”. Then, we can have action tags like “login”, “logout” and “connectionOpen”. Status tags could include “success” or “fail”, among others. The idea of tags is based on early CEE concepts. I will try to keep consistent with whatever CEE heads to. Tags form a flat space, there is no inherent relationship between then (but this may be added later on top of the current implementation). Think of tags like the tag cloud in a blogging system. Tags can be defined for any reason and need (though obviously we must strive to get to a standard set, something I hope CEE will provide in the not too distant future). A single event can be associated with as many tags as required.

Assigning tags to messages is simple. A rule contains both the sample of the message (including the extracted fields) as well as -now- the tags. Have a look at this sample, taken from liblognorm 0.2.0:

rule=:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Here, we have a rule that shows an invalid ssh login request. The various field are used to extract information into a well-defined structure. Have you ever wondered why every rule starts with a colon? Now, here is the answer: the colon separates the tag part from the actual sample part. Starting with liblognorm 0.3.0, you can create a rule like this:

rule=ssh,user,login,fail:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%

Note the “ssh,user,login,fail” part in front of the colon. These are the four tags the user has decided to assign to this event. What now happens is that the normalizer does not only extract the information from the message if it finds a match, but it also adds the tags as metadata. Once normalization is done, one can not only query the individual fields, but also query if a specific tag is associated with this event. For example, to find all ssh-related events (provided the rules are built that way), you can normalize a large log and select only that subset of the normalized log that contains the tag “ssh”.

Note that versions of liblognorm 0.2.0 simply ignore the tag part, so old versions of the library are capable of working with new rule bases.

This is pretty cool and has ample potential. Just think about creating firewall reports: if you have different firewalls, you only need to have different rule bases to normalize these events all into the same format. Even more now, you can process the logs based on the classification assigned during the normalization process. For example, a “failed connection request” report may ignore everything that is not tagged as “connection, fail”.

That probably sounds pretty good to you, but how to actually use it? Right now, the core functionality is available inside the libraries (more precisely in the git version, I will do an official release very soon but wanted to spread word). That means developers have the necessary API to integrate with their programs. End user tools do not yet exist (what is not too surprisingly for a library). Integration of the new functionality is very easy. Classification is available without need to change anything in existing applications. A single new simple API ee_EventHasTag() has been added, which needs to be called to see if an event is associated with the given tag. [side-note: the current API is NOT guaranteed to be stable, even though I try not to break things without need]

In hope that developers will play with the new functionality, so that it will be available in end-user tools soon as well. I myself plan to enhance the normalizer tool very soon to support selecting subsets based on tags (this can also serve as an example for other developers). Also, I plan to add classification support to rsyslog very, very soon. So stay tuned to what’s coming up — it’s exciting ;)

New Mailing List for Log Normalization

Thankfully, the interest in log normalization and the related libraries liblognorm and libee has increased. Up until now, I have handled discussions on this topics via the rsyslog mailing list. As conversations increase, this may be come an unnecessary burden for those only interested in rsyslog. So I have created a new mailing list named lognorm. I used this somewhat generic name, as I intend to use it for both libraries. This saves me some overhead, and I strongly assume that anyone interested in liblognorm will also be interested in libee (but to a lesser extent in the reverse direction).

Please subscribe to the new lists. Currently, it is a very exciting phase in log normalization development, so getting involved is a great way to shape things in the way you need it!

normalizing Apache Access logs to JSON, XML and syslog

I like to make my mind up based on examples, especially for complex issues. For a discussion we had on the CEE editorial board, I’d like to have some real-world example of a log file with many empty fields. An easy to grasp, well understood and easy to parse example of such is the Apache access log. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I also had a few research samples at hand.

Apache common log format is structured data. So there is no point in running it through a free-text normalization engine like liblognorm. Of course it could process the data, but why use that complex technology. Instead, the decoder is now part of libee and receives a simple string describing which fields are present. It’s called like this:

$ ./convert  -exml -dapache -D “host identity user date request status size f1 useragent” < apache.org > apache.xml

Options specify encoder and decoder, and the string after -D tells the convert field names and order. But now let’s speak the input and output for itself:

The converter works by calling the decoder, which creates an in-memory representation of the log format in a CEE-like object model. Then, that object model and the called-for encoder is used to write the actual output format. That way, the conversion tool can convert between any structured log format, as long as the necessary encoders and decoders are available. This greatly facilitates log processing in heterogeneous environments.

Note that liblognorm works similar and, from libee’s point of view, can be viewed at as an decoder for unstructured text data (or, more precisely, for text data where the structure is well-hidden ;)).

libee status update

Yesterday I reached some important milestones:
  • defined an internal simple event representation format “int” to create test cases
     http://doc.libee.org/classee__int.html
  • wrote a decoder “int” -> CEE
  • wrote a syslog encoder CEE->syslog
    (with enough escapes to be testable, but without all details)
  • wrote a small program (int2syslog) that ties these pieces togetherand can be used to visualize test cases in syslog representation
 All is available from the libee git at http://git.adiscon.com/?p=libee.git;a=summary

introducing libestr

I tried to avoid this, but it looks like I need to start a new project. I need a simple counted-byte string “class” for the CEE effort. All of that is already present in rsyslog, but using it’s runtime library is overkill. I could have copied the string code, but I really don’t like to have multiple copies of the same code around. So I’ll now create a new small library just for that purpose. Well, while I am at it, I’ll probably also add basic hashtable support, as this can help keep libee small for cases when no XML-specific functionality is desired. And I don’t want to start yet another library for that (we already have lib inflation… ;)).

I named it libestr, bare of a better name. I thought a while about the name, but could not find anything really decent. “estr” means “essentials for string handling” and is probably descriptive enough. Quite honestly, I really like to gain some speed in coding again instead of just creating skeletons and thinking about names…

logging and the C NUL problem

Again, I ran into the “C NUL Problem”, that is the way C strings are terminated. Unfortunately, the creators of C have represented strings as variable arrays of char without an explicitely-stated size. Instead of a size property, a C string is terminated by an US-ASCII NUL character (”). This works well enough in most cases, but has one serious drawback: the NUL character is a reserved character that cannot be part of any C string. So, for example strlen(“AB”) equals one and not three as one would usually expect.

CERT has a good presentation of some of the more important problems associated with the standard C string handling functions. I do not intend to reproduce this here or elaborate on any further details except that we get into big trouble if NUL characters are used e.g. in logging data sets. We had this problem in the IETF syslog WG, where we permited NUL to be part of the syslog message text, but permitted a receiver to escape it. This is far from being an ideal solution, but we considered it good enough, especially as it permits to keep compatible with existing toolset libraries.

Now, in CEE, we face the same challenge: the problem is if the in-memory representation of event fields should permit NUL characters. The correct technical answer to this question is “yes, of course”, but unfortunately it has a serious drawback that can affect adoption: if NULs are permited, none of the string handling functions of the C runtime library can be used. This is, as said above, because the C runtime library is not able to handle NULs inside “standard” C strings. A potential solution would be to escape NULs when they enter the system. However, that would require an additional reserved character, to do the escaping. So in any case, we’ll end up with a string that is different from what the “usual” runtime library routines expect.

Of course, this problem is not new, and many solutions already have been proved. The obvious solution is to replace the standard C string handling functions with octet-counting functions that do not require any reserved characters.

A short, non- weighted list of string replacement string libraries is:

Note that some of them try to mimic standard C strings as part of their API. I consider this highly dangerous, because it provides a false sense of security. While the library now can handle strings with included NUL characters (like “AB”), all parts of the string after the first NUL will be discarded if passed to a “regular” C runtime library string function (like printf!). So IMO this is a mis-feature. A replacement library must explicitely try to avoid compatibility to the C runtime library in order to safe the user from subtle issues, many of them resulting in security problems (think: information hiding).

Unfortunately, I could not identify any globally-accepted string replacement library that is in widespread use.. Despite its deficits, C programmers’  tend to use the plain old string functions present in the standard C runtime library.

So we are back to the original issue:

If CEE supports NUL characters inside strings, the C standard string library can not be used, and there are also problems with a potentially large number of other toolsets. This can lead to low acceptance rate.

But if CEE forbids NUL characters, data must be carefully asserted when it enters the system. Most importantly, a string value like “AB” must NOT be accepted when it is put in via an API. Experience tells that implementors sometimes simply overlook such restrictions. So this mode opens up a number of very subtle bug (security) issues.

I am very undicided which route is best. Obviously, a sound technical solution is what we want. However, the best technical solution is irrelevant if nobody actually uses it. In that light, the second best solution might be better. Comments, anyone?

libee – first peek preview available

I have just published a first preview of the libee library API. This work is obviously far from being finished, but I think the preview provides at least some idea of how it will materialize.

To tell a bit about the status in general: I have completed a first round of evaluation of CEE objects and concepts, based on the draft standard. Note that the draft standard is far from being finished, and will probably undergo serious changes. As such, I have not literally taken object definitions from there. Instead, I have mostly pulled what I am currently interested in and also have added some additions here and there, as I need them to do what I intend to do.

My primary goal is to get some proof of concept working first, then re-evaluate everything and probably do a rewrite of some parts. For the same reason, performance is currently not on my radar (but of course I always think about it ;)).

I would also like to note that the libee git is also available — and a good way to follow libee development for those interested. Comments are always appreciated and especially useful in this early phase.

CEE library will be named libee

After some discussions, we have finally decided to name the CEE part of the original liblognorm project “libee“. Note the missing “c” ;) We originally thought that libcee would be a smart name for a library implementing an API on top of the upcoming CEE standard. However, there were some complexities associated with that because “libcee” sounds much like *the* official library. While I hope I can finish a decent implementation of the full CEE standard, I think it is a bit bold to try to write the standard lib when the standard is not even finished.

So the compromise is to drop the “c”. The “ee” in libee means “Event Expression”, much as CEE means “Common Event Expression”. If CEE evolves as we all hope and if I manage to actually create a decent implementation of it, we may reconsider calling the library libcee. But that’s not important now and definitely something that will involve a lot of people.

In the mean time, I wanted to make sure you all understand what I mean when I talk about “libee”. I hope to populate its public git with at least a skeleton soon (liblognorm reaches a point where I need some CEE definitions, so this becomes more important).