log normalization: how to share rulebases?

Rulebases play a crucial role in log normalization. While the log normalizer itself needs to be of high quality and speed, it is the rulebase that really helps to detect which message the one in question is. I myself have so far concentrated on the code and not created any larger rulebase. Champ Clarck III has created many more for his use inside Sagan. But this means everything is in its infancy. What we really need is community involvement to create a large number of easy to access rulebases for almost all devices.

This brings up the question of how to manage and share such a repository. One method may be to place it on a web site, together with some submission tool. An alternate approach would be to put everything into a public git. This latter approach has some beauty, because git is universally available and well know. Even if a user does not know git, only a minimal set of commands is required to pull the rulebase. So maybe this is the way to go?

I would be very interested in suggestions on how we shall manage rulebases and spread the word. What do we need to support a great community? Whom can we talk to? If you have any ideas, concerns, questions or even an idle rant, be sure to let me know. At best, send mail to the lognorm mailing list, so we can broadcast this to other folks interested.

normalizing Apache Access logs to JSON, XML and syslog

I like to make my mind up based on examples, especially for complex issues. For a discussion we had on the CEE editorial board, I’d like to have some real-world example of a log file with many empty fields. An easy to grasp, well understood and easy to parse example of such is the Apache access log. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I also had a few research samples at hand.

Apache common log format is structured data. So there is no point in running it through a free-text normalization engine like liblognorm. Of course it could process the data, but why use that complex technology. Instead, the decoder is now part of libee and receives a simple string describing which fields are present. It’s called like this:

$ ./convert  -exml -dapache -D “host identity user date request status size f1 useragent” < apache.org > apache.xml

Options specify encoder and decoder, and the string after -D tells the convert field names and order. But now let’s speak the input and output for itself:

The converter works by calling the decoder, which creates an in-memory representation of the log format in a CEE-like object model. Then, that object model and the called-for encoder is used to write the actual output format. That way, the conversion tool can convert between any structured log format, as long as the necessary encoders and decoders are available. This greatly facilitates log processing in heterogeneous environments.

Note that liblognorm works similar and, from libee’s point of view, can be viewed at as an decoder for unstructured text data (or, more precisely, for text data where the structure is well-hidden ;)).

logging and the C NUL problem

Again, I ran into the “C NUL Problem”, that is the way C strings are terminated. Unfortunately, the creators of C have represented strings as variable arrays of char without an explicitely-stated size. Instead of a size property, a C string is terminated by an US-ASCII NUL character (”). This works well enough in most cases, but has one serious drawback: the NUL character is a reserved character that cannot be part of any C string. So, for example strlen(“AB”) equals one and not three as one would usually expect.

CERT has a good presentation of some of the more important problems associated with the standard C string handling functions. I do not intend to reproduce this here or elaborate on any further details except that we get into big trouble if NUL characters are used e.g. in logging data sets. We had this problem in the IETF syslog WG, where we permited NUL to be part of the syslog message text, but permitted a receiver to escape it. This is far from being an ideal solution, but we considered it good enough, especially as it permits to keep compatible with existing toolset libraries.

Now, in CEE, we face the same challenge: the problem is if the in-memory representation of event fields should permit NUL characters. The correct technical answer to this question is “yes, of course”, but unfortunately it has a serious drawback that can affect adoption: if NULs are permited, none of the string handling functions of the C runtime library can be used. This is, as said above, because the C runtime library is not able to handle NULs inside “standard” C strings. A potential solution would be to escape NULs when they enter the system. However, that would require an additional reserved character, to do the escaping. So in any case, we’ll end up with a string that is different from what the “usual” runtime library routines expect.

Of course, this problem is not new, and many solutions already have been proved. The obvious solution is to replace the standard C string handling functions with octet-counting functions that do not require any reserved characters.

A short, non- weighted list of string replacement string libraries is:

Note that some of them try to mimic standard C strings as part of their API. I consider this highly dangerous, because it provides a false sense of security. While the library now can handle strings with included NUL characters (like “AB”), all parts of the string after the first NUL will be discarded if passed to a “regular” C runtime library string function (like printf!). So IMO this is a mis-feature. A replacement library must explicitely try to avoid compatibility to the C runtime library in order to safe the user from subtle issues, many of them resulting in security problems (think: information hiding).

Unfortunately, I could not identify any globally-accepted string replacement library that is in widespread use.. Despite its deficits, C programmers’  tend to use the plain old string functions present in the standard C runtime library.

So we are back to the original issue:

If CEE supports NUL characters inside strings, the C standard string library can not be used, and there are also problems with a potentially large number of other toolsets. This can lead to low acceptance rate.

But if CEE forbids NUL characters, data must be carefully asserted when it enters the system. Most importantly, a string value like “AB” must NOT be accepted when it is put in via an API. Experience tells that implementors sometimes simply overlook such restrictions. So this mode opens up a number of very subtle bug (security) issues.

I am very undicided which route is best. Obviously, a sound technical solution is what we want. However, the best technical solution is irrelevant if nobody actually uses it. In that light, the second best solution might be better. Comments, anyone?

Carnival of Logging!

This is a proposal to start a “Carnival of Logging”. Huh… what? I guess by now everybody is participating (or at least has written) in one or more blog carnivals. But a Carnival of Logging – am I mad? Who the heck is interested in such a thing?

Yes, I know… Logging is not sexy. Indeed, most folks don’t even know there is something like logging (except, of course, if they burn wood ;)). However, some folks blog about (computer) logging. Really. Every now and then I find a few interesting posts about logging and things related to it – like what role does logging play in compliance? What about forensics? What about security, … So there is some potential interest.

I think, there is a place for a “Carnival of Logging” and I would like to organize it :) I guess we could argue weeks if it make sense to actually do it – or we can simply try. I prefer the later (… and may be proven wrong “the hard way”).

This is what I propose:

  • If you are blogging about logging or areas that are closely related to logging (so that you have something log-related at least every know and then), please email me. Let me know which blog post you would like to see highlighted in the first carnival. Please also let me know if you would be interested in hosting future carnivals.
  • I will compile the first carnival of logging out of the messages I receive. It will be hosted on this site. I have no definite schedule for it yet, because I do not know the volume of posts coming in. My goal, however, is to have this done by February, 11th.
  • After the initial post has been done, I’ll email all those that submitted carnival entries and ask for new posts ;) I’ll pick a host from the list of those that have opted to host and forward entries I have received to that host. The host will then compile a new carnival and post it on his blog.
  • Once a carnival is out, all participants should link to it from their blog.
  • I’ll keep an archive of all carnival posts here on this site.
  • The carnival of logging should be on a fixed schedule. That probably depends a bit on the volume, but I suggest that the carnival is written at least bi-weekly.

Well – and now it is up to you! Please send in your carnival entries! I think a Carnival of Logging would be a useful addition to the logging world and hope that others agree. Also, please spread the word, so that we get more exposure and more participants! Also, feel free to post any questions and comments you have.