Will my CEE library be named libcee?

After my thinking about splitting off a CEE library from the liblognorm project, Michael Biebl suggested that the CEE part of it be named “libcee”. This sounds very natural and decent. However, CEE is a trademark of the Mitre corporation, which helps establish the standard. I have asked the folks there if it is possible to use this name. Not surprising, the answer will take a little while.

I need to progress while waiting. So I have decided to name the library libcee for now, but rename it if Mitre decides this is not possible. That should not be much of a problem, as this decision is probably much quicker made than me writing the first releasable version of it ;)

splitting up the normalization library

I have dug into the design of my upcoming event/log normalization library. As it will base on CEE, I intend to pull in CEE definitions for types defined there, like tags or field types. Also, I thought about what the library should output. An obvious choice for many use cases is an in-memory object model describing the normalized form of the event that was passed in. This is probably most convenient for applications that want to do further processing on the event.

However, it also seems useful to have the ability to serialize this data in the form of a text string. That string could be stored in a file for later reference, forensics or to feed some other tool capable of understanding the file format. And as the in-memory object model will be CEE based, and CEE defines such serialization formats, it seems obvious that the library should be able to generate serialization based on the CEE-defined and supported formats (note that does not necessarily means XML, it may be JSON or syslog structured data as well).

Looking at all this, the normalization library seems to consist of two largely independent (but co-operating) parts:

  • the parser engine itself, that part that is used to actually normalize the input string according to the provided sample base and CEE definitions
  • a CEE support library, which provides the plumbing for everything that is defined in CEE (like tags, field types and serialization formats)

Now consider that I intended to create the normalization feature as a separate library instead of a rsyslog plugin because I hope that other projects can reuse it. Looking at the above bullet points, it looks like it is also natural to split core parser from CEE functionality. Again, there seems to be a broader potential user base for generic CEE functionality than for normalization. For example, a CEE support library could also be used by projects that natively support CEE. It hopefully would safe them the hassle of coding all CEE functionality just to do some basic things. Think, for example, on some application that would “just” like to emit a well-formed CEE log record (a very common case, I guess). With a library, it could just generate (via the library) a proper in-memory representation of the event and then have the library process it. The library could then also check if it is syntactically correct and contains all the necessary fields to conform to a specific CEE profile.

The more I think about it, the more I think it is useful. So I’ll probably split the core normalization library from the CEE part. This is not much effort, but opens up additional uses. I’ll call the normalization part then liblognorm (or libeventnorm) and the CEE part libcee — or something along these lines. Under this light, liblognorm may actually be a better name, because the parser part is more concernd about logs and log files instead of generic events (which often come in other format).

Again, feedback is appreciated!

liblognorm or libeventnorm?

Names are always important. While thinking about liblognorm’s desing, it occured to me that the name may be “too limited”. In fact, the library will make havy use of CEE, the common event expression standard. In CEE, the term “log” was deliberately avoided in spite of the broader term “event”. And thinking about this, my library is much more about normalizing all type of event representations than it is about normalizing “just” logs. So libeventnorm would probably be a better name. So for now, I think I will use the new name.

Any feedback in this issue is appreciated. I am also open to new name suggestions. But I think I will populate a public git early next week, so we should have fixed a name by that then ;)

Introducing liblognorm

Hi folks,

With this posting, I introduce a new project of mine: liblognorm. This library shall help to make sense out of syslog data, or, actually, any event data that is present in text form.

In short words, one will be able to throw arbitrary log message to liblognorm, one at a time, and for each message it will output well-defined name-value pairs and a set of tags describing the message.

So, for example, if you have traffic logs from three different firewalls, liblognorm will be able to “normalize” the events into generic ones. Among others, it will extract source and destination ip addresses and ports and make them available via well-defined fields. As the end result, a common log analysis application will be able to work on that common set and so this backend will be independent from the actual firewalls feeding it. Even better, once we have a well-understood interim format, it is also easy to convert that into any other vendor specific format, so that you can use that vendor’s analysis tool.

So liblognorm will be able to provide interoperability between systems that were never designed to be interoperable.

This sounds bold, so why am I thinking I can do this?

Well, I am working for quite some years in this field and have accumulated a lot of experience including a sufficient number of implementations where I failed in one way or another. You can read about this in my previous blog post on “syslog normalization“. So know-how is available. The core technology is actually not that complex. I hope to code the core parts of the lib within one month (but it may take some more real-world time as I can not yet work on it full time). Also, very importantly, there is the Common Event Expression (CEE) standard coming up. Its aim is nothing less than to provide the plumbing that is needed for log normalization. CEE is initiated by the same folks that made so successful standards like CVE alive — so it is something we can hope to succeed. Thankfully, I have recently become a member of the editorial board so I have good insight of what it takes to write a library that utilizes CEE.

This all sounds good, but there is one ingredient missing: liblognorm will not be able to magically know the format of each message. We must teach it the formats. This involves some community effort as a single person can not compile all message formats this IT world has to offer. Not even a small group can do. I hope that we can get sufficient log samples from the rsyslog community and hopefully later from a growing liblognorm community. I have thought long about how this can be done. A core finding is that it must be dumb easy to do. This is how it is intended to work:

When you feed liblognorm a message that it can not recognize, it will spit out that message. Let’s assume we have the following message:

AAA credentials rejected: reason = reason: server = server_IP_address: user = user

Then the user just needs to tell which fields it contains and do so via a simple syntax. A hypothetical sample could be:

reject,AAA:AAA credentials rejected: reason = %reason_msg%: server = %sourceIP%: user = %user%

The strings “reject” and “AAA” are tags. Tags will be placed in comma-delimited format in front of the actual message sample and are terminated by a colon. Everything after the colon is actual message text. The field between percent signs reflect some well-known properties (which are taken from the CEE base def and/or are custom defined). The syntax will be taken from a data dictionary, so the user does not need to bother about that in most cases. So creating a message sample out of an unknown message type should be fairly easy.

The idea now is that we gather these one-line message samples and compile them into a central repository. Then, all users can download fresh versions of the “sample base” and use them to normalize their tags — much like a virus scanner works. Of course, there are a number of questions in regard of how trustworthy samples are. But this is definitely something we can solve. To get started, we could simply do some manual review first, much like code integration. At later stages, some trust level policy could be applied. Well-known technology…

None of this is cast in stone as I am in a very early stage of this project. Your feedback definitely makes a difference and so be sure to provide ample ;)

That’s it for today, but I plan to do a series of blog posts on the new system within the next couple of days. Please provide feedback whenever you have it. It’s definitely something very useful to have. I’ll also make sure that I set up a new list for this effort, but initially I will be abusing the rsyslog mailing list, as I assume folks on that list will definitely be interested in liblognorm as well. And, of course, once the library is ready I’ll utilize it in various ways inside rsyslog.

rsyslog now supports Hadoop’s HDFS

I will be releasing rsyslog 5.7.1 today, a member of the v5-devel branch. With this version, omhdfs debuts. This is a specially-crafted output module to support Hadoop’s HDFS file system. The new module was a sponsored project and is useful for folks expecting enormous amounts of data or having high processing time needs for analysis of the logs.

The module has undergone basic testing and is considered (almost) production-ready. However, I myself have limited test equipment and limited needs for and know-how of Hadoop, so it is probably be interesting to see how real-world users will perceive this module. I am looking forward to any experiences, be it good or bad!

One thing that is a bit bad at the moment is the way omhdfs is build: Hadoop is Java-based, and so is HDFS. There is a C library, libhdfs, available to handle the details, but it uses JNI. That seems to result in a lot of dependencies on environment variables. Not knowing better, I request the user to set these before ./configure is called. If someone has a better way, I would really appreciate to hear about that.

Please also note that it was almost impossible to check omhdfs under valgrind: the Java VM created an enormous amount of memory violations under the debugger and made the output unusable. So far I have not bothered to create a suppression file, but may try this in the future.

All in all, I am very happy we now have native output capability for HDFS as well. Adding the module also proved how useful the idea of a rsyslog core paired up with relatively light-weight output/action modules is.

comments on this blog are re-enabled

Finally, Google has create some useful tools to help fight comment spam. As a result, I was able to quickly delete the remaining 100 (or so) spam comments and so I can now show comments on this blog again. This is useful, as they contain a lot of insight. Also, I hope that my readers will now comment again!

Some results from LinuxKongress

I had a great time both attending the talks on Linux Kongress 2010 in Nuremberg as well as giving my presentation on rsyslog performance enhancements. Even more, the social events helped driving things forward. Among others, I met with Lennart Poettering, who currently works on systemd. He had a couple of good suggestions for rsyslog and I hope to implement at least some of them. On the list for the next v5-devel release are definitely improvements for imuxsock, like the ability to obtain the process id of that process that actually emitted the log message (via SCM_Credentials). Once this is known, we can do a couple of “interesting things” with it.

rsyslog queues, reliability and forwarding

I would like to explain a little bit how rsyslog queues and forwarding interact. First of all, let me state that the queue engine and the action part of rsyslog, where the forwarding happens, are strictly separated. There is no interface between the two that permits an action to affect queue behaviour. For example, I was asked if a forwarding action could initiate disk queue mode when the forwarding fails. The root reason for this was that messages should not be endagered while a remote server fails.

This is not possible with the current design and involves a far-from-trivial design change. However, I do not think that the functionality is actually needed. When talking about reliablity, rsyslog works on the importance of messages and not on the importance of actions.

So in rsyslog we can configure the level of message loss that is acceptable for a given use case. This can be done on an action-by-action basis (or once at the ruleset/main queue level for all actions — usually not a good idea, but from time to time it may be). The extreme ends are a) no message loss at all permited and b) message loss of any magnitude is acceptable. For a), this means we need to configure a disk-only queue with full sync of queue files and management information after each message processed (with message input batches of one). For b), this means we do not need to place any restrictions. Obviously, a) is rather slow while b) is extremely fast. As extremes are not usually what is needed, there are many configuration possibilities between these two extremes. For example, one may define a queue the goes to disk if more than 1,000 messages are kept in main memory, but only then, and that fsyncs queue files every 10 messages (a big performance saver). That means that at any instant, at most 1,010 messages are at risk and can potentially be lost. The queue than monitors these predicates and switches to disk mode only when required. This is a very big performance saver.

Now let’s switch a bit the perception of things: Let’s go with our starting example and say you want to go to disk only when the remote system is down. But what if the remote system is up, but can not accept messages quickly enough. Let’s assume a backlog of 10,000 messages builds up. Is it then acceptable to keep these in main memory, just because the remote system is accepting data? If this risk is acceptable, why would it be inacceptable if the remote system is not yet accessible. If we say one case is acceptable but the other not, we somewhat contradict ourselves: it is almost random if the remote system is accepting messages, so why does it make a difference in the risk we permit?

This contradiction is the core reason why rsyslog does not look at external events or action failure causes but rather works on the basis of “acceptable risk”. Let’s say it is acceptable to lose 1,000 messages. Then, it is irrelevant if we lose these while the remote system is accepting or not. Consequently, rsyslog enforces disk mode if the remote system is down and there are more than 1,000 messages inside the queue. But it does not enforce this if there are only 500 messages waiting to be sent? Why should it? After all, the user has specified that a loss of 1,000 messages is acceptable, and so we do not try to guard these messages more than required by this policy. Note, of course, that if rsyslog is terminated in such a situation, of course a disk queue with 500 messages is created. We do not voluntarily lose messages, and if we terminate, we can no longer hold them in main memory. Consequently, they must be written out (of course, again depending on configuration). So the in-memory queue is retained across rsyslog restarts. But it is important to point out that unexected aborts – like sudden loss of power – can cause message loss in such scenarios. This is no different from sudden loss of power with an accepting remote system and a queue of 500. If such a risk is unaccetable, we have what I initially described in scenario a).

As a side note, rsyslog queues provide very high reliability. Every message is removed from the queue only after the action acknowledges that it has been processed. This kind of reliablity is used in some very demanding audit-grade environments (which I, as usally, not permitted to name…).

To sum up, rsyslog protects message integrity not be external events but by user-configurable “acceptable risk” policies.

We consider this a superior approach, as external events are somewhat unreliable when it comes to protecting traffic bursts. Relying on external events has a number of anomalies, as hopefully explained above.

getting serious with rsyslog v6

I have just created a new v5-devel branch based on the current master AND have then merged the newconf branch into master. This “officially” means the beginning of rsyslog v6. So I thought that justifies a short blog post.

As already elaborated, v6 will focus on a better configuration language. The current version already has scoping for actions, but no doc yet for it. I will try to add the doc, so that I can hopefully officially release the devel version this week. I’d also like to work a bit more on imptcp, so that I have some common base functionalty in all versions that support it.