log normalization Archives - Rainer Gerhards

2025-09-232025-10-09

Revisiting old style Windows Log Schema Mapping

Windows logs provide a wealth of information that must be made usable for Observability. As you may know, I work on normalizing these logs for quite some while, I even created liblognorm for that purpose. Ingesting them properly is important for schema mapping, e.g. to Elastic Cloud.

2015-08-262018-05-30

moving towards liblognorm v2

The initial version of liblognorm v2 is almost ready. It offers many new features, like custom data types, much easier rule description langugage, and potentially even greater performance (we have not yet verfied this). As some of you know, I have worked very hard on liblognorm during the past weeks. I have now reached a very important milestone and will switch the git master branch to use the new version. If things go smooth enough, the initial release of liblognorm v2 will go along with the next rsyslog release. Daily build will have it very soon.

Liblognorm v2 also contains the full v1 engine and thus is fully compatible with the previous versions, as far as rulebases are concerned. For more on the compatibility, please read the compatiblity document. In fact, by default the v1 engine is used. To opt in for the new features, you need to add a line

version=2

to the top of your rulebases Then it is when you need to really check the compatibility document. This is also what brings you the enhancements.

A couple of notes are due: while we are approaching the initial release, not all design goals have been met yet. Most importantly, we are feeding back user comments into the development process. As such, the v2 feature set is not 100% finalized yet. This means that we cannot yet fully guarantee that all constructs you use will remain compatible with versions released later. But those that know us also know that this risk is minimal and, if it happens, will be easy to fix. The core concepts are ready and unlikely to change. Note that I will continue to actively work on v2 and more features will be upcoming in the next weeks.

The online doc should be updated in two days at latest (actual update date depends on when I can switch the git branch and how this interacts with the automatic doc generating scripts). I invite you to use the new version and am sure it will be much easier to use and powerful.

Note to developers: the v2 and v1 engine are very different. V2 is a complete rewrite of the core components. Nevertheless, v1 and v2 share some of the same file names. For many reasons, this means I need to rewrite the git master branch with the new version.

Also note that as of now, no new development happens to v1, this version is essentially dead. Very important fixes to the v1 engine will be applied to the v1 subsystem of v2.

Feedback on v2 is appreciated, please post issues or feature requests directly to liblognorm’s github trackers (if possible).

2015-04-282018-05-30

liblognorm’s “rest” parser now more useful

The liblognorm “rest” parser was introduced some time ago, to handle cases where someone just wants to parse a partial message and keep all the “rest of it” into another field. I never was a big fan of this type of parser, but I accepted it because so many people asked. Practice, however, showed that my concerns were right: the “rest” parser has a very broad match and those that used it often got very surprising results.

A key cause of this issue was that the rest parser had the same priority as other parsers, and most importantly a higher priority than a simple character match. so it was actually impossible to match some constant text that was at the same location than the “rest” parser.

I have now changed this so that the rest parser is always called last, if no other thing matches – neither any parser nor any constant text. This will make it work much more like you expect. Still, I caution against using this parser as it continues to provide a very broad match.

Note that the way I have implemented this is not totally clean from a software engineering point of view, but very solid. A cleaner solution will occur during the scheduled rewrite of the algorithm (later in spring/summer).

Note that existing rulebases using “rest” may behave differently with the new algorithm. However, previously the result was more or less random, so any other change to the rulebase could also have caused different behaviour. So this is no compatibility break as there really is no compatibility to retain.

This will be released with 1.1.2, probably in early may. If you need it urgently, you can use a daily build.

2015-03-302018-05-30

Call for Log Samples

There is one big problem in research for better logging methods: no good logging sample repositories exist. Well, not even bad ones… I am currently doing some preliminary steps towards a new, better log normalization system. Among others, it will contain a structure analyzer which will remove much of the manual burden of creating normalization rules. But, guess what: while the project looks very promising, lack of log samples is a real big problem!

To solve that problem, I have setup a public log ingestor that you can simply send logs to. The system is reachable as follows:

hostname: logsubmit.rsyslog.com

port: 514

protocol: any flavor of syslog or other text data

If you run rsyslog, you can use add this snippet to /etc/rsyslog.conf:

*.* @logsubmit.rsyslog.com

How did this idea materialize? During my talk at the German Unix User’s Group FFG 2015 conference last week in Stuttgart, I mentioned that problem and Dirk Wetter had the idea to provide a log receiver that makes it very easy for people to contribute. There were some concerns that this may open up my server for DoS, and that of course is true. Nevertheless, I liked the idea and so we setup a machine today. It may be DDoS’ed and other bad things may happen, but then we got more experience. It’s split from the main systems, so that shouldn’t cause much harm.

For log contributors, please keep on your mind that you send data to a public service and so this is probably not a great idea to do this for sensitive systems. But if we get enough data from uncritical systems, we can still gain a lot from that, most importantly it helps us gain insight into structural log mining methods — which will also lead to above-mentioned tool. All logs gathered by this method will be placed in the research log repository, which currently is hosted on github. It is licensed under BSD 2-clause in the hope that a sufficiently large and diverse data set is also of great value for other researchers (did I mention it is ultra-hard to find any log sample data sets?). If you are interested in cotributing logs, but would want to do so under NDA, that’s of course also possible. In that case, please just drop me an email to see how to best go forward with that.

2015-02-272018-05-30

looking for Java stacktrace samples

I am currently working on log normalization as well as improvements for rsyslog’s imfile. Among the things that regularly come up on the rsyslog mailing list is support for multi-line logs and Java stack traces in general.

I would like to see what I can do to improve processing of these. To do so, I need a set of samples of such logs. As such, I look for people who would like to contribute log records for my research.

Please contact me at rgerhards@adiscon.com (or any other way you prefer) to contribute log samples. Please let me know if it is OK if I put them into the public log repository for research or you would like me to keep them private.

All types of multi-line logs are appreciated, this is not limited to java stacktraces.

Your support is greatly appreciated.

2013-01-252018-06-08

performance of liblognorm/rsyslog parse tree

We are currently considering a new table lookup feature for rsyslog, and as usual, performance is a priority. As such, we intend to base the implementation on the liblognorm parse tree, which is highly efficient. In fact, the adaption (maybe a partial rewrite) I have in mind for rsyslog actually has O(1) time complexity. This drew some discussion, as parse trees are usually considered to be O(log n). So I thought I quickly lay out why constant time complexity is possible.

To set some ground, let’s define the use case where this applies.We are talking about string matches, not regular expressions. In the case of liblognorm, we are also talking about exact string matches, and let me assume this for rsyslog’s lookup table functionality as well (at least in “string” mode). Note that this may be extended, and should be no problem, but I have yet to properly evaluate the effect on the parse tree. So let’s assume exact matches in this blog post. So basically, we have a set of strings and would like to see if there is an exact match (and pick up some value associated with that match). For example, we may have the strings “Berlin”, “Boston”, “Bogota” and “London”. That example is obviously extremely simplified, yet it will help us understand.

Whenever we talk about time complexity, we need to make sure what exact n we are talking about. When it comes to matching values, n is usually the number of different values. In our example, this n would be four. Of course, one could also look at string lengths, as the length obviously influences the complexity of the comparison operation. Let’s call this m, then in the above sample m would be six (length of longest string). If we take the unusual step to include that m, a linear list would have time complexity O(n*m), a binary tree O(log n*m) and rsyslog’s parse tree O(m). HOWEVER, there usually is an upper bound on string length, and for that reason m is taken as a constant in almost all cases. By definition, it will then no longer be mention in O notation, and so we end up with the usual view that a linear list is O(n), a binary tree O(log n) … and rsyslog’s parse tree O(1).

This is possible because rsyslog’s parse tree is based on the radix tree idea and kind of “compresses” the strings rather than including them in every node. The conceptual in-memory representation of a tree based on above values is as follows:

[root] - B --- O --- GOTA
  |      |     + --- STON
  |      +---- ERLIN
  + ---- LONDON

As you can see, the tree is actually based on single characters that are different between the strings. So we do not have each string at the tree level, but only the rest of what is different. On this works a simple state machine. On a conceptual level, the search string is processed character by character, so each character represents a state transition. For obvious reasons, there can be only as many state transitions as there are characters inside the longest string. This is what we called m above. However, the number of different strings is totally irrelevant for the search, because it does not influence the number of state transitions required. That number is what we called n. So in conclusion, the time complexity is only depending on m, which is kind of constant. Consequently, we have O(1) complexity.

Just to stress the point once again: we do not iterate over the individual strings in a tree-like manner, instead we iterate over the index string characters themselves. This creates a very broad and shallow tree, even for a large number of index values.

On real hardware, the number of strings of course has some slight effect, especially when it comes to cache hit performance. While a small tree may fit into the available cache, a much larger tree obviously will not. However, from a theoretical point of view, this does not matter, we still have constant time for the lookup. From a practical perspective, this also does not matter so much: any larger tree will have a larger memory requirement and thus a lower cache hit rate. In fact, rsyslog’s parse tree will probably perform better because of its compressed structure, which makes cache presence of at least top-level elements more likely.

For rsyslog, that’s all I need to tell. For liblognorm, the story is a bit longer. It is only “almost” O(1). While the string matches themselves can be carried out in constant time, liblognorm also supports parsers to pull out specific parts of a log message. The current parsers are all O(1) as well, so this does not change the picture. However, state transitions involving a parser are no longer simple operations where the transition is immediately clear. If, for a given prefix, multiple parsers may apply, liblognorm must iterate through them until a matching one is found. This is no longer O(1), but depends on the number of parsers. Of course, we can claim that the number of parsers is still a (even small) constant, and so we can still say all of this is O(1) without violating the definition. While this is true, there unfortunately comes another complexity with parsers: With constant string, backtracking is never required, as we immediately know the exact state transition. With parsers, we have dynamic elements, and later text matches may turn out to be a non-match. In that case, we need to go back to parser detection and see if another parser also fits. If so, we need to give it a try and need to re-evaluate the parse tree from that point on. This process is called backtracking and can ruin the runtime to become as worse as O(n^2), just like regular expressions. Obviously, the problem exists primarily of very generic parsers are used, like “word”. If, for example, parsers as specific as IPv4 are used, there is no chance for more than one parser matching. So the problem cannot occur. This is also why I caution against using too-generic parsers. Thankfully, very generic parsers are not used too frequently, and so we usually do not see this problem in practice. This is why I call liblognorm to be “almost” O(1), even though it may degrade to O(n^2) in extreme cases. Heuristically said, these extreme cases will not occur in practice.

The rsyslog parse tree does not use parsers inside the parse tree, just the fixed state transitions. As such, it actually is O(1). Oh, and yes: we should probably call the rsyslog “parse tree” better a “lookup tree” or “search tree” but for historical reasons it currently is named the way it is…

2012-02-21

parsing JSON-enhanced syslog

Strucuted logging is cool. A couple of month ago, I added support for log normalization and the 0.5 draft CEE standard to rsyslog. At last weeks Fedora Developer’s Conference, there was a huge agreement that CEE-like JSON is a great way to enhance syslog logging. To follow up on this concept, I have integrated a JSON decoder into libee, so that it can now decode JSON with a single method call. It’s a proof of concept, and for serious use performance optimization needs to be done. Besides that, it’s already quite solid.

Also, I just added the mmjsonparse message modification module to rsyslog (available now in git master branch!). It checks if the message contains an “@JSON: ” cookie and, if so, tries to parse the resulting string as JSON. If that succeeds, we obviously have a JSON-enhanced message and the individual name/value pairs are stored and can be used both in filters and output templates. This provides some really great opportunities when it comes to processing the structured data. Just think about RESTful interfaces and such!

Right now, everything is at proof of concept level, but works well enough for you to try it. I’ll smoothen some edges but will release the versions rather soon. Probably the biggest drawback is that the JSON processor currently flattens the event, with structure being conveyed via field names. That means if you have a JSON object “SUPER” containing a number of fields “field1” to “fieldn”, the current implementation will be a single level and the names are “SUPER.field1”,… I did this in order to have a quick solution and one that fits into the existing framework. I’ll work on creating real structure soon. It’s not really hard, but I probably do some other PoCs first ;)

I considered several approaches, among them moving over to libcollection (part of ding-libs) or a pure JSON parser. The more I worked with the code, the more it turned out that libee already has a lot of the necessary plumbing and could simply been enhanced/modified under the hood. The big plus in that approach is that is immediately plugs in into rsyslog and the other solutions that already built on it. This even enables to use the new functionality in the v6 context (I originally thought I’d need to move on to rsyslog v7 for the name-value pair changes). Now that I have written mmjsonparse, this really seems to work out. No engine change was required, and I expect little need for change even for the final version. As such, I’ll proceed in that direction. Actually, what I now use is kind of a hybrid approch: I use a lot of philosophy of libcollection, which showed me the right route to take. Then, I use cJSON, which is a really nice JSON parser. In the proof of concept, I use both cJSON’s object model and libee’s own. I expect to merge them, actually tightly integrating cJSON. The reason is that CEE has evolved quite a bit in the mean time, and many complex constructs are no longer required. As such, I can streamline the library as well, what not only reduces complexity but speeds up the whole process.

I would like to express my sincere thank to Dmitri Pal, Keith Robertson and Bill Heinbockel, which provided great advise and excellent discussion. And the best is that this part of the effort is just the beginning… Stay tuned for more!

2012-01-09

rsyslog v6 stable released

Happy new year! After being back from vacation, I started the new year with finally releasing the first stable rsyslog v6. While it was ready for a while, I was hesitant to release it when there was so much going on and time left for any quirks that may show up (I know I all too often overlook a thing or two with such a release and, if so, it is good to react fast). So today was finally the day. Let’s see and wait if I failed somewhere.

The new release is a very important one for me. First of all, because it helps me to get down to a decent set of versions I need to support. Doing all that v4..6 stuff in parallel with multiple branches really became time-consuming (but manageable thanks to git!). V4 has now been retired, and the number of branches is reduced to a minimum.

Very important is the new log classification capability. While it is in the beta (and devel) for quite a while, I think people were hesitant to try it out. At the same time, this prevented us from developing it any further. With it now being part of a stable release, I hope that it will get momentum.

Besides software availability, rule bases for common syslog messages are also important. Adiscon will assist with creating these rule bases upon request. I hope that people will use that facility. All they need to do is provide sufficient information about what they want to normalize and some sample messages for us to work with. We will than see how to create the required rulebase. This whole effort is based on liblognorm, so the technology can also be used by other projects (sagan is a good example that actually currently has more interest in that feature that rsyslog has).

2011-11-16

log annotation with liblognorm

I have recently written about the concept of event (log) annotation and liblognorm. During the past days I have made my mind up and have begun implementing some stuff today. In essence, rule bases will receive a new rule type “annotate”, which contains the annotation part. Here is a sample from my lab environment:



rule=logon:<%-:number%>1 %timestamp:date-rfc5424% %src-id:word% ...
annotate=logon:+action="login"

Note the text in red. This is a liblognorm tag (not to confuse with a CEE tag!). This rule base tells the normalizer to append, according to the target format, the fields that are given in the annotate statement to any events that have the tag in question (“logon” in our case).

Today, I am extending the rule base parser to support the annotate rule. Within the next days, I’ll update the rest of the system. When this is done, I’ll probably release that version so that you can try out the new functionality in your own environment.

2011-11-08

filler fields in log normalization

When looking at some real-world rule bases for liblognorm, I noticed that it is often required to check for the presence of a specific field, but the value is actually not needed. This leads to fields named e.g. “filler”, “dummy”, “dummy<n>” with n being an increasing number. This is both clumsy and requires unnecessary processing power. For that reason, I have introduced “-” (dash) as field name. When this special name is used, the field as parsed as usual, but immediately discarded after the successful parse. So while we need to parse and extract in order to get the parse logic right, we save the effort to keep a copy of this unneeded data. This also means that output log records produced by the normalizer tool are cleaned up. I hope this is a useful addition.