liblognorm Archives - Rainer Gerhards

2015-12-232018-05-30

rsyslog and liblognorm will switch to libfastjson as replacement for json-c

We have been using json-c for quite a while now and had good success with it. However, recent problem reports and analysis indicate that we need to replace it in the future. Don’t get me wrong: json-c is a solid piece of software, but we most probably use it much more intensely as the json-c developers ever anticipated. That’s probably the actual root cause why we need to switch.

A main problem spot is performance: various experiments, profiler runs, code review and experimental implementations have prooven that json-c is a severe bottleneck. For example, in the evaluation of liblognorm v2 performance, we found out that json-c calls dominated processing time by more than 90%. Once we saw this, we dug down into the profiler and saw that the hashtable hash calculation as well as memory allocations took a large amount of overall processing time. We have submitted an initial performance enhancement PR to json-c which also got merged. That already removed processing time considerably. We continued on that path, resulting in a quite large second enhancement PR, which I withdrew due to disagreement with the json-c development lead.

A major problem for our application of json-c is that the hash table implementation beneath it is not a good match to our needs. We have been able to considerably speed it up by providing a new hash function (based on perl’s hash function), but to really get the performance we need, we would need to replace that system. However, that is not possible, because json-c considers the hash tables part of its API. Actually, json-c considers each function, even internal ones, as part of the API, so it is very hard to make any changes at all.

Json-c also aims at becoming fully JSON compliant. It currently is not due to improper handling of NUL bytes, but the longer-term plan is to support NUL bytes this. While this is a good thing to do for a general json library, it is a performance killer for our use case. I know, because I faced that same problems with the libee implementation years ago, where we ditched it later in accordance with the CEE standards body board. I admit I also have some doubts if that change in json-c will actually happen, as it IMHO requires a total revamp of the API.

Also, the json-c project releases far to infrequently (have a look at recent json-c releases, the last one was April, 2014). And then, it takes the usual additonal timelag for distros to pick up the new version. So even if we could successfully submit further performance-enhancing PRs to json-c, it would take an awful lot of time before we could actually use them. I would definitely not like to create private packages for rsyslog users, as this could break other parts of a system.

Finally, json-c contains a real bad race bug in reference counting, which can cause rsyslog to segfault under some conditions. A proposed fix was unfortunately not accepted by the json-c development lead, so this is an open issue. Even if it were, it would probably take a long time until the release of the fixed version and its availability in standard packages.

In conclusion and after a lot of thinking, we decided that it is best to fork json-c, which we than did. The new project is named libfastjson. As the name suggests, it’s focus is on performance. It will not try to be 100% JSON compliant. We will not support NUL characters in a standards-conformant way. I expect this not to be a big deal, as json-c also never did this, and the number of complaints seem to be very low. So libfastjson will not aim to be general purpose json library, but one that offers high performance at some functionality cost and works well in highly threaded applications. Note that we will also reduce the number of API functions and especially remove those that we do not need and that cost performance. Also, the data store will probably be changed from the current hashtable-only system to something more appropriate to our tasks.

Libfastjson already includes many performance enhancement changes and a solid fix for the reference counting bug. Up until that bug, we planned to release in the Feb..April 2016 time frame, together with liblognorm v2. Now this has changed, and we actually did a kind of emergency release (0.99.0) because of the race bug. The source tarball is already available. We are working on packages in the rsyslog repositories (Ubuntu is already done). Rsyslog packages are not yet build against it, but we may do an refresh after the holiday period.

Rsyslog 8.15.0 optionally builds against libfastjson (it is preferred if available). Due to the race bug, we have decided that rsyslog 8.16.0 will require libfastjson.

A side-note is due: we have been thinking about a replacement for the variable subsystem since summer or so. We envision that there are capabilities even beyond of what libfastjson can do. So we still consider this project and think it is useful. In regard to liblognorm, however, we need to provide a more generic interface, and libfastjson is a good match here. Also, we do not know how long it will take until we replace the variable system. We don’t even know if we actually can do it time-wise.

2015-08-262018-05-30

moving towards liblognorm v2

The initial version of liblognorm v2 is almost ready. It offers many new features, like custom data types, much easier rule description langugage, and potentially even greater performance (we have not yet verfied this). As some of you know, I have worked very hard on liblognorm during the past weeks. I have now reached a very important milestone and will switch the git master branch to use the new version. If things go smooth enough, the initial release of liblognorm v2 will go along with the next rsyslog release. Daily build will have it very soon.

Liblognorm v2 also contains the full v1 engine and thus is fully compatible with the previous versions, as far as rulebases are concerned. For more on the compatibility, please read the compatiblity document. In fact, by default the v1 engine is used. To opt in for the new features, you need to add a line

version=2

to the top of your rulebases Then it is when you need to really check the compatibility document. This is also what brings you the enhancements.

A couple of notes are due: while we are approaching the initial release, not all design goals have been met yet. Most importantly, we are feeding back user comments into the development process. As such, the v2 feature set is not 100% finalized yet. This means that we cannot yet fully guarantee that all constructs you use will remain compatible with versions released later. But those that know us also know that this risk is minimal and, if it happens, will be easy to fix. The core concepts are ready and unlikely to change. Note that I will continue to actively work on v2 and more features will be upcoming in the next weeks.

The online doc should be updated in two days at latest (actual update date depends on when I can switch the git branch and how this interacts with the automatic doc generating scripts). I invite you to use the new version and am sure it will be much easier to use and powerful.

Note to developers: the v2 and v1 engine are very different. V2 is a complete rewrite of the core components. Nevertheless, v1 and v2 share some of the same file names. For many reasons, this means I need to rewrite the git master branch with the new version.

Also note that as of now, no new development happens to v1, this version is essentially dead. Very important fixes to the v1 engine will be applied to the v1 subsystem of v2.

Feedback on v2 is appreciated, please post issues or feature requests directly to liblognorm’s github trackers (if possible).

2015-04-282018-05-30

liblognorm’s “rest” parser now more useful

The liblognorm “rest” parser was introduced some time ago, to handle cases where someone just wants to parse a partial message and keep all the “rest of it” into another field. I never was a big fan of this type of parser, but I accepted it because so many people asked. Practice, however, showed that my concerns were right: the “rest” parser has a very broad match and those that used it often got very surprising results.

A key cause of this issue was that the rest parser had the same priority as other parsers, and most importantly a higher priority than a simple character match. so it was actually impossible to match some constant text that was at the same location than the “rest” parser.

I have now changed this so that the rest parser is always called last, if no other thing matches – neither any parser nor any constant text. This will make it work much more like you expect. Still, I caution against using this parser as it continues to provide a very broad match.

Note that the way I have implemented this is not totally clean from a software engineering point of view, but very solid. A cleaner solution will occur during the scheduled rewrite of the algorithm (later in spring/summer).

Note that existing rulebases using “rest” may behave differently with the new algorithm. However, previously the result was more or less random, so any other change to the rulebase could also have caused different behaviour. So this is no compatibility break as there really is no compatibility to retain.

This will be released with 1.1.2, probably in early may. If you need it urgently, you can use a daily build.

2015-03-302018-05-30

Call for Log Samples

There is one big problem in research for better logging methods: no good logging sample repositories exist. Well, not even bad ones… I am currently doing some preliminary steps towards a new, better log normalization system. Among others, it will contain a structure analyzer which will remove much of the manual burden of creating normalization rules. But, guess what: while the project looks very promising, lack of log samples is a real big problem!

To solve that problem, I have setup a public log ingestor that you can simply send logs to. The system is reachable as follows:

hostname: logsubmit.rsyslog.com

port: 514

protocol: any flavor of syslog or other text data

If you run rsyslog, you can use add this snippet to /etc/rsyslog.conf:

*.* @logsubmit.rsyslog.com

How did this idea materialize? During my talk at the German Unix User’s Group FFG 2015 conference last week in Stuttgart, I mentioned that problem and Dirk Wetter had the idea to provide a log receiver that makes it very easy for people to contribute. There were some concerns that this may open up my server for DoS, and that of course is true. Nevertheless, I liked the idea and so we setup a machine today. It may be DDoS’ed and other bad things may happen, but then we got more experience. It’s split from the main systems, so that shouldn’t cause much harm.

For log contributors, please keep on your mind that you send data to a public service and so this is probably not a great idea to do this for sensitive systems. But if we get enough data from uncritical systems, we can still gain a lot from that, most importantly it helps us gain insight into structural log mining methods — which will also lead to above-mentioned tool. All logs gathered by this method will be placed in the research log repository, which currently is hosted on github. It is licensed under BSD 2-clause in the hope that a sufficiently large and diverse data set is also of great value for other researchers (did I mention it is ultra-hard to find any log sample data sets?). If you are interested in cotributing logs, but would want to do so under NDA, that’s of course also possible. In that case, please just drop me an email to see how to best go forward with that.

2013-11-28

On liblognorm1

Liblognorm is a fast-samples based normalization library. It’s brand new version 1.0.0 will be released today. It is a major improvement over previous versions, but unfortunately we needed to change the API. So some notes are due.

Liblognorm evolves since several years and was intially meant to be used primarily with the Mitre CEE effort. Consequently, the initial version of liblognorm (0.x) uses the libee CEE support library in its API.

As time evolved, the initial CEE schema underwent considerable change. Even worse, Mitre lost funding for CEE. While the CEE ideas survived as part of Red Hat-driven “Project Lumberjack”, the data structures became greatly simplified and JSON based. That effectively made libee obsolete (and also in parts libestr, which was specifically written to support CEE’s initial requirement of embedded NUL chars in strings).

Recently, Pavel Levshin converted liblognorm to native JSON, which helped improve performance and simplicity for many client applications. Unfortunately, this change broke interface compatibility (and there was no way to avoid that, obviously…).

The current library is the result of that effort. Application developers are encouraged to switch to this version, as it provides the benefit of a simpler API. This version is now being tracked by the git master branch.

However, if you need to stick to the old API, there is a git branch liblognorm0, which contains the previous version of the library. This branch is also maintained for important bug fixes, so it is safe to use.

We recommend that packagers create packages both for liblognorm0 and liblognorm1. Note that liblognorm’s development packages cannot coexist on the same system as the PKGCONFIG system would get into trouble. Adiscon’s own packages follow this schema.

Note that rsyslog will soon begin to enjoy the benefits of liblognorm1. This results in a notable performance improvement for mmnormalize. Support will initially become available in v8.

2013-01-252018-06-08

performance of liblognorm/rsyslog parse tree

We are currently considering a new table lookup feature for rsyslog, and as usual, performance is a priority. As such, we intend to base the implementation on the liblognorm parse tree, which is highly efficient. In fact, the adaption (maybe a partial rewrite) I have in mind for rsyslog actually has O(1) time complexity. This drew some discussion, as parse trees are usually considered to be O(log n). So I thought I quickly lay out why constant time complexity is possible.

To set some ground, let’s define the use case where this applies.We are talking about string matches, not regular expressions. In the case of liblognorm, we are also talking about exact string matches, and let me assume this for rsyslog’s lookup table functionality as well (at least in “string” mode). Note that this may be extended, and should be no problem, but I have yet to properly evaluate the effect on the parse tree. So let’s assume exact matches in this blog post. So basically, we have a set of strings and would like to see if there is an exact match (and pick up some value associated with that match). For example, we may have the strings “Berlin”, “Boston”, “Bogota” and “London”. That example is obviously extremely simplified, yet it will help us understand.

Whenever we talk about time complexity, we need to make sure what exact n we are talking about. When it comes to matching values, n is usually the number of different values. In our example, this n would be four. Of course, one could also look at string lengths, as the length obviously influences the complexity of the comparison operation. Let’s call this m, then in the above sample m would be six (length of longest string). If we take the unusual step to include that m, a linear list would have time complexity O(n*m), a binary tree O(log n*m) and rsyslog’s parse tree O(m). HOWEVER, there usually is an upper bound on string length, and for that reason m is taken as a constant in almost all cases. By definition, it will then no longer be mention in O notation, and so we end up with the usual view that a linear list is O(n), a binary tree O(log n) … and rsyslog’s parse tree O(1).

This is possible because rsyslog’s parse tree is based on the radix tree idea and kind of “compresses” the strings rather than including them in every node. The conceptual in-memory representation of a tree based on above values is as follows:

[root] - B --- O --- GOTA
  |      |     + --- STON
  |      +---- ERLIN
  + ---- LONDON

As you can see, the tree is actually based on single characters that are different between the strings. So we do not have each string at the tree level, but only the rest of what is different. On this works a simple state machine. On a conceptual level, the search string is processed character by character, so each character represents a state transition. For obvious reasons, there can be only as many state transitions as there are characters inside the longest string. This is what we called m above. However, the number of different strings is totally irrelevant for the search, because it does not influence the number of state transitions required. That number is what we called n. So in conclusion, the time complexity is only depending on m, which is kind of constant. Consequently, we have O(1) complexity.

Just to stress the point once again: we do not iterate over the individual strings in a tree-like manner, instead we iterate over the index string characters themselves. This creates a very broad and shallow tree, even for a large number of index values.

On real hardware, the number of strings of course has some slight effect, especially when it comes to cache hit performance. While a small tree may fit into the available cache, a much larger tree obviously will not. However, from a theoretical point of view, this does not matter, we still have constant time for the lookup. From a practical perspective, this also does not matter so much: any larger tree will have a larger memory requirement and thus a lower cache hit rate. In fact, rsyslog’s parse tree will probably perform better because of its compressed structure, which makes cache presence of at least top-level elements more likely.

For rsyslog, that’s all I need to tell. For liblognorm, the story is a bit longer. It is only “almost” O(1). While the string matches themselves can be carried out in constant time, liblognorm also supports parsers to pull out specific parts of a log message. The current parsers are all O(1) as well, so this does not change the picture. However, state transitions involving a parser are no longer simple operations where the transition is immediately clear. If, for a given prefix, multiple parsers may apply, liblognorm must iterate through them until a matching one is found. This is no longer O(1), but depends on the number of parsers. Of course, we can claim that the number of parsers is still a (even small) constant, and so we can still say all of this is O(1) without violating the definition. While this is true, there unfortunately comes another complexity with parsers: With constant string, backtracking is never required, as we immediately know the exact state transition. With parsers, we have dynamic elements, and later text matches may turn out to be a non-match. In that case, we need to go back to parser detection and see if another parser also fits. If so, we need to give it a try and need to re-evaluate the parse tree from that point on. This process is called backtracking and can ruin the runtime to become as worse as O(n^2), just like regular expressions. Obviously, the problem exists primarily of very generic parsers are used, like “word”. If, for example, parsers as specific as IPv4 are used, there is no chance for more than one parser matching. So the problem cannot occur. This is also why I caution against using too-generic parsers. Thankfully, very generic parsers are not used too frequently, and so we usually do not see this problem in practice. This is why I call liblognorm to be “almost” O(1), even though it may degrade to O(n^2) in extreme cases. Heuristically said, these extreme cases will not occur in practice.

The rsyslog parse tree does not use parsers inside the parse tree, just the fixed state transitions. As such, it actually is O(1). Oh, and yes: we should probably call the rsyslog “parse tree” better a “lookup tree” or “search tree” but for historical reasons it currently is named the way it is…

2011-11-25

Quick update on log normalization

I just wanted to give a heads up on the status of log normalization. We have just released updated versions of libee and liblognorm. These provide important new features, like the capability to annotate events based on classification. This, among others, brings the libraries more back inline with recent CEE developments. Also, the log normalizer tool is nearly ready for prime time. The “only” thing that is missing is a decent set of rule bases. Thankfully, sagan already has a couple. I guess besides programming obtaining rule bases is a major thing to target.

As soon as I find time (I hope soon!), I’ll finalize some lose ends on the software side and get doc online on how to use the normalizer tool. I think with that a great tool for everyday use in log analysis will become available. Feedback and collaboration is always appreciated!

2011-11-16

log annotation with liblognorm

I have recently written about the concept of event (log) annotation and liblognorm. During the past days I have made my mind up and have begun implementing some stuff today. In essence, rule bases will receive a new rule type “annotate”, which contains the annotation part. Here is a sample from my lab environment:



rule=logon:<%-:number%>1 %timestamp:date-rfc5424% %src-id:word% ...
annotate=logon:+action="login"

Note the text in red. This is a liblognorm tag (not to confuse with a CEE tag!). This rule base tells the normalizer to append, according to the target format, the fields that are given in the annotate statement to any events that have the tag in question (“logon” in our case).

Today, I am extending the rule base parser to support the annotate rule. Within the next days, I’ll update the rest of the system. When this is done, I’ll probably release that version so that you can try out the new functionality in your own environment.

2011-11-09

liblognorm event annotation … and CEE

As you probably know, CEE is an effort driven by MITRE to support a common event expression format. Liblognorm is a log normalizer library (aka “network event normalizer”). One of its primary target formats is CEE.

For pure normalizing needs, liblognorm extracts data fields from semi-structured log message. The extracted fields are available inside a (basically) name/value property list. Liblognorm also permits to classify messages, e.g. as being a logon or logoff message. For this classification, liblognorm provides so-called “tags”. These are simple words (strings of characters) which can be specified by the user. Tags reside in a special property called “tags”, but otherwise occupy a flat space (tags can easily be structured via punctuation).

CEE takes a slightly different approach: while it shares the tag concept (actually liblognorm inherited tags from an earlier version of CEE), CEE classifies tags into different tag types. For example, “logon” may be a tag, but can only be used to describe an action(-field). As such, “logon” can not be present by itself in a CEE log record, it must be given as value of the action field (‘action=”logon”‘). Also, CEE requires some other fields which may not be present explicitly from the original message even though the information may implicitly be present inside it. To express such information entities (and tags in the CEE way), liblognorm needs the capability to add additional fields to an extracted event. Let call these set of fields the “annotation” for easier future reference. Liblognorm needs to annotate the event so that the target format’s (CEE) requirements are met. While I was talking about CEE so far, I assume (and know from previous experience) that other formats may have similar requirements, albeit different fields that need to be annotated.

The question is now: how to implement this in liblognorm? The initial idea was to include the annotation inside the normalization rule itself. That has a major drawback: If a rule base is to be used for CEE and some other format, the annotation may be different, and thus the same rule base cannot be used. These two rule bases would differ in just the annotation. So it seems more natural, and easier to maintain, to split the recognition rule from the annotation rule. In that setting, the message is recognized and classified by recognition rules and the annotation is based on (different) annotation rules. So only one set of recognition rules can be used by multiple annotation rules. Only the latter need to be redefined for different target formats (or systems).

This split-rule method is the way I intend to head to. In essence, the current “rule=” rule and its format will remain untouched. It will be augmented by “annotate=” rules, which contain the full annotation. The binding between these two will be done via classification (liblognorm tags): in the first step, the message is recognized, data extracted and tags assigned, just like it is currently done. Then a second step will be added. It traverses through the tags and adds all annotation that are defined for the message’s tag set. So the binding is on the tag set. Finally, it is probably necessary to add a third step that can remove unwanted fields. This step is probably target-format specific. For example, this step could eliminate the liblognorm tag set from an event if CEE compliance is desired, because CEE does not support, not even permit, an extra tag set.

Feedback on this approach is appreciated. It is my hope to be able to implement this in the near future.

2011-11-08

filler fields in log normalization

When looking at some real-world rule bases for liblognorm, I noticed that it is often required to check for the presence of a specific field, but the value is actually not needed. This leads to fields named e.g. “filler”, “dummy”, “dummy<n>” with n being an increasing number. This is both clumsy and requires unnecessary processing power. For that reason, I have introduced “-” (dash) as field name. When this special name is used, the field as parsed as usual, but immediately discarded after the successful parse. So while we need to parse and extract in order to get the parse logic right, we save the effort to keep a copy of this unneeded data. This also means that output log records produced by the normalizer tool are cleaned up. I hope this is a useful addition.