simplifying rsyslog JSON generation

With RESTful APIs, like for example ElasticSearch, you need to generate JSON strings. Rsyslog will soon do this in a very easy to use way. The current method is not hard either, but often looks a bit clumsy. The new way of doing things will most probably be part of the 8.33 release.

You now can define a template as follows:

template(name="outfmt" type="list" option.jsonf="on") {
property(outname="@timestamp"
name="timereported"
dateFormat="rfc3339" format="jsonf")
property(outname="host"
name="hostname" format="jsonf")
property(outname="severity"
name="syslogseverity-text" caseConversion="upper" format="jsonf")
property(outname="facility"
name="syslogfacility-text" format="jsonf")
property(outname="syslog-tag"
name="syslogtag" format="jsonf")
property(outname="source"
name="app-name" format="jsonf")
property(outname="message"
name="msg" format="jsonf")

}

This will generate JSON. Here is a pretty-printed version of the generated output:

{
"@timestamp": "2018-03-01T01:00:00+00:00",
"host": "172.20.245.8",
"severity": "DEBUG",
"facility": "local4",
"syslog-tag": "app[1666]",
"source": "app",
"message": " this is my syslog message"
}

Note: the actual output will be compact on a single “line”, as this is most useful with RESTful APIs.

Future versions of rsyslog may see additional simplifications in generating the JSON. For example, I currently think about removing the need to give format=”jsonf” for each property.
The functionality described here is being added via this pull request.

rsyslog and liblognorm will switch to libfastjson as replacement for json-c

We have been using json-c for quite a while now and had good success with it. However, recent problem reports and analysis indicate that we need to replace it in the future. Don’t get me wrong: json-c is a solid piece of software, but we most probably use it much more intensely as the json-c developers ever anticipated. That’s probably the actual root cause why we need to switch.

A main problem spot is performance: various experiments, profiler runs, code review and experimental implementations have prooven that json-c is a severe bottleneck. For example, in the evaluation of liblognorm v2 performance, we found out that json-c calls dominated processing time by more than 90%. Once we saw this, we dug down into the profiler and saw that the hashtable hash calculation as well as memory allocations took a large amount of overall processing time. We have submitted an initial performance enhancement PR to json-c which also got merged. That already removed processing time considerably. We continued on that path, resulting in a quite large second enhancement PR, which I withdrew due to disagreement with the json-c development lead.

A major problem for our application of json-c is that the hash table implementation beneath it is not a good match to our needs. We have been able to considerably speed it up by providing a new hash function (based on perl’s hash function), but to really get the performance we need, we would need to replace that system. However, that is not possible, because json-c considers the hash tables part of its API. Actually, json-c considers each function, even internal ones, as part of the API, so it is very hard to make any changes at all.

Json-c also aims at becoming fully JSON compliant. It currently is not due to improper handling of NUL bytes, but the longer-term plan is to support NUL bytes this. While this is a good thing to do for a general json library, it is a performance killer for our use case. I know, because I faced that same problems with the libee implementation years ago, where we ditched it later in accordance with the CEE standards body board. I admit I also have some doubts if that change in json-c will actually happen, as it IMHO requires a total revamp of the API.

Also, the json-c project releases far to infrequently (have a look at recent json-c releases, the last one was April, 2014). And then, it takes the usual additonal timelag for distros to pick up the new version. So even if we could successfully submit further performance-enhancing PRs to json-c, it would take an awful lot of time before we could actually use them. I would definitely not like to create private packages for rsyslog users, as this could break other parts of a system.

Finally, json-c contains a real bad race bug in reference counting, which can cause rsyslog to segfault under some conditions. A proposed fix was unfortunately not accepted by the json-c development lead, so this is an open issue. Even if it were, it would probably take a long time until the release of the fixed version and its availability in standard packages.

In conclusion and after a lot of thinking, we decided that it is best to fork json-c, which we than did. The new project is named libfastjson. As the name suggests, it’s focus is on performance. It will not try to be 100% JSON compliant. We will not support NUL characters in a standards-conformant way. I expect this not to be a big deal, as json-c also never did this, and the number of complaints seem to be very low. So libfastjson will not aim to be  general purpose json library, but one that offers high performance at some functionality cost and works well in highly threaded applications. Note that we will also reduce the number of API functions and especially remove those that we do not need and that cost performance. Also, the data store will probably be changed from the current hashtable-only system to something more appropriate to our tasks.

Libfastjson already includes many performance enhancement changes and a solid fix for the reference counting bug. Up until that bug, we planned to release in the Feb..April 2016 time frame, together with liblognorm v2. Now this has changed, and we actually did a kind of emergency release (0.99.0) because of the race bug. The source tarball is already available. We are working on packages in the rsyslog repositories (Ubuntu is already done). Rsyslog packages are not yet build against it, but we may do an refresh after the holiday period.

Rsyslog 8.15.0 optionally builds against libfastjson (it is preferred if available). Due to the race bug, we have decided that rsyslog 8.16.0 will require libfastjson.

A side-note is due: we have been thinking about a replacement for the variable subsystem since summer or so. We envision that there are capabilities even beyond of what libfastjson can do. So we still consider this project and think it is useful. In regard to liblognorm, however, we need to provide a more generic interface, and libfastjson is a good match here. Also, we do not know how long it will take until we replace the variable system. We don’t even know if we actually can do it time-wise.

how to use mmjsonparse only for select messages

Rsyslog’s mmjsonparse module permits to parse JSON base data (actually expecting CEE-format). This message modification module is implemented via the output plugin interface, which provides some nice flexibility in using it. Most importantly, you can trigger parsing only for a select set of messages.

Note that the module checks for the presence of the cee cookie. Only if it is present, json parsing will happen. Otherwise, the message is left alone. As the cee cookie was specifically designed to signify the presence of JSON data, this is a sufficient check to make sure only valid data is processed.

However, you may want to avoid the (small) checking overhead for non-json messages (note, however, that the check is *really fast*, so using a filter just to spare it does not gain you too much). Another reason for using only a select set might be that you have different types of cee-based messages but want to parse (and specifically process just some of them).

With mmjsonparse being implemented via the output module interface, it can be used like a regular action. So you could for example do this:

if ($programname == ‘rsyslogd-pstats’) then {
      action(type=”mmjsonparse”)
      action(type=”omfwd” target=”target.example.net” template=”…” …)

}

As with any regular action, mmjsonparse will only be called when the filter evaluates to true. Note, however, that the modification mmjsonparse makes (most importantly creating the structured data) will be kept after the closing if-block. So any other action below that if (in the config file) will also be able to see it.

rsyslog templates & json

I today added a simpler method to specify JSON inside rsyslog templates. The new method simplifies specifying JSON-encoded fields. It still looks a bit ugly, but if you look closely enough, you’ll quickly notice that it no longer needs “quoting magic” and thus is far easier to work with.

Previously, you had to define a template like this:

$template tpl, “{“message”:”%msg:::json%”,”fromhost”:”%HOSTNAME:::json%”,”facility”:”%syslogfacility-text%”,”priority”:”%syslogpriority-text%”,”timereported”:”%timereported:::date-rfc3339%”,”timegenerated”:”%timegenerated:::date-rfc3339%”}”

The template given above is the default for ElasticSearch. With the new code, this can be replaced by:

$template tpl,”{%msg:::jsonf:message%,%HOSTNAME:::jsonf:fromhost%,%syslogfacility-text:::jsonf:facility%,%syslogpriority-text:::jsonf:priority%,%timereported:::date-rfc3339,jsonf%,%timegenerated:::date-rfc3339,jsonf%}”

It’s a bit shorter, but most importantly the JSON field is now generated by the property itself. This is triggered by the “jsonf” (json field) option. If it is given, a

“fieldname”=”value”

is automatically generated. The fieldname is either the forth parameter (see “message” in the msg field in the example above) or, if not given, defaults to the property name. If I hadn’t insisted on specific field names, I could have written the sample as such:

$template tpl,”{%msg:::json%,%HOSTNAME:::jsonf%,%syslogfacility-text:::jsonf%,%syslogpriority-text:::jsonf%,%timereported:::date-rfc3339,jsonf%,%timegenerated:::date-rfc3339,jsonf%}”

Note that the commas to separate the various JSON fields must still be included inside the template as literal text.

This new system works pretty well within the current template system. The config format will probably become a bit more intuitive when moving it over to the new config language. But the exact how’s of that are still to be decided. In fact, I think this string-based way of specifying templates is not so bad. In any case, I am also considering a template option which would take a template of pure field lists and generate proper JSON out of that. But that’s work for another day (or so ;)).

This is to be released with 6.3.9 and already available via the git master branch.

JSON and rsyslog templates

Rsyslog already supports JSON parsing and formatting (for all cee properties). However, the way formatting currently is done is unsatisfactory to me. Right now, we just take the cee properties as they are and format them into JSON format. In this mode, we do not have any way to specify which fields to use and we also do not have a way to modify the field contents (e.g. pick substrings or do case conversions). Exactly these are the use cases rsyslog invented templates for.

One way to handle the situation is to have the user write the JSON code inside the template and just inject the data field where desired. This almost works (and I know Brian Knox tries to explore that route). IT just works “almost” as there is currently no property replacer option to ensure proper JSON escaping. Adding this option is not hard. However, I don’t feel this approach is the right route to take: making the admin craft the JSON string is error-prone and very user-unfriendly.

So I wonder what would be a good way to specify fields that shall go into a JSON format. As a limiting factor, the method should be possible within the limits of the current template system – otherwise it will probably take too long to implement it. The same question also arises for outputs like MongoDB: how best to specify the fields (and structure!) to be passed to the output module?

Of course, both questions are closely related. One approach would be to solve the JSON encoding and say that to outputs like MongoDB JSON is passed. Unfortunately, this has strong performance implications. In a nutshell, it would mean formatting the data to JSON, and then re-parsing it inside the plugin. This process could be be somewhat simplified by passing the data structure (the underlaying tree) itself rather than the JSON encoding. However, this would still mean, that a data structure specific for this use would need to be created. That obviously involves a lot of data-copying. So it would probably be useful to have a capability to specify fields (and replacement options) that are just passed down to the module for its use (that would probably limit the required amount of data copying, at least in common cases). Question again: what would be a decent syntax to specify this?

Suggestions are highly welcome. I need to find at least an interim solution urgently, as this is an important building block for the MongoDB driver and all work that will depend on it. So please provide feedback (note that I may try out a couple of things to finally settle on one – so any idea is highly welcome ;)).

parsing JSON-enhanced syslog

Strucuted logging is cool. A couple of month ago, I added support for log normalization and the 0.5 draft CEE standard to rsyslog. At last weeks Fedora Developer’s Conference, there was a huge agreement that CEE-like JSON is a great way to enhance syslog logging. To follow up on this concept, I have integrated a JSON decoder into libee, so that it can now decode JSON with a single method call. It’s a proof of concept, and for serious use performance optimization needs to be done. Besides that, it’s already quite solid.

Also, I just added the mmjsonparse message modification module to rsyslog (available now in git master branch!). It checks if the message contains an “@JSON: ” cookie and, if so, tries to parse the resulting string as JSON. If that succeeds, we obviously have a JSON-enhanced message and the individual name/value pairs are stored and can be used both in filters and output templates. This provides some really great opportunities when it comes to processing the structured data. Just think about RESTful interfaces and such!

Right now, everything is at proof of concept level, but works well enough for you to try it. I’ll smoothen some edges but will release the versions rather soon. Probably the biggest drawback is that the JSON processor currently flattens the event, with structure being conveyed via field names. That means if you have a JSON object “SUPER” containing a number of fields “field1” to “fieldn”, the current implementation will be a single level and the names are “SUPER.field1”,… I did this in order to have a quick solution and one that fits into the existing framework. I’ll work on creating real structure soon. It’s not really hard, but I probably do some other PoCs first ;)

I considered several approaches, among them moving over to libcollection (part of ding-libs) or a pure JSON parser. The more I worked with the code, the more it turned out that libee already has a lot of the necessary plumbing and could simply been enhanced/modified under the hood. The big plus in that approach is that is immediately plugs in into rsyslog and the other solutions that already built on it. This even enables to use the new functionality in the v6 context (I originally thought I’d need to move on to rsyslog v7 for the name-value pair changes). Now that I have written mmjsonparse, this really seems to work out. No engine change was required, and I expect little need for change even for the final version. As such, I’ll proceed in that direction. Actually, what I now use is kind of a hybrid approch: I use a lot of philosophy of libcollection, which showed me the right route to take. Then, I use cJSON, which is a really nice JSON parser. In the proof of concept, I use both cJSON’s object model and libee’s own. I expect to merge them, actually tightly integrating cJSON. The reason is that CEE has evolved quite a bit in the mean time, and many complex constructs are no longer required. As such, I can streamline the library as well, what not only reduces complexity but speeds up the whole process.

I would like to express my sincere thank to Dmitri Pal, Keith Robertson and Bill Heinbockel, which provided great advise and excellent discussion. And the best is that this part of the effort is just the beginning… Stay tuned for more!