how to use mmjsonparse only for select messages

Rsyslog’s mmjsonparse module permits to parse JSON base data (actually expecting CEE-format). This message modification module is implemented via the output plugin interface, which provides some nice flexibility in using it. Most importantly, you can trigger parsing only for a select set of messages.

Note that the module checks for the presence of the cee cookie. Only if it is present, json parsing will happen. Otherwise, the message is left alone. As the cee cookie was specifically designed to signify the presence of JSON data, this is a sufficient check to make sure only valid data is processed.

However, you may want to avoid the (small) checking overhead for non-json messages (note, however, that the check is *really fast*, so using a filter just to spare it does not gain you too much). Another reason for using only a select set might be that you have different types of cee-based messages but want to parse (and specifically process just some of them).

With mmjsonparse being implemented via the output module interface, it can be used like a regular action. So you could for example do this:

if ($programname == ‘rsyslogd-pstats’) then {
      action(type=”mmjsonparse”)
      action(type=”omfwd” target=”target.example.net” template=”…” …)

}

As with any regular action, mmjsonparse will only be called when the filter evaluates to true. Note, however, that the modification mmjsonparse makes (most importantly creating the structured data) will be kept after the closing if-block. So any other action below that if (in the config file) will also be able to see it.

rsyslog / CEE base schema mapping

I am giving a first shot at a mapping of the CEE base schema (as currently described in project lumberjack, NOT on the CEE site!) to rsyslog properties. The core idea is to use this mapping as the default for ommongodb. Then, rsyslog shall be able to write this schema, while logtools (and others) can rely on it. For obvious reasons “rely” is not to be treated literally, as the whole thing currently is a moving target.

So I would deeply appreciate feedback for improving this mapping.

In the following mapping, the cee field name is first, the rsyslog property second.

Fields we can always map:

  • srchost -> hostname
  • time -> timestamp (rsyslog currently populates subseconds, what seems not to be supported in lumberjack)
  • msg -> msg (initially used rawmsg, but decided against this)
  • pid -> procid (may not actually be a Linux process ID)
  • proc -> app-name
  • level -> generated based on syslog severity (value mapping see below)
Translation of syslog severity to lumberjack level (not bijective, syslog first, number in parenthesis denotes numerical value):
  • emergency(0) -> FATAL
  • alert(1), critical(2), error(3) -> ERROR
  • warning(4) -> WARN
  • notice(5), informational(6) -> INFO
  • debug(7) -> DEBUG
  • (never mapped) -> TRACE
Fields we do not currently provided, but could be in some cases:
Note that these fields may or may not be present inside a JSON/BSON document.
  • ppid -> parent process ID (SCM_CREDENTIALS, local only?)
  • uid ->  (SCM_CREDENTIALS, local only?)
  • gid ->  (SCM_CREDENTIALS, local only?)
  • tid -> thread ID (questionable, can probably not provided with current logging API)
As I said, feedback and suggestions are highly welcome. This list ist work in progress and can change in any instant. I’ll provide notification when the interface has stabilized. Do not expect this soon.

next steps for ommongodb

I just wanted to give you a heads-up on my work on ommongodb. During the past couple of days I have converted it to libmongo-client, which gives us a much more solid basis. I have also refactored it to some degree and adopted it to the new v6 config interface. Also, ommongodb will not be supported on pre-v6 platforms. This enables me to use the v6-exclusive features I am building now, especially great JSON and CEE support. Right now, ommongodb uses a very limited field set, and this set is hardcoded (so you can change it, but that means you need to change code).

My next step is to make ommongodb support the base event (as currently being discussed in project lumberjack). I will also provide a capability to add “extra” information from the cee field set. That’s probably not a perfect solution, but the goal is to get ready for some command line tools that are able to extract data from mongodb and thus make the system mimic it is a traditional flat-file syslog format. I have also asked Andre, the lead behind Adiscon LogAnalyzer to consider adding support for MongoDB to loganayzer. I have not yet heard back from him and don’t know exactly about his schedule, but I hope we will be able to make this happen very soon.

Only after that – somewhat hardcoded – work is done I’ll go back and look at JSON and templates in a more native way (very probably also looking at the contributed JSON string generator in more depth).

JSON and rsyslog templates

Rsyslog already supports JSON parsing and formatting (for all cee properties). However, the way formatting currently is done is unsatisfactory to me. Right now, we just take the cee properties as they are and format them into JSON format. In this mode, we do not have any way to specify which fields to use and we also do not have a way to modify the field contents (e.g. pick substrings or do case conversions). Exactly these are the use cases rsyslog invented templates for.

One way to handle the situation is to have the user write the JSON code inside the template and just inject the data field where desired. This almost works (and I know Brian Knox tries to explore that route). IT just works “almost” as there is currently no property replacer option to ensure proper JSON escaping. Adding this option is not hard. However, I don’t feel this approach is the right route to take: making the admin craft the JSON string is error-prone and very user-unfriendly.

So I wonder what would be a good way to specify fields that shall go into a JSON format. As a limiting factor, the method should be possible within the limits of the current template system – otherwise it will probably take too long to implement it. The same question also arises for outputs like MongoDB: how best to specify the fields (and structure!) to be passed to the output module?

Of course, both questions are closely related. One approach would be to solve the JSON encoding and say that to outputs like MongoDB JSON is passed. Unfortunately, this has strong performance implications. In a nutshell, it would mean formatting the data to JSON, and then re-parsing it inside the plugin. This process could be be somewhat simplified by passing the data structure (the underlaying tree) itself rather than the JSON encoding. However, this would still mean, that a data structure specific for this use would need to be created. That obviously involves a lot of data-copying. So it would probably be useful to have a capability to specify fields (and replacement options) that are just passed down to the module for its use (that would probably limit the required amount of data copying, at least in common cases). Question again: what would be a decent syntax to specify this?

Suggestions are highly welcome. I need to find at least an interim solution urgently, as this is an important building block for the MongoDB driver and all work that will depend on it. So please provide feedback (note that I may try out a couple of things to finally settle on one – so any idea is highly welcome ;)).

finally… rsyslog agent for windows released

It’s done! We have finally released the rsyslog Agent for Windows, a nice piece of software that enables easy integration of Windows Event Logs into a rsyslog backend system. Ideas for this tool floated around for roughly four to five month, and we had lots of internal discussions. It is important to note that we at Adiscon already have the necessary technology as part of our Windows products (actually, we invented this whole event-log-to-syslog type of software…), so it was just a matter of fine-tuning the code and selecting some useful default settings and policies.

The release is important because it makes clear that there actually is a Windows component (while I tried to convey that several times, people most often did not realize it due to name differences – something with “rsyslog” inside the name was expected). It is also important for me at Adiscon internally: the rsyslog Agent is a commercial product and license sales will make clear that this business is driven by rsyslog. And, obviously, I hope that this will help fund the project without need to resort to other things like premium plugins. This is also why the Agent is so important for the rsyslog project as whole: it will hopefully help to stabilize the funding situation even more.

EDIT: I should probably mention that the Windows Agent was more or less ready when I held its release in order to integrate support for cee-enhanced syslog into it. I am glad I could convince my folks at Adiscon, so that we now have this exciting feature actually available.

CEE-enhanced syslog defined

CEE-enhanced syslog is an upcoming standard for expressing structured data inside syslog messages. It is a cross-platform effort that aims at making log analysis (and log processing in general) much more easy both for log producers and consumers. The idea was originally born as part of MITRE’s CEE effort. It has been adopted by a larger set of logging stakeholders in an initiative that was named “project lumberjack“. Under this project, cee-enhanced syslog, and a framework to make full use of it, is being openly advanced. It is hoped (and planned) that the outcome will flow back to the CEE standard.

In a nutshell cee-enhanced syslog is very simple and powerful: inside the syslog message, a special cookie (“@cee:”) is followed by a JSON representation of the data. The cookie tells processors that the format is actually cee-enhanced. If you are interested in a more technical coverage, have a look at my cee-enhanced syslog howto presentation.

Adiscon is one of the main supporters of project lumberjack and CEE enhanced syslog. Since February 2012, Adiscon products offer basic support for cee-enhanced syslog, being among the first tools to do so.

What is CEE-enhanced syslog?

I just did a quick presentation on what cee-enhanced syslog actually is and how it works. I suggest to have at least a peek, as this format will probably become very important in the future. But why say more…  just get the full story in 5 minutes ;)

cee-enhanced event log to syslog forwarding

As many know, we at Adiscon also work hard at Windows Event Log to syslog forwarding software. During the past days we have taken the time to implement cee-enhanced syslog format inside these products as well. It is currently a proof of concept stage, but mostly because the relevant specs are also at PoC. This effort nicely integrated with the new project lumberjack, which aims at providing structured logging. New releases of the relevant Windows products (EventReporter and MonitorWare Agent) will be released very soon. With these releases, we are again the first-ever folks to release something never seen before, this time CEE support for windows logging ;)

But how does it work? Basically, it is a message format option of the “format syslog” option. If you select cee-enhanced syslog, messages will be emitted in that format. Most importantly, they will included nice name/value pairs of the Windows events (if Windows provided names, else the previous “Paramn” replacement names will be used). For example, a security event is described as follows:

@cee: {“source”: “machine.local”, “nteventlogtype”: “Security”, “sourceproc”: “Microsoft-Windows-Security-Auditing”, “id”: “4648”, “categoryid”: “12544”, “category”: “12544”, “keywordid”: “0x8020000000000000”, “user”: “N\A”, “SubjectUserSid”: “S-1-5-11-222222222-333333333-4444444444-5555”, “SubjectUserName”: “User”, “SubjectDomainName”: “DOMAIN”, “SubjectLogonId”: “0x5efdd”, “LogonGuid”: “{00000000-0000-0000-0000-000000000000}”, “TargetUserName”: “Administrator”, “TargetDomainName”: ” DOMAIN “, “TargetLogonGuid”: “{00000000-0000-0000-0000-000000000000}”, “TargetServerName”: “servername”, “TargetInfo”: ” servername “, “ProcessId”: “0x76c”, “ProcessName”: “C:\Windows\System32\spoolsv.exe”, “IpAddress”: “-“, “IpPort”: “-“, “catname”: “Logon”, “keyword”: “Audit Success”, “level”: “Information”}

Note that we currently focus on cee-enhanced syslog format. We did not yet try to map the Windows field names to the CEE dictionary/profile terms. Probably the most important reason for this focus is that we do not yet have any definite spec to write to. Obviously, once the spec is out, it is fairly easy to upgrade the implementation to support these other names.

A co-worker is right now doing some more testing with rsyslog, which is able to understand that new format. I’ll update you with the findings, and procedures, once they are ready.

Announcing Project Lumberjack

Two weeks ago, along with the Fedora Developer’s Conference in Brno, Czech Republic, a couple of logging and auditing folks from Red Hat, Balabit (syslog-ng), the MITRE Corporation, and Adiscon (me) stuck their heads together to talk about the future of structured logging. It quickly became clear that extending syslog in the CEE spirit is the right thing to do.

We observed that almost all technology is present to provide a rich framework to support structured logging. Actually, both syslog-ng and rsyslog provide the necessary plubming since long (for, example, as part of the RFC5424 effort), but that functionality is relatively seldom explored actively by other developers. A core problem in that regard is that most applications rely on the good old syslog() API, which does not provide structured logging by itself. Also, there is no common log storage database available, which tools could be based on.

In order to evolve syslog, we defined a three-layer architecture, with applications and logging libraries/APIs being the top layer, the syslogd the middle layer and the datastore the bottom layer. Multiple APIs must be supported as noone can expect projects to change their existing logging infrastructure. Also, existing frameworks like log4j or log4j and even glibc’s syslog() will stay around for a while longer. New libraries (like ELAPI) will  probably become more dominant for new applications. So how to tie these different libraries to the syslogd subsystem (the second layer)?

The solution is rather simple: we use what we already achieved in CEE and support cee-enhanced syslog on the system log socket. The core idea is very simple: we use the regular syslog message part, but include JSON-encoded structured data with it. To signify to the syslog system that this is actually cee-enhanced, a cookie string (“@cee:”) is used in front of the JSON data. It is then easy to decide for the syslogd which message format it deals with: if the cookie is present and the rest of the message is a valid JSON representation, the message is cee-enhanced. If one of the two conditions fails, it is traditional syslog. As both conditions are checked together, it is highly unlikely that a legacy syslog message will ever fit into that criteria (and if it really does, nothing is lost: after all, the syslogd has correctly understood that format). It must be noted that the necessary parsing and internal plumbin is available both in syslog-ng as well as rsyslog (I committed the missing JSON parser, held back awaiting a more final CEE standard, yesterday).

The interface to the log database layer is currently not as well defined and needs to be worked on. Note that both syslog-ng and rsyslog support multiple datastores, so there already exist solutions. The group as whole was of the opinion that some unified API for a log data store would be useful and something that should be looked at as a longer-term target.

After reaching this rough consensus, we were delighted to see that most of the base technology is already and place and just needs to be tied correctly together. It is more an effort of doing detail implementations and documenting the various pieces (and how they work exactly together) than creating a totally new system (aka “can be quickly done”). We agreed that it probably is best to reach for the low-hanging fruit first: get structured logging integrated first, then do the other steps. So an initial milestone will be making sure cee-enhanced syslog is supported by all of the subsystem and only after this is done reach for the other things.

One of these next things definitely is a dictionary of field names (and exact structure) to be used to describe events in a standard way (for example a logon event). While the whole effort is highly inspired by CEE, it probably is best to try out initial efforts outside of the formal CEE framework. That will enable rapid development, discussion and the capability to check what works in practice. The experience gained in such PoC can than be feed back to the formal CEE process (along the old IETF mantra “running code and rough consensus first”).

We agreed that such an effort is best be done in a tranparent and flexible open source process. With that, project lumberjack was born: an effort to provide better structured logging for Linux, being supported by many major players in that arena. We agreed that it would be a good idea if Red Hat provided some of the project infrastructure. This is why you find project lumberjack now at fedorahosted.org (note that the project will probably contain mostly specs and less code, which is kept in the individual project’s repositories).

parsing JSON-enhanced syslog

Strucuted logging is cool. A couple of month ago, I added support for log normalization and the 0.5 draft CEE standard to rsyslog. At last weeks Fedora Developer’s Conference, there was a huge agreement that CEE-like JSON is a great way to enhance syslog logging. To follow up on this concept, I have integrated a JSON decoder into libee, so that it can now decode JSON with a single method call. It’s a proof of concept, and for serious use performance optimization needs to be done. Besides that, it’s already quite solid.

Also, I just added the mmjsonparse message modification module to rsyslog (available now in git master branch!). It checks if the message contains an “@JSON: ” cookie and, if so, tries to parse the resulting string as JSON. If that succeeds, we obviously have a JSON-enhanced message and the individual name/value pairs are stored and can be used both in filters and output templates. This provides some really great opportunities when it comes to processing the structured data. Just think about RESTful interfaces and such!

Right now, everything is at proof of concept level, but works well enough for you to try it. I’ll smoothen some edges but will release the versions rather soon. Probably the biggest drawback is that the JSON processor currently flattens the event, with structure being conveyed via field names. That means if you have a JSON object “SUPER” containing a number of fields “field1” to “fieldn”, the current implementation will be a single level and the names are “SUPER.field1”,… I did this in order to have a quick solution and one that fits into the existing framework. I’ll work on creating real structure soon. It’s not really hard, but I probably do some other PoCs first ;)

I considered several approaches, among them moving over to libcollection (part of ding-libs) or a pure JSON parser. The more I worked with the code, the more it turned out that libee already has a lot of the necessary plumbing and could simply been enhanced/modified under the hood. The big plus in that approach is that is immediately plugs in into rsyslog and the other solutions that already built on it. This even enables to use the new functionality in the v6 context (I originally thought I’d need to move on to rsyslog v7 for the name-value pair changes). Now that I have written mmjsonparse, this really seems to work out. No engine change was required, and I expect little need for change even for the final version. As such, I’ll proceed in that direction. Actually, what I now use is kind of a hybrid approch: I use a lot of philosophy of libcollection, which showed me the right route to take. Then, I use cJSON, which is a really nice JSON parser. In the proof of concept, I use both cJSON’s object model and libee’s own. I expect to merge them, actually tightly integrating cJSON. The reason is that CEE has evolved quite a bit in the mean time, and many complex constructs are no longer required. As such, I can streamline the library as well, what not only reduces complexity but speeds up the whole process.

I would like to express my sincere thank to Dmitri Pal, Keith Robertson and Bill Heinbockel, which provided great advise and excellent discussion. And the best is that this part of the effort is just the beginning… Stay tuned for more!