Using ElasticSearch Bulk Mode with rsyslog

Rsyslog’s omelasticsearch plugin now supports bulk mode. With bulk mode, message processing is much faster, especially if large loads are to be processed.

Bulk mode works with rsyslog’s batching capabilities. So it probably is a good idea to refresh some of the batching concepts. The core idea is that while we would like to process many messages at once, we do NOT want to wait hold processing messages “just” because they are too few. So with batching, you set an upper limit on the batch size (number of messages inside a batch). Let’s say the batch size is set to 32. When a new batch is to be processed, the queue worker tries to pull 32 messages off the queue. If there are 32 or more present, this is nice and all 32 are taken from the queue. But now let’s assume there are only 10 messages at all present inside the queue. In that case, the queue worker does not try to guess when the next 22 messages will arrive and wait for that (if the time is short enough). Instead, it just pulls the 10 already-present messages off the queue and these form the batch. When new messages arrive, they will be part of the next batch.

Now let’s look at the startup of a busy system. Lot’s of messages come in. Let’s assume they are submitted one-by-one (many sources submit multiple messages, but let’s focus on those cases that do not). If so, the first message is submitted and the queue worker is activated. Assuming this happens immediately and before any other message is submitted (actually unlikely!), it will initially create a batch of exactly one message and process that. In the mean time, more messages arrive and the queue fills. So when the first batch is completed, there are ample messages inside the queue. As such, the queue worker will pull the next set of 32 messages off the queue and form a new batch out of them. This continues as long as there are sufficient messages. Note that in practice the first batch will usually be larger than one and often be the max batch size, thanks to various inner workings I would not like to elaborate on in detail in this article. Large batch sizes with more than 1024 messages are not bad at all and may even help improve performance. When looking at a system performing with such large batches, you will likely see that partial batches are being created, simply for the reason that the queue does not contain more messages. This is not an indicator for a problem but shows that everything works perfectly!

The max batch size can be configured via

$ActionQueueDequeueBatchSize
and
$MainMsgQueueDequeueBatchSize

Note that the default sizes are very conservative (read: low), so you probably want to adjust them to some higher value. The best value depends on your workload, but 256 is probably a good starting point. If the action queue runs asynchronously (e.g. linkedlist mode, everything non-direct), the action queue batch size specifies the upper limit for the elasticsearch bulk submission.

To activate bulk mode, use

*.*     action(type=”omelasticsearch”
           … other params …
           bulkmode=”on”)

The default is the more conservative “off”. Note that the action can of course be used with any type of filter, not just the catch-all “*.*”. This is only used as a sample.

omudpspoof, threads, and libnet..

I today ran into a problem with omudpspoof (rsyslog’s UDP spoofing output module) that only occured when multiple output threads were used. If that perquisite was given and a larger traffic volume was generated, rsyslog relatively quickly aborted – often even during startup, otherwise very quickly thereafter. Diging down to the potential troublespot, it turned out that libnet is not threadsafe (and I have to admit they do not explicitely claim that). That was a very quick hint, and I now protect libnet calls via a mutex, ensuring that only one thread at a time will ever call into libnet. Of course, this somewhat limits concurrency, but for obvious reasons stability is far more important. There was also recently some discussion on different-kind problems with omudpspoof. While I do not think that this was releated to multithreading (data was malformed, what I could not reproduce), I can not totally outrule that. Will try to follow up with that on the mailing list as well.

The problem probably took quite some while to surface, as people usually have a single UDP spoofing action AND do not run it on multiple threads (given the i/o-boundness of this action, you probably can not gain much be using more than the standard single instance thread).

If you are interested in patching older versions, the patch is available via git (use the raw link to obtain a file usuable with patch). New releases are due soon.

RFC5424 or cee-enhanced syslog?

I  recently got a question if it would be better to implement RFC5424 or cee-enhanced syslog in a new project. This sounds like a though question, but looking at the details helps to ease it a bit.

First of all, RFC5424 has a far broader scope than cee-enhanced syslog. RFC5424 is the base RFC of a new series of syslog RFCs that solve many problems, among them (this list for sure will not be complete):

  • finally a standardized syslog format (no standard exists before, just a weak understanding)
  • supports multiple transports, e.g. UDP and TLS (all in their own RFCs)
  • solves some long-standing nits, like the date format (RFC5424 supports high precision and demands a year as part of the timestamp)
  • increases the syslog message size and permits operators to set an (almost) arbitrary upper limit for message sizes
  • supports structured logging via so-called “structured data”

In contrast, cee-enhanced syslog only cares about the structured logging feature – none of the others is addressed with it. This is the case because its intent is different: the goal here is to have a format that can work with existing technology and still provide the best possible structured logging solution. Depending on circumstances, “best possible” can be quite different. For example, in a scenario where it needs to work with crufty old syslogds, it can mean that message content can not be longer than 1k, a serious restriction. Even more, it is envisioned that cee-enhanced syslog will also work over non-syslog transport, and may even be transported by simply copying (and merging) files.

CEE has long thought about using RFC5424 structured data format for its structured logging needs. However, that would have limited its ability to work with all these older (and deployed) syslog implementations. Thus, the decision was made to use text that purely resides inside the message part and, without transformation, can be used in legacy systems (with given restrictions, of course).

So now back to the question: what to implement? With this facts, I think it boils down to “both”. To use the benefits of modern-day syslog (as described above), you definitely need to implement RFC5424 and its helper RFCs (like 5425 for TLS). HOWEVER, the question remains if one should use structured data or cee-enhanced syslog. This, I have to admit, is indeed a though question.

My personal opinion, which can be totally wrong and is solely based on my experience (aka “gut feeling” ;)), cee-enhanced syslog sounds more promising for structured data. The main reason for this is that the format has received very good responses from major players and a lot of work is currently being done by a lot of folks driving important projects in the logging environment. There is also a lot of supporting libraries upcoming AND some of them already available. Most importantly, this effort is thightly integrated with Mitre and  it probably is not to far-fetched to assume that cee-enhanced syslog will appear on some purchasing checklists in the not so distant future. RFC5424 structured data does not have this broad backing. So while cee-enhanced syslog is a very fresh project, my personal assumption is that it will take off.

But again, that does not mean that RFC5424 shall not be implemented. I think RFC5424 technology is vital for a number of reasons. Of course, things like TLS can also be obtained without RFC5424, but not in a standards-compliant way. So to ensure interoperability, going the 5424 way is a good idea.

MongoDB, BSON and templates…

After I have improved the template system yesterday, I have begun to think again about the integration of custom templates (actually field lists) with ommongodb (rsyslog’s mongodb output plugin). A “problem” with the mongo interface is that it does not support native JSON but rather BSON, its binary equivalent. So what needs to be done is convert the textual JSON representation to BSON before it can be stored in MongoDB. Given the fact that the JSON representation must be build with the property replacer, this looks like a was of coding/enconding. Assuming that I would take JSON and tansfrom it to BSON (all this in ommongodb), the workflow would be as follows:

text properties -> encode JSON -> decode JSON -> generate BSON

The “encode JSON” step would happen inside the template processr, the “decode JSON” part in ommongodb. In essence, this looks like a quite flexible, but rather slow approach. After all, it would serialize to JSON just for interim needs. What I am actually looking for is this workflow:

text properties -> generate BSON

In that we would replace the JSON format with some internal format. That internal format in a kind already exists, in array passing mode. In this mode, the property text is passed in via an array. As a side-note, some transformations are necessary and desired even in internal format, as the property replacer permits to use not only the raw properties themselves but substrings, case conversions, regexes and the like. The problem with array passing mode is that it provides just the plain values. However, for BSON (and MongoDB) we also need to know the field name – and type information. The latter is probably easy, as rsyslog usually deals with text, only, and so we could stick to strings except maybe for dates. The field name is available since yesterday inside the template structure. However, there currently is no way for a plugin to access this information.

So it looks like the decent thing is to create a new interface that passes in a (description,value) pair to the plugin. The description most probably could be the template structure (or some abstraction if we feel bad about tying things too deeply together). That will prevent the detour via JSON, but still provide otherwise full capabilities. The bad thing, however, is that some complex interface gets yet another option (maybe it is time for a general cleanup?).

Feedback on this issue is appreciated.

rsyslog templates & json

I today added a simpler method to specify JSON inside rsyslog templates. The new method simplifies specifying JSON-encoded fields. It still looks a bit ugly, but if you look closely enough, you’ll quickly notice that it no longer needs “quoting magic” and thus is far easier to work with.

Previously, you had to define a template like this:

$template tpl, “{“message”:”%msg:::json%”,”fromhost”:”%HOSTNAME:::json%”,”facility”:”%syslogfacility-text%”,”priority”:”%syslogpriority-text%”,”timereported”:”%timereported:::date-rfc3339%”,”timegenerated”:”%timegenerated:::date-rfc3339%”}”

The template given above is the default for ElasticSearch. With the new code, this can be replaced by:

$template tpl,”{%msg:::jsonf:message%,%HOSTNAME:::jsonf:fromhost%,%syslogfacility-text:::jsonf:facility%,%syslogpriority-text:::jsonf:priority%,%timereported:::date-rfc3339,jsonf%,%timegenerated:::date-rfc3339,jsonf%}”

It’s a bit shorter, but most importantly the JSON field is now generated by the property itself. This is triggered by the “jsonf” (json field) option. If it is given, a

“fieldname”=”value”

is automatically generated. The fieldname is either the forth parameter (see “message” in the msg field in the example above) or, if not given, defaults to the property name. If I hadn’t insisted on specific field names, I could have written the sample as such:

$template tpl,”{%msg:::json%,%HOSTNAME:::jsonf%,%syslogfacility-text:::jsonf%,%syslogpriority-text:::jsonf%,%timereported:::date-rfc3339,jsonf%,%timegenerated:::date-rfc3339,jsonf%}”

Note that the commas to separate the various JSON fields must still be included inside the template as literal text.

This new system works pretty well within the current template system. The config format will probably become a bit more intuitive when moving it over to the new config language. But the exact how’s of that are still to be decided. In fact, I think this string-based way of specifying templates is not so bad. In any case, I am also considering a template option which would take a template of pure field lists and generate proper JSON out of that. But that’s work for another day (or so ;)).

This is to be released with 6.3.9 and already available via the git master branch.

rsyslog & elasticsearch: async replication and timeout

Today I have added the capability to specify asynchronous replication support and timeout settings to omelasticsearch. Code-wise it’s a small change – it just passed the required parameters to ElasticSearch via the proper REST URL parameters. By default, both parameters are not set, that means default timeout and synchronous replication.

To set parameters, use

*.*     action(type=”omelasticsearch”
           … other params …
           asyncrepl=”on” timeout=”1m”)

If you leave “asyncrepl” out or set it to “off”, synchronous replication is used. For greatest flexiblity, the value of the “timeout” parameter is not checked. While this enables you to use anything ElasticSearch supports, invalid values can not be detected by omelasticsearch and thus will cause all inserts to fail (or misbehave). Note that some basic integrity checking is done, but we do not go great length here. So use with care.

rsyslog & ElasticSearch: using dynamic index and type

Today, I have added support for dynamically taking search index and type from the message itself. It works much like dynafiles do. If dynamic mode is selected (via config params), the index name and type given in the configuration are template names instead of  literal ones. That means the rsyslog core will generate the actual name at runtime, based on actual message content. These dynamically generated values will then be used inside the request.

Here is a quick howto:  you need to define the templates first, like this:

$template srchidx,”%hostname%”
$template srchtype,”%programname%”

In this example, the hostname (from the message) is used as name for the search index, and the programname (usually part of the syslog tag) is used as searchtype. The full power of templates, including all properties, can be used, so this is highly flexible.

To actually use these dynamic values, omelasticsearch must be called like this (line break do not matter):

*.*     action(type=”omelasticsearch”
           searchIndex=”srchidx” dynSearchIndex=”on”
           searchType=”srchtype” dynSearchType=”on”)

The “dynSearch…” setting tells the engine if it is a dynamic name or not. If set to on, like above, a template name is given and the actual name will be dynamically generated. If set to “off”, the literal value will be used (“off” is the default, so the parameters do not need to be specified if “off” is desired). For example:

*.*     action(type=”omelasticsearch”
           searchIndex=”srchidx”   searchType=”srchtype”)

will use the index “srchidx” and type “srchtype” literally. No template-based name extension will happen in this case. Please note that it is ok for only one parameter to by dynamic. It’s not “all or nothing”.

I hope this is a useful addition for omelasticsearch. Several folks have indicated that this method can increase throughput in a couple of use cases, for example to split of indexes based on host names, host functions and the like.

new rsyslog v5-beta released

I just released rsyslog 5.9.6, now a beta and no longer a development version. I have done a big merge today of bug fixes and small enhancements from the v5 branch and combined that all into 5.9.6. The plan is to mature this version as quickly as possible, which could really mean “rather quick”. For one, 5.9.5 is released for three month now and seem to have not broken quite a bit. Secondly, there are only very few new features and those that have been added already got their practice drill in some large environments (and such seem to be save for wider consumption).


I also sincerely hope that this will conclude the v5-devel branch for the foreseeable future. Having two active development branches always is a big burden, and merges are both time-consuming and error-prone. So if at all possible, I’ll try to refrain from doing any larger-scale development work in v5. It needs to be seen if I can hold to that when it comes to paid custom work. On the other hand, the v6 engine has also gotten both rather stable AND much feature enhanced. So it probably is a good idea to being to ask people to move up to v6. But that’s another story, to be written in the not so distant future ;)

rsyslog group on Facebook

I just created a rsyslog group on facebook. I hope it is useful. It’s a very interesting experiment for me. Are the tech folks available there as well? Will it work better than e.g. the rsyslog forum itself? I guess time will (quickly) tell. At best, this could become a nice new channel to get news out and motivate people to try out new versions.

Using different templates with rsyslog’s ElasticSearch plugin

Recently, an experimental ElasticSearch plugin has been added to rsyslog, omelasticsearch. Like all other output plugins, it comes with a canned template, which specifies a default “schema”. However, the template engine provides capabilities to use a completely different set of fields. In this blog post, I’ll briefly describe how this is done.

Note: all work is based on the current (April, 10th 2012) implementation of both omelasticsearch as well as the rsyslog core. Future implementation changes are expected in an effort to make things more intuitive. This may even break what I describe here. So if you come to this blog post at a later time, it probably is best to check if things have changed by “now” (especially if the procedure does not work ;)).

The current implementation ties into the template system. The default template looks as follows (all on one line, broken for readability):

$template JSONDefault, “{“message”:”%msg:::json%”,”fromhost”:”%HOSTNAME:::json%”,”facility”:”%syslogfacility-text%”,”priority”:”%syslogpriority-text%”,”timereported”:”%timereported:::date-rfc3339%”,”timegenerated”:”%timegenerated:::date-rfc3339%”}”

The ‘”‘ sequence is needed to represent a quote character inside the template. To format JSON, this is pretty ugly, but that’s the way the template processor currently works (and that is one reason why it is under review). As you can see, the JSON is actually “hand-crafted”, with the “json” option specifying that property text needs to be properly escaped to be well-formed JSON. If that option is specified, the template processor does the necessary escaping. Note that not all properties have the “json” option. This is purely for performance reasons. For example, time stamps do never include characters that need to be escaped. Consequently, the “json” option is not used there (but could be, e.g. “date-rfc3339,json”).

So now let’s define a different set of fields to be used. Let’s say we just want to have the date from the syslog message and the MSG part of the message itself. That would be as follows:

$template miniSchema, “{“message”:”%msg:::json%”,”timereported”:”%timereported:::date-rfc3339,json%”}”

Note: I requested JSON formatting in this example just to prove the point – don’t use it in a real deployment, as it is nonsense that costs CPU cycles ;)

Now I need to use the template within the omelasticsearch action. This is done as follows:

*.*     action(type=”omelasticsearch” template=”miniSchema”)

Note: the “all” (“*.*”) filter of course can be replaced with a different type of filter.

That’s all that needs to be done. Please note that you can add several templates and use these in several different elasticsearch output actions. Just exactly the same thing used in other actions. Also keep in mind that the property replacer permits to access a wide range of message properties. Most importantly, normalized properties or cee-enhanced syslog properties can be accessed via the CEE family of property names (essentially “$!cee-name” style). Just be sure that you include the “json” option into any property that may contain unsafe characters (which means almost all of the fields). This is not done automatically by the current engine and invalid characters can lead to strange problems, even aborts of ElasticSearch itself!

In the future, a more intuitive syntax is planned for JSON template definitions. Nevertheless, the current code permits full customization but requires taking care of the details.