release window for rsyslog 7.6

There have been a couple of questions when rsyslog’s next stable release (7.6) will be released.

Originally, the idea was to have this done by the end of this year, which essentially means end of November. Unfortunately, this needs to be pushed into early January 2014. The reason is code quality.

In October, we discovered a larger regression and inconsistency in regard to the new variable support in rsyslog. As it turned out, the only good choice was to seriously refactor the code handling variables and message properties. Unfortunately, such a larger refactoring also means bug potential, so I don’t feel well to release 7.6 stable with just two weeks of testing on this refactored code. Looking forward, a release around mid-December sounds doable, but this means we’ll hit the holiday break with a brand-new stable release. This is something that I definitely want to avoid. As such, the current plan is to do the release shortly after the holiday break.

In order to keep this release window, I will also avoid larger modifications to the code base. That doesn’t mean things like better support for impstats support in omelasticsearch, but it may mean that the updated imfile code will probably not be merged into v7, at least not now.

Depending on how things work out with rsyslog v8, I will do another 7.7 devel series for less intrusive enhancements or move directly to v8 only. This will be also be decided around January 2014, when we hopefully got enough feedback on v8.

Samples for v8 module conversion

This is just a  blog post on where to find sample of converting modules to the v8 output module interface. Additional information will be upcoming within the next days. Stay tuned.

Please bear in mind that the v8 output module interface is not stable at this time. It will very likely change within the next weeks.

Right now, the rsyslog v8 compatibility doc has some information on the new interface. It is expected that it will remain the best source to get at least an overview of the required changes.

Actual samples are best found in git. As I have various levels of updates (from very minimal to full), you’ll probably find something for your liking. Very minimal change is required for message modification modules (assuming they were thread-safe in the first place), e.g. mmfields:

http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=1c5a17241131129e9a1c184688500ad5b1383c28

It’s actually just adding the new plumbing. That’s all that’s usually required for message modification modules. Still, note the fundamental difference that the modules are run concurrently in v8, and as such must be reentrant and thread-safe (this requirement now applies to all output modules). Usually this is the case, but may be different depending on the processing done by the module in question (e.g. lookups to backend systems, libraries, global data – to name some of potential trouble spots).
omrelp is an example of a conversion to v8 that makes use of the new features:

If you browse git, the commit comments should help find more examples on a module-by-module basis.

On to the Lucky Number…

Do you remember last October? The rsyslog mailing list got very busy with things like output load balancing, global variables, and a lot of technical details of the rsyslog engine. I don’t intend to reproduce all of this here. The interested reader may simply review the rsyslog mailing list archives, starting at October 2013.

Most of these discussions focused on the evolution of the rsyslog engine. Essentially, the v7 engine brought many new things, but was constraint due to a lot of legacy, for example the output module interface which originally (roughly 8 years ago) was designed to keep it very simple to use, especially for developers who did not know about threading (which was uncommon to know at that time). Also, the engine gained lot’s of speeds by a special SIMD-like execution mode (details in mailing list archive), but that mode was counter-productive for some of the newer features. An ultimate problem were global variables, which were very hard to implement with the v7 engine. I even proposed a complete shadow variable system to make them fit into the framework, but this was -thankfully- not well received by the community at large.

Note that I wanted to re-write at least part of the core engine for quite some while, but it sounded like a task so large that I deferred it one time after another. However, seeing all the time going into the discussion, potential work-arounds (like shadow variables) and the then still-nonoptimal state I very strongly considered if it may be time for a radical change. Better spend the time on cool new things than investing that time into some “work-around” type of implementations.

With that on my mind, I had a couple of days of drafting and some talks with my peers. It turned out that we wanted to at least give this a try and see how (and if;)) things would evolve. So I did a week of test implementation, mostly focused on a new, highly concurrent output module interface. That did work out rather well, so we finally decided to go for the full length. I started rewriting the complete core rule execution engine.

And this brings us to the lucky number: let’s welcome rsyslog v8, as the new version will be named. I thought for a short moment to stick with v7, but the changes are really, really big and quite a bit of legacy and upward compatibility has been lost. For example, all output modules need to be rewritten. So while v8 may not be a great marketing tool to get it quickly into the distros, it correctly flags that some bigger changes are to be anticipated (but as usual we were very conservative with breaking existing configurations ;)).

I did not announce this effort any sooner as until now I wasn’t sure if it would really work out. Obviously, I now think thinks look sufficiently well ;-). It’s not yet deployable, but we are getting closer.

Expect a 8.1.0 release within the next couple of days. We are working on the final finishing touches, many of which involve testing. Note that 8.1.0 will be highly experimental and really is for folks who would like to get an impression of its new capabilities … and those trying to help find bugs.

There are some restrictions on supported modules and features, and some upgrades may run into trouble with that (details will be available together with the release). Also note that the rewrite is not yet 100% complete. Some of the finer details still need to be advanced, e.g. a new, higher performance and easier to use interface for output transactions. Or some aspects of message error processing … and so on. This follows our usual pattern that in a new development branch we bring in new features gradually, until we have reached the final goal. In any case, we are now pretty confident that we will reach this goal. If all goes exceptionally well, a v8-stable may come up in March 2014 (and free support for v7 will be available for another month or two so that there is time to migrate).

To facilitate this move, git branches will change. Yesterday’s master branch will become available as v7-devel (now in place) and the master branch will hold the new v8 code. Follow this blog post and the rsyslog mailing list to be sure to see exciting new versions when they become available.

Version 7-devel will get kind of a beta status, where mostly bug fixes are applied. However, some non-bugfix changes will also happen, but only after careful consideration. The focus obviously is now on v8 and this is where we will do the “cool” stuff in the future.

Note that the average user will possibly not notice much difference between v7 and v8 initially. The new engine will be most useful for power users and installations with complex rules and/or a high log volume.

I am very excited about this new branch of development and hope you’ll also be excited as it unrolls. Also my sincere thanks to everyone who waits patiently for v7 related work. I was too busy to look a that the past few weeks and will probably be occupied with v8 for a quite a bit longer.

On Executing a Ruleset in non-SIMD Mode

WARNING – WORK IN PROGRESS
This is probably inconsistent, not thought out, maybe even wrong. Use this information with care.

I wanted to share a complexity of executing a rsyslog ruleset in non-SIMD mode, which means one message is processed after each other. This posting is kind of a think tank and may also later be useful for others to understand design and design decisions. Note, however, that I do not guarantee that anything will be implemented as described here.

As you probably know rsyslog supports transactions and, among other things, does this to gain speed from it. This is supported by many modules, e.g. database modules or the tcp forwarder. In essence, what happens is
  • transaction is begun
  • messages are feed to the action
  • transaction is stopped

For most operations, success/failure information is only available after step 3 (because step 2 mainly buffers to-be-processed messages).

This plays very well with the current SIMD engine, which advances processing one statement after the other, and then processed the batch as whole. That means all three steps are done at one instance when the relevant statement is processed. This permits the engine to read the final result before advancing state to the next action.

Now envision a non-SIMD engine. It needs to start the transaction as soon as the first message is ready for processing, and the submit messages as they are processed. However, in non-SIMD mode, each action will only carry out step 2, and then immediately advance to the next statement. This happens for each message. Only after ruleset processing is completed, the final commit can occur.
This means that we do not have the actual outcome of operation until ruleset processing has finished. Most importantly, the result of statement n is no longer available in statements s with s > n. Some features (like failover processing — execIfPreviousIsSuspended) depend on the previous statement’s result, so this is a problem. Even more so, retrying an action is far from trivial, as we do no longer necessarily have the exact same information that was gathered during processing available when processing is done (later statements may have changed message state).
There are some solutions to these problems
  1. if execIfPreviousIsSuspended is specified, the statement in front of it must be forced to commit after each message (which costs quite a lot of performance, but at least only if the users turns on that feature)
  2. To mitigate the performance loss of those auto-commits, we could add a new syntax which explicitly permits to specify failover actions. These could be linked to the primary statement and NOT be executed during regular rule engine execution. We would still need to buffer (message pointers) for the case they are to be executed, but that’s probably better. Most importantly, there would be no other conditional logic supported, which makes processing them rather quickly.
  3. The message object must support a kind of “copy on write” (which would even be very useful with the v7 engine, which also permits more updates than any pre-v8 engine ever was designed for…). This could be done by splitting the (traditional) immutable part of the message structure from things like variables. Message modification modules modifying the “immutable” part would need to do a full copy, others not (the usual case). Of course, this copy on update can make variable operations rather costly.
  4. Output modules could be forced to perform a kind of “retry commit” — but this is a bad option because it a) puts (repetitive) burden on the output (in essence, the output faces exactly the same problems like the core engine, EXCEPT probably that it knows better which exact data items it needs — easy for traditional template based interface). b) it removes from the engine the ability to re-try parts of the transaction. So this is not very appealing.
  5. In any case, the actual “action retry handling” should probably be applied to the commit interface, far less than the usual submit interface.

What even more complicates things is that we do not know what modules that use the message passing interface actually do with the messages. In current code, most of them are message modification modules. This means in order for them to work correctly, they actually need to execute on the message “right in time”. And, of course, there is even more complexity in that each output may do partial commits at any time. The most popular case probably is when it runs out of some buffer space.

To solve these issues, a two-step execution inside the rule system seems desireable:

  • execution phase
  • commit phase

Note that this two-phase approach is already very similar to what action queues do. However, this also means that action queues in pre-v8 can be victim to race conditions if variables are heavily used.

In any case, using action queues to perform these two steps seems very natural and desirable. Unfortunately, there is still considerate overhead attached to this (mutex operations, required context switches), which makes this very unattractive. The end result if taking this path probably would be a reduced overall processing speed, something we really don’t like to see. Also, failover processing would not work if following that path.

Execution Phase

  1. advance message state – message modification (mm) modules must be called immediately
  2. “shuffle” msgs to actions – the main concern here is to make sure that the action sees an immutable action object, at least in regard to what it needs from it (we may need to add an entry point to ask the action to extract whatever it actually needs and return a pointer to that – not necessary for simple strings, for a prominent example).
    Note that doAction is never called for non mm-modules.

At the end of execution phase, for each action we have an array of to-be-processed data.

Commit Phase
For each action with data, we submit the data to its action, performing all three steps. This way, we can easily keep track of the state advancement and action errors. It would be easy to implement dedicated failover processing at this stage (but this probably requires larger state info if the failover action is different from the primary one).

This two-phase approach somewhat resembles the original batching/SIMD idea of the pre-v8 engine. So it looks like this design was well up to the point of what we need to do. I am still a bit undecided if doing these engine changes are really worth it, but so far code clarity seem to be much better. Performance possibly as well, as the SIMD needed to carry a lot of state around as well.

I will now probably do a test implementation of the two-phase approach, albeit only for the traditional string interface.

Some ideas/results from the test implementation:

  •  The structure used to store messages could -LATER- be made the structure that is actually queued in action queues, enabling for faster performance (in-memory) and unified code.
  • Note: an advantage on storing the result string template vs. the message object between phases is of advantage as we do not need to keep the message immutable for this reason. It needs to be seen, though, if that really is an advantage from the overall picture (the question is can we get to a point where we actually do NOT need to do copy-on-write — obviously this would be the case if one string templates are used).

A Proposal for Rsyslog State Variables

As was discussed in great lenghth on the rsyslog mailing list in October 2013, global variables as implemented in 7.5.4 and 7.5.5 were no valid solution and have been removed from rsyslog. At least in this post I do not want to summarize all of this – so for the details please read the mailing list archive.

Bottom line: we need some new facility to handle global state and this will be done via state variables. This posting contains a proposal which is meant as basis for discussion. At the time of this writing, nothing has been finalized and the ultimate solution may be totally different. I just keep it as blog posting so that, if necessary and useful, I can update the spec towards the final solution.

Base Assumptions
There are a couple of things that we assume for this design:

  • writing state variables will be a  very infrequent operation during config execution
  • the total number of state variables inside a ruleset is very low
  • reading occurs more often, but still is not a high number (except for read-only ones)

Syntax
State variables look like the previous global variables (e.g. “$/var” or “$/p!var”), but have different semantics based on the special restrictions given below.

Restrictions
Restrictions provide the correctness predicate under which state vars are to be evaluated. Due to the way the rsyslog engine works (SIMD-like execution, multiple threads), we need to place such restrictions in order to gain good performance and to avoid a prohibitively amount of development work to be done. Note that we could remove some of these restriction, but that would require a lot of development effort and require considerable sponsorship.

  1. state variables are read-only after they have been accessed
    This restriction applies on a source statement level. For example,
    set $/v = $/v + 1;
    is considered as one access, because both the read and the write is done on the same source statement (but it is not atomic, see restriction #3). However,
    if $/v == 10 then
        set $/v = 0;
    is not considered as one access, because there are two statements involved (the if-condition checker and the assignment).
    This rule requires some understanding of the underlying RainerScript grammar and as such may be a bit hard to interpret. As a general rule of thumb, one can set that using a state variable on the left-hand side of a set statement is the only safe time when this is to be considered “one access”.
  2. state variables do not support subtree access
    This means you can only access individual variables, not full trees. For example, accessing $/p!var is possible, but accessing $/p will lead to a runtime error.
  3. state variables are racy between diffrent threads except if special update functions are use
    State variable operations are not done atomic by default. Do not mistake the one-time source level access with atomicity:
    set $/v = $/v + 1;
    is one access, but it is NOT done atomically. For example, if two threads are running and $/v has the value 1 before this statement, the outcome after executing both threads can be either 2 or 3. If this is unacceptable, special functions which guarantee atomic updates are required (those are spec’ed under “new functions” below).
  4. Read-only state variables can only be set once during rsyslog lifetimeThey exist as an optimization to support the common  usecase of setting some config values via state vars (like server or email addresses).

Basic Implementation Idea
State variable are implemented by both a global state as well as message-local shadow variables. The global state holds all state variables, whereas the shadow variables contain only those variables that were already accessed for the message in question. Shadow variables, if they exist, always take precedency in variable access and are automatically created on first access.

As a fine detail, note that we do not implement state vars via the “usual” JSON method, as this would both require more complex code and may cause performance problems. This is the reason for the “no-subtree” restriction.

Data Structure
State vars will be held in a hash table. They need some decoration that is not required for the other vars. Roughly, a state var will be described by

  • name
  • value
  • attributes
    • read-only
    • modifyable
    • updated (temporary work flag, may be solved differently)

Implementation
This is pseudo-code for the rule engine. The engine executes statement by statement.

for each statement:
   for each state var read:
      if is read-only:
         return var from global pool
     else:
         if is in shadow:
             return var from shadow pool
        else:
             read from global pool
             add to shadow pool, flag as modifiable
             return value
    for each state var written:
        if is read-only or already modified:
           emit error message, done
        if is not in shadow:
            read from global pool
            add to shadow pool, flag as modifiable
        if not modifieable:
           emit error message, done
        modify shadow value, flag as modified

at end of each statement:
   scan shadow var pool for updated ones:
       propagate update to global var space
       reset modified flag, set to non-modifiable
  for all shadow vars:
       reset modifiable flag

Note: the “scanning” can probably done much easier if we assume that the var can only be updated once per statement. But this may not be the case due to new (atomic) functions, so I just kept the generic idea, which may be refined in the actual implementation (probably a single pointer access may be sufficient to do the “scan”).

The lifetime of the shadow variable pool equals the message lifetime. It is destrcuted only when the message is destructed.

New Statements
In order to provide some optimizations, new statements are useful. They are not strictly required if the functionality is not to be implemented (initially).

  • setonce $/v = value;
    sets read-only variable; fails if var already exists.
  • eval expr;
    evaluates an expression without the need to assign the result value. This is useful to make atomic functions (see “New Functions”) more performant (as they don’t necessarily need to store their result).

New Functions
Again, these offer desirable new functionality, but can be left out without breaking the rest of the system. But they are necessary to avoid cross-thread races.

  • atomic_add(var, value)
    atomically adds “value” to “var”. Works with state variables only (as it makes no sense for any others).

This concludes the my current state of thinking. Comments, both here or on the rsyslog mailing list, are appreciated. This posting will be updated as need arises and no history kept (except where vitally important).

Special thanks to Pavel Levshin and David Lang for their contributions to this work!

rsyslog’s imudp now multithreaded

Rsyslog is heavily threaded to fully utilize modern multi-core processors. However, the imudp module did so far work on a single thread. We always considered this appropriate and no problem, because the module basically pulls data off the OS receive buffers and injects them into rsyslog’s internal queues. However, some folks expressed the desire to have multiple receiver threads and there were also some reports that imudp ran close to 100% cpu in some installations.

So starting with 7.5.5, imudp itself supports multiple receiver threads. The default is to use a single thread as usual, but via the “threads” module parameter, up to 32 receiver threads can be configured. We introduced this limit to prevent naive users from totally overruning their system capability – spawning a myriad of threads usually is quite counter-productive (especially when they outnumber the available processor cores). For the same reason, I would strongly suggest that the number of threads is only increased if there is some evidence for this to be useful — which usually means the imudp thread should require considerable CPU time. In order to aid the decision, I have also added new rsyslog statistics counters which permit monitoring of the worker thread activity.

We will now evaluate practical feedback from the new feature. One of the goals of this new enhancement is to limit the risk of UDP message loss due to buffer overrun, which we hope we have improved even without the need to select realtime priority.

Please note that 7.5.5 is at the time of this writing not yet released, so for the next couple of days the new feature is only available via building from the git master branch.

New Queue Defaults in rsyslog 7.5

As regular readers of my blog know, we are moving towards preferring enterprise needs vs. low-end system needs in rsyslog. This is part of the changes in the logging world induced by systemd journal (the full story can be found here).

Many of the main queue and ruleset queue default parameters were a compromise, and much more in favor of low-end systems than enterprises. Most importantly, the queue sizes were very small, done so in an approach to save virtual memory space. With the 7.5.4 release, this will change. While the default size was 10,000 msgs so far, it has been increased 10-fold to 100,000. The main reason is that the inputs nowadays batch together quite some messages, which gives us very good performance on busy systems. It is not uncommon that e.g. the tcp input submits 2,000 messages as once. With the previous defaults, that meant the main queue could hold 5 such submission. Now we got much more head room.

Note that some other changes were made alongside. The dequeue batch size has been increased to 256 from the previous value of 32. The max number of worker threads has been increased to two, removing our previous conservative setting of one. At the same time, we now require at least 40,000 messages to be inside the queue before the second worker is activated. So this will only happen on very busy systems. Note that the previous value of 100 messages was really an artifact of long gone-away times and usually meant immediate activate of the maximum number of workers, what was quite contrary to the intention of that parameter.

Of course, these are just changed defaults. They can always be overridden by explicit settings. For those configs that already did this, nothing changes at all.

imfile multi-line messages

As most of you know, rsyslog permits to pull multiple lines from a text file and combine these into a single message. This is done with the imfile module. Up until version 7.5.3, this lead to a message which always had the LF characters embedded. That usually posed little problem when the same rsyslog instance wrote the message immediately to another file or database, but caused trouble with a number of other actions. The most important example of the latter is plain tcp syslog.

That industry standard protocol uses LF as a frame delimiter. This means a syslog message is considered finished when a LF is seen and everything after the LF is a new message. Unfortunately, the protocol does not provide a special escape mechanism for embedded LFs. This makes it simply impossible to correctly transmit messages with embedded LFs via plain tcp syslog (for more information, see RFC6587, section 3.4).

To solve this situation, rsyslog provides so-called “octet counted” framing, which permits transmission of any characters. While this is a great solution for rsyslog-to-rsyslog transmission, there are few other programs capable of working in that mode. So interoperability is limited.

Even worse, most log processing tools (primarily those working on files) do not expect multi-line messages. Usually they get very confused if LFs are included.

In short, embedded LFs are evil in the logging world. It was probably not a great idea to generate them when imfile processes multi-line messages.

Starting with rsyslog version 7.5.3, this problem has now been solved. Now, imfile escapes LF to the four-character sequence “#012”, which is rsyslog’s standard (octal) control character escape sequence. With this escaping in place, there will neither be problems at the protocol layer nor with other log processing applications. If for some reason embedded LF are needed, there is a new imfile input() parameter called “escapeLF”. If set to “off”, embedded LFs will generated. We assume that when a users does this, he also knows what he does and how to handled those embedded LFs.

This behaviour could obviously break existing configurations. So we have decided not to turn on LF escaping for file monitors defined via legacy statements. These are most probably those that do not want it and also probably long have dealt with the resulting problems.

As always, it is highly suggested that new configurations use the much easier to handle input() statement, which also has LF escaping turned on by default. Note that you cannot use LF escaping together with imfile legacy config statements. In that case, you must switch to the new style.

So this construct:

$InputFileName /tmp/imfile.in
$InputFileTag imfile.in
$InputFileStateFile imfile.in
$InputFileReadMode 1
$InputRunFileMonitor

Needs to become that one:

input(type=”imfile” file=”/tmp/imfile.in”
      statefile=”imfile.in” readMode=”2″ tag=”imfile.in”)

Again, keep in mind that in new style LF escaping is turned on by default, so the above config statement is equivalent to:

input(type=”imfile” file=”/tmp/imfile.in” escapelf=”on”
      statefile=”imfile.in” readMode=”2″ tag=”imfile.in”)

This later sample is obviously also correct.
To turn off LF escaping in new style, use:

input(type=”imfile” file=”/tmp/imfile.in” escapelf=”off”
      statefile=”imfile.in” readMode=”2″ tag=”imfile.in”)

I hope this clarifies reasons, usefulness and how to handle the new imfile LF escaping modes.

rsyslog: why disk-assisted queues keep a file open

From time to time, someone asks why rsyslog disk-assisted queues keep one file open until shutdown. So it probably is time to elaborate a bit about it.

Let’s start with explaining what can be seen: if a disk-assisted queue is configured, rsyslog will normally use the in-memory queue. It will open a disk queue only if there is good reason to do so (because it severely hurts performance). The prime reason to go to disk is when the in memory queue’s configured size has been exhausted. Once this happens, rsyslog begins to create spool files, numbered consequtively. This should normally happen in cases where e.g. a remote target is temporarily not reachable or a database engine is responding extremely slow.  Once the situation has been cleared, rsyslog clears out the spool files. However, one spool file always is kept until you shut rsyslog down. Note well that when the disk queue is idle, all messages are processed even though the the physical spool file still contains some already-processed data (impstats will show you the exact details).

This is expected behaviour, and there is good reason for it. This is what happens technically:

A DA queue is actually “two queues in one”: It is a regular in-memory queue, which has a (second) helper disk queue in case it neeeds it. As long as operations run smoothly, the disk queue is never used. When the system starts up, only the in-memory queue is started. Startup of the disk queue is deferred until it is actually needed (as in most cases it will never be needed).

When the in-memory queue runs out of space, it starts that Disk queue, which than allocates its queue file. In order to reclaim space, not a single file is written but a series of files, where old files are deleted when they are processed and new files are created on an as-needed basis. Initially, there is only one file, which is read and written. And if the queue is empty, this single file still exists, because it is the representation of a disk queue (like the in-memory mapping tables for memory queues always exist, which just cannot be seen by the user).

So what happens in the above case is that the disk queue is created, put into action (the part where it writes and deletes files) and is then becoming idle and empty. At that stage, it will keep its single queue file, which holds the queue’s necessary mappings.

One may now ask “why not shut down the disk queue if no longer needed”? The short answer is that we anticpiate that it will be re-used and thus we do not make the effort to shut it down and restart when the need again arises. Let me elaborate: experience tells that when a system needs the disk queue once, it is highly likely to need it again in the future. The reason for disk queues to kick into action are often cyclic, like schedule restarts of remote systems or database engines (as part of a backup process, for example). So we assume if we used it once, we will most probably need it again and so keep it ready. This also helps reduce potential message loss during the switchover process to disk – in extreme cases this can happen if there is high traffic load and slim in-memory queues (remember that starting up a disk queue needs comparativley long).

The longer answer is that in the past rsyslog tried to shut down disk queues to “save” ressources. Experience told us that this “saving” often resulted in resource wasting. Also, the need to synchronize disk queue shutdown with the main queue was highly complex (just think about cases where we shut it down at the same time the in-memory queue actually begins to need it again!). This caused quite some overhead even when the disk queue was not used (again, this is the case most of the time – if properly sized). An indication of this effect probably is that we could remove roughly 20% of rsyslog’s queue code when we removed the “disk queue shutdown” feature!

Bottom line: keeping the disk queue active once it has been activated is highly desirable and as such seeing a queue file around until shutdown is absolutely correct. Many users will even never see this spool file, because there are no exceptional circumstances that force the queue to go to disk. So they may be surprised if it happens every now and then.

Side-Note: if disk queue files are created without a traget going offline, one should consider increasing the in-memory queue size, as disk queues are obviouly much less efficient than memory queues.