On Executing a Ruleset in non-SIMD Mode

This is probably inconsistent, not thought out, maybe even wrong. Use this information with care.

I wanted to share a complexity of executing a rsyslog ruleset in non-SIMD mode, which means one message is processed after each other. This posting is kind of a think tank and may also later be useful for others to understand design and design decisions. Note, however, that I do not guarantee that anything will be implemented as described here.

As you probably know rsyslog supports transactions and, among other things, does this to gain speed from it. This is supported by many modules, e.g. database modules or the tcp forwarder. In essence, what happens is
  • transaction is begun
  • messages are feed to the action
  • transaction is stopped

For most operations, success/failure information is only available after step 3 (because step 2 mainly buffers to-be-processed messages).

This plays very well with the current SIMD engine, which advances processing one statement after the other, and then processed the batch as whole. That means all three steps are done at one instance when the relevant statement is processed. This permits the engine to read the final result before advancing state to the next action.

Now envision a non-SIMD engine. It needs to start the transaction as soon as the first message is ready for processing, and the submit messages as they are processed. However, in non-SIMD mode, each action will only carry out step 2, and then immediately advance to the next statement. This happens for each message. Only after ruleset processing is completed, the final commit can occur.
This means that we do not have the actual outcome of operation until ruleset processing has finished. Most importantly, the result of statement n is no longer available in statements s with s > n. Some features (like failover processing — execIfPreviousIsSuspended) depend on the previous statement’s result, so this is a problem. Even more so, retrying an action is far from trivial, as we do no longer necessarily have the exact same information that was gathered during processing available when processing is done (later statements may have changed message state).
There are some solutions to these problems
  1. if execIfPreviousIsSuspended is specified, the statement in front of it must be forced to commit after each message (which costs quite a lot of performance, but at least only if the users turns on that feature)
  2. To mitigate the performance loss of those auto-commits, we could add a new syntax which explicitly permits to specify failover actions. These could be linked to the primary statement and NOT be executed during regular rule engine execution. We would still need to buffer (message pointers) for the case they are to be executed, but that’s probably better. Most importantly, there would be no other conditional logic supported, which makes processing them rather quickly.
  3. The message object must support a kind of “copy on write” (which would even be very useful with the v7 engine, which also permits more updates than any pre-v8 engine ever was designed for…). This could be done by splitting the (traditional) immutable part of the message structure from things like variables. Message modification modules modifying the “immutable” part would need to do a full copy, others not (the usual case). Of course, this copy on update can make variable operations rather costly.
  4. Output modules could be forced to perform a kind of “retry commit” — but this is a bad option because it a) puts (repetitive) burden on the output (in essence, the output faces exactly the same problems like the core engine, EXCEPT probably that it knows better which exact data items it needs — easy for traditional template based interface). b) it removes from the engine the ability to re-try parts of the transaction. So this is not very appealing.
  5. In any case, the actual “action retry handling” should probably be applied to the commit interface, far less than the usual submit interface.

What even more complicates things is that we do not know what modules that use the message passing interface actually do with the messages. In current code, most of them are message modification modules. This means in order for them to work correctly, they actually need to execute on the message “right in time”. And, of course, there is even more complexity in that each output may do partial commits at any time. The most popular case probably is when it runs out of some buffer space.

To solve these issues, a two-step execution inside the rule system seems desireable:

  • execution phase
  • commit phase

Note that this two-phase approach is already very similar to what action queues do. However, this also means that action queues in pre-v8 can be victim to race conditions if variables are heavily used.

In any case, using action queues to perform these two steps seems very natural and desirable. Unfortunately, there is still considerate overhead attached to this (mutex operations, required context switches), which makes this very unattractive. The end result if taking this path probably would be a reduced overall processing speed, something we really don’t like to see. Also, failover processing would not work if following that path.

Execution Phase

  1. advance message state – message modification (mm) modules must be called immediately
  2. “shuffle” msgs to actions – the main concern here is to make sure that the action sees an immutable action object, at least in regard to what it needs from it (we may need to add an entry point to ask the action to extract whatever it actually needs and return a pointer to that – not necessary for simple strings, for a prominent example).
    Note that doAction is never called for non mm-modules.

At the end of execution phase, for each action we have an array of to-be-processed data.

Commit Phase
For each action with data, we submit the data to its action, performing all three steps. This way, we can easily keep track of the state advancement and action errors. It would be easy to implement dedicated failover processing at this stage (but this probably requires larger state info if the failover action is different from the primary one).

This two-phase approach somewhat resembles the original batching/SIMD idea of the pre-v8 engine. So it looks like this design was well up to the point of what we need to do. I am still a bit undecided if doing these engine changes are really worth it, but so far code clarity seem to be much better. Performance possibly as well, as the SIMD needed to carry a lot of state around as well.

I will now probably do a test implementation of the two-phase approach, albeit only for the traditional string interface.

Some ideas/results from the test implementation:

  • ┬áThe structure used to store messages could -LATER- be made the structure that is actually queued in action queues, enabling for faster performance (in-memory) and unified code.
  • Note: an advantage on storing the result string template vs. the message object between phases is of advantage as we do not need to keep the message immutable for this reason. It needs to be seen, though, if that really is an advantage from the overall picture (the question is can we get to a point where we actually do NOT need to do copy-on-write — obviously this would be the case if one string templates are used).

what are actions and action instance data?

On the rsyslog mailing list, the question about what actions are in in which way they are kept single-threaded from the POV of the output module came up again. I try to summarize the most important points and term here.

David Lang gave the following example configuration:

*.* file1
*.* file2
*.* @ip1
*.* @ip2
*.* @@ip3
*.* @@ip4

and asked how many different actions/entities that were. Here is my answer:

An *action* is a specific instance of some desired output. The actual processing carried out is NOT termed “action”, even though one could easily do so. I have to admit I have not defined any term for that. So let’s call this processing. That actual processing is carried out by the output module (and the really bad thing is that the entry point is named “doAction”, which somewhat implies that the output module is called the action, what is not the case).

Each action can use the service of exactly one output module. Each output module can provide services to many actions. So we have a N:1 relationship between actions and output modules.

In the above samples, 3 output modules are involved, where each output module is used by two actions. We have 6 actions, and so we have 6 action locks.

So the output module interface does not serialize access to the output module, but rather to the action instance. All action-specific data is kept in a separate, per-action data structure and passed into the output module at the time the doAction call is made. The output module can modify all of this instance data as if it were running on a single thread. HOWEVER, any global data items (in short: everything not inside the action instance data) is *not* synchronized by the rsyslog core. The output module must take care itself of synchronization if it desires to have concurrent access to such data items. All current output modules do NOT access global data other than for config parsing (which is serial and single-threaded by nature).

Note that the consistency of the action instance data is guarded by the rsyslog core by actually running the output module processing on a single thread *for that action*. But the output module code itself may be called concurrently if more than one action uses the same output module. That is a typical case. If so, each of the concurrently running instances receives its private instance data pointer but shares everything else.

next generation rsyslog design, worker pools, parellelism and future hardware

Based on my last post on rsyslog multithreading, an email conversation arose. When I noticed I elaborated about a lot of things possibly of general interest, I decided to turn this into a blog post. Here it is.

The question that all this started was whether or not a worker thread pool is good or evil for a syslogd.

Yes, the worker-thread pool has a lot of pros and cons with syslog. I know this quite well, because I am also the main design lead behind WinSyslog, which was the first syslogd available natively on Windows (a commercial app). It is heavily multi-threaded. I opposed a worker pool for a long time but have accepted it lately. In some high-volume scenarios (unusual cases) it is quite valuable. But of course you lose order of messages. For WinSyslog, we made the compromise to have the worker pool configurable and set it to 1 worker if order of events is very important. I designed WinSyslog in 1995 and released the first version in 1996 – so I know quite well what I am talking about (but to be honest the initial multi threading engine got in somewhat later;)).

HOWEVER, especially in high-volume environments and with network senders, you are somewhat playing Russian roulette if you strongly believe that order in which events come in is exactly the same order in which they were sent. Just think routers, switches, congestions, etc… For small volumes, that’s fair enough. But once the volume goes up, you do not get it 100% right. This is one of the reasons I am working quite hard it the IETF to get a better timestamp in into syslog (rsyslog already has it, of course as an option). The right thing to do message sequencing is by looking at a high-precision timestamp, not by looking at time of reception.

For rsyslog, I am not yet sure if it will ever receive a worker pool. From today’s view, it does not look necessary. But if I think about future developments in hardware, the only key in getting more performance is by using the enhanced parallelism the hardware will provide. The time of fast single cores is over. We will have relatively slow (as fast as today ;-]) massively parallel hardware. This is a challenge for all of software engineering. Many folks have not yet realized it. I think it has more problem potential than the last “big software crisis”. As you can see in the sequencing discussion, parallelism doesn’t only mean mutexes and threads – it means you need to re-think on how you do things (e.g. use timestamps for correlation instead of reception/processing order, because the later is no stable concept in a massively parallel program). With the design I am now doing, I will definitely have separate threads for each output action (like file or database writer). I need this, because rsyslog will provide a way to queue messages on disk when a destination is not available. You could also call this “store-and-forward syslog”, just like SMTP is queued while in transit. You can not do this concept with a single thread (at least not in a reasonable complex way). I also do not think that multiple processes would be the right solution. First off, they are too slow. Secondly, multiple processes for this use are much more complicated than threads (I know, I’ve written similar multi-process based things back in the 80s and 90s).

Rsyslog will also most probably have more than one thread for the input. Input will become modular, too. If I use one thread per input, each input can use whatever threading model that it likes. That makes writing input plugins easy (one of my goals). Also, in high volume environments, having the ability to run inputs on multiple CPUs *is* *very* beneficial. One of the first plugins will be RFC 3195 syslog, which is currently run as an external process (communicating via unix local domain socket, which is quite a hack). At the core of 3195 is liblogging, a slim and fast select-server I have written. However, integrating it into a single-threaded application is still a challenge (you need to merge the select() calls and provide an API for that). With multiple threads, you can run that select server on its own thread, which is quite reasonable (after all, why should both UDP and 3195 reception run on the same thread?). The same is true for tcp based processing. Then think about the native ssl support that is coming. Without hardware crypto acceleration, doesn’t it make much sense to run the receiver on its own thread? Doesn’t it even make sense to run the sender on its own thread?

So the threading model of the next major release of rsyslog (called 3.x for reasons not yet elaborated about) will most probably be:

  1. multiple input threads, at least one for each input module (module may decide about more threads if they like to)
  2. all those input threads serialize messages to a single incoming queue
  3. the queue will most probably be processed by a single worker thread working on filter conditions and passing messages to the relevant outputs. There may be a worker pool, but if so its size can be configured (and set to 1, if needed)
  4. multiple output threads, at least one for each output. Again, it is the output’s decision if it runs on more than one thread
  5. there may be a number of housekeeping threads, e.g. for DNS cache maintenance

This design will provide superb performance, is oriented on logical needs, allows for easy intrgration of plugins AND will be reasonable easy to manage – at least I hope so.

But, wait, why have I written all this? OK, one reason is that I wanted to document the upcoming threading model for quite a while and now was a good time for doing so. But I think it also shows that you can not say “multithreading is good/bad” without thinking about the rest of the system design. Multi-threading is inherently complex. Humans think sequentially, at least when we think about something consciously. So parallel programming doesn’t match human nature. On the other hand, nature is doing things itself massively parallel. Just think about our human vision: in tech terms, it is massively parallel signal processing. This is where technology is also heading to. As we have learned to program at all, we will also learn to deal with parallel programming. A key thing is that we should never think about parallelism as a feature – it’s actually a tool, that can be used right or wrong. So let’s hope the rsyslog design will do it right. Of course, comments are very much appreciated!