rsyslog queues, reliability and forwarding

I would like to explain a little bit how rsyslog queues and forwarding interact. First of all, let me state that the queue engine and the action part of rsyslog, where the forwarding happens, are strictly separated. There is no interface between the two that permits an action to affect queue behaviour. For example, I was asked if a forwarding action could initiate disk queue mode when the forwarding fails. The root reason for this was that messages should not be endagered while a remote server fails.

This is not possible with the current design and involves a far-from-trivial design change. However, I do not think that the functionality is actually needed. When talking about reliablity, rsyslog works on the importance of messages and not on the importance of actions.

So in rsyslog we can configure the level of message loss that is acceptable for a given use case. This can be done on an action-by-action basis (or once at the ruleset/main queue level for all actions — usually not a good idea, but from time to time it may be). The extreme ends are a) no message loss at all permited and b) message loss of any magnitude is acceptable. For a), this means we need to configure a disk-only queue with full sync of queue files and management information after each message processed (with message input batches of one). For b), this means we do not need to place any restrictions. Obviously, a) is rather slow while b) is extremely fast. As extremes are not usually what is needed, there are many configuration possibilities between these two extremes. For example, one may define a queue the goes to disk if more than 1,000 messages are kept in main memory, but only then, and that fsyncs queue files every 10 messages (a big performance saver). That means that at any instant, at most 1,010 messages are at risk and can potentially be lost. The queue than monitors these predicates and switches to disk mode only when required. This is a very big performance saver.

Now let’s switch a bit the perception of things: Let’s go with our starting example and say you want to go to disk only when the remote system is down. But what if the remote system is up, but can not accept messages quickly enough. Let’s assume a backlog of 10,000 messages builds up. Is it then acceptable to keep these in main memory, just because the remote system is accepting data? If this risk is acceptable, why would it be inacceptable if the remote system is not yet accessible. If we say one case is acceptable but the other not, we somewhat contradict ourselves: it is almost random if the remote system is accepting messages, so why does it make a difference in the risk we permit?

This contradiction is the core reason why rsyslog does not look at external events or action failure causes but rather works on the basis of “acceptable risk”. Let’s say it is acceptable to lose 1,000 messages. Then, it is irrelevant if we lose these while the remote system is accepting or not. Consequently, rsyslog enforces disk mode if the remote system is down and there are more than 1,000 messages inside the queue. But it does not enforce this if there are only 500 messages waiting to be sent? Why should it? After all, the user has specified that a loss of 1,000 messages is acceptable, and so we do not try to guard these messages more than required by this policy. Note, of course, that if rsyslog is terminated in such a situation, of course a disk queue with 500 messages is created. We do not voluntarily lose messages, and if we terminate, we can no longer hold them in main memory. Consequently, they must be written out (of course, again depending on configuration). So the in-memory queue is retained across rsyslog restarts. But it is important to point out that unexected aborts – like sudden loss of power – can cause message loss in such scenarios. This is no different from sudden loss of power with an accepting remote system and a queue of 500. If such a risk is unaccetable, we have what I initially described in scenario a).

As a side note, rsyslog queues provide very high reliability. Every message is removed from the queue only after the action acknowledges that it has been processed. This kind of reliablity is used in some very demanding audit-grade environments (which I, as usally, not permitted to name…).

To sum up, rsyslog protects message integrity not be external events but by user-configurable “acceptable risk” policies.

We consider this a superior approach, as external events are somewhat unreliable when it comes to protecting traffic bursts. Relying on external events has a number of anomalies, as hopefully explained above.

getting serious with rsyslog v6

I have just created a new v5-devel branch based on the current master AND have then merged the newconf branch into master. This “officially” means the beginning of rsyslog v6. So I thought that justifies a short blog post.

As already elaborated, v6 will focus on a better configuration language. The current version already has scoping for actions, but no doc yet for it. I will try to add the doc, so that I can hopefully officially release the devel version this week. I’d also like to work a bit more on imptcp, so that I have some common base functionalty in all versions that support it.

Starting new v5-beta branch

I will start a new v5-beta branch soon. I have seen that a number of things have been changed since the last stable release. Most importantly, some of these changes fix bugs. Bug, that are hard to fix in the current stable version. This is because v5 got not so much exposed to reality, but adoption rate seems to increase and so we have been able to iron out some issues while doing refactoring on the latest development release.

I have thought about backporting the bug fixes. However, this doesn’t seem to make too much sense: it is a lot of work, and, when done the current v5-stable will have code very close to the development branch. So I decided to accept the bugs for the time being and instead see the we can get a new 4.6.0 stable release as soon as possible. The first step on that route is to create a new beta. Based on past experience, I think we can promote that the stable in September (as we already got some good feedback on it).

Anybody with problems in the current v5-stable should simply update to the beta as soon as it is available.

coding new config format begun

After a long discussion about potential new config formats for rsyslog, we came to the conclusion that the current format is not as bad as I thought and just needs scoping (OK, the whole story is longer, but I can’t repeat that lengthy discussion in a single blog post, see mailing list archive for details).

After some thinking, I finally started coding today.

I have begun to implement action scoping. A snapshot with partial functionality is available at

http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/newconf

It does NOT yet include the necessary output plugin interface change (or updated plugins), but it implements

$Begin action
$End action
$StrictScoping [on/off] — off default

So if you want to play a bit with it, feel free to do so. Note that it disallows global statements within action scope and in overall has somewhat better detection of user errors (these are easier to detect when scoping is used).

Note that scoping is purely optional: if not enabled, it will not be used. So any current rsyslog.conf remains valid.

I will see that I change the projects’s output plugins next week, and will possibly then make an experimental release.

Why I think Lua is no solution for rsyslog

During our discussion about the new rsyslog config format, a couple of times it was suggested to use Lua inside rsyslog. So far, I reject this idea for a couple of reasons, and I thought it is time to write up of them.

But first of all let me explain that I do not dislike Lua. I think it is a very good tool, and it can be extremely useful to embed it. What I question is if Lua is the right thing for rsyslog.

Let’s first think about what rsyslog is: it is a very high-speed, very scalable message processor that handles message processing via fine-tuned algorithms and with a lot of concurrency. As one important detail, the underlying engine can be seen as a specialized SIMD (single instruction multiple data) computer. That is a machine that is able to execute the same instruction on multiple data elements at the same time concurrently. Speaking in rsyslog terms, this means that a single rule will process multiple messages “at once” in a concurrent operation (to be precise, the level of concurrency depends on a large number of factors, but let’s stick with the simplified view). Also, rsyslog is able to execute various parts of a ruleset concurrently to other parts. Some of of these work units can be delayed for very long time.

So a rsyslog configuration is a partially ordered set of filters and actions, which are executed in parallel based on this partial ordering (thus the concurrency). Each of the parallel execution threads then works in a SIMD manner, with the notable fact that we have a variable number of data elements. All communication between the various parts is via a message passing mechanism, which provides very high speed, very high reliability and different storage drivers (like memory and disk). Finally, speed is gained by state-of-the-art non-blocking synchronization and proper partitioning (quite honestly, not much now, but this work is scheduled).

Then, there is Lua. Reading about the implementation of Lua 5.0, I learn that Lua employs a virtual machine and all code is interpreted. Also, other than rsyslog’s engine, Lua’s VM is a VM for a programming language, not a message processor (not surprisingly…). Thus, it’s optimized statements are for control-of-flow instructions. I don’t see anything that permits SIMD execution. That alone, based on rsyslog experience, means a 400 to 800% drop in performance — just use a rsyslog v3 engine which did not have this mode. Not that this is the difference between the ability to process e.g. 200,000 messages per second (mps) vs. 50,000 mps. That in turn is already argument enough for me to reject the idea of Lua inside rsyslog.

But it comes even worse: nor surprisingly for something that claims to be simple and easy, Lua’s threading is limited and makes it somewhat hard to integrate with threaded code (which would not be an issue, as the core part of rsyslog would be replaced by Lua if I’d use it). So coroutines would probably be the way to go. Reading the coroutines tutorial, I learn

“Coroutines in Lua are not operating system threads or processes. Coroutines are blocks of Lua code which are created within Lua, and have their own flow of control like threads. Only one coroutine ever runs at a time,”

Which is really bad news for rsyslog,which tries to fully utilize highly parallel hardware.

“and it runs until it activates another coroutine, or yields (returns to the coroutine that invoked it). Coroutines are a way to express multiple cooperating threads of control in a convenient and natural way, but do not execute in parallel, and thus gain no performance benefit from multiple CPU’s.

That actually does not need a comment ;)

“However, since coroutines switch much faster than operating system threads and do not typically require complex and sometimes expensive locking mechanisms, using coroutines is typically faster than the equivalent program using full OS threads.”

Well, with a lot of effort going into threading, rsyslog is much faster on multiple CPUs than on a single one.

So weighing this, we are back to how the rsyslog v1 engine worked. I don’t even have performance numbers for that at hand, it was too long ago.

Looking at the Lua doc, I do not find any reference to trying to match the partial order of execution into something that gains concurrency effects (and given the single-threadedness, there is no point in doing so…).

I have not dug deeper, but I am sure I will also find no concept of queues, message passing etc. If at all, I would expect such concepts in an object oriented language, which Lua claims not to be primarily.

So using Lua inside rsyslog means that I will remove almost all of those things that helped me make up a high performance syslogd, and it will also remove lot’s of abilities from the product. The only solution then would be to heavily modify Lua. And even if I did, I wonder if it’s maintainers would like the direction I would need to take, as this would add a lot of extra code to Lua. Which supposedly is not needed for the typical simple (read: non highly parallel) applications that use Lua.

This is why I reject Lua, as well as other off the shelf script interpreters for rsyslog. They do not reflect the needs a high-speed, high-concurrency message processor has.

Of course, I’d like to be proven wrong. If you can, please do. But please do not state generics but rather tell me how exactly I can gain SIMD capability and look-free multi-threaded synchronization with the embedded language of your choice. Note that I am not trying to discourage you. The problem is that I received so often suggestions that “this and that” is such a great replacement for the config, invested quite some time in the evaluation and always saw it is the same problem. So before I do that again, I’d like to have some evidence that the time to evaluate the solution is well spent. And support for rsyslog’s concurrency requirements is definitely something we need.

I would also like to add some notes from one (of the many) mailing list post about Lua. These comments, I hope, provide some additional information about the concerns I have:

David Lang stated:
> If speed or security are not major issues, having a config language be
> a
> snippet of code is definantly convienient and lets the person do a huge
> number of things that the program author never thought of (see simple
> event correltator for an example of this), but in rsyslog speed is a
> significant issue (processing multiple hundreds of thousands of logs
> per
> second doesn’t leave much time) and I don’t think that an interpreter
> is
> up to the task. Interpreted languages also usually don’t support
> multi-threaded operation well.

And I replied:

This is a *very* important point. And it is the single reason why I
re-thought about RainerScript and tend not to use it. While (in design) it
can do anything I ever need, the interpretation is too slow — at least as
far as the current implementation is concerned. I have read up on Lua, and
there seem to be large similarities between how Lua works and how
RainerScript actually (in filters!) works. Let met assume that Lua is far
more optimized than RainerScript. Even then, it is a generic engine and
running that engine to actually process syslog data is simply too slow.

In order to gain the high data rates we have. Using my test lab as an
example, we are currently at ~250,000 mps. The goal for my next performance
tuning step will be to double that value (I don’t know yet when I will start
with that work). Overall, the design shall be that rsyslog almost linearly
scales with the number of CPUs (and network cards) added. I’ve done a couple
of design errors with that in the past, but now I am through with that, have
done a lot of research and think that I can achieve this nearly-linear
speedup in the future. That means there will no longer be an actual upper
limit on the number of messages per second rsyslog can process. Of course,
even on a single processor, we need *excellent* performance.

For the single-processor, this means we need highly optimized, up to the task
algorithms that don’t do many things generically.

For the multi-processor, that means we need to run as many of these tasks
truly concurrently.

For example, in the last performance tuning step, I radically changed the way
rules are processed. Rather than thinking in terms of messages and steps to
be done on these, I now have an implementation that works, semi-parallel, on
the batch as whole and (logically) computes sub-steps of message processing
concurrently to each other (to truly elaborate on this would take a day or
more to write, thus I spare the details).

I don’t think any general language can provide the functionality I need to do
these sorts of things. This was also an important reason that lead to
RainerScript, a language where I could define the level of granularity
myself. The idea is still not dead, but the implementation effort was done
wrongly. But I have become skeptic if a language at all is the right
approach.

Also note the difference between config and runtime engine. Whatver library /
script/ format/ language we use for the config will, for the reasons given
above, NOT be used during execution. It can only be used as a meta-language
to specify what the actual engine will do.

So if we go for Lua (for example), we could use Lua to build the rsyslog
config objects. But during actual execution, we will definitely not use Lua.
So we would need a way to express rsyslog control flow in Lua, what probably
would stretch the spec too far. Note that a Lua “if then” would not be
something that the engine uses, but rather be used to build a config object.
So we still have the issue how to specify an “rsyslog engine if then” inside
a Lua script”. Except, of course, if you think that Lua can do regular
processing, which I ruled out with argument above.

And a bit later I added:
The order of execution of the task inside a given configuration
can be viewed as a partially ordered set. Some of the tasks need to be
preceded by others, but a (large) number of tasks have no relationship. To
gain speed and scalability, the rsyslog engine tries to identify this partial
order and tries to run those task in parallel that have no dependency on each
other. Also, one must note that a config file is written with the assumption
of a single message traversing the engine, which is a gross simplification.
In practice, we now have batches (multiple messages at once) traversing
through the engine, where a lot of things are done concurrently and far
different from what one would expect when looking at the config file (but in
a functionally equivalent way). It is this transformation of in-sequence,
single-message view to partial execution order, parallel view that provides
the necessary speedup to be able to serve demanding environments.

rsyslog performance improvement rather impressive

I (think I ;)) have finished this round of performance rsyslog tuning. The result is rather impressive:

On my virtualized 4-core development environment (not exactly a high-end environment), I went from approx. 70,000 messages per second (mps) to approx. 280,000 mps. Note that these numbers do not necessarily represent a practice configuration, but I think they show the magnitude of the optimization. Also note that some complex configurations have far lower gain, because some things (like execute an action only n times within m seconds or “message repeated n times” processing) require serialization of the message flow and there is little we can gain in this case.

I plan to do an “official release” in the not so distant future. Next, I will first see which patches I have in my queue and then I’ll focus on the config language enhancement. That’s much more complex than just the format — I’ll blog the details hopefully either later today or tomorrow morning.

Getting Top Performance out of rsyslog

Rsyslog is lightning fast. However, the configuration influences speed very much. This blog post tells what offers optimal performance for the most recent v5 version.

I will update this blog post whenever there is news to share (at least this is the plan). This information will also hopefully flow into the rsyslog doc at some time.

  • do not use more than one ruleset within a single queue
  • do not use rate limiting unless absolutely necessary
  • use array-based queue modes
  • do not use
  • send data from different inputs to multiple queues
  • use “$ActionWriteAllMarkMessages on” for all actions where you can afford it (it really makes a difference!)

This following blogpost also has some solid information on performance-influencing parameters: rsyslog evaluation. Note that it talks about a somewhat older rsyslog release. While already quoting 250,000 messages per second, rsyslog 5.5.6 is quite a bit faster.

what are actions and action instance data?

On the rsyslog mailing list, the question about what actions are in in which way they are kept single-threaded from the POV of the output module came up again. I try to summarize the most important points and term here.

David Lang gave the following example configuration:

*.* file1
*.* file2
*.* @ip1
*.* @ip2
*.* @@ip3
*.* @@ip4

and asked how many different actions/entities that were. Here is my answer:

An *action* is a specific instance of some desired output. The actual processing carried out is NOT termed “action”, even though one could easily do so. I have to admit I have not defined any term for that. So let’s call this processing. That actual processing is carried out by the output module (and the really bad thing is that the entry point is named “doAction”, which somewhat implies that the output module is called the action, what is not the case).

Each action can use the service of exactly one output module. Each output module can provide services to many actions. So we have a N:1 relationship between actions and output modules.

In the above samples, 3 output modules are involved, where each output module is used by two actions. We have 6 actions, and so we have 6 action locks.

So the output module interface does not serialize access to the output module, but rather to the action instance. All action-specific data is kept in a separate, per-action data structure and passed into the output module at the time the doAction call is made. The output module can modify all of this instance data as if it were running on a single thread. HOWEVER, any global data items (in short: everything not inside the action instance data) is *not* synchronized by the rsyslog core. The output module must take care itself of synchronization if it desires to have concurrent access to such data items. All current output modules do NOT access global data other than for config parsing (which is serial and single-threaded by nature).

Note that the consistency of the action instance data is guarded by the rsyslog core by actually running the output module processing on a single thread *for that action*. But the output module code itself may be called concurrently if more than one action uses the same output module. That is a typical case. If so, each of the concurrently running instances receives its private instance data pointer but shares everything else.

further improving tcp input performance

As one of the next things, I will be further improving rsyslog‘s tcp syslog input performance. As you know, rsyslog already has excellent performance (some sources, for example, quote 250,000 msgs per second). But, of course, there is room for improvement.

One such area is imtcp, the tcp syslog input module. It uses a single polling loop to obtain data from all senders. It is worth noting that the actual input module does NOT do very much, but hands the majority of work off to queue worker threads. However, it pulls the data from operating system buffers to our user space and also fills some basic properties (like time of reception, remote peer and so on). Then, the message is pushed to the message queue and at the other side of the queue the majority of processing happens (including such things like parsing the message, which some would assume to happen inside the receiving thread).

As can be seen in practice, this design scales pretty well in most cases. However, on a highly parallel system, it obviously limits the process of pulling data “off the wire” to be done on a single CPU. If then the rule set is not very complex (and thus fast to process), the single-threadedness off the initial receiver becomes a bottleneck. On a couple of high performance systems, we have seen this to be the bottleneck, and I am now trying to address it.

Right now, I am looking for a good solution. There are two obvious ones:

a) start up a single thread for each connection
b) do a hybrid approach of what we currently do and a)

Even with 64bit machines and NPTL, approach a) does probably not work well for a very large number of active sessions. Even worse, receiving messages from two different hosts would then require at least one context switch, and do so repeatedly. Context switches are quite expensive in terms of performance, and so better to avoid. Note that the current approach needs no context switch at all (for the part it does). On a system with many connections, I would be close to betting that the runtime required by the a)-approch context switching alone is probably more than what we need to do the processing with our current approach. So that seems to be a dead end.

So it looks like b) is a route to take, combining a (rather limited) number of threads with an reception-even driven loop. But how to best do that? A naive approach is to have one thread running the epoll() loop and have a pool of worker threads that actually pull the data off the wire. So the epoll loop would essentially just dispense to-be processed file descriptors to the workers. HOWEVER, that also implies one context switch during processing, that is when the epoll loop thread activates a worker. Note that this situation is by far not as bad as in a): as we have limited number of workers, and they are activated by the epoll thread, and that thread blocks when no workers are available, we have limited the level of concurrency. Note that limiting the concurrency level roughly to the number of CPUs available makes a lot of sense from a performance point of view (but not necessarily from a program simplicity and starvation-avoidance point of view – these concerns will be locked at, but now I have a focused problem to solve).

One approach to this problem could be that I further reduce the amount of work done in imtcp: if it no longer pulls data off the wire, but just places the file descriptor into a “message” object and submit that to the overall queue, modified queue processing could then take care of the rest. However, there are many subtle issues, including how to handle system shutdown and restart as well as disk queues. In short: that probably requires a full redesign, or at least considerable change. Anything less than that would probably result in another processing stage in front of the rule engine, as outlined initially (and thus require additional context changes).

So I focused back to the optimal way to partition this problem. One (simple) approach is to partition the problem by tcp listeners. It would be fairly easy to run multiple listeners concurrently, but each of the listeners would have its own (epoll/take data off the wire)-loop that runs on the listener’s single thread. So in essence, it would be much like running two or more rsyslog instances, using the current code, concurrently. That approach obviously causes no additional context switches. But it has a major drawback: if the workload is spread unevenly between listeners, it may not provide sufficient parallelism to busy all CPU cores. However, if the workload is spread evenly enough, the approach can prevent starvation between listeners – but not between sessions of one listener. This problem is also not addressed by the current code, and there has never been any user complaint about that (or it’s potential effects). So one may conclude starvation is not an issue.

It looks like the usefulness of this approach is strongly depending on the spread of workload between different listeners. Looking at a busy system, we need focus on the number of highly active listeners in relation to the number of expectedly idle CPU cores i. That number i obviously must take into consideration any other processing requirements, both from rsyslog (parsing, rule processing, …) as well as all other processes the system is intended to run. So, in general, the number i is probably (much) lower than the total number of cores inside the system. If we now have a number l of listeners, we must look closely: if among all listeners, l_h is the number of high activity listeners, than it is sufficient to have i equals l_h: few occasional wakeups from low activity listeners do not really matter. However, if l_a is lower than i, or even just one, then we can not fully utilize the system hardware. In that case, we would need to provide partitioning based on sessions, and there we see a similar scheme based on the view of low- and high-activity sessions.

But the real questions is if we can assume that most busy systems have a sufficient number of high activity listeners, so that per-listener concurrency is sufficient to fully utilize the hardware. If that is the case, we can drastically improve potential message processing rates and still be able to keep the code simple. Even more concrete, the question is if we re sufficiently sure this approach works well enough so that we implement it. Doing so, could save considerable development effort, which could be put to better uses (like speeding up queue processing). BUT that development effort is wasted time if for a large enough number of systems we can not see benefit. And note that single-listener systems are not uncommon, a case where we would gain NO benefit at all..

I am actually somewhat undecided and would appreciate feedback on that matter.

Thanks in advance to all who provide it.
Rainer

Update: there is a long and very insightful discussion about this post on the rsyslog mailing list. All interested parties are strongly advised to read through it, it will definitely enhance your understanding. Please also note that based on that discussion the development focus shifted a bit.

rsyslog string generators … done :)

A rsyslog string generator is what I had previously called a “template module” – in essence a facility to generate a template string with some custom native C code. I have decided to name it a bit differently, because at some later stage there may be other uses for these types of modules as well. Specifically, I am thinking about adding custom name-value pairs to the message object, and then a string generator (or strgen for short) could be used to generate such a value as well.

Implementation went smooth. I implemented both the interface as well as a small set of important core strgens, those that are frequently used to write files or forward message to remote machines. I did not touch any others, as that is probably not really necessary — and could easily be done any time if need arises.

The new interface also provides a capability to third-parties that enables them to create their own high speed parsers. The performance impact can be dramatic, just think about cases where multiple regular expression calls can be replaced by a single call and some C logic.

Finally, these modules may even provide a way to fund rsyslog development. Adiscon can probably sell them for some small amount (I guess around $500 based on what needs to be done, in some cases maybe less, in some maybe a bit more). I guess that would be attractive for anyone who needs both high speed and a custom format and runs rsyslog for profit. Getting into all the details to develop such a thing oneself probably costs more than our whole implementation effort. I hope we will get some orders for these, and I hope that folks will contribute the strgen back to the project. Their plus is then that we maintain it for free and the plus for the community is that, in the long term, we will hopefully get a large directory of ready-to use custom strgens (OK, that sidesteps the funding process a bit, but… ;)).

I have also managed to write some basic doc on the new capability, to be seen here:

What now is missing is some feedback from the field, including from someone who actually uses this to create a custom format.

The code has been merged into v5-devel (the master branch) and will most probably be released early next week. Then, it will undergo the usual devel/beta cycle, so that availability in a stable v5 release can be expected towards the end of summer 2010. Special thanks go to David Lang, who provided good advise that helped me create the new functionality.