Why I think Lua is no solution for rsyslog

During our discussion about the new rsyslog config format, a couple of times it was suggested to use Lua inside rsyslog. So far, I reject this idea for a couple of reasons, and I thought it is time to write up of them.

But first of all let me explain that I do not dislike Lua. I think it is a very good tool, and it can be extremely useful to embed it. What I question is if Lua is the right thing for rsyslog.

Let’s first think about what rsyslog is: it is a very high-speed, very scalable message processor that handles message processing via fine-tuned algorithms and with a lot of concurrency. As one important detail, the underlying engine can be seen as a specialized SIMD (single instruction multiple data) computer. That is a machine that is able to execute the same instruction on multiple data elements at the same time concurrently. Speaking in rsyslog terms, this means that a single rule will process multiple messages “at once” in a concurrent operation (to be precise, the level of concurrency depends on a large number of factors, but let’s stick with the simplified view). Also, rsyslog is able to execute various parts of a ruleset concurrently to other parts. Some of of these work units can be delayed for very long time.

So a rsyslog configuration is a partially ordered set of filters and actions, which are executed in parallel based on this partial ordering (thus the concurrency). Each of the parallel execution threads then works in a SIMD manner, with the notable fact that we have a variable number of data elements. All communication between the various parts is via a message passing mechanism, which provides very high speed, very high reliability and different storage drivers (like memory and disk). Finally, speed is gained by state-of-the-art non-blocking synchronization and proper partitioning (quite honestly, not much now, but this work is scheduled).

Then, there is Lua. Reading about the implementation of Lua 5.0, I learn that Lua employs a virtual machine and all code is interpreted. Also, other than rsyslog’s engine, Lua’s VM is a VM for a programming language, not a message processor (not surprisingly…). Thus, it’s optimized statements are for control-of-flow instructions. I don’t see anything that permits SIMD execution. That alone, based on rsyslog experience, means a 400 to 800% drop in performance — just use a rsyslog v3 engine which did not have this mode. Not that this is the difference between the ability to process e.g. 200,000 messages per second (mps) vs. 50,000 mps. That in turn is already argument enough for me to reject the idea of Lua inside rsyslog.

But it comes even worse: nor surprisingly for something that claims to be simple and easy, Lua’s threading is limited and makes it somewhat hard to integrate with threaded code (which would not be an issue, as the core part of rsyslog would be replaced by Lua if I’d use it). So coroutines would probably be the way to go. Reading the coroutines tutorial, I learn

“Coroutines in Lua are not operating system threads or processes. Coroutines are blocks of Lua code which are created within Lua, and have their own flow of control like threads. Only one coroutine ever runs at a time,”

Which is really bad news for rsyslog,which tries to fully utilize highly parallel hardware.

“and it runs until it activates another coroutine, or yields (returns to the coroutine that invoked it). Coroutines are a way to express multiple cooperating threads of control in a convenient and natural way, but do not execute in parallel, and thus gain no performance benefit from multiple CPU’s.

That actually does not need a comment ;)

“However, since coroutines switch much faster than operating system threads and do not typically require complex and sometimes expensive locking mechanisms, using coroutines is typically faster than the equivalent program using full OS threads.”

Well, with a lot of effort going into threading, rsyslog is much faster on multiple CPUs than on a single one.

So weighing this, we are back to how the rsyslog v1 engine worked. I don’t even have performance numbers for that at hand, it was too long ago.

Looking at the Lua doc, I do not find any reference to trying to match the partial order of execution into something that gains concurrency effects (and given the single-threadedness, there is no point in doing so…).

I have not dug deeper, but I am sure I will also find no concept of queues, message passing etc. If at all, I would expect such concepts in an object oriented language, which Lua claims not to be primarily.

So using Lua inside rsyslog means that I will remove almost all of those things that helped me make up a high performance syslogd, and it will also remove lot’s of abilities from the product. The only solution then would be to heavily modify Lua. And even if I did, I wonder if it’s maintainers would like the direction I would need to take, as this would add a lot of extra code to Lua. Which supposedly is not needed for the typical simple (read: non highly parallel) applications that use Lua.

This is why I reject Lua, as well as other off the shelf script interpreters for rsyslog. They do not reflect the needs a high-speed, high-concurrency message processor has.

Of course, I’d like to be proven wrong. If you can, please do. But please do not state generics but rather tell me how exactly I can gain SIMD capability and look-free multi-threaded synchronization with the embedded language of your choice. Note that I am not trying to discourage you. The problem is that I received so often suggestions that “this and that” is such a great replacement for the config, invested quite some time in the evaluation and always saw it is the same problem. So before I do that again, I’d like to have some evidence that the time to evaluate the solution is well spent. And support for rsyslog’s concurrency requirements is definitely something we need.

I would also like to add some notes from one (of the many) mailing list post about Lua. These comments, I hope, provide some additional information about the concerns I have:

David Lang stated:
> If speed or security are not major issues, having a config language be
> a
> snippet of code is definantly convienient and lets the person do a huge
> number of things that the program author never thought of (see simple
> event correltator for an example of this), but in rsyslog speed is a
> significant issue (processing multiple hundreds of thousands of logs
> per
> second doesn’t leave much time) and I don’t think that an interpreter
> is
> up to the task. Interpreted languages also usually don’t support
> multi-threaded operation well.

And I replied:

This is a *very* important point. And it is the single reason why I
re-thought about RainerScript and tend not to use it. While (in design) it
can do anything I ever need, the interpretation is too slow — at least as
far as the current implementation is concerned. I have read up on Lua, and
there seem to be large similarities between how Lua works and how
RainerScript actually (in filters!) works. Let met assume that Lua is far
more optimized than RainerScript. Even then, it is a generic engine and
running that engine to actually process syslog data is simply too slow.

In order to gain the high data rates we have. Using my test lab as an
example, we are currently at ~250,000 mps. The goal for my next performance
tuning step will be to double that value (I don’t know yet when I will start
with that work). Overall, the design shall be that rsyslog almost linearly
scales with the number of CPUs (and network cards) added. I’ve done a couple
of design errors with that in the past, but now I am through with that, have
done a lot of research and think that I can achieve this nearly-linear
speedup in the future. That means there will no longer be an actual upper
limit on the number of messages per second rsyslog can process. Of course,
even on a single processor, we need *excellent* performance.

For the single-processor, this means we need highly optimized, up to the task
algorithms that don’t do many things generically.

For the multi-processor, that means we need to run as many of these tasks
truly concurrently.

For example, in the last performance tuning step, I radically changed the way
rules are processed. Rather than thinking in terms of messages and steps to
be done on these, I now have an implementation that works, semi-parallel, on
the batch as whole and (logically) computes sub-steps of message processing
concurrently to each other (to truly elaborate on this would take a day or
more to write, thus I spare the details).

I don’t think any general language can provide the functionality I need to do
these sorts of things. This was also an important reason that lead to
RainerScript, a language where I could define the level of granularity
myself. The idea is still not dead, but the implementation effort was done
wrongly. But I have become skeptic if a language at all is the right
approach.

Also note the difference between config and runtime engine. Whatver library /
script/ format/ language we use for the config will, for the reasons given
above, NOT be used during execution. It can only be used as a meta-language
to specify what the actual engine will do.

So if we go for Lua (for example), we could use Lua to build the rsyslog
config objects. But during actual execution, we will definitely not use Lua.
So we would need a way to express rsyslog control flow in Lua, what probably
would stretch the spec too far. Note that a Lua “if then” would not be
something that the engine uses, but rather be used to build a config object.
So we still have the issue how to specify an “rsyslog engine if then” inside
a Lua script”. Except, of course, if you think that Lua can do regular
processing, which I ruled out with argument above.

And a bit later I added:
The order of execution of the task inside a given configuration
can be viewed as a partially ordered set. Some of the tasks need to be
preceded by others, but a (large) number of tasks have no relationship. To
gain speed and scalability, the rsyslog engine tries to identify this partial
order and tries to run those task in parallel that have no dependency on each
other. Also, one must note that a config file is written with the assumption
of a single message traversing the engine, which is a gross simplification.
In practice, we now have batches (multiple messages at once) traversing
through the engine, where a lot of things are done concurrently and far
different from what one would expect when looking at the config file (but in
a functionally equivalent way). It is this transformation of in-sequence,
single-message view to partial execution order, parallel view that provides
the necessary speedup to be able to serve demanding environments.