Maybe a historical moment, but let those elaborate that know far more:
Starting new v5-beta branch
I will start a new v5-beta branch soon. I have seen that a number of things have been changed since the last stable release. Most importantly, some of these changes fix bugs. Bug, that are hard to fix in the current stable version. This is because v5 got not so much exposed to reality, but adoption rate seems to increase and so we have been able to iron out some issues while doing refactoring on the latest development release.
I have thought about backporting the bug fixes. However, this doesn’t seem to make too much sense: it is a lot of work, and, when done the current v5-stable will have code very close to the development branch. So I decided to accept the bugs for the time being and instead see the we can get a new 4.6.0 stable release as soon as possible. The first step on that route is to create a new beta. Based on past experience, I think we can promote that the stable in September (as we already got some good feedback on it).
Anybody with problems in the current v5-stable should simply update to the beta as soon as it is available.
coding new config format begun
After a long discussion about potential new config formats for rsyslog, we came to the conclusion that the current format is not as bad as I thought and just needs scoping (OK, the whole story is longer, but I can’t repeat that lengthy discussion in a single blog post, see mailing list archive for details).
After some thinking, I finally started coding today.
I have begun to implement action scoping. A snapshot with partial functionality is available at
http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/newconf
It does NOT yet include the necessary output plugin interface change (or updated plugins), but it implements
$Begin action
$End action
$StrictScoping [on/off] — off default
So if you want to play a bit with it, feel free to do so. Note that it disallows global statements within action scope and in overall has somewhat better detection of user errors (these are easier to detect when scoping is used).
Note that scoping is purely optional: if not enabled, it will not be used. So any current rsyslog.conf remains valid.
I will see that I change the projects’s output plugins next week, and will possibly then make an experimental release.
Why I think Lua is no solution for rsyslog
During our discussion about the new rsyslog config format, a couple of times it was suggested to use Lua inside rsyslog. So far, I reject this idea for a couple of reasons, and I thought it is time to write up of them.
But first of all let me explain that I do not dislike Lua. I think it is a very good tool, and it can be extremely useful to embed it. What I question is if Lua is the right thing for rsyslog.
Let’s first think about what rsyslog is: it is a very high-speed, very scalable message processor that handles message processing via fine-tuned algorithms and with a lot of concurrency. As one important detail, the underlying engine can be seen as a specialized SIMD (single instruction multiple data) computer. That is a machine that is able to execute the same instruction on multiple data elements at the same time concurrently. Speaking in rsyslog terms, this means that a single rule will process multiple messages “at once” in a concurrent operation (to be precise, the level of concurrency depends on a large number of factors, but let’s stick with the simplified view). Also, rsyslog is able to execute various parts of a ruleset concurrently to other parts. Some of of these work units can be delayed for very long time.
So a rsyslog configuration is a partially ordered set of filters and actions, which are executed in parallel based on this partial ordering (thus the concurrency). Each of the parallel execution threads then works in a SIMD manner, with the notable fact that we have a variable number of data elements. All communication between the various parts is via a message passing mechanism, which provides very high speed, very high reliability and different storage drivers (like memory and disk). Finally, speed is gained by state-of-the-art non-blocking synchronization and proper partitioning (quite honestly, not much now, but this work is scheduled).
Then, there is Lua. Reading about the implementation of Lua 5.0, I learn that Lua employs a virtual machine and all code is interpreted. Also, other than rsyslog’s engine, Lua’s VM is a VM for a programming language, not a message processor (not surprisingly…). Thus, it’s optimized statements are for control-of-flow instructions. I don’t see anything that permits SIMD execution. That alone, based on rsyslog experience, means a 400 to 800% drop in performance — just use a rsyslog v3 engine which did not have this mode. Not that this is the difference between the ability to process e.g. 200,000 messages per second (mps) vs. 50,000 mps. That in turn is already argument enough for me to reject the idea of Lua inside rsyslog.
But it comes even worse: nor surprisingly for something that claims to be simple and easy, Lua’s threading is limited and makes it somewhat hard to integrate with threaded code (which would not be an issue, as the core part of rsyslog would be replaced by Lua if I’d use it). So coroutines would probably be the way to go. Reading the coroutines tutorial, I learn
“Coroutines in Lua are not operating system threads or processes. Coroutines are blocks of Lua code which are created within Lua, and have their own flow of control like threads. Only one coroutine ever runs at a time,”
Which is really bad news for rsyslog,which tries to fully utilize highly parallel hardware.
“and it runs until it activates another coroutine, or yields (returns to the coroutine that invoked it). Coroutines are a way to express multiple cooperating threads of control in a convenient and natural way, but do not execute in parallel, and thus gain no performance benefit from multiple CPU’s.“
That actually does not need a comment ;)
“However, since coroutines switch much faster than operating system threads and do not typically require complex and sometimes expensive locking mechanisms, using coroutines is typically faster than the equivalent program using full OS threads.”
Well, with a lot of effort going into threading, rsyslog is much faster on multiple CPUs than on a single one.
So weighing this, we are back to how the rsyslog v1 engine worked. I don’t even have performance numbers for that at hand, it was too long ago.
Looking at the Lua doc, I do not find any reference to trying to match the partial order of execution into something that gains concurrency effects (and given the single-threadedness, there is no point in doing so…).
I have not dug deeper, but I am sure I will also find no concept of queues, message passing etc. If at all, I would expect such concepts in an object oriented language, which Lua claims not to be primarily.
So using Lua inside rsyslog means that I will remove almost all of those things that helped me make up a high performance syslogd, and it will also remove lot’s of abilities from the product. The only solution then would be to heavily modify Lua. And even if I did, I wonder if it’s maintainers would like the direction I would need to take, as this would add a lot of extra code to Lua. Which supposedly is not needed for the typical simple (read: non highly parallel) applications that use Lua.
This is why I reject Lua, as well as other off the shelf script interpreters for rsyslog. They do not reflect the needs a high-speed, high-concurrency message processor has.
Of course, I’d like to be proven wrong. If you can, please do. But please do not state generics but rather tell me how exactly I can gain SIMD capability and look-free multi-threaded synchronization with the embedded language of your choice. Note that I am not trying to discourage you. The problem is that I received so often suggestions that “this and that” is such a great replacement for the config, invested quite some time in the evaluation and always saw it is the same problem. So before I do that again, I’d like to have some evidence that the time to evaluate the solution is well spent. And support for rsyslog’s concurrency requirements is definitely something we need.
I would also like to add some notes from one (of the many) mailing list post about Lua. These comments, I hope, provide some additional information about the concerns I have:
David Lang stated:
> If speed or security are not major issues, having a config language be
> a
> snippet of code is definantly convienient and lets the person do a huge
> number of things that the program author never thought of (see simple
> event correltator for an example of this), but in rsyslog speed is a
> significant issue (processing multiple hundreds of thousands of logs
> per
> second doesn’t leave much time) and I don’t think that an interpreter
> is
> up to the task. Interpreted languages also usually don’t support
> multi-threaded operation well.
And I replied:
This is a *very* important point. And it is the single reason why I
re-thought about RainerScript and tend not to use it. While (in design) it
can do anything I ever need, the interpretation is too slow — at least as
far as the current implementation is concerned. I have read up on Lua, and
there seem to be large similarities between how Lua works and how
RainerScript actually (in filters!) works. Let met assume that Lua is far
more optimized than RainerScript. Even then, it is a generic engine and
running that engine to actually process syslog data is simply too slow.
In order to gain the high data rates we have. Using my test lab as an
example, we are currently at ~250,000 mps. The goal for my next performance
tuning step will be to double that value (I don’t know yet when I will start
with that work). Overall, the design shall be that rsyslog almost linearly
scales with the number of CPUs (and network cards) added. I’ve done a couple
of design errors with that in the past, but now I am through with that, have
done a lot of research and think that I can achieve this nearly-linear
speedup in the future. That means there will no longer be an actual upper
limit on the number of messages per second rsyslog can process. Of course,
even on a single processor, we need *excellent* performance.
For the single-processor, this means we need highly optimized, up to the task
algorithms that don’t do many things generically.
For the multi-processor, that means we need to run as many of these tasks
truly concurrently.
For example, in the last performance tuning step, I radically changed the way
rules are processed. Rather than thinking in terms of messages and steps to
be done on these, I now have an implementation that works, semi-parallel, on
the batch as whole and (logically) computes sub-steps of message processing
concurrently to each other (to truly elaborate on this would take a day or
more to write, thus I spare the details).
I don’t think any general language can provide the functionality I need to do
these sorts of things. This was also an important reason that lead to
RainerScript, a language where I could define the level of granularity
myself. The idea is still not dead, but the implementation effort was done
wrongly. But I have become skeptic if a language at all is the right
approach.
Also note the difference between config and runtime engine. Whatver library /
script/ format/ language we use for the config will, for the reasons given
above, NOT be used during execution. It can only be used as a meta-language
to specify what the actual engine will do.
So if we go for Lua (for example), we could use Lua to build the rsyslog
config objects. But during actual execution, we will definitely not use Lua.
So we would need a way to express rsyslog control flow in Lua, what probably
would stretch the spec too far. Note that a Lua “if then” would not be
something that the engine uses, but rather be used to build a config object.
So we still have the issue how to specify an “rsyslog engine if then” inside
a Lua script”. Except, of course, if you think that Lua can do regular
processing, which I ruled out with argument above.
And a bit later I added:
The order of execution of the task inside a given configuration
can be viewed as a partially ordered set. Some of the tasks need to be
preceded by others, but a (large) number of tasks have no relationship. To
gain speed and scalability, the rsyslog engine tries to identify this partial
order and tries to run those task in parallel that have no dependency on each
other. Also, one must note that a config file is written with the assumption
of a single message traversing the engine, which is a gross simplification.
In practice, we now have batches (multiple messages at once) traversing
through the engine, where a lot of things are done concurrently and far
different from what one would expect when looking at the config file (but in
a functionally equivalent way). It is this transformation of in-sequence,
single-message view to partial execution order, parallel view that provides
the necessary speedup to be able to serve demanding environments.
rsyslog performance improvement rather impressive
I (think I ;)) have finished this round of performance rsyslog tuning. The result is rather impressive:
On my virtualized 4-core development environment (not exactly a high-end environment), I went from approx. 70,000 messages per second (mps) to approx. 280,000 mps. Note that these numbers do not necessarily represent a practice configuration, but I think they show the magnitude of the optimization. Also note that some complex configurations have far lower gain, because some things (like execute an action only n times within m seconds or “message repeated n times” processing) require serialization of the message flow and there is little we can gain in this case.
I plan to do an “official release” in the not so distant future. Next, I will first see which patches I have in my queue and then I’ll focus on the config language enhancement. That’s much more complex than just the format — I’ll blog the details hopefully either later today or tomorrow morning.
Getting Top Performance out of rsyslog
Rsyslog is lightning fast. However, the configuration influences speed very much. This blog post tells what offers optimal performance for the most recent v5 version.
I will update this blog post whenever there is news to share (at least this is the plan). This information will also hopefully flow into the rsyslog doc at some time.
- do not use more than one ruleset within a single queue
- do not use rate limiting unless absolutely necessary
- use array-based queue modes
- do not use
- send data from different inputs to multiple queues
- use “$ActionWriteAllMarkMessages on” for all actions where you can afford it (it really makes a difference!)
This following blogpost also has some solid information on performance-influencing parameters: rsyslog evaluation. Note that it talks about a somewhat older rsyslog release. While already quoting 250,000 messages per second, rsyslog 5.5.6 is quite a bit faster.
what are actions and action instance data?
On the rsyslog mailing list, the question about what actions are in in which way they are kept single-threaded from the POV of the output module came up again. I try to summarize the most important points and term here.
David Lang gave the following example configuration:
*.* file1
*.* file2
*.* @ip1
*.* @ip2
*.* @@ip3
*.* @@ip4
and asked how many different actions/entities that were. Here is my answer:
An *action* is a specific instance of some desired output. The actual processing carried out is NOT termed “action”, even though one could easily do so. I have to admit I have not defined any term for that. So let’s call this processing. That actual processing is carried out by the output module (and the really bad thing is that the entry point is named “doAction”, which somewhat implies that the output module is called the action, what is not the case).
Each action can use the service of exactly one output module. Each output module can provide services to many actions. So we have a N:1 relationship between actions and output modules.
In the above samples, 3 output modules are involved, where each output module is used by two actions. We have 6 actions, and so we have 6 action locks.
So the output module interface does not serialize access to the output module, but rather to the action instance. All action-specific data is kept in a separate, per-action data structure and passed into the output module at the time the doAction call is made. The output module can modify all of this instance data as if it were running on a single thread. HOWEVER, any global data items (in short: everything not inside the action instance data) is *not* synchronized by the rsyslog core. The output module must take care itself of synchronization if it desires to have concurrent access to such data items. All current output modules do NOT access global data other than for config parsing (which is serial and single-threaded by nature).
Note that the consistency of the action instance data is guarded by the rsyslog core by actually running the output module processing on a single thread *for that action*. But the output module code itself may be called concurrently if more than one action uses the same output module. That is a typical case. If so, each of the concurrently running instances receives its private instance data pointer but shares everything else.
further improving tcp input performance
As one of the next things, I will be further improving rsyslog‘s tcp syslog input performance. As you know, rsyslog already has excellent performance (some sources, for example, quote 250,000 msgs per second). But, of course, there is room for improvement.
One such area is imtcp, the tcp syslog input module. It uses a single polling loop to obtain data from all senders. It is worth noting that the actual input module does NOT do very much, but hands the majority of work off to queue worker threads. However, it pulls the data from operating system buffers to our user space and also fills some basic properties (like time of reception, remote peer and so on). Then, the message is pushed to the message queue and at the other side of the queue the majority of processing happens (including such things like parsing the message, which some would assume to happen inside the receiving thread).
As can be seen in practice, this design scales pretty well in most cases. However, on a highly parallel system, it obviously limits the process of pulling data “off the wire” to be done on a single CPU. If then the rule set is not very complex (and thus fast to process), the single-threadedness off the initial receiver becomes a bottleneck. On a couple of high performance systems, we have seen this to be the bottleneck, and I am now trying to address it.
Right now, I am looking for a good solution. There are two obvious ones:
a) start up a single thread for each connection
b) do a hybrid approach of what we currently do and a)
Even with 64bit machines and NPTL, approach a) does probably not work well for a very large number of active sessions. Even worse, receiving messages from two different hosts would then require at least one context switch, and do so repeatedly. Context switches are quite expensive in terms of performance, and so better to avoid. Note that the current approach needs no context switch at all (for the part it does). On a system with many connections, I would be close to betting that the runtime required by the a)-approch context switching alone is probably more than what we need to do the processing with our current approach. So that seems to be a dead end.
So it looks like b) is a route to take, combining a (rather limited) number of threads with an reception-even driven loop. But how to best do that? A naive approach is to have one thread running the epoll() loop and have a pool of worker threads that actually pull the data off the wire. So the epoll loop would essentially just dispense to-be processed file descriptors to the workers. HOWEVER, that also implies one context switch during processing, that is when the epoll loop thread activates a worker. Note that this situation is by far not as bad as in a): as we have limited number of workers, and they are activated by the epoll thread, and that thread blocks when no workers are available, we have limited the level of concurrency. Note that limiting the concurrency level roughly to the number of CPUs available makes a lot of sense from a performance point of view (but not necessarily from a program simplicity and starvation-avoidance point of view – these concerns will be locked at, but now I have a focused problem to solve).
One approach to this problem could be that I further reduce the amount of work done in imtcp: if it no longer pulls data off the wire, but just places the file descriptor into a “message” object and submit that to the overall queue, modified queue processing could then take care of the rest. However, there are many subtle issues, including how to handle system shutdown and restart as well as disk queues. In short: that probably requires a full redesign, or at least considerable change. Anything less than that would probably result in another processing stage in front of the rule engine, as outlined initially (and thus require additional context changes).
So I focused back to the optimal way to partition this problem. One (simple) approach is to partition the problem by tcp listeners. It would be fairly easy to run multiple listeners concurrently, but each of the listeners would have its own (epoll/take data off the wire)-loop that runs on the listener’s single thread. So in essence, it would be much like running two or more rsyslog instances, using the current code, concurrently. That approach obviously causes no additional context switches. But it has a major drawback: if the workload is spread unevenly between listeners, it may not provide sufficient parallelism to busy all CPU cores. However, if the workload is spread evenly enough, the approach can prevent starvation between listeners – but not between sessions of one listener. This problem is also not addressed by the current code, and there has never been any user complaint about that (or it’s potential effects). So one may conclude starvation is not an issue.
It looks like the usefulness of this approach is strongly depending on the spread of workload between different listeners. Looking at a busy system, we need focus on the number of highly active listeners in relation to the number of expectedly idle CPU cores i. That number i obviously must take into consideration any other processing requirements, both from rsyslog (parsing, rule processing, …) as well as all other processes the system is intended to run. So, in general, the number i is probably (much) lower than the total number of cores inside the system. If we now have a number l of listeners, we must look closely: if among all listeners, l_h is the number of high activity listeners, than it is sufficient to have i equals l_h: few occasional wakeups from low activity listeners do not really matter. However, if l_a is lower than i, or even just one, then we can not fully utilize the system hardware. In that case, we would need to provide partitioning based on sessions, and there we see a similar scheme based on the view of low- and high-activity sessions.
But the real questions is if we can assume that most busy systems have a sufficient number of high activity listeners, so that per-listener concurrency is sufficient to fully utilize the hardware. If that is the case, we can drastically improve potential message processing rates and still be able to keep the code simple. Even more concrete, the question is if we re sufficiently sure this approach works well enough so that we implement it. Doing so, could save considerable development effort, which could be put to better uses (like speeding up queue processing). BUT that development effort is wasted time if for a large enough number of systems we can not see benefit. And note that single-listener systems are not uncommon, a case where we would gain NO benefit at all..
I am actually somewhat undecided and would appreciate feedback on that matter.
Thanks in advance to all who provide it.
Rainer
Update: there is a long and very insightful discussion about this post on the rsyslog mailing list. All interested parties are strongly advised to read through it, it will definitely enhance your understanding. Please also note that based on that discussion the development focus shifted a bit.
rsyslog string generators … done :)
A rsyslog string generator is what I had previously called a “template module” – in essence a facility to generate a template string with some custom native C code. I have decided to name it a bit differently, because at some later stage there may be other uses for these types of modules as well. Specifically, I am thinking about adding custom name-value pairs to the message object, and then a string generator (or strgen for short) could be used to generate such a value as well.
Implementation went smooth. I implemented both the interface as well as a small set of important core strgens, those that are frequently used to write files or forward message to remote machines. I did not touch any others, as that is probably not really necessary — and could easily be done any time if need arises.
The new interface also provides a capability to third-parties that enables them to create their own high speed parsers. The performance impact can be dramatic, just think about cases where multiple regular expression calls can be replaced by a single call and some C logic.
Finally, these modules may even provide a way to fund rsyslog development. Adiscon can probably sell them for some small amount (I guess around $500 based on what needs to be done, in some cases maybe less, in some maybe a bit more). I guess that would be attractive for anyone who needs both high speed and a custom format and runs rsyslog for profit. Getting into all the details to develop such a thing oneself probably costs more than our whole implementation effort. I hope we will get some orders for these, and I hope that folks will contribute the strgen back to the project. Their plus is then that we maintain it for free and the plus for the community is that, in the long term, we will hopefully get a large directory of ready-to use custom strgens (OK, that sidesteps the funding process a bit, but… ;)).
I have also managed to write some basic doc on the new capability, to be seen here:
What now is missing is some feedback from the field, including from someone who actually uses this to create a custom format.
The code has been merged into v5-devel (the master branch) and will most probably be released early next week. Then, it will undergo the usual devel/beta cycle, so that availability in a stable v5 release can be expected towards the end of summer 2010. Special thanks go to David Lang, who provided good advise that helped me create the new functionality.
rsyslog template plugins
As I have written yesterday, I am evaluating the use of “template modules” in rsyslog.
In that post, I mentioned that I’d expect a 5% speedup as proof that the new plugin type was worth considering. As it turns out, this method seems to provide a speedup factor of 5 to 6 percent, so it seems to be useful in its own right.
After I had written yesterday’s post, I checked what it would take to create a test environment. It turned out that it was not too hard to change the engine so that I could hardcode one of the default templates AND provide a vehicle to activate that code via the configuration file. Of course, we do not yet have full loadable modules, but I was able to create a proof of concept in a couple of hours and do some (mild) performance testing on it. The current code provides a vehicle to use a c-function based template generator. It is actiated by saying
$template tpl,=generator
where the equal sign indicates to use a C generator instead of the usual template string. The name that follows the equal sign that will probably later become the actual module name, but is irrelevant right now. I then implemented a generator for the default file format in a very crude way, but I would expect that a real loadable module will not take considerably more processing time (just a very small amount of calling overhead after the initial config parsing stage). So with that experimental code, I could switch between the template-based default file format and the generator based format, with the outcome being exactly the same.
Having that capability, I ran a couple of performance tests. I have to admit I did not go to a real test environment, but rather used my (virtualized) standard development machine. Also, I ran the load generator inside the same box. So there were a lot of factors that influenced the performance, and this for sure was no totally valid test. To make up for that, I ran several incarnations of the same test, with 1 to 10 million of test messages. The results quite consistently reported a speedup between 5 and 6 percent achieved by the C template generator. Even though the test was crude, this consistently seen speedup is sufficient proof for me that native template generators actually have value in them. I have to admit that I had expected improvements in the 1 to 2 percent area, so the 5 and more percent is considerable.
I committed the experimental branch to git, so everyone is free to review and test it oneself.
Now that I am convinced this is a useful addition, my next step will be to add proper code for template plugins (and, along that way, decide if they will actually be called template plugins — I guess library plugins could be used as well and with somewhat less effort and greater flexibility). Then, I will convert the canned templates into such generators and included them statically inside rsyslog (just like omfile and a couple of other modules are statically included inside rsyslog). I hope that in practice we will also see this potential speedup.
Another benefit is that any third party can write new generator functions. Of course, there is some code duplication inside such functions. But that should not be a bit issue, especially as generator functions are usually expected to be rather small (but of course need not be so). If someone intends to write a set of complex generator functions, these can be written with a common core module whom’s utility functions are accessed by each of the generators. But this is not of my concerns as of now.
Note that I will probably use very simple list data structures to keep track of the available generators. The reason is that after the initial config file parsing, access to these structures is no longer required and so there is no point in using a more advanced method.
I expect my effort to take a couple of days at most, but beware that Thursday is a public holiday over here in Germany and I may not work on the project on Thursday and Friday (depending, I have to admit, a little bit on the weather ;)).