rsyslog performance improvement rather impressive

I (think I ;)) have finished this round of performance rsyslog tuning. The result is rather impressive:

On my virtualized 4-core development environment (not exactly a high-end environment), I went from approx. 70,000 messages per second (mps) to approx. 280,000 mps. Note that these numbers do not necessarily represent a practice configuration, but I think they show the magnitude of the optimization. Also note that some complex configurations have far lower gain, because some things (like execute an action only n times within m seconds or “message repeated n times” processing) require serialization of the message flow and there is little we can gain in this case.

I plan to do an “official release” in the not so distant future. Next, I will first see which patches I have in my queue and then I’ll focus on the config language enhancement. That’s much more complex than just the format — I’ll blog the details hopefully either later today or tomorrow morning.

Getting Top Performance out of rsyslog

Rsyslog is lightning fast. However, the configuration influences speed very much. This blog post tells what offers optimal performance for the most recent v5 version.

I will update this blog post whenever there is news to share (at least this is the plan). This information will also hopefully flow into the rsyslog doc at some time.

  • do not use more than one ruleset within a single queue
  • do not use rate limiting unless absolutely necessary
  • use array-based queue modes
  • do not use
  • send data from different inputs to multiple queues
  • use “$ActionWriteAllMarkMessages on” for all actions where you can afford it (it really makes a difference!)

This following blogpost also has some solid information on performance-influencing parameters: rsyslog evaluation. Note that it talks about a somewhat older rsyslog release. While already quoting 250,000 messages per second, rsyslog 5.5.6 is quite a bit faster.

what are actions and action instance data?

On the rsyslog mailing list, the question about what actions are in in which way they are kept single-threaded from the POV of the output module came up again. I try to summarize the most important points and term here.

David Lang gave the following example configuration:

*.* file1
*.* file2
*.* @ip1
*.* @ip2
*.* @@ip3
*.* @@ip4

and asked how many different actions/entities that were. Here is my answer:

An *action* is a specific instance of some desired output. The actual processing carried out is NOT termed “action”, even though one could easily do so. I have to admit I have not defined any term for that. So let’s call this processing. That actual processing is carried out by the output module (and the really bad thing is that the entry point is named “doAction”, which somewhat implies that the output module is called the action, what is not the case).

Each action can use the service of exactly one output module. Each output module can provide services to many actions. So we have a N:1 relationship between actions and output modules.

In the above samples, 3 output modules are involved, where each output module is used by two actions. We have 6 actions, and so we have 6 action locks.

So the output module interface does not serialize access to the output module, but rather to the action instance. All action-specific data is kept in a separate, per-action data structure and passed into the output module at the time the doAction call is made. The output module can modify all of this instance data as if it were running on a single thread. HOWEVER, any global data items (in short: everything not inside the action instance data) is *not* synchronized by the rsyslog core. The output module must take care itself of synchronization if it desires to have concurrent access to such data items. All current output modules do NOT access global data other than for config parsing (which is serial and single-threaded by nature).

Note that the consistency of the action instance data is guarded by the rsyslog core by actually running the output module processing on a single thread *for that action*. But the output module code itself may be called concurrently if more than one action uses the same output module. That is a typical case. If so, each of the concurrently running instances receives its private instance data pointer but shares everything else.

further improving tcp input performance

As one of the next things, I will be further improving rsyslog‘s tcp syslog input performance. As you know, rsyslog already has excellent performance (some sources, for example, quote 250,000 msgs per second). But, of course, there is room for improvement.

One such area is imtcp, the tcp syslog input module. It uses a single polling loop to obtain data from all senders. It is worth noting that the actual input module does NOT do very much, but hands the majority of work off to queue worker threads. However, it pulls the data from operating system buffers to our user space and also fills some basic properties (like time of reception, remote peer and so on). Then, the message is pushed to the message queue and at the other side of the queue the majority of processing happens (including such things like parsing the message, which some would assume to happen inside the receiving thread).

As can be seen in practice, this design scales pretty well in most cases. However, on a highly parallel system, it obviously limits the process of pulling data “off the wire” to be done on a single CPU. If then the rule set is not very complex (and thus fast to process), the single-threadedness off the initial receiver becomes a bottleneck. On a couple of high performance systems, we have seen this to be the bottleneck, and I am now trying to address it.

Right now, I am looking for a good solution. There are two obvious ones:

a) start up a single thread for each connection
b) do a hybrid approach of what we currently do and a)

Even with 64bit machines and NPTL, approach a) does probably not work well for a very large number of active sessions. Even worse, receiving messages from two different hosts would then require at least one context switch, and do so repeatedly. Context switches are quite expensive in terms of performance, and so better to avoid. Note that the current approach needs no context switch at all (for the part it does). On a system with many connections, I would be close to betting that the runtime required by the a)-approch context switching alone is probably more than what we need to do the processing with our current approach. So that seems to be a dead end.

So it looks like b) is a route to take, combining a (rather limited) number of threads with an reception-even driven loop. But how to best do that? A naive approach is to have one thread running the epoll() loop and have a pool of worker threads that actually pull the data off the wire. So the epoll loop would essentially just dispense to-be processed file descriptors to the workers. HOWEVER, that also implies one context switch during processing, that is when the epoll loop thread activates a worker. Note that this situation is by far not as bad as in a): as we have limited number of workers, and they are activated by the epoll thread, and that thread blocks when no workers are available, we have limited the level of concurrency. Note that limiting the concurrency level roughly to the number of CPUs available makes a lot of sense from a performance point of view (but not necessarily from a program simplicity and starvation-avoidance point of view – these concerns will be locked at, but now I have a focused problem to solve).

One approach to this problem could be that I further reduce the amount of work done in imtcp: if it no longer pulls data off the wire, but just places the file descriptor into a “message” object and submit that to the overall queue, modified queue processing could then take care of the rest. However, there are many subtle issues, including how to handle system shutdown and restart as well as disk queues. In short: that probably requires a full redesign, or at least considerable change. Anything less than that would probably result in another processing stage in front of the rule engine, as outlined initially (and thus require additional context changes).

So I focused back to the optimal way to partition this problem. One (simple) approach is to partition the problem by tcp listeners. It would be fairly easy to run multiple listeners concurrently, but each of the listeners would have its own (epoll/take data off the wire)-loop that runs on the listener’s single thread. So in essence, it would be much like running two or more rsyslog instances, using the current code, concurrently. That approach obviously causes no additional context switches. But it has a major drawback: if the workload is spread unevenly between listeners, it may not provide sufficient parallelism to busy all CPU cores. However, if the workload is spread evenly enough, the approach can prevent starvation between listeners – but not between sessions of one listener. This problem is also not addressed by the current code, and there has never been any user complaint about that (or it’s potential effects). So one may conclude starvation is not an issue.

It looks like the usefulness of this approach is strongly depending on the spread of workload between different listeners. Looking at a busy system, we need focus on the number of highly active listeners in relation to the number of expectedly idle CPU cores i. That number i obviously must take into consideration any other processing requirements, both from rsyslog (parsing, rule processing, …) as well as all other processes the system is intended to run. So, in general, the number i is probably (much) lower than the total number of cores inside the system. If we now have a number l of listeners, we must look closely: if among all listeners, l_h is the number of high activity listeners, than it is sufficient to have i equals l_h: few occasional wakeups from low activity listeners do not really matter. However, if l_a is lower than i, or even just one, then we can not fully utilize the system hardware. In that case, we would need to provide partitioning based on sessions, and there we see a similar scheme based on the view of low- and high-activity sessions.

But the real questions is if we can assume that most busy systems have a sufficient number of high activity listeners, so that per-listener concurrency is sufficient to fully utilize the hardware. If that is the case, we can drastically improve potential message processing rates and still be able to keep the code simple. Even more concrete, the question is if we re sufficiently sure this approach works well enough so that we implement it. Doing so, could save considerable development effort, which could be put to better uses (like speeding up queue processing). BUT that development effort is wasted time if for a large enough number of systems we can not see benefit. And note that single-listener systems are not uncommon, a case where we would gain NO benefit at all..

I am actually somewhat undecided and would appreciate feedback on that matter.

Thanks in advance to all who provide it.
Rainer

Update: there is a long and very insightful discussion about this post on the rsyslog mailing list. All interested parties are strongly advised to read through it, it will definitely enhance your understanding. Please also note that based on that discussion the development focus shifted a bit.

rsyslog string generators … done :)

A rsyslog string generator is what I had previously called a “template module” – in essence a facility to generate a template string with some custom native C code. I have decided to name it a bit differently, because at some later stage there may be other uses for these types of modules as well. Specifically, I am thinking about adding custom name-value pairs to the message object, and then a string generator (or strgen for short) could be used to generate such a value as well.

Implementation went smooth. I implemented both the interface as well as a small set of important core strgens, those that are frequently used to write files or forward message to remote machines. I did not touch any others, as that is probably not really necessary — and could easily be done any time if need arises.

The new interface also provides a capability to third-parties that enables them to create their own high speed parsers. The performance impact can be dramatic, just think about cases where multiple regular expression calls can be replaced by a single call and some C logic.

Finally, these modules may even provide a way to fund rsyslog development. Adiscon can probably sell them for some small amount (I guess around $500 based on what needs to be done, in some cases maybe less, in some maybe a bit more). I guess that would be attractive for anyone who needs both high speed and a custom format and runs rsyslog for profit. Getting into all the details to develop such a thing oneself probably costs more than our whole implementation effort. I hope we will get some orders for these, and I hope that folks will contribute the strgen back to the project. Their plus is then that we maintain it for free and the plus for the community is that, in the long term, we will hopefully get a large directory of ready-to use custom strgens (OK, that sidesteps the funding process a bit, but… ;)).

I have also managed to write some basic doc on the new capability, to be seen here:

What now is missing is some feedback from the field, including from someone who actually uses this to create a custom format.

The code has been merged into v5-devel (the master branch) and will most probably be released early next week. Then, it will undergo the usual devel/beta cycle, so that availability in a stable v5 release can be expected towards the end of summer 2010. Special thanks go to David Lang, who provided good advise that helped me create the new functionality.

rsyslog template plugins

As I have written yesterday, I am evaluating the use of “template modules” in rsyslog.

In that post, I mentioned that I’d expect a 5% speedup as proof that the new plugin type was worth considering. As it turns out, this method seems to provide a speedup factor of 5 to 6 percent, so it seems to be useful in its own right.

After I had written yesterday’s post, I checked what it would take to create a test environment. It turned out that it was not too hard to change the engine so that I could hardcode one of the default templates AND provide a vehicle to activate that code via the configuration file. Of course, we do not yet have full loadable modules, but I was able to create a proof of concept in a couple of hours and do some (mild) performance testing on it. The current code provides a vehicle to use a c-function based template generator. It is actiated by saying

$template tpl,=generator

where the equal sign indicates to use a C generator instead of the usual template string. The name that follows the equal sign that will probably later become the actual module name, but is irrelevant right now. I then implemented a generator for the default file format in a very crude way, but I would expect that a real loadable module will not take considerably more processing time (just a very small amount of calling overhead after the initial config parsing stage). So with that experimental code, I could switch between the template-based default file format and the generator based format, with the outcome being exactly the same.

Having that capability, I ran a couple of performance tests. I have to admit I did not go to a real test environment, but rather used my (virtualized) standard development machine. Also, I ran the load generator inside the same box. So there were a lot of factors that influenced the performance, and this for sure was no totally valid test. To make up for that, I ran several incarnations of the same test, with 1 to 10 million of test messages. The results quite consistently reported a speedup between 5 and 6 percent achieved by the C template generator. Even though the test was crude, this consistently seen speedup is sufficient proof for me that native template generators actually have value in them. I have to admit that I had expected improvements in the 1 to 2 percent area, so the 5 and more percent is considerable.

I committed the experimental branch to git, so everyone is free to review and test it oneself.

Now that I am convinced this is a useful addition, my next step will be to add proper code for template plugins (and, along that way, decide if they will actually be called template plugins — I guess library plugins could be used as well and with somewhat less effort and greater flexibility). Then, I will convert the canned templates into such generators and included them statically inside rsyslog (just like omfile and a couple of other modules are statically included inside rsyslog). I hope that in practice we will also see this potential speedup.

Another benefit is that any third party can write new generator functions. Of course, there is some code duplication inside such functions. But that should not be a bit issue, especially as generator functions are usually expected to be rather small (but of course need not be so). If someone intends to write a set of complex generator functions, these can be written with a common core module whom’s utility functions are accessed by each of the generators. But this is not of my concerns as of now.

Note that I will probably use very simple list data structures to keep track of the available generators. The reason is that after the initial config file parsing, access to these structures is no longer required and so there is no point in using a more advanced method.

I expect my effort to take a couple of days at most, but beware that Thursday is a public holiday over here in Germany and I may not work on the project on Thursday and Friday (depending, I have to admit, a little bit on the weather ;)).

rsyslog speed & output formatting

I’d like to reproduce a mailing list post here, both because I would like to retain it for easy reference and I consider it important enough to achieve better visibility. It originated by asking a question about output formatting options, but took as down to important roadmap questions for rsyslog.

For simplicity, I simply copy over the text of my relevant posting:

> On Mon, 31 May 2010, david@lang.hm wrote:
>
> > On Mon, 31 May 2010, Rainer Gerhards wrote:
> >
> >>> I agree that doing it in the output would be far better in many
> ways,
> >>> but
> >>> since there isn’t a way to do a plugin there (at least not as far
> as I
> >>> know, it would be good to get confirmation or a better idea)
> >>
> >> David, can you tell me what you have on your mind for this
> functionality? I
> >> have thought a bit about it, and I probably have one approach
> myself. But I
> >> would prefer to hear your idea before I push you into a direction.
> >
> >
> > two options
> >
> > 1. something that would work similar to the existing format
> > string, but would call a C subroutine that could read the existing
> > properties and would create the output string in a buffer
> >
> > 2. something that could also modify the exisitng properties (more
> > powerful, but also more dangerous and could involve locking to
> prevent
> > other things from trying to read properties at the same time)
> >
> > we haven’t gone too far down the road of researching the output
> > performance (since the input and queue locking has dominated so far),
> but
> > it is clear that the output currently takes significantly more CPU
> time
> > than input, it may be that being able to use C to define the output
> format
> > instead of interpreting the format string may be a noticable
> improvement.
> > Is there a relativly easy way to test this? (say, hard-code a format
> or
> > two and test writes to file and network with the hard-coded format vs
> a
> > format string that produces the same output?)
>
> for the traditional output formats the difference may not be that much,
> but if there is extensive parsing involved (as the initial poster is
> doing, or what I would expect is common for specific log types into a
> database) the difference can be much more significant since it can
> replace
> multiple regex statements with a much faster single pass that looks for
> word breaks and inserts standard filler in those spots.
>
> With the new syslog format where the data is ‘supposed to be’ in a
> series of name=value tuples, something like this would be a pretty
> efficiant way of extracting particular portions of the data to be
> output
> (although the properties could be extended to do this sort of thing by
> providing something similar to a perl hash)

You are looking in the same direction I am, and I think this is good news ;)

The current engine supports functions coded in C, but not yet as real plugins
nor in an easy to see way. It is done via a crude function interface library
module, and only within the script engine. My original plan (over a year, or
even two, ago) was to generalize these library plugins, so that it is easy to
add new code and load them as plugins. Actually, making them available as
plugins should not be too much work given the already existing
infrastructure. There already exist a handful of “function modules”, the
control structure is just statically created during compile time, much as
some of the output plugins are statically linked.

Then the original plan was to enable templates to call scripts and enable
scripts to define templates (kind of). Unfortunately, I got distracted by
more important things before I could complete all of this.

HOWEVER, at this time performance was not a major concern. With what has
evolved in the mean time, I do not like the original approach that much any
longer. At least the script engine must become much faster before I can take
a real look at that capability. Right now, scripts generate a interim code
that then is interpreted by a (kind of) virtual machine. A script invocation
inside a template would mean that a VM must be instantiated, the script
interpreted and the resulting string be used as template contents. Clearly,
this is not for high-performance use. Still, however, it may be useful to
have that capability for those cases, where performance is not the #1
consideration. But given that everything would need to be implemented, it
does make limited sense to look into something known to be too slow in the
long run. BTW, this is one reason that I have not yet continued to work on
the script engine, knowing that some larger redesign is due to fit it into
the now much tighter runtime constraints.

On the performance of the output system: I think the system in general is
quite fast and efficient, with only ONE important exception: that is, if
multiple replacements need to happen. Still, the algorithm is quite
efficient, but it is generic and needs to run though a number of steps. Of
course, it is definitely faster to permit a C plugin to look at the message
and then format, in an “atomic” way the resulting custom string. Thus, you
need to write multiple C codes instead of using a generic engine, but can do
so in a much higher performance way. I would assume, however, that this
approach cannot beat the simple templates we usually use (maybe by less than
5% and, of course, there may be cases where this matters).

As you know, my current focus is speed, together with some functional
enhancements. I was looking at queue operations improvements, but the
potential output speed improvements may be more interesting than the queue
mode improvements (and apply to more use cases). So it may make sense to look
into these, first. My challenge here is to find something that is

a) generic enough to be useful in various (usual) cases
b) specific enough to be rather fast

and it should also be able to implement within a few weeks at most, because I
can probably not spend much more time on a single feature/refactoring.

One solution may be to create “template modules”. I could envision a template
module to be something that generates the template string *as a whole* from
the input message.

That is, we would have

$template current-style,”%msg%n”

but also (**)

$modload tplcustom
$template custom,tplcustom

where tplcustom generates the template string.

While this sounds promising, we have some issues. One immediately pops up my
mind: we will probably be able to use the same template for file writing or
forwarding, but for file writing we need a LF at the end, while for
forwarding we do not need it.

So the most natural way would be to have the ability to embed a “custom
template” into a regular template, like suggested by this syntax:

$template both,”%=tplcustom%n”

however, this brings us down to the slippery slope of the original design. As
a next thing to be requested, I could ask for using not the msg object (with
its fixed unmodified properties), but rather of a transformation of the
message object. So we would end up with something like this:

$template cmplx,”%=tplcustom(syslogtag & msg)%”

Which would require a much more complex logic working behind the scenes.

Of course, depending on the format used, the engine could select different
processing algorithms. Doing this on the fly seems possible, but requires
more work than I can commit in one sequence.

Also, it would be useful to have the ability to persist already-generated
properties with the message while it is continued to be processed in the rule
engine. So far, we do not have this ability, and the reason is processing
time (plus, as usual, implementation effort): for that, we would need to
maintain a list (or hash, …) of name/value pairs, store them to disk for
disk queues and shuffle them through the rule engine as processing is carried
out. As I said, quite doable, but another big addition.

So I am somewhat stuck with things that sound interesting, but are a bit
interdependent. Doing them all together is too big to be useful, and it will
probably fail because I can probably not keep focus on all of the for the
next, say, 9 to 12 month that it would require to complete everything.

So I am again down to picking what is most useful. Out of this discussion, it
looks like the idea I marked with (**), the plain C template generator could
be a useful route to take. I am saying this under the assumption that it
would be relatively easy to implement and cause at least some speedup in
standard cases (contrary to what I expect, I have to admit…). But that
approach is highly specialized, requiring a C module for each custom format.
So does it really serve the rsyslog community well – or just some very
isolated use cases?

Thinking more about it, it would probably be useful if it is both

a) relatively easy to implement and
b) causes some speedup in standard cases

But b) cannot be proven without actually implementing the interface. So, in
practice, the questions boils down to what we *expect* about the usefulness
of this utility.

Having said that, I’d appreciate feedback, both on the concrete question of
the usefulness of this feature as well as any and all comments on the
situation at large. I am trying to put my development resources, which
thankfully have been somewhat increased nowadays :) to the area where they
provide greatest benefit.

rsyslog now available on Solaris

Rsyslog has become the de-facto standard on modern Linux operating systems. It’s high-performance log processing, database integration, modularity and support for multiple logging protocols make it the sysadmin’s logging daemon of choice. The project was started in 2004 and has since then evolved rapidly.

Starting with today, rsyslog is not only available on Linux and BSD, but also on Sun Solaris. Both Intel and Sparc machines are fully supported under Solaris. Depending on operator need, rsyslog can replace stock Solaris syslogd or be used in conjunction with it. The later case provides enhanced rsyslog functionality without the need to change the system infrastructure.

Solaris is now a tier-one target platform. That means that all testing for major releases will be carried out on Solaris as well as on other platforms. The Solaris port was done very careful taking into account Sun’s somewhat specific syslogd handling via door files and preserving the full power of rsyslog. So it not only compiles and runs on Solaris but rsyslog is a good citizen in the Solaris environment.

As of usual rsyslog project policies, the project does not make installation packages other than the source distribution available. However, we work closely together with the Solaris community be able to provide them. We expect additional announcements soon.

The versions with initial solid Solaris support are 4.7.2 and 5.5.4. Rsyslog’s Solaris port was made possible by a generous contribution of hardware and some development funding by a sponsor which preferred to remain anonymous. We from the rsyslog project would like to express our sincere appreciation. Contributions of any kind are always very welcome.

Comment moderation turned on

Unfortunately, I had several waves of spam comments on my blog. Turning Word Verification on did not help, nor did the requirement to have a Google account help. So I was unfortunately forced to turn on comment moderation. In short, that means your comments no longer appear instantaneously, but only after I was able to review them. I know this is a bad thing, but I think it is better than turning off comments at large.

I have also temporarily hidden all comments. This is so that I have some time to clean up the mess. Once this is done, I will re-enable all valid comments.

Thanks,
Rainer

v4-devel is back again

I intended to focus on new development for rsyslog v5 exclusively, but as it turns out, I announce a new v4-devel build today: version 4.7.0.

So what is happening? Do I no longer believe in v5? Not at all! But we have thankfully received some grant to make rsyslog available on Solaris, and I wanted to have this killer-feature for v4 as well. It turned out to be impractical and somewhat confusing (for end users) to do that in v4-stable. After all, some code areas need to be touched and the Solaris-unique code is obviously brand new and far from being stable.

So I concluded that the best approach is to revive v4-devel, but in a controlled way. I will not put any new request into that version. Most of new feature requests will go in v5 only. However, there may be some exceptions. Already, with 4.7.0, I have included some things that I consider either very useful or easy to integrate. Note that much of this easiness is related to the fact that in autumn 2009 the original plan was to release 4.7.0, and thus some new features were already integrated. I have not removed them, and now v4 users can gain the benefits. Note, however, that I did not maintain these changes. So there may be bugs inside them that I already fixed in v5. Quite honestly, I did not do any extra testing to rule that out. However, the testbench runs without any problems on 4.7.0.

Note that the new v4-devel branch will bring some select enhancements in the future. One thing I have on my mind is a small, but potentially troublesome, enhancement to the background log writer that can potentially boost performance pretty much.

However, the majority of new work will go into v5, exclusively. But you will not see that much happen there in the next couple of days/weeks as my next milestone is strongly focused on getting full stable Solaris support into both v4 and v5 (note that v5-devel, the master branch, also already supports Solaris).