next round of performance enhancement in rsyslog

Today, I made a very important change to rsyslog: rulesets now can have their own “main” queue. This doesn’t sound too exciting, but can offer dramatic performance improvements.

When rsyslog was initially created, it followed the idea that messages must be processed in the order they were received. To facilitate that, all inputs submitted message to a single main message queue, off from which the processing took place. So messages stayed in reception order. … Well, actually they stayed in “enqueued order”, because it depended on the OS scheduler if input modules could really enqueue in the order they received. If, for example, input A received two messages, but was preempted by module B’s message reception, B’s data could hit the queue earlier than A’s. As rsyslog supported more and more concurrency, the order of messages did become ever less important. The real cure for ordered delivery is to look at high-precision timestamps and build the sort order based on them (in the external log analyzer/viewer).

So, in essence, reception order never has worked well and the requirement to try keep it has long been dropped. That also removed one important reason for the single main message queue. Still, it is convenient to have a single queue, as its parameters can be set once and for all.

But a single queue limits concurrency. In the parallel processing world, we try to partition the input data as much as possible so that the processing elements can independently work on the data partitions. All data received by a single input is a natural data partition. But the single main queue merged all these partitions again, and caused performance bottlenecks via lock contention. “Lock contention”, in simple words, means that threads needed to wait for exclusive access to the queue.

This has now been solved. Today, I created the ability to create ruleset-specific queues. In rsyslog, the user can decide which ruleset is bound to which inputs. For a highly parallel setup, each input should have its own ruleset and each ruleset should have defined its own “main” queue. In that setting, inputs do no longer block each other during queue access. On a busy system with many inputs, the results can be dramatic. And as more as a side-effect, each ruleset is now processed by its dedicated rule processing thread, totally independent from each other.

This design offers a lot of flexibility. But that is not enough. The next step I plan to do is to create the ability to submit a message to a different ruleset during processing. That way, hierarchies of rulesets can be created, and these rulesets can even be executed via separate thread pools, with different queue parameters and in full concurrency. And the best is that I currently think it will not be very hard to create the missing glue.

The only really bad thing is that the current configuration language is really not well-suited to handle that complexity (“really not” is not a type for “not really”…). But I have no alternative than to take this route again, until I finally find time to create a new config language. The only good thing is that I get better and better understanding of what this new language must be able to do, and it looks that my initial thoughts were not up to what now is required…

Canonical Paper on RSyslog

I just found out that Canonical (the Company behind Ubuntu) did a nice paper on rsyslog, which also explains why Ubuntu chooses rsyslog as its default syslogd.

It is interesting to see that the paper is well-written and well-researched, but rsyslog has also evolved while the paper has been written. So in fact, it offers even more features than described in the paper.

And, obviously, I am glad to see Ubuntu move to rsyslog as well.

Will Microsoft remove the Windows Software RAID?

These days, hardware rates are quite inexpensive. So everybody is moving towards them. However, all mainstream operating systems still support software RAIDs, maybe even for a good reason: an os-controlled software raid may be a bit better to optimize under some circumstances. Anyhow. Microsoft seems to move away from that feature set:

As you probably know, Adiscon provides premier Windows event log processing solutions. Some of our customers use the products for example to monitor if their RAIDs break. And some of them use software RAIDs. So we wrote a nice article on how to monitor RAID health using the Windows Event Log.

Since the days of Windows NT 3.1 (or was it 3.5), the Windows logged an error message if the RAID failed. Actually, I’d consider this a necessary functionality for any working RAID solution. Why? Well, if the RAID solution works, you will not notice that a disk has died. So if nobody tells you, you’ll continue to use the system as usual, not suspecting anything bad. So guess what – at some time the next disk fails and then (assuming the usual setup) you’ll be “notified” by the disk system, with those nice unnercoverable i/o errors. So without any health alerts, a RAID system is virtually useless.

We learned, that Windows Server 2008’s RAID system does no longer issue these alerts! (aka “is useless” ;)). So a long while ago, we reported this to Microsoft. The bug went through several stages of escalation. A few minutes ago, my co-worker got a call from the frontline Microsoft tech. He told him that, regrettably, Microsoft won’t fix this issue. According to his words, Micorosoft has confirmed this to be a bug, and the group responsible for ftdisk has confirmed that it should be fixed but someone more powerful up in the hierarchy has opted not to do that. Boom. The tech tried to persuade us to switch to a hardware RAID, but actually that was not the point of the support call ;)

What does that mean? To me, it looks like Microsoft is actually moving away from providing software RAID. How other can one explain that there is no interest in providing any error message at all if something goes wrong with the RAID. Given the wide availability of hardware RAIDs (which, btw, provide proper alerting), this step does not look illogical. But do they really want to leave Linux with being the only widely deployed mainstream operating system that provides software RAID? Or do they intend to keep it on the feature sheet, but provide a dysfunctional solution like in Windows Server 2008?

Let’s stay tuned and listen what the future brings…

Another note on hard-to-find-bugs…

Before I began to write this blog post, I realized how long I had not written anything! I promise to begin to write in a more timely manner, but the past weeks were merely a consolidation phase, ironing out bugs from the new releases.

I’d like to elaborate on one of these, one that really drove me crazy the past days. The problem was that omfile’s directory creation mode was sometimes set to zero (well, almost always in some configurations). What began as a minor nit turned into a real nightmare ;)

The issue was that the variable fDirCreateMode was always set to zero, except if it was written to at the start of module initialization or when it was simply displayed at start of module initialization. That sounded strange, but even stranger seemed that by moving around the variable definition in the sources code (and thus assumingly changing its memory location), nothing changed. So I came to a point where I used this code as a patch:

omfile.c from rsyslog git

Look at line 769. With that seemingly unrelated write, the variable stayed as expected. However, if I moved the write to a function, nothing worked again. Strange… After committing the patch, testing showed that the directory permissions now worked well BUT the file create mode now behaved wrong in the same way.

I was stunned – and speechless. What followed, were intense debugging sessions. I ended up finding the commit that introduced the problem, but still could not see why that commit would affect anything. After hours of debugging, I ended up with a stripped-down and almost codeless omfile, which still had the same problem. And it appeared and disappeared almost at random as code lines were moved in and out.

I once again checked the git history and then I noticed that a few commits down the line, I had introduced another config variable for the io buffer size. Now I finally had the idea. The size-type config directives were introduced for file size restrictions. Thus, the regular 32 bit integer is not sufficiently large for them. Consequently, they needed 64 bit integers as pointers! But, of course, I had provided only a pointer to a 32 bit int, thus the config handler overwrote another 32 bits that happened to be close to the address I provided.

This was clearly an error. But could it explain the behavior I saw? Not really… But the problem went away once I had corrected the issue. So I began to suspect the that compiler hard re-ordered variable memory assignment in order to optimize access to them (maybe to get a better cache hit rate or whatever else). But hadn’t I compiled with -O0 and as such no optimization should take place? I checked, and I realized that due to a glitch in lab setup, optimization actually was on, and not turned off! So now I think I can explain the behavior and theory as well as practice go hand in hand.

Really? What about the write and the debug print that made everything work? I guess these changes triggered some compiler optimization and thus the memory assignment was changed and so the “extra 32” bit pointed to some other variable. What also explains why the file creation mode was affected by my change. As well as why the bug reacted quite random to my testing code changes.

So it looks like I finally resolved this issue.

Lessens learned? Re-check your lab environment, even if it always worked well before. Be careful with assumption about memory layout, as the optimizer seems to heavily reorder variables, and even single statements and statement sequences seem to make a big difference. I knew the compiler reorders things, but I did not have this clear enough on my mind to become skeptic about my lab setup.

And, as always, some assumption limit your ability to really diagnose what goes on… May this be a reminder not only for me (I wonder how long it will last) but for others as well (thus I thought a blog post makes sense ;)).