What is rsyslog auto-backgrounding?

Rsyslog, by default, auto-backgrounds itself after startup. That simply means that the instance that is started by the user (or script) more or less does nothing but fork a new instance detached from the current terminal session and execute it. The originally started instance exits after a short timeout. This behavior was carried over from sysklogd.

Note that auto-backgrounding is problematic (aka “makes things more complicated than the need to be”) in debug sessions, lab environments and so on. So command line switch “-n” can be used to turn off auto-backgrounding. In that case, the first instance started will actually carried out the work to be done (as most would expect in the first place).

It is strongly recommended to use “-n” option for lab testing.

log normalization: how to share rulebases?

Rulebases play a crucial role in log normalization. While the log normalizer itself needs to be of high quality and speed, it is the rulebase that really helps to detect which message the one in question is. I myself have so far concentrated on the code and not created any larger rulebase. Champ Clarck III has created many more for his use inside Sagan. But this means everything is in its infancy. What we really need is community involvement to create a large number of easy to access rulebases for almost all devices.

This brings up the question of how to manage and share such a repository. One method may be to place it on a web site, together with some submission tool. An alternate approach would be to put everything into a public git. This latter approach has some beauty, because git is universally available and well know. Even if a user does not know git, only a minimal set of commands is required to pull the rulebase. So maybe this is the way to go?

I would be very interested in suggestions on how we shall manage rulebases and spread the word. What do we need to support a great community? Whom can we talk to? If you have any ideas, concerns, questions or even an idle rant, be sure to let me know. At best, send mail to the lognorm mailing list, so we can broadcast this to other folks interested.

why does the rsyslog testbench sometimes fail?

Rsyslog contains a set of automatted tests, the so-called “testbench”. It is invoked via the standard method of “make check” and “make distcheck”. Since its introduction in version 3, the testbench has been continously enhanced and extended. It now contains around 150 individual tests, which sum up to around 80 tests from the autoconf point of view (some autoconf tests run a couple of subtests, thus the difference in number). The testbench has been proven to be very useful and caught numerous problems before new code was released.

But the testbench is not perfect, and it may sometimes fail without any actual problem. There are two reasons for this. One is that the test require a very specific environment. For example, some parser based tests assume that the system the test is run on is configured to be named “localhost.localdomain” (the default for many test deployments). This needs to be the case because there currently is no way in rsyslog to overwrite the local hostname. Some parser tests use malformed messages, in which case (as of the RFC), the local system name must be used. As such, we need to have a specific system name set in order to prove the results. In the long term, I’ll add the capability to overwrite system name inside rsyslog, but it does not make sense to create a dirty trick just for testbench use. So this needs to wait until we get to it as part of regular development. Note that a similar issues may exist at other places. An obvious one is the database tests, where we need pre-created users, databases, tables etc in order to run the tests.

The other issue is a bit more subtle. The syslog protocol is simply, without App-Layer acknowledgments. This makes it hard to know when rsyslog has received a while bunch of test data. That in turn makes it hard to definitely say when all test data has arrived and an instance can be shut down. So the whole process is a bit racy. To “solve” this, I use some wait periods in tests affected by this problem. However, longer wait periods mean longer test bench runtime and this reduces my development productivity. So I use wait time that usually does the job, but may fail under some circumstances (most notably when –enable-debug is set). This can affect a couple of TCP-based tests (like imtcp_conndrop.sh and similar ones). I have not yet a good idea what a clean solution to this problem is, where “clean” means that it a) always works and b) does no introduce unnecessary code complexity under non-testbench runs.

Given these problems, some care must be taken interpreting testbench results. Most importantly, a fail does not necessarily mean that things are actually broken. It merely means that one needs to look at the actual test and check a) why it fails and b) if it fails repeatedly. Especially the “racy” test tend to occasionally fail without any real problem. I’ve also seen them to fail consistenly on some platforms, simply because my timing assumptions are not valid there (Solaris was one example where I needed to adjust my overall wait periods).

So testbench results need to be taken with a grain of salt, and require interpretation. I know this is inconvenient for occasional users, but it is the best compromise I currently can offer.

using failover and asynchronous actions in rsyslog

I wanted to point out that failover actions and asynchronous processing does not work well in rsyslog — at least if a simple approach is used. The reason is that there is a conceptual problem with them working both together: async actions will – by design – always return an “everything went OK” status (because we don’t know otherwise, as things are async!). So this means failover processing will never see an error.

Now look the following config:

$modload imuxsock
$ActionQueueType LinkedList
$RepeatedMsgReduction on
*.* @@10.48.20.19:10514
$ActionExecOnlyWhenPreviousIsSuspended on
& @@10.48.20.18:10514
& -/home/logfile2
$ActionExecOnlyWhenPreviousIsSuspended off

If the Action Queue Type is set to linked list (and thus the action executed asynchronously), the other two actions will never be executed – because the async action always “succeeds”.

There are two ways to solve this situation:
1) do run the action synchronously — depending on your needs, this may be a solution or not
2) if you need to run it async, you need to define a new ruleset, which includes the config WITHOUT the async processing. Then, use omruleset to execute the newly defined ruleset as whole asynchronously. This is obviously a bit more complex, but will do what you need.

I hope this resolves some confusion about the failover functionality (and, yes, a better config language would make this less painful — hopefully we will finally be able to write one ;)).

new rsyslog/systemd work going on

This is just a quick note that systemd is doing really well in providing logging right from the system startup, even when no syslogd is running. The magic is that messages are put into the kernel log, where the (later started) syslogd can pull them from. Unfortunately, there are some downsides from this mode, and we are currently working to solve them. For more details, please follow this thread:

http://lists.freedesktop.org/archives/systemd-devel/2011-March/001558.html

I guess the remaining issues will be settled soon, as almost everything is in place and I just need to add some additional parsing logic to rsyslog. Due to the conference next week, I may need one extra week to complete that. All of this work will be part of the newly opened v5-devel (and above).

we are nearing a new rsyslog v5-stable

I have just released rsyslog v5.7.9. It will possibly be the last v5-beta versions of the 5.7. branch. I’ve ironed out a lot of bugs during the past two to three weeks. Right now, some patches are in 5.7.9 and not in the current stable, because I wait for some more feedback on the patches.There are still some bugs open in bugzilla, but all of this bugs are mostly concerned with rather exotic environments AND are present in the current v5-stable as well. So there is little argument to hold the new v5-stable branch just for that reason.

The plan is to release a last 5.6.6 version, ending that branch. Shortly after that, I’ll release 5.8.0. That way, conservative operators receive the latest round of bug fixes and can probably wait quite relaxed until 5.8.1 arrives ;)

It should be noted that 5.8.0 will be the first stable version with full support for systemd.

refactored imfile module now publically available

Hi all,

the refactoring and enhancement of imfile done by Nikolaidis Fotis has reached a first public-visible result. I have just created a new experimental git branch with his work:

http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/v5-stable-imfile-refactor

The code is probably not 100% finished, but it provides a preview plus the ability to suggest further improvements.

Please note that the code base has evolved since the refactoring begun. So in order to integrate into the main code line, we must see that we can adopt to the recent changes to imfile. Most importantly, you should try to merge this branch with v5-devel, which gives me a lot of issues when I try to do it. We will look at this once the new modules codebase has matured (and Nikolaidis gives me a “go” for that activity).
Feedback on the new module is appreciated (please direct it to the rsyslog mailing list).

rsyslog hit the one million message per second mark

Thanks to a user report, I just found out the the recent v5 versions of rsyslog have hit and exceeded the one million message per second (mps) mark! In my lab, with commodity hardware I reach around 500,000 mps. In practice this seems to be at least doubled :)

This strengthens even further that rsyslog is the syslogd of choice for demanding high-traffic environments :)

The full report is here:

http://kb.monitorware.com/000-000-character-messages-sec-with-rsyslog-t10740.html

a new rsyslog v5 beta – and focus on v6!

I have just generated a new v5-beta. It is being processed for release right now and will be soon. From the announcement:

This release both offers a set of new features and, at the same time, turns 5.7.3 into beta state. At first, this sounds a bit contradictory, but we do this for two reasons:

a) the new features introduced are non-intrusive in regard to the existing feature set, so no “bad surprises” are expected
b) other than that, primarily bugfixing went into this release, with
only few remaining issues being open

In order to move towards a new v5-stable, we consider it useful to begin with a new v5-beta stage. Note that this time the beta phase may be rather quick, because of the argument b) above. Note that we did not have any serious bug reports (except for one open issue) since December. Once the remaining issue is solved, we plan to do a short “proof in practice” and then move on to a new stable version.

In regard to new features, this release offers imfile multi-line capability, realtime UDP reception capability and better configurability for ommysql as recently announced for 6.1.3.

With this release, I also plan to conclude moving new features into v5 and try to focus on v6. This may not be practical in all cases, but I will try to stick to this plan as much as possible. During the past few weeks I have had considerable work to do just to integrate the various new features introduced in v4 and v5 into v6. While this was not really hard to do, it requires more than a little effort and very careful handling of the changes. The primary reason is that the code base diverged quite a bit and merging isn’t so much “fun” with that. More than once I even screwed up on some minor details. I hope that with a focus on v6 (for new features), I can spare a lot of time which than can go into new features.

And please do not misunderstand me: I focus on v6 for new features. This means I can focus even more on v5 in regard to correctness (bug-freeness). As usual, I prefer to fix issue in the oldest affected (and supported) release, and traditionally the version before the most current version branch has been very attractive to users because it has a near-complete feature set and a very strong focus on correctness.

Multi-Threading rsyslog’s TCP input

A form thread made me aware that there seems to be an issue with rsyslog performance if TLS is used. The past two weeks, I have worked on a paper which looks in-depth at rsyslog performance an I came across a paper [1] that promotes writing servers in that “traditional” multi-threaded way (with a single thread per connection). It addressed some of my concerns, and I thought it is worth actually trying out this approach (I outruled it for several years and never again looked at it). As a result, I created an experimental module imttcp, which works in this mode. I put this to test, especially as that would also lead to a much simpler programming paradigm. Unfortuantely, the performance results are devastive: while there is a very slight speedup with  a low connection number (close to the number of cores on the system), there is a dramatic negative speedup if running with many threads. Even at only 50 connections, rsyslog is dramatically slower (80 seconds for the same workload which was processed in 60 seconds with traditional imtcp or when running on a single connection). At 1,000 connections, the run was *extremely* slow. So this is definitely a dead-end. To be honest, Behren, condit and Brewer (the authors of [1]) claim that the problem lies in the current implementation of thread libraries. As one cure, they propose user-level threads. However, as far as I could find out, User-Level threads seem not to be much faster under Linux than Kernel-Level threads (which I used in my approach).

Even more convincing is, from the rsyslog PoV, that there are clear reasons why the highly threaded input must be slower:

  • batch sizes are smaller, leading to much more overhead
  • many more context switches are needed to switch between the various i/o handlers
  • more OS API calls are required because in this model we get more   frequent wakeups on new incoming data, so we have less data available to read at each instant
  • more lock contention because many more threads compete on the main queue mutex

All in all, this means that the approach is not the right one, at least not for rsyslog (it may work better if the input can be processed totally independent, but I have note evaluated this). So I will look into an enhanced event-based model with a small set of input workers pulling off data (I assume this is useful for e.g. TLS, as TLS transport is much more computebound than other inputs, and this computation becomes a limiting factor for the overall processing speed under some circumstances – see [2]).

As a side-note: for obvious reasons, I will not try to finish imttcp. However, I have decided to leave it included in the source tree, so that a) someone else can build on it, if he sees value in that b) I may use it for some other tests in the future.

[1] R. Von Behren, J. Condit, and E. Brewer. Why events are a bad idea
      (for high-concurrency servers). In Proceedings of the 9th conference on Hot
     Topics in Operating Systems-Volume 9, page 4. USENIX Association, 2003.