update on rsyslog changes

I have been a bit lazy in posting my recent rsyslog changes. This is because there was a lot of fine-tuning as well as design work. I also fear that I probably have forgotten to log a few less important changes. Nevertheless, the log should still provide you with a good idea of what’s going on. If I do not get any more last minute nits, the 1.20.0 release can hopefully be released either tomorrow or on Friday.

Here are the changes now:

2007-11-27
– got new dev environment on Fedora 8 x64 up and running
– reviewed calls to MsgDestruct() – re-confirmed that there are no
double frees.
– moved syslog PRI code tables to msg.c as this is the only place where
they are still needed

2007-12-03
– adding sur5r’s postgres module – many thanks for providing it!
– added Bartosz Kuzma patches for better compatibility with NetBSD – many
thanks for them!
– disabled GSS-API in configure.ac, as support for it is not yet complete
(Peter Vrabec told me to hold it until a further patch)
2007-12-04
– released 1.20.0 preview (via mailing list only, waiting on feedback)
2007-12-05
– applied patch from Bartosz Kuzma and Michael Biebl to fix compilation
problem on NetBSD (it is missing the -ldl lib, even though it provides
dlopen()).

rsyslog now does postgres SQL!

I have just finished the integration of a new postgres sql output module for rsyslog. The code was provided by sur5r. I am more than glad that rsyslog now supports the popular postgres sql engine. I am sure this is a very nice addition for the community.

Sur5r’s able-coded code also provided some feedback about the output module interface in general. The good news is that the amount of code needed in the actual output module was as low as initially thought.

But it also proved that module design still is less than perfect: there was a new date format needed to support postgres sql. This could not be done in the output module. Instead, sur5r had to modify the core files. He did this in an excellent way, but the point is that it is not a good thing this was needed at all. It is a result of the old, non-modular syslogd design which we inherited from sysklogd and have not yet changed. In fact, we are at the very beginning of modularization. The next thing on the list is a loadable function interface. With that, such modifications will no longer be necessary, a custom (loadable) function can than be provided.

This enhancement requires a number of changes in the core, most importantly in the template subsystem. This is most probably the next area of work I am heading to.

But for the time being, let’s celebrate that rsyslog now has postgres sql support! It will be released in 1.20.0, hopefully around the end of the week. A preview will be announced on the mailing list somewhat earlier. And if you can’t wait, simply pull it out of anonymous CVS – it’s in there as of this writing.

rsyslog now part of Fedora 8

Oops… I think I forgot to mention that rsyslog was recently included in Fedora 8. This is an important milestone for the rsyslog project, because it is the first time ever that it appears as the default syslogd in a wide-spread distribution.

Thankfully, a lot of other package maintainers have been found for other distros (I really appreciate your hard work, guys!), but everywhere else rsyslog is “just” an extra. In Fedora 8, it is the default logging daemon. I’d also like to add that Red Hat folks contributed quite a lot to rsyslog, including great testing.

Thanks to everyone who helped make rsyslog this capable. I think its appearance in Fedora 8 is rewarding for all of us!

rsyslog 1.19.12 released

I just released rsyslog 1.19.12. For the minor change I made in 1.19.11, I created a branch based on 1.19.10. I thought .11 would be the last release in that branch. Obviously, I was wrong. Working on two branches is definitely a lot of not really useful extra work. I’ll try to get thing together this week, then again only working with one branch. That will be the feature-enhanced 1.20.0 release.

Did I find the rsyslog bug?

I have a new leading theory on the rsyslog segfault bug. Before I restructured everything to get rid of the alarm() calls, I did some more research in respect to the best threading model. More or less by accident, I found a nice note on glibc, rentrancy and the _REENTRANT preprocessor macro. That lead me to the “-pthread” compiler option…

May it be that we “just” have a compiler option problem? So far, we only call it with “-lpthreads”, which only effects the linker. -pthread, if I understood it right, defines _REENTRANT, which in turn causes some reentrant versions of standard library functions to be used. If that isn’t outdated information, it could pretty much be our problem.

I am not sure which functions of the run time library are affected by the _REENTRANT macro. So I do not know if I call them. However, non-reentrant run time library functions would be a good explanation for very seldom segfaults that only occur under heavy load and when running in multithreading mode. It would also explain why so many in-depth code reviews did not find anything…

This compiler option finding looks very promising. I’ll probably do a special 1.19.11 release with just that patch and then we’ll see what happens. In the mean time, I wait if somebody comes up with some more diagnostic information. But I finally have found a good explanation for what I see – at least in theory…

Is alarm() the culprit?

I have once again reviewed the threading. Now I have a weak hope. To handle mark messages, there is an alarm() call inside rsyslog. While alarm() typically does not play well with pthreads, rsyslog uses it in a very limited scope. Also, the alarm handler is activated only in one specific thread. But… “alarm() does not play well with pthreads”… In absence of any better solution, may this be the actual cause for this hard to hunt bug?

I will now see what is required to get rid of the alarm(). Actually, this is harder than it initially looks. For alarm() to go away, I need to set up a real background thread that does the timer ticks and that causes the mark messages. That, in turn, means that I have two concurrent messages sources, which involves quite a bit of synchronization. All of that is currently not needed as the alarm signal simply interrupts the select, which in turn leads to execution of nice sequential code. Oh, and yes: the alarm signal handler of course does (contrary to sysklogd) nothing then to set a global variable flag. So I’ll have a look at all that…

found another part of the code … that is not the problem…

I’d like to reproduce part of my conversation with Peter Vrabec over here. Together with others, Peter is very actively helping track down the nasty rsyslog segfault bug. He had quite a good idea of what could have caused it, resulting in me doing another focussed code inspection. I just explained to him why we have not yet found the problem. And there is a lot of detail in that description. Detail, I think, that others can benefit from, too. So here it comes:

> *** glibc detected *** rsyslogd: double free or corruption (!prev):
> 0x09cbd588
> ***
> (gdb) print (uchar *) 0x09cbd588
> $49 = (unsigned char *) 0x9cbd588 “”
>
> is it possible to call free on pszRcvFrom when it points at “” ?
> because I can see:
> msg.c:
> if(pM->pszRcvFrom == NULL)
> return “”;
> else
> return (char*) pM->pszRcvFrom;
>
> same with getRawMsg/getUxTradMsg/…, places where mudflap
> screams in MsgDestruct().

Sorry – if I have not totally screwed up, it’s impossible that this is the problem. I’ve done (yet another) thorough review. All of these functions are either called to supply another Set…() function (which simply copies over the “”, but does not free it) or from MsgGetProp(). MsgGetProp() uses a variable (pbMustBeFreed) to track whether or not a buffer must be freed. In general, unmodified properties are never freed – this happens only at message destruction (MsgDestruct()). MsgDestruct() however, just checks the pointers and frees if they are non-NULL. So MsgDestruct() is not interested in getRawMsg() or another get() function returning an empty string except a NULL pointer.

I have also checked if MsgGetProp() and its helpers correctly handle pbMustBeFreed – and they do it right. So a buffer is only freed when it was dynamically allocated. Message properties are only created during message construction and are free()ed when the message is destroyed.

So I guess we have found another part of the code that does not cause the problem.

What I would like to re-iterate, though, is that the segfault problem seems to disappear under all circumstances if compiled with –disable-pthreads. IF SO (and this is my current assumption), it can not be a general logic error like a double free but must be a synchronization problem. Let me re-iterate: I have yet to find a single installation that has a segfault when running on a single thread! All folks with segfaults that compiled in single threading mode did no longer experience any problems. This is strong evidence to me.

But while I almost think I know it is threading related, I can not find anything wrong in that area, too. The threading model is very simple and construction/destruction of elements is spread in an easy way over the two threads. Basically, the listener thread creates the message and its properties, while the action thread destructs them. There are a few exceptions where properties are derived, but nothing really complex. Yet, I still think it is in the threading area – why does it work in single thread mode otherwise?

Hunting for the segfault…

Do you remember? We are still hunting for a segfault in rsyslog that is very hard to find. The biggest problem is that most people will never experience it. I do not experience it in lab, nor does anybody else who is currently working on the project. And without the ability to reproduce it, there is a lot of guesswork involved.

This is why we are asking for the help of our users (that means: you!). If you run rsyslog and experience a segfault, we would very much appreciate if you could run a specially instrumented version. Peter Vrabec created it and it contains debugging support as well mudflap support, which is a tool to track down nasty memory management errors.

The RPM can be found at:

http://people.redhat.com/pvrabec/rpms/rsyslog/rsyslog-1.19.10-2.mudflap.src.rpm

Please install it. You will notice that CPU usage is higher than normal, but in most cases this is harmless. If you are concerned, drop me a line with specifics and I’ll happily address them.

If a segfault happens when you run this version, please send us

  • binary,
  • coredump and
  • /var/log/rsyslog.mudflap
  • hardware and OS information (which distro? multiprocessor? …?)

Our sincere hope is that we receive enough reports to find something in common between them. So, please contribute your segfault info if you happen to have it. This is a great way to contribute to the project!

recent rsyslog changes

I am back to my routine of posting rsyslog changes. You may also imply that this means I am actually developing some things (and not just writing about it ;)). After I had a somewhat slow start today, things evolved quite nicely this afternoon. If I did not overlook anything important, I even managed to complete the “clean unload process” for loadable modules. That also brought me back to good working knowledge of the code. Actually, I am at least a day ahead of my schedule. But, of course, I’ll check if I overlooked something – but that’ll be tomorrow.

So on to the promised change log (it also covers some past days where I had not reported):

2007-11-19
– applied gssapi patch from varmojfekoj – gss-api is now supported
– added some debug message to ommysql
2007-11-20
– added user doc for gssapi patch from varmojfekoj – thanks!
– bumped version number to 1.20.0, because of new gss api functionality
2007-11-21
– begun to look at dynamic module unloading – this is currently a hack
and works with the mysql module only (which is the only one, so there
is no problem in practice. But it would be good to begin to do it right ;)
– added new modExit() entry point to loadable module interface
– added an identifier to command handler table – need to identify which
command handler entries need to be removed when module is unloaded
– added support so that linkedlist key can be used for owner handle
– enhanced llExecFunc to support deletion of list elements (on behalf of
user function being called, slight interface change)
– enhanced linkedlist class so that list elements can now be deleted based
on the key value they have
– created entry point so that CfSysLine handlers are removed on modExit()
– some cleanup
– modules are now correctly unloaded and de-initialized

going back to rsyslog coding…

As promised, I have started to look at the rsyslog code this morning again and done so in an effort to enhance it. My first target is unloading loadable modules in a “well done” way. So far, this is a hack that does only work because ommysql (and probably a postgres module basing on it soon) does not use some of the interface functionality. Namely cfSysLineHandlers do not work with the current code.

So what I am set now for is doing it right and making sure that a loadable module can be cleanly unloaded under all circumstances. That, of course, requires some interface changes, but nothing major (keep in mind the interface is not yet finalized!). This work provides the basis for upcoming work, which will utilize many more loadable modules for other functionality, too. So it is a critical task.

I have to admit, though, that I think I need another day or so to get fully re-acquainted to the code. There was really a toll from my absence and I begin to notice it ;)