I just released rsyslog 1.19.12. For the minor change I made in 1.19.11, I created a branch based on 1.19.10. I thought .11 would be the last release in that branch. Obviously, I was wrong. Working on two branches is definitely a lot of not really useful extra work. I’ll try to get thing together this week, then again only working with one branch. That will be the feature-enhanced 1.20.0 release.
Did I find the rsyslog bug?
I have a new leading theory on the rsyslog segfault bug. Before I restructured everything to get rid of the alarm() calls, I did some more research in respect to the best threading model. More or less by accident, I found a nice note on glibc, rentrancy and the _REENTRANT preprocessor macro. That lead me to the “-pthread” compiler option…
May it be that we “just” have a compiler option problem? So far, we only call it with “-lpthreads”, which only effects the linker. -pthread, if I understood it right, defines _REENTRANT, which in turn causes some reentrant versions of standard library functions to be used. If that isn’t outdated information, it could pretty much be our problem.
I am not sure which functions of the run time library are affected by the _REENTRANT macro. So I do not know if I call them. However, non-reentrant run time library functions would be a good explanation for very seldom segfaults that only occur under heavy load and when running in multithreading mode. It would also explain why so many in-depth code reviews did not find anything…
This compiler option finding looks very promising. I’ll probably do a special 1.19.11 release with just that patch and then we’ll see what happens. In the mean time, I wait if somebody comes up with some more diagnostic information. But I finally have found a good explanation for what I see – at least in theory…
Is alarm() the culprit?
I have once again reviewed the threading. Now I have a weak hope. To handle mark messages, there is an alarm() call inside rsyslog. While alarm() typically does not play well with pthreads, rsyslog uses it in a very limited scope. Also, the alarm handler is activated only in one specific thread. But… “alarm() does not play well with pthreads”… In absence of any better solution, may this be the actual cause for this hard to hunt bug?
I will now see what is required to get rid of the alarm(). Actually, this is harder than it initially looks. For alarm() to go away, I need to set up a real background thread that does the timer ticks and that causes the mark messages. That, in turn, means that I have two concurrent messages sources, which involves quite a bit of synchronization. All of that is currently not needed as the alarm signal simply interrupts the select, which in turn leads to execution of nice sequential code. Oh, and yes: the alarm signal handler of course does (contrary to sysklogd) nothing then to set a global variable flag. So I’ll have a look at all that…
found another part of the code … that is not the problem…
I’d like to reproduce part of my conversation with Peter Vrabec over here. Together with others, Peter is very actively helping track down the nasty rsyslog segfault bug. He had quite a good idea of what could have caused it, resulting in me doing another focussed code inspection. I just explained to him why we have not yet found the problem. And there is a lot of detail in that description. Detail, I think, that others can benefit from, too. So here it comes:
> *** glibc detected *** rsyslogd: double free or corruption (!prev):
> 0x09cbd588
> ***
> (gdb) print (uchar *) 0x09cbd588
> $49 = (unsigned char *) 0x9cbd588 “”
>
> is it possible to call free on pszRcvFrom when it points at “” ?
> because I can see:
> msg.c:
> if(pM->pszRcvFrom == NULL)
> return “”;
> else
> return (char*) pM->pszRcvFrom;
>
> same with getRawMsg/getUxTradMsg/…, places where mudflap
> screams in MsgDestruct().
Sorry – if I have not totally screwed up, it’s impossible that this is the problem. I’ve done (yet another) thorough review. All of these functions are either called to supply another Set…() function (which simply copies over the “”, but does not free it) or from MsgGetProp(). MsgGetProp() uses a variable (pbMustBeFreed) to track whether or not a buffer must be freed. In general, unmodified properties are never freed – this happens only at message destruction (MsgDestruct()). MsgDestruct() however, just checks the pointers and frees if they are non-NULL. So MsgDestruct() is not interested in getRawMsg() or another get() function returning an empty string except a NULL pointer.
I have also checked if MsgGetProp() and its helpers correctly handle pbMustBeFreed – and they do it right. So a buffer is only freed when it was dynamically allocated. Message properties are only created during message construction and are free()ed when the message is destroyed.
So I guess we have found another part of the code that does not cause the problem.
What I would like to re-iterate, though, is that the segfault problem seems to disappear under all circumstances if compiled with –disable-pthreads. IF SO (and this is my current assumption), it can not be a general logic error like a double free but must be a synchronization problem. Let me re-iterate: I have yet to find a single installation that has a segfault when running on a single thread! All folks with segfaults that compiled in single threading mode did no longer experience any problems. This is strong evidence to me.
But while I almost think I know it is threading related, I can not find anything wrong in that area, too. The threading model is very simple and construction/destruction of elements is spread in an easy way over the two threads. Basically, the listener thread creates the message and its properties, while the action thread destructs them. There are a few exceptions where properties are derived, but nothing really complex. Yet, I still think it is in the threading area – why does it work in single thread mode otherwise?
Hunting for the segfault…
Do you remember? We are still hunting for a segfault in rsyslog that is very hard to find. The biggest problem is that most people will never experience it. I do not experience it in lab, nor does anybody else who is currently working on the project. And without the ability to reproduce it, there is a lot of guesswork involved.
This is why we are asking for the help of our users (that means: you!). If you run rsyslog and experience a segfault, we would very much appreciate if you could run a specially instrumented version. Peter Vrabec created it and it contains debugging support as well mudflap support, which is a tool to track down nasty memory management errors.
The RPM can be found at:
http://people.redhat.com/pvrabec/rpms/rsyslog/rsyslog-1.19.10-2.mudflap.src.rpm
Please install it. You will notice that CPU usage is higher than normal, but in most cases this is harmless. If you are concerned, drop me a line with specifics and I’ll happily address them.
If a segfault happens when you run this version, please send us
- binary,
- coredump and
- /var/log/rsyslog.mudflap
- hardware and OS information (which distro? multiprocessor? …?)
Our sincere hope is that we receive enough reports to find something in common between them. So, please contribute your segfault info if you happen to have it. This is a great way to contribute to the project!
recent rsyslog changes
I am back to my routine of posting rsyslog changes. You may also imply that this means I am actually developing some things (and not just writing about it ;)). After I had a somewhat slow start today, things evolved quite nicely this afternoon. If I did not overlook anything important, I even managed to complete the “clean unload process” for loadable modules. That also brought me back to good working knowledge of the code. Actually, I am at least a day ahead of my schedule. But, of course, I’ll check if I overlooked something – but that’ll be tomorrow.
So on to the promised change log (it also covers some past days where I had not reported):
2007-11-19
– applied gssapi patch from varmojfekoj – gss-api is now supported
– added some debug message to ommysql
2007-11-20
– added user doc for gssapi patch from varmojfekoj – thanks!
– bumped version number to 1.20.0, because of new gss api functionality
2007-11-21
– begun to look at dynamic module unloading – this is currently a hack
and works with the mysql module only (which is the only one, so there
is no problem in practice. But it would be good to begin to do it right ;)
– added new modExit() entry point to loadable module interface
– added an identifier to command handler table – need to identify which
command handler entries need to be removed when module is unloaded
– added support so that linkedlist key can be used for owner handle
– enhanced llExecFunc to support deletion of list elements (on behalf of
user function being called, slight interface change)
– enhanced linkedlist class so that list elements can now be deleted based
on the key value they have
– created entry point so that CfSysLine handlers are removed on modExit()
– some cleanup
– modules are now correctly unloaded and de-initialized
going back to rsyslog coding…
As promised, I have started to look at the rsyslog code this morning again and done so in an effort to enhance it. My first target is unloading loadable modules in a “well done” way. So far, this is a hack that does only work because ommysql (and probably a postgres module basing on it soon) does not use some of the interface functionality. Namely cfSysLineHandlers do not work with the current code.
So what I am set now for is doing it right and making sure that a loadable module can be cleanly unloaded under all circumstances. That, of course, requires some interface changes, but nothing major (keep in mind the interface is not yet finalized!). This work provides the basis for upcoming work, which will utilize many more loadable modules for other functionality, too. So it is a critical task.
I have to admit, though, that I think I need another day or so to get fully re-acquainted to the code. There was really a toll from my absence and I begin to notice it ;)
rsyslog- what’s next?
I posted an outline of my next actions on the rsyslog mailing list and would like to share it here as well:
I have thought about setting up a full lab for GSS-API before carrying on. For now, I have decided to NOT do that. I am sure that the contributors have tested it quite well and the code that I have reviewed looks excellent.
So I will pull it in as is and wait for some feedback from the field (with the assumption “no feedback” equals “OK”).
I will then begin to look at the loadable module de-initialization. This is not really clean in the current release, but that’s no problem because modules never get unloaded. However, in the long term we need this to be clean.
The mysterious segfault issue is still dangling. I was hesitant to do any larger-scale new development without fixing it. But given the fact that it is extremely hard to find, and obviously happens very seldom, I’ll continue developing. I am right now looking into upgrading the dev machine to an x64 OS, where most of the problems happened. My hope is that I will see a segfault during further development work and then hopefully be able to tackle it. I still think that the segfault must be well understood and fixed before I go into some serious multithreading redesign. As such, unfortunately, this issue still holds some of the work scheduled for the next *major* version.
I thought I give you an update here in my end (will also post this to the blog for the others). Any feedback/suggestion is highly welcome.
syslog GSS-API usage notes
I copy over some usage notes from Peter Vrabec’s intial announcement of the patch:
It adds a new commandline option ‘-g’ to listen for a connection wrapped with gss-api and few new configuration directives:
for server:
$gsslistenservicename <service name>
for client:
$gssforwardservicename <service name>
$gssmode <encryption|integrity|none>With gssmode set to “encryption” or “integrity” all tcp selectors will be forwarding messages via gss-api.
That’s probably useful while I am getting up some real documentation.
GSS-API for Rsyslog
I just reviewed and integrated a patch from varmojfekoj into rsyslog – it provided GSS-API support. I have to admit that I have not yet fully understood what it does from a user’s point of view. But I begin to have the feeling that this patch will probably be the most important addition to rsyslog in the later half of this year.
I also have not yet evaluated how this patch relates to the syslog-sec IETF syslog security effort, namely syslog-transport-tls. I guess they are related and the patch probably not only provided non-standard functionality but may even make it harder to implement the standard. HOWEVER, if we look at how slow moving that IETF WG is, I do not bother about any compliance problems. They can be dealt with later. What I find much more important is that we have a real-world answer to real-world security question and we do have this now. So what could be more important? ;)
I keep you updated on the progress.