Helping to find major issues in rsyslog…

Rsyslog has a very rapid development process, complex capabilities and now gradually gets more and more exposure. While we are happy about this, it also has some bad effects: some deployment scenarios have probably never been tested and it may be impossible to test them for the development team because of resources needed. So while we try to avoid this, one may see a serious problem during deployments in demanding, non-standard, environments (hopefully not with a stable version, but chances are good you’ll run into troubles with the development versions).

Active support from the user base is very important to help us track down those things. Most often, serious problems are the result of some memory misadressing. During development, we routinely use valgrind, a very well and capable memory debugger. This helps us to create pretty clean code. But valgrind can not detect anything, most importantly not code pathes that are never executed. So of most use for us is information about aborts and abort locations.

Unforutnately, faults rooted in adressing errors typically show up only later, so the actual abort location is in an unrelated spot. To help track down the original spot, libc later than 5.4.23 offers support for finding, and possible temporary relief from it, by means of the MALLOC_CHECK_ environment variable. Setting it to 2 is a useful troubleshooting aid for us. It will make the program abort as soon as the check routines detect anything suspicious (unfortunately, this may still not be the root cause, but hopefully closer to it). Setting it to 0 may even make some problems disappear (but it will NOT fix them!). With functionality comes cost, and so exporting MALLOC_CHECK_ without need comes at a performance penalty. However, we strongly recommend adding this instrumentation to your test environment should you see any serious problems. Chances are good it will help us interpret a dump better, and thus be able to quicker craft a fix.

In order to get useful information, we need some backtrace of the abort. First, you need to make sure that a core file is created. Under Fedora, for example, that means you need to have an “ulimit -c unlimited” in place.

Now let’s assume you got a core file (e.g. in /core.1234). So what to do next? Sending a core file to us is most often pointless – we need to have the exact same system configuration in order to interpret it correctly. Obviously, chances are extremely slim for this to be. So we would appreciate if you could extract the most important information. This is done as follows:

  • $gdb /path/to/rsyslogd
  • $info thread
  • you’ll see a number of threads (in the range 0 to n with n being
    the highest number). For each of them, do the following (let’s assume
    that i is the thread number):

    • $ thread i (e.g. thread 0, thread 1, …)
    • $bt

  • then you can quit gdb with “$q”

Then please send all information that gdb spit out to the development team. It is best to first ask on the forum or mailing list on how to do that. The developers will keep in contact with you and, I fear, will probably ask for other things as well ;)

Note that we strive for highest reliability of the engine even in unusual deployment scenarios. Unfortunately, this is hard to achieve, especially with limited resources. So we are depending on cooperation from users. This is your chance to make a big contribution to the project without the need to program or do anything else except get a problem solved ;)

Tools to detect stack adressing Problems?

Since I have begun to use the valgrind memory debugger routinely in rsyslog development (some two years ago), the quality of the source has much increased. Unfortunately, however, valgrind is not able to detect problems related to misaddressing variables on the stack. The 5.3.6 bug I was hunting for almost a week is a good example of this. Valgrind also provides only limited support for global data, as far as I know (and see from testing results).

This becomes an even more important restriction as I moved a lot of former heap memory use to the stack for performance reasons. I remember at least one more major bug hunting effort that was hard to find because it affected only stack space.

So I am currently looking for tools that could complement valgrind by providing good stack checking capabilities. As one tool, mudflap was suggested to me. It sounds interesting, but gives me a very hard time [very hard to read debug output (no symbolic names for dlloade’ed modules, (false?) reports for areas where I can not see anything wrong as well as frequent (threading-related?) crashes when running under instrumentation). Maybe I am just misinterpreting the output…

In short: I would highly appreciate suggestions for tools that can help with debugging stack memory access (global data would be a plus) – and/or instructions on how to interpret mudflap, if that is considered to be *the* tool for that use case.

Thanks,
Rainer