Is alarm() the culprit?

I have once again reviewed the threading. Now I have a weak hope. To handle mark messages, there is an alarm() call inside rsyslog. While alarm() typically does not play well with pthreads, rsyslog uses it in a very limited scope. Also, the alarm handler is activated only in one specific thread. But… “alarm() does not play well with pthreads”… In absence of any better solution, may this be the actual cause for this hard to hunt bug?

I will now see what is required to get rid of the alarm(). Actually, this is harder than it initially looks. For alarm() to go away, I need to set up a real background thread that does the timer ticks and that causes the mark messages. That, in turn, means that I have two concurrent messages sources, which involves quite a bit of synchronization. All of that is currently not needed as the alarm signal simply interrupts the select, which in turn leads to execution of nice sequential code. Oh, and yes: the alarm signal handler of course does (contrary to sysklogd) nothing then to set a global variable flag. So I’ll have a look at all that…

found another part of the code … that is not the problem…

I’d like to reproduce part of my conversation with Peter Vrabec over here. Together with others, Peter is very actively helping track down the nasty rsyslog segfault bug. He had quite a good idea of what could have caused it, resulting in me doing another focussed code inspection. I just explained to him why we have not yet found the problem. And there is a lot of detail in that description. Detail, I think, that others can benefit from, too. So here it comes:

> *** glibc detected *** rsyslogd: double free or corruption (!prev):
> 0x09cbd588
> ***
> (gdb) print (uchar *) 0x09cbd588
> $49 = (unsigned char *) 0x9cbd588 “”
>
> is it possible to call free on pszRcvFrom when it points at “” ?
> because I can see:
> msg.c:
> if(pM->pszRcvFrom == NULL)
> return “”;
> else
> return (char*) pM->pszRcvFrom;
>
> same with getRawMsg/getUxTradMsg/…, places where mudflap
> screams in MsgDestruct().

Sorry – if I have not totally screwed up, it’s impossible that this is the problem. I’ve done (yet another) thorough review. All of these functions are either called to supply another Set…() function (which simply copies over the “”, but does not free it) or from MsgGetProp(). MsgGetProp() uses a variable (pbMustBeFreed) to track whether or not a buffer must be freed. In general, unmodified properties are never freed – this happens only at message destruction (MsgDestruct()). MsgDestruct() however, just checks the pointers and frees if they are non-NULL. So MsgDestruct() is not interested in getRawMsg() or another get() function returning an empty string except a NULL pointer.

I have also checked if MsgGetProp() and its helpers correctly handle pbMustBeFreed – and they do it right. So a buffer is only freed when it was dynamically allocated. Message properties are only created during message construction and are free()ed when the message is destroyed.

So I guess we have found another part of the code that does not cause the problem.

What I would like to re-iterate, though, is that the segfault problem seems to disappear under all circumstances if compiled with –disable-pthreads. IF SO (and this is my current assumption), it can not be a general logic error like a double free but must be a synchronization problem. Let me re-iterate: I have yet to find a single installation that has a segfault when running on a single thread! All folks with segfaults that compiled in single threading mode did no longer experience any problems. This is strong evidence to me.

But while I almost think I know it is threading related, I can not find anything wrong in that area, too. The threading model is very simple and construction/destruction of elements is spread in an easy way over the two threads. Basically, the listener thread creates the message and its properties, while the action thread destructs them. There are a few exceptions where properties are derived, but nothing really complex. Yet, I still think it is in the threading area – why does it work in single thread mode otherwise?

Hunting for the segfault…

Do you remember? We are still hunting for a segfault in rsyslog that is very hard to find. The biggest problem is that most people will never experience it. I do not experience it in lab, nor does anybody else who is currently working on the project. And without the ability to reproduce it, there is a lot of guesswork involved.

This is why we are asking for the help of our users (that means: you!). If you run rsyslog and experience a segfault, we would very much appreciate if you could run a specially instrumented version. Peter Vrabec created it and it contains debugging support as well mudflap support, which is a tool to track down nasty memory management errors.

The RPM can be found at:

http://people.redhat.com/pvrabec/rpms/rsyslog/rsyslog-1.19.10-2.mudflap.src.rpm

Please install it. You will notice that CPU usage is higher than normal, but in most cases this is harmless. If you are concerned, drop me a line with specifics and I’ll happily address them.

If a segfault happens when you run this version, please send us

  • binary,
  • coredump and
  • /var/log/rsyslog.mudflap
  • hardware and OS information (which distro? multiprocessor? …?)

Our sincere hope is that we receive enough reports to find something in common between them. So, please contribute your segfault info if you happen to have it. This is a great way to contribute to the project!

Space Shuttle

space shuttleI thought its time to write a bit about the space shuttle itself. As many say, it is the world’s most complex machine ever built.

The space shuttle made its maiden flight on April, 12th 1981 and will, based on current plans, be retired in 2010 after completing the construction of the international space station, its current primary target.

The space shuttle was developed as a replacement for the Apollo spacecrafts. Contrary to Apollo, it can reach low earth orbit, only. It is not capable to go to the moon.

The space shuttle’s major components are the boosters, external tank and the orbiter. The orbiter is the airplane-shaped white “ship” that is commonly called “the shuttle”, though it is only part of it. The reddish external tank contains ascent fuel. And the white booster rockets on the sides of the external tank provide the main propulsion for the initial flight phase after launch.

Its main feature was reusability of most parts
. Only the external tank is lost on launch, the boosters glide back to earth on parachutes after separation from the craft. The initial design called for huge savings from that fact – something the space shuttle could not life up to. Some sources quote that NASA expected to have as much as one flight per week and the shuttle to replace all other launch vehicles. In practice, only a few launches per year were achievable and each of them being much more expensive than initially thought.

The space shuttle program was compromised by budget cuts in its early design phase. Initially, it was planned to have the actual orbiter sitting on top of the external tank and boosters. There would obviously be a different design for the main engines in this concept, too. The then-chosen configuration with the orbiter being mounted to the side of these components is a trouble source until today. It exposes the shuttle to launch debris, for example parts of the external tanks foam isolation that fall off during launch.

Launch debris is very hard to avoid. On launch, each spacecraft is shaken quite well. So chances are great something will go off. With all designs but the space shuttle, this poses no problem, because no vital system can be hit by such debris. If you look at Apollo-days Saturn V launches, you will see lots of ice falling off, but the crew capsule and their support system sat well protected above the debris source. Consequently, NASA’s new constellation moon program designs an Apollo-like craft with the vital systems again sitting on top of the launch propulsion system.

In my personal opinion, the space shuttle is a good example why budget constrains should not overrule engineering decisions. NASA paid badly for the initial savings…

Besides that problem, the space shuttle is an incredible and fascinating machine. Among its many great achievements is the delivery and continued servicing of the Hubble Space Telescope. Also, construction of the international space station ISS depends on space shuttles doing the heavy hauling. The space shuttle is also the only spacecraft ever capable to capture massive satellites in orbit and deliver them back to earth.

The space shuttle is also very inspiring. Viewing a space shuttle launch is a special experience.

space shuttle launch

Astronauts also praise the space shuttle for its roominess and the smooth ascent and descent, which puts very low G-forces on the crew.

NASA’s future constellation space program borrows heavily both at the space shuttle and Apollo programs. It is expected to get the best of two worlds. For example, Ares rockets will fly modified space shuttle boosters.

So while the space shuttle has some weaknesses, it is a very successful craft that not only contributed significantly to science, but will also help pave the way to the Moon, Mars and beyond. In my personal opinion, even the weaknesses were kind of success: they proved which things needed to be done differently. And, of course, a lot of issues were already fixed during the lifetime of the space shuttle program.

Currently, the shuttle fleet is set to retire in 2010. This is a political decision not backed by hard technical fact. In my personal opinion, I would like to see the space shuttle flying at least once a year until the Ares I and Orion vehicles are ready to launch. Of course, I do not know exactly what this requires, but I am a bit hesitant to leave access to the international space station just to the Russians. I also doubt that there will really be a “just” six-year inability of carrying humans into space – the Constellation program already has some of its schedule’s slipping. And with an endeavor as complex Constellation, it would be wise to count on some more schedule slips. I wouldn’t be surprised if the first Ares manned flight will not happen before 2018…

The space shuttle has received numerous fixes both in procedures and technology. It is more capable than ever before. It is safer than ever before. Wouldn’t it be wise to count on it as long as its successor is not ready?

November, 24th ISS spacewalk a success

Astronauts Peggy Whitson and Dan Tani conducted a successful spacewalk yesterday. It was the last in a series of important construction space walks which readied the international space station ISS to receive the European Columbus module.

Columbus will be delivered by space shuttle Atlantis’ STS-122 mission, set to launch from Kennedy Space Center on December, 6th 2007. As such, success of the spacewalk was also important for STS-122. Without it, a launch would not have been possible.

Now, with the successful spacewalk and everything going very smooth in the processing of Atlantis, it looks like weather is becoming the only constraint for the launch attempt. This is good news, because STS-122 has a very short launch window. It extends for just one week. So there is not much room for delays.

Here is also some more detail information from the NASA homepage:

Spacewalkers Peggy Whitson and Dan Tani completed Saturday’s spacewalk at 11:54 a.m. EST. The 7 hour and 4 minute excursion started an hour and 10 minutes early. They completed their main tasks well ahead of the timeline then moved on to perform some get-ahead work.

The two spacewalkers moved the 300-pound, 18.5 foot Loop B fluid tray from the station’s main truss to the port side of Destiny and completed fluid and electrical connections.

Tani did an inspection of a Solar Alpha Rotary Joint that had previously shown increased power consumption and vibration while rotating as it followed the Sun. Whitson deployed and mated cables to be used as part of the Station to Shuttle Power Transfer System, or SSPTS. A portable foot restraint was also installed on Node 2 for upcoming spacewalks when the European Columbus laboratory is installed on the STS-122 mission.

If you like even more details, you can find them on an additional NASA page devoted to Saturday’s spacewalk.

Spaceports: Obama Would Delay Moon Return

I have some time to review other space blogs right now (an advantage of doing business with the US – if it is a holiday over there, I’ve some spare time, too ;)). I read this interesting report on potential cuts into NASA’s budget:

Spaceports: Obama Would Delay Moon Return

I do not like the idea at all. What needs to be known is that the money that fuels NASA’s Constellation moon program already is taken from the regular budget. There was no budget increase that came together with the plan to go to the moon again. NASA’s science program is already suffering very badly.

If now additional funds are taken from NASA’s budget, that would IMHO severely compromise NASA’s ability to do useful missions. Not to mention that fact that it would be depending on Russia for all its manned space flight activities for at least a decade.

Even though I am German, I do not at all like this idea. But, granted, it’s the same problem everywhere: Germany cut the space budget that much that even though we have a number of slot in current ESA and NASA missions, we do not have any funding left to use them :( …

Apollo Mission in Pictures…

I just found a nice link I’d like to share – it is a nice, quick look at NASA’s Apollo mission in pictures. I personally think that the new “moon race” carried out now is at least as interesting as the Apollo missions. And I find it very interesting that the NASA’s Constellation program is building on so many Apollo concepts.

Give the link a try, the pictures are really inspiring. BTW: does anybody have a recording of the old moon TV coverage? Having a few samples online would be really great…

Unlinke Apollo, Orion will touchdown on land

Early tests of the Orion landing phase...I just read an interesting article. With the Constellation program re-using so many of the clever Apollo-day concepts, I was under the impression that Orion capsules would splash down into the pacific, too. But I was (probably) wrong.

NASA engineers plan to do Orion touchdowns on land – just like the Russian Soyuz capsules. While it is challenging to do a land touchdown, it has a number of advantages. The Orion capsule is reusable (planned to be usable for up to ten missions). A splashdown in salt water means a lot of corrosion potential and thus a number of problems. It also requires an expensive fleet of recovery ships. So NASA has its preferences. On the picture, you can see testing of airbags that should absorb some of the remaining energy after the descent (though it is not expected to be much, Orion glides down on parachutes).

The landing area is supposed to be in the western Unites States. That comes at no surprise, a lightly populated area is definitely a plus for such an endeavor.

An ocean splashdown, however, is yet not fully ruled out. NASA keeps this option in case it is needed.

If you’d like to dig down into all the details, I recommend this Scientific American Article.

Happy Thanksgiving!

I wish a happy Thanksgiving to all my US readers! As it looks, most NASA folks will also enjoy a nice four-day weekend. Processing flow on space shuttle Atlantis STS-122 mission seems to be so smooth that only very limited work is scheduled for now until Sunday evening. It is really nice to see that the engineers worked so well that this is possible.

As far as me is concerned, I do not have a holiday over here but obviously there will not be much to report. So do not expect too many STS-122 related news. Except, of course, on the upcoming ISS spacewalk (November, 24th), which is critical for an on-time launch of Atlantis.

Enjoy the holiday!

recent rsyslog changes

I am back to my routine of posting rsyslog changes. You may also imply that this means I am actually developing some things (and not just writing about it ;)). After I had a somewhat slow start today, things evolved quite nicely this afternoon. If I did not overlook anything important, I even managed to complete the “clean unload process” for loadable modules. That also brought me back to good working knowledge of the code. Actually, I am at least a day ahead of my schedule. But, of course, I’ll check if I overlooked something – but that’ll be tomorrow.

So on to the promised change log (it also covers some past days where I had not reported):

2007-11-19
– applied gssapi patch from varmojfekoj – gss-api is now supported
– added some debug message to ommysql
2007-11-20
– added user doc for gssapi patch from varmojfekoj – thanks!
– bumped version number to 1.20.0, because of new gss api functionality
2007-11-21
– begun to look at dynamic module unloading – this is currently a hack
and works with the mysql module only (which is the only one, so there
is no problem in practice. But it would be good to begin to do it right ;)
– added new modExit() entry point to loadable module interface
– added an identifier to command handler table – need to identify which
command handler entries need to be removed when module is unloaded
– added support so that linkedlist key can be used for owner handle
– enhanced llExecFunc to support deletion of list elements (on behalf of
user function being called, slight interface change)
– enhanced linkedlist class so that list elements can now be deleted based
on the key value they have
– created entry point so that CfSysLine handlers are removed on modExit()
– some cleanup
– modules are now correctly unloaded and de-initialized