Heat Shield Trouble looks worse…

The more I read about space shuttle Discovery’s heat shield trouble, the worse it looks. News not yet generally available yet points to a potentially serious issue, which could actually make a launch in the October/November time frame impossible. Thankfully, nothing is finalized yet and there still is hope the situation is not as bad as it currently looks.

As far as my travel plans are concerned: While it is too early to panic, I have begun to think about fall back scenarios. I am right now checking what I can cancel and at what cost. From the cost perspective, it looks frightening, too. Is this really supposed to be the end of my launch viewing trip, at least for this launch window…?

I’ll post updates as soon as I get more news…

Trouble with space shuttle heat shield?

RCC Panels on Space Shuttle Discovery and an up-close view of them. They are an essential part of the shuttle's thermal protection system.
There is been some rumor about trouble with Discovery’s heat shield for a while now. Nothing is yet confirmed, but at the public raumfahrer.net forum a message popped up that there may actually be trouble.

This is the first public posting on the problem. If it really exists, that would probably be extremely bad: if some of the so called RCC (reinforced carbon carbon) elements would be damaged, the repair would probably be very time consuming. That could cause not just a slight delay but in worst case make a launch of STS-120 in the set launch window impossible. Of course, this would be extremely bad news for me personally, too.

The RCC (Reinforced Carbon-Carbon) panels are a vital part of the space shuttle’s thermal protection system. They gained notoriety as the cause of the Columbia disaster. During Columbia’s STS-107 launch, a RCC panel was damaged by foam debris falling from the external tank. On re-entry, that lead to ultra-hot gases entering the orbiters inside, which in turn lead to melting and break up of the space ship. All crew members lost their lives in this accident.

So problems with the RCC panels are to be taken seriously. And I am sure NASA will. What gives me hope is that so far no official word from NASA is out. So, hopefully, these rumors are exactly that: rumors. I hope to get better information in the next few hours. Actually, I am very eager about any news: after all, it would more then depress me if I needed to cancel my launch viewing trip at this stage – especially as so far everything went exceptionally well…

malloc/free anomaly cleared

Peter Vrabec provided very helpful information on the anomaly I experienced with malloc/free under mudflap instrumentation. See his report:


$ gcc -lmudflapth -lpthread -fmudflapth mud.c
.........----------
mud.c: In function ‘main’:
mud.c:27: warning: return type of ‘main’ is not ‘int’
./a.out
alloc p in thread: 0ea72530
alloc p in main thread: 0ea72460
freeing p from thread: 0ea72530
free done!
freeing p from main thread: 0ea72460
free done!
main thread exiting

gcc -lpthread -fmudflapth mud.c -lmudflapth
................................----------
mud.c: In function ‘main’:
mud.c:27: warning: return type of ‘main’ is not ‘int’
$ ./a.out
alloc p in thread: 1bffe3f0
alloc p in main thread: 1bffe440
freeing p from thread: 1bffe3f0
free done!
freeing p from main thread: 1bffe440
*** glibc detected *** ./a.out: double free or corruption (out):
0x000000001bffe440 ***
======= Backtrace: =========
/lib64/libc.so.6[0x32bde6e8a0]
/lib64/libc.so.6(cfree+0x8c)[0x32bde71fbc]
./a.out(__wrap_main+0x174)[0x400924]
/lib64/libpthread.so.0[0x32bea061b5]
/lib64/libc.so.6(clone+0x6d)[0x32bdecd39d]
======= Memory map: ========
bla bla bla

Note the position of the -lmudflapth argument. So, as it looks, the problem was really one of the instrumentation and not of rsyslogd itself. So, bad as it is, we are still back to hunting a bug that is hiding well. But hopefully we’ll get somewhat closer when mudflap is now actually active… I’ll see and post news as soon as I have them.

Discovery Crew arrives at Kennedy Space Center

Space Shuttle Discovery's crew arrives at Kennedy Space Center (STS-120 mision)
On Sunday, space shuttle Discovery’s crew has arrived at Kennedy Space Center. This is a major step in launch preparations. This week, the crew and technicals will participate in the so-called terminal countdown demonstration test (TCDT) in which launch and emergency procedures are being practiced.

launch “on time” stats

I was pointed to this interesting article today:

http://cbs4.com/topstories/topstories_story_219064717.html

According to it, only 40% of the space shuttle launches are on time. Interestingly, the number one reason for delays are technical issues. They are to blame for about half of the launch delays. The weather, which I thought to be number one, is actually the second-most reason. About a third of all launches are delayed due to bad weather.

work-around for malloc/free issue created

I have now created a work-around for the malloc/free threading issue which creates an extra thread for the mainloop. The startup thread then simply waits for it to be finished. As such, there is never memory allocated in the “main” thread. At least for me, it works now with mudflap. I am still in doubt if that was the segfault issue (or just a bug in mudflap), but at least we can give it a try.

I will now see that I get some feedback. The next thing is change the packaging back to a single source tarball (by popular request ;)).

A fire alert near the VAB

According to many news source, a fire alert happened last Friday on Kennedy Space Center property. The Launch Control Center and possibly also the VAB had been evacuated. NASA sources tell that the alert was not a drill, but it was a false alert. So, it was real, but the fire sensor didn’t work correctly and detected fire where none was.

To the best of my knowledge, work at the launch pad was not (seriously) affected by the fire alert. However, all traffic, including tour buses, was stopped.

But now think about launch day: if such an alert happens then, the launch will be scrubbed for sure. I hope this will not happen while I am on my launch viewing trip.

pad flow still on track

From what I have read over the weekend, the launch pad processing of space shuttle discovery is still on track. Fuels have been loaded to the shuttle. They are used to power all sort of auxiliary systems. A countdown test with the crew on site is scheduled for mid this week. That countdown test seems also to be a chance to do the APU hotfire test, which still is considered.

NASA my skip APU hotfire test, crawler headed back…

The crawler transporter has now left launch pad 39A at Kennedy Space Center.
Still, no official news from NASA. But from what I have gathered on several forums on the Internet, Discovery is still on schedule. A tight schedule – as said yesterday, there seems to be only a few hours of contingency left.

In spite of this, NASA considers to drop the so-called APU hotfire test. APUs provide power e.g. for hydraulics during re-entry. The APU test was originally scheduled for shortly after arrival at the pad. However, it could not be carried out due to bad weather. It is now targeted for next Thursday — if not being canceled. Canceling seems attractive to NASA, as it would probably save around one day of contingency. And the APU test is not considered absolutely vital – it is recommenced only after a major overhaul of the orbiter (which, however, seems to have happened).

As a side-note, the crawler transporter has now left the pad. Look at the picture above — not to long ago this was where the crawler parked. Driving back seems to be a bit quick, as it usually stays until shortly before launch. Maybe this is another indication that NASA is very serious on launching as quickly as possible.

Could I really reproduce the bug…?

Today, I was able to actually test and debug rsyslog Not just looking at code and how it may work. No, real interaction and real crashes.

Things went well, but then I got stuck. Somehow, the segfault didn’t make much sense. I found something that is related to the segfault user’s are seeing. But is it really the actual segfault or just a side-effect of instrumentation?

With mudflap active, rsyslog crashes when freeing the message structure in the worker thread. The structure was allocated (malloc) in another thread, actually the “main” thread, that is the one rsyslog starts up in. Of course, I’ve first assumed I have messed up with the structure. But further analysis showed that I have not. So a bad feeling creeped in … that there may be some thread safety issues with malloc/free. On the other hand, rsyslog is far from being my first multi-threaded program (but on a modern flavor of linux, I have to admit). I’ve used dynamic memory alloc in multithreaded apps for years now and without any problems. After all, dynamic memory is often a trouble-safer with multithreading.

Then, I have written a minimalistic program to check out threading functionality. Here it is:


#include <stdlib.h>
#include <pthread.h>

static char *pthrd;
static char *pmain;

static void *singleWorker1()
{
pthrd = malloc(32);
printf("alloc p in thread: %8.8xn", pthrd);
pthread_exit(0);
}

static void *singleWorker2()
{
printf("freeing p from thread: %8.8xn", pthrd);
free(pthrd);
printf("free done!n");
printf("freeing p from main thread: %8.8xn", pmain);
free(pmain);
printf("free done!n");
pthread_exit(0);
}

void main()
{
int i;
pthread_t thrdWorker;

i = pthread_create(&thrdWorker, NULL, singleWorker1, NULL);
pthread_join(thrdWorker, NULL);
pmain = malloc(32);
printf("alloc p in main thread: %8.8xn", pmain);
i = pthread_create(&thrdWorker, NULL, singleWorker2, NULL);
pthread_join(thrdWorker, NULL);
printf("main thread exitingn");
}

Note: the code did originally contain sleep(1) in stead of the pthread_join()s now found in it. I was initially too lazy to do it right in this tester. I’ve been told this is bad, so I fixed it. The result, however, is unchanged.

… and now look at the output:

cc -O1 -fmudflapth threadtest.c -lpthread -lmudflapth
threadtest.c: In function ‘main’:
threadtest.c:27: warning: return type of ‘main’ is not ‘int’
[root@localhost rsyslog]# ./a.out
malloc: using debugging hooks
alloc p in thread: 095586d0
alloc p in main thread: 095587f8
freeing p from thread: 095586d0
free done!
freeing p from main thread: 095587f8
*** glibc detected *** ./a.out: free(): invalid pointer: 0x095587f8 ***

free done!
main thread exiting
*******
mudflap stats:
calls to __mf_check: 0
__mf_register: 5179 [524294B, 32B, 20981024B, 0B, 2365B]
__mf_unregister: 0 [0B]
__mf_violation: [0, 0, 0, 0, 0]
calls with reentrancy: 5132
lock contention: 0
lookup cache slots used: 0 unused: 1024 peak-reuse: 0
number of live objects: 5179
zombie objects: 0

As you can see, the free that is done on the memory malloc’ed in the thread I created manually works fine. But the freeing the memory malloc’ed in the main thread fails miserably (I’ve set MALLOC_CHECK_=1, for the records).

I am both stunned and puzzled. If that is really a problem, it is clear why rsyslog aborts.

… but can that really be? I have to admit I now suspect a problem with mudflap — when it is compiled without it, everything works. But this applies only to the test program. Rsyslog doesn’t as quickly abort compiled without mudflap, but it aborts in any case. So can there really be a problem in the way dynamic memory management is done and in which threads?

If you can contribute to the solution, please do. I really need any helping hand, this is probably one of the most strange situations I’ve ever seen [and, of course, all will clear up once I see where I have failed – as always ;)].

Feedback appreciated!