begun working on rsyslog v3

I reproduce a note here that I sent out to the mailing list this morning. In the mean time, I have done most of the work in CVS.

As you know, I am looking at the way threading is supposed to work in future releases and, most importantly, looking at the inputs (like mark message generation).

Around summer, I wrote that I will probably need to release new major versions when we go into multithreading redesign. It looks like we have reached this stage. I tried to keep a single code base that still supports both single- and multi-threaded operations. I have looked into this the past days and I need to say that it creates a lot of complexity and hard to understand code.

For this reason, I think it is finally time to branch the code based and release some new versions.

Soon, I will create a branch for the current 1.20.1 code base. That will only receive bug fixes, but no new development (except, I guess, GSSAPI which I about to be contributed by Red Hat). When we are confident the last changes worked well and introduced no new bugs, there will be a version 2.0.0 stable release based on that code base.

CVS head, however, will then be rsyslog version 3. It will receive the new input module interface. It requires pthreads, because there is no way input modules and many more of the new desired features can be implemented without them. Consequently, I will remove all single-threading code from it, resulting in an easier to understand code base. Please note that I expect this code to change dramatically when it is being modified to be more modular (much like it was when I introduced modular outputs in summer). Please note that I will apply any non-bugfix patches to this code base, only.

I have somewhat bad feeling of going ahead with implementing a more sophisticated and more parallel multi-threading while we still have an issue with the segfault. However, I think by now we did everything imaginable to capture that rare bug. I have come to the conclusion that the best chance to find it is go ahead and implement the more sophisticated design. That will lead to a review, and rewrite, of much of the code in question, uncovering this we didn’t think about before. The recently discovered race condition is an excellent sample.

One thing about the license: rsyslog 2 will stay with “GPL v2 and above” license, but rsyslog V3 will be licensed under “GPL v3 and above”. I already wrote about that change. It is my firm believe that GPL v3 brings benefit to our freedom to use digital goods. I am a strong oppose of digital restrictions management (DRM) and software patens and I do not like the idea that rsyslog benefits anyone who encourages these things. I hope for your understanding.

I will set stage now for these changes and will do a web announcement soon. Please don’t be surprised that rsyslog v3 will be available before v2, you now know the reason.

ISS Spacewalk on Tuesday

The International Space Station is viewed from space shuttle Discovery after undocking during the STS-120 mission.The international space station ISS crew will put the time until the next space shuttle visits the orbiting complex to good use. A spacewalk is scheduled for next Tuesday. It is part of the ongoing troubleshooting of the solar array rotary joint (SARJ) problem problem that troubles the station for some weeks now.

The SARJ issue reduces power generation from the solar array. This is currently no issue, but when more modules are added, it becomes a constraint. The Columbus module, to be delivered by Atlantis whenever STS-122 is ready to launch, can operate with currently available power. However, the Kibo module, rocketed into space with STS-123, will probably exhaust current power availability. As such, it is vital to solve the issue with the rotary joints.

An international space station's solar array rotary joint (SARJ) shown inside a NASA presentation.
Previous spacewalks found some material on the race ring, a result of abrasion. There is a backup race ring available, but it will not be activated until the root cause of the problem is understood.

And now let me quote the NASA ISS home page:

Station Commander Peggy Whitson and Flight Engineer Dan Tani will perform the 100th spacewalk in support of International Space Station assembly on Tuesday, Dec. 18. The spacewalk will focus on the starboard solar arrays. Whitson and Tani will examine the starboard Solar Alpha Rotary Joint (SARJ) and return a trundle assembly to the station’s interior.

Whitson and Tani also will examine the Beta Gimbal Assembly (BGA). It tilts solar wings for optimal power generation. The starboard BGA has been locked since some power feeds to it were interrupted last Saturday.

While spacewalk preparations are under way, the docked Progress 26 cargo ship is being loaded with discarded items and readied for undocking on Dec. 21. Progress 27 will arrive at the station with supplies on Dec. 26.

How large is an Orion Capsule?

Have you ever wondered how much space there is inside an Orion capsule? NASA tells us more than in Apollo was, and it will carry up to a crew of six. Damaris B. Sarria, who wants to become an astronaut, has found some really nice picture. Here is one of them, for the other – and some great reading – please visit Damaris’ blog.

A mockup Orion crew module.

It looks really tiny, doesn’t it? Compare it to the man in front of it. I wonder how it will be to stay in there to reach the moon. Obviously, the comfortable days of the space shuttle will not be seen again any time soon…

rsyslog changes upto 2007-12-12

It looks like I have become too lazy in reporting my changes. I’ll try to be more quickly again in the future. Here is the part of the work log that is missing. Please note that it does not always mention my hard thinking about the new threading model ;)

2007-12-07
– applied patch from Michael Biebl to finally fix the -ldl cross-platform
issue
– fixed some type conversion warnings that appeared on 64 bit machines – these were in
debug statements, so indicated no real problem
– some code cleanup
– released 1.20.0 (finally ;))

2007-12-11
– When a hostname ACL was provided and DNS resolution for that name failed,
ACL processing was stopped at that point. Thanks to mildew for the patch.
Fedora Bugzilla: http://bugzilla.redhat.com/show_bug.cgi?id=395911
– fixed a small memory leak that happend when PostgreSQL date formatting
was used
– corrected a debug setting that survived release. Caused TCP connections
to be retried unnecessarily often.
– added expr.c, which has some thoughts on expression implementation
– fixed a potential race condition, see link for details:
http://rgerhards.blogspot.com/2007/12/rsyslog-race-condition.html
– added synchronization class to handle mutex-operations in the most
portable way.

2007-12-12
– handled selector flushing on termination (and hup) correctly. Could lose
some information before.
– done some more hard thinking on the threading model for upcoming
enhancements
– released 1.20.1

Shuttle Troubleshooting continues…

Just a short note, a longer one will follow (maybe tomorrow, it’s late over here…). I have just listened to the news teleconference with Mr. Wayne Hale. I am very glad to say that he is determined to fix the ECO sensor issue and sees good chances in doing so. To do that, a tanking test will be conducted next week.

Probably the following quote describes the whole situation “The primary goal is to troubleshoot the system as it is and restore its functionality. We would only consider other measures if we fail with this.

I hope they will succeed with that and we will have a great space shuttle launch in early January.

Shuttle Manager Hale’s Teleconference Statements

These are the notes I have taken during the December, 11th teleconference. I am posting them as I have taken them. They are largely unedited, but IMHO speak for themselves. Direct quotes are within quotation marks, the rest are my observations”. I penned this down during the teleconference, so I am pretty sure it is the exact wording.

“We set up 2 investigation teams. One is a near term team working on the current vehicle, second is a longer term team with experts from all around the agency.”

“do instrumented tanking test next Tuesday. Some instrumentation that we can put on some appropriate places at the circuits … We can capture the location in circuit in TDR, a commercially available technology … we have a high degree of confidence in pinpointing the location and once we know the location we can put together a fix.”

“STS-122 launch could definitely be a bit later than January, 2nd”

“I’ve been committed to fixing it. An intermittent electrical problem is really hard to fix. We thought we had fixed it, but we didn’t. So I might say we are RE-committed to fixing it”

Reporter: How concerned are you about the whole eco sensor system (touching on the email where Mr. Hale considered retiring this system as unreliable).

“This low level cutoff capability is a safety system that has never been used in flight… like a seat belt. If you don’t have anything bad, you probably don’t need it. If you needed it and it didn’t work it could be really bad. We would like to have this system functional and we would like to restore functionality of this system”

“There are other test, bench test of the equipment at manufactures in parallel to tanking test. Tanking test is a hazardous operation.”

Reporter: if STS-122 does not launch on January, 2nd, when then? “I let you know when we have our tanking test done and we have some data”

“The problem only occurs when we have cryo conditions present” It all works well at normal temperaturs. We are gonna to find out where the problem is.”

If the LCC criteria (number of working ECO sensors) would be changed: “Our point is to try to fix the problem and then go back to the previously LCC, rather than speculate let us wait and try to fix this problem.”

“TDR is not flight equipment, its not qualified. Its a ground system only.”

Once again on the email about retiring the eco sensors: “Our thinking has evolved from Friday when I wrote that little note.”

“We hope it repeats one more time on our test next week when we have the instrumentation”.

“Until we come to the bottom of this mystery, we are in no better shape launching any other orbiter” when asked about swapping assembly flights or orbiters to get off the ground.

“a single circuit is around a hundered feet of wire from the PSB to the sensing element.”

Test details:
“We have to physically cut the wires, we are talking about the ECO and the 5% sensor, which is in the same pass-trough and connector. We have to have people present to run that equipment. We send the red crew out during the stable replenishment phase. And if the problem is as in the past, it will stay with us during the stable replenish phase. We can not have people during tanking for safety concerns”

On a Christmas break: “We are thinking about taking a few days off to allow our folks to have a few days with their families. We’ll make that decision shortly after the tanking test.”

“The splicing of wires in the aft compartment is a standard procedure. We have identified a place where we can access the circuits that are readily accessible.”

“And then restore the circuits together when we are done with the tanking test.”

“Take the sensor’s wire harness and the LH2 pass trough connector and put them in a facility with cryo fluids and monitor how these tings respond in a lab setting. either liquid helium, warmer, or liquid hydrogen (lh2). we look at all these piece parts work. we did piece part testing before, we are doing integrated circuit from the lh2 pass through to the sensors themselves.”

“The liquid helium low level cutoff ability (LLCO) was present from the beginning , I am pretty sure it was sts-1 (99% positive). It is not a new system that has changed dramatically in design or manufacturing. The FPR is not significantly different the past 10/15 years than it is today. Just enough reserves to protect it from normal engine and operating procedures. The voltage indication system has been on sts 118 and 120 only, which were flawless.”

“The primary goal is to trouble shoot the system as it is and restore its functionality. We would only consider other measures if we fail with this.

Next tuesday full tanking test? “We need to fill the tank fully because as I said, it is our safety requirement to have people on the pad we need to be in stable replenish. It is reviewed which aux systems need to be operating. The launch team is finalizing the procedures.”

“I can not put my finger at anything that is especially difficult, its just normal operations in an unusual environment.”

ISS Stage EVA next week, Thursday afternoon, 2ET, possibly next Tuesday, SARJ

rsyslog race condition fixed

There is a race-condition when running rsyslog 1.20.0 or any previous release in multi-threaded mode. The probability for it to happen is very remote, but there definitely is a design flaw in it. Quick communication revealed, unfortunately, that this flaw can not be responsible for the hard to track segfault bug. The segfault occurs in a situation that does not match what I have found out. I discovered this problem when I worked on multi-threading re-design and focussed on input modules. Maybe my decision to hold off redesign until after the segfault bug has been found was wrong. Redesign forces me to look at more places from a very different angle and that may reveal even more (at lest I hope so).

Now here is the story on the “mark” race condition I discovered today:

fprintlog() duplicates the message object when we have a “last message repeated n times”. It does this by saving the pointer to the message object in a temporary buffer, carries out its work and then checks if it needs to restore the saved pointer. This works well in single threading as well as in almost all cases when running multi-threaded. However, if we reach a mark interval domark() calls fprintlog() potentially concurrently to a call that is already in place. What can happen is:

  1. domark() calls fprintlog() on an action
  2. fprintlog() begin execution and saves the previous message to the buffer
  3. fprintlog() is preempted
  4. the worker thread now calls into fprintlog() with the exact same message
  5. fprintlog() processes the message and finishes (deletes it)
  6. now processing of that thread ends and our first thread is resumed
  7. fprintlog() performs its actions and restores the already freed message object pointer

Now, the action holds an invalid pointer. Depending on what happens next, it either aborts (if the buffer has been overwritten) or continues to function but does a double-free.

The root cause is that access to the action object is not synchronized. This was deemed unnecessary, because there could be no concurrent write operations be in place. The domark() processing, however, had been overlooked.

This analysis is still preliminary, but points into a good direction. It needs to be said, though, that the probability for this failure scenario is remote. I have confirmed this is a race condition.

If you think about it, the mark() processing as a whole does not make much sense if we have a full queue. It is awfully flawed. I don’t like mark(): in the original sysklogd code, there was a similar issue: mark() was called by an alarm() timer and executed the full syslogd code during its processing. Given that lead to serious problems if some other message was being processed. I solved that issue by setting just a flag() in the alarm() handler. Then, actual mark() processing was started in the mainloop(). For single threading mode that works, because no other thread can be in the action processing at that time.

In multi-threaded mode, however, the mainloop() runs on a thread different from the work thread. So in fact, domark() can once again conflict with action processing. And if the queue is full, it does totally wrong things: because it uses whatever message is currently being processed as basis for emiting mark messages. This is seriously flawed! The root cause is that mark() processing does not go through the producer/consumer queue/gateway. This is what I now need to fix.

What mark() does is first to inject the “–mark–” message. That is no problem, because it is injected via the regular producer interface logmsgInternal(). But then, it calls domarkActions() on each action which in turn calls fprintlog(). It also accesses the messages then-current f_prevcount, which, with a full queue, has nothing to do with the last element being seen at that time.

The more I look at the code, the more I wonder what exact useful feature it is. I just checked the cuttren sysklogd source, and, yes, it still is there (even that domark() is being called in an alarm() handler is still there…). Interestingly, in both sysklogd and rsyslogd the “last message repeated n times” periodic display is turned off when mark messages are turned off. Is this intentional? I doubt so…

So what did the original sysklogd author think when he wrote that post? I guess he wanted to have an indication that messages had been dropped – and this not only when the next message with different text arrived, but after a set period (30 seconds with the current defines both in rsyslog and sysklogd). So message compression should indicate at least every 30 seconds that messages arrived, but were compressed. OK, that gives me something to think about.

Obviously, there is no point in emitting the “last message repeated n times” message if we have, let’s say, 100 identical message sitting in the queue followed from at least one non-identical message. In that case, the queue will be processed as fast as possible and upon arrival at the non-identical message, the “repeat message” will be issued. No need to say anything in between. If, however, there is no such non-identical message, rsyslogd is left in a somewhat open state. The queue is depleted but still no output is written (but “n” messages have not been displayed). Only in this scenario it is appropriate to start a timeout timer that will ultimately lead to the “repeated message” when no further non-identical message arrives in the allocated time window.

To be more precisely, it is not actually a question of messages being in the queue. As we have sophisticated filtering with rsyslog, the question actually is if a message was processed (e.g. written to file) by an action within the “repeated message” time window (RMTW). When the last message has been processed can be quite different from action to action.

One way to track this is to keep track when each action last successfully was called. If the queue is reasonable full and the action is supplied with reasonable, to be processed, data on a steady rate, that time should never fall outside of the RMTW. And if it does, isn’t that an indication that it is time to write a “repeated message” out, so that the operator is seeing at least one indication in every RMTW? Of course it is! So we can store this timer with each action and use it as a base check.

Now, we can periodically awake and check each action object; did it last process something outside of its RMTW AND does it have any repeated messages received? If so, it is time to emit the “repeated message” message. The fine difference to existing code is that we use the newly constructed timer here. Also, the action object must be locked, so that this background process and the worker thread(s) will not access the same vital data structures at the same time. The scenario from the top of this post would otherwise still apply. Finally, this processing should be de-coupled from the mark processing, because IMHO these are two totally separate things.

I will now go ahead and apply these changes and then we shall see where this bring us.

I have now done a first implementation. Actually, the code does not look that much different compared to before the change. The difference is that I have handled the timestamp thing a bit more transparently and, the biggie, I have used synchronization macros to guard the action object against the race condition. This code will become part of 1.20.1.

A design problem…

Folks, I am facing a design problem – and it looks so simple that I am pulling out all my hair ;)

I am currently preparing the next steps in modular rsyslog redesign. I am not sure yet, there are a couple of candidates what to do first. One is to add a real expression capability, another one is to add threaded inputs (which would be quite useful). In support of these enhancements, a number of things need to be changed in the current code. Remember, we are still running on large parts of the original sysklogd code, which was never meant to do all these advanced things (plus, it is quite old and shows its age). A cleanup of the core, however, requires some knowledge of what shall be done with it in the future.

My trouble is about a small detail. A detail I thought that should be easy to solve by a little bit of web search or doing a forum post or two. But… not only did I find the relevant information, I did not even find an appropriate place to post. May be I am too dumb (well possible).

OK, enough said. Now what is the problem? I don’t know how to terminate a long-running “socket call” in a proper way under *nix. Remember, I have done most of my multithreading programming in the past ten years or so under Windows.

What I want to do: Rsyslog will support loadable input modules in the future. In essence, an input module is something that gets data from a data source (e.g. syslog/udp, syslog/tcp, kernel log, text file, whatever …), parses it and constructs a message object out of it and injects that message object into the processing queue. Each input module will run on its own thread. Fairly easy and well-understood. The problem happens when it comes to termination (or config reload). At that instant, I need to stop all of these input module threads in a graceful way. The problem is that they are probably still in a long-lasting read call. So how to handle this right?

Under Windows, I have the WSACancelBlockingCall() API. Whenever I call that method, all threads magically wake up and their read and write calls return an error state. Sweet. I know that I can use signal() under Linux to do much of the same. However, from what I read on the web I have the impression that this is not the right thing to do. First of all it seems to interfere with the pthreads library in a somewhat unexpected way and secondly there is only a very limited set of signals available … and none left for me?

The next approach would be to have each blocking call timeout after a relatively short period, e.g. 10 seconds. But that feels even worse to me. Performance wise, it looks bad. Design-wise it looks just plain ugly, much like a work-around. It looks like I needed to do something not knowing what the right thing is (which, as it turns out, is the right description at the time being ;)).

To make matters worse, I have a similar problem not only with the read and write calls but with other constructs as well. For example, I’d like to have a couple (well, one to get started) of background threads that handle periodic activity (mark messages immediately come to my mind). Again, I would need a way to awake them when it comes to termination time – immediately.

And, of course, I would prefer to have one mechanism to awake any sleeping thread. Granted, can’t do that under Windows either, so I may need to use different constructs, here, too.

This is the current state of affairs. There is still enough work to do before the question MUST be answered in order to proceed. But that point in time approaches quickly. I would deeply appreciate any help on this issue. Be it either advise on how to actually design that part of the code – or be it advised where to ask for a solution! Really – a big problem is that I did not find an appropriate place to ask. Either the forum is not deeply technical enough, or there are some mailing lists where the topic is on something really different. If you know where to ask- please tell me!

[update] In the mean time, I have found a place to ask. Blieve it or not, I had forgotten to check for a dedicated newsgroup. And, of course, there is ;) The discussion there is quite fruitful.

Space Shuttle ECO Sonsors: an in-depth View

Space Shuttle ECO Sensor during Testing.After the scrub of space shuttle Atlantis December 2007 launch window, everyone is interested in the ECO sensors. That shuttle component is responsible for the scrub. Unfortunately, detailed information about it is hard to find.

However, I was able to obtain some good information. Most helpful was NASA’s “STS-114 Engine Cut-off Sensor Anomaly Technical Consultation Report“. I also used other NASA sources for my writeup, including information conveyed at the post-scrub press conferences.

Let’s start with some interesting fact that space shuttle program manager Wayne Hale provided in a press conference. According to him, the ECO sensors are an Apollo heritage. Their design dates back to the 1960s. Consequently, they are analog “computer systems”, which look quite strange compared to today’s technology.

I could not find any indication of sensor malfunction prior to STS-114, the “return to flight” mission. However, I have been told that pre-STS-114 flights did not have the same rigor checks in the flight procedure as they exist today. So it may very well be that there always were problems with the sensors, but these were “just” never detected.

It should also be noted that there was never a space shuttle main engine cutoff due to an ECO sensor (I need to correct this a bit later – but let’s keep it this way for the time being). It is believed, however, that on some flights the cutoff happened just a second or so before the ECO sensors would have triggered one. The amount of fuel left in the tank can not be analyzed post-flight, as the external tank is the only non-reusable component of the shuttle stack and lost after being separated from the orbiter.

But now let’s dig down into some hard technical facts
: A good starting point are the graphics that NASA posted on the space shuttle home page. I’ll reproduce them here, but due to the blog theme, they are a bit small. Click on each image for a high-res version. It will open up in a new window, so that you can read along.

There is a drawing that puts together all the pieces. It is an excellent starting point:

Space Shuttle ECO Sensors: OverviewA brief word of caution, though: the picture titles “LH2 ECO Sensor Locations” for a good reason. It is about the liquid hydrogen (LH2) tank sensors. There are also others, as we will see below. Let’s for the time being stick with the LH2 one. As far as I know, the LH2 sensors were also the only trouble source in recent shuttle launch attempts.

This is also where I need to correct myself. There actually have been main engine cutoffs due to ECO sensors, but none of them happened due to the liquid hydrogen sensors. As far as I know, there were three missions where it happened and among them were STS-51F and STS-93.

The image shows that the ECO sensors are located right at the bottom of the tank – which makes an awful lot of sense, as they should indicate depletion. There are four of them mounted in a single row on the shock mount. Each of them has their housing containing the actual sensing element. Even though this is not show on the above overview, let me add that there is are a lot of additional components that make up the so-called “ECO sensor”. That can be nicely seen in this schematic:

Space Shuttle ECO Sensors: Overall Schematic
The actual sensing element of the space shuttle's ECO sensor system.First of all, you’ll probably notice that it is more appropriate to speak of a “sensor system” than just of a “sensor”. If we talk about sensors, most of us simply think about the actual sensing element, seen to the right here. Obviously, that takes us far too short. You must think about the whole system to understand the problem. So think sensor element, electronics and electrical connections. All of this makes up what we call the “ECO Sensor”. In my personal opinion, there is a lot of misleading information and discussions on the public Internet these days. Part of this misunderstanding IMHO seems to stem back to the “sensor” vs. “sensor system” issue. Many folks express that they don’t understand why “such a simple sensor issue” can not be fixed. I guess that was even the motivation to write this post, but, hey, I am becoming off.-topic. On with the technical facts.

Next, you’ll notice that the ECO sensors are just few of the many sensors that make up the tank level information (the “point sensors”). All of these sensors are the same. The ECOs are in no way special, except for their name. ECO stems from “Engine Cut Off” and is attributed to the fact that these sensors are a emergency line of defense to shut down the engines if other things have already gone wrong (if all goes right, the ECOs are never used, but it is the ECOs that ultimately determine the fact that something went wrong…).

If you count, you’ll find twelve sensors: the four ECO sensors, one 5%, two 98%, one 100% minus, two 100%, one 100% plus and one overfill point sensor. Note that there are sensors both in the liquid hydrogen (LH2) and liquid oxygen (LOX) tank. Each of them has twelve, so there is a total of 24.

A notable difference is the location of the ECO sensors: for LH2, they are at the bottom of the external thank, while for LOX they are in the feedline inside the orbiter. In plain words that means that the LOX ECO sensors report very late while the LH2 sensors report early in the process of tank draining. This can be attributed to the fact that a fuel(LH2)-rich engine shutdown is required. I also assume that the risk of fuel pump overspeed and explosion is by far higher for the LH2 part of the system (but that just my guess, found no hard fact backing it).

The number of sensors at each position tell you something about their importance: it for sure is no accident that most positions are covered by one sensor, the 98% and 100% locations have two and the depletion location has four! Obviously, depletion is a major concern.

Which brings us to the point: why four? Let’s spell it out if it is not clear yet: it’s “just” for redundancy and backup. If there would be just one sensor, a single-sensor failure could be fatal. If it failed dry, it would cause an unnecessary (and comparatively risky) launch abort, if it failed wet and something else goes wrong, it could lead to vehicle destruction. Either way is not really desired, though obviously one case is better than the other.

To mitigate that risk, there are four sensors. But how put these to use? A simplistic approach could be that a poll is taken and the majority wins. So if we have one sensor telling dry and three telling wet, we would go to wet. Obviously, there would be a problem with a 2 dry/2 wet state. So our simplistic model is too simplistic. But I hope it conveyed the idea. What the system really does is a bit different:

First of all, there is a construct called “arming mass”. Keep in mind that the ECO sensors themselves are “just” a backup system to handle the case when other things have gone wrong before. Space shuttle computers continuously monitor engine performance and calculate fuel used. So there is a rough idea of how much fuel is left in the tank at any given moment. However, these calculations may not be 100% perfect and may not detect some malfunction, thus it is risky to rely on them alone. To mitigate that risk, the ECO sensor system has been put in place.

Now let’s take an extreme example. Let’s say an ECO sensor switches to dry just one second after launch. Would you trust it and assume the tank is already drained? I hope not. There are some points in flight where both logic and physics tell us the the tank can not be depleted. In fact, during most of the ascent it can not. But when we come close to main engine cutoff, then fuel may actually be used up. Only at that stage it is useful to look at the ECO sensors. This is what “arming mass” is all about. The shuttle’s computers continuously compute estimated fuel left and only when the estimate comes within the last 8 to 12 seconds of fuel depletion, the ECO sensors are armed.

This has some bonus, too. If an ECO sensor indicates “dry” before we reach arming mass, we can assume the sensor has failed. So that sensor will no longer be able to cast its vote when it later comes to aborting the launch. Please note, however, that it is not possible to detect a “failed wet” sensor in the same way. Sensors are expected to be “wet” during ascent and doing so obviously does not disqualify a sensor.

The ECO sensor mountpoint inside the space shuttle's external tank. As can be seen, they are mounted close to each other.Once the arming mass has reached, shuttle computers look at those sensors with a healthy status. If a single sensor indicates “dry”, computers initially assume a sensor failure. Remember: all sensors are mounted at the same location (see picture to the right), so they theoretically should indicated “dry” all at the same instant. However, that sensor is not disqualified. When now any second of the healthy sensor joins the other one in reading “dry”, shuttle computers assume an actual tank depletion.

They do not wait for the remaining qualified sensors, in a case now assuming these have failed “wet”. So whenever two qualified ECO sensors indicate “dry” after the space shuttle has reached “arming mass”, an abort will most probably be initiated. That means the space shuttle main engines will be cut off in a controlled and non-destructive way (which means a fuel-rich shutdown). Depending on when and how exactly this happens, it may lead to either an abort to the transatlantic landing (TAL) sites or an abort to orbit (ATO). I guess it may even be possible to reach the desired orbit with the help of the orbital maneuvering system if the engine cutoff happens very soon before its originally scheduled time.

Please let me add that the actual procedure for tank depletion must be even more complicated than briefly outlined here. For example, what happens if three of the ECO sensors disqualify themselves by indicating “dry” early in the ascent? Will the remaining single sensor than decide about launch abort? Also, what happens if all four fail early? I don’t like to speculate here, if somebody has the answer, please provide it ;) In any case, you hopefully have gotten some understanding now that the ECO sensor system and putting it to use is not as simple as these days it is often written on the web…

Now let’s look a little bit about where the sensors are located. If you paid attention to the above drawing, you have noticed the black lines which separate parts in the tank from parts in the orbiter (and yet from those at mission control center on the ground).

The best picture of the actual ECO sensor housing I could find is this one:

Space Shuttle ECO Sensors during a test procedurePlease note that it shows the ECO sensors during a test, in a test configuration. The mount is different from the actual one in the external tank.

The computers driving the sensors are located in the orbiter’s avionics bay:

Space Shuttle ECO Sensors: Orbiter Avionics BaysThis, and the following, drawings mention the “point sensor box”, PSB for short. Remember that the sensors together are the “point sensors” and the ECO sensors are just point sensors with a special name and function. NASA also lets us know where exactly the point sensor box is located in the shuttle’s aft:

Space Shuttle ECO Sensors: Orbiter Aft Avionics BaysAnd finally, we have some more information on the point sensor box itself:

Space Shuttle ECO Sensors: Functional Block Diagram of Point Sensor BoxThe point sensor box interprets sensor readings. The sensor elements provide a voltage. Certain low voltage level means “dry” while certain high voltage levels are interpreted as “wet”. However, somewhat above the “wet” levels, they indicated “dry” again. This level is reached when there is an open circuit.

NASA also provided an the exploded view of the point sensor box:

Space Shuttle ECO Sensors: Exploded View of Point Sensor Box
To me, it just looks like a box for electronics and I do not get any further insight from looking at the drawing. But anyways – it’s nice to know…

I could not find pictures of the not-yet-mentioned sensor system parts: the connectors and cables. Somehow the in-tank sensors and the on-board point sensor box must be connected to each other. This is done via some cables and connectors. Those must also be looked at when thinking about the system as whole. Especially as the failure reading we see points to an open circuit. I have read that some of the cable are below external tank foam. So its not easy to get to them.

I have heard that cryogenic temperatures are probably part of the trouble. Because failure readings seem to happen only when the tank ins filled (and thus very cold). One could assume that shrinking of ultra-cold material is part of the problem, but again, I have not found any credible references for this – or any other thermal results.

So it is now probably time to going right to the source. Below, find reproduced the deep technical description from the STS-114 paper quoted right at the start of this posting (quoted text in italics):

The MPS ECO system consists of point-sensors installed in the ET liquid hydrogen (LH2) tank and the Orbiter’s liquid oxygen (LO2) feedline. Point sensor electronics are designed to condition signals and to provide appropriate stimulation of the sensors and associated wiring and connectors.

Space Shuttle ECO Sensors: Overall Schematic

The point sensor electronics interprets a low resistance at a sensor as the presence of cryogenic liquid, which provides a “wet” indication to the Multiplexer/De-Multiplexer (MDM) for use by on-board General Purpose Computers (GPCs) and the ground Launch Processing System (LPS). Conversely, a high resistance is interpreted as a “dry” indication. The point sensor electronics include circuitry suitable for pre-flight verification of circuit function and are designed to fail “wet”. For example, an open circuit in the sensor, or an open or short in the signal path, will provide a “wet” indication to the MDM. The system is then activated and checked out during launch countdown and remains active during ascent.

The actual sensing element of the space shuttle's ECO sensor system.An ECO sensor is depicted in the next Figure. The sensor consists of a platinum wire sensing element mounted on an alumina Printed Wiring Board (PWB) and is encased in an aluminum housing. The sensing element acts as a variable resistance which changes on exposure to cryogenic liquid. This resistance variation is detected by post-sensor (signal conditioning) electronics and is used to generate either a “wet” or “dry” indication as noted above.

Space Shuttle ECO Sensors: System Overview

The ECO system is designed to protect the Space Shuttle Main Engines (SSMEs) from catastrophic failure due to propellant depletion. Flight software is coded to check for the presence of “wet” indications from the sensors within 8 to 12 seconds of SSME shutdown. The software rejects the first “dry” indication observed from any of the ECO sensors, but the presence of at least two more “dry” indications will result in a command to shutdown the SSMEs (i.e., “dry” indications from two of four “good” sensors are required for SSME shutdown). Early SSME shutdown would probably lead to a contingency Trans-Atlantic (TAL) abort. A failed “wet” indication cannot be detected. The system is designed so that LO2 depletion should occur first. However, a failure “wet” indication of three of the four LH2 sensors, coupled with an SSME problem that results in early LH2 depletion, could result in catastrophic failure of a SSME. Failure probability is considered remote, but would almost certainly be catastrophic to the flight vehicle. The system architecture addresses redundancy with one point sensor box containing four groups of sensor conditioner circuit cards. Each card can accommodate one hydrogen and one oxygen sensor. Each card group has its own power converter and one sensor conditioner card from each group services a pair of ECO sensors (again, one hydrogen and one oxygen). Wiring for each of the eight ECO sensors is split into one of two groups of sensors which are routed through separate Orbiter / ET monoball connections.

Let’s wrap-up: I hope you got a more in-depth view of the ECO sensor system by reading this post. At least, I think I have so by doing the research and writing it. Remember that I am no expert in this area, so I may be wrong. If you spot something that needs to be corrected, just drop me a note, for example in the form of a comment.

In regard to recent (STS-122…) developments, the question now is: what should be done if the root cause of the ECO sensor system failure can not be found. I don’t know, I miss too many facts. and my understanding is limited. But my guess is that if there can be rationale found to fly without it, that’s probably the best option to carry out. But hopefully tanking tests will show where it is flawed and a solution can be applied. Either way, I trust those wizards at NASA (and its contractors, of course). They have the have the training, they have the insight and they have the excellence. What else could one ask for?

Astronauts have left Kennedy Space Center

The astronauts have left Kennedy Space Center, but not without a big thank you to the launch support guys:

“We want to thank everyone who worked so hard to get us into space this launch window,” the astronauts said in a statement. “We had support teams working around the clock at KSC, JSC, and numerous sites in Europe. We were ready to fly, but understand that these types of technical challenges are part of the space program. We hope everyone gets some well-deserved rest, and we will be back to try again when the vehicle is ready to fly.”

They are now back to Houston, where they will continue their practices in support for a space shuttle Atlantis launch in January 2008. The launch is scheduled to be no earlier than January, 2nd. The date is obviously affected by the result of the ECO sensor troubleshooting that is currently being conducted. First news on that troubleshooting effort is expected on Tuesday.