Rainer Gerhards

2009-05-08

Can “more reliable” actually mean “less reliable”?

On the rsyslog mailing list, we currently have a discussion about how reliable rsyslog should be. It circles about a small potential window of message loss in the case of sudden power failure. Rsyslog can be configured to put all messages into a disk queue (instead of main memory), so these messages survive such a powerfail condition. However, messages dequeued and scheduled for processing during the power outage may be lost.

I now consider a case where we have bursty UDP traffic and rsyslog is configured to use a disk-only queue (which obviously is much slower than an in-memory queue). Looking at processing speeds, the max burst rate is limited by using an ultra-reliable queue. To avoid using UDP messages, a second instance could be run that uses an in-memory queue and forwards received messages to the one in ultra-reliable mode (that is with the disk-only queue). So that second instance queues in memory until the (slower) reliable rsyslogd can now accept the message and put it into the reliable queue. Let’s say that you have a burst of r messages and that from these burst only r/2 can be enqueued (because the ultra reliable queue is so slow). So you lose r/2 messages.

Now consider the case that you run rsyslog with just a reliable queue, one that is kept in memory but not able to cover the power failure scenario. Obviously, all messages in that queue are lost when power fails (or almost all to be precise). However, that system has a much broader bandwidth. So with it, there would never have been r messages inside the queue, because that system has a much higher sustained message rate (and thus the burst causes much less of trouble). Let’s say the system is just twice as fast in this setup (I guess it usually would be *much* faster). Than, it would be able to process all r records.

In that scenario, the ultra-reliable system loses r/2 messages, whereas the somewhat more “unreliable” system loses none – by virtue of being able to process messages as they arrive.

Now extend that picture to messages residing inside the OS buffers or even those that are still queued in their sources because a stream transport blocked sending them.

I know that each detail of this picture can be argued at length about.

However, my opinion is that there is no “ultra-reliable” system in life, only various probabilities in losing messages. These probabilities often depend on each other, what makes calculating them very hard to impossible. Still, the probability of message loss in the system at large is just the product of the probabilities in each of its components. And reliability is just the inverse of that probability.

This is where *I* conclude that it can make sense to permit a system to lose some messages under certain circumstances, if that influences the overall probability calculation towards the desired end result. In that sense, I tend to think that a fast, memory-queuing rsyslogd instance can be much more reliable compared to one that is configured as being ultra-reliable, where the rest of the system at large is badly influenced by this (the scenario above).

However, I also know that for regulatory requirements, you often seem to need to prove that a system may not lose messages once it has received them, even at the cost of an overall increased probability of message loss.

My view of reliability is much the same as my view of security: there is no such thing as “being totally secure”, you can just reduce the probability that something bad happens. The worst thing in security is someone who thinks he is “totally secure” and as such is no longer actively looking at potential issues.

The same I see for reliability. There is no thing like “being totally reliable” and it is a really bad idea to think you could ever be. Knowing this, one may begin to think about how to decrease the overall probability of message loss AND think about what rate is acceptable (and what to do with these cases, e.g. “how can they hurt”).

2009-04-30

A batch output handling algorithm

With this post, I’d like to reproduce a posting from David Lang on the rsyslog mailing list. I consider this to be important information and would like to have it available for easy reference.

Here we go:

the company that I work for has decided to sponser multi-message queue
output capability, they have chosen to remain anonomous (I am posting from
my personal account)

there are two parts to this.

1. the interaction between the output module and the queue

2. the configuration of the output module for it’s interaction with the
database

On for the first part (how the output module interacts with the queue), the
criteria are that

1. it needs to be able to maintain guarenteed delivery (even in the face
of crashes, assuming rsyslog is configured appropriately)

2. at low-volume times it must not wait for ‘enough’ messages to
accumulate, messages should be processed with as little latency as
possible

to meet these criteria, what is being proposed is the following

a configuration option to define the max number of messages to be
processed at once.

the output module goes through the following loop

X=max_messages

if (messages in queue)
mark that it is going to process the next X messages
grab the messages
format them for output
attempt to deliver the messages
if (message delived sucessfully)
mark messages in the queue as delivered
X=max_messages (reset X in case it was reduced due to delivery errors)
else (delivering this batch failed, reset and try to deliver the first half)
unmark the messages that it tried to deliver (putting them back into the status where no delivery has been attempted)
X=int(# messages attempted / 2)
if (X=0)
unable to deliver a single message, do existing message error
process

this approach is more complex than a simple ‘wait for X messages, then
insert them all’, but it has some significant advantages

1. no waiting for ‘enough’ things to happen before something gets written

2. if you have one bad message, it will transmit all the good messages
before the bad one, then error out only on the bad one before picking up
with the ones after the bad one.

3. nothing is marked as delivered before delivery is confirmed.

an example of how this would work

max_messages=15

messages arrive 1/sec

it takes 2+(# messages/2) seconds to process each message (in reality the
time to insert things into a database is more like 10 + (# messages / 100)
or even more drastic)

with the traditional rsyslog output, this would require multiple output
threads to keep up (processing a single message takes 1.5 seconds with
messages arriving 1/sec)

with the new approach and a cold start you would see

message arrives (Q=1) at T=0
om starts processing message a T=0 (expected to take 2.5)
message arrives (Q=2) at T=1
message arrives (Q=3) at T=2
om finishes processing message (Q=2) at T=2.5
om starts processing 2 messages at T=2.5 (expected to take 3)
message arrives (Q=4) at T=3
message arrives (Q=5) at T=4
message arrives (Q=6) at T=5
om finishes processing 2 messages (Q=4) at T=5.5
om starts processing 4 messages at T=5.5 (expected to take 4)
message arrives (Q=5) at T=6
message arrives (Q=6) at T=7
message arrives (Q=7) at T=8
message arrives (Q=8) at T=9
om finishes processing 4 messages (Q=4) at T=9.5
om starts processing 4 messages at T=9.5 (expected to take 4)

the system is now in a steady state

message arrives (Q=5) at T=10
message arrives (Q=6) at T=11
message arrives (Q=7) at T=12
message arrives (Q=8) at T=13
om finishes processing 4 messages (Q=4) at T=13.5
om starts processing 4 messages at T=13.5 (expected to take 4)

if a burst of 10 extra messages arrived at time 13.5 this last item would
become

11 messages arrive at (Q=14) at T=13.5
om starts processing 14 messages at T=13.5 (expected to take 9)
message arrives (Q=15) at T=14
message arrives (Q=16) at T=15
message arrives (Q=17) at T=16
message arrives (Q=18) at T=17
message arrives (Q=19) at T=18
message arrives (Q=20) at T=19
message arrives (Q=21) at T=20
message arrives (Q=22) at T=21
message arrives (Q=23) at T=22
om finishes processing 14 messages (Q=9) at T=22.5
om starts processing 9 messages at T=22.5 (expected to take 6.5)

2009-04-27

Levels of reliabilty

We had a good discussion about reliability in rsyslog this morning. On the mailing list, it started with a question about the dynafile cache, but quickly morphed into something else. As the mailing list thread is rather long, I’ll try to do a quick excerpt of those things that I consider vital.

First a note on RELP, which is a reliable transport protocol. This was the relevant thought from the discussion:

I’ve got relp set up for transfer – but apparently I discovered
that relp doesnt take care of a “disk full” situation on the receiver
end? I would have expected my old entries to come in once I had cleared the disk space, but no… I’m not complaining btw – just remarking that this was an unexpected behaviour for me.

That has nothing to do with RELP. The issue here is that the file output writer (in v3) uses the sysklogd concept of “if I can’t write it, I’ll throw it away”. This is another issue that was “fixed” in v4 (not really a fix, but a conceptual change).

If RELP gets an ack from the receiver, the message is delivered from the RELP POV. The receiving end acks, so everything is done for RELP. Some thing if you queue at the receiver and for some reason lose the queue.

RELP is reliable transport, but not more than that. However, if you need reliable end-to-end, you can do that by running the receiver totally synchronous, that is all queues (including the main message queue!) in direct mode. You’ll have awful performance and will lose messages if you use anything other than RELP for message reception (well, plain tcp works mostly correct, too), but you’ll have synchronous end-to-end. Usually, reliable queuing is sufficient, but then the sender does NOT know when the message was actually processed (just that the receiver enqueued it, think about the difference!).

This explanation triggered further questions about the difference in end-to-end reliability between direct queue mode versus disk based queues:

The core idea is that a disk-based queue should provide sufficient reliability for most use cases. One may even question if there is a reliability difference at all. However, there is a subtle difference:

If you don’t use direct mode, than processing is no longer synchronous. Think about the street analogy:

http://www.rsyslog.com/doc-queues_analogy.html

For synchronous, you need the u-turn like structure.

If you use a disk-based queue, I’d say it is sufficiently reliable, but it is no longer an end-to-end acknowledgement. If I had this scenario, I’d go for the disk queue, but it is not the same level of reliability.

Wild sample: sender and receiver at two different geographical locations. Receiver writes to database, database is down.

Direct queue case: sender blocks because it does not receive ultimate ack (until database is back online and records are committed).

Disk queue case: sender spools to receiver disk, then considers records committed. Receiver ensures that records are actually committed when database is back up again. You use ultra-reliable hardware for the disk queues.

Level of reliability is the same under almost all circumstances (and I’d expect “good enough” for almost all cases). But now consider we have a disaster at the receiver’s side (let’s say a flood) that causes physical loss of receiver.

Now, in the disk queue case, messages are lost without the sender knowing. In direct queue case we have no message loss.

And then David Lang provided a perfect explanation (to which I fully agree) why in practice a disk-based queue can be considered mostly as reliable as direct mode:

> Level of reliability is the same under almost all circumstances (and I’d
> expect “good enough” for almost all cases). But now consider we have a
> disaster at the receiver’s side (let’s say a flood) that causes physical loss
> of reciver.

no worse than a disaster on the sender side that causes physical loss of the sender.

you are just picking which end to have the vunerability on, not picking if you will have the vunerability or not (although it’s probably cheaper to put reliable hardware on the upstream reciever than it is to do so on all senders)

> Now, in the disk queue case, messages are lost without sender knowing. In
> direct queue case we have no message loss.

true, but you then also need to have the sender wait until all hops have been completed. that can add a _lot_ of delay without nessasarily adding noticably to the reliability. the difference between getting the message stored in a disk-based queue (assuming it’s on redundant disks with fsync) one hop away vs the message going a couple more hops and then being stored in it’s final destination (again assuming it’s on redundant disks with fsync) is really not much in terms of reliability, but it can be a huge difference in terms of latency (and unless you have configured many worker threads to allow you to have the messages in flight at the same time, throughput also drops)

besides which, this would also assume that the ultimate destination is somehow less likely to be affected by the disaster on the recieving side than the rsyslog box. this can be the case, but usually isn’t.

That leaves me with nothing more to say ;)

2009-04-08

what is “nextmaster” good for?

People that looked at rsyslog’s git may have wondered what the branch “nextmaster” is good for. This actually is an indication that the next rsyslog stable/beta/devel rollover will happen soon. With it, the current beta becomes the next v3-stable. At the same time, the current (v4) devel becomes the next beta (which means there won’t be any beta any longer in v3). In order to facilitate this, I have branched of “nextmaster”, which I will currently work on. The “master” branch will no longer be touched and soon become beta. Then, I will merge “nextmaster” back into the “master” branch and continue to work with it.

The bottom line is that you currently need to pull nextmaster if you would like to keep current on the edge of development. Sorry for any inconvenience this causes, but this is the best approach I see to go through the migration (and I’ve done the same in the past with good success, just that then nobody noticed it ;)).

2009-04-01

rsyslog going to outer space

Rsyslog was designed to be a flexible and ultra-reliable platform for demanding applications. Among others, it is designed to work very well in occasionally connected systems.

There are some systems that are inherently occasionally connected – space ships. And while we are still a bit away from the Star Trek way of doing things, current space technology needs a “captain’s star log”. Even for spacecraft, it is important when and why systems were powered up, over- or under-utilized or malfunction (for example, due to “attack” not of a Klingon, but a cosmic ray). And all of this information needs to be communicated back to earth, where it can be reviewed and analyzed. For all of this, systems capable of reliable transmission in a disconnected environment are needed.

Inspired by NASA’s needs, the Internet Resarch Task Force (the research branch of the IETF) is working on a protocol named DTN, usually called the interplanetary Internet.

As we probably all know, Microsoft Windows flies on the Space Shuttle. And, more importantly, Linux also did. With the growing robustness of Open Source, future space missions will most probably contain more Linux components.

This overall trend will also be present in NASA’s and ESA’s future Jupiter mission. There is a lot of information technology on the upcoming spacecraft, and so there is a lot of things worth logging. While specialized software is usually required for spacecraft operations, it is considered the rsyslog as the leading provider of reliable occasionally connected logging infrastructures may extend its range into the solar system. It only sounds logical to use all the technology we already have in place for reliable logging even under strange conditions (see “reliable forwarding“). Of importance is also rsyslog’s speed and robustness.

As a consequence, we have today begun to implement the DTN protocol for the interplanetary Internet. That will be “omdtn” and is available as part of the rsyslog spaceship git branch. This branch is available as of now from the public git repository.

We could also envision that mission controllers will utilize phpLogCon to help analyze space craft malfunction. A very interesting feature is also rsyslog’s modular architecture, which could be used to radiate a new communication plugin up to the space ship, in case this is required to support some alien format. This also enables the rsyslog team to provide an upgrade to the Interstellar Internet, should this finally be standardized in the IETF. If so, and provided the probe has enough consumables, it may be in the best spot to work as a stellar relay between us and whoever else.

2009-03-27

is freshmeat now dead?

I used freshmeat.net -both as an user and a project author- for several years and like the clean and efficient interface. Now, they have revamped the whole thing and I have to admit I personally think they screwed up while doing so.

First of all, a project has a structure that consists of various branches, each of them coming in different versions (see my post on the rsyslog family tree. In the old interface, you had branches and versions, and everyone could clearly see what belonged to where. In the new interface (as I understand it), you have a bunch of links that you can label. So I now have to deal with a flat structure and labels. This is NOT how software grows. And as this no longer is a real-world abstraction, it has become quite complicated to assign meaningful values. Not to mention that the big bunch of links is probably quite confusing to users.

I’ll probably deal with that by removing all but the development branches. Better to have consistent information than to have everything…

I also miss the statistics counters. They provided some good insight into what users where interested in and what effect releases had. Very valuable for me as an author, but also valuable for me as a user, for example, when I want to judge how active a project is. Freshmeat promised (on March, 15th) to bring back statistics “in a few days”, but today (March, 27th), they are still missing. And if they eventually appear and follow the rest of the design paradigm, I am skeptical if there is really value in them.

All in all, I am very dissatisfied. I am sad to have lost a valuable open source resource. So what to do now? Sourceforge again – don’t like that either. Ohloh? Maybe. Probably it’s best to concentrate on our own web resources… But first of all, I’ll wait a couple of days/weeks and hope that freshmeat will become usable again. But please don’t expect too many announcements on freshmeat from me for the time being.

There is also an interesting discussion thread on the new freshmeat design, I suggest to give it a read (you’ll also find that others like it!)

2009-03-23

rsyslog “family tree”

I have created a rsyslog “family tree” showcasing how the various branches and versions go together. It is a condensed graph of the git DAG and shows a few feature branches as an example. I personally think it provides a good overview of how rsyslog work progresses (click picture for larger version).

In red is the git master branch, blue are currently supported stable branches. Branch head “v1-stable” is dotted, because it is no longer officially supported. Dashed nodes are versions on feature branches, solid nodes are versions on main branches. Solid lines are direct ancestors, dashed lines indicate that there are some versions in between. Lots of feature branches have not been show. Bug fixes are typically applied to the oldest code experiencing the problem and then merged into the more recent versions, thus the code flow for bug fixes is kind of reverse. This bug fixing code flow is not shown inside the graph.

Note that you can use gitk to create the full DAG from the git archive. The purpose of my effort is to show the relationships that are well-hidden in gitk’s detailled view.

I have written a much more elaborate post about the “evolution of software“, unfortunately, it is available currently only in German (with very questionable results by Google Translate).

2009-03-172018-06-11

Why is there still PRI in a syslog message?

This is the first of a couple of blog posts I intend to do in response to Raffy’s post on syslog-protocol. I am very late, but better now than never. Raffy raised some good points. To some I agree, to some not and for some others it is probably interesting to see why things are as they are.

The bottom line is that this standard – as probably every standard – is a compromise of what could be agreed on by a larger group of people and corporate interests. Reading the IETF mailing list archives will educate much about this process, but I will dig out those interesting entry points into the mass of posts for you.

I originally thought I reply with a single blog post to Raffy. However, this tends to be undoable – every time I intend to start, something bigger and more important comes into my way. So I am now resorting to more granualar answers – hopefully this work.

Enough said, on the the meat. Raffy said:

Syslog message facility: Why still keeping this? The only reason that I see people using the facility is to filter messages. There are better ways to do that. Some of the pre-assigned groups are fairly arbitrary and not even really implemented in most OSs. UUCP subsystem? Who is still using that? I guess the reason for keeping it is backwards compatibility? If possible, I would really like this to be gone.

Priority calculation: The whole priority field is funky. The priority does not really have any meaning. The order does not imply importance. Why having this at all?

And I couldn’t agree more with this. In my personal view, keeping with the old-style facility is a large debt, but it was necessary to make the standard happen. Over time, I have to admit, I even tend to think it was a good idea to stick with this format, it actually eases transition.

Syslog-protocol has a long history. We thought several times we were done, and the first time this happened was in November, 2005. Everything was finalized and then there was a quite unfortunate (or fortunate, as you may say now ;)) IETF meeting. I couldn’t attend (too much effort to travel around the world for a 30-minute meeting…) and many other WG participants also could not.

It took us by surprise that the meeting agreed the standard was far from ready for publishing (read the meeting minutes). The objection raised a very long (and productive, I need to admit) WG maling list discussion. To really understand the spirit of what happened later, it would be useful to read mailing list archives starting with November, 14th.

However, this is lots of stuff, so let me pick out some posts that I find important. The most important fact is that backward compatibility became the WG charter’s top priority (one more post to prove the point). Among others, it was strongly suggested that both the PRI as well as the RFC 3164 timestamp be preserved. Thankfully, I was able to proof that there was no common understanding on the date part in different syslog server (actually, the research showed that nothing but PRI is common among syslogds…). So we went down and decided that PRI must be kept as is – to favor compatibility.

As I said, I did not like the decision at that time and I still do not like the very limited number of facilities that it provides to us (actually, I think facility is mostly useless). However, I have accepted that there is wisdom in trying to remain compatible with existing receivers – we will stick with them for a long time.

So I have to admit that I think it was a good decision to demand PRI begin compatible. With structured data and the other header fields, we do have ways of specifying different “facilities”, that is originating processes. Take this approach: look at facility as a down-level filtering capability. If you have a new syslogd (or write one!) make sure you can filter on all the other rich properties and not just facility.

In essence, I think this is the story why, in 2009, we still have the old-style PRI inside syslog messages…

2009-03-12

How Software gets stable…

I have received a couple of questions the past days if this or that rsyslog feature can be introduced into the stable branch soon. So I thought it is time to blog about what makes software stable – and what not…

But let me first start by something apparently unrelated: let me confess that, from time to time, I like to enjoy some good wine (Californian Merlot and Cabernet especially – ask my for my mailing address if you would like to contribute some! ;)). And at some special occasions, I spend way to much money just to get the “old stuff”: those nice wines that have aged in oak barriques. To cut a long story short, those wines are stored in barrels not only for storage, but because the exposure to the oak, as well as some properties of the storage container, interact with the wine and make it taste better. Wikipedia has the full story, and also this interesting quote:

The length of time that a wine spends in the barrel is dependent on the varietal and style of wine that the winemaker wishes to make. The majority of oak flavoring is imparted in the first few months that the wine is in contact with oak but a longer term exposure can affect the wine through the light aeration that the barrel allows which helps to precipitate the phenolic compounds and quickens the aging process of the wine.[8] New World Pinot noir may spend less than a year in oak. Premium Cabernet Sauvignon may spend two years. The very tannic Nebbiolo grape may spend four or more years in oak. High end Rioja producers will sometimes age their wines up to ten years in American oak to get a desired earthy, vanilla character.

Read it again: “High end Rioja producers will sometimes age their wines up to ten years in American oak to get a desired earthy, vanilla character.“

So what would the Riojan winemaker probably say if you asked him for a great 2008 wine (we are in early 2009 currently, just for the records)? How about “Be patient, my friend – wait another 9 years, and you can enjoy it!” And what if you begged him you need it now, immediately? “I am sorry, but I can’t accelerate time…“. And if you told him you really, really need it because otherwise you can not close an important business deal? Maybe he says “Listen my friend. Some things simply need time. You can’t hurry them. But if you need to have something that can’t really exist, I can get you a bottle of that wine and label it as ‘Famos Riojan 10-year aged Wine from 2008’ – but we both know what is in the bottle!“. Technically speaking, the winemaker is not even cheating – he claims that the wine is from 2008, and so how can it be aged 10 years? If anyone buys that (today), the onlooker is probably very much in fault.

As a side-note, all too often our society works in that way: someone requests something that is impossible to do, someone begs long enough until someone else cheats, everybody knows – and we all are happy (at least up to the point where the cheat gets us into real trouble… – pick your favorite economic crisis to elaborate).
The moral from the story? Some things need time. And you can’t replace time by anything else. If you want to have the real taste of a wine aged 10 years in oak… you need 10 years.

By now you probably wonder what all of this has to do with software. A lot! Have you ever thought what makes software stable? In closed source, you hopefully have a large testing department that helps you nail down bugs. In open source, you usually do not have many of these folks, but you have something much better: a community of loyal users eager to break their systems with the latest and greatest of what you happen to have thrown together ;)

In either case, you start with a relatively unstable program and with each bug report (assuming you fix it), the software gets more stable. While fixing bugs, however, you may introduce new instabilities. The larger the fix, the larger the risk. So the more you change, the larger the need to re-test and the larger the probability that while one issue is fixed one (or more!) issues have been newly created. For very large fixes, you may even end with a much worse version of the software than you had before.

Thankfully, a patch to fix a bug is usually much smaller than what was fixed. Often, it is just a few lines of code, so the risk to worsen things is low. Why is the patch usually just a few lines long? Simply because you fix some larger thing that usually works quite well. So you need to change some details which were not properly thought out and thus resulted in wrong behavior (if you made a design error, that’s a different story…).

So the more bug reports you get, and the more of them you fix, the more stable a software gets. You may have seen some formal verifications in computer science, but in practice, for most applications, this is the simple truth on how things work.

Now to new features: features are usually the opposite from a bugfix: introducing a new feature tends to be a larger effort, touching much more code and adding code where code never has been ;) If you add new features, chances are great that you introduce new bugs. So with each feature added, you should expect that the stability of your code decreases (and, oh boy, it does!). So how to iron out these newly introduced bugs? Simply wait for bug reports, fix them, wait for more – until you have reached at least a decent level of stability (aka “no new/serious bug reports received for a period of n days, whatever you have n defined to be).

And what if you then introduce a new feature? I guess by now you know: that’ll decrease stability so you need to iterate through the bugfixing process … and so on.

But, hey, we are doing open source. I *love* to add features every day! Umm… I guess my program will never reach a decent level of stability. Bad…

What to do? Taking a long vacation (seducing…) is not a real solution. Who will fix bugs while I am away (shame on me for mentioning this…)? But a pattern appears if you follow this thought: what you need to do to make a program stable is fix bugs for a period of time but refrain from adding new features!

Thanks to git, this can easily be done: you simply create one code branch for a version that shall become stable, and create another branch for the version where you create new features (the development branch). With a bit of git vodoo, you can even import fixes from your stabilizing branch to the development branch. Once you are happy with the stability of your code (in the stabilizing branch), you are ready to declare it to be stable! For that, you’ll probably have a separate branch. Then, you can start the game again: copy the state of your development branch to the stabilizing branch, do not touch that branch except for bug fixes and continue adding new features to the development branch. Iterate this as long as you are interested in your project.

This, in short form, is how rsyslog is created. Currently, there are four main branches, plus a number of utility branches that aid the development of specific features (let’s ignore them in this context here): we have the development (also called “master”) branch which equates to the … yes… development branch from the sample above;). The stabilizing branch is called “beta” in rsyslog terms. Then, we have a v2-stable and a v3-stable branch. Both are actually stable, but v2-is probably even more stable because it has – except for bug fixes – not been touched for many months more. It also has the fewest features, so it is probably the best choice if you are primarily interested in stability and do not need any of the new features. As rsyslog is further developed, we will add extra stable branches (e.g. there will probably be a v4- and v5-stable branch – but we may also no longer maintain v2-stable at this point because nobody uses it any longer [just like dinosaurs are no longer maintained ;)]).

Did you read carefully? Did you get the message? So let me ask:
What makes software stable?

Bug fixes? Testing? Money (yes, yes, please throw at me!)?

REALLY? Let me repeat:
WHAT MAKES SOFTWARE STABLE?

There is only one real ingredient and that is: TIME! Just like good wine, software needs to age. Thankfully, age, for software, is defined in number of different test cases. So money can accelerate aging of software (as some chemistry guru may be able for wine, probably with the same side-effects…). But for the typical open source project, stability simply goes along with the rate at which the community adopts new releases, tests them AND submits bugs, so that the authors can work on fixing broken things.

And what is the moral of the story? Finally, I am coming back to the opening questions: there is nothing but time that make rsyslog stable. So if you ask me to add a feature today, and I do, you can not expect it to be immediately stable – simply because this is not how things work (thanks, btw, for trusting so much in my programming abilities ;)). The new feature needs to go through all the stages, that is it must be applied to the current development build (otherwise we would de-stabilize the current beta, what is not desirable). Then, this is migrated to the stable build over time, where it can finally fully stabilize and, whenever the bug rate seems to justify this, it can move on to the stable build. For rsyslog, this typically means between three to four, sometimes more month are needed before a new feature hits the stable branches. And there is little you can do against that.

“But… hey, I need a stable version of that cool feature now! My manager demands it. Hey, I’ll also pay you for it…” Guess what? I can do the same the winemaker did. Of course, and if you ask really nicely, I can create a v3-stable-cool version for you, which is a version with the cool feature that I have declared immediately stable (btw, it’s mostly the same thing that all others just cal l “the beta”). If that satisfies your boss, I’ll happy to do. But we both know what you have gotten… ;)

Of course, I am exaggerating a bit here: in software, we can somewhat increase the speed of stabilizing by adding testers. Money (and even more motivation) can do that. We can also backport single new features to so-far stable branches (note the fine print!). This reduces the stability a bit, but obviously not as much as for the development version. However, this requires effort (read: time and/or money) and it may be impractical for many features. Some features simply rely on others that were newly introduced in that development version and if you backport the whole bunch of them, you’ll have something as much changed as the development version, but in an environment where the component integration is not as well tested and understood. Of course, some company policies (seem to) force you to do that. If so, the end result is that you have a system that is much less stable than the development version, but has a seemingly “stable” label. Wow, how cool! As the common sense says says: “everyone gets what one asks for” ;)

So what is the bottom line? Good software and good wine has something in common: time to ripen! Think about this the next time to ask me to offer a new feature as part of a stable branch. Its simply impossible. But, of course, you can bribe me to stick that “stable” label onto a mangled-with version…

2009-03-12

ISS unter debris hit threat!

In case you have not yet heard it on the twittersphere, here is something you should really look into: there is a so-called “red” threat that the ISS is being hit by debris. The ISS crew is currently closing hatches and preparing to move to the attached Sojuz return vehicle, in case this should be required. The full story is at nasaspaceflight.com. I also strongly recommend to dial in to NASA mission audio. The critical time is 5 minutes around 11:39am CDT.

I think I found the following two interesting links to track the debris and the International space station.

Thankfully, the event is now over and nothing happend (no news is good news :-)).

Here is a picture of the two satellite trackers around the time of the close encounter. Have a look at latidue, longitude and elevation in the trackers.