Rainer Gerhards, Author at Rainer Gerhards

2009-06-15

high-peformance, low-precision time API under linux?

This time, I raise a question in my blog. Suggestions, tips and full answers are very welcome.

In rsyslog, there are various situations where I only need low resolution timestamps. With low resolution, I precise within a second. Of course, this thing is provided by the time() API. However, time() is very slow – far too slow for many things I do in rsyslog. So far, I have been able to work around this problem by doing a time() call only every n-th time where I run in tight loops and know that this will not bring me outside of me 1-second window (well, to be precise, this is at least very unlikely and thus acceptable).

However, this approach does not work for all work that I am doing. Now I am facing the challenge once gain, but this time in an area where the “query only n-th time” approach does not work. I need the time in order to schedule asynchronous activities (like writing so far unwritten buffers to disk). With them, there is no tight loop that provides me with some sense of timing, and so I simply do not know if half a second or half an hour has elapsed between calls – except when I do one of these costly time() calls.

A good work-around would be to define my own interval timer, awaking me e.g. every seconds. So I would not need absolute time but could do things based on these timer ticks. However, there is lot of evil in this approach, too: most importantly: this means rsyslogd will be active whenever the system is up, and running on a tick will prevent the operating system from switching the CPU to power saving modes. So this option looks very dirty, too.

So what to do now? Is there any (decently portable) way to get a second-resolution current timestamp (or a tick counter) without actually running on a tick?

If I don’t find a better solution, I’ll probably be forced to run rsyslogd on a tick, which would not be a good thing from a power consumption point of view.

As I already said, feedback is greatly appreciated…

Edit: in case my description was a bit unclear: it is not so important that the timestamp is of low resolution. Of course, I prefer higher resolution, but I would be OK with lower resolution if that is faster.

The problem with time() and gettimeofday() is that they are quite slow. As an example, I can only do around 250,000 time()/gettimteofday() calls per second on my current development system. So each API call takes around 4ms on that system. While this sounds much, it adds considerable runtime to each messages being processed – especially if multiple calls are required thanks to modular structure.

I have also thought about a single “lowres system time getter” inside rsyslog. However, that brings up problems with multi-threading. If one would like to be on the safe side, its entry points need to be guarded by mutexes, another inherently slow operation (depending on circumstances, overhead can be even worse then time()). With atomic operations, things may improve. But even then, we run into the issue that we do not know if the last call was half a second or half an hour ago…

Another edit:
This is a recording from a basic test I did on one lab system:


[rgerhards@rf10up tests]# cat timecaller.c
#include 
#include 
#include 

int main(int argc, char* argv[])
{
 time_t tt;
 struct timeval tp;
 int i;

 for(i = 0 ; i < atoi(argv[1]) ; ++i) {
 // time(&tt);
  gettimeofday(&tp, NULL);
 }
}
[rgerhards@rf10up tests]# cc timecaller.c
[rgerthards@rf10up tests]# time ./a.out 100000

real 0m0.309s
user 0m0.004s
sys 0m0.294s

The runtime for the time() call is roughly equivalent (especially giving the limited precision of the instrumentation). Please also note that we identified the slowness of the time() calls in autumn 2008, when doing performance optimization with the help of David Lang. David was the first to point to the time-consuming time() calls in strace. Reducing them made quite a difference.

Since them, I try to avoid time() calls at all costs.

2009-05-29

introducing rsyslog v5

A new v5 version of rsyslog will be released today. Originally, I did not plan to start the v5 version before the end of the year (2009). But then we received sponsorship to enhance queue performance. And then we saw that an audit-grade queue subsystem was needed (audit-grade means that no message is ever lost, not even in fatal failure cases like sudden power loss).

Especially the audit-grade queue subsystem resulted in very large design changes to the queue engine. Their magnitude is so large that I assume we need some time to stabilize it. Thus, I have decided to start a new v5 branch, which will feature the redesigned queue engine.

When we introduced the queue engine in early 2008 (in rsyslog v3), it took roughly three to five month until it got decently stable. With the magnitude of changes we have done now, it will probably take some time, again. It depends a bit on the actual feedback we receive from practice. Also, this time I have added lots of automated tests, so a lot of bugs should already have been caught. Also, during the next weeks I will focus on actual deployment scenarios, rather than things that theoretically may happen (the testbench covers many of those). So, all in all, I expect that the new queue engine will become production-ready faster than the v3 engine.

Still, I think it is desirable to create a new major version branch for this change. So here we are, at v5. I will continue to develop functionality that does not necessarily need the new queue engine inside the v4-devel. That way, we will have this functionality available both with the proven queue engine as well as with the new experimental one. Note that I can not do this with a stable branch: per definition, stable branches never receive enhancements (as that would potentially destabilize the branch). So, for the time being and probably a couple of month, we will have two development branches: the v4 as well as the v5 branch. With that v5 will focus on the new queue engine plus any other additions, which are done in v4.

2009-05-19

rsyslog queue enhancements – status report

I thought I post a few thoughts about how far the rsyslog queue enhancements have evolved.

We started with the goal to increase performance, especially for database outputs. As part of that endeavor, we designed and implemented message batches as the new processing entity. This approach was suggested by David Lang, who also offered very valuable feedback, suggestions and review of the relevant papers (not to mention actual testing) during the process. Then, we came to the conclusion that we need to have a truly ultra-reliable queue. One that does not even lose messages in case of a sudden fatal failure (like a power fail without a UPS – or a failing UPS!). That lead to further redesign and a lot of design work. All of this is very exciting.

Since last Friday, I have now worked on the actual code. I do now have updated for queue, the queue storage drivers and action processing. Most importantly, the rsyslog testbench does once again successfully run, even in DA queue mode. There are still a couple of things that need to be looked at, but I think most of the bulk work is done. What now follows is careful looking at the open issues plus a LOT more of testing.

The testbench has improved much in the past three month, but it still is far from covering even the most important code areas. Especially the various queueing scenarios are not very well covered by it, mostly because it is rather complex to do so. Anyhow, I will now try not to do so many ad-hoc manual tests but rather see that I can create more automated tests. While this is a lot more of work, even the current testbench has been proven to be extremely valuable during this mayor code change effort (which, let me re-iterate, is far from being fully completed). Without it, it would have been much harder to find those bugs that came up during the testbench run. I think that the time I invested into it already has payed back.

Let me end with a list of things I need to look at. That will at least help me keep focused and let you know what is extremely weak right now:

more tests
so far, the last batch is not freed until at least one more message comes in (permit DeleteProcessedBatch() to be called de-coupled)
cancel processing cleanup, decision if we should still support cancel processing entry points
configured discarding of messages on queue-full condition [at least add extra nElem counter]
make output actions support message-permanent failures (at least PostgreSQL output plugin) [also think about test cases for this]
double-check of action and action unit state processing
persisting of messages from memory queues during shutdown (testing)
Think about a new way of handling iDeqSlowdown (maybe during batch processing?)

2009-05-13

ultra-reliable queueing in rsyslog

As part of the ongoing mailing list discussion on ultra-reliable queueing in rsyslog, I’d like to create another blogpost from discussion content (again, I hope this reference is handy in the future).

The key point with ultra-reliable queues is that no message can be lost once it has been enqueued. In the current (v2,v3,v4 <= 4.1.2) releases of rsyslog, this is ensured as long a the system is guarded against a sudden loss of power (or similar disaster) and even then all but the last messages dequeued are save.

To make queue operations ultra-reliable in that case, the queue needs to be run as a pure disk queue and a checkpoint interval of one needs to be used. This makes the queue reliable at the expense of performance. Note also that with a disk queue only a single queue worker is permitted.

Now let’s look at a simplified scenario:

input -> queue -> output

This is not correct in that inputs never connect directly to outputs, but this detail is irrelevant for what I intend to say (replace “input” by “producer” and “output” by “consumer” if you’d prefer to have a fully consistent version).

Let’s say the processing time is the cost we incur. If we look at it, the queue’s cost dominates by far the combined cost of input and output. In most cases, it dominates input+output cost so much, that you can express the total cost as just the cost of the queue operation, without looking at anything else.

So the input needs to wait until the queue is ready to accept a new message. Once it has done so, the output is notified and immediately acquires the queue lock and begins the dequeue operation. At the same time, the input has already finished input processing (as I said, this happens in virtually “no time” compared to the queue operation). So it needs to wait for the queue lock. Once the dequeue operation is finished, the output releases the lock, and processes the message in virtually no time, too. The input acquired the queue lock, and the whole story begins right from the start.

A small queue may build up depending on the OS scheduler, but I think most often, input and output will just wait for the queue to complete. In that sense, this mode is similar to DIRECT mode, except that a queue can build up when the action needs to be retried.

So to optimize such a scenario, the best thing to do is a totally new queue storage driver for such cases. Sequential files do not really work well if we have multiple producers running.

This is a major effort and even then we need to think about the implications I raised in regard to processing cost above.

First of all, rsyslog was never designed for this use case (preserve every message EVEN in case of sudden power fail). When I introduced purely disk-based queues, this was done to support disk-assisted mode. I needed a queue type to permit me store things on disk, if we run out of memory. As a “side-effect”, a pure disk mode was available also (I’d never implemented it for the sake of itself). As it was there, I decided to expose this mode and made it user-configurable. I thought (probably correct) that it could solve some need – a need that I’d consider “very exotic” (think about the reliance on a audit-grade protocol for this to really make sense). And I added the checkpoint capability because it seemed useful, even with disk-based queues, which could be guarded from total loss of messages by using a reasonable checkpoint interval. Again, a checkpoint interval of one is permitted just because this capability came “for free” and could be handy in some use cases.

The kiosk example we discussed 2008 (?) on the mailing list looked like a good match for such an exotic environment. Sudden power loss was an option, and we had low traffic volume. Bingo, perfect match.

However, I’d never thought about a reasonable high-volume system using disk-only queues. Think about the cost functions, such a system boils down to a DIRECT mode queue which just takes an exceptional lot of time for processing messages.

So probably the best approach for this situation would be to run the queue actually in direct mode. That removes the overwhelming cost of queue operations. Direct mode also ensures that the input receives an ack from the output [but there may be subtle issues which I need to check to make sure this is always the case, so do not take this for granted – but if it is not yet so, this should not be too complex to change]. With this approach, we have two issues left:

a) the output action may be so slow, that it actually is the dominating cost factor and not disk queue operation

b) the output action may block for an extended period of time (e.g. during a retry)

In case a), a disk-queue makes sense, because it’s cost is irrelevant in this scenario. Indeed, it is irrelevant under all circumstances. As such, we can configure a disk-only action queue in that case. Note that this implies a *very* slow output.

Case b) is more complicated. We do NOT have any proper way to address it with current code. The solution IMHO is to introduce a new queue mode “Disk Queue on Delay” which starts an ultra-reliable disk queue (preferably with a faster queue store driver) if and only if the action indicates that it will need extended processing time. This requires some changes to action processing, but the action state machine should be capable to handle that with relatively slight modification [again, an educated guess, not a guarantee]).

In that scenario, we run the action immediately whenever possible. Only if that take the (considerable) extra effort of buffering messages into a much-slower on disk queue. Note that such a mode makes only sense with audit-grade protocols and senders (which hold processing until the ACK has been received). As such, a busy system automatically slows down to the rate that the queue writer can handle. In this sense, the overall system (e.g. a financial trading system!) may be slowed down by the unavailability of a failing output (which in turn causes the extra and very high cost of disk queue operations). It needs to be considered if that is an acceptable price.

The faster an ultra-reliable queue disk store driver performs, the more cases we can handle in the spirit of a) above. In theory, this can lead to elimination of b) cases.

Nevertheless, I hope I have shown that re-designing the queue (drivers) to support high throughput AND ultra-reliable operations AT THE SAME TIME is far from being a trivial task. To do it right, it involves some other changes too.

2009-05-11

rsyslog configuration graphs

I worked today on adding a configuration graphing capability to rsyslog. This was inspired by many discussions about how the rule engine works. From a high-level perspective, rsyslog is “just” a configurable message router, that routes messages from a set of inputs to a set of outputs, potentially with transformations doing to the messages. Rsyslog does so via the rule set, which is the most important part of the configuration file. In that sense, rsyslog is a configurable state machine and the rule set is its configuration.

While typical syslog configurations are rather simple and easy to understand, complex ones can be challenging. The graphing capability we now have provide a high-level, human-readable representation of rsyslogd’s internal control structures. The beauty with that is that every user can create an exact right diagram from his own configuration.

I hope this is a useful tool for documenting a system setup, but I also think it is a very valuable tool for learning to understand rsyslog as well troubleshooting problems with message processing.

With that said, I now send you to the new graphing feature manual page, which I hope provides sufficient insight into how this feature is used.

But… here is a sample graph to whet your appetite:

2009-05-08

Can “more reliable” actually mean “less reliable”?

On the rsyslog mailing list, we currently have a discussion about how reliable rsyslog should be. It circles about a small potential window of message loss in the case of sudden power failure. Rsyslog can be configured to put all messages into a disk queue (instead of main memory), so these messages survive such a powerfail condition. However, messages dequeued and scheduled for processing during the power outage may be lost.

I now consider a case where we have bursty UDP traffic and rsyslog is configured to use a disk-only queue (which obviously is much slower than an in-memory queue). Looking at processing speeds, the max burst rate is limited by using an ultra-reliable queue. To avoid using UDP messages, a second instance could be run that uses an in-memory queue and forwards received messages to the one in ultra-reliable mode (that is with the disk-only queue). So that second instance queues in memory until the (slower) reliable rsyslogd can now accept the message and put it into the reliable queue. Let’s say that you have a burst of r messages and that from these burst only r/2 can be enqueued (because the ultra reliable queue is so slow). So you lose r/2 messages.

Now consider the case that you run rsyslog with just a reliable queue, one that is kept in memory but not able to cover the power failure scenario. Obviously, all messages in that queue are lost when power fails (or almost all to be precise). However, that system has a much broader bandwidth. So with it, there would never have been r messages inside the queue, because that system has a much higher sustained message rate (and thus the burst causes much less of trouble). Let’s say the system is just twice as fast in this setup (I guess it usually would be *much* faster). Than, it would be able to process all r records.

In that scenario, the ultra-reliable system loses r/2 messages, whereas the somewhat more “unreliable” system loses none – by virtue of being able to process messages as they arrive.

Now extend that picture to messages residing inside the OS buffers or even those that are still queued in their sources because a stream transport blocked sending them.

I know that each detail of this picture can be argued at length about.

However, my opinion is that there is no “ultra-reliable” system in life, only various probabilities in losing messages. These probabilities often depend on each other, what makes calculating them very hard to impossible. Still, the probability of message loss in the system at large is just the product of the probabilities in each of its components. And reliability is just the inverse of that probability.

This is where *I* conclude that it can make sense to permit a system to lose some messages under certain circumstances, if that influences the overall probability calculation towards the desired end result. In that sense, I tend to think that a fast, memory-queuing rsyslogd instance can be much more reliable compared to one that is configured as being ultra-reliable, where the rest of the system at large is badly influenced by this (the scenario above).

However, I also know that for regulatory requirements, you often seem to need to prove that a system may not lose messages once it has received them, even at the cost of an overall increased probability of message loss.

My view of reliability is much the same as my view of security: there is no such thing as “being totally secure”, you can just reduce the probability that something bad happens. The worst thing in security is someone who thinks he is “totally secure” and as such is no longer actively looking at potential issues.

The same I see for reliability. There is no thing like “being totally reliable” and it is a really bad idea to think you could ever be. Knowing this, one may begin to think about how to decrease the overall probability of message loss AND think about what rate is acceptable (and what to do with these cases, e.g. “how can they hurt”).

2009-04-30

A batch output handling algorithm

With this post, I’d like to reproduce a posting from David Lang on the rsyslog mailing list. I consider this to be important information and would like to have it available for easy reference.

Here we go:

the company that I work for has decided to sponser multi-message queue
output capability, they have chosen to remain anonomous (I am posting from
my personal account)

there are two parts to this.

1. the interaction between the output module and the queue

2. the configuration of the output module for it’s interaction with the
database

On for the first part (how the output module interacts with the queue), the
criteria are that

1. it needs to be able to maintain guarenteed delivery (even in the face
of crashes, assuming rsyslog is configured appropriately)

2. at low-volume times it must not wait for ‘enough’ messages to
accumulate, messages should be processed with as little latency as
possible

to meet these criteria, what is being proposed is the following

a configuration option to define the max number of messages to be
processed at once.

the output module goes through the following loop

X=max_messages

if (messages in queue)
mark that it is going to process the next X messages
grab the messages
format them for output
attempt to deliver the messages
if (message delived sucessfully)
mark messages in the queue as delivered
X=max_messages (reset X in case it was reduced due to delivery errors)
else (delivering this batch failed, reset and try to deliver the first half)
unmark the messages that it tried to deliver (putting them back into the status where no delivery has been attempted)
X=int(# messages attempted / 2)
if (X=0)
unable to deliver a single message, do existing message error
process

this approach is more complex than a simple ‘wait for X messages, then
insert them all’, but it has some significant advantages

1. no waiting for ‘enough’ things to happen before something gets written

2. if you have one bad message, it will transmit all the good messages
before the bad one, then error out only on the bad one before picking up
with the ones after the bad one.

3. nothing is marked as delivered before delivery is confirmed.

an example of how this would work

max_messages=15

messages arrive 1/sec

it takes 2+(# messages/2) seconds to process each message (in reality the
time to insert things into a database is more like 10 + (# messages / 100)
or even more drastic)

with the traditional rsyslog output, this would require multiple output
threads to keep up (processing a single message takes 1.5 seconds with
messages arriving 1/sec)

with the new approach and a cold start you would see

message arrives (Q=1) at T=0
om starts processing message a T=0 (expected to take 2.5)
message arrives (Q=2) at T=1
message arrives (Q=3) at T=2
om finishes processing message (Q=2) at T=2.5
om starts processing 2 messages at T=2.5 (expected to take 3)
message arrives (Q=4) at T=3
message arrives (Q=5) at T=4
message arrives (Q=6) at T=5
om finishes processing 2 messages (Q=4) at T=5.5
om starts processing 4 messages at T=5.5 (expected to take 4)
message arrives (Q=5) at T=6
message arrives (Q=6) at T=7
message arrives (Q=7) at T=8
message arrives (Q=8) at T=9
om finishes processing 4 messages (Q=4) at T=9.5
om starts processing 4 messages at T=9.5 (expected to take 4)

the system is now in a steady state

message arrives (Q=5) at T=10
message arrives (Q=6) at T=11
message arrives (Q=7) at T=12
message arrives (Q=8) at T=13
om finishes processing 4 messages (Q=4) at T=13.5
om starts processing 4 messages at T=13.5 (expected to take 4)

if a burst of 10 extra messages arrived at time 13.5 this last item would
become

11 messages arrive at (Q=14) at T=13.5
om starts processing 14 messages at T=13.5 (expected to take 9)
message arrives (Q=15) at T=14
message arrives (Q=16) at T=15
message arrives (Q=17) at T=16
message arrives (Q=18) at T=17
message arrives (Q=19) at T=18
message arrives (Q=20) at T=19
message arrives (Q=21) at T=20
message arrives (Q=22) at T=21
message arrives (Q=23) at T=22
om finishes processing 14 messages (Q=9) at T=22.5
om starts processing 9 messages at T=22.5 (expected to take 6.5)

2009-04-27

Levels of reliabilty

We had a good discussion about reliability in rsyslog this morning. On the mailing list, it started with a question about the dynafile cache, but quickly morphed into something else. As the mailing list thread is rather long, I’ll try to do a quick excerpt of those things that I consider vital.

First a note on RELP, which is a reliable transport protocol. This was the relevant thought from the discussion:

I’ve got relp set up for transfer – but apparently I discovered
that relp doesnt take care of a “disk full” situation on the receiver
end? I would have expected my old entries to come in once I had cleared the disk space, but no… I’m not complaining btw – just remarking that this was an unexpected behaviour for me.

That has nothing to do with RELP. The issue here is that the file output writer (in v3) uses the sysklogd concept of “if I can’t write it, I’ll throw it away”. This is another issue that was “fixed” in v4 (not really a fix, but a conceptual change).

If RELP gets an ack from the receiver, the message is delivered from the RELP POV. The receiving end acks, so everything is done for RELP. Some thing if you queue at the receiver and for some reason lose the queue.

RELP is reliable transport, but not more than that. However, if you need reliable end-to-end, you can do that by running the receiver totally synchronous, that is all queues (including the main message queue!) in direct mode. You’ll have awful performance and will lose messages if you use anything other than RELP for message reception (well, plain tcp works mostly correct, too), but you’ll have synchronous end-to-end. Usually, reliable queuing is sufficient, but then the sender does NOT know when the message was actually processed (just that the receiver enqueued it, think about the difference!).

This explanation triggered further questions about the difference in end-to-end reliability between direct queue mode versus disk based queues:

The core idea is that a disk-based queue should provide sufficient reliability for most use cases. One may even question if there is a reliability difference at all. However, there is a subtle difference:

If you don’t use direct mode, than processing is no longer synchronous. Think about the street analogy:

http://www.rsyslog.com/doc-queues_analogy.html

For synchronous, you need the u-turn like structure.

If you use a disk-based queue, I’d say it is sufficiently reliable, but it is no longer an end-to-end acknowledgement. If I had this scenario, I’d go for the disk queue, but it is not the same level of reliability.

Wild sample: sender and receiver at two different geographical locations. Receiver writes to database, database is down.

Direct queue case: sender blocks because it does not receive ultimate ack (until database is back online and records are committed).

Disk queue case: sender spools to receiver disk, then considers records committed. Receiver ensures that records are actually committed when database is back up again. You use ultra-reliable hardware for the disk queues.

Level of reliability is the same under almost all circumstances (and I’d expect “good enough” for almost all cases). But now consider we have a disaster at the receiver’s side (let’s say a flood) that causes physical loss of receiver.

Now, in the disk queue case, messages are lost without the sender knowing. In direct queue case we have no message loss.

And then David Lang provided a perfect explanation (to which I fully agree) why in practice a disk-based queue can be considered mostly as reliable as direct mode:

> Level of reliability is the same under almost all circumstances (and I’d
> expect “good enough” for almost all cases). But now consider we have a
> disaster at the receiver’s side (let’s say a flood) that causes physical loss
> of reciver.

no worse than a disaster on the sender side that causes physical loss of the sender.

you are just picking which end to have the vunerability on, not picking if you will have the vunerability or not (although it’s probably cheaper to put reliable hardware on the upstream reciever than it is to do so on all senders)

> Now, in the disk queue case, messages are lost without sender knowing. In
> direct queue case we have no message loss.

true, but you then also need to have the sender wait until all hops have been completed. that can add a _lot_ of delay without nessasarily adding noticably to the reliability. the difference between getting the message stored in a disk-based queue (assuming it’s on redundant disks with fsync) one hop away vs the message going a couple more hops and then being stored in it’s final destination (again assuming it’s on redundant disks with fsync) is really not much in terms of reliability, but it can be a huge difference in terms of latency (and unless you have configured many worker threads to allow you to have the messages in flight at the same time, throughput also drops)

besides which, this would also assume that the ultimate destination is somehow less likely to be affected by the disaster on the recieving side than the rsyslog box. this can be the case, but usually isn’t.

That leaves me with nothing more to say ;)

2009-04-08

what is “nextmaster” good for?

People that looked at rsyslog’s git may have wondered what the branch “nextmaster” is good for. This actually is an indication that the next rsyslog stable/beta/devel rollover will happen soon. With it, the current beta becomes the next v3-stable. At the same time, the current (v4) devel becomes the next beta (which means there won’t be any beta any longer in v3). In order to facilitate this, I have branched of “nextmaster”, which I will currently work on. The “master” branch will no longer be touched and soon become beta. Then, I will merge “nextmaster” back into the “master” branch and continue to work with it.

The bottom line is that you currently need to pull nextmaster if you would like to keep current on the edge of development. Sorry for any inconvenience this causes, but this is the best approach I see to go through the migration (and I’ve done the same in the past with good success, just that then nobody noticed it ;)).

2009-04-01

rsyslog going to outer space

Rsyslog was designed to be a flexible and ultra-reliable platform for demanding applications. Among others, it is designed to work very well in occasionally connected systems.

There are some systems that are inherently occasionally connected – space ships. And while we are still a bit away from the Star Trek way of doing things, current space technology needs a “captain’s star log”. Even for spacecraft, it is important when and why systems were powered up, over- or under-utilized or malfunction (for example, due to “attack” not of a Klingon, but a cosmic ray). And all of this information needs to be communicated back to earth, where it can be reviewed and analyzed. For all of this, systems capable of reliable transmission in a disconnected environment are needed.

Inspired by NASA’s needs, the Internet Resarch Task Force (the research branch of the IETF) is working on a protocol named DTN, usually called the interplanetary Internet.

As we probably all know, Microsoft Windows flies on the Space Shuttle. And, more importantly, Linux also did. With the growing robustness of Open Source, future space missions will most probably contain more Linux components.

This overall trend will also be present in NASA’s and ESA’s future Jupiter mission. There is a lot of information technology on the upcoming spacecraft, and so there is a lot of things worth logging. While specialized software is usually required for spacecraft operations, it is considered the rsyslog as the leading provider of reliable occasionally connected logging infrastructures may extend its range into the solar system. It only sounds logical to use all the technology we already have in place for reliable logging even under strange conditions (see “reliable forwarding“). Of importance is also rsyslog’s speed and robustness.

As a consequence, we have today begun to implement the DTN protocol for the interplanetary Internet. That will be “omdtn” and is available as part of the rsyslog spaceship git branch. This branch is available as of now from the public git repository.

We could also envision that mission controllers will utilize phpLogCon to help analyze space craft malfunction. A very interesting feature is also rsyslog’s modular architecture, which could be used to radiate a new communication plugin up to the space ship, in case this is required to support some alien format. This also enables the rsyslog team to provide an upgrade to the Interstellar Internet, should this finally be standardized in the IETF. If so, and provided the probe has enough consumables, it may be in the best spot to work as a stellar relay between us and whoever else.