rsyslog queues, reliability and forwarding

I would like to explain a little bit how rsyslog queues and forwarding interact. First of all, let me state that the queue engine and the action part of rsyslog, where the forwarding happens, are strictly separated. There is no interface between the two that permits an action to affect queue behaviour. For example, I was asked if a forwarding action could initiate disk queue mode when the forwarding fails. The root reason for this was that messages should not be endagered while a remote server fails.

This is not possible with the current design and involves a far-from-trivial design change. However, I do not think that the functionality is actually needed. When talking about reliablity, rsyslog works on the importance of messages and not on the importance of actions.

So in rsyslog we can configure the level of message loss that is acceptable for a given use case. This can be done on an action-by-action basis (or once at the ruleset/main queue level for all actions — usually not a good idea, but from time to time it may be). The extreme ends are a) no message loss at all permited and b) message loss of any magnitude is acceptable. For a), this means we need to configure a disk-only queue with full sync of queue files and management information after each message processed (with message input batches of one). For b), this means we do not need to place any restrictions. Obviously, a) is rather slow while b) is extremely fast. As extremes are not usually what is needed, there are many configuration possibilities between these two extremes. For example, one may define a queue the goes to disk if more than 1,000 messages are kept in main memory, but only then, and that fsyncs queue files every 10 messages (a big performance saver). That means that at any instant, at most 1,010 messages are at risk and can potentially be lost. The queue than monitors these predicates and switches to disk mode only when required. This is a very big performance saver.

Now let’s switch a bit the perception of things: Let’s go with our starting example and say you want to go to disk only when the remote system is down. But what if the remote system is up, but can not accept messages quickly enough. Let’s assume a backlog of 10,000 messages builds up. Is it then acceptable to keep these in main memory, just because the remote system is accepting data? If this risk is acceptable, why would it be inacceptable if the remote system is not yet accessible. If we say one case is acceptable but the other not, we somewhat contradict ourselves: it is almost random if the remote system is accepting messages, so why does it make a difference in the risk we permit?

This contradiction is the core reason why rsyslog does not look at external events or action failure causes but rather works on the basis of “acceptable risk”. Let’s say it is acceptable to lose 1,000 messages. Then, it is irrelevant if we lose these while the remote system is accepting or not. Consequently, rsyslog enforces disk mode if the remote system is down and there are more than 1,000 messages inside the queue. But it does not enforce this if there are only 500 messages waiting to be sent? Why should it? After all, the user has specified that a loss of 1,000 messages is acceptable, and so we do not try to guard these messages more than required by this policy. Note, of course, that if rsyslog is terminated in such a situation, of course a disk queue with 500 messages is created. We do not voluntarily lose messages, and if we terminate, we can no longer hold them in main memory. Consequently, they must be written out (of course, again depending on configuration). So the in-memory queue is retained across rsyslog restarts. But it is important to point out that unexected aborts – like sudden loss of power – can cause message loss in such scenarios. This is no different from sudden loss of power with an accepting remote system and a queue of 500. If such a risk is unaccetable, we have what I initially described in scenario a).

As a side note, rsyslog queues provide very high reliability. Every message is removed from the queue only after the action acknowledges that it has been processed. This kind of reliablity is used in some very demanding audit-grade environments (which I, as usally, not permitted to name…).

To sum up, rsyslog protects message integrity not be external events but by user-configurable “acceptable risk” policies.

We consider this a superior approach, as external events are somewhat unreliable when it comes to protecting traffic bursts. Relying on external events has a number of anomalies, as hopefully explained above.

Some thoughts on reliability…

When talking syslog, we often talk about audit or other important data. A frequent question I get is if syslog (and rsyslog in specific) can provide a reliable transport.

When this happens, I need to first ask what level of reliability is needed? There are several flavors of reliability and usually loss of message is acceptable at some level.

For example, let’s assume the process writes out log messages to a text file. Under (allmost?) all modern operating systems and by default, this means the OS accepts the information to write, acks it, does NOT persist it to storage and lets the application continue. The actual data block is usually written a short while later. Obviously, this is not reliable: you can lose log data if an unrecoverable i/o error happens or something else goes fatally wrong.

This can be solved by instructing the operating system to actually persist the information to durable store before returning back from the API. You have to pay a big performance toll for that. This is also a frequent question for syslog data, and many operators do NOT sync and accept a small message loss risk to save themselves from requiring a factor of 10 servers of what they now need.

But even if writes are synchronous, how does the application react? For example: what shall the application do if log data cannot be written? If one really needs reliable logging, the only choice is to shutdown the application when it can no longer log. I know of very few systems that actually do that, even though “reliability” is highly demanded. Here, the cost of shutting down the application may be so high (or even fatal), that the limited risk of log data loss is accepted.

There are a myriad of things when thinking about reliability. So I think it is important to define the level of reliability that is required by the solution and do that in detail. To the best of my knowledge, this is also important for operators who are required by law to do “reliable” logging. If they have a risk matrix, they can define where it is “impossible” (for technical or financial reasons) to achieve full reliability and as of my understanding this is information auditors are looking for.

So for all cases, I strongly recommend to think about which level of reliability is needed. But to provide an answer for the rsyslog case: it can provide very high reliability and will most probably fulfil all needs you may have. But there is a toll in both performance and system uptime (as said above) to go to “full” reliability.