getting a bit more reliability from plain tcp syslog

As I have described in my recent post, plain TCP syslog is unreliable by its very nature. However, Martin Schütte had a good idea that helps improve the reliability, at least under some circumstances.

Martin showed that by doing a recv() call in front of the send(), reliability improves. This is because the local TCP stack is then forced to evaluate whatever the remote peer sent to it. Without that recv() call, the local stack may not evaluate the response until later. Of course, this is implementation specific, but seems to be common. From traces, I have seen that there seems to be good reason to do so: the number of packets sent is slightly lower when not doing the recv(), so it looks like the buffering is improved. On the other hand, this costs reliability as a remote shutdown is not immediately detected. I have not verified the buffering behaviour by looking at the tcp stack sources, but the traces are evidence enough. At least for practical purposes.

The recv() suggested by Martin forces the stack to evaluate packets received. If the remote end has closed the connection, that makes the local stack detect the condition. Consequently, loss of connection can be detected at this stage. Without the recv(), this needs at least one other send() call and thus causes the loss of one message. So we increase the reliability of plain tcp syslog by being able to detect the error state earlier.

However, this approach is racy. For example, if the remote peer initiated the shutdown after we did the recv() (but before the send()), this will not help. Also, it does not help if there is high traffic. In this case, the remote peer may be forced to shutdown the connection in the middle of a message. Keep in mind that a syslog server has no way of telling the client it intends to shut down. So if it must, it has no other chance than to terminate the connection forcefully (maybe after a short timeout, depending on the needs).

So, in short, the recv() trick often works if we have:

a network with likely no packet loss
a network with low latency
low network traffic

Fortunately, this is the situation we experience in many entry-level local networks. So for them, doing the recv() provides an improvement over not doing it. Most importantly, plain tcp syslog will deterministically lose one message in all environments without the recv(). With it, there is a chance the message will not be lost.

The larger the environment, the more slim this chance is. Unfortunately, I assume that in enterprise environments that really need reliability, the recv() trick does not contribute much to improving reliability (because they do not have the perquisites outlined above). So for enterprises, plain tcp syslog is still not an option when there is need for reliable logging (e.g. in an auditing environment). Protocols with app-level acknowledgement are required there (e.g. RFC 3195, RELP or SETP).

I have thought long how this affects rsyslog. I have now concluded I will give the recv() trick a try. I think it can improve the situation, at least in some practical scenarios. The problem is that it is kind of a hack. I need to find a well enough way to implement it into my netstream driver layer that doesn’t force the code to do something where it is illogical while at the same time keeping the driver clean to be used with more sophisticated protocols. So it is not a matter of adding a simple API call. I’ll probably introduce a driver call which may do the recv(), only if it makes sense for that type of driver, and does a dummy return for those drivers that do not (need) to support it.

For completeness, I reproduce some of my comments on Martin’s blog just so that I have a stable reference within a domain a manage (things too often disappear too quickly ;)).

Rainer Gerhards says:
20080530 11:55Martin, I have read your update. This use case isn’t fixed either, at least not reliably. It is racy. You get the same results be just doing the send() without the recv() – just give it a try. ;) — Rainer
Rainer Gerhards says:
20080530 14:34OK, I am not just ignorant ;) I’ve given it another try, while racy it looks like it indeed solves the problem in the majority of simple cases. I just wonder what happens under the hood. I’ll have a look :) — Rainer
Rainer Gerhards says:
20080530 15:02Interesting. I’ve now done a wireshark trace. As it looks, the IP stack does NOT process the incoming connection request until after it actually sends the message. If you do the recv(), this forces this evaluation. I have not verified this with the tcp sources and I don’t think it makes much sense to do so. It’s probably a bit platform-specific, depending on what the IP stack does. Anyhow, it seems to greatly reduce the problem in non-failure cases and does not cause much performance pain. So it is worth adding it. I’ll now apply it once again to the real rsyslog and see how it performs there. I keep you posted. Rainer
Rainer Gerhards says:
20080602 15:40OK, I have done some further stress testing. Unfortunately, the solution does not work as soon as the traffic gets higher. In client.c, you do a sleep. This sleep makes sure that the remote end has time to do its close and notify the client before the next message is sent. I modified you examples so that the sleep() is no longer used and 1000 messages are sent by the client. Then, the disconnect happens 1 byte into message 423 (stream was partly received). The next message I see is 701. The client also notices the problem at 701. So I have lost around 280 messages without anybody noticing it. Of course, on a non-busy server that does not happen, but it proves the point that even in a nice clean close scenario without packet loss you can lose data on a server restart. There is no reliable cure. I’ll now evaluate if it makes sense to add your suggested recv() logic for those cases where it helps. I am just unsure if these are a large enough sample to justify this code change. Anyhow… Please let me know any further findings you may have (did I overlook something)?

Rainer