Writing syslog messages to a ram buffer

I just found an interesting article about ramlog, which enables to write syslog messages to a RAM disk. This sounds silly at first (after all, don’t we want to have the data secured in case of a failure?). But it makes an awful lot of sense in some scenarios, like laptops. But read yourself:

http://www.linux.com/feature/141231

Thinking about it, this is actually a buffer that performs a delayed write, but while the data is already accessible to the user. Delayed writing as such may be useful in its own right. Given that rsyslog already has queues, it may be worth to extend it to a mode where data is kept in the queue until either a configured amount of messages is queued or a configured period of time has elapsed. That could be used to generate more efficient, burst disk writes. For example, we could say that log messages be flushed only if there are at least 10 or at least one is older than one minute.

Note that this is a use case quite different from what ramlog addresses. Actually, they could even be combined. If you like this idea, please tell me, so that I can consider implementing it.

rsyslog work log / what’s going on / script engine…

Hi folks,

I have not posted any work log for long. Even worse, I have not fully kept up with the log. So what’s going on? Is rsyslog development pace going down?

No, definitely not! I am just currently concerned with helping users, doing some labs myself, working a bit on phpLogCon and thinking about how to best do the scripting engine.

Oh yes, the scripting engine… I am promising and promising it. The reason it takes so long is that I want to have a perfect solution on the first shot (well, as close to perfect as possible…). And I want to keep it backwards-compatible to what we currently have. This involves a lot of things under the hood, and it requires very careful planning to do it right. In fact, I am even re-evaluating some theoretical concepts for the scripting language, maybe it makes sense to adopt a few more thoughts. But, even though you do not see it, I am still very focussed on it. It is the next big thing to happen.

Just to get you an example of the subtle problems: how to *best* handle the “*.*” PRI selectors that are in the current config at the rule of almost all filters. If that is not mapped properly, it can have an enourmous performance hit. To support it, I may need to do a constant folding phase in my interpretative language, something that I so far did not consider as needed. And I am still trying to avoid this (it offers little benefit for almost all other cases).

Whith that said, on to the (partial) work log:

Past day’s rsyslog work log:
2008-06-13
– created autconf support for a rsyslog testbench
– created the first very rough beginnings of a RainerScript testbench
2008-06-23
– disabled warnings caused by third-party libraries
2008-06-24
– enhanced gtls driver to support fully async operation (there was
a bug before if gtls did issue EGAIN without getting any data)
2008-06-25
– bugfix: comments after actions were not properly treated. For some
actions (e.g. forwarding), this could also lead to invalid configuration
applied to beta branch

… something missing, see git log ;) …

2008-07-01
– released 3.19.8

rsyslog error message repository

Starting with 3.19.8, rsyslog finally offers specific error codes as part of the syslog tag. For example, rsyslogd-2040 means that a file could not be found. I have added these tags both to facilitate log parsing as well as easy troubleshooting.

But a tag is only as good as the information that it helps to find. Consequently, I have started to describe error cases inside the knowledge base’s event repository:

http://kb.monitorware.com/kbeventdb-list-1-Adiscon-rsyslog.html

So far, there is only a limited set of messages available (to phrase it politely ;)), but I plan to increase it over time. Note that there is an interactive feature where questions to the message can directly be posted to the forum. I hope this is useful.

If you run into an error message that is not-yet described, let us know and we’ll add an entry. In the long term, the new knowledge base part should be able to solve most problems.

rsyslog work log

Past day’s rsyslog work log:
2008-06-09
– somewhat improved plain tcp syslog reliability by doing a connection
check before sending. Credits to Martin Schuette for providing the
idea.
– made rsyslog tickless in the (usual and default) case that repeated
message reduction is turned off. More info:
http://blog.gerhards.net/2008/06/coding-to-save-environment.html
2008-06-11
– applied some patches from Michael Biebl – thanks!
– released 3.19.7
2008-06-12
– changed default for $KlogSymbolLookup to “off” (in beta)

coding to save the environment…

Each one of us has ways to help save the environment. Interestingly, there are even some options exclusively available to coders ;)

I was made aware by Michael Biebl some time ago that the Linux community is trying to save power by disabling unnecessary wakeups inside the softwares. For example, the Linux kernel has become mostly tickless, that is does not necessarily wake up frequently. This is important, because modern CPUs have power save modes: simply said, the less is computed, the less power is needed. An idle CPU uses least power. However, when an application wakes up, the CPU resumes into modes that need more power. Of course, the CPU does not know if what it needs to be executed is actually useful or could be done more smartly. You can find more details at the site http://www.lesswatts.org/.

But why am I telling you this? The reason is that rsyslog, too, had (and unfrotunately has) some wakeups which are not strictly necessary. For example, the current imfile implementation needs to poll input files. This is the most portable way to do things, but it is not power-effcient. An update to a more power-efficient solution (on the platforms that support it) is scheduled. Next, the –MARK– message generator runs periodically. This one can probably never be avoided, because the actual functionality is the periodic activity. Any user should think if the functionality is neeed and, if not, not enable this plugin.

Finally, and the core rsyslog awoke every 30 seconds to check if repeated messages need to be flushed. However, repeated message reduction is no longer the default and I assume that most people do not use it. The default is to have it turned off. But even when it was turned off, rsyslog still awoke to do the (unnecessary) check. I have changed this, rsyslog now only initiates the wakeups when they are actually needed. To many (if not most) environments this means rsyslog will never again wakeup the system if there is no need to. It’s a small saving, but every little bit counts toward the ultimate goal. And given the potentially large number of systems running rsyslog, the saving may actually be larger than one initially assumes.

In the future, repeated message reduction will probably be implemented based on a different paradigm, which enables us to not run on a tick even when it is turned on.

In any case, I hope the new feature will contribute to – you name it – slightly longer notebook battery life and less environent pollution. And, ultimately, it’s a interesting fact to see how a software developer can affect environment health these days ;)

The new functionality will be available starting with rsyslog 3.19.7.

reliable plain tcp syslog – once again…

As said in my last blog post on plain tcp syslog reliability, I have now implemented Martin Schütte’s idea in rsyslog‘s plain tcp sender. I created an interface inside the netstream driver layer specifically for this method. It shall not be used by any other transport, and so it is clean enough to not affect any protocol other than the one that has the problem.

I have done some initial testing. It confirms the findings from my last post: the method works well in local, low-traffic environments, but fails in others. However, even in the fail cases, the message loss is somewhat less than without the work-around. So I think it is useful to have it inside rsyslog.

One thing that it affects is the current retry logic. Rsyslog used the work-around to resend the last message, which was most probably lost, after a successful reconnect. As connection loss is now much more reliable detected in some cases, the probabilty of message duplication greatly increases in them. In other cases, however, it may still be desirable to resend the last message. To (somewhat) solve this issue, I have now added a new config directive ($ActionSendResendLastMsgOnReconnect on/off) so that the user can specify his preferrence. The default now is to NOT send these messages on reconnect.

This functionality will be part of rsyslog 3.19.7.

rsyslog work

Past day’s rsyslog work log:
2008-06-06
– released 3.19.6
– added doc on suggested TLS deployment (rough picture, configuration
sample still missing).
– added new property replacer option “date-subseconds” that enables
to query just the subsecond part of a high-precision timestamp
2008-06-07
– fixed a bug with the new property replacer option in timereported
property – thanks to Elizabeth for reporting it

getting a bit more reliability from plain tcp syslog

As I have described in my recent post, plain TCP syslog is unreliable by its very nature. However, Martin Schütte had a good idea that helps improve the reliability, at least under some circumstances.

Martin showed that by doing a recv() call in front of the send(), reliability improves. This is because the local TCP stack is then forced to evaluate whatever the remote peer sent to it. Without that recv() call, the local stack may not evaluate the response until later. Of course, this is implementation specific, but seems to be common. From traces, I have seen that there seems to be good reason to do so: the number of packets sent is slightly lower when not doing the recv(), so it looks like the buffering is improved. On the other hand, this costs reliability as a remote shutdown is not immediately detected. I have not verified the buffering behaviour by looking at the tcp stack sources, but the traces are evidence enough. At least for practical purposes.

The recv() suggested by Martin forces the stack to evaluate packets received. If the remote end has closed the connection, that makes the local stack detect the condition. Consequently, loss of connection can be detected at this stage. Without the recv(), this needs at least one other send() call and thus causes the loss of one message. So we increase the reliability of plain tcp syslog by being able to detect the error state earlier.

However, this approach is racy. For example, if the remote peer initiated the shutdown after we did the recv() (but before the send()), this will not help. Also, it does not help if there is high traffic. In this case, the remote peer may be forced to shutdown the connection in the middle of a message. Keep in mind that a syslog server has no way of telling the client it intends to shut down. So if it must, it has no other chance than to terminate the connection forcefully (maybe after a short timeout, depending on the needs).

So, in short, the recv() trick often works if we have:

  • a network with likely no packet loss
  • a network with low latency
  • low network traffic

Fortunately, this is the situation we experience in many entry-level local networks. So for them, doing the recv() provides an improvement over not doing it. Most importantly, plain tcp syslog will deterministically lose one message in all environments without the recv(). With it, there is a chance the message will not be lost.

The larger the environment, the more slim this chance is. Unfortunately, I assume that in enterprise environments that really need reliability, the recv() trick does not contribute much to improving reliability (because they do not have the perquisites outlined above). So for enterprises, plain tcp syslog is still not an option when there is need for reliable logging (e.g. in an auditing environment). Protocols with app-level acknowledgement are required there (e.g. RFC 3195, RELP or SETP).

I have thought long how this affects rsyslog. I have now concluded I will give the recv() trick a try. I think it can improve the situation, at least in some practical scenarios. The problem is that it is kind of a hack. I need to find a well enough way to implement it into my netstream driver layer that doesn’t force the code to do something where it is illogical while at the same time keeping the driver clean to be used with more sophisticated protocols. So it is not a matter of adding a simple API call. I’ll probably introduce a driver call which may do the recv(), only if it makes sense for that type of driver, and does a dummy return for those drivers that do not (need) to support it.


For completeness, I reproduce some of my comments on Martin’s blog just so that I have a stable reference within a domain a manage (things too often disappear too quickly ;)).

  1. Rainer Gerhards says:
    20080530 11:55Martin, I have read your update. This use case isn’t fixed either, at least not reliably. It is racy. You get the same results be just doing the send() without the recv() – just give it a try. ;) — Rainer
  2. Rainer Gerhards says:
    20080530 14:34OK, I am not just ignorant ;) I’ve given it another try, while racy it looks like it indeed solves the problem in the majority of simple cases. I just wonder what happens under the hood. I’ll have a look :) — Rainer
  3. Rainer Gerhards says:
    20080530 15:02Interesting. I’ve now done a wireshark trace. As it looks, the IP stack does NOT process the incoming connection request until after it actually sends the message. If you do the recv(), this forces this evaluation. I have not verified this with the tcp sources and I don’t think it makes much sense to do so. It’s probably a bit platform-specific, depending on what the IP stack does. Anyhow, it seems to greatly reduce the problem in non-failure cases and does not cause much performance pain. So it is worth adding it. I’ll now apply it once again to the real rsyslog and see how it performs there. I keep you posted. Rainer
  4. Rainer Gerhards says:
    20080602 15:40OK, I have done some further stress testing. Unfortunately, the solution does not work as soon as the traffic gets higher. In client.c, you do a sleep. This sleep makes sure that the remote end has time to do its close and notify the client before the next message is sent. I modified you examples so that the sleep() is no longer used and 1000 messages are sent by the client. Then, the disconnect happens 1 byte into message 423 (stream was partly received). The next message I see is 701. The client also notices the problem at 701. So I have lost around 280 messages without anybody noticing it. Of course, on a non-busy server that does not happen, but it proves the point that even in a nice clean close scenario without packet loss you can lose data on a server restart. There is no reliable cure. I’ll now evaluate if it makes sense to add your suggested recv() logic for those cases where it helps. I am just unsure if these are a large enough sample to justify this code change. Anyhow… Please let me know any further findings you may have (did I overlook something)?

    Rainer

rsyslog work log 2

As you can see below, I am currently more involved in non-coding activities, like getting most reliability out of plain tcp syslog, IETF work and pondering the phpLogCon team with all those nice little things that are hard to implement ;)

Past day’s rsyslog work log:
2008-05-28
– released 3.17.3
– (finally switched to Fedora 9 on main dev machine;))
2008-05-29
– enabled Posix ERE expressions inside the property replacer
(previously BRE was permitted only)
– provided ability to specify that a regular expression submatch shall
be used inside the property replacer
– Worked on Martin’s suggestion for a more reliable plain tcp transport
2008-05-30
– implemented in property replacer: if a regular expression does not match,
it can now either return “**NO MATCH** (default, as before), a blank
property or the full original property text
– Worked on Martin’s new suggestion for a more reliable plain tcp transport
This one looks promising, at least for simple cases
2008-06-02
– worked on plain tcp syslog reliability
2008-06-03
– begun working on TLS doc (finally ;))
– bugfix: part of permittedPeer structure was not correctly initialized
thanks to varmojfekoj for spotting this
2008-06-04
– bugfix: off-by-one bug during certificate check
– bugfix: removed some memory leaks in TLS code
– enhanced property replacer to support multiple regex matches
2008-06-06
– IETF work, syslog-transport-tls
– begun doc on unreliable tcp syslog and a work-around based on Martin’s
idea

why you can’t build a reliable TCP protocol without app-level acks…

Again, there we are in the area of a reliable plain TCP transport. Some weeks ago, I noted down that plain tcp syslog is unreliable by design. Yesterday, Martin Schütte came up with his blog post “Reliable TCP Reconect made Easy“. In it, he describes how he thinks one can get around the limitations.

I was more than interested. Should there be a solution I had overlooked? Martin’s idea is that he queries the TCP stack if the connection is alive before sending any data. He outlined two approaches with the later being a non-blocking recv() right in front of the send(). The idea is that the recv() should detect a broken connection.

After I thought a bit about this approach I had a concern that it may be a bit racy. But in any case, new thoughts are worth evaluating. And a solution would be most welcome. So I quickly implemented some logic in my existing gtls stream driver. To make matter simple, I just did the recv() and sent the return status and errno to debug output (but did not try any reconnects based on it). And then the ugly happened: I always got the same EAGAIN return state (which is not an error), no matter in what state the connection was. I was even able to pull the receiver’s network cable and the sender didn’t care.

So, this approach doesn’t work. And, if you think a bit more about it, it comes at no surprise.

Consider the case with the pulled network cable. When I plugged it in again a quarter hour later, TCP happily delivered the “in-transit” messages (that were sitting in the client buffer) after a short while. This is how TCP is supposed to work! The whole point is that it is designed to survive even serious network failures. This is why the client buffers messages in the first place.

What should the poor client do in the “pulled network cable” case. Assume the data is lost just because it can not immediately send it? To make it worse, let’s assume the data had already left the system and successfully reached the destination machine. Now it is sitting in the destination’s receive buffer. What now if the server application (for example due to a bug) does not pull this data but also does not abort or close the file descriptor? How at all should TCP detect these failures? The simple truth is it isn’t and it is not supposed to be.

The real problem is the missing application level acknowledgment. The transport level ACK is for use by TCP. It shall not imply anything for the app layer. So if we depend on the TCP ack for an app-level protocol, we abuse TCP IMHO. Of course, abusing something may be OK, but we shouldn’t wonder if it doesn’t work as we expect.

Back to the proposed solution: the problem here is not that the send call does not fail, even though the stack knows the connection is broken. The real problem is that the TCP stack does not know it is broken! Thus, it permits us to send data on a broken connection – it still assumes the connection is OK (and, as can be seen in the plugged cable case, this assumption often is true).

As such, we can NOT cure the problem by querying the TCP stack if the connection is broke before we send. Because the stack will tell us it is fine in all those cases where the actual problem occurs. So we do not gain anything from the additional system call. It just reflects the same unreliable status information that the send() call is also working on.

And now let’s dig a bit deeper. Let’s assume we had a magic system call that told us “yes, this connection is fine and alive”. Let’s call it isalive() for easy reference. So we would be tempted to use this logic:

if(isalive()) then
send()
else
recover()

But, as thread programmers know, this code is racy. What, if the connection breaks after the isalive() call but before the send()? Of course, the same problem happens! And, believe me, this would happen often enough ;).

What we would need is an atomic send_if_alive() call which checks if the connection is alive and only then submits data. Of course, this atomicity must be preserved over the network. This is exactly why databases invented two-phase commits. It requires an awful lot of well thought-out code … and a network protocol that supports it. To ensure this, you need to shuffle at least two network packets between the peers. To handle it correctly (the case where the client dies), you need four, or a protocol that works with delayed commits (as a side-note, RELP works along those lines).

Coming back to our syslog problem, the only solution to solve our unreliability problem without specifying app-layer acks inside syslog, is to define a whole new protocol that does these acks (quite costly and complex) out of band. Of course, this is not acceptable.

Looking at all of this evidence, I come to the conclusion that my former statement unfortunately is still correct: one can not implement a reliable syslog transport without introducing app-level acks. It simply is impossible. The only cure, in syslog terms, is to use a transport with acks, like RFC 3195 (the unbeloved standard) or RELP (my non-standard logging protocol). There are no other choices.

The discussion, however, was good insofar that we now have generally proven that it is impossible to implement a reliable TCP based protocol without application layer acknowledgment at all. So we do not need to waste any more effort on trying that.

As a funny side-note, I just discovered that I described the problem we currently see in IETF’s transport-tls document back than on June, 16, 2006, nearly two years ago:

http://www.ietf.org/mail-archive/web/syslog/current/msg00994.html

Tom Petch also voiced some other concerns, which still exist in the current draft:

http://www.ietf.org/mail-archive/web/syslog/current/msg00989.html

As you can see, he also mentions the problem of TCP failures. The idea of introducing some indication of a successful connection was quickly abandoned, as it was considered too complex for the time being. But as it looks, plain technical fact does not go away by simply ignoring it ;)

UPDATE, half an hour later…

People tend to be lazy and so am I. So I postponed to do “the right thing” until now: read RFC 793, the core TCP RFC that still is not obsoleted (but updated by RFC3168, which I only had a quick look at because it seems not to be relevant to our case). In 793, read at least section 3.4 and 3.7. In there, you will see that the local TCP buffer is not permitted to consider a connection to be broken until it receives a reset from the remote end. This is the ultimate evidence that you can not build a reliable syslog infrastructure just on top of TCP (without app-layer acks).