Writing syslog messages to a ram buffer

I just found an interesting article about ramlog, which enables to write syslog messages to a RAM disk. This sounds silly at first (after all, don’t we want to have the data secured in case of a failure?). But it makes an awful lot of sense in some scenarios, like laptops. But read yourself:

http://www.linux.com/feature/141231

Thinking about it, this is actually a buffer that performs a delayed write, but while the data is already accessible to the user. Delayed writing as such may be useful in its own right. Given that rsyslog already has queues, it may be worth to extend it to a mode where data is kept in the queue until either a configured amount of messages is queued or a configured period of time has elapsed. That could be used to generate more efficient, burst disk writes. For example, we could say that log messages be flushed only if there are at least 10 or at least one is older than one minute.

Note that this is a use case quite different from what ramlog addresses. Actually, they could even be combined. If you like this idea, please tell me, so that I can consider implementing it.

more on syslog TLS, policies and IETF efforts…

I am still working hard on TLS support, but on the design level. Here, I’d like to reproduce a message I sent to the IETF syslog WG’s mailing list. It outlines a number of important points when it comes to practical use of TLS with syslog.

I am quite happy that syslog-transport-tls got new momentum right at the time when I finished my TLS implementation in rsyslog and turned into fine-tuning it. The IETF discussion on authentication and policies actually is touching right those places where it in practice really hurts. For the initial TLS implementation, I decided to let rsyslog work in anonymous mode, only. Iit was clear that -transport-tls section 4 as in version 11 would not survive – just as we have now seen).

The next steps in rsyslog are to enable certificate based access policies and this is exactly what the IETF discussion is focusing on. Of course, I try to finish design and try to affect the standard in a positive way, so that the rsyslog implementation can both be standards-compliant and useful in practice.

And now – have an interesting read with my mailing list post. Feedback is highly appreciated.

Rainer


Hi all,

I agree to Robert, policy decisions need to be separated. I CC Pasi because my comment is directly related to IESG requirements, which IMHO cannot be delivered by *any* syslog TLS document without compromise [comments directly related to IESG are somewhat later, I need to level ground first].

Let me tell the story from my implementor’s POV. This is necessarily tied to rsyslog, but I still think there is a lot of general truth in it. So I consider it useful as an example.

I took some time yesterday to include the rules laid out in 4.2 into rsyslog design. I quickly came to the conclusion that 4.2. is talking about at least two things:

a) low-level handshake validation
b) access control

In a) we deal with the session setup. Here, I see certificate exchange and basic certificate validation (for example checking the validity dates). In my current POV, this phase ends when the remote peer can positively be identified.

Once we have positive identification, b) kicks in. In that phase, we need to check (via ACLs) if the remote peer is permitted to talk to us (or we are permitted to talk to it). Please note that from an architectural POV, this can be abstracted to a higher layer (and in rsyslog it probably will). For that layer, it is quite irrelevant if the remote peer’s identity was obtained via a certificate (in the case of transport-tls), a simple reverse lookup (UDP syslog), SASL (RFC 3195) or whatever. What matters is that the ACL engine got a trusted identify from the transport layer and verifies that identity [level of trust varies, obviously]. Most policy decisions happen on that level.

There is some grayish between a) and b). For example, I can envision that if there is a syslog.conf rule (forward everything to server.example.net)

*.* @@server.example.net

The certificate name check for server.example.net (using dNSName extension) could probably be part of a) – others may think it is part of b).

Also, even doing a) places some burden onto the system, like the need to have trust anchors configured in order to do the validation. This hints at at least another sub-layer.

I think it would be useful to spell out these different entities in the draft.

Coming back to policy decisions, one must keep in mind that the IESG explicitly asked for those inside the document. This was done based on the -correct- assumption that today’s Internet is no longer a friendly place. So the IESG would like to see a default policy implemented that provides at least a minimum acceptable security standard. Unfortunately, this is not easy to do in the world of syslog. For the home users, we cannot rely on any ability to configure something. For the enterprise folks, we need to have defaults that do not get into their way of doing things [aka “can be easily turned off”]. There is obviously much in between these poles, so it largely depends on the use case. I have begun a wiki page with use cases and hope people will contribute to it. It could lead us to a much better understanding of the needs (and the design decisions that need to be made to deliver these). It is available at

http://wiki.rsyslog.com/index.php/TLS_for_syslog_use_cases

After close consideration, I think the draft currently fails on addressing the two use cases define above properly. Partly it fails because it is not possible under the current IESG requirement to be safe by default. We cannot be fully safe by default without configuration, so whatever we specify will fail for the home user.

A compromise may be to provide “good enough” security in the default policy. I see two ways of doing that: one is to NOT address the Masquerade and Modification threats in the default policy, just the Disclosure threat. That leads us to unauthenticated syslog being the default (contrary to what is currently implemented) [Disclosure is addressed in this scenario as long as the client configs are not compromised, which I find sufficiently enough – someone who can compromise the client config can find other ways to get hold of the syslog message content].

An alternative is to use the way HTTPS works: we only authenticate the server. To authenticate, we need to have trusted certificate inside the server. As we can see in HTTPS, this doesn’t really require PKI. It is sufficient to have the server cert singed by one of few globally trusted CAs and have this root certificates distributed with all client installations as part of their setup procedure. This is quite doable. In that scenario, a client can verify a server’s identity and the above sample (*.* @server.example.net) could be verified with sufficient trust. The client, however, is still not authenticated. However, the threats we intended to address are almost all addressed, except for the access control issue which is defined as part of the Masquerade threat (which I think is even a different beast and deserves its own threat definition now that I think about it). In short we just have an access control issue in that scenario. Nothing else.

The problem, however, is that the server still needs a certificate and now even one that, for a home user, is prohibitively expensive. The end result will be that people turn off TLS, because they neither know how to obtain the certificate nor are willing to trade in a weekend vacation for a certificate ;) In the end result, even that mode will be less useful than anonymous authentication.

The fingerprint idea is probably a smart solution to the problem. It depends on the ability to auto-generate a certificate [I expressed that I don’t like that idea yesterday, but my thinking has evolved ;)] OR to ship every device/syslogd with a unique certificate. In this case, only minimal interaction is required. The idea obviously is like with SSH: if the remote peer is unknown, the user is queried if the connection request is permitted and if the certificate should be accepted in the future. If so, it is added permanently to the valid certificate store and used in the future to authenticate requests from the same peer. This limits the security weakness to the first session. HOWEVER, the problem with syslog is that the user typically cannot be prompted when the initial connection happens (everything is background activity). So the request must actually be logged and an interface be developed that provides for user notification and the ability to authorize the request.

This requires some kind of “unapproved certificate store” plus a management interface for it. Well done, this may indeed enable a home user to gain protection from all three threats without even knowing what he really does. It “just” requires some care in approving new fingerprints, but that’s a general problem with human nature that we may tackle by good user interface desig but can’t solve from a protocol point of view.

The bad thing is that it requires much more change to existing syslogd technology. That, I fear, reduces acceptance rate. Keep in mind that we already have a technically good solution (RFC 3195) which miserably failed in practice due to the fact it required too much change.

If I look at *nix implementations, syslogd implementers are probably tempted to “just” log a message telling “could not accept remote connection due to invalid fingerprint xx:xx:…” and leave it to the user to add it to syslog.conf. However, I fear that for most home setups even that would be too much. So in the end effect, in order to avoid user hassle, most vendors would probably default back to UDP syslog and enable TLS only on user request.

From my practical perspective this sounds even reasonable (given the needs and imperfections of the real world…). If that assessment is true, we would probably be better off by using anonymous TLS as the default policy, with the next priority on fingerprint authentication as laid out above. A single big switch could change between these two in actual implementations. Those users that “just want to get it running” would never find that switch but still be somewhat protected while the (little) more technically aware can turn it to fingerprint authentication and then will hopefully be able to do the remaining few configuration steps. Another policy is the certificate chain based policy, where using public CAs would make sense to me.

To wrap it up:

1. I propose to lower the default level of security
for the reasons given.
My humble view is that lower default security will result in higher
overall security.

2. We should split authentication policies from the protocol itself
… just as suggested by Robert and John. We should define a core
set of policies (I think I described the most relevant simple
cases above, Robert described some complex ones) and leave it
others to define additional policies based on their demand.

Policies should go either into their own section OR into their own documents. I have a strong favor of putting them into their own documents if that enables us to finally finish/publish -transport-tls and the new syslog RFC series. If that is not an option, I’d prefer to spend some more work on -transport-tls, even if it delays things further, instead of producing something that does not meet the needs found in practice.

Rainer

> —–Original Message—–
> From: syslog-bounces@ietf.org [mailto:syslog-bounces@ietf.org] On
> Behalf Of robert.horn@agfa.com
> Sent: Thursday, May 08, 2008 5:53 PM
> To: Joseph Salowey (jsalowey); syslog@ietf.org
> Subject: Re: [Syslog] I-D Action:draft-ietf-syslog-transport-tls-12.txt
>
> Section 4.2 is better, but it still needs work to separate the policy
> decisions from the protocol definition. Policy decisions are driven by
> risk analysis of the assets, threats, and environment (among other
> things). These are not uniform over all uses of syslog. That makes it
> important to separate the policy from the protocol, in both the
> specifications and in the products.
>
> In the healthcare environment we use TLS to protect many of our
> connections. This is both an authentication protection and a
> confidentiality protection. The policy decisions regarding key
> management
> and verification will be very similar for a healthcare use of syslog.
> Some
> healthcare sites would reach the same policy decision as is in 4.2, but
> here are three other policy decisions that are also appropriate:
>
> Policy A:
> The clients are provided with their private keys and the public
> certificates for their authorized servers by means of physical media,
> delivered by hand from the security office to the client machine
> support
> staff. (The media is often CD-R because it’s cheap, easy to create,
> easy
> to destroy, and easy to use.) During TLS establishment the clients use
> their assigned private key and the server confirms that the connection
> is
> from a machine with one of the assigned private keys. The client
> confirms
> that the server matches one of the provided public certificates by
> direct
> matching. This is similar to the fingerprint method, but not the same.
> My
> most recent experience was with an installation using this method. We
> had
> two hours to install over 100 systems, including the network
> facilities.
> This can only be done by removing as many installation schedule
> dependencies as possible. The media method removed the certificate
> management dependencies.
>
> Policy B:
> These client systems require safety and functional certification
> before
> they are made operational. This is done by inspection by an acceptance
> team. The acceptance team has a “CA on a laptop”. After accepting
> safety
> and function, they establish a direct isolated physical connection
> between
> the client and the laptop. Then using standard key management tools,
> the
> client generates a private key and has the corresponding public
> certificate generated and signed by the laptop. The client is also
> provided with a public certificate for the CA that must sign the certs
> for
> all incoming connections.
>
> During a connection setup the client confirms that the server key has
> been
> signed by that CA. This is similar to a trusted anchor, but not the
> same.
> There is no chain of trust permitted. The key must have been directly
> signed by the CA. During connection setup the server confirms that the
> client cert was signed by the “CA on a laptop”. Again, no chain of
> trust
> is permitted. This policy is incorporating the extra aspect of “has
> been
> inspected by the acceptance team” as part of the authentication
> meaning.
> They decided on a policy-risk basis that there was not a need to
> confirm
> re-inspection, but the “CA on a laptop” did have a revocation server
> that
> was kept available to the servers, so that the acceptance team could
> revoke at will.
>
> Policy C:
> This system was for a server that accepted connections from several
> independent organizations. Each organization managed certificates
> differently, but ensured that the organization-CA had signed all certs
> used for external communications by that organization. All of the
> client
> machines were provided with the certs for the shared servers (by a
> method
> similar to the fingerprint method). During TLS connection the clients
> confirmed that the server cert matched one of the certs on their list.
> The
> server confirmed that the client cert had been signed by the CA
> responsible for that IP subnet. The server was configured with a list
> of
> organization CA certs and their corresponding IP subnets.
>
> I do not expect any single policy choice to be appropriate for all
> syslog
> uses. I think it will be better to encourage a separation of function
> in
> products. There is more likely to be a commonality of configuration
> needs
> for all users of TLS on a particular system than to find a commonality
> of
> needs for all users of syslog. The policy decisions implicit in
> section
> 4.2 make good sense for many uses. They are not a complete set. So a
> phrasing that explains the kinds of maintenance and verification needs
> that are likely is more appropriate. The mandatory verifications can
> be
> separated from the key management system and kept as part of the
> protocol
> definition. The policy decisions should be left as important examples.
>
> Kind Regards,
>
> Robert Horn | Agfa HealthCare
> Research Scientist | HE/Technology Office
> T +1 978 897 4860
>
> Agfa HealthCare Corporation, 100 Challenger Road, Ridgefield Park, NJ,
> 07660-2199, United States
> http://www.agfa.com/healthcare/
> Click on link to read important disclaimer:
> http://www.agfa.com/healthcare/maildisclaimer
> _______________________________________________
> Syslog mailing list
> Syslog@ietf.org
> https://www.ietf.org/mailman/listinfo/syslog

On the (un)reliability of plain tcp syslog…

Did you ever use TCP to transfer syslog reliably? And do you think that makes you immune against message loss? If so, it’s time to think again.

There is a subtle, yet important problem with plain tcp syslog. It is a very simple protocol and it works without any app-level acknowledgment. Well, “why care”, you may say “we have the TCP ack”. That, of course is right, but that low-level ack won’t help you in all cases. The TCP ack is fine while everything goes well. But if the connection breaks, there is no mechanism in TCP that tells the sender immediately. In fact, “not immediately” actually is “after [2] hours” (and only if you have keep alive active, else its even later). This is not a flaw in TCP but desired behavior. To be honest, when the server aborts and is restarted, the client will see an error as soon as the server is restarted. On a busy system, this may still be many messages later. So when the client receives the error status, it has no clue at all which messages were lost and which finally made it to the server.

Why that? The problem here is that (by design!), the TCP send() API always returns success but buffers the message locally until it can be transmitted to the remote host. This behavior is fine – it enables TCP to be used reliably even when a network path is temporarily down. The downside is that the sender never knows if and when the data was received by the remote peer. Of course, there is an upper limit on the buffer size. It’s depending on TCP window negotiations, and if I remember correctly the ultimate upper limit is around 240K. I haven’t checked, but the lower limit is probably around 1.5K, the size of an Ethernet packet.

If we now assume a generous syslog message size of 150 bytes (many are much smaller), we can have between 10 and 1,600 messages sitting in this buffer! So we may lose up to 1,600 messages if the server goes down – without even noticing it. Frightening, isn’t it?

I was aware of this problem since the early days of plain tcp syslog. I have even documented it as part of the SELP (simple event logging protocol) discussion on the loganalysis mailing list: see the selp spec, section 2.4. I have to admit that I never took it too seriously. After all, it only happens when the server goes down … and there were so many other occasions on which syslog messages could get lost. The most prominent one probably running out of buffer space on the local syslogd or simply restarting it.

Then, I begun to implement more and more reliability into rsyslog (my *nix syslogd implementation for those not in the know). And the more reliable that engine got, the more the reliability problems of plain tcp syslog surfaced. A typical sample of TCP unreliability is included in this post, which also was the thing that finally reminded me on the root cause of all that (if you search the forum, you’ll find some other example of “the tcp problem”).

Once the core engine was reliable, the unreliability of TCP, so far well-hidden by the unreliability of the rest of the system, quickly surfaced. No advanced flow control, no tricks, no nothing helped. We can not build a reliable solution out of plain tcp syslog. It’s simply a no-go. At this point, it is probably good to add that this is not a rsyslog problem. It’s a protocol issue and as such all softwares implementing plain TCP syslog have the same shortcoming!

My quest for a highly reliable and tamper-proof logging system got shaky at that point. So what to do? Of course, a different, more reliable protocol was needed. It must support app-level acks, so that the client (sender) knows what was processed by the server and what not. If so, it can resend the right messages in case of a connection failure. My initial thoughts were to use liblogging and thus RFC 3195, which bases on BEEP (RFC 3080/3081). However, after some thinking, I came to the conclusion that RFC 3195 is far too less accepted and quite a bit to complicated for the job. I ended up designing a new, reliable and extensible protocol which I called RELP in honor of the initial SELP work. RELP stands for reliable event logging protocol. That implies two things: it’s reliable and it’s NOT only about syslog. RELP can carry a variety of payloads and will probably do so in the future. [You can find some more of my thought process in a previous blog post.]

I have now finished the initial alpha implementation of librelp, the RELP core library. I have also released the first version of rsyslog (3.15.0) with RELP support. There are still a few nits with RELP. Most importantly, it may duplicate some messages in extreme network failure cases (documented inside rsyslog’s imrelp module). But, except for a bug, RELP will never again lose any message … nearly. Now the ball is back to rsyslog. If the network connection breaks and rsyslog sent some yet-unacknowledged messages just before the network connection broke and rsyslog is terminated in that state … then we may indeed still lose a handful of messages. The bad thing is that it happens, the good thing is that there is some cure, it “just” needs to be implemented. So the reliability quest is not over. It just went into the next round. But, again, the probability of message loss is now very much lower than with plain tcp syslog.

So just let me repeat: plain tcp syslog is *not* a reliable solution. It works well as long as everything is working well, but it screws up when the network or the server breaks… And, yes, you don’t really need to care about that until you try to make the rest of the system fully reliable.

UPDATE:

I have received a good comment via mail and provided an in-depth answer, probably interesting to others, too. So I here reproduced it (but snip much of the original mail):

> There is very expensive software out there that can do guarenteed
> packet
> deliver over tcp…

Well… RELP can *really* do the trick. It needs some help in the rsyslog engine to work under all cases, that is when the relp stack is terminated (when the client rsyslog is shut down). But that isn’t rocket science and has just been pushed back by more urgent work (securing the channel).

So far, relp does single ack, that is the server ack’s every packet back to the client. Client discards packets only after ack. So when a session breaks, we know what the server has processed. If, however, some of the acks get lost, we resend packets that the server already processed, resulting in some mild message *duplication* (we currently have a 128 packet app-layer max window and typically acks come in rather quickly). To work around this, the client must ack the server’s acks (double-acking). That sounds scary, but is not. It’s also not performance intense and the protocol is modeled that those with ultra-slim bandwidth can turn it off and live with the message duplication scenario. If double-acks are active, relp client and server will go into a recovery phase after reconnect negotiation, in which mutally acked package numbers are exchanged. To do so, the client must provide a session cookie back to the server. This is a potential attack vector and thus I’d like to have secure transport in place before I do it. Once the recovery negotiation is done, client and server know, and have discarded, what each other peer has processed. The remaining packets are re-sent, but under the same reliability settings. So if the connection breaks at this point again (or during the renegotiation), we’ll simply go into another recovery phase until we finally succeed. It is of course important that session caches be persisted to disk when an engine stops – that will be part of rsyslog. The recovery procedure itself is in librelp. rsyslog needs to add just a slim persistence capability.

Once this is done, and with proper rsyslog queue settings, sufficiently stable hardware and sufficient disk space, I guarantee (not in a lawyers sense, though ;)) that you’ll never lose a message nor get a duplicate.

This is when I think we have achieved our reliability goal. If all goes well… summer? ;)

ANOTHER UPDATE (2008-05-29)

Some idea was brought up about one may be able to circumvent this problem. I tried it out, saw it failed and could also prove that it is simply impossible to have a reliable plain tcp syslog protocol without application-level acknowledgment.

GSS-API for Rsyslog

I just reviewed and integrated a patch from varmojfekoj into rsyslog – it provided GSS-API support. I have to admit that I have not yet fully understood what it does from a user’s point of view. But I begin to have the feeling that this patch will probably be the most important addition to rsyslog in the later half of this year.

I also have not yet evaluated how this patch relates to the syslog-sec IETF syslog security effort, namely syslog-transport-tls. I guess they are related and the patch probably not only provided non-standard functionality but may even make it harder to implement the standard. HOWEVER, if we look at how slow moving that IETF WG is, I do not bother about any compliance problems. They can be dealt with later. What I find much more important is that we have a real-world answer to real-world security question and we do have this now. So what could be more important? ;)

I keep you updated on the progress.

where is liblogging heading to?

I just received a question whether or not I plan to change the build system for liblogging to a modern one, just like I have done with rsyslog recently.

As a reminder, liblogging is the library that provides RFC 3195 functionality to rsyslog. Of course, others can integrate it too. Writing liblogging was a considerable effort. When I did this, I had hoped that RFC 3195 would become much more dominant in the logging space. Unfortunately, the current version became a failure. This is also one reason that rsyslog supports receiving messages via 3195 but not sending them – the interest was too low (to phrase it politely) and other features received priority.

Liblogging development is currently “on hold”. I was tempted to completely abandon it, but then Cisco moved RFC 3195 support into IOS. That was one thing that made me hope that the standard may become accepted in the long term. Recently, while designing rsyslog v3, I noticed that rfc 3195 (or more precisely BEEP, RFC 3080) indeed offers a very good choice for protocol extensions. I am currently very seriously thinking about re-vitalizing liblogging and make it the foundation of rsyslog-to-rsyslog communication.

But there are also a number of other subtleties. Most importantly, the current RFC 3195 is insufficient for practical purposes. For example, it limits message size to 1K. It also does not support new syslog headers. The IETF is considering a new version of 3195, which will fix these shortcomings. Of course, I can go ahead and use from 3195 what makes sense to me and extend the rest on my own. But that will cause trouble in the long term and I do not really like that.

There are also some funding issues. As you possibly know, we fund our projects also via closed-source Windows software (e.g. MonitorWare Agent or EventReporter). The current liblogging has a BSD-style license. That was chosen in the hope it will help to make it a standard. However, it also enables our closed-source competitors to pull our work without contributing anything back.

So, to be honest, if we go ahead and do a new substantial effort on liblogging, we will probably release it under GPLv3. That enables everybody who does open source to use it, but prohibits our closed-source competitor to integrate the library. Of course, we need to retain the right to use it ourselfs in our commercial products, but I think it can be arranged in a way that is compatible with GPLv3 – at least I have seen a number of projects doing that.

To summarize in short words: liblogging is currently on hold and will be for at least some more weeks. We are deciding if we extend it. If so, that will go along with rsyslog development.

on reliability and the need to discard messages…

I am a bit involved in creating the next-generation on the wire syslog protocol with the IETF. Unfortunately, that activity had to starve a little while I was busy with getting rsyslog to its current state. However, I have not lost it out of my eye. And in fact the rsyslog implementation effort helped my sharpen my mind for what to suggest in the standardization area.

A recap: the IETF tries to standardize the syslog protocol for quite a while now. There were changing thoughts about what was needed, but this spring we agreed on bringing a draft up to the IESG. From there, we received a number of comments. Some easy ones, some of more substance. In July 2007, more or less one topic was left: that on what to do when a syslog sender blocks. I quote the relevant part of the IESG reply here (accentuation by me):

> First, this starts as an issue with TLS over TCP and the
> syslog base protocol.
> It can also arise teorethically for UDP, but as I understand
> not in practice for todays usage. When you are using TCP, in
> case the syslog sender generates events at an rate that is
> higher than available rate over the path used there will be
> build up of a queue. So I would like to have some words
> somewhere saying what you do when you build up a queue of
> messages that should be transmitted, but the queue simply
> keeps growing. What do I do? To me this situation implies
> that one starts discarding syslog messages and starts
> with the least important ones.
So I would like to have
> a paragraph on this.
>
> I also included UDP in this in the case that you actually
> have reserved or determined that syslog is allowed to use a
> particular amount of bandwidth, but not more. In this case it
> could be possible that one implements a rate limiter
> and run into exactly the same issue as for TCP
.
>
> Please do understand that if syslog was designed from scratch
> today it wouldn’t get away without a congestion control that
> guarantees that it isn’t missbehaving in any situation. But
> being an “old” protocol we are accepting less than that.
> However, we do require it to contain the limitations and
> assumptions that it can be safely operated with. Using it as
> it currently is used is not an issue because the networks it
> is used in has many magnitudes more bandwidth that syslog
> generates. However, what would happen if some one starts
> using syslog in low-powered, low-bitrate sensor network for
> carrying some information. In that situation syslog becomes a
> potential threat to the stability of the network as it
> doesn’t react at all to congestion if run over UDP. Network
> failures are also sitation that are problematic to ensure
> that the inteded resources are available.
Thus we do
> like to protect the utility of what resources do exist.

And this sentence covers it all:

> Please seriously consider adding a paragraph about
> how one can thin a queue of syslog messages in
> the base protocol. This as I think it potentially applies
> to any transport.

This basic requirement caused some confusion in the working group (WG). I have to admit, it confused me at that time, too.

In the mean time, I kept working on rsyslogd. I didn’t even realize that I had written code in 2005 (or so) to take care of what was requested by the IESG: if rsyslog runs in single thread mode and it forwards messages via TCP and the send call would block – then we have a problem. Because when we wait, we will potentially hang the whole system. This is unacceptable. So I coded rsyslog in such a way that it discards the message in such a situation.

When I moved rsyslog to a multi-threaded design, that restriction was removed. Now, there was a queue between the receiver and the (potentially blocking) sender. And, as we all know, queues are endless, so we’ll never run into problems ;-]. Of course, that simplistic point of view survived only so long as we did not deliberately bash rsyslogd with lots of traffic. What then happens is pretty much the same as in the single-threaded case: rsyslog freezes. Mildew, a frequent contributor, discovered this problem and also sent a good fix: he throws away messages if the queue is full. Sounds familiar? Yes, there seems to be a real-world problem and there is also a well known cure for it.

In some essence, this is what the IESG is asking for – at least as of my understanding. But the IESG, obviously speaking with a lot of experience, goes a step further. What they recommend is not to discard newly arrived packages. The recommend a way to “thin out the queue“. So we would not just blindly drop whatever comes when the queue is full. No, we should intelligently decide what to drop, and that based on the severity of the message. It sounds only fair that debugging messages get dropped in favor of emergency messages. And, yes, syslog has a field “severity”, which is the right thing to use here.

I thought about implementing a “high watermark” in the rsyslog queuing engine. What that amount of the queue would be full, a filter should be applied to the message. So from that on, only messages with a given severity would be queued, while all others be dropped. Of course, those already in the queue would be sent, no matter how unimportant they are. So far, I have not implemented that algorithm. And I think I am now unlikely to do it that way. It would be much better to follow the advise and, once the queue becomes full, begin to drop messages based on their severity.

I can even envision the algorithm: go forward through the queue starting at the tail and look for messages with the least importance. Once you have found one, drop it. Then enqueue the (new) higher-severity message. When this is done, do not drop any more messages, because it may not be necessary. We can always drop if a newer higher priority message arrives. If, during our queue traversal, we could not find any message with lowest severity, retry with the next lowest severity. Run this algorithm until either a message to drop is found or the drop-severity has reached the severity of the to-be-queued message, which in this case is the youngest and thus the drop target. Of course, that algorithm needs to be tweaked in regard to performance, but basically it should work fine. I even think, a derivative of it will make it into a rsyslog feature request.

… and now back to the IETF. We still need wording for our upcoming standard (called an I-D). Now that I know what I look for, shouldn’t it be easy? Well, to be honest I don’t think its that easy. First of all, I am not sure if, in a protocol specification, I can demand such queue processing. It looks somewhat too algorithmic, too implementation specific to me. So I need to find a better way to phrase it. And of course, peer review will show if that makes sense or not (maybe we even get a better suggestion).

My first try at it is as follows:

In any implementation, a situation can arise in which an originator or relay would need to block sending messages. A common case is when an internal queue is full. This might happen due to rate-limiting or slow performance of the syslog application. In such cases, it is RECOMMENDED that the originator or relay drops messages of lower severity in favor of higher severity messages. Messages with a numerically lower SEVERITY value have a higher severity than those with a numerically higher one. To-be-dropped messages SHOULD simply be discarded. The syslog application may notify a collector or relay about the fact that it drops messages. If the underlying protocol supports congestion control, discarding of messages SHOULD be avoided.

I have to admit that this text is probably too lengthy and imprecise. But hopefully it is a starting point. I’ll post it to the syslog IETF WG and we’ll see…

Still, there is one important message to learn:
no matter how hard we try, there is always a chance that we must discard messages. It may be highly unlikely and it may be totally unexpected. But we can not absolutely avoid it. Not even an on-disk queue is large enough, that it can buffer any imaginable message queue. So loss of messages is inherent in any application, also in syslog and auditing. The question is, how we deal smartly with it and how we preserve as much evidence in our logs as possible (and keep the log valid in those cases!). This all sounds pretty basic and easy, we always need to remember ourself’s of this fact…

next generation rsyslog design, worker pools, parellelism and future hardware

Based on my last post on rsyslog multithreading, an email conversation arose. When I noticed I elaborated about a lot of things possibly of general interest, I decided to turn this into a blog post. Here it is.

The question that all this started was whether or not a worker thread pool is good or evil for a syslogd.

Yes, the worker-thread pool has a lot of pros and cons with syslog. I know this quite well, because I am also the main design lead behind WinSyslog, which was the first syslogd available natively on Windows (a commercial app). It is heavily multi-threaded. I opposed a worker pool for a long time but have accepted it lately. In some high-volume scenarios (unusual cases) it is quite valuable. But of course you lose order of messages. For WinSyslog, we made the compromise to have the worker pool configurable and set it to 1 worker if order of events is very important. I designed WinSyslog in 1995 and released the first version in 1996 – so I know quite well what I am talking about (but to be honest the initial multi threading engine got in somewhat later;)).

HOWEVER, especially in high-volume environments and with network senders, you are somewhat playing Russian roulette if you strongly believe that order in which events come in is exactly the same order in which they were sent. Just think routers, switches, congestions, etc… For small volumes, that’s fair enough. But once the volume goes up, you do not get it 100% right. This is one of the reasons I am working quite hard it the IETF to get a better timestamp in into syslog (rsyslog already has it, of course as an option). The right thing to do message sequencing is by looking at a high-precision timestamp, not by looking at time of reception.

For rsyslog, I am not yet sure if it will ever receive a worker pool. From today’s view, it does not look necessary. But if I think about future developments in hardware, the only key in getting more performance is by using the enhanced parallelism the hardware will provide. The time of fast single cores is over. We will have relatively slow (as fast as today ;-]) massively parallel hardware. This is a challenge for all of software engineering. Many folks have not yet realized it. I think it has more problem potential than the last “big software crisis”. As you can see in the sequencing discussion, parallelism doesn’t only mean mutexes and threads – it means you need to re-think on how you do things (e.g. use timestamps for correlation instead of reception/processing order, because the later is no stable concept in a massively parallel program). With the design I am now doing, I will definitely have separate threads for each output action (like file or database writer). I need this, because rsyslog will provide a way to queue messages on disk when a destination is not available. You could also call this “store-and-forward syslog”, just like SMTP is queued while in transit. You can not do this concept with a single thread (at least not in a reasonable complex way). I also do not think that multiple processes would be the right solution. First off, they are too slow. Secondly, multiple processes for this use are much more complicated than threads (I know, I’ve written similar multi-process based things back in the 80s and 90s).

Rsyslog will also most probably have more than one thread for the input. Input will become modular, too. If I use one thread per input, each input can use whatever threading model that it likes. That makes writing input plugins easy (one of my goals). Also, in high volume environments, having the ability to run inputs on multiple CPUs *is* *very* beneficial. One of the first plugins will be RFC 3195 syslog, which is currently run as an external process (communicating via unix local domain socket, which is quite a hack). At the core of 3195 is liblogging, a slim and fast select-server I have written. However, integrating it into a single-threaded application is still a challenge (you need to merge the select() calls and provide an API for that). With multiple threads, you can run that select server on its own thread, which is quite reasonable (after all, why should both UDP and 3195 reception run on the same thread?). The same is true for tcp based processing. Then think about the native ssl support that is coming. Without hardware crypto acceleration, doesn’t it make much sense to run the receiver on its own thread? Doesn’t it even make sense to run the sender on its own thread?

So the threading model of the next major release of rsyslog (called 3.x for reasons not yet elaborated about) will most probably be:

  1. multiple input threads, at least one for each input module (module may decide about more threads if they like to)
  2. all those input threads serialize messages to a single incoming queue
  3. the queue will most probably be processed by a single worker thread working on filter conditions and passing messages to the relevant outputs. There may be a worker pool, but if so its size can be configured (and set to 1, if needed)
  4. multiple output threads, at least one for each output. Again, it is the output’s decision if it runs on more than one thread
  5. there may be a number of housekeeping threads, e.g. for DNS cache maintenance

This design will provide superb performance, is oriented on logical needs, allows for easy intrgration of plugins AND will be reasonable easy to manage – at least I hope so.

But, wait, why have I written all this? OK, one reason is that I wanted to document the upcoming threading model for quite a while and now was a good time for doing so. But I think it also shows that you can not say “multithreading is good/bad” without thinking about the rest of the system design. Multi-threading is inherently complex. Humans think sequentially, at least when we think about something consciously. So parallel programming doesn’t match human nature. On the other hand, nature is doing things itself massively parallel. Just think about our human vision: in tech terms, it is massively parallel signal processing. This is where technology is also heading to. As we have learned to program at all, we will also learn to deal with parallel programming. A key thing is that we should never think about parallelism as a feature – it’s actually a tool, that can be used right or wrong. So let’s hope the rsyslog design will do it right. Of course, comments are very much appreciated!

on the syslogd -h option

While I work on rsylsog modularization, I also re-visit a lot of code. Please remember that rsyslog is rooted in the sysklogd package (and we always tried to keep it quite compatible with it). When I finished moving out references to the selector_t (struct filed) entries in the modules, I came across a place in the forwarding driver where the message element is accessed. You can look up that code in cvs (omfwd.c, line 597 and below).

This code implements the -h option, which stops forwarding messages when they did not originate from the local host. The intention of that option probably is to avoid a death spiral, which could be caused by two systems sending syslog messages back and forth (this scenario is actually even covered in RFC 3164, so it seems to happen from time to time…).

However, the code in sysklogd relies on hostnames to prevent that behaviour. If the hostname is different from the current hostname, then we have a remotely received message. I question if that check is always reliable (besides, it is not working right at the moment ;)). If that functionality is actually needed, it would be way better to check the messages target IP address against the local addresses (probably a lot of work, but definitely doable).

The question is, if such a feature is actually needed – and if it is needed in the output driver. To me, it sounds like a natural filter condition (“selector does not apply if host == non-localhost”). If that feature is required, it would probably be best suited to build it into filtering than into a (single) output module.

But again, the question is: do we really need to provide this functionality? Or is it an artifact long gone away?

Feedback is appreciated (you may also use the rsyslog forum, if you like).

rsyslog and fedora 8

I looks like more and more sites pick up the news. A quite good review of it, together with the other F8 features, can be found at this site:

http://linuxupdate.blogspot.com/2007/07/proposed-fedora-8-features.html

I wonder if traffic will go up on the rsyslog site – this has not happened yet. The root question behind that question is if people really care about logging. I guess most people couldn’t care less than which syslogd is included in a distro. But I may be proved wrong.

In the mean time, I hope that the additional exposure will lead to more ideas, more testing and in the end result an even better rsyslog.