WinSyslog German Site

We are selling a Windows Syslog daemon (WinSyslog) for many, many years now (since 1995 if I remember correctly). Interesting is the “language issue”. Back at the late 90s, we had English and German pages for that product. Some time later, we dropped the German pages because almost nobody ever accessed them (funny, ain’t it?).

Now we are giving it another shot. While talking with some peers, they claimed there is more demand for German language in IT security today than it was 10 years ago. Really? If so, I have to admit I am surprised. I thought that the IT world speaks English and the IT security/auditing world even more so. Anyhow, I always like to experiment. So we at Adiscon agreed to translate some important content of the WinSyslog pages into German and see what happens.

As a side-note, the discussion with my peers started another experiment which did not require discussions inside the company. Rsyslog got a German language support forum in October 2008. Guess what? There is only a single user post in it, and that post tells that the poster thinks it is unnecessary to have a German language forum. So far, it looks like I was right – but let’s see what a product site brings ;) (It sounds somewhat logical that an open source support forum has different metrics than an commercial software product site, so I think there really can be different results).

Use of application-level acks in RELP

I received a very well crafted question about RELP reliability via the rsyslog mailing list this morning. I think it makes perfect sense to highlight this question here in the blog instead of letting it die unread and hard to find in the mailing list archives. Before reading this post, it would be useful to read my rant on “On the unreliability of plain tcp syslog” if you have not already done so. It will greatly help understand the fine details of what the message talks about.

Here we go, original posters’s text in italics, my replies in between it:

In my research of rsyslog to determine its suitability for a
particular situation I have some questions left unanswered. I need
relatively-guaranteed delivery. I will continue to review the
available info including source code to see if I can answer the
questions, but I hope it may be productive to ask questions here.

In the documentation, you describe the situation where syslog silently
loses tcp messages, not because the tcp protocol permits it but
because the send function returns after delivering the message to a
local buffer before it is actually delivered.

But there is a more-fundamental reason an application-level ack is
required. An application can fail (someone trips over the power cord)
between when the application receives the data and when it records it.

1. Does rsyslog send the ack in the RELP protocol occur after the
message has been safely recorded in whatever queue has been configured
or forwarded on so its delivery status is as safe as it will get (of
course how safe depends upon options chosen), or was it only intended
to solve the case of TCP buffering-based unreliability?


RELP is designed to provide end-to-end reliability. The TCP buffering issue is just highlighted because it is so subtle that most people tend to overlook it. An application abort seems to be more obvious and RELP handles that.

HOWEVER, that does not mean messages are necessarily recorded when the ACK is sent. It depends on the configuration. In RELP, the acknowledgment is sent after the reception callback has been called. This can be seen in the relevant RELP module. For rsyslog’s imrelp, this means the callback returns after the message has been enqueued in the main message queue.

It now depends on how that queue is configured. By default, messages are buffered in main memory. So when rsyslog aborts for some reason (or is terminated by user request) before this message is being processed, it is lost – while the sender still got a positive ACK. This is how things are done by default, and it is useful for many scenarios. Of course, it does not provide the audit-grade reliability that RELP aims for. But the default config needs to take care of the usual use case and this is not audit-grade reliablity (just think of the numerous home systems that run rsyslog and should do so in the least intrusive way).

If you are serious about your logs, you need to configure the engine to be fully reliable. The most important thing is a good understanding of the queue engine. You need to read and understand the rsyslog queue docs, as they form the basis on which reliability can be built.

The other thing you need to know is your exact requirements. Asking for reliability is easy, implementing it is not. The more you near 100% reliability (which you will never reach for one reason or the other) the more complex scenarios get. I am sure the original post knows quite well what he want, but I am often approached by people who just want to have it “totally reliable” … but don’t want to spent the fortune it requires (really – ever thought about the redundant data centers, power plants, satellite and sea links et all you need for that?). So it is absolutely vital to have good requirements, which also includes of when loss is acceptable, and at what cost this comes.

Once you have these requirements, a rsyslog configuration that matches them can be designed.

At this point, I’d like to note that it may also be useful to consider rsyslog professional services as it provides valuable aid during design and probably deployment of a solution (I can’t go into the full depth of enterprise requirements here).

To go back to the original question: RELP has almost everything that is needed, but configuring the whole system in an audit-grade way requires (ample) work.

2. Presumably there is a client API that speaks RELP. Can it be
configured to return an error to the client if there is no ACK (i.e.
if the log it sent did not make it into the configured safe location
which could be on a disk-based queue), or does it only retry? Where is
this API?


The API is in librelp. But actually this is not what you are looking for. In rsyslog, an output module (here: omrelp) provides the status back to the caller. Then, configuration decides what happens. Messages may be discarded, sent to a different destination or retried.

With omrelp, I think we have some hardcoded ways to preserve the message, but I have no time yet to look this up in detail. In any case, RELP will not loose messages but may duplicate few of them (within the current unacked window) if the remote peer simply dies. Again, this requires proper configuration of the rsyslog components.

Even with that, you may loose messages if the local rsyslogd dies (not terminates, but dies for some unexpected reason, e.g. a segfault, kill -9 or whatever) but still has messages in a not persisted queue. Again, this can be mitigated by proper configuration, but that must be designed. Also, it is very costly in terms of performance. A good reading on the subtleties can be in the rsyslog mailing list archive. I suggest to have a look at it.

Certainly the TCP caching case you mention in your pages is one a user
is more likely to be able to reproduce, but that is all the more
reason for me to be concerned that the less-reproducible situations
that could cause a message to occasionally become lost are handled
correctly.


I don’t think app-abort is less reproducable – kill -9 `cat /var/run/rsyslog.pid` will do nicely. Actually, from feedback I received, many users seem to understand the implications of a program/system abort. But far fewer understand the issues inherent in TCP. Thus I am focusing so much on the later. But of course, everything needs to be considered. Read the thread about the reliable queue (really!). It goes great lengths, but still does not offer a full solution. Getting things reliable (or secure) is very, very challenging and requires in-depth knowledge.

So I am glad you asked and provided an opportunity for this to be written :)

Rainer

Strong passwords? Forbidden!

American Express, as a bank and card issuer should be a fairly security sensitive company. Right? Well, it looks like they have not yet learned their lesson. Occasionally, I log in to my AmEx account to gain access to memebership rewards (these nice gimmicks that shall trick you into charging to AmEx as much as possible). I tend to have my credentials not at hand when doing so, but thankfully AmEx has a quite secure system to recover your credentials.

What really bugs me is their password requirement. A password can have a maximum of 8 characters and consist only of letters and numbers! Ouch… what about strong passwords? They are simply forbidden by AmEx. The funny thing is that the web site doesn’t even complain when you enter a too-strong (aka longer or alphanumeric) password. It simply ignores the extra characters. Some time last year this drove me crazy as I could not log in after changing my password. Guess what, I used a too strong one and of course it didn’t match to what the system generated. I called customer service and also complained about being forced to use insecure passwords. That was several month ago.

New year, new try – old problem… Nothing learned, still 8 chars max and only letters and number. Frankly, AmEx, who is advising you on security? I really wonder if under US law AmEx is responsible if someone breaks into my account. I think they should…

new syslog RFC will not advance…

I thought that after 5 years of IETF syslog WG work, I’d already be through all potential sources of frustration. Wrong! Last Friday, I was notified that the otherwise ready-to-be-published RFC could not be published because the IETF requires a copyright assignment to the trust, in which the author must grant all rights. Of course, he must also do so on behalf of every contributor.

Quite frankly speaking, this is more or less impossible for this draft. Among the many reasons is that it would be extremely time-consuming to verify each and every (small) contribution from the past 5 years. Also, I copied (perfectly legal than) text from other RFCs, who do not know who contributed. There are more reasons, but I do not like to elaborate on all of them again.

The bottom line is that the syslog RFC series is again stalled. The IETF has at least identified there is a problem and a work-around is being discussed. A decision will be made on February, 15th. Let’s hope for the best, otherwise all the work would probably be lost.

This IMHO is also a good example on what a too-far streched system of intellectual property rights can do. This, and also software patents, are a threat to our freedom. The stop progress and unnecessarily limit out capabilities. Hopefully more and more people realize that and will tell their governments to stop this abusive system.

Thailand is going syslog…

I found an interesting read in “The Nation”, one of Thailand’s largest business dailies. They talk about the economic crisis and the way Thailand plans to reduce negative effects. There is a 5-point initiative in place. Of interest for us the the fifth and final point:

Finally, the association will focus on security, which promises to be this year’s main technology trend. It will urge software companies to become more familiar with Syslog, which is a standard for forwarding log messages in an IP network, but is also typically used for computer system management and security auditing.

So, as it looks, Thailand is betting on security. This is obviously a good movement. Interestingly, they seem to have identified logging, and syslog in specific, to be a major building block in this endeavor. That’s a bit surprising, given the typical weaknesses of syslog. But they’ve probably identified the broad potential this protocol has. Maybe I should look a bit more towards Asia with rsyslog and phpLogCon as well as the Windows product line.

Joined the Security Bloggers Network

Yesterday I joined the Security Blogger’s Network. I discovered this interesting network by accident and thought it may be worth contributing to it. While I am writing mostly about logging and log analysis, this definitely has a lot to do with security. So I asked Alan, the person in charge, if he’d consider adding my blog – man, was he quick. I got a positive response soon after my request and have also already been added to the site. Great. So welcome all new readers.

The good thing is that this motivates me to write a bit more about the security aspects, which I think is a good thing to existing readers, too.

Drop me a comment if you have any opinion on me appearing in the network – especially if you have an idea of what you would like to read about logging.

SyslogAppliance Keyboard

This morning, Google alerts brought up a nice blog post on how to reconfigure SyslogAppliance‘s keyboard settings.

I think some users already asked about keyboard reconfiguration. However, the blog post suggest that it is much more painful to do than I thought (I agree with the author that “dpkg-reconfigure console-setup” should have done the job). I’ll see that I finally include this functionality in the automated setup. It should not be too hard. After all, I have a “run once” script already in place.

Back from the break ;)

Hi folks,

I am right now back from my extended xmas break. Well, actually I’ve been gone right after xmas and been away – even mostly without email – for two weeks.

I was delighted to see that the rsyslog community was quite active during this time (which usually has low activity at all due to the holidays). An the sad side, that also means I have a couple of bug reports outstanding. One I managed to fix today, the others will need a little more time. Obviously, there are quite a lot of things going on, and these need to be taken care of, too. I need to do a few changes to our internal infrastructure, and then I’ll look into rsyslog, phpLogCon and the appliances.

One of the major undertakings I hope to finish in the not so distant future is fixing a stability bug that seems to occur on 4+ core machines only. I got another but report over the holidays and hope that I get enough momentum to finally track it down – we’ll see…

That’s it for now, I just wanted to keep you updated.

Oh – and did I mention happy new year ;) [it’s not too late right now…]

Rainer

root cause of security issue in rsyslog

If you have followed the rsyslog mailing list, you have noticed that we had a small, but still noteworthy, security issue in rsyslog recently. In short words, the $AllowedSender directive was accepted but no longer honored, given potentially any remote system a chance to send messages to the instance in question (its a minor issue because most people rightfully tend to use firewalls to carry out that kind of access control).

After this is now settled, I sat back, relaxed and meditated a bit about the root cause of the issue. Acutally, I didn’t need to think very hard. The problem was introduced when I implemented the netstream driver class. During that implementation, I shuffled a lot of code to the now-modular interfaces. Among them were the access control lists, whose roots were kept in global variables at this time.

I screwed up the first time when I allowed them to remain global variables. We all know that global variables are evil, especially when making publically accessible. Now that we moved to a proper interface, I should have replaced them by a function call. Doing that in the first place had prevented the problem. Why? Because I just initialized the now-interface specific global variable “representative” with the value at time of interface creation, which meant NULL in all cases. So whoever used the interface, always got an empty list, which meant no access control was configured.

Any user-configuration still hit the global variable, which caused the ACLs to be created, but no part of the code ever accessed it any longer. One may argue if that is a simple coding error, and there is some truth in it, but I’d still say its primarily a design issue (bad design promises to provide the quick solution, but it seldom does…).

And as it always takes at least two faults to really screw up, the next major issue wasn’t around to far. Rsyslog had not – and still has not! – a formal test suite that you can simply run each time code changes. I have begun to employ some limited test cases via “make check”, but they cover primarily exotic aspects and do not yet contain any serious test case that involves actually running rsyslogd against any serious number of messages. One of the reasons is that I had no good tool for doing so, or that I considered building the test suite to be too expensive (in comparison what else needed to do). As a small excuse I would like to mention that some others have encouraged this view. But I always new it is a lame excuse…

So it exactly happened what usually happens in such cases: the test case vital to discover this problem was not present in the series of test I ran against the new code. As usual, the programmer himself tests whatever he thinks needs testing. And, also as usual, this means that the programmer doesn’t test those things that he can not think of being wrong. Usually, these are the real problems, because if the programmer did not think of a potential problem, he did not implement, or at least carefully check, for it. This is just another example, why external testers are needed.

In open source, users adopting the devel and beta releases are often considered to be these testers. Quite frankly, I could not afford a full testing lab and continue developing the project. I think this is true for most open source projects. “Free testing” by early adopters is a major advantage over closed source. But this time, this failed, too. Probably the (small) club of early adopters also did not think about this issue. Maybe that’s because the more knowledgeable folks prefer to solve this problem with a firewall, which is the better approach to use for various reasons (not to be outlined here, see security advisory for details).

Finally, the issue came up in the form of a bug report. Unfortunately quite later, month after the initial release. But it was reported and so I could fix it as quickly as possible once I knew.

The important lesson to learn is that it usually takes more than one error to cause real problems. But these things happen!

I think the case also strengthens the need for good, systematic testing. Some time ago, I began to look into the DejaGnu testing suite and asked the mailing list if somebody had some experience in it. Unfortunately, nobody showed up. I’ll now give it another shot. There have been too-often small problems that were rooted in things not being consistently tested. Most often, it were only really small issues, like missing files, or some variables not defined in some conditional path. Since I improved my “make distcheck” settings, many of these small items no longer appear. Even the small set of current exotic tests reveal a problem from time to time.

So I think it would be wise to try to expand the test cases that rsyslog runs on regular basis. Frankly, I will not be able to create a full suite from the ground up. But the idea is if I once manage to get DejaGNU – or something similar – up and running, and acquire the necessary knowledge, I could gradually add tests as I go along. So over time, the tests would increase and we could finally very much better, automatic, that existing functionality is no longer broken by new features.

I will try to get the focus for my next release steps on DejaGNU. Obviously, any help in doing so is appreciated.

security…

No system is totally secure. Few systems are totally insecure. Most systems are between these two extremes. But what does “more secure” mean? We had an interesting discussion on the rsyslog mailing list on the use of root jails. I’d like to reproduce one of my posts here, not only because it is mine, but because it can guard us a bit towards the security goals for rsyslog.

Let me think of security as a probability of security breach. S_curr is the security of the reference system without a root jail. S_total is the security of a hypothetical system that is “totally secure” (knowing well that no such system exists). In other words, the probability S_total equals 0.

I think the common ground is that a root jail does not worsen security. Note that I do not say it improves security, only that it does not reduce a system’s security. S_jail is the security of a system that is otherwise identical to the reference system, but with a root jail. Than S_jail <= s_curr, because we assume that the security of the system is not reduced.

I think it is also common ground that the probability of a security breach is reduced if the number of attack vectors is reduced, without any new attack vectors being added. [There is one generic “attack vector”, the “thought of being secure and thus becoming careless” which always increases as risk is reduced – I will not include that vector in my thoughts]

We seem to be in agreement that a root jail is able to prevent some attacks from being successful. I can’t enumerate them and it is probably useless to try to do so (because attackers invent new attacks each day), but there exist some attacks which can be prevented by a root jail. I do not try to weigh them by their importance.

For obvious reasons, there exist other attacks which are not affected by the root jail. Some of them have been mentions, like the class of in-memory based attacks, code injection and many more.

I tend to think that the set of attack vectors that can be prevented by a root jail is much smaller than the set of those which can not. I also tend to think that the later class contains the more serious attack vectors.

But even then, a root jail seems to remove a subset of the attack vectors that otherwise exist and so it reduces the probably of security breach. So it benefits security. We can only argue that it does not benefit security if we can show that in all cases we can think of (and those we can not), security is not improved. However, some cases have been show, where it improves, so it can not be that security is not improved in all cases. As such, a root jail improves security, or more precisely the probability of a security breach is

0 < S_jail < S_curr

We can identify the benefit we gain is the difference between the reference system’s probability of security breach and the system with the jail. Be S_impr this improvement, than

S_impr = S_curr – S_jail

Now the root jail is just one potential security measure. We could now try to calculate S_impr for all kinds of security measures, for example a privilege drop. I find it hard to do the actual probability calculations, but I would guess that S_impr_privdop > S_impr_jail.

Based on the improvements, one may finally decide what to implement first (either at the code or admin level), all of this of course weighted with the importance of the numbers.

In any case, I think I have shown that both is correct:

  • the root jail is a security improvement
  • there exist numerous other improvements, many of them probably more efficient than the jail