How we found and fixed a CVE in librelp

This is a joint blog post, from Adiscon and Semmle, about the finding and fixing of CVE-2018-1000140, a security vulnerability in librelp. This was a collaborative effort by:

  • Kevin Backhouse, Semmle, Security Researcher.
  • Rainer Gerhards, Adiscon, Founder and President.
  • Bas van Schaik, Semmle, Head of Product.

We have published this post on Rainer’s blog here and the LGTM blog.

Bas originally found the vulnerability (using lgtm.com) and Rainer fixed it. Kev developed the proof-of-concept exploit.

In this blog post, we explain the cause of the bug, which is related to a subtle gotcha in the behavior of snprintf, and how it was found by a default query on https://lgtm.com/. We also demonstrate a working exploit (in a docker container, so that you can safely download it and try it for yourself). As a bonus, we give a short tutorial on how to set up rsyslog with TLS for secure communication between the client and server. Continue reading “How we found and fixed a CVE in librelp”

rsyslog performance: main and action queue workers

Rsyslog has both “main” message queues and action queues. [Actually, “main message queues” are queues created for a ruleset, “main message” is an old-time term that was preserved even though it is no longer accurate.]

By default, both queues are set to one worker maximum. The reason is that this is sufficient for many systems and it can not lead to message reordering. If multiple workers are concurrently active, messages will obviously be reordered, as the order now, among others, depends on thread scheduling order.

So for now let’s assume that you want to utilize a multi-core machine. Then you most probably want to increase the maximum number of main message queue workers. The reason is that main queue workers process all filters for all rules inside the rule set, as well as full action processing for all actions that are not run on an asynchronous (action) queue. In typical setups, this offers ample of opportunity for concurrency. Intense tests on the v5 engine have shown near linear scalability to up to 8 cores, with still good improvements for higher number of cores (but increasing overhead). Later engines do most probably even better, but have not been rigorously performance tested (doing it right is a big effort in itself).

Action queues have a much limited concurrency potential because they do only a subset of the work (no filtering, for example, just generating the strings and executing the actual plugin action). The output module interface traditionally permits only one thread at a time to be present within the actual doAction() call of the plugin. This is to simply writing output plugins, but would be needed in any case as most can not properly handle real concurrent operations (think for example about writing to sequential files or a TCP stream…). For most plugins, the doAction() part is what takes most processing time. HOWEVER, multiple threads on action queues can build string concurrently, which can be a time consuming operation (especially when regexpes are involved). But even then it is hard to envision that more than two action queue worker threads can make much sense.

So the bottom line is that you need to increase the main queue worker threads to get much better performance. If you need to go further, you can fine-tune action queue worker threads, but that’s just that: fine-tuning.

Note that putting “fast” actions (like omfile) on an async action queue just to be able to specify action threads is the wrong thing to do in almost all situations: there is some inherent overhead with scheduling the action queue, and that overhead will most probably eat up any additional performance gain you get from the action queue (even more so, I’d expect that usually it will slow things down).

Action queues are meant to be used with slow (database, network) and/or unreliable outputs.

For all other types of actions, even long-running, increasing the main queue worker thread makes much more sense, because this is where most concurrency is possible. So for “fast” action, use direct action queues (aka “no queue”) and increase the main thread workers.

Finally a special note on low-concurrency rulesets. Such rulesets have limited inherent concurrency. A typical example is a ruleset that consists of a single action. For obvious reasons, the number of things that can be done concurrently is very limited. If it is a fast action, and there is little effort involved in producing the strings (most importantly no regex), it is very hard to gain extra concurreny, especially as a high overhead is involved with such fine-grained concurrency. In some cases, the output plugin may come to help. For example, omfile can do background writes, which will definitely help in such situations.

You are in a somewhat better shape if the string builder is CPU intense, e.g. because it contains many and complex regexes. Remember that strings can be build in parallel to executing an action (if run on multiple threads). So in this case, it makes sense to increase the max number of threads. It makes even more sense to increase the default batch size. That is because strings for the whole batch are build, and then the action plugin is called. So the larger the batch, the large these two partitions of work are. For a busy system, a batch size of 10,000 messages does not sound unreasonable in such cases. When it comes to which worker threads to increase, again increase the main queue workers, not the action ones. It must be iterated that this gives the rsyslog core more concurrency to work with (larger chunks of CPU bound activity) and avoids the extra overhead (though relatively small) of an async action queue.

I hope this clarifies worker thread settings a bit more.

rsyslog performance

Thanks to David Lang, I have been able to gather some performance data on rsyslog. More importantly, I have been able to improve rsyslog’s performance dramatically while working with David. He does not only dispense good advise, he has also a great test environment which I lack. If you would like to see how things evolve, be sure to follow this (lengthy ;) thread: http://kb.monitorware.com/rsyslog-performance-t8691.html.

But you are probably interested in actual numbers.
The current v3-stable (3.18.x) manages to process around 22.000 messages per second (mps) with DNS name resolution turned on and about double that value without. That’s not bad, but obviously there is room for improvement.

Thanks to our combined effort, we have reached a state where we can process more than 100,000 mps and there is an experimental version (applying some lock-free algorithms) that goes well beyond 200,000 mps. I am not yet sure if we will pursue the lock-free algorithm. There are ample of additional ideas available and I am quite positive we can push the limit even further.

All numbers were tested with a minimal configuration (one udp input, one file output) on a capable multi-core machine. The numbers above are for sustained traffic rates. More messages can be accepted (and buffered) during bursts.