A form thread made me aware that there seems to be an issue with rsyslog performance if TLS is used. The past two weeks, I have worked on a paper which looks in-depth at rsyslog performance an I came across a paper [1] that promotes writing servers in that “traditional” multi-threaded way (with a single thread per connection). It addressed some of my concerns, and I thought it is worth actually trying out this approach (I outruled it for several years and never again looked at it). As a result, I created an experimental module imttcp, which works in this mode. I put this to test, especially as that would also lead to a much simpler programming paradigm. Unfortuantely, the performance results are devastive: while there is a very slight speedup with a low connection number (close to the number of cores on the system), there is a dramatic negative speedup if running with many threads. Even at only 50 connections, rsyslog is dramatically slower (80 seconds for the same workload which was processed in 60 seconds with traditional imtcp or when running on a single connection). At 1,000 connections, the run was *extremely* slow. So this is definitely a dead-end. To be honest, Behren, condit and Brewer (the authors of [1]) claim that the problem lies in the current implementation of thread libraries. As one cure, they propose user-level threads. However, as far as I could find out, User-Level threads seem not to be much faster under Linux than Kernel-Level threads (which I used in my approach).
Even more convincing is, from the rsyslog PoV, that there are clear reasons why the highly threaded input must be slower:
- batch sizes are smaller, leading to much more overhead
- many more context switches are needed to switch between the various i/o handlers
- more OS API calls are required because in this model we get more frequent wakeups on new incoming data, so we have less data available to read at each instant
- more lock contention because many more threads compete on the main queue mutex
All in all, this means that the approach is not the right one, at least not for rsyslog (it may work better if the input can be processed totally independent, but I have note evaluated this). So I will look into an enhanced event-based model with a small set of input workers pulling off data (I assume this is useful for e.g. TLS, as TLS transport is much more computebound than other inputs, and this computation becomes a limiting factor for the overall processing speed under some circumstances – see [2]).
As a side-note: for obvious reasons, I will not try to finish imttcp. However, I have decided to leave it included in the source tree, so that a) someone else can build on it, if he sees value in that b) I may use it for some other tests in the future.
[1] R. Von Behren, J. Condit, and E. Brewer. Why events are a bad idea
(for high-concurrency servers). In Proceedings of the 9th conference on Hot
Topics in Operating Systems-Volume 9, page 4. USENIX Association, 2003.