Today’s release of rsyslog 8.1901.0 contains a small but important feature: the ability to specify a minimum batch size. It is much-needed for some outputs, with ElasticSearch (and ClickHouse) being prime examples. While I am happy I finally implemented it, I am also a bit ashamed it took me almost three and a half year since Radu Gheorghe proposed that feature in 2015.
Quick reminder on how rsyslog batches work: we receive messages and put them into queues. From these queues, we pull so-called batches (sets of messages) and have them processed by output modules. A batch can contain a given maximum number of messages (by default and depending on case around 1024 or below). If there are that many messages inside the queue, a full batch is extracted and processed. If the queue does not contain that many, whatever it currently has is taken and forms the batch. As such a batch contain as few messages as one.
The idea behind that logic is that when the system has to process many messages, queues are sufficiently full and so full batches are extracted and processed. But if the system is more idle, fewer messages a processed to not introduce latency – which may be bad for real-time security analysis. We trade in the overhead of small batches vs. responsiveness. A bit more overhead doesn’t matter if the system is low-utilized anyhow. If the utilization grows, everything becomes slower, queues begin to become fuller and batch sizes will naturally increase.
This auto-tuning works very well with the majority of outputs. However, if the output actually transfers data to a different system and that system’s performance depends on the reception of bulk data … then things now longer work that perfect. From the rsyslog perspective, auto-tuning still works. However, the back-end (remote) system will probably experience a higher-than-desirable load. Still not a big problem if the back-end system is dedicated to logging – but definitely a problem if other traffic hits it as well. Then, the overall performance overhead caused by small rsyslog batches affects those other applications and may even do so severely (depending on circumstances, of course).
The natural solution to the problem is to permit the batch to have a minimum batch size requirement. While it is not reached, rsyslog processing stops and waits for more messages. Obviously, this is dangerous in periods where few messages hit rsyslog – the wait time may be prohibitive long. The solution is to have the capability to timeout the wait, and let the user specify the timeout period. So, again, when a timeout occurs, a batch is submitted to processing, no matter how small it is. This permits each user to specify his or her own latency/performance trade-off.
Exactly that kind of processing I have added to rsyslog now. It’s a small, but fundamental change. All tests passed well. Some risks remain. But if someone runs into issues, I am well prepared to help get this going. So it definitely is worth trying. And I don’t expect much trouble in any case.
A reminder: rsyslog is very conservative in that new versions should only introduce new defaults if there is an ultra-high-very-strong reason for it. For this feature we don’t think it’s really a “must”. So users need to explicitly opt into it via the new queue.minDequeueBatchSize parameter.
I am happy we have it now. Big kudos to Radu who always pointed out that it is an important feature.