I thought I post a few thoughts about how far the rsyslog queue enhancements have evolved.
We started with the goal to increase performance, especially for database outputs. As part of that endeavor, we designed and implemented message batches as the new processing entity. This approach was suggested by David Lang, who also offered very valuable feedback, suggestions and review of the relevant papers (not to mention actual testing) during the process. Then, we came to the conclusion that we need to have a truly ultra-reliable queue. One that does not even lose messages in case of a sudden fatal failure (like a power fail without a UPS – or a failing UPS!). That lead to further redesign and a lot of design work. All of this is very exciting.
Since last Friday, I have now worked on the actual code. I do now have updated for queue, the queue storage drivers and action processing. Most importantly, the rsyslog testbench does once again successfully run, even in DA queue mode. There are still a couple of things that need to be looked at, but I think most of the bulk work is done. What now follows is careful looking at the open issues plus a LOT more of testing.
The testbench has improved much in the past three month, but it still is far from covering even the most important code areas. Especially the various queueing scenarios are not very well covered by it, mostly because it is rather complex to do so. Anyhow, I will now try not to do so many ad-hoc manual tests but rather see that I can create more automated tests. While this is a lot more of work, even the current testbench has been proven to be extremely valuable during this mayor code change effort (which, let me re-iterate, is far from being fully completed). Without it, it would have been much harder to find those bugs that came up during the testbench run. I think that the time I invested into it already has payed back.
Let me end with a list of things I need to look at. That will at least help me keep focused and let you know what is extremely weak right now:
- more tests
- so far, the last batch is not freed until at least one more message comes in (permit DeleteProcessedBatch() to be called de-coupled)
- cancel processing cleanup, decision if we should still support cancel processing entry points
- configured discarding of messages on queue-full condition [at least add extra nElem counter]
- make output actions support message-permanent failures (at least PostgreSQL output plugin) [also think about test cases for this]
- double-check of action and action unit state processing
- persisting of messages from memory queues during shutdown (testing)
- Think about a new way of handling iDeqSlowdown (maybe during batch processing?)