rsyslog - what's next? - Rainer Gerhards

I’ve not blogged so much the recent weeks. I have had my nose deep down inside rsyslog code, adding new features (like an automatic zip file writer or the ability to spoof/forge UDP sender addresses) and enhancing performance (where I think I scored some major points ;)).

So it is time for an update. Where is rsyslog heading to? With the many changes I made in the past two to three month, I think it is very important to let the code base stabilize. So I would prefer not to touch too much existing code for a while. Also, it is summer time and my summer vacation is not so far away. Another good argument that it is probably not the best time for big code changes (or do I like to break things before I go away…? ;)).

So looking at what to do next, I would like to center myself on improving the tool set that helps create rsyslog. That doesn’t mean direct improvement to the actual syslogd, but rather to tools that help build and maintain it. The first major effort in this regard was adding an automated testbench. If you look at v3, I think it has around four automated tests (previous versions had none). With v5, we have over 20 subtests, each of which test various cases, so in total we currently have around one hundered test cases automatically covered.

When I started with this in v3, it was a major effort, even though the number of covered cases was rather small. But getting started with a testbench meant I needed to evaluate ways to automate the tests and create them in the best possible way for rsyslog (which also means convenient during the development process). At that point, I tried a lot of things and finally came up with the current set of tests. The initial testbench covered only a very limited set of use cases.

Since then, it has greatly improved, but there are still a lot of uncovered areas. But I now regularly add new tests, most often when I implement new features, change existing ones or hunt bugs. The process is now well understood and many tests can be added with relative ease (but others not, I have some testcases in the queue that require notable extensions to my current system plus a bit … of the different toolset I will be talking about soon…).

Initially, I was rather skeptic if the testbench would really pay, especially after I saw the initial effort required (which I by far underestimated). But in the meantime I am convinced. Especially the past couple of month has shown that the automated tests both increase development productivity (by reducing the number of manual tests that need to be done and spotting regressions early) as well as code quality (detect regressions where they otherwise would have been overlooked).

Now I am in a similar situation in regard to performance testing as I was in regard to correctness testing: everything is done manually and with very low-level tools. Still, I was able to make good progress without tools. But I hope that tools would be as useful for performance testing as they were for bug hunting. Most importantly, my current performance improvement testing covers only limited (though highly relevant) scenarios: those where getting sufficiently reliable numbers is possible with the limited capabilities I have. Most importantly this means that almost all testing so far has been done with plain tcp syslog. While this still enables to check the core engine’s performance, it does not offer a clear view of e.g. UDP performance (which I really do not have now). Also, the examples are artificial, and it would be useful to get more of a real-world performance benchmark.

Finally, performance benchmarks stress the engine, especially its multi-threading capabilities. So performance testing is also a good way to uncover those nasty threading bugs, that one otherwise only detects when systems fail in production (and nobody then knows why…). So I consider decent performance test also to be a plus for code quality. I even consider them very important to stability e.g. the v5-engine, which so far has received only limited attention in practice. It looks like almost nobody ever tried it. I know because the initial v5 release had such a big memory leak, that any serious tester would have needed to come up and complain very quickly. A lack of test deployments makes it harder to mature the engine. I think that good stress tests (which all have a performance co-notation) will help to somewhat mitigate this problem. As a side-note, I have uncovered many of those bugs that I fixed during my manual performance testing. This seems to prove the point.

So I am more or less convinced (if nothing more urgent shows up) to spend some time implementing performance tools and tests for rsyslog. I would also like to include a somewhat older idea of a “diagnostic front-end” that would be able to pull (and maybe modify) some of rsyslog parameters. I’d expect that as a side-activity I’ll also gather (at a minimum) or improve (preferable) performance in a couple of areas (UDP reception performance is on top of the list). But improvements will only come after the basic tools have been written.

As with the testbench, that will mean that new features and enhancements will probably stall a bit in the coming weeks. This even more as I do not intend to write the front-end in C (I personally do not consider C to be the language of choice for non-performance critical interactive programs, especially looking into some of the portability issues – but YMMV…). So I will try to approach this with an Java app. I have to admit that I learned Java 8 to 10 years ago and never programmed much in it, so that will probably mean I’ll need to re-learn the language, but as I don’t consider this GUI to be something extremely critical, I don’t see any issues with me as a Java freshmen doing it.

As a side-note, I should probably mention that I am also involved in the phpLogCon project. So far, I am only part of the design team, but I have a number of really cool visualization features on my personal wish-list. If I ever get time to work on that (I hope for next year), I probably will need to do that in Java, so it doesn’t hurt to practice on a less demanding project. In that sense, I also hope to be able to set stage for some future cool technology while I work on a current demand ;) It would also be interesting so visualize some of the performance counters, but that’s another story ;)

All in all, getting an interactive troubleshooting and analysis front-end has big potential, not only for testing but also for deployments and finding configuration bugs (which become more and more an issue with improving complexity of the configuration). One could also envision that it could include a graphical configuration build … as well as tools for setting up all those TLS certs. I don’t think I can do all of this now or in the next quarter. But I think it is the right time now to begin working on a foundation that offers yet another big potential. Especially as it also helps to urgent need to get better testing for the engine plus the desire to further improve its performance (my goal is no less than to provide the by-far fastest AND most reliable syslogd on this planet ;)).

Well, that’s it for now. I hope you like the idea of an additional performance-centric toolset (which of course also requires engine enhancements) and a GUI as much as I do. If you have comments or concerns, please let me know. I sincerely hope to begin a new round of capability enhancements with this move.