refactored imfile module now publically available

Hi all,

the refactoring and enhancement of imfile done by Nikolaidis Fotis has reached a first public-visible result. I have just created a new experimental git branch with his work:

http://git.adiscon.com/?p=rsyslog.git;a=shortlog;h=refs/heads/v5-stable-imfile-refactor

The code is probably not 100% finished, but it provides a preview plus the ability to suggest further improvements.

Please note that the code base has evolved since the refactoring begun. So in order to integrate into the main code line, we must see that we can adopt to the recent changes to imfile. Most importantly, you should try to merge this branch with v5-devel, which gives me a lot of issues when I try to do it. We will look at this once the new modules codebase has matured (and Nikolaidis gives me a “go” for that activity).
Feedback on the new module is appreciated (please direct it to the rsyslog mailing list).

rsyslog hit the one million message per second mark

Thanks to a user report, I just found out the the recent v5 versions of rsyslog have hit and exceeded the one million message per second (mps) mark! In my lab, with commodity hardware I reach around 500,000 mps. In practice this seems to be at least doubled :)

This strengthens even further that rsyslog is the syslogd of choice for demanding high-traffic environments :)

The full report is here:

http://kb.monitorware.com/000-000-character-messages-sec-with-rsyslog-t10740.html

a new rsyslog v5 beta – and focus on v6!

I have just generated a new v5-beta. It is being processed for release right now and will be soon. From the announcement:

This release both offers a set of new features and, at the same time, turns 5.7.3 into beta state. At first, this sounds a bit contradictory, but we do this for two reasons:

a) the new features introduced are non-intrusive in regard to the existing feature set, so no “bad surprises” are expected
b) other than that, primarily bugfixing went into this release, with
only few remaining issues being open

In order to move towards a new v5-stable, we consider it useful to begin with a new v5-beta stage. Note that this time the beta phase may be rather quick, because of the argument b) above. Note that we did not have any serious bug reports (except for one open issue) since December. Once the remaining issue is solved, we plan to do a short “proof in practice” and then move on to a new stable version.

In regard to new features, this release offers imfile multi-line capability, realtime UDP reception capability and better configurability for ommysql as recently announced for 6.1.3.

With this release, I also plan to conclude moving new features into v5 and try to focus on v6. This may not be practical in all cases, but I will try to stick to this plan as much as possible. During the past few weeks I have had considerable work to do just to integrate the various new features introduced in v4 and v5 into v6. While this was not really hard to do, it requires more than a little effort and very careful handling of the changes. The primary reason is that the code base diverged quite a bit and merging isn’t so much “fun” with that. More than once I even screwed up on some minor details. I hope that with a focus on v6 (for new features), I can spare a lot of time which than can go into new features.

And please do not misunderstand me: I focus on v6 for new features. This means I can focus even more on v5 in regard to correctness (bug-freeness). As usual, I prefer to fix issue in the oldest affected (and supported) release, and traditionally the version before the most current version branch has been very attractive to users because it has a near-complete feature set and a very strong focus on correctness.

Multi-Threading rsyslog’s TCP input

A form thread made me aware that there seems to be an issue with rsyslog performance if TLS is used. The past two weeks, I have worked on a paper which looks in-depth at rsyslog performance an I came across a paper [1] that promotes writing servers in that “traditional” multi-threaded way (with a single thread per connection). It addressed some of my concerns, and I thought it is worth actually trying out this approach (I outruled it for several years and never again looked at it). As a result, I created an experimental module imttcp, which works in this mode. I put this to test, especially as that would also lead to a much simpler programming paradigm. Unfortuantely, the performance results are devastive: while there is a very slight speedup with  a low connection number (close to the number of cores on the system), there is a dramatic negative speedup if running with many threads. Even at only 50 connections, rsyslog is dramatically slower (80 seconds for the same workload which was processed in 60 seconds with traditional imtcp or when running on a single connection). At 1,000 connections, the run was *extremely* slow. So this is definitely a dead-end. To be honest, Behren, condit and Brewer (the authors of [1]) claim that the problem lies in the current implementation of thread libraries. As one cure, they propose user-level threads. However, as far as I could find out, User-Level threads seem not to be much faster under Linux than Kernel-Level threads (which I used in my approach).

Even more convincing is, from the rsyslog PoV, that there are clear reasons why the highly threaded input must be slower:

  • batch sizes are smaller, leading to much more overhead
  • many more context switches are needed to switch between the various i/o handlers
  • more OS API calls are required because in this model we get more   frequent wakeups on new incoming data, so we have less data available to read at each instant
  • more lock contention because many more threads compete on the main queue mutex

All in all, this means that the approach is not the right one, at least not for rsyslog (it may work better if the input can be processed totally independent, but I have note evaluated this). So I will look into an enhanced event-based model with a small set of input workers pulling off data (I assume this is useful for e.g. TLS, as TLS transport is much more computebound than other inputs, and this computation becomes a limiting factor for the overall processing speed under some circumstances – see [2]).

As a side-note: for obvious reasons, I will not try to finish imttcp. However, I have decided to leave it included in the source tree, so that a) someone else can build on it, if he sees value in that b) I may use it for some other tests in the future.

[1] R. Von Behren, J. Condit, and E. Brewer. Why events are a bad idea
      (for high-concurrency servers). In Proceedings of the 9th conference on Hot
     Topics in Operating Systems-Volume 9, page 4. USENIX Association, 2003.

New Mailing List for Log Normalization

Thankfully, the interest in log normalization and the related libraries liblognorm and libee has increased. Up until now, I have handled discussions on this topics via the rsyslog mailing list. As conversations increase, this may be come an unnecessary burden for those only interested in rsyslog. So I have created a new mailing list named lognorm. I used this somewhat generic name, as I intend to use it for both libraries. This saves me some overhead, and I strongly assume that anyone interested in liblognorm will also be interested in libee (but to a lesser extent in the reverse direction).

Please subscribe to the new lists. Currently, it is a very exciting phase in log normalization development, so getting involved is a great way to shape things in the way you need it!

log normalization with rsyslog

I just wanted to give you a quick heads-up on my current development efforts:  I have begun to work heavily on a message modfication module for rsyslog which will support liblognorm-style normalization inside rsyslog. In git
there already is a branch “lognorm”, which I will hopefully complete and merge into master soon. It provides some very interesting shortcuts of pulling specific information out of syslog messages. I’ll probably promote it
some more when it is available. IMHO it’s the coolest and potentially most valuable feature I have added in the past three years. Once I have enabled tags in liblognorm/libee, you can even very easily classify log messages
based on their content.

 

calling for log samples!

Now I join those mass of people who are asking for log samples. But I do for a good reason :) Also, I do not need a lot, a single log message works well for my needs. I need them to improve rsyslog so that the parser can even better handle exotic message formats. So the short story is if you have a syslog message, please provide it to me.

And here is the long story:

One of the strength of rsyslog is that it is very much focused on standards. That also means it tries to parse syslog messages according to the relevant RFCs. Unfortunately, syslog has been standardized only recently and so there is no real standard for what to expect inside the header. So rsyslog strength is also its weakness: if messages are ill-formed, results are often suboptimal.

I am working around this by doing smart guesswork inside the legacy syslog parser. However, every now and then some folks pop up with problems. And, more importantly, some others do not even ask. On my twitter account, I recently saw one such frustration. In that case, timestamps were duplicated. I guess that was caused by something unexpected inside the timestamp. However, I was not able to get down to the real problem, because I did not have access to the raw message. That’s an important point: I need the raw message content, not what happens to usually be in the logfile. The later is already parsed, processed and recombined, so it does not tell me what the actual message is. But I need the actual message to improve the parser.

What I would like to do is create a very broad test suite with a vast amount of real-life syslog formats. The message text itself is actually not so important to me at this stage. It is the header format. If I get this, I’d like to analyze the different ways in which the format is malformed and then try to find ways to implement it inside the parser. If I find out that I can not detect the right format in all cases automatically, I may find ways to configure the different formats. The end result, I hope, will be far more plug-and-play message detection, something that should be of great benefit for all users.

Please contribute your logs! I need logs from many different devices, with many different versions. But I need only a few lines from each one. For each individual contributor, there is not a lot of effort required. Even a single log line would be great (ten or so be even greater). Just please don’t mangle the logs and provide me with raw log messages. That’s probably the hardest part. One way to do it is to sniff them off the wire, for example with WireShark. Another way is to use rsyslog itself. All you need is a special template and an output file using it:

$template rawmsg,”%rawmsg%n”
*.* /path/to/raw-file.log

Add this to your rsyslog.conf, restart rsyslog, make the device emit a few lines and mail me the result to rgerhards@gmail.com. You may also simply post the log sample to the sample log thread on the rsyslog forum – whatever you prefer. After you have done that, you can remove the lines from rsyslog.conf again. Before you mail me, it is a good idea to check if there is any sensitive information inside the log file. Feel free to delete any lines you have, but I would appreciate if you do not modify line contents. Also, it would be useful for me if you let me know which device, vendor and version produced the log.

I hope that you can help me improve the rsyslog parser even more. Besides, it will probably be a very interesting experiment to see how different syslog messages really are.

Thanks in advance for all contributions. Please let them flow!

Rainer

normalizing Apache Access logs to JSON, XML and syslog

I like to make my mind up based on examples, especially for complex issues. For a discussion we had on the CEE editorial board, I’d like to have some real-world example of a log file with many empty fields. An easy to grasp, well understood and easy to parse example of such is the Apache access log. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I also had a few research samples at hand.

Apache common log format is structured data. So there is no point in running it through a free-text normalization engine like liblognorm. Of course it could process the data, but why use that complex technology. Instead, the decoder is now part of libee and receives a simple string describing which fields are present. It’s called like this:

$ ./convert  -exml -dapache -D “host identity user date request status size f1 useragent” < apache.org > apache.xml

Options specify encoder and decoder, and the string after -D tells the convert field names and order. But now let’s speak the input and output for itself:

The converter works by calling the decoder, which creates an in-memory representation of the log format in a CEE-like object model. Then, that object model and the called-for encoder is used to write the actual output format. That way, the conversion tool can convert between any structured log format, as long as the necessary encoders and decoders are available. This greatly facilitates log processing in heterogeneous environments.

Note that liblognorm works similar and, from libee’s point of view, can be viewed at as an decoder for unstructured text data (or, more precisely, for text data where the structure is well-hidden ;)).

log normalization – first results

At the beginning of this week I was pretty confident, that I would not make my self-set deadline of one month to implement a first rough proof of concept of liblognorm, a log normalizing library. Fortunately, I made extremely good progress the past two days and I am now happy to say that I have such a proof of concept available. All of this can be seen by pulling from Adiscon’s public git server: you need libestr, libee and liblognorm to make it work.

Right now, I’d like to provide a glimpse at how things work. Thanks to Anton Chuvakin and his Public Security Log Sharing Site I got a couple of examples to play with (but I am still interested in more lag samples, especially from Cisco devices). Out of the many, I took a random messages.log file written by sysklogd. This is my input file and can be seen here.

To normalize events, liblognorm needs to know which fields are present at which positions of the input file. It learns this via so-called “samples”. Samples are very similar to the patterns used by virus scanners: like virus patterns describe how a specific virus looks, log samples describe how a specific log line looks. Other than virus patters, I have crafted a format hopefully easy (enough) to understand by sysadmins, so that everyone can add relevant samples himself. To support this, samples look relatively similar to actual log lines, and this is the reason I have termed them “log samples”. Like log files, samples are stored in simple text files. For the initial test, I used a a very small set of samples, available here. A production system will have many more samples, and I envision systems that have many (ten?-) thousand of samples loaded at the same time. If you look at the samples, take special care about entities enclosed in ‘%’ – these are field definitions, the rest is literal text.

The actual normalization is performed by the libraries engine, which parses log lines, based on the samples, into fields. This creates an in-memory representation of the event, which can than be processed by the driving application or be written to some other media or the network.

Liblognorm will come with a small toll called “the normalizer”. It is a minimal library user: it loads a sample database and reads log lines from standard input, creates the event in-memory representation and then writes this representation to standard output in a standardized format. So far, it supports formats as they are expected for the upcoming CEE standard.

The result of a normalizer run on my test input file based on the provided sample base can be seen here. The output is actually a bit more verbose than described above, because it lists the to-be-normalized line as well. If you look at the properties I extracted, you’ll probably notice that some do not make too much sense (maybe…). Also, a classification of the message is missing. Don’t care about these aspects right now: it’s a proof of concept and these things will be addressed by future development (the classification, for example, will be based on CEE taxonomy via tags).

I hope I was able to convey some of the power that is available with liblognorm. Of course, a “little bit” of more work and time will be required to get it production-ready. Unfortunately, I will be unavailable for larger parts of the next two weeks (other work now pressing plus a long-awaited seminar ;)), but I will try to get liblognorm as quickly as possible into the best shape possible. In the meantime, if you like, feel free to have a look at its code or play with it. All of what I wrote can actually be done with the versions available in git.

Call for Log Samples

My log normalization effort made good progress and I have a very rough first proof of concept available. It will take a log sample database, and transform input log files to a CEE-like output format.
Now I am looking at ways to practice-test it. So I’d appreciate if you could point me to some sources of log files. It mustn’t be terabytes, but they should be anonymized and be usable in the public Internet. For obvious reasons, it would be good if they are from widely deployed devices.
I would use a subset of these samples to extract usable sample database entries and see how the run through the normalizer.
Thanks,
Rainer