creating an error log file for elasticsearch data errors

With the work I have just completed in the beta-jsonrt branch of rsyslog, omelasticsearch now supports logging records with data errors to a local log file. This, for example, handles cases where the data type does not match an expected case. These type of errors are not recoverable, so omelasticsearch does not busy the rsyslog core with handling them – especially as it knows precisely which records caused the error. Instead, as of now (will be “officially” available as part of  7.3.2) the user can configure a local error log file, where all requests with errors are written to. That file will receive information about the request and elasticsearch reply. Log records are formatted in JSON and as such are easy to handle. Note that the request will be reported as the post data, a string, NOT a full JSON representation. This is done for two reasons: first of all, the post data for a bulk mode request (as defined by ES) is NOT proper JSON, so we would need to mangle with that data – something we don’t like, as it may hide the real request problem (better use “the real thing”). Secondly (and closely related), there may actually be errors in that string, so that we may not even be able to parse it. As such, it makes a lot more sense to supply the actual string. The reply part is JSON representation as we received it from ES (we assume that it could be parsed – else something would be awfully wrong). Note that in bulk mode, the error record may also contain records that have been processed correctly (ES status OK). Again, we include them so that the whole request/reply process can be evaluated. If data needs to be re-processed, users scripts must ignore such OK status records.

The core idea behind that representation is that it should be as simple as possible to create a custom cleanup script capable of processing the error information – while still providing full feedback on the error cause.

Once the error file is opened, it is never closed until rsyslog is terminated OR a HUP is sent. So proper processing of the error file in a custom script is to first rename the file and then send SIGHUP to rsyslog. This will cause it to release the file and re-open it *when needed*. Note that the file is only created if actually errors occurred. Among others, this is done so that the presence of this file can be used as simple trigger to decide if error handling must be started or not (should be easy to integrate with custom monitoring solutions).

The error file functionality is activated by specifying the “errorfile” action parameter. If it is not given, errors are simply ignored (as there is nothing we can do to handle them in any case).

A quick sample on how this can be done is here:

*.*     action(type=”omelasticsearch”
searchType=”mytype” bulkmode=”on”

Error file contents will look like this:
“request”:    {
“url”:    “http://localhost:9200/_bulk?replication=async&timeout=1m&”,
“postdata”:    “{“index”:{“_index”: “testindex”,”_type”:”mytype”}}n{“message”:” test 1″,”fromhost”:”hostname”,”facility”:”kern”,”priority”:”notice”,”timereported”:”2012-07-30T12:38:38.737354+02:00″,”timegenerated”:”2012-07-30T12:38:38.737354+02:00″}n”
“reply”:    {
“took”:    2,
“items”:    [{
“create”:    {
“_index”:    “testindex”,
“_type”:    “mytype”,
“_id”:    “g8XJzt-TRd6WOu3pRyzrcg”,
“error”:    “MapperParsingException[Failed to parse [timegenerated]]; nested: NumberFormatException[For input string: “2012-07-30T12:38:38.737354+02:00″]; ”

rsyslog action queue

In pre-v6, action queues are quite hard to understand, because the syntax is very counter-intuitive. Based on a recent conversation, I thought I share a little bit of advise on how queues work.

Each action has a dedicated queue. This queue can be in-memory, it can be on disk and it can be a combination of both. There is also direct mode, which means by design there is a queue but the actual driver does forward messages to the action without action queueing.

When you define actions, the queue parameters belong to the next action (in v6 this is far easier to understand and see as the queue params are specified within the action). In the snippet below, we have two actions, and each of them is configured with different queue settings. This results with two different sets of queue files being written. It is important that the queue file names are different (as they are), but otherwise the queues are independent. There can be as many disk (or DA) queues as there are actions. Note that if there is only a single action queue and n actions, this means only one action uses the disk queue, whereas n-1 actions do not use it (if nothing else is set, they run in direct mode).

$ActionQueueType LinkedList # use asynchronous processing
$ActionQueueFileName nfsq    # set file name, also enables disk mode
*.* /var/log/log1

$ActionQueueType LinkedList # use asynchronous processing
$ActionQueueFileName diskq    # set file name, also enables disk mode
*.* /var/log/log2

*.* /var/log/log3 # DIRECT mode here!!!!

In v6, queue parameters are set directly within the action:

*.* action(type="omfile" 
queue.filename="nfsq" queue.type="linkedlist"
*.* action(type="omfile" 
queue.filename="diskq" queue.type="linkedlist"

*.* action(type="omfile" file="/var/log/log3")

The v6 format hopefully make more clear what belongs together.