Revisiting old style Windows Log Schema Mapping

Windows logs provide a wealth of information that must be made usable for Observability. As you may know, I work on normalizing these logs for quite some while, I even created liblognorm for that purpose. Ingesting them properly is important for schema mapping, e.g. to Elastic Cloud.

Some may think that ingesting structured Windows logs is no big deal nowadays. Indeed, it is not, if you are free to place Agents. For example, the rsyslog Windows agent extracts rich metadata from the Windows Event Log and provides it nicely as JSON. But in reality, many installations still use other formats, like Snare format. Especially the older format without real structure bits that is available in free to use versions.

We have lots of experience, but when we hit a new project where normalizing snare format and schema mapping windows events in rsyslog is a hot topic, we decided to revisit the topic. With the advance of AI, we have additional options, and we also have even more experience now than we had ten or 20 years ago (remember that I invented the Windows-to-Syslog technology and pipeline in the late 1990s).

We are currently trying four approaches to refine our effort:

a classic transformation process (no AI) which is able to use rules for extracting fields. IT will most probably be a heuristic, but if the success probability is high, that’s good enough (we can add guardrails against mis-detection) Today, we can use AI to complement our deep Event Log knowledge by letting it wade through large log samples and verify our human-crafted assumptions.
a clustering based approach, which the goal to identify constant text and variable data elements inside the messages. The goal here is to either automatically generate liblognorm rulebases or at least considerably facilitate creating these rulebases by humans. This would probably involve machine learning, but no LLMs.
running a LLM natively on a larger set of event log messages, again with the goal to cluster and detect – much like option two, but no need to generate clustering scripts.
a combination of one and two, where the clustering jumps in when the heuristics is not sufficiently “sure” it understood the format (there will be guardrails inside the heuristic).

If possible, option one is our prime goal – it is fast and requires no pipeline for rule creation.

In any case, it’s interesting to re-visit a know problem after more than ten years of working custom solutions and re-thinking it in a more generic sense. With now 20+ years of experience and a totally new technology stack.

I am positive this effort can succeed and make rsyslog a very valuable data pipeline component in projects which migrate from snare-format based logging to modern observability front-ends. Remember that rsyslog does schema mapping for 20 years now. We already did it when the term was not know ;-)