Refactoring omhttp for containers: what, why, and when?

This week—and likely into next—my primary focus (together with Andre and our AI Agents) is a substantial refactor of contrib/omhttp. We’re tracking the work in #5957 and will link the PR from there once it’s open.

Symbol image for “Engineering Rational” type of postings. (Image: Rainer Gerhards via AI)

Prerequisite: core fix that unlocked this work

Before touching omhttp, we fixed a correctness issue in rsyslog core around transaction suspension/resume. That repair makes core-native retry reliable for HTTP actions and removes the historical rationale for complex module-local retry paths. With the core semantics in place, the refactor below becomes both feasible and worthwhile.

Why refactor `omhttp`—focused on container realities

Running rsyslog in Docker/Kubernetes stresses HTTP outputs: ephemeral restarts, transient ingress/DNS issues, bursty backpressure, rolling updates, tight resource budgets. In this environment we need:

Predictable retry behavior (no stalls when an auxiliary queue fills).
Clear HTTP status semantics (no accidental retries on 4xx).
Accurate per-record outcomes for partial batch successes.
Graceful recovery from brief outages without duplicate storms.

omhttp is already a solid contributed module with a few known (mostly historic) issues. We appreciate the original contribution and aim to polish it for modern, container-heavy deployments.

What will change (concise, technical)

1) Native transactions (`commitTransaction()`)

Migrate from the older begin/do/end path to commitTransaction().
Use core batch visibility; map per-record outcomes precisely (partial success ⇒ selective retry).
Return suspension only for transient failures so the core handles backoff uniformly.

Benefit: cleaner correctness under failure, fewer duplicates, simpler tuning at scale.

2) Retry defaults: core-native; RetryRuleset remains optional

Default: suspend on retriable errors and let core retry.
Keep retryRuleset as an optional path for advanced/exotic flows (e.g., special enrichment/routing on failure). We’ll document queue-pressure risks and when it makes sense.

Benefit: safe defaults for most users; power path for specialists.

3) HTTP status policy (explicit and predictable)

1xx/2xx ⇒ success
3xx ⇒ failure (non-retriable) for now (no implicit follow-redirect)
4xx ⇒ permanent failure (non-retriable) by default
5xx / transport failure (0) ⇒ retriable (suspend so core retries)

Overrides:

httpretrycodes adds retriable codes (doesn’t convert failures to success).
httpignorablecodes can explicitly mark certain non-2xx as processed (applied after the base policy).

Benefit: matches real backend behavior; avoids retry storms on 4xx; robust on 5xx.

4) Batching as a thin serializer

Keep newline/jsonarray/kafkarest/lokirest as formatters over core batches (not a parallel transaction system).
Unify gzip and header lifecycle; ensure partial acceptance maps to per-record results.

Benefit: correctness + performance without ambiguity about “who owns the batch”.

5) Loki: from partial to first-class

batch.format=lokirest exists; we’ll verify templates (timestamps/labels), recommended restpath, headers, compression, and container-friendly defaults (label cardinality, batch sizing).
Ensure partial failures are handled per record and document copy-paste examples.

Benefit: straightforward, reliable Loki pipelines in containers.

What the PR will contain (overview)

We’ll link the PR from #5957. High-level contents:

Transaction migration to commitTransaction() with precise per-record outcomes.
Retry changes: core-native by default; retryRuleset optional and documented.
HTTP semantics enforcement and coherent use of httpretrycodes/httpignorablecodes.
Batching rework: serializers over core batches; unified gzip/headers.
Loki specifics: validated payload/labels; container-ready examples.
Quality & safety: review code for defects and fix; unsafe-code audit; improved doxygen.
Docs & migration: clear migration notes; parameter tables; decision chart for status handling.
Tests: status matrix (1xx..5xx + transport failure), partial success, suspend/resume, Loki conformance, perf counters, and queue-pressure scenarios.

Team and “AI First”

This effort is led by me, Andre, and our AI Agents—which we consider part of the team. In our responsible “AI First” approach:

Agents propose code diffs, run PR checks (lint/style/invariants), and draft doc updates.
They act as independent reviewers; humans remain in the loop for design and merges.

Timeline

Focus window: this week, possibly next week.
Follow progress and discussion in #5957; the PR will be linked there when ready.

If you run rsyslog in containers—especially with Loki—and have edge cases we should test, please comment on the issue. Your input helps us set the right defaults.

Prerequisite: core fix that unlocked this work

Why refactor omhttp—focused on container realities

What will change (concise, technical)

1) Native transactions (commitTransaction())

2) Retry defaults: core-native; RetryRuleset remains optional

3) HTTP status policy (explicit and predictable)

4) Batching as a thin serializer

5) Loki: from partial to first-class

What the PR will contain (overview)

Team and “AI First”

Timeline

Why refactor `omhttp`—focused on container realities

1) Native transactions (`commitTransaction()`)