Skip to content

Rules Guide

Adan edited this page Oct 6, 2025 · 1 revision

Automation Rule Playbooks

LogForge's Alert Engine automation service applies rule definitions to container signals and executes guarded remediation. Use the playbooks below as reference implementations. They map directly onto the Rule Builder panels (Trigger Condition, Rule Scope, Actions Chain, and the real-time Rule Preview) and call out when you should visit the global Advanced Settings modal for guardrails.

How the rule builder is organised

  • Trigger Condition: choose the signal (Keyword in Logs, Container Event, or Metric Threshold for Pro).
  • Timeline windows: use the Count and Minutes fields to require N events within M minutes before the trigger fires; leave them blank for immediate alerts.
  • Rule Scope: pick whether the rule inspects every container or specific containers/groups that you already maintain inside LogForge Core. The Alert Engine only lists groups that still contain monitored containers.
  • Actions Chain: order remediation steps (Restart, Stop, Kill, Start, Run script, Send notification) and add per-step delays. Script actions are automatically disabled when scope is global.
  • Notifier delivery: every Send notification step posts the alert payload to the Notifier service, which fans it out to the destinations configured in LogForge (email, chat, PagerDuty, webhooks, etc.).
  • Advanced Settings: open from the Alert Rules toolbar to adjust action cooldowns, backoff, per-hour limits, verification delay, and keyword defaults (case sensitivity, ANY/ALL matching, ignore patterns) for the entire engine.

Rule 1: Restart Loop Stopper (Template: Restart Loop Detection)

  • Template: Open the Templates tab, activate Restart Loop Detection, and set the rule name/scope if prompted. No other fields need to change for default behaviour.
  • Trigger Condition: Container event start with Count >= 3 inside a 5 minute window.
  • Rule Scope: In the activation modal, choose All containers or target one of the LogForge Core groups (for example Payments Services); only groups with monitored containers appear. You can revisit scope later from the rule detail.
  • Actions Chain:
    1. Stop container (fires immediately on the offending container).
    2. Send notification (Notifier pushes the alert to your configured destinations) with a 30 second delay so the stop action completes first.
  • Advanced Settings: The template inherits global guardrails (default stop cooldown 3 minutes, per-rule max actions 3/hour). Override those system-wide only if you need more frequent remediation.
  • Operator notes: For most teams the template defaults suffice-activate it, set the scope, and watch for duplicate-rule warnings before cloning.

Rule 2: Crash Loop Kill Switch (Template: Crash Loop Detection)

  • Template: Activate Crash Loop Detection from the Templates tab; adjust only the scope. This template kills instead of stops when a container keeps failing to start.
  • Trigger Condition: Container event stop with Count >= 5 within 10 minutes.
  • Rule Scope: Point the rule at the LogForge Core group that owns the fragile workload (for example Core APIs); the picker only shows groups with monitored containers.
  • Actions Chain:
    1. Kill container immediately to break the crash storm.
    2. Send notification (Notifier delivers the kill context to the configured destinations) after a 60 second delay.
  • Advanced Settings: Defaults enforce a 5 minute kill cooldown and a 30 second verification delay. Increase the cooldown globally if the same app repeatedly relaunches under investigation.
  • Operator notes: To tweak timelines or swap the kill action for a stop, convert the rule to a custom rule (Pro) or build a fresh custom rule in the builder.

Rule 3: Error Flood Alert (Custom rule - Keyword trigger)

  • Trigger Condition: Select Keyword in Logs, enter keywords such as ERROR and panic, and set Count >= 5 in 2 minutes. Use the Keyword Settings tab in Advanced Settings to flip to case-insensitive matching or add ignore patterns for noisy stack traces.
  • Rule Scope: Target the specific containers known for noisy logs (multi-select from the monitored container inventory) or one of your LogForge Core groups such as Frontend Services.
  • Actions Chain:
    1. Send notification (Notifier pushes the incident payload to the destinations you configured).
    2. Optional Restart container with a 60 second delay if rebooting clears the condition.
  • Advanced Settings: Keep the default restart cooldown (5 minutes) and verification delay (30 seconds); widen them globally if restart storms occur.
  • Operator notes: Start as notify-only, watch the Alerts dashboard, then layer the restart step with a delay.

Rule 4: Security Keyword Escalation (Custom rule - Keyword trigger)

  • Trigger Condition: Keyword in Logs, enable case-insensitive matching, add unauthorized, failed login, AWS_SECRET_ACCESS_KEY, and require Count >= 2 within 60 seconds.
  • Rule Scope: Select the relevant ingress/authentication group that you already maintain in LogForge Core (for example Edge Gateways); the Alert Engine surfaces those groups so you can include them while excluding internal tooling.
  • Actions Chain:
    1. Send notification (Notifier delivers to the security destinations configured in LogForge Core / Notifier).
    2. Send notification via Webhook (Notifier calls the configured webhook endpoint with the formatted alert message).
    3. Optional Stop container as a follow-up action after human review.
  • Advanced Settings: Use the Keywords tab to add ignore patterns for known scanners. Tighten the stop cooldown if security remediations should run only once per hour.
  • Operator notes: Document the response procedure inside the rule description, note which LogForge Core groups the rule covers, and enable Require approval before restart in the Notifier workflow if you automate container stops.

Rule 5: Post-Start Warmup (Custom rule - Container event trigger)

  • Trigger Condition: Container event set to start with no additional thresholds.
  • Rule Scope: Select the release group that needs priming (e.g., Frontend Services) from the Core-managed list. Script actions are only available when the scope is container or group.
  • Actions Chain:
    1. Delay 45 seconds using the step-level delay field before remediation starts.
    2. Run script to execute the first executable .sh in /logforge-scripts/ inside the triggering container.
    3. Send notification (Notifier shares the captured {{ action.output }}) so deployers see warmup results.
  • Advanced Settings: Ensure run_script cooldown (default 10 minutes) aligns with your rollout cadence. The gatekeeper also enforces max actions per container/hour.
  • Operator notes: Provision /logforge-scripts/ with numbered scripts (for example 01_warmup.sh) and check script validity via the Validate scripts API before enabling the rule.

Building reliable automation

  • Prefer templates when they exist; they already include safe defaults and duplicate detection. Adjust scope first, then customise if needed.
  • Keep remediation narrow by scoping rules to the container groups you curate in LogForge Core. The builder prevents wildcard targeting but it is still easy to over-select.
  • Use the global Advanced Settings modal to align cooldowns, exponential backoff, per-rule/per-container limits, verification delays, and keyword defaults with your operational runbooks.
  • Layer remediation gradually: run notification-only rules, review the Alerts dashboard, then append restart/stop/script actions with deliberate delays.
  • When you add script actions, confirm /bin/sh exists, scripts are executable, and the gatekeeper's verification delay is long enough for the script to finish.
  • Keep the Notifier service destinations current so every Send notification step lands in the right inboxes, chat channels, or webhooks.