Lading in Python#1907
Conversation
…gstatsd-py Adds a Python port of lading under lading_py/ that uses the dogstatsd-py library for all metric emission, enabling direct testing of that client under realistic load patterns. All supporting lading capabilities are preserved: Prometheus and expvar telemetry collection from a running Datadog Agent, JSONL/Parquet capture output, passive Prometheus exporter, HTTP blackhole, /proc observer, and the full lading YAML config schema. The Dockerfile is updated from the Rust multi-stage build to a two-stage Python build, dropping build time from several minutes to ~30 seconds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a995ae4fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| fn main() -> Result<(), Error> { | ||
| panic!("Rust is forbidden."); |
There was a problem hiding this comment.
Remove the unconditional panic from the Rust entry point
When anyone runs the existing Rust executable, including cargo run --bin lading, installed lading, or smoke tests that still exercise the Rust binary, main now panics before parsing the CLI or running any workload. The crate still builds, so this becomes a runtime break of the primary lading binary rather than a compile-time failure.
Useful? React with 👍 / 👎.
| if kind == "counter": | ||
| self._registry.increment(name, int(value), merged) |
There was a problem hiding this comment.
Record Prometheus counters as deltas, not absolute totals
For Prometheus target metrics, counters are cumulative values reported by the target, but this code adds each scraped absolute value into the registry. Since Accumulator later differences registry counters between flushes, a stable counter value like http_requests_total 100 will be emitted as another 100 every scrape/flush instead of 0, and increasing counters are overcounted by repeatedly re-adding their full total. This affects configs using target_metrics[].prometheus with any counter metric.
Useful? React with 👍 / 👎.
| self._gauges: dict[str, Gauge] = {} | ||
|
|
||
| def _sync(self) -> None: | ||
| counters, gauges, _ = self._lading_registry.snapshot() |
There was a problem hiding this comment.
Export counters from the Prometheus telemetry endpoint
When telemetry.prometheus.addr is configured, the exporter snapshots both counters and gauges but only registers gauges, so core lading metrics such as bytes_written and packets_sent never appear on /metrics. This leaves the advertised passive Prometheus exporter without the main throughput counters users need to monitor a run in real time.
Useful? React with 👍 / 👎.
Replaces --config (required) with --config-path (default: /etc/lading/lading.yaml) and adds all flags the Rust binary exposes: run/config-check subcommands, --no-target, --target-pid, --capture-path/format/flush-seconds, --prometheus-addr, --experiment-duration-seconds/infinite, --warmup-duration-seconds, --global-labels, and compat stubs for unsupported flags. LADING_CONFIG env var is also honoured. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fc947b95a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| async def run(self, signals) -> None: | ||
| while not signals.shutdown.is_set(): | ||
| await asyncio.sleep(self._flush_seconds) |
There was a problem hiding this comment.
Wake the accumulator on shutdown
When capture output is enabled and shutdown is requested just after this sleep starts, inner_main sets the shutdown event and then awaits the accumulator task, but the task cannot observe the event until the full flush_seconds delay has elapsed. With the default 60-second flush interval (or any larger configured value), short runs and SIGTERM shutdowns hang for that extra interval before the final flush, effectively ignoring the configured shutdown delay.
Useful? React with 👍 / 👎.
| if config.telemetry and config.telemetry.prometheus_addr: | ||
| exp = PrometheusExporter(registry, config.telemetry.prometheus_addr) | ||
| tasks.append(asyncio.create_task(exp.run(signals), name="prometheus_exporter")) |
There was a problem hiding this comment.
Start the exporter for prometheus socket configs
When users configure telemetry.prometheus_socket.path or pass --prometheus-path, the override code stores that value, but this startup path only checks prometheus_addr, so no Prometheus exporter task is created for socket-only configurations. That makes the advertised --prometheus-path endpoint silently unavailable unless a TCP address is also configured.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fc947b95a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| async def run(self, signals) -> None: | ||
| while not signals.shutdown.is_set(): | ||
| await asyncio.sleep(self._flush_seconds) |
There was a problem hiding this comment.
Wake the accumulator on shutdown
When capture output is enabled and shutdown is requested just after this sleep starts, inner_main sets the shutdown event and then awaits the accumulator task, but the task cannot observe the event until the full flush_seconds delay has elapsed. With the default 60-second flush interval (or any larger configured value), short runs and SIGTERM shutdowns hang for that extra interval before the final flush, effectively ignoring the configured shutdown delay.
Useful? React with 👍 / 👎.
| if config.telemetry and config.telemetry.prometheus_addr: | ||
| exp = PrometheusExporter(registry, config.telemetry.prometheus_addr) | ||
| tasks.append(asyncio.create_task(exp.run(signals), name="prometheus_exporter")) |
There was a problem hiding this comment.
Start the exporter for prometheus socket configs
When users configure telemetry.prometheus_socket.path or pass --prometheus-path, the override code stores that value, but this startup path only checks prometheus_addr, so no Prometheus exporter task is created for socket-only configurations. That makes the advertised --prometheus-path endpoint silently unavailable unless a TCP address is also configured.
Useful? React with 👍 / 👎.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be339ac27d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| class ConfRange(BaseModel): | ||
| inclusive: InclusiveRange | None = None | ||
| exclusive: ExclusiveRange | None = None |
There was a problem hiding this comment.
Preserve existing constant ranges
Configs that use the existing lading ConfRange form constant (for example for contexts, tags_per_msg, or multivalue_count) are accepted by Pydantic as extra data but leave both inclusive and exclusive unset; the generator then crashes when lo/hi dereference self.exclusive. This breaks compatibility for valid lading YAMLs that rely on constant ranges instead of spelling identical inclusive min/max values.
Useful? React with 👍 / 👎.
| now = time.monotonic() | ||
| elapsed = now - self._last | ||
| self._tokens = min(self._rate, self._tokens + elapsed * self._rate) | ||
| self._last = now | ||
| if self._tokens >= n: | ||
| self._tokens -= n | ||
| return | ||
| wait = (n - self._tokens) / self._rate |
There was a problem hiding this comment.
Allow buckets to accumulate one full block
When bytes_per_second is lower than the estimated size of a generated block, acquire() can never finish because tokens are capped at self._rate, which is always less than n; the send thread then sleeps forever and does not observe shutdown. This affects low-rate configs or large multi-value/event blocks, causing the run to hang instead of emitting at the configured slow rate.
Useful? React with 👍 / 👎.
| def _sync(self) -> None: | ||
| counters, gauges, _ = self._lading_registry.snapshot() |
There was a problem hiding this comment.
Avoid draining histograms from the exporter
Serving the passive Prometheus endpoint calls Registry.snapshot(), and that method clears histogram samples, but the exporter ignores the returned histograms. With telemetry.prometheus enabled alongside capture or histogram/summary target metrics, each /metrics scrape can silently discard those samples before the accumulator writes them, leaving gaps in captured histogram data.
Useful? React with 👍 / 👎.
aiohttp fails at parse time when receiving zstd-compressed requests without this backport of Python 3.14's compression.zstd module. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d38228daee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if config.experiment_duration_secs > 0: | ||
| await asyncio.sleep(config.experiment_duration_secs) |
There was a problem hiding this comment.
Wake fixed-duration runs on shutdown
When a SIGTERM/SIGINT arrives during a configured experiment duration, the signal handler only sets signals.shutdown, but this branch is still awaiting asyncio.sleep(config.experiment_duration_secs). As a result, a normal 60-second (or longer) run keeps the main task asleep until the full duration elapses before it starts draining tasks, so shutdown requests are not honored promptly and --max-shutdown-delay cannot take effect for fixed-duration runs.
Useful? React with 👍 / 👎.
What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Related issues
A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.
Additional Notes
Anything else we should know when reviewing?