feat(monitoring): add prometheus metrics collection and grafana dashboard#100
feat(monitoring): add prometheus metrics collection and grafana dashboard#100KalpanaBhaskar wants to merge 1 commit intofuzziecoder:mainfrom
Conversation
|
@KalpanaBhaskar is attempting to deploy a commit to the Revon Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR introduces Prometheus monitoring to the pipeline execution system. It adds metric definitions for execution tracking, instruments the executor with timing measurements, exposes metrics via a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant Executor
participant Metrics
participant Prometheus
participant Grafana
Client->>Executor: Start Pipeline Execution
Executor->>Metrics: Increment pipeline_active_executions
Executor->>Metrics: Record pipeline start_time
rect rgba(100, 150, 200, 0.5)
Note over Executor,Metrics: Pipeline Processing
Executor->>Executor: Execute Stages
Executor->>Metrics: Record stage start_time
Executor->>Metrics: Observe stage_execution_duration_seconds
end
alt Pipeline Success
Executor->>Metrics: Increment pipeline_executions_total (status=success)
Executor->>Metrics: Observe pipeline_execution_duration_seconds
Executor->>Metrics: Decrement pipeline_active_executions
else Pipeline Failure
Executor->>Metrics: Increment pipeline_executions_total (status=failed)
Executor->>Metrics: Increment pipeline_failures_total
Executor->>Metrics: Observe pipeline_execution_duration_seconds
Executor->>Metrics: Decrement pipeline_active_executions
end
Client->>Prometheus: GET /metrics
Prometheus->>Metrics: Collect Metrics
Prometheus-->>Client: Return Prometheus Format
Grafana->>Prometheus: Query pipeline_executions_total, pipeline_active_executions, etc.
Prometheus-->>Grafana: Metric Data
Grafana->>Grafana: Render Dashboard Panels
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
||
| # Record failure metrics | ||
| duration = time.time() - start_time | ||
| pipeline_executions_total.labels(pipeline_id=pipeline.id, status='failed').inc() | ||
| pipeline_failures_total.labels(pipeline_id=pipeline.id).inc() | ||
| pipeline_execution_duration_seconds.labels(pipeline_id=pipeline.id).observe(duration) | ||
| pipeline_active_executions.labels(pipeline_id=pipeline.id).dec() |
There was a problem hiding this comment.
🔴 pipeline_active_executions gauge never decremented if metrics recording in except block raises
The pipeline_active_executions gauge is incremented at backend/core/executor.py:48 (before the try block), but it is only decremented inside the try success path (line 79) or inside the except block (line 98). If any of the metrics calls preceding the .dec() in the except block (lines 95–97) raise an exception, the decrement on line 98 is never reached and the exception propagates out. This causes the gauge to permanently drift upward, reporting phantom active executions. The .dec() call should be placed in a finally block to guarantee it always executes regardless of what happens during metrics recording.
Prompt for agents
In backend/core/executor.py, refactor the execute() method (lines 47-99) so that the pipeline_active_executions gauge decrement is guaranteed to run. Move `pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()` out of both the try and except blocks and into a finally block. The structure should be:
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()
start_time = time.time()
try:
... (success path, record success metrics but WITHOUT the .dec() call)
except Exception as e:
... (failure path, record failure metrics but WITHOUT the .dec() call)
finally:
pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()
Remove the .dec() calls from both the try block (currently line 79) and the except block (currently line 98).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/core/executor.py`:
- Around line 47-49: Replace uses of the wall-clock timer time.time() with a
monotonic high-resolution timer time.perf_counter() for all elapsed-time
measurements in backend/core/executor.py: change the start_time assignments
(e.g. the start_time set next to
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()) and any
subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.
In `@backend/main.py`:
- Around line 75-79: The /metrics endpoint (prometheus_metrics) is currently
exposed anonymously; when settings.AUTH_ENABLED is true, protect it by adding
the same authentication dependency (auth_dependencies) to the app.get decorator
or route registration and hide it from the public OpenAPI (set
include_in_schema=False) so it requires auth and is not in the public schema;
conditionalize the decorator or route registration on settings.AUTH_ENABLED (or
always attach auth_dependencies when enabled) to ensure prometheus_metrics is
only accessible to authenticated/internal callers.
In `@backend/monitoring/metrics.py`:
- Around line 39-44: The metric uses an unbounded label 'stage_name'
(stage_execution_duration_seconds) which will create infinite Prometheus series;
change the metric to use a bounded dimension such as 'stage_type' or
'stage_index' instead and update all call sites that set labels (references in
executor where stage_execution_duration_seconds.labels(...) is called) to
provide the chosen bounded value (e.g., Stage.type or a numeric stage index).
Leave Stage.name unchanged for logs/traces—emit the free-form name to
logging/tracing code but stop passing it into the metric labels. Ensure you
update the metric declaration (remove 'stage_name' label) and all uses in the
executor (the label-setting calls around the stage timing at the locations that
currently pass stage.name) to match the new label keys.
In `@docs/monitoring/grafana_dashboard.json`:
- Around line 92-94: The legend text is incorrect: the PromQL expression using
histogram_quantile(0.95, ...) produces a 95th percentile (p95), not a 95%
confidence interval; update the legendFormat value (currently "{{pipeline_id}}
(95% CI)") to a correct label such as "{{pipeline_id}} (p95)" or
"{{pipeline_id}} (95th percentile)" so the dashboard accurately reflects the
metric returned by histogram_quantile.
- Around line 24-28: The success-rate PromQL expression for the target using the
metric pipeline_executions_total should be changed to handle idle 5m windows:
replace the current expr string with the pattern that performs the division only
when the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9ee64cdf-d109-47b9-bae1-2b24a89d032d
⛔ Files ignored due to path filters (4)
backend/__pycache__/main.cpython-311.pycis excluded by!**/*.pycbackend/api/__pycache__/schemas.cpython-311.pycis excluded by!**/*.pycbackend/core/__pycache__/executor.cpython-311.pycis excluded by!**/*.pycbackend/db/__pycache__/crud.cpython-311.pycis excluded by!**/*.pyc
📒 Files selected for processing (4)
backend/core/executor.pybackend/main.pybackend/monitoring/metrics.pydocs/monitoring/grafana_dashboard.json
| # Track active pipeline start | ||
| pipeline_active_executions.labels(pipeline_id=pipeline.id).inc() | ||
| start_time = time.time() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Python, what clock is recommended for measuring elapsed durations: time.time()ortime.perf_counter()/time.monotonic()?
💡 Result:
For measuring elapsed durations, prefer a monotonic clock:
- Use
time.perf_counter()(ortime.perf_counter_ns()) for timing/benchmarking elapsed time, because it’s monotonic and uses the highest available resolution clock. [1], [2] - Use
time.monotonic()(ortime.monotonic_ns()) when you specifically want a monotonic elapsed-time clock (also immune to system clock changes), typically with slightly less “benchmark-focused” intent thanperf_counter(). [1], [3] - Avoid
time.time()for durations because it’s “wall clock” time and can jump forwards/backwards if the system time is adjusted (NTP, manual changes), which can break elapsed-time measurements. [1], [3]
Rule of thumb: durations → perf_counter() (most common); timestamps/“seconds since epoch” → time.time(). [1], [3]
Sources:
[1] PEP 418 (monotonic/perf_counter rationale and use) (peps.python.org)
[2] Python time module docs (perf_counter) (docs.python.org)
[3] Python time module docs (clock semantics; monotonic vs realtime) (docs.python.org)
Citations:
- 1: https://peps.python.org/pep-0418/?utm_source=openai
- 2: https://docs.python.org/3/library/time.html?utm_source=openai
- 3: https://docs.python.org/3/library/time.html?utm_source=openai
🏁 Script executed:
find . -name "executor.py" -path "*/core/*" | head -5Repository: fuzziecoder/Flexi-Roaster
Length of output: 131
🏁 Script executed:
head -160 backend/core/executor.py | cat -nRepository: fuzziecoder/Flexi-Roaster
Length of output: 7403
🏁 Script executed:
tail -n +160 backend/core/executor.py | grep -n "time\." | head -20Repository: fuzziecoder/Flexi-Roaster
Length of output: 51
Use a monotonic clock for elapsed-time metrics.
time.time() tracks wall time. NTP or manual clock changes can move it backward or forward and skew these histogram observations. time.perf_counter() is the recommended approach for measuring elapsed durations.
⏱️ Proposed refactor
- start_time = time.time()
+ start_time = time.perf_counter()
...
- duration = time.time() - start_time
+ duration = time.perf_counter() - start_time
...
- duration = time.time() - start_time
+ duration = time.perf_counter() - start_time
...
- stage_start_time = time.time()
+ stage_start_time = time.perf_counter()
...
- stage_duration = time.time() - stage_start_time
+ stage_duration = time.perf_counter() - stage_start_time
...
- stage_duration = time.time() - stage_start_time
+ stage_duration = time.perf_counter() - stage_start_timeAlso applies to: 49, 76, 94, 111, 137, 153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/core/executor.py` around lines 47 - 49, Replace uses of the
wall-clock timer time.time() with a monotonic high-resolution timer
time.perf_counter() for all elapsed-time measurements in
backend/core/executor.py: change the start_time assignments (e.g. the start_time
set next to pipeline_active_executions.labels(pipeline_id=pipeline.id).inc())
and any subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.
| # Prometheus Metrics scraping endpoint | ||
| @app.get("/metrics", tags=["monitoring"]) | ||
| async def prometheus_metrics(): | ||
| """Prometheus metrics endpoint""" | ||
| return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST) |
There was a problem hiding this comment.
Protect /metrics; it currently bypasses auth.
When settings.AUTH_ENABLED is true, only the routers on Lines 97-110 get auth_dependencies; this endpoint stays anonymous. That exposes operational data and any user-derived label values to anyone who can reach the service. Gate it behind the same auth policy or an internal-only ingress, and keep it out of the public schema.
🔒 Proposed fix
+# Define this before the route declarations.
+metrics_dependencies = [Depends(get_current_auth_context)] if settings.AUTH_ENABLED else []
+
-@app.get("/metrics", tags=["monitoring"])
+@app.get("/metrics", tags=["monitoring"], include_in_schema=False, dependencies=metrics_dependencies)
async def prometheus_metrics():
"""Prometheus metrics endpoint"""
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/main.py` around lines 75 - 79, The /metrics endpoint
(prometheus_metrics) is currently exposed anonymously; when
settings.AUTH_ENABLED is true, protect it by adding the same authentication
dependency (auth_dependencies) to the app.get decorator or route registration
and hide it from the public OpenAPI (set include_in_schema=False) so it requires
auth and is not in the public schema; conditionalize the decorator or route
registration on settings.AUTH_ENABLED (or always attach auth_dependencies when
enabled) to ensure prometheus_metrics is only accessible to
authenticated/internal callers.
| stage_execution_duration_seconds = Histogram( | ||
| 'stage_execution_duration_seconds', | ||
| 'Stage execution duration in seconds', | ||
| ['pipeline_id', 'stage_name', 'status'], | ||
| buckets=(0.1, 0.5, 1.0, 5.0, 15.0, 30.0, 60.0, 120.0, 300.0, float('inf')) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that `Stage.name` is unconstrained and flows directly into the metric label.
rg -n -C2 'class Stage|name:\s*str|stage_execution_duration_seconds|stage_name\s*=\s*stage\.name' backendRepository: fuzziecoder/Flexi-Roaster
Length of output: 8075
Avoid unbounded stage_name labels.
Stage.name is defined as a free-form str in backend/models/pipeline.py and passed directly into the stage_execution_duration_seconds metric label at backend/core/executor.py lines 140 and 156. Every distinct stage name creates a new time-series in Prometheus, which will eventually exhaust memory and degrade query latency. Use a bounded dimension (e.g., stage type or index) instead, and keep the free-form name in logs or traces.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/monitoring/metrics.py` around lines 39 - 44, The metric uses an
unbounded label 'stage_name' (stage_execution_duration_seconds) which will
create infinite Prometheus series; change the metric to use a bounded dimension
such as 'stage_type' or 'stage_index' instead and update all call sites that set
labels (references in executor where
stage_execution_duration_seconds.labels(...) is called) to provide the chosen
bounded value (e.g., Stage.type or a numeric stage index). Leave Stage.name
unchanged for logs/traces—emit the free-form name to logging/tracing code but
stop passing it into the metric labels. Ensure you update the metric declaration
(remove 'stage_name' label) and all uses in the executor (the label-setting
calls around the stage timing at the locations that currently pass stage.name)
to match the new label keys.
| "targets": [ | ||
| { | ||
| "expr": "sum(rate(pipeline_executions_total{status=\"success\"}[5m])) / sum(rate(pipeline_executions_total[5m])) * 100", | ||
| "refId": "A" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PromQL, how should a success / total * 100 query be written so Grafana shows 0 instead of NaN/N/A when the denominator is zero or the series is absent?
💡 Result:
Use a filter to drop the divide-by-zero series, then fallback to 0 when the result is empty:
(
100 * success / total
)
and on(<your_grouping_labels>)
(
total > 0
)
or on() vector(0)
total > 0(withoutbool) removes series where the denominator is0, so you don’t getNaN.... or on() vector(0)makes Grafana get a 0 when the whole expression returns no series (missing/absent data). [1]
If you’re aggregating (common case), apply the same idea to the aggregated vectors:
(
100 *
sum by (job) (success)
/
sum by (job) (total)
)
and on(job)
(
sum by (job) (total) > 0
)
or on() vector(0)
Sources: [1] https://grafana.com/docs/grafana/latest/alerting/guides/missing-data/ (Prometheus example: your_metric_query OR on() vector(0)), [2] https://stackoverflow.com/questions/47056557/how-to-gracefully-avoid-divide-by-zero-in-prometheus
🏁 Script executed:
find . -name "grafana_dashboard.json" -type fRepository: fuzziecoder/Flexi-Roaster
Length of output: 109
🏁 Script executed:
# Once we find the file, let's examine it
cat -n docs/monitoring/grafana_dashboard.json | head -100Repository: fuzziecoder/Flexi-Roaster
Length of output: 3701
Handle idle windows in the success-rate query.
When no executions occur in a 5-minute window, both sum(rate(...)) expressions return empty series. The denominator becomes 0 or absent, causing the division to produce NaN/N/A instead of 0%.
The proposed fix using or vector(0) and clamp_min() will work. However, PromQL documentation recommends a cleaner pattern that filters before calculating:
📊 Alternative approach
(
100 * sum(rate(pipeline_executions_total{status="success"}[5m]))
/
sum(rate(pipeline_executions_total[5m]))
)
and on() (sum(rate(pipeline_executions_total[5m])) > 0)
or on() vector(0)
This filters out the division when the denominator is 0, then falls back to 0 when the entire expression returns no series.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/monitoring/grafana_dashboard.json` around lines 24 - 28, The
success-rate PromQL expression for the target using the metric
pipeline_executions_total should be changed to handle idle 5m windows: replace
the current expr string with the pattern that performs the division only when
the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.
| "expr": "histogram_quantile(0.95, sum(rate(pipeline_execution_duration_seconds_bucket[5m])) by (le, pipeline_id))", | ||
| "legendFormat": "{{pipeline_id}} (95% CI)", | ||
| "refId": "A" |
There was a problem hiding this comment.
Rename the legend; this is a percentile, not a confidence interval.
histogram_quantile(0.95, ...) returns a p95 estimate. Labeling it 95% CI is statistically incorrect and will mislead dashboard readers.
📝 Proposed fix
- "legendFormat": "{{pipeline_id}} (95% CI)",
+ "legendFormat": "{{pipeline_id}} p95",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/monitoring/grafana_dashboard.json` around lines 92 - 94, The legend text
is incorrect: the PromQL expression using histogram_quantile(0.95, ...) produces
a 95th percentile (p95), not a 95% confidence interval; update the legendFormat
value (currently "{{pipeline_id}} (95% CI)") to a correct label such as
"{{pipeline_id}} (p95)" or "{{pipeline_id}} (95th percentile)" so the dashboard
accurately reflects the metric returned by histogram_quantile.
What:
Implemented comprehensive Prometheus-style metrics collection for pipeline and stage performance monitoring, and exported a pre-configured Grafana Dashboard JSON for visualization.
Why:
Resolves the issue requesting real-time visibility into pipeline throughput, success/failure rates, active workloads, and execution duration trends.
How:
prometheus_clientCounters (total runs, failures), Histograms (duration distributions), and Gauges (active threads).GET /metricsin [backend/main.py].Testing:
GET /metricstext payload formatting in the browser.Addressing Issue #30
Summary by CodeRabbit
Release Notes