Skip to content

feat(monitoring): add prometheus metrics collection and grafana dashboard#100

Open
KalpanaBhaskar wants to merge 1 commit intofuzziecoder:mainfrom
KalpanaBhaskar:feature/prometheus-metrics
Open

feat(monitoring): add prometheus metrics collection and grafana dashboard#100
KalpanaBhaskar wants to merge 1 commit intofuzziecoder:mainfrom
KalpanaBhaskar:feature/prometheus-metrics

Conversation

@KalpanaBhaskar
Copy link
Contributor

@KalpanaBhaskar KalpanaBhaskar commented Mar 7, 2026

What:

Implemented comprehensive Prometheus-style metrics collection for pipeline and stage performance monitoring, and exported a pre-configured Grafana Dashboard JSON for visualization.

Why:

Resolves the issue requesting real-time visibility into pipeline throughput, success/failure rates, active workloads, and execution duration trends.

How:

  • Created the core registry [backend/monitoring/metrics.py] defining official prometheus_client Counters (total runs, failures), Histograms (duration distributions), and Gauges (active threads).
  • Instrumented the core backend execution engine in [backend/core/executor.py] to trigger and track metrics on pipeline/stage start, success, and failure.
  • Established the stateless web endpoint GET /metrics in [backend/main.py].
  • Generated [docs/monitoring/grafana_dashboard.json] pre-configured with 4 live panels tracking Success Rates, Durations, Active Pipelines, and Global Throughput.

Testing:

  • Verified GET /metrics text payload formatting in the browser.
  • Pushed test workloads via Swagger to confirm counters dynamically increment and decrement accurately without crashing the engine.
image

Addressing Issue #30


Open with Devin

Summary by CodeRabbit

Release Notes

  • New Features
    • Added comprehensive pipeline execution metrics tracking, including success/failure rates, execution duration, active pipeline counts, and throughput
    • Introduced a monitoring dashboard for real-time visibility into pipeline performance and health
    • Metrics now exposed via a dedicated endpoint for integration with external monitoring systems

@vercel
Copy link

vercel bot commented Mar 7, 2026

@KalpanaBhaskar is attempting to deploy a commit to the Revon Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

This PR introduces Prometheus monitoring to the pipeline execution system. It adds metric definitions for execution tracking, instruments the executor with timing measurements, exposes metrics via a new /metrics endpoint, and provides a Grafana dashboard configuration for visualization.

Changes

Cohort / File(s) Summary
Metrics Definitions
backend/monitoring/metrics.py
Defines five Prometheus metrics: pipeline_executions_total (counter), pipeline_failures_total (counter), pipeline_execution_duration_seconds (histogram), stage_execution_duration_seconds (histogram), and pipeline_active_executions (gauge) with appropriate labels and buckets.
Executor Instrumentation
backend/core/executor.py
Records timing and emits metrics on pipeline start, completion, failure; tracks stage execution duration and status; increments counters, observes histograms, and manages gauge for active executions.
Metrics Exposure
backend/main.py
Adds /metrics endpoint that returns Prometheus-formatted metrics with appropriate HTTP response headers.
Dashboard Configuration
docs/monitoring/grafana_dashboard.json
Defines a Grafana dashboard with four panels: pipeline success rate, active executions, execution duration (95th percentile), and global throughput, all querying Prometheus data source.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Executor
    participant Metrics
    participant Prometheus
    participant Grafana

    Client->>Executor: Start Pipeline Execution
    Executor->>Metrics: Increment pipeline_active_executions
    Executor->>Metrics: Record pipeline start_time
    
    rect rgba(100, 150, 200, 0.5)
        Note over Executor,Metrics: Pipeline Processing
        Executor->>Executor: Execute Stages
        Executor->>Metrics: Record stage start_time
        Executor->>Metrics: Observe stage_execution_duration_seconds
    end
    
    alt Pipeline Success
        Executor->>Metrics: Increment pipeline_executions_total (status=success)
        Executor->>Metrics: Observe pipeline_execution_duration_seconds
        Executor->>Metrics: Decrement pipeline_active_executions
    else Pipeline Failure
        Executor->>Metrics: Increment pipeline_executions_total (status=failed)
        Executor->>Metrics: Increment pipeline_failures_total
        Executor->>Metrics: Observe pipeline_execution_duration_seconds
        Executor->>Metrics: Decrement pipeline_active_executions
    end
    
    Client->>Prometheus: GET /metrics
    Prometheus->>Metrics: Collect Metrics
    Prometheus-->>Client: Return Prometheus Format

    Grafana->>Prometheus: Query pipeline_executions_total, pipeline_active_executions, etc.
    Prometheus-->>Grafana: Metric Data
    Grafana->>Grafana: Render Dashboard Panels
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Suggested labels

apertre3.0, hard

Poem

🐰 Hop hop, metrics flow,
Prometheus glows bright and true,
Dashboards now show,
Pipelines dance through,
Execution tales in numbers anew! 📊

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely describes the main changes: adding Prometheus metrics collection and a Grafana dashboard for monitoring pipeline executions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +92 to +98

# Record failure metrics
duration = time.time() - start_time
pipeline_executions_total.labels(pipeline_id=pipeline.id, status='failed').inc()
pipeline_failures_total.labels(pipeline_id=pipeline.id).inc()
pipeline_execution_duration_seconds.labels(pipeline_id=pipeline.id).observe(duration)
pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 pipeline_active_executions gauge never decremented if metrics recording in except block raises

The pipeline_active_executions gauge is incremented at backend/core/executor.py:48 (before the try block), but it is only decremented inside the try success path (line 79) or inside the except block (line 98). If any of the metrics calls preceding the .dec() in the except block (lines 95–97) raise an exception, the decrement on line 98 is never reached and the exception propagates out. This causes the gauge to permanently drift upward, reporting phantom active executions. The .dec() call should be placed in a finally block to guarantee it always executes regardless of what happens during metrics recording.

Prompt for agents
In backend/core/executor.py, refactor the execute() method (lines 47-99) so that the pipeline_active_executions gauge decrement is guaranteed to run. Move `pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()` out of both the try and except blocks and into a finally block. The structure should be:

  pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()
  start_time = time.time()
  try:
      ... (success path, record success metrics but WITHOUT the .dec() call)
  except Exception as e:
      ... (failure path, record failure metrics but WITHOUT the .dec() call)
  finally:
      pipeline_active_executions.labels(pipeline_id=pipeline.id).dec()

Remove the .dec() calls from both the try block (currently line 79) and the except block (currently line 98).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/core/executor.py`:
- Around line 47-49: Replace uses of the wall-clock timer time.time() with a
monotonic high-resolution timer time.perf_counter() for all elapsed-time
measurements in backend/core/executor.py: change the start_time assignments
(e.g. the start_time set next to
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()) and any
subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.

In `@backend/main.py`:
- Around line 75-79: The /metrics endpoint (prometheus_metrics) is currently
exposed anonymously; when settings.AUTH_ENABLED is true, protect it by adding
the same authentication dependency (auth_dependencies) to the app.get decorator
or route registration and hide it from the public OpenAPI (set
include_in_schema=False) so it requires auth and is not in the public schema;
conditionalize the decorator or route registration on settings.AUTH_ENABLED (or
always attach auth_dependencies when enabled) to ensure prometheus_metrics is
only accessible to authenticated/internal callers.

In `@backend/monitoring/metrics.py`:
- Around line 39-44: The metric uses an unbounded label 'stage_name'
(stage_execution_duration_seconds) which will create infinite Prometheus series;
change the metric to use a bounded dimension such as 'stage_type' or
'stage_index' instead and update all call sites that set labels (references in
executor where stage_execution_duration_seconds.labels(...) is called) to
provide the chosen bounded value (e.g., Stage.type or a numeric stage index).
Leave Stage.name unchanged for logs/traces—emit the free-form name to
logging/tracing code but stop passing it into the metric labels. Ensure you
update the metric declaration (remove 'stage_name' label) and all uses in the
executor (the label-setting calls around the stage timing at the locations that
currently pass stage.name) to match the new label keys.

In `@docs/monitoring/grafana_dashboard.json`:
- Around line 92-94: The legend text is incorrect: the PromQL expression using
histogram_quantile(0.95, ...) produces a 95th percentile (p95), not a 95%
confidence interval; update the legendFormat value (currently "{{pipeline_id}}
(95% CI)") to a correct label such as "{{pipeline_id}} (p95)" or
"{{pipeline_id}} (95th percentile)" so the dashboard accurately reflects the
metric returned by histogram_quantile.
- Around line 24-28: The success-rate PromQL expression for the target using the
metric pipeline_executions_total should be changed to handle idle 5m windows:
replace the current expr string with the pattern that performs the division only
when the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ee64cdf-d109-47b9-bae1-2b24a89d032d

📥 Commits

Reviewing files that changed from the base of the PR and between fad909a and e7a42c9.

⛔ Files ignored due to path filters (4)
  • backend/__pycache__/main.cpython-311.pyc is excluded by !**/*.pyc
  • backend/api/__pycache__/schemas.cpython-311.pyc is excluded by !**/*.pyc
  • backend/core/__pycache__/executor.cpython-311.pyc is excluded by !**/*.pyc
  • backend/db/__pycache__/crud.cpython-311.pyc is excluded by !**/*.pyc
📒 Files selected for processing (4)
  • backend/core/executor.py
  • backend/main.py
  • backend/monitoring/metrics.py
  • docs/monitoring/grafana_dashboard.json

Comment on lines +47 to +49
# Track active pipeline start
pipeline_active_executions.labels(pipeline_id=pipeline.id).inc()
start_time = time.time()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Python, what clock is recommended for measuring elapsed durations: time.time()ortime.perf_counter()/time.monotonic()?

💡 Result:

For measuring elapsed durations, prefer a monotonic clock:

  • Use time.perf_counter() (or time.perf_counter_ns()) for timing/benchmarking elapsed time, because it’s monotonic and uses the highest available resolution clock. [1], [2]
  • Use time.monotonic() (or time.monotonic_ns()) when you specifically want a monotonic elapsed-time clock (also immune to system clock changes), typically with slightly less “benchmark-focused” intent than perf_counter(). [1], [3]
  • Avoid time.time() for durations because it’s “wall clock” time and can jump forwards/backwards if the system time is adjusted (NTP, manual changes), which can break elapsed-time measurements. [1], [3]

Rule of thumb: durations → perf_counter() (most common); timestamps/“seconds since epoch” → time.time(). [1], [3]

Sources:
[1] PEP 418 (monotonic/perf_counter rationale and use) (peps.python.org)
[2] Python time module docs (perf_counter) (docs.python.org)
[3] Python time module docs (clock semantics; monotonic vs realtime) (docs.python.org)

Citations:


🏁 Script executed:

find . -name "executor.py" -path "*/core/*" | head -5

Repository: fuzziecoder/Flexi-Roaster

Length of output: 131


🏁 Script executed:

head -160 backend/core/executor.py | cat -n

Repository: fuzziecoder/Flexi-Roaster

Length of output: 7403


🏁 Script executed:

tail -n +160 backend/core/executor.py | grep -n "time\." | head -20

Repository: fuzziecoder/Flexi-Roaster

Length of output: 51


Use a monotonic clock for elapsed-time metrics.

time.time() tracks wall time. NTP or manual clock changes can move it backward or forward and skew these histogram observations. time.perf_counter() is the recommended approach for measuring elapsed durations.

⏱️ Proposed refactor
-        start_time = time.time()
+        start_time = time.perf_counter()
...
-            duration = time.time() - start_time
+            duration = time.perf_counter() - start_time
...
-            duration = time.time() - start_time
+            duration = time.perf_counter() - start_time
...
-        stage_start_time = time.time()
+        stage_start_time = time.perf_counter()
...
-            stage_duration = time.time() - stage_start_time
+            stage_duration = time.perf_counter() - stage_start_time
...
-            stage_duration = time.time() - stage_start_time
+            stage_duration = time.perf_counter() - stage_start_time

Also applies to: 49, 76, 94, 111, 137, 153

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/core/executor.py` around lines 47 - 49, Replace uses of the
wall-clock timer time.time() with a monotonic high-resolution timer
time.perf_counter() for all elapsed-time measurements in
backend/core/executor.py: change the start_time assignments (e.g. the start_time
set next to pipeline_active_executions.labels(pipeline_id=pipeline.id).inc())
and any subsequent elapsed = time.time() - start_time calculations to use
time.perf_counter() instead, and ensure imports/reference use the perf_counter
symbol; apply the same replacement at the other occurrences noted (the other
start_time/elapsed sites around lines 49, 76, 94, 111, 137, 153) so
histogram.observe calls use perf_counter-based durations.

Comment on lines +75 to +79
# Prometheus Metrics scraping endpoint
@app.get("/metrics", tags=["monitoring"])
async def prometheus_metrics():
"""Prometheus metrics endpoint"""
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Protect /metrics; it currently bypasses auth.

When settings.AUTH_ENABLED is true, only the routers on Lines 97-110 get auth_dependencies; this endpoint stays anonymous. That exposes operational data and any user-derived label values to anyone who can reach the service. Gate it behind the same auth policy or an internal-only ingress, and keep it out of the public schema.

🔒 Proposed fix
+# Define this before the route declarations.
+metrics_dependencies = [Depends(get_current_auth_context)] if settings.AUTH_ENABLED else []
+
-@app.get("/metrics", tags=["monitoring"])
+@app.get("/metrics", tags=["monitoring"], include_in_schema=False, dependencies=metrics_dependencies)
 async def prometheus_metrics():
     """Prometheus metrics endpoint"""
     return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/main.py` around lines 75 - 79, The /metrics endpoint
(prometheus_metrics) is currently exposed anonymously; when
settings.AUTH_ENABLED is true, protect it by adding the same authentication
dependency (auth_dependencies) to the app.get decorator or route registration
and hide it from the public OpenAPI (set include_in_schema=False) so it requires
auth and is not in the public schema; conditionalize the decorator or route
registration on settings.AUTH_ENABLED (or always attach auth_dependencies when
enabled) to ensure prometheus_metrics is only accessible to
authenticated/internal callers.

Comment on lines +39 to +44
stage_execution_duration_seconds = Histogram(
'stage_execution_duration_seconds',
'Stage execution duration in seconds',
['pipeline_id', 'stage_name', 'status'],
buckets=(0.1, 0.5, 1.0, 5.0, 15.0, 30.0, 60.0, 120.0, 300.0, float('inf'))
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that `Stage.name` is unconstrained and flows directly into the metric label.
rg -n -C2 'class Stage|name:\s*str|stage_execution_duration_seconds|stage_name\s*=\s*stage\.name' backend

Repository: fuzziecoder/Flexi-Roaster

Length of output: 8075


Avoid unbounded stage_name labels.

Stage.name is defined as a free-form str in backend/models/pipeline.py and passed directly into the stage_execution_duration_seconds metric label at backend/core/executor.py lines 140 and 156. Every distinct stage name creates a new time-series in Prometheus, which will eventually exhaust memory and degrade query latency. Use a bounded dimension (e.g., stage type or index) instead, and keep the free-form name in logs or traces.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/monitoring/metrics.py` around lines 39 - 44, The metric uses an
unbounded label 'stage_name' (stage_execution_duration_seconds) which will
create infinite Prometheus series; change the metric to use a bounded dimension
such as 'stage_type' or 'stage_index' instead and update all call sites that set
labels (references in executor where
stage_execution_duration_seconds.labels(...) is called) to provide the chosen
bounded value (e.g., Stage.type or a numeric stage index). Leave Stage.name
unchanged for logs/traces—emit the free-form name to logging/tracing code but
stop passing it into the metric labels. Ensure you update the metric declaration
(remove 'stage_name' label) and all uses in the executor (the label-setting
calls around the stage timing at the locations that currently pass stage.name)
to match the new label keys.

Comment on lines +24 to +28
"targets": [
{
"expr": "sum(rate(pipeline_executions_total{status=\"success\"}[5m])) / sum(rate(pipeline_executions_total[5m])) * 100",
"refId": "A"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In PromQL, how should a success / total * 100 query be written so Grafana shows 0 instead of NaN/N/A when the denominator is zero or the series is absent?

💡 Result:

Use a filter to drop the divide-by-zero series, then fallback to 0 when the result is empty:

(
  100 * success / total
)
and on(<your_grouping_labels>)
(
  total > 0
)
or on() vector(0)
  • total > 0 (without bool) removes series where the denominator is 0, so you don’t get NaN.
  • ... or on() vector(0) makes Grafana get a 0 when the whole expression returns no series (missing/absent data). [1]

If you’re aggregating (common case), apply the same idea to the aggregated vectors:

(
  100 *
  sum by (job) (success)
  /
  sum by (job) (total)
)
and on(job)
(
  sum by (job) (total) > 0
)
or on() vector(0)

Sources: [1] https://grafana.com/docs/grafana/latest/alerting/guides/missing-data/ (Prometheus example: your_metric_query OR on() vector(0)), [2] https://stackoverflow.com/questions/47056557/how-to-gracefully-avoid-divide-by-zero-in-prometheus


🏁 Script executed:

find . -name "grafana_dashboard.json" -type f

Repository: fuzziecoder/Flexi-Roaster

Length of output: 109


🏁 Script executed:

# Once we find the file, let's examine it
cat -n docs/monitoring/grafana_dashboard.json | head -100

Repository: fuzziecoder/Flexi-Roaster

Length of output: 3701


Handle idle windows in the success-rate query.

When no executions occur in a 5-minute window, both sum(rate(...)) expressions return empty series. The denominator becomes 0 or absent, causing the division to produce NaN/N/A instead of 0%.

The proposed fix using or vector(0) and clamp_min() will work. However, PromQL documentation recommends a cleaner pattern that filters before calculating:

📊 Alternative approach
(
  100 * sum(rate(pipeline_executions_total{status="success"}[5m]))
  /
  sum(rate(pipeline_executions_total[5m]))
)
and on() (sum(rate(pipeline_executions_total[5m])) > 0)
or on() vector(0)

This filters out the division when the denominator is 0, then falls back to 0 when the entire expression returns no series.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/monitoring/grafana_dashboard.json` around lines 24 - 28, The
success-rate PromQL expression for the target using the metric
pipeline_executions_total should be changed to handle idle 5m windows: replace
the current expr string with the pattern that performs the division only when
the denominator sum(rate(pipeline_executions_total[5m])) > 0 and otherwise
returns 0 (use the "and on() (... > 0) or on() vector(0)" post-filter approach).
Locate the "targets" -> "expr" entry containing the current
sum(rate(pipeline_executions_total...)) division and update it to the
recommended guarded expression so the panel yields 0% instead of NaN/N/A when no
executions occur.

Comment on lines +92 to +94
"expr": "histogram_quantile(0.95, sum(rate(pipeline_execution_duration_seconds_bucket[5m])) by (le, pipeline_id))",
"legendFormat": "{{pipeline_id}} (95% CI)",
"refId": "A"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename the legend; this is a percentile, not a confidence interval.

histogram_quantile(0.95, ...) returns a p95 estimate. Labeling it 95% CI is statistically incorrect and will mislead dashboard readers.

📝 Proposed fix
-                    "legendFormat": "{{pipeline_id}} (95% CI)",
+                    "legendFormat": "{{pipeline_id}} p95",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/monitoring/grafana_dashboard.json` around lines 92 - 94, The legend text
is incorrect: the PromQL expression using histogram_quantile(0.95, ...) produces
a 95th percentile (p95), not a 95% confidence interval; update the legendFormat
value (currently "{{pipeline_id}} (95% CI)") to a correct label such as
"{{pipeline_id}} (p95)" or "{{pipeline_id}} (95th percentile)" so the dashboard
accurately reflects the metric returned by histogram_quantile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant