Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
1 change: 1 addition & 0 deletions backend/lined/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ changes.
| HPA Resource Scenarios | Local kind variants for backend resource requests/limits, fixed replicas, and CPU-based HPA behavior. | `hpa-resource-scenarios.md` | Use before comparing deployment/runtime trade-offs under k6 workload traffic. | Resource pressure, fixed replica comparison, HPA prerequisites, scenario cleanup. |
| Runtime Scenario Summaries | Scenario-runner seam for producing sanitized runtime-summary artifacts from one scenario and workload. | `runtime-scenario-summaries.md` | Use before generating collector-ready runtime evidence for local scenario comparisons. | Scenario runner CLI, k6 summary export, Kubernetes state summaries, provenance manifest. |
| Runtime Fitness Extension | Runtime-aware fitness metric contract and optional collector input shape. | `runtime-fitness-extension.md` | Use before adding runtime-aware scoring or attaching runtime summaries to metrics documents. | Structural/runtime score separation, runtime metric schema, normalization, compatibility. |
| Runtime-Aware Scoring | Versioned scalar runtime score that compares current runtime summaries against baseline evidence. | `runtime-aware-scoring.md` | Use before running or interpreting runtime-aware fitness scoring. | Local baseline input, score fields, SLO classification, missing metrics, metrics-store fallback. |
| Prometheus Telemetry Pipeline | Local Prometheus deployment and scrape workflow for kind backend metrics. | `prometheus-telemetry-pipeline.md` | Use before collecting persistent-enough Prometheus samples from local scenario runs. | Prometheus pod discovery, Actuator scrape verification, PromQL checks, telemetry cleanup. |
| SLO Constraint Thresholds | Initial runtime SLO and constraint thresholds for classifying local experiment variants. | `slo-constraint-thresholds.md` | Use before interpreting runtime-summary evidence or adding runtime-aware scoring. | Latency, error-rate, availability, restart, readiness, and resource-pressure constraints. |
| LLM Support Service | Plan for a separate advisory service that proposes candidate fitness rules and trade-off explanations. | `llm-support-service.md` | Use before designing or implementing LLM-assisted rule synthesis for the experiment. | Service boundary, serverless/manual triggers, input/output contracts, review workflow. |
Expand Down
2 changes: 1 addition & 1 deletion backend/lined/docs/experiment-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ scientific experiment work.
| `experiment/scenario-fixture-discipline` | Task | Runtime evidence | Yes | Scenario fixture discipline | Define explicit workload/context profiles and repeatable input setup for Lined experiment scenario runs. | Deployment/runtime comparisons use stable fixtures instead of manual setup. |
| `experiment/slo-constraint-thresholds` | Task | Runtime evidence | Yes | SLO and constraint thresholds | Define initial latency, error-rate, availability, restart, readiness, and resource-efficiency thresholds for classifying valid experiment variants. | Runtime evidence can be evaluated against explicit constraints instead of ad hoc interpretation. |
| `experiment/fitness-runtime-extension` | Task | Runtime scoring | Yes | Runtime fitness extension | Extend experiment documentation and/or collector design to include telemetry metrics. | Fixed CI fitness can be compared with runtime-aware adaptive fitness. |
| `experiment/runtime-aware-scoring` | Task | Runtime scoring | No | Runtime-aware scoring | Add a versioned runtime fitness score that uses summarized runtime metrics while preserving the existing structural `fitnessScore`. | Runtime-aware scalar fitness can be computed without changing historical CI fitness semantics. |
| `experiment/runtime-aware-scoring` | Task | Runtime scoring | Yes | Runtime-aware scoring | Add a versioned runtime fitness score that uses summarized runtime metrics while preserving the existing structural `fitnessScore`. | Runtime-aware scalar fitness can be computed without changing historical CI fitness semantics. |
| `experiment/adaptive-weighted-fitness` | Task | Runtime scoring | No | Adaptive weighted fitness | Implement context-sensitive weighting over structural and runtime signals for workload, SLO, or resource-pressure contexts. | Fixed structural fitness can be compared with an adaptive scalar fitness baseline. |
| `experiment/pareto-optimization-baseline` | Task | Runtime scoring | No | Pareto optimization baseline | Add a small NSGA-II or equivalent Pareto-based optimizer over the current deployment scenario set and runtime objectives. | Deployment variants can be compared as multi-objective trade-offs rather than a single weighted score. |
| `experiment/decision-usefulness-reporting` | Task | Runtime scoring | No | Decision-usefulness reporting | Extend experiment reporting to compare whether Pareto/NSGA-II gives more actionable trade-off alternatives than fixed-weight scalar scoring. | Results explain decision usefulness in addition to numeric objective values. |
Expand Down
173 changes: 173 additions & 0 deletions backend/lined/docs/runtime-aware-scoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Runtime-Aware Scoring

This guide describes the runtime-aware scoring contract for
`experiment/runtime-aware-scoring`.

Runtime-aware scoring is additive. It keeps the existing top-level
`fitnessScore` as the structural CI score and adds a separate versioned score
from summarized runtime evidence.

## Scope

This task provides:

- a versioned scalar runtime score named `runtimeFitnessScore`;
- local scoring from explicit current and baseline runtime summary files;
- optional persisted baseline lookup through the collector metrics-store seam;
- SLO constraint classification from `slo-thresholds-v1.json`;
- optional local output when Cosmos DB or another metrics database is not
configured.

This task does not add adaptive weighting, Pareto optimization, new backend
API behavior, production SLOs, dashboarding, or live telemetry scraping inside
the collector.

## Collector Inputs

The collector accepts current runtime evidence through the existing input:

```text
RUNTIME_METRICS_JSON=/absolute/path/to/runtime-summary.json
```

For local/offline scoring, pass an explicit baseline summary:

```text
RUNTIME_BASELINE_METRICS_JSON=/absolute/path/to/baseline-runtime-summary.json
```

When a metrics store is configured and no explicit baseline file is provided,
the collector can look for the latest persisted `main` runtime summary matching
the configured baseline scenario and current workload:

```text
RUNTIME_BASELINE_SCENARIO=fixed-medium
```

The default threshold artifact is:

```text
SLO_THRESHOLDS_JSON=../backend/lined/load-tests/runtime-scenarios/slo-thresholds-v1.json
```

When no database is configured, write the final document locally:

```text
METRICS_OUTPUT_JSON=/absolute/path/to/metrics-document.json
```

For a runtime-only local smoke check without structural CI reports or
SonarCloud access, use:

```text
RUNTIME_ONLY=true
```

The default collector path still reads Checkstyle, SpotBugs, JaCoCo, and
SonarCloud evidence. `RUNTIME_ONLY=true` is only for local runtime scoring
experiments where the structural CI score is not being recomputed.

## Output Contract

The stored or local metrics document preserves the structural score:

```json
{
"fitnessScore": 0.1234,
"runtimeFitnessScore": 0.2185,
"runtimeFitnessScoreVersion": "runtime-aware-v1",
"runtimeFitness": {
"current": {
"scenario": "replicas-2",
"workload": "baseline",
"source": "local-kind"
},
"baseline": {
"scenario": "fixed-medium",
"workload": "baseline",
"source": "local-kind"
},
"eligibleForStableComparison": false
}
}
```

`fitnessScore` remains the CI-only structural score. Runtime evidence is
attached under `metrics.runtime_metrics`; runtime score metadata is attached
under `runtimeFitness`.

When `RUNTIME_ONLY=true`, the output document may contain
`fitnessScore: null` because no structural CI evidence was collected. That
does not redefine the field; it records that the runtime-only smoke path did
not compute the structural score.

## Runtime-Aware v1 Formula

The score compares current runtime summary metrics against a baseline runtime
summary. Each metric is normalized to `[-1, 1]` before weighting.

Lower-is-better metrics use:

```text
(baseline - current) / baseline
```

Higher-is-better metrics use:

```text
(current - baseline) / baseline
```

If baseline and current are both zero, the normalized delta is `0`. If baseline
is zero and current is non-zero, beneficial movement is `1` and harmful
movement is `-1`. Missing metrics are omitted from the score and the active
weights are re-normalized.

| Metric | Direction | Weight |
|--------|-----------|--------|
| `latency_p95_ms` | lower is better | `0.20` |
| `latency_p99_ms` | lower is better | `0.15` |
| `error_rate` | lower is better | `0.20` |
| `throughput_rps` | higher is better | `0.15` |
| `availability` | higher is better | `0.15` |
| `restart_count` | lower is better | `0.10` |
| `cpu_utilization` | lower is better | `0.025` |
| `memory_utilization` | lower is better | `0.025` |

`hpa_current_replicas` and `hpa_desired_replicas` remain contextual evidence
and are not scored directly in v1.

## SLO Classification

The collector classifies current runtime evidence against
`slo-thresholds-v1.json` and records per-constraint results:

- `valid` when evidence exists and satisfies the constraint;
- `warning` when evidence exists and crosses a warning threshold;
- `invalid` when evidence exists and violates a hard constraint;
- `unknown` when required evidence is missing.

`runtimeFitness.eligibleForStableComparison` is `false` when any hard
constraint is `invalid` or `unknown`. The numeric runtime score may still be
emitted when comparable current and baseline metrics exist, but eligibility
keeps incomplete or unstable runs out of article-ready comparisons.

Readiness remains external evidence. It is classified as `unknown` unless a
future runtime summary contract adds a summarized readiness source.

## Local Example

```bash
cd fitness-metrics-collector
npm run build
RUNTIME_ONLY=true \
RUNTIME_METRICS_JSON=/absolute/path/current/runtime-summary.json \
RUNTIME_BASELINE_METRICS_JSON=/absolute/path/baseline/runtime-summary.json \
METRICS_OUTPUT_JSON=/absolute/path/output/metrics-document.json \
npm run metrics
```

If `COSMOS_DB_CONNECTION_STRING` is absent, the collector writes the local
output document when `METRICS_OUTPUT_JSON` is set and skips database
persistence. Omit `RUNTIME_ONLY=true` when structural reports and `SONAR_TOKEN`
are available and the run should also compute the structural `fitnessScore`.
Loading
Loading