AzureLocal
diff --git a/‎docs/architecture/data-flow.md‎
Lines changed: 159 additions & 0 deletions b/‎docs/architecture/data-flow.md‎
Lines changed: 159 additions & 0 deletions
diff --git a/‎docs/architecture/overview.md‎
Lines changed: 124 additions & 0 deletions b/‎docs/architecture/overview.md‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎docs/architecture/tool-selection.md‎
Lines changed: 90 additions & 0 deletions b/‎docs/architecture/tool-selection.md‎
Lines changed: 90 additions & 0 deletions
@@ -0,0 +1,159 @@
+# Data Flow
+
+![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
+
+This page traces the three primary data flows through the framework: configuration, results, and monitoring. Understanding these paths helps when diagnosing failures, extending the framework, or integrating with external systems.
+
+---
+
+## Configuration Data Flow
+
+```
+config/variables.yml
+        │
+        ▼
+  ConfigManager.psm1
+  ┌────────────────────────┐
+  │ 1. Load master YAML    │
+  │ 2. Filter by solution  │
+  │ 3. Validate vs schema  │
+  │ 4. Apply overrides     │
+  └────────────────────────┘
+        │
+        ├──► config/json/fio.json
+        ├──► config/json/iperf.json
+        ├──► config/json/hammerdb.json
+        ├──► config/json/stress-ng.json
+        └──► config/json/vmfleet.json
+                 │
+                 ▼
+        scripts/Start-*.ps1
+        (consumes only generated JSON)
+```
+
+### Key Rules
+
+- Downstream scripts **never read** `variables.yml` directly. They only consume the generated JSON files.
+- Variables are tagged by solution name in the master YAML (`solutions: [fio, iperf]`). `ConfigManager` emits only variables tagged for the target solution.
+- Override chain (lowest wins): master YAML → environment variable → `-Variables` parameter → profile YAML
+
+---
+
+## Results Data Flow
+
+```
+Target Nodes (Linux / Windows)
+        │
+        │  (SSH/WinRM — batch execution)
+        ▼
+scripts/Start-<Tool>.ps1
+        │
+        │  Tool runs on node: writes raw output to /tmp or C:\
+        │
+        ▼
+Raw results on node:
+  /tmp/fio-results/<RunId>/<node>-<job>.json          (fio)
+  /tmp/iperf-results/<RunId>/<client>-to-<server>.json (iPerf3)
+  /tmp/stress-ng-results/<RunId>/stress-ng-results.yml (stress-ng)
+  C:\hammerdb-results\<RunId>\hammerdb-output.log      (HammerDB)
+        │
+        │  (SCP / WinRM copy)
+        ▼
+scripts/Collect-<Tool>.ps1
+  ┌─────────────────────────────────────┐
+  │ 1. Copy raw files from all nodes    │
+  │ 2. Parse tool-specific format       │
+  │ 3. Normalise metric fields          │
+  │ 4. Compute aggregate statistics     │
+  │ 5. Write aggregate + per-node JSON  │
+  └─────────────────────────────────────┘
+        │
+        ▼
+logs\<tool>\<RunId>\
+  ├── <RunId>-aggregate.json     ← Primary report input
+  ├── <RunId>-per-<node|job>.json
+  └── <node>-raw-output.*        (preserved for audit)
+        │
+        ▼
+scripts/New-LoadReport.ps1
+  ┌──────────────────────────────────────────────┐
+  │ 1. Read aggregate JSON                       │
+  │ 2. Populate reports/templates/<tool>-*.adoc  │
+  │ 3. Invoke asciidoctor-pdf / pandoc           │
+  │ 4. Write PDF / DOCX / XLSX to reports/       │
+  └──────────────────────────────────────────────┘
+        │
+        ▼
+reports/<RunId>.<pdf|docx|xlsx>
+```
+
+### Aggregate JSON Contract
+
+Every tool's `Collect-*.ps1` writes a JSON file conforming to the same top-level envelope:
+
+```json
+{
+  "run_id": "string",
+  "tool": "string",
+  "profile": "string",
+  "node_count": int,
+  "<tool_specific_metrics>": { ... },
+  "collected_at": "ISO 8601 UTC"
+}
+```
+
+Report templates rely on this envelope structure; adding a new tool requires a corresponding template that maps its specific metric fields.
+
+---
+
+## Monitoring Data Flow
+
+```
+Target Nodes
+        │
+        │  (WMI / WinRM)
+        ▼
+MonitoringManager.psm1
+  ┌────────────────────────────────────────┐
+  │ Runs in parallel with Start-<Tool>.ps1 │
+  │ 1. Read monitoring/<tool>/alert-rules  │
+  │ 2. Sample PerfMon counters every N sec │
+  │ 3. Evaluate each rule condition        │
+  │ 4. On trigger: log alert + send        │
+  └────────────────────────────────────────┘
+        │
+        ├──► logs\<tool>\<RunId>\monitor-<node>.jsonl  (all samples)
+        ├──► logs\<tool>\<RunId>\alerts-<node>.jsonl   (triggered alerts only)
+        └──► Azure Monitor (if configured)
+                 │
+                 ▼
+            Grafana Dashboard
+            (reads from Azure Monitor workspace)
+```
+
+### Alert Rule Evaluation
+
+Alert rules are defined in `monitoring/<tool>/alert-rules.yml`. Each rule specifies:
+
+| Field | Description |
+|-------|-------------|
+| `counter` | Windows Performance Counter path |
+| `condition` | `<`, `>`, or `==` |
+| `threshold` | Numeric value |
+| `cooldown_seconds` | Minimum seconds between repeated alerts for the same rule |
+| `severity` | `warning` or `critical` |
+
+When a rule fires, `MonitoringManager` appends a structured JSON line to `alerts-<node>.jsonl` with the rule name, counter value, node name, and UTC timestamp. The `Collect-*.ps1` scripts include a threshold violation review step that surfaces any alerts recorded during the run.
+
+---
+
+## Correlation IDs
+
+Every log line written by the `Logger` module includes a `correlation_id` field set to the `RunId` passed to `Start-*.ps1`. This allows correlating entries across:
+
+- `monitor-<node>.jsonl` — PerfMon samples
+- `alerts-<node>.jsonl` — Alert triggers
+- `<RunId>-aggregate.json` — Parsed results
+- `state/<RunId>.json` — Checkpoint state
+
+When investigating a failed or anomalous run, filter all log files by `"correlation_id": "<RunId>"` to reconstruct the full timeline.
@@ -0,0 +1,124 @@
+# Architecture Overview
+
+![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
+
+The Azure Local Load Testing Framework is organised into five layers that take a declarative configuration file as input and produce validated, auditable test reports as output. Each layer has a clear responsibility boundary and communicates with adjacent layers through well-defined file contracts and PowerShell module APIs.
+
+---
+
+## Five-Layer Stack
+
+```
+┌──────────────────────────────────────────────────────────────┐
+│  5. Reporting Layer                                          │
+│     AsciiDoc templates → PDF / DOCX / XLSX reports          │
+├──────────────────────────────────────────────────────────────┤
+│  4. Monitoring Layer                                         │
+│     PerfMon counters, Azure Monitor, real-time alerts        │
+├──────────────────────────────────────────────────────────────┤
+│  3. Execution Layer                                          │
+│     fio · iPerf3 · HammerDB · stress-ng · VMFleet           │
+├──────────────────────────────────────────────────────────────┤
+│  2. Automation Layer                                         │
+│     PowerShell orchestrators, Ansible roles, modules         │
+├──────────────────────────────────────────────────────────────┤
+│  1. Configuration Layer                                      │
+│     Master YAML → ConfigManager → solution JSON              │
+└──────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Layer Responsibilities
+
+### Layer 1 — Configuration
+
+The configuration layer holds all cluster-specific, credentials, and workload parameters. A single `variables.yml` file acts as the source of truth.
+
+| Component | Location | Responsibility |
+|-----------|----------|---------------|
+| Master variables file | `config/variables.yml` | All environment parameters, tagged by solution |
+| Workload profiles | `config/profiles/<tool>/` | Per-tool YAML profile definitions |
+| ConfigManager module | `src/common/modules/ConfigManager/` | Filters, validates, and emits solution-scoped JSON |
+| Schema validation | `config/schema/` | JSON Schema files that gate `ConfigManager` output |
+
+### Layer 2 — Automation
+
+The automation layer orchestrates pre-checks, installation, and execution across cluster nodes. All scripts consume only the ConfigManager-emitted JSON — never the raw YAML.
+
+| Component | Location | Responsibility |
+|-----------|----------|---------------|
+| Orchestrator scripts | `scripts/*.ps1` | Top-level `Start-*` / `Collect-*` / `Install-*` entry points |
+| Logger module | `src/common/modules/Logger/` | Structured JSON-lines logging with correlation IDs |
+| StateManager module | `src/common/modules/StateManager/` | Checkpoint-based resume-after-failure |
+| CredentialManager module | `src/common/modules/CredentialManager/` | Key Vault, interactive, or parameter credential retrieval |
+| Ansible roles | `src/ansible/roles/<tool>/` | Linux-target deployment (fio) |
+
+### Layer 3 — Execution
+
+The execution layer is the load-testing tools themselves, running on cluster nodes or inside guest VMs.
+
+| Tool | Target OS | Install Method |
+|------|-----------|---------------|
+| fio | Linux VMs | Ansible (`Install-Fio.ps1`) |
+| iPerf3 | Linux nodes | `apt` / `dnf` (manual) |
+| HammerDB | Windows nodes | PowerShell remoting (`Install-HammerDB.ps1`) |
+| stress-ng | Linux nodes | `apt` / `dnf` (manual) |
+| VMFleet | Windows (HCI host) | `Install-VMFleet.ps1` |
+
+### Layer 4 — Monitoring
+
+The monitoring layer runs in parallel with execution, capturing Windows Performance Counter data and evaluating alert rules.
+
+| Component | Location | Responsibility |
+|-----------|----------|---------------|
+| MonitoringManager module | `src/common/modules/MonitoringManager/` | PerfMon collection, Azure Monitor push |
+| Alert rules files | `monitoring/<tool>/alert-rules.yml` | Per-tool alert definitions |
+| Grafana dashboards | `monitoring/dashboards/` | Real-time visualisation |
+
+### Layer 5 — Reporting
+
+The reporting layer aggregates raw JSON results, populates AsciiDoc templates, and renders final reports.
+
+| Component | Location | Responsibility |
+|-----------|----------|---------------|
+| ReportGenerator module | `src/common/modules/ReportGenerator/` | Template population, `asciidoctor-pdf` invocation |
+| Report templates | `reports/templates/` | Per-tool AsciiDoc templates |
+| Generated reports | `reports/` | Output PDF / DOCX / XLSX files |
+
+---
+
+## Common Workflow Pattern
+
+Every tool follows the same eight-phase lifecycle:
+
+| Phase | Script Action | StateManager Checkpoint |
+|-------|-------------|------------------------|
+| Pre-Check | Validate connectivity, prerequisites | `pre-check-complete` |
+| Install | Deploy tool binaries on target nodes | `install-complete` |
+| Deploy | Configure test environment | `deploy-complete` |
+| Test | Execute workload profile | `test-complete` |
+| Monitor | Collect PerfMon counters (parallel) | `monitor-complete` |
+| Collect | Parse and aggregate results | `collect-complete` |
+| Report | Render PDF/DOCX/XLSX | `report-complete` |
+| Cleanup | (Optional) Remove test artefacts | `cleanup-complete` |
+
+Any phase can be resumed from its checkpoint if the run is interrupted.
+
+---
+
+## Security Model
+
+- Credentials are **never hardcoded** in scripts or configuration files
+- Three retrieval modes: Azure Key Vault (production), interactive prompt (development), parameter injection (CI/CD)
+- All credential access is logged with values masked
+- CI/CD pipelines use GitHub Secrets / Azure DevOps Service Connections
+
+See [Credential Management](../operations/credential-management.md) for implementation details.
+
+---
+
+## Further Reading
+
+- [Tool Selection Guide](tool-selection.md) — Choosing the right tool for your workload
+- [Data Flow](data-flow.md) — End-to-end config, results, and monitoring data paths
@@ -0,0 +1,90 @@
+# Tool Selection Guide
+
+![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
+
+Use this guide to select the right tool for your Azure Local performance validation scenario. The decision flowchart covers the most common questions; the comparison table below it covers all dimensions.
+
+---
+
+## Decision Flowchart
+
+```mermaid
+flowchart TD
+    A([What are you testing?]) --> B{Is it network throughput\nor latency?}
+    B -->|Yes| C([iPerf3])
+    B -->|No| D{Is it storage\nI/O performance?}
+    D -->|Yes - block device\nbenchmark| E([fio])
+    D -->|Yes - application\nI/O patterns| F{Is it a SQL\nworkload?}
+    F -->|Yes| G([HammerDB])
+    F -->|No| H([stress-ng io-stress])
+    D -->|No| I{Is it CPU or\nmemory stress?}
+    I -->|Yes| J([stress-ng cpu/memory])
+    I -->|No| K{Is it full VM\nworkload simulation?}
+    K -->|Yes| L([VMFleet])
+    K -->|No| M([Consult team])
+```
+
+---
+
+## Tool Comparison Matrix
+
+| Dimension | fio | iPerf3 | HammerDB | stress-ng | VMFleet |
+|-----------|-----|--------|----------|-----------|---------|
+| **Primary purpose** | Block device I/O benchmarking | Network throughput & latency | SQL database benchmarking | OS-level stress (CPU/memory/I/O) | Full VM workload simulation |
+| **Target OS** | Linux | Linux / Windows | Windows | Linux | Windows (HCI host) |
+| **Protocol** | POSIX file I/O | TCP / UDP | TDS (SQL Server), libpq (PostgreSQL) | POSIX / kernel syscalls | Hyper-V VM workload |
+| **Output format** | JSON | JSON | HammerDB log + parsed JSON | YAML + parsed JSON | CSV / JSON |
+| **Profile count** | 5 | 3 | 2 | 3 | N/A (config-driven) |
+| **Install method** | Ansible (`Install-Fio.ps1`) | `apt` / `dnf` | PowerShell remoting (`Install-HammerDB.ps1`) | `apt` / `dnf` | `Install-VMFleet.ps1` |
+| **CI/CD ready** | Yes | Yes | Yes | Yes | Yes |
+| **Monitoring alerts** | 7 rules | 6 rules | 7 rules | 6 rules | (shared PerfMon) |
+| **Key metric** | IOPS, throughput MB/s, latency P99 | Throughput MB/s, jitter ms | NOPM, TPM | bogo-ops/sec | VM IOPS, CPU% |
+| **Parallelises across nodes** | Yes (all nodes simultaneously) | Yes (pairs / mesh) | Yes (per-node DB instance) | Yes (all nodes simultaneously) | Yes (VM fleet distributes load) |
+
+---
+
+## Scenario Examples
+
+### "We want to know if our RDMA storage network is healthy after hardware replacement."
+
+**→ Use iPerf3 (mesh profile)**
+
+Mesh throughput between all node pairs will expose any link degraded from 10GbE to 1GbE, misconfigured MTU, or failed SFP.
+
+### "We need to validate that our NVMe SSDs meet the IOPS requirement for a new VM workload."
+
+**→ Use fio (random-read + random-write profiles)**
+
+fio directly benchmarks the block device from inside a VM, giving you raw IOPS and P99 latency that you can compare directly against the SSD datasheet and SLA requirement.
+
+### "We're deploying SQL Server on Azure Local and want to verify it can handle our TPC-C equivalent load."
+
+**→ Use HammerDB (tpc-c profile)**
+
+HammerDB simulates OLTP database operations and reports NOPM (New Orders Per Minute), the standard TPC-C throughput metric.
+
+### "We want to confirm that our AX nodes can sustain CPU load for 5 minutes without thermal throttling."
+
+**→ Use stress-ng (cpu-stress profile)**
+
+Saturates all logical CPUs and reports bogo-ops/sec; the `stressng_cpu_throttling` alert fires if frequency drops below 80%.
+
+### "We want to understand how many VMs our cluster can host before storage IOPS degrades."
+
+**→ Use VMFleet**
+
+VMFleet deploys a configurable fleet of VMs with CDB workload simulation and measures cluster-wide storage throughput as VM density increases.
+
+---
+
+## Combining Tools
+
+For production validation, run tools in sequence:
+
+1. **iPerf3 mesh** — Confirm network fabric health before any other testing
+2. **stress-ng cpu-stress** — Verify thermal and BIOS power profile settings
+3. **fio sequential + random** — Baseline storage performance per node
+4. **HammerDB tpc-c** — Validate SQL workload under realistic OLTP load
+5. **VMFleet** — Full-system capacity proving
+
+Each tool writes results to its own `logs\<tool>\<RunId>\` directory; reports can be generated independently after each phase completes.