Skip to content

Commit 657babf

Browse files
author
Kristopher Turner
committed
docs: six-phase documentation redesign
Phase 1 - accuracy fixes: - Remove placeholder badges and stale status in 4 tool overviews (fio, iperf, hammerdb, stress-ng) - Update introduction.md tool table: mark fio, iPerf3, HammerDB, stress-ng as Implemented - Fix File Locations tables with correct paths and add Quick Start + Documentation sections Phase 2 - tool sub-pages (20 new pages): - fio: installation, workload-profiles, monitoring, reporting, troubleshooting - iPerf3: installation, workload-profiles, monitoring, reporting, troubleshooting - HammerDB: installation, workload-profiles, monitoring, reporting, troubleshooting - stress-ng: installation, workload-profiles, monitoring, reporting, troubleshooting Phase 3 - architecture section: - Add docs/architecture/overview.md (five-layer stack, component table, workflow lifecycle) - Add docs/architecture/tool-selection.md (Mermaid decision flowchart, comparison matrix) - Add docs/architecture/data-flow.md (config, results, and monitoring data paths) - Rewrite getting-started/architecture.md as summary pointer page Phase 4 - operations and reference expansion: - Add docs/operations/runner-setup.md (self-hosted runner install, SSH/WinRM validation) - Add docs/operations/troubleshooting.md (cross-tool: SSH, WinRM, Key Vault, PSScriptAnalyzer, Pester, log correlation) - Add docs/reference/tool-comparison.md (capability matrix, output formats, monitoring coverage) - Add docs/roadmap.md (milestone tracker M1-M5 with status) Phase 5 - index.md rewrite: - Add Quick Start 3-command block - Add Architecture at a Glance section - Enhance tool table with Target OS column and status badges Phase 6 - mkdocs.yml nav update: - Add top-level Architecture section (overview, tool-selection, data-flow) - Expand fio/iPerf3/HammerDB/stress-ng from flat links to sub-nav sections - Add runner-setup and troubleshooting to Operations - Add tool-comparison to Reference - Add Roadmap top-level entry
1 parent 58ee58c commit 657babf

35 files changed

Lines changed: 3488 additions & 59 deletions

docs/architecture/data-flow.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Data Flow
2+
3+
![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
4+
5+
This page traces the three primary data flows through the framework: configuration, results, and monitoring. Understanding these paths helps when diagnosing failures, extending the framework, or integrating with external systems.
6+
7+
---
8+
9+
## Configuration Data Flow
10+
11+
```
12+
config/variables.yml
13+
14+
15+
ConfigManager.psm1
16+
┌────────────────────────┐
17+
│ 1. Load master YAML │
18+
│ 2. Filter by solution │
19+
│ 3. Validate vs schema │
20+
│ 4. Apply overrides │
21+
└────────────────────────┘
22+
23+
├──► config/json/fio.json
24+
├──► config/json/iperf.json
25+
├──► config/json/hammerdb.json
26+
├──► config/json/stress-ng.json
27+
└──► config/json/vmfleet.json
28+
29+
30+
scripts/Start-*.ps1
31+
(consumes only generated JSON)
32+
```
33+
34+
### Key Rules
35+
36+
- Downstream scripts **never read** `variables.yml` directly. They only consume the generated JSON files.
37+
- Variables are tagged by solution name in the master YAML (`solutions: [fio, iperf]`). `ConfigManager` emits only variables tagged for the target solution.
38+
- Override chain (lowest wins): master YAML → environment variable → `-Variables` parameter → profile YAML
39+
40+
---
41+
42+
## Results Data Flow
43+
44+
```
45+
Target Nodes (Linux / Windows)
46+
47+
│ (SSH/WinRM — batch execution)
48+
49+
scripts/Start-<Tool>.ps1
50+
51+
│ Tool runs on node: writes raw output to /tmp or C:\
52+
53+
54+
Raw results on node:
55+
/tmp/fio-results/<RunId>/<node>-<job>.json (fio)
56+
/tmp/iperf-results/<RunId>/<client>-to-<server>.json (iPerf3)
57+
/tmp/stress-ng-results/<RunId>/stress-ng-results.yml (stress-ng)
58+
C:\hammerdb-results\<RunId>\hammerdb-output.log (HammerDB)
59+
60+
│ (SCP / WinRM copy)
61+
62+
scripts/Collect-<Tool>.ps1
63+
┌─────────────────────────────────────┐
64+
│ 1. Copy raw files from all nodes │
65+
│ 2. Parse tool-specific format │
66+
│ 3. Normalise metric fields │
67+
│ 4. Compute aggregate statistics │
68+
│ 5. Write aggregate + per-node JSON │
69+
└─────────────────────────────────────┘
70+
71+
72+
logs\<tool>\<RunId>\
73+
├── <RunId>-aggregate.json ← Primary report input
74+
├── <RunId>-per-<node|job>.json
75+
└── <node>-raw-output.* (preserved for audit)
76+
77+
78+
scripts/New-LoadReport.ps1
79+
┌──────────────────────────────────────────────┐
80+
│ 1. Read aggregate JSON │
81+
│ 2. Populate reports/templates/<tool>-*.adoc │
82+
│ 3. Invoke asciidoctor-pdf / pandoc │
83+
│ 4. Write PDF / DOCX / XLSX to reports/ │
84+
└──────────────────────────────────────────────┘
85+
86+
87+
reports/<RunId>.<pdf|docx|xlsx>
88+
```
89+
90+
### Aggregate JSON Contract
91+
92+
Every tool's `Collect-*.ps1` writes a JSON file conforming to the same top-level envelope:
93+
94+
```json
95+
{
96+
"run_id": "string",
97+
"tool": "string",
98+
"profile": "string",
99+
"node_count": int,
100+
"<tool_specific_metrics>": { ... },
101+
"collected_at": "ISO 8601 UTC"
102+
}
103+
```
104+
105+
Report templates rely on this envelope structure; adding a new tool requires a corresponding template that maps its specific metric fields.
106+
107+
---
108+
109+
## Monitoring Data Flow
110+
111+
```
112+
Target Nodes
113+
114+
│ (WMI / WinRM)
115+
116+
MonitoringManager.psm1
117+
┌────────────────────────────────────────┐
118+
│ Runs in parallel with Start-<Tool>.ps1 │
119+
│ 1. Read monitoring/<tool>/alert-rules │
120+
│ 2. Sample PerfMon counters every N sec │
121+
│ 3. Evaluate each rule condition │
122+
│ 4. On trigger: log alert + send │
123+
└────────────────────────────────────────┘
124+
125+
├──► logs\<tool>\<RunId>\monitor-<node>.jsonl (all samples)
126+
├──► logs\<tool>\<RunId>\alerts-<node>.jsonl (triggered alerts only)
127+
└──► Azure Monitor (if configured)
128+
129+
130+
Grafana Dashboard
131+
(reads from Azure Monitor workspace)
132+
```
133+
134+
### Alert Rule Evaluation
135+
136+
Alert rules are defined in `monitoring/<tool>/alert-rules.yml`. Each rule specifies:
137+
138+
| Field | Description |
139+
|-------|-------------|
140+
| `counter` | Windows Performance Counter path |
141+
| `condition` | `<`, `>`, or `==` |
142+
| `threshold` | Numeric value |
143+
| `cooldown_seconds` | Minimum seconds between repeated alerts for the same rule |
144+
| `severity` | `warning` or `critical` |
145+
146+
When a rule fires, `MonitoringManager` appends a structured JSON line to `alerts-<node>.jsonl` with the rule name, counter value, node name, and UTC timestamp. The `Collect-*.ps1` scripts include a threshold violation review step that surfaces any alerts recorded during the run.
147+
148+
---
149+
150+
## Correlation IDs
151+
152+
Every log line written by the `Logger` module includes a `correlation_id` field set to the `RunId` passed to `Start-*.ps1`. This allows correlating entries across:
153+
154+
- `monitor-<node>.jsonl` — PerfMon samples
155+
- `alerts-<node>.jsonl` — Alert triggers
156+
- `<RunId>-aggregate.json` — Parsed results
157+
- `state/<RunId>.json` — Checkpoint state
158+
159+
When investigating a failed or anomalous run, filter all log files by `"correlation_id": "<RunId>"` to reconstruct the full timeline.

docs/architecture/overview.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Architecture Overview
2+
3+
![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
4+
5+
The Azure Local Load Testing Framework is organised into five layers that take a declarative configuration file as input and produce validated, auditable test reports as output. Each layer has a clear responsibility boundary and communicates with adjacent layers through well-defined file contracts and PowerShell module APIs.
6+
7+
---
8+
9+
## Five-Layer Stack
10+
11+
```
12+
┌──────────────────────────────────────────────────────────────┐
13+
│ 5. Reporting Layer │
14+
│ AsciiDoc templates → PDF / DOCX / XLSX reports │
15+
├──────────────────────────────────────────────────────────────┤
16+
│ 4. Monitoring Layer │
17+
│ PerfMon counters, Azure Monitor, real-time alerts │
18+
├──────────────────────────────────────────────────────────────┤
19+
│ 3. Execution Layer │
20+
│ fio · iPerf3 · HammerDB · stress-ng · VMFleet │
21+
├──────────────────────────────────────────────────────────────┤
22+
│ 2. Automation Layer │
23+
│ PowerShell orchestrators, Ansible roles, modules │
24+
├──────────────────────────────────────────────────────────────┤
25+
│ 1. Configuration Layer │
26+
│ Master YAML → ConfigManager → solution JSON │
27+
└──────────────────────────────────────────────────────────────┘
28+
```
29+
30+
---
31+
32+
## Layer Responsibilities
33+
34+
### Layer 1 — Configuration
35+
36+
The configuration layer holds all cluster-specific, credentials, and workload parameters. A single `variables.yml` file acts as the source of truth.
37+
38+
| Component | Location | Responsibility |
39+
|-----------|----------|---------------|
40+
| Master variables file | `config/variables.yml` | All environment parameters, tagged by solution |
41+
| Workload profiles | `config/profiles/<tool>/` | Per-tool YAML profile definitions |
42+
| ConfigManager module | `src/common/modules/ConfigManager/` | Filters, validates, and emits solution-scoped JSON |
43+
| Schema validation | `config/schema/` | JSON Schema files that gate `ConfigManager` output |
44+
45+
### Layer 2 — Automation
46+
47+
The automation layer orchestrates pre-checks, installation, and execution across cluster nodes. All scripts consume only the ConfigManager-emitted JSON — never the raw YAML.
48+
49+
| Component | Location | Responsibility |
50+
|-----------|----------|---------------|
51+
| Orchestrator scripts | `scripts/*.ps1` | Top-level `Start-*` / `Collect-*` / `Install-*` entry points |
52+
| Logger module | `src/common/modules/Logger/` | Structured JSON-lines logging with correlation IDs |
53+
| StateManager module | `src/common/modules/StateManager/` | Checkpoint-based resume-after-failure |
54+
| CredentialManager module | `src/common/modules/CredentialManager/` | Key Vault, interactive, or parameter credential retrieval |
55+
| Ansible roles | `src/ansible/roles/<tool>/` | Linux-target deployment (fio) |
56+
57+
### Layer 3 — Execution
58+
59+
The execution layer is the load-testing tools themselves, running on cluster nodes or inside guest VMs.
60+
61+
| Tool | Target OS | Install Method |
62+
|------|-----------|---------------|
63+
| fio | Linux VMs | Ansible (`Install-Fio.ps1`) |
64+
| iPerf3 | Linux nodes | `apt` / `dnf` (manual) |
65+
| HammerDB | Windows nodes | PowerShell remoting (`Install-HammerDB.ps1`) |
66+
| stress-ng | Linux nodes | `apt` / `dnf` (manual) |
67+
| VMFleet | Windows (HCI host) | `Install-VMFleet.ps1` |
68+
69+
### Layer 4 — Monitoring
70+
71+
The monitoring layer runs in parallel with execution, capturing Windows Performance Counter data and evaluating alert rules.
72+
73+
| Component | Location | Responsibility |
74+
|-----------|----------|---------------|
75+
| MonitoringManager module | `src/common/modules/MonitoringManager/` | PerfMon collection, Azure Monitor push |
76+
| Alert rules files | `monitoring/<tool>/alert-rules.yml` | Per-tool alert definitions |
77+
| Grafana dashboards | `monitoring/dashboards/` | Real-time visualisation |
78+
79+
### Layer 5 — Reporting
80+
81+
The reporting layer aggregates raw JSON results, populates AsciiDoc templates, and renders final reports.
82+
83+
| Component | Location | Responsibility |
84+
|-----------|----------|---------------|
85+
| ReportGenerator module | `src/common/modules/ReportGenerator/` | Template population, `asciidoctor-pdf` invocation |
86+
| Report templates | `reports/templates/` | Per-tool AsciiDoc templates |
87+
| Generated reports | `reports/` | Output PDF / DOCX / XLSX files |
88+
89+
---
90+
91+
## Common Workflow Pattern
92+
93+
Every tool follows the same eight-phase lifecycle:
94+
95+
| Phase | Script Action | StateManager Checkpoint |
96+
|-------|-------------|------------------------|
97+
| Pre-Check | Validate connectivity, prerequisites | `pre-check-complete` |
98+
| Install | Deploy tool binaries on target nodes | `install-complete` |
99+
| Deploy | Configure test environment | `deploy-complete` |
100+
| Test | Execute workload profile | `test-complete` |
101+
| Monitor | Collect PerfMon counters (parallel) | `monitor-complete` |
102+
| Collect | Parse and aggregate results | `collect-complete` |
103+
| Report | Render PDF/DOCX/XLSX | `report-complete` |
104+
| Cleanup | (Optional) Remove test artefacts | `cleanup-complete` |
105+
106+
Any phase can be resumed from its checkpoint if the run is interrupted.
107+
108+
---
109+
110+
## Security Model
111+
112+
- Credentials are **never hardcoded** in scripts or configuration files
113+
- Three retrieval modes: Azure Key Vault (production), interactive prompt (development), parameter injection (CI/CD)
114+
- All credential access is logged with values masked
115+
- CI/CD pipelines use GitHub Secrets / Azure DevOps Service Connections
116+
117+
See [Credential Management](../operations/credential-management.md) for implementation details.
118+
119+
---
120+
121+
## Further Reading
122+
123+
- [Tool Selection Guide](tool-selection.md) — Choosing the right tool for your workload
124+
- [Data Flow](data-flow.md) — End-to-end config, results, and monitoring data paths
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Tool Selection Guide
2+
3+
![Category: Architecture](https://img.shields.io/badge/Category-Architecture-8E44AD?style=flat-square)
4+
5+
Use this guide to select the right tool for your Azure Local performance validation scenario. The decision flowchart covers the most common questions; the comparison table below it covers all dimensions.
6+
7+
---
8+
9+
## Decision Flowchart
10+
11+
```mermaid
12+
flowchart TD
13+
A([What are you testing?]) --> B{Is it network throughput\nor latency?}
14+
B -->|Yes| C([iPerf3])
15+
B -->|No| D{Is it storage\nI/O performance?}
16+
D -->|Yes - block device\nbenchmark| E([fio])
17+
D -->|Yes - application\nI/O patterns| F{Is it a SQL\nworkload?}
18+
F -->|Yes| G([HammerDB])
19+
F -->|No| H([stress-ng io-stress])
20+
D -->|No| I{Is it CPU or\nmemory stress?}
21+
I -->|Yes| J([stress-ng cpu/memory])
22+
I -->|No| K{Is it full VM\nworkload simulation?}
23+
K -->|Yes| L([VMFleet])
24+
K -->|No| M([Consult team])
25+
```
26+
27+
---
28+
29+
## Tool Comparison Matrix
30+
31+
| Dimension | fio | iPerf3 | HammerDB | stress-ng | VMFleet |
32+
|-----------|-----|--------|----------|-----------|---------|
33+
| **Primary purpose** | Block device I/O benchmarking | Network throughput & latency | SQL database benchmarking | OS-level stress (CPU/memory/I/O) | Full VM workload simulation |
34+
| **Target OS** | Linux | Linux / Windows | Windows | Linux | Windows (HCI host) |
35+
| **Protocol** | POSIX file I/O | TCP / UDP | TDS (SQL Server), libpq (PostgreSQL) | POSIX / kernel syscalls | Hyper-V VM workload |
36+
| **Output format** | JSON | JSON | HammerDB log + parsed JSON | YAML + parsed JSON | CSV / JSON |
37+
| **Profile count** | 5 | 3 | 2 | 3 | N/A (config-driven) |
38+
| **Install method** | Ansible (`Install-Fio.ps1`) | `apt` / `dnf` | PowerShell remoting (`Install-HammerDB.ps1`) | `apt` / `dnf` | `Install-VMFleet.ps1` |
39+
| **CI/CD ready** | Yes | Yes | Yes | Yes | Yes |
40+
| **Monitoring alerts** | 7 rules | 6 rules | 7 rules | 6 rules | (shared PerfMon) |
41+
| **Key metric** | IOPS, throughput MB/s, latency P99 | Throughput MB/s, jitter ms | NOPM, TPM | bogo-ops/sec | VM IOPS, CPU% |
42+
| **Parallelises across nodes** | Yes (all nodes simultaneously) | Yes (pairs / mesh) | Yes (per-node DB instance) | Yes (all nodes simultaneously) | Yes (VM fleet distributes load) |
43+
44+
---
45+
46+
## Scenario Examples
47+
48+
### "We want to know if our RDMA storage network is healthy after hardware replacement."
49+
50+
**→ Use iPerf3 (mesh profile)**
51+
52+
Mesh throughput between all node pairs will expose any link degraded from 10GbE to 1GbE, misconfigured MTU, or failed SFP.
53+
54+
### "We need to validate that our NVMe SSDs meet the IOPS requirement for a new VM workload."
55+
56+
**→ Use fio (random-read + random-write profiles)**
57+
58+
fio directly benchmarks the block device from inside a VM, giving you raw IOPS and P99 latency that you can compare directly against the SSD datasheet and SLA requirement.
59+
60+
### "We're deploying SQL Server on Azure Local and want to verify it can handle our TPC-C equivalent load."
61+
62+
**→ Use HammerDB (tpc-c profile)**
63+
64+
HammerDB simulates OLTP database operations and reports NOPM (New Orders Per Minute), the standard TPC-C throughput metric.
65+
66+
### "We want to confirm that our AX nodes can sustain CPU load for 5 minutes without thermal throttling."
67+
68+
**→ Use stress-ng (cpu-stress profile)**
69+
70+
Saturates all logical CPUs and reports bogo-ops/sec; the `stressng_cpu_throttling` alert fires if frequency drops below 80%.
71+
72+
### "We want to understand how many VMs our cluster can host before storage IOPS degrades."
73+
74+
**→ Use VMFleet**
75+
76+
VMFleet deploys a configurable fleet of VMs with CDB workload simulation and measures cluster-wide storage throughput as VM density increases.
77+
78+
---
79+
80+
## Combining Tools
81+
82+
For production validation, run tools in sequence:
83+
84+
1. **iPerf3 mesh** — Confirm network fabric health before any other testing
85+
2. **stress-ng cpu-stress** — Verify thermal and BIOS power profile settings
86+
3. **fio sequential + random** — Baseline storage performance per node
87+
4. **HammerDB tpc-c** — Validate SQL workload under realistic OLTP load
88+
5. **VMFleet** — Full-system capacity proving
89+
90+
Each tool writes results to its own `logs\<tool>\<RunId>\` directory; reports can be generated independently after each phase completes.

0 commit comments

Comments
 (0)