|
| 1 | +# Draft Issue: Automated Verification of AI Training Dataset Freshness |
| 2 | + |
| 3 | +**Status:** Draft |
| 4 | +**Priority:** Low |
| 5 | +**Category:** Developer Experience / Testing |
| 6 | +**Related:** Issue #339 - AI Training Resources |
| 7 | + |
| 8 | +## Problem |
| 9 | + |
| 10 | +The AI training dataset in `docs/ai-training/dataset/` requires manual regeneration whenever: |
| 11 | + |
| 12 | +- Input examples in `dataset/environment-configs/` change |
| 13 | +- Template files in `templates/` are modified |
| 14 | +- Rendering code changes in a way that affects output |
| 15 | + |
| 16 | +Currently, developers must manually run `scripts/generate-ai-training-outputs.sh` to regenerate the `rendered-templates/` directory. |
| 17 | + |
| 18 | +**Risk:** The rendered outputs can become stale without anyone noticing, leading to: |
| 19 | + |
| 20 | +- Outdated training data for AI models |
| 21 | +- Inconsistencies between documentation and actual behavior |
| 22 | +- Confusion for users examining the examples |
| 23 | + |
| 24 | +## Current Workaround |
| 25 | + |
| 26 | +Manual regeneration by running: |
| 27 | + |
| 28 | +```bash |
| 29 | +./scripts/generate-ai-training-outputs.sh |
| 30 | +``` |
| 31 | + |
| 32 | +## Challenge |
| 33 | + |
| 34 | +Templates contain dynamic timestamps that make direct directory comparison difficult: |
| 35 | + |
| 36 | +```text |
| 37 | +# Generated: 2026-02-13T10:02:22Z |
| 38 | +``` |
| 39 | + |
| 40 | +Any comparison mechanism would need to: |
| 41 | + |
| 42 | +- Strip/normalize timestamps before comparison, OR |
| 43 | +- Use timestamp-agnostic comparison methods, OR |
| 44 | +- Include deterministic timestamp generation for testing |
| 45 | + |
| 46 | +## Potential Solutions |
| 47 | + |
| 48 | +### Option 1: CI Workflow with Timestamp Normalization |
| 49 | + |
| 50 | +Create a GitHub Actions workflow that: |
| 51 | + |
| 52 | +1. Strips timestamp lines matching pattern `# Generated: YYYY-MM-DDTHH:MM:SSZ` |
| 53 | +2. Compares normalized outputs with existing rendered templates |
| 54 | +3. Fails if differences exist (beyond timestamps) |
| 55 | + |
| 56 | +**Pros:** Catches stale outputs automatically |
| 57 | +**Cons:** Requires maintaining comparison logic, may have false positives |
| 58 | + |
| 59 | +### Option 2: Metadata File with Input Hashes |
| 60 | + |
| 61 | +Store a `.metadata.json` file alongside rendered outputs: |
| 62 | + |
| 63 | +```json |
| 64 | +{ |
| 65 | + "generated_at": "2026-02-13T10:02:22Z", |
| 66 | + "input_hashes": { |
| 67 | + "environment_configs": "sha256:abc123...", |
| 68 | + "templates": "sha256:def456...", |
| 69 | + "rendering_code": "sha256:ghi789..." |
| 70 | + }, |
| 71 | + "tool_version": "0.1.0" |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +CI checks if relevant paths changed since metadata was created. |
| 76 | + |
| 77 | +**Pros:** Fast, deterministic, no false positives |
| 78 | +**Cons:** Requires metadata management, doesn't detect code logic changes |
| 79 | + |
| 80 | +### Option 3: Deterministic Timestamp Flag |
| 81 | + |
| 82 | +Add `--mock-timestamp` or `--fixed-timestamp` flag to render command: |
| 83 | + |
| 84 | +```bash |
| 85 | +torrust-tracker-deployer render ... --fixed-timestamp "2024-01-01T00:00:00Z" |
| 86 | +``` |
| 87 | + |
| 88 | +Use this flag for test/CI renders to enable exact comparison. |
| 89 | + |
| 90 | +**Pros:** Simple, enables deterministic testing |
| 91 | +**Cons:** CI outputs would have fake timestamps (but that's fine for testing) |
| 92 | + |
| 93 | +### Option 4: Git-Based Staleness Check |
| 94 | + |
| 95 | +Simple bash script that checks: |
| 96 | + |
| 97 | +```bash |
| 98 | +# Get last modification time of rendered outputs |
| 99 | +rendered_mtime=$(find docs/ai-training/dataset/rendered-templates -type f -printf '%T@\n' | sort -n | tail -1) |
| 100 | + |
| 101 | +# Compare with modification times of inputs and templates |
| 102 | +inputs_mtime=$(find docs/ai-training/dataset/environment-configs templates src/infrastructure/templating -type f -printf '%T@\n' | sort -n | tail -1) |
| 103 | + |
| 104 | +if (( inputs_mtime > rendered_mtime )); then |
| 105 | + echo "⚠️ Rendered outputs may be stale. Run: ./scripts/generate-ai-training-outputs.sh" |
| 106 | + exit 1 |
| 107 | +fi |
| 108 | +``` |
| 109 | + |
| 110 | +**Pros:** Very simple, no changes to application code |
| 111 | +**Cons:** Doesn't catch all cases (e.g., code logic changes without file mtime change) |
| 112 | + |
| 113 | +### Option 5: Pre-commit Hook (Auto-regeneration) |
| 114 | + |
| 115 | +Add regeneration to pre-commit checks: |
| 116 | + |
| 117 | +```bash |
| 118 | +./scripts/generate-ai-training-outputs.sh |
| 119 | +git add docs/ai-training/dataset/rendered-templates |
| 120 | +``` |
| 121 | + |
| 122 | +**Pros:** Always up-to-date automatically |
| 123 | +**Cons:** Slow pre-commit (renders 15 examples), may be disruptive |
| 124 | + |
| 125 | +## Recommendation |
| 126 | + |
| 127 | +**Short-term (easiest):** Option 4 (Git-based staleness check) |
| 128 | + |
| 129 | +- Minimal implementation effort |
| 130 | +- Good enough for catching most cases |
| 131 | +- Can be added to pre-commit or CI immediately |
| 132 | + |
| 133 | +**Long-term (best):** Option 3 (Deterministic timestamp flag) |
| 134 | + |
| 135 | +- Enables proper testing and comparison |
| 136 | +- Clean separation of test vs production behavior |
| 137 | +- More maintainable and robust |
| 138 | + |
| 139 | +## Implementation Estimate |
| 140 | + |
| 141 | +- **Option 4:** ~30 minutes (bash script + CI integration) |
| 142 | +- **Option 3:** ~2-3 hours (add flag, update templates, tests) |
| 143 | +- **Option 2:** ~4-5 hours (metadata generation, comparison logic) |
| 144 | +- **Option 1:** ~5-6 hours (normalization logic, CI workflow, edge cases) |
| 145 | + |
| 146 | +## Related Files |
| 147 | + |
| 148 | +- Script: `scripts/generate-ai-training-outputs.sh` |
| 149 | +- Inputs: `docs/ai-training/dataset/environment-configs/*.json` |
| 150 | +- Outputs: `docs/ai-training/dataset/rendered-templates/*/` |
| 151 | +- Templates: `templates/**/*.tera` |
| 152 | +- Rendering code: `src/infrastructure/templating/**` |
| 153 | + |
| 154 | +## Notes |
| 155 | + |
| 156 | +- Not urgent since dataset is primarily for AI training, not runtime behavior |
| 157 | +- Manual regeneration is acceptable for now (done periodically) |
| 158 | +- Worth revisiting when/if we automate AI model training pipeline |
0 commit comments