Skip to content

Commit 0bd9547

Browse files
committed
docs: [#339] add draft issue for AI dataset freshness verification
Document the problem of stale rendered templates in AI training dataset and explore potential solutions including timestamp mocking for deterministic testing. Deferred for future implementation when needed.
1 parent fa12000 commit 0bd9547

1 file changed

Lines changed: 158 additions & 0 deletions

File tree

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Draft Issue: Automated Verification of AI Training Dataset Freshness
2+
3+
**Status:** Draft
4+
**Priority:** Low
5+
**Category:** Developer Experience / Testing
6+
**Related:** Issue #339 - AI Training Resources
7+
8+
## Problem
9+
10+
The AI training dataset in `docs/ai-training/dataset/` requires manual regeneration whenever:
11+
12+
- Input examples in `dataset/environment-configs/` change
13+
- Template files in `templates/` are modified
14+
- Rendering code changes in a way that affects output
15+
16+
Currently, developers must manually run `scripts/generate-ai-training-outputs.sh` to regenerate the `rendered-templates/` directory.
17+
18+
**Risk:** The rendered outputs can become stale without anyone noticing, leading to:
19+
20+
- Outdated training data for AI models
21+
- Inconsistencies between documentation and actual behavior
22+
- Confusion for users examining the examples
23+
24+
## Current Workaround
25+
26+
Manual regeneration by running:
27+
28+
```bash
29+
./scripts/generate-ai-training-outputs.sh
30+
```
31+
32+
## Challenge
33+
34+
Templates contain dynamic timestamps that make direct directory comparison difficult:
35+
36+
```text
37+
# Generated: 2026-02-13T10:02:22Z
38+
```
39+
40+
Any comparison mechanism would need to:
41+
42+
- Strip/normalize timestamps before comparison, OR
43+
- Use timestamp-agnostic comparison methods, OR
44+
- Include deterministic timestamp generation for testing
45+
46+
## Potential Solutions
47+
48+
### Option 1: CI Workflow with Timestamp Normalization
49+
50+
Create a GitHub Actions workflow that:
51+
52+
1. Strips timestamp lines matching pattern `# Generated: YYYY-MM-DDTHH:MM:SSZ`
53+
2. Compares normalized outputs with existing rendered templates
54+
3. Fails if differences exist (beyond timestamps)
55+
56+
**Pros:** Catches stale outputs automatically
57+
**Cons:** Requires maintaining comparison logic, may have false positives
58+
59+
### Option 2: Metadata File with Input Hashes
60+
61+
Store a `.metadata.json` file alongside rendered outputs:
62+
63+
```json
64+
{
65+
"generated_at": "2026-02-13T10:02:22Z",
66+
"input_hashes": {
67+
"environment_configs": "sha256:abc123...",
68+
"templates": "sha256:def456...",
69+
"rendering_code": "sha256:ghi789..."
70+
},
71+
"tool_version": "0.1.0"
72+
}
73+
```
74+
75+
CI checks if relevant paths changed since metadata was created.
76+
77+
**Pros:** Fast, deterministic, no false positives
78+
**Cons:** Requires metadata management, doesn't detect code logic changes
79+
80+
### Option 3: Deterministic Timestamp Flag
81+
82+
Add `--mock-timestamp` or `--fixed-timestamp` flag to render command:
83+
84+
```bash
85+
torrust-tracker-deployer render ... --fixed-timestamp "2024-01-01T00:00:00Z"
86+
```
87+
88+
Use this flag for test/CI renders to enable exact comparison.
89+
90+
**Pros:** Simple, enables deterministic testing
91+
**Cons:** CI outputs would have fake timestamps (but that's fine for testing)
92+
93+
### Option 4: Git-Based Staleness Check
94+
95+
Simple bash script that checks:
96+
97+
```bash
98+
# Get last modification time of rendered outputs
99+
rendered_mtime=$(find docs/ai-training/dataset/rendered-templates -type f -printf '%T@\n' | sort -n | tail -1)
100+
101+
# Compare with modification times of inputs and templates
102+
inputs_mtime=$(find docs/ai-training/dataset/environment-configs templates src/infrastructure/templating -type f -printf '%T@\n' | sort -n | tail -1)
103+
104+
if (( inputs_mtime > rendered_mtime )); then
105+
echo "⚠️ Rendered outputs may be stale. Run: ./scripts/generate-ai-training-outputs.sh"
106+
exit 1
107+
fi
108+
```
109+
110+
**Pros:** Very simple, no changes to application code
111+
**Cons:** Doesn't catch all cases (e.g., code logic changes without file mtime change)
112+
113+
### Option 5: Pre-commit Hook (Auto-regeneration)
114+
115+
Add regeneration to pre-commit checks:
116+
117+
```bash
118+
./scripts/generate-ai-training-outputs.sh
119+
git add docs/ai-training/dataset/rendered-templates
120+
```
121+
122+
**Pros:** Always up-to-date automatically
123+
**Cons:** Slow pre-commit (renders 15 examples), may be disruptive
124+
125+
## Recommendation
126+
127+
**Short-term (easiest):** Option 4 (Git-based staleness check)
128+
129+
- Minimal implementation effort
130+
- Good enough for catching most cases
131+
- Can be added to pre-commit or CI immediately
132+
133+
**Long-term (best):** Option 3 (Deterministic timestamp flag)
134+
135+
- Enables proper testing and comparison
136+
- Clean separation of test vs production behavior
137+
- More maintainable and robust
138+
139+
## Implementation Estimate
140+
141+
- **Option 4:** ~30 minutes (bash script + CI integration)
142+
- **Option 3:** ~2-3 hours (add flag, update templates, tests)
143+
- **Option 2:** ~4-5 hours (metadata generation, comparison logic)
144+
- **Option 1:** ~5-6 hours (normalization logic, CI workflow, edge cases)
145+
146+
## Related Files
147+
148+
- Script: `scripts/generate-ai-training-outputs.sh`
149+
- Inputs: `docs/ai-training/dataset/environment-configs/*.json`
150+
- Outputs: `docs/ai-training/dataset/rendered-templates/*/`
151+
- Templates: `templates/**/*.tera`
152+
- Rendering code: `src/infrastructure/templating/**`
153+
154+
## Notes
155+
156+
- Not urgent since dataset is primarily for AI training, not runtime behavior
157+
- Manual regeneration is acceptable for now (done periodically)
158+
- Worth revisiting when/if we automate AI model training pipeline

0 commit comments

Comments
 (0)