Summary
The #873 fix in 82bd92c correctly wires _verify_sha256 into api_fw_binary and refuses serve with HTTP 502 on mismatch. But the poisoned cache file is left in place — every subsequent request recomputes the bad SHA, fails again, and 502s. The operator must manually clear %APPDATA%\SlyLED\firmware\<board>\ to recover.
This contradicts the design pattern established by #870, where the 2 MB size-guard auto-deletes poisoned cache files so the next request triggers a fresh fetch. SHA mismatch should self-heal the same way.
Evidence (HEAD = 82bd92c)
SHA-mismatch refuse, no cleanup:
# desktop/shared/parent_server.py:18432-18445
if expected_sha:
verdict = _verify_sha256(bin_path, expected_sha)
if verdict is False:
log.error("OTA SHA mismatch for %s: expected %s",
board, expected_sha[:12])
return jsonify(ok=False,
err=f"otaSha256 mismatch for {board}"), 502
The cached file at bin_path stays on disk untouched. No os.unlink(bin_path), no rename to .bad, no log indicating the cache is poisoned.
Existing self-heal pattern (#870, for size mismatch):
# desktop/shared/parent_server.py:18355-18362
if _ota_cache_is_poisoned(board, cand):
log.warning("Removing poisoned OTA cache: %s", cand)
try:
os.unlink(cand)
except OSError:
pass
The two failure modes (size mismatch → auto-clean; SHA mismatch → manual cleanup) are inconsistent.
Smaller diagnostic gaps in the same area
While the cache-cleanup is the headline, two smaller observability gaps were noted during the audit:
- Mismatch log doesn't include actual computed hash — only expected (truncated). Diagnosing "is the cache corrupt or is the registry pin wrong?" currently requires manual re-hashing.
- Missing
otaSha256 field is silent — if expected_sha: skips the check with no warn-log. If build_release.ps1 regresses and stops writing the pin, no signal in the orchestrator logs would surface it.
Proposed fix sketch (operator's call)
if expected_sha:
verdict = _verify_sha256(bin_path, expected_sha)
if verdict is False:
# Compute actual hash for diagnostics
try:
with open(bin_path, "rb") as f:
actual = hashlib.sha256(f.read()).hexdigest()
except OSError:
actual = "<unreadable>"
log.error("OTA SHA mismatch for %s: expected %s, got %s — "
"removing poisoned cache",
board, expected_sha[:12], actual[:12])
try:
os.unlink(bin_path)
except OSError:
pass
return jsonify(ok=False,
err=f"otaSha256 mismatch for {board}"), 502
elif entry:
log.warning("OTA registry entry for %s lacks otaSha256 — "
"serving without integrity check", board)
The unlink mirrors _ota_cache_is_poisoned's pattern. Operator's next request retries, GitHub fetch repopulates the cache, and if the CDN is now serving correct bytes the system self-heals without manual intervention.
Acceptance
Related
Summary
The #873 fix in
82bd92ccorrectly wires_verify_sha256intoapi_fw_binaryand refuses serve with HTTP 502 on mismatch. But the poisoned cache file is left in place — every subsequent request recomputes the bad SHA, fails again, and 502s. The operator must manually clear%APPDATA%\SlyLED\firmware\<board>\to recover.This contradicts the design pattern established by #870, where the 2 MB size-guard auto-deletes poisoned cache files so the next request triggers a fresh fetch. SHA mismatch should self-heal the same way.
Evidence (HEAD = 82bd92c)
SHA-mismatch refuse, no cleanup:
The cached file at
bin_pathstays on disk untouched. Noos.unlink(bin_path), no rename to.bad, no log indicating the cache is poisoned.Existing self-heal pattern (#870, for size mismatch):
The two failure modes (size mismatch → auto-clean; SHA mismatch → manual cleanup) are inconsistent.
Smaller diagnostic gaps in the same area
While the cache-cleanup is the headline, two smaller observability gaps were noted during the audit:
otaSha256field is silent —if expected_sha:skips the check with no warn-log. Ifbuild_release.ps1regresses and stops writing the pin, no signal in the orchestrator logs would surface it.Proposed fix sketch (operator's call)
The unlink mirrors
_ota_cache_is_poisoned's pattern. Operator's next request retries, GitHub fetch repopulates the cache, and if the CDN is now serving correct bytes the system self-heals without manual intervention.Acceptance
api_fw_binarydeletes (or.bad-renames) the poisonedbin_pathbefore returning 502.entry["otaSha256"]is missing/None, a warn-log fires ("serving without integrity check"or similar).tests/test_870_ota_app_only.py: pre-stage cached binary with wrong SHA, request → 502, andos.path.exists(cache_path) is False(or renamed).otaSha256field triggers a warn log entry (capture viacaplog).otaSha256registry pin is unverified —api_fw_binarynever calls_verify_sha256#873 (test_873_sha_mismatch_refuses_serve_with_502,test_873_sha_match_serves_200,test_873_no_sha_pin_falls_through_for_back_compat,test_873_d1mini_uses_legacy_sha256_pin) still pass — adjust the mismatch test to also assert cache deletion.Related
_ota_cache_is_poisoned.otaSha256registry pin is unverified —api_fw_binarynever calls_verify_sha256#873 (closed) — wired_verify_sha256intoapi_fw_binary. This issue is the audit-found follow-up: integrity check landed but didn't extend the self-heal contract to the new failure mode.