Skip to content

OTA follow-up to #873: SHA mismatch refuses serve but doesn't auto-clean the poisoned cache (breaks #870 self-heal contract) #875

@SlyWombat

Description

@SlyWombat

Summary

The #873 fix in 82bd92c correctly wires _verify_sha256 into api_fw_binary and refuses serve with HTTP 502 on mismatch. But the poisoned cache file is left in place — every subsequent request recomputes the bad SHA, fails again, and 502s. The operator must manually clear %APPDATA%\SlyLED\firmware\<board>\ to recover.

This contradicts the design pattern established by #870, where the 2 MB size-guard auto-deletes poisoned cache files so the next request triggers a fresh fetch. SHA mismatch should self-heal the same way.

Evidence (HEAD = 82bd92c)

SHA-mismatch refuse, no cleanup:

# desktop/shared/parent_server.py:18432-18445
if expected_sha:
    verdict = _verify_sha256(bin_path, expected_sha)
    if verdict is False:
        log.error("OTA SHA mismatch for %s: expected %s",
                  board, expected_sha[:12])
        return jsonify(ok=False,
                       err=f"otaSha256 mismatch for {board}"), 502

The cached file at bin_path stays on disk untouched. No os.unlink(bin_path), no rename to .bad, no log indicating the cache is poisoned.

Existing self-heal pattern (#870, for size mismatch):

# desktop/shared/parent_server.py:18355-18362
if _ota_cache_is_poisoned(board, cand):
    log.warning("Removing poisoned OTA cache: %s", cand)
    try:
        os.unlink(cand)
    except OSError:
        pass

The two failure modes (size mismatch → auto-clean; SHA mismatch → manual cleanup) are inconsistent.

Smaller diagnostic gaps in the same area

While the cache-cleanup is the headline, two smaller observability gaps were noted during the audit:

  1. Mismatch log doesn't include actual computed hash — only expected (truncated). Diagnosing "is the cache corrupt or is the registry pin wrong?" currently requires manual re-hashing.
  2. Missing otaSha256 field is silentif expected_sha: skips the check with no warn-log. If build_release.ps1 regresses and stops writing the pin, no signal in the orchestrator logs would surface it.

Proposed fix sketch (operator's call)

if expected_sha:
    verdict = _verify_sha256(bin_path, expected_sha)
    if verdict is False:
        # Compute actual hash for diagnostics
        try:
            with open(bin_path, "rb") as f:
                actual = hashlib.sha256(f.read()).hexdigest()
        except OSError:
            actual = "<unreadable>"
        log.error("OTA SHA mismatch for %s: expected %s, got %s — "
                  "removing poisoned cache",
                  board, expected_sha[:12], actual[:12])
        try:
            os.unlink(bin_path)
        except OSError:
            pass
        return jsonify(ok=False,
                       err=f"otaSha256 mismatch for {board}"), 502
elif entry:
    log.warning("OTA registry entry for %s lacks otaSha256 — "
                "serving without integrity check", board)

The unlink mirrors _ota_cache_is_poisoned's pattern. Operator's next request retries, GitHub fetch repopulates the cache, and if the CDN is now serving correct bytes the system self-heals without manual intervention.

Acceptance

  • On SHA mismatch, api_fw_binary deletes (or .bad-renames) the poisoned bin_path before returning 502.
  • Mismatch log includes actual computed hash (truncated) alongside expected.
  • When entry["otaSha256"] is missing/None, a warn-log fires ("serving without integrity check" or similar).
  • New test in tests/test_870_ota_app_only.py: pre-stage cached binary with wrong SHA, request → 502, and os.path.exists(cache_path) is False (or renamed).
  • New test: missing otaSha256 field triggers a warn log entry (capture via caplog).
  • Existing 4 tests added by OTA follow-up to #870: otaSha256 registry pin is unverified — api_fw_binary never calls _verify_sha256 #873 (test_873_sha_mismatch_refuses_serve_with_502, test_873_sha_match_serves_200, test_873_no_sha_pin_falls_through_for_back_compat, test_873_d1mini_uses_legacy_sha256_pin) still pass — adjust the mismatch test to also assert cache deletion.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions