Phase 1: critical security — bind anon DNA tasks, fix DEBUG default, lock task whitelist by vanajmoorthy · Pull Request #106 · vanajmoorthy/bibliotype

vanajmoorthy · 2026-05-14T16:08:15Z

Summary

Phase 1 of the multi-agent triage remediation. Closes the three Critical-severity security findings from the triage report:

Anonymous DNA hijack via guessable task_id (US-001, US-002, US-003) — anyone who learned a task_id (Referer leak, shared link, log) could pull another user's DNA into their own session, or sign up with the stolen task_id and inherit the whole library.
DEBUG defaulting to True (US-004) — a missing/typo'd env var silently exposed Silk profiler and disabled secure cookies.
run_management_command_task had no in-task whitelist (US-005) — admin had a whitelist but a malicious caller publishing directly to the broker could bypass it.

Also lands the Ralph harness + the 8-phase remediation PRD; subsequent PRs (Phase 2..8) will work off the same branch.

What changed

Critical security fixes (US-001 → US-005)

US-001 (011ab73) — anonymous branch of upload_view now binds task_owner_<task_id> in Redis to the visitor's session_key at upload time (TTL 3600s). Session is .save()d first if session_key is None.
US-002 (07abc13) — get_task_result_view checks task_owner_<task_id> and rejects mismatched callers with 403. Gated behind a new ENFORCE_TASK_OWNERSHIP env flag (defaults False) so in-flight pre-US-001 task IDs aren't immediately broken. Logs task_owner check skipped warnings while the flag is off so operators can drain legacy traffic.
US-003 (ff16c64) — signup_view validates task_id_to_claim against the session's own anonymous_task_id BEFORE creating the user (and before login() rotates session_key). The pre-login session_key is captured into a local and passed as a new third arg to claim_anonymous_dna_task.delay(...). The task refuses any claim where task_owner_<task_id> doesn't match the supplied key.
US-004 (e978af4) — DEBUG = os.environ.get(\"DEBUG\", \"False\") == \"True\" (was \"True\" default); ImproperlyConfigured raised at startup if DJANGO_ENV=production and DEBUG=True. Also adds explicit DEBUG=True + ALLOWED_HOSTS=localhost,127.0.0.1 to docker-compose.local.yml because dev was implicitly relying on the old default.
US-005 (44b8a55) — ADMIN_COMMANDS / ALLOWED_COMMANDS carved into new core/management_command_registry.py (third module — sidesteps the tasks.py → admin.py cycle that would have been created if we imported the other direction). run_management_command_task now raises ValueError(f\"command not allowed: {command_name}\") as its first statement.

Required deploy actions

⚠️ Set ENFORCE_TASK_OWNERSHIP=True in the production env ≥1 hour after this deploy lands. Until then, US-002's strict check is off (legacy task IDs predating US-001 are warn-and-allow). Grep production logs for task_owner check skipped — once it drops to zero for ≥1 hour, flip the flag.
⚠️ Set DEBUG=False (or unset it entirely) and DJANGO_ENV=production in the production env. With the new default, a missing DEBUG env var is safe; with the new assert, an accidentally-set DEBUG=True in production will refuse to start.
⚠️ Local dev now requires DEBUG=True to be set explicitly. The compose file at docker-compose.local.yml is updated; if you run Django outside Docker (honcho start), ensure your .env has DEBUG=True.

Scaffolding (not user-facing)

tasks/prd-codebase-triage-fixes.md — full 69-story remediation PRD (8 phases, ~1250 lines).
scripts/ralph/ — Ralph harness (ralph.sh, prompt, prd.json work list, PRD-to-JSON converter).
AGENTS.md — project conventions + settings invariants + Redis key registry (referenced by future stories).
progress.txt — append-only chronological log; populated as iterations land.

Test plan

All five stories shipped with regression tests. New / updated tests live in:

core/tests/test_views_e2e.py — anon upload-binds-owner, owner-positive read, foreign-session strict 403, foreign-session legacy warn, signup-cross-session reject, signup-post-rotation positive.
core/tests/test_tasks_integration.py — claim_anonymous_dna_task direct-invocation mismatch rejection.
core/tests/test_tasks_unit.py — run_management_command_task whitelist (blocked-name + empty-string negatives, whitelisted positive).

Pre-merge checklist

CI passes (full Django test suite).
Confirm ENFORCE_TASK_OWNERSHIP env var is in the production env (default False).
Confirm prod DEBUG env is unset OR explicitly False; DJANGO_ENV=production set.
Manual smoke test of: anonymous upload → dashboard → recommendations, authenticated upload → dashboard, signup with anon-DNA claim, login, password reset.
Plan the ENFORCE_TASK_OWNERSHIP=True flip for 1h+ post-deploy.

- scripts/ralph/ralph.sh — customized harness (uses RALPH_PROMPT.md to avoid CLAUDE.md collision; runs claude with project-root cwd) - scripts/ralph/RALPH_PROMPT.md — agent prompt with Bibliotype-specific guardrails, blocked-prereq handling, and stop condition - scripts/ralph/convert_prd.py — PRD markdown -> prd.json converter - scripts/ralph/prd.json — 70 stories (US-000 marked done at scaffold) - progress.txt + AGENTS.md — initialized at project root per US-000 - tasks/prd-codebase-triage-fixes.md — copied from origin work tree [US-000] - Create AGENTS.md and progress.txt scaffolding

@patch

Findings from independent agent review + spot-checks: BLOCKERS: - US-007: premise false (no .prof files tracked); converted to verify-only - US-008a-i/ii: false 'broken-commit' premise — tests don't reference csv/ paths at runtime (they use SimpleUploadedFile inline). Collapsed into single atomic US-008a. - US-014: safe_cache_delete is intentionally re-exported (consumed by test_cache_refactor.py); story now targets only Avg + Author. - US-022: dropping Book.objects.get() refetch produced stale global_read_count and broke niche-book stats; replaced with book.refresh_from_db(fields=['global_read_count']). - US-024c: cache key was 'recs_<id>' but actual is 'user_recommendations_<id>'; also documented anon mirror. - US-031a: explicit AC bullet for @patch decorator targets (test_cache_refactor.py:273 was a hidden footgun). - US-033: deletion range was wrong (would have deleted hot-path community- averages refresh, not the legacy backfills). Narrowed scope with identifying characteristics. - US-034: removed entirely — both 'unreachable' guards have explicit tests (test_currently_reading:22-25, test_sanitize_review:50). Simplification not worth test-update churn. Marked passes=true at scaffold. HIGH: - US-027: rate_limit='30/m' is NOT present on enrich_book_task today; AC must ADD it (was 'confirm'). - US-005: import existing ALLOWED_COMMANDS from core/admin instead of redefining as second source of truth. - US-041: rephrased 'add' → 'promote to module level' for BATCH_LIMIT / AGE_THRESHOLD_DAYS (already exist function-local). - US-042: replaced specific line numbers with grep guidance. - US-001: explicit scope to anonymous branch of upload_view. PRD-WIDE: - §4 conventions: explicit note that line numbers are advisory; re-anchor on symbols (codebase has shifted since triage). prd.json regenerated from updated markdown. US-000 and US-034 are pre-passed (Ralph skips them). 69 stories total.

…oad time Anonymous branch of upload_view now ensures session_key exists, then writes task_owner_<task_id> to Redis (TTL 3600s) so US-002/US-003 can refuse cross-session task lookups. Adds positive e2e test covering both the session binding and the cache write.

…to caller's session Gate strict ownership behind ENFORCE_TASK_OWNERSHIP env flag (default False); legacy traffic still passes but each unowned lookup logs a WARNING so operators can drain pre-US-001 task IDs before flipping the kill-switch in production.

…n session_key signup_view now (a) validates POSTed task_id_to_claim matches session["anonymous_task_id"] before account creation and (b) captures pre_login_session_key before login() rotates it. The captured key is passed to claim_anonymous_dna_task, which verifies it against task_owner_<task_id> in cache and aborts the claim on mismatch. Closes the signup-time DNA hijack hole: a stolen task_id can no longer be used to inherit a stranger's library because the session_key bound at upload (US-001) must match the key the signup view captures.

…oduction

…t_command_task Extract ADMIN_COMMANDS + ALLOWED_COMMANDS out of core/admin.py into a new core/management_command_registry.py module. core/admin.py imports them from there (same frozenset object), and run_management_command_task now guards its first line with `if command_name not in ALLOWED_COMMANDS: raise ValueError(f"command not allowed: {command_name}")`. This is defence-in-depth on top of the existing command_run_api check in core/admin.py: a caller who publishes directly to the Celery broker (bypassing the HTTP layer) can no longer invoke an arbitrary management command. Avoids a circular import by putting the registry in its own module — core/admin.py already does a deferred `from .tasks import run_management_command_task` inside command_run_api, so importing ALLOWED_COMMANDS from admin into tasks would invert that at startup.

Addresses all 4 BLOCKER + several MEDIUM findings from the post-Phase-1 security review. The original PR's default config didn't actually close the anonymous-DNA-hijack vulnerability — this commit closes it for real. Blocker fixes ============= #1 — `get_task_result_view` no longer writes DNA into the session on ownership mismatch (regardless of `ENFORCE_TASK_OWNERSHIP`). The old warn-and-allow path wrote `session["dna_data"]` even when the caller wasn't the owner, which combined with signup_view's `if "dna_data" in request.session:` branch to re-open the hijack. Now: cache miss or mismatch → 403 + log, no session writes. `ENFORCE_TASK_OWNERSHIP` is retained for deploy-doc compat but no longer affects the response (documented in settings.py). #2 — `claim_anonymous_dna_task(session_key)` is now a REQUIRED positional arg (was `session_key: str = None`). Previously, omitting it silently bypassed the ownership check. Direct broker publishers can no longer skip the check by leaving the arg off. #3 — Both the view and the task now fail closed on cache miss. The 1hr TTL on `task_owner_<id>` used to be a security hole: after expiry the check silently passed because `owner_session_key is None`. Now treated as "task unowned, reject" — the user re-uploads if their TTL ran out (the cached DNA at `dna_result_<id>` has the same TTL anyway). #4 — Session keys are session bearer credentials. They are no longer logged raw. Both views and tasks now hash with sha256[:12] before emitting any log line. Forensic value preserved; credential leak eliminated. Medium fixes ============ #6 — `signup_view` now ALSO checks `task_owner_<task_id>` against the caller's session_key before accepting `task_id_to_claim`. Layered with the existing `session["anonymous_task_id"]` consistency check for defense in depth. #7 — DEBUG production assert widened. Was `DJANGO_ENV == "production"` exact match (so `DJANGO_ENV=staging` silently allowed DEBUG=True). Now uses a whitelist of permitted DEBUG envs (`development`, `test`, `ci`); anything else with DEBUG=True raises ImproperlyConfigured. Also adds a `_env_bool` helper that fails loud on unrecognized values — typoed `ENFORCE_TASK_OWNERSHIP=true` (lowercase) no longer silently parses as False. #12 — On any rejection in `claim_anonymous_dna_task` (missing key, cache miss, or mismatch), `pending_dna_task_id` is cleared so the rightful user's dashboard doesn't poll forever. Polish ====== #10 — Removed hardcoded absolute username path from RALPH_PROMPT.md. #11 — Added `scripts/ralph/`, `scripts/regression/`, `tasks/`, `progress.txt`, `AGENTS.md`, `.claude/` to .dockerignore so the internal triage scaffolding doesn't ship in production images. Also added `.claude/worktrees/` and `scripts/ralph/.last-branch` to .gitignore. Tests ===== Updated `test_task_result_warns_but_allows_when_not_enforced` to `test_task_result_rejects_foreign_session_even_when_not_enforced` — the original test was happy-path-shaped and would have passed against the buggy fix. New assertions verify the attacker's session has NO leaked DNA data after the 403 response (finding #9). New negative tests: - `test_claim_anonymous_dna_task_fails_closed_on_cache_miss` - `test_claim_anonymous_dna_task_fails_closed_on_missing_session_key` Updated positive `test_claim_anonymous_dna_task` to pass a session_key arg (was 2-arg, would have failed after #2's required-arg change). Full suite: 359 tests, all green.

vanajmoorthy · 2026-05-15T11:03:42Z

Review fixes pushed (`e80c734`)

All 4 BLOCKER findings + bonus MEDIUM/LOW polish from the post-merge security review are now addressed. Full suite remains green (359 tests).

Blockers closed:

Features #1 — get_task_result_view no longer writes session DNA on ownership mismatch (regardless of ENFORCE_TASK_OWNERSHIP). The original warn-and-allow path leaked into session["dna_data"], which combined with signup_view's session-DNA branch to re-open the hijack.
feat: set up basic posthog event #2 — claim_anonymous_dna_task(session_key) is now a required positional arg. Direct broker publishers can no longer omit it to skip the check.
css #3 — Both view and task fail closed on cache miss (TTL expiry was a silent bypass).
Update README.md #4 — Session keys hashed with sha256[:12] before logging (they're bearer credentials).

Medium fixes:

update meter #6 — signup_view now ALSO checks task_owner_<id> cache against caller session_key, layered with existing session["anonymous_task_id"] consistency check.
Update README.md #7 — DEBUG production assert uses a whitelist of allowed envs (development, test, ci); staging is no longer silently permissive. New _env_bool helper fails loud on typos like ENFORCE_TASK_OWNERSHIP=true (lowercase) — was previously silently coerced to False.
use google books and add mgmt command to fix genres and update mappings #12 — pending_dna_task_id is cleared on any claim-task rejection, so the rightful user's dashboard doesn't poll a doomed task forever.

Polish (#10, #11): RALPH_PROMPT.md no longer ships with hardcoded username path; scripts/ralph/, scripts/regression/, tasks/, progress.txt, AGENTS.md, .claude/ added to .dockerignore so internal triage scaffolding doesn't end up in the production image.

Test hardening (#9): test_task_result_warns_but_allows_when_not_enforced (which would have passed against the buggy fix) replaced with test_task_result_rejects_foreign_session_even_when_not_enforced that asserts the attacker's session has NO leaked dna_data after the 403. Plus two new fail-closed negative tests for the claim task (cache-miss + missing-key paths).

Deploy notes update: the ENFORCE_TASK_OWNERSHIP env var is now vestigial — fail-closed semantics apply regardless. Kept for forward compat with operational docs.

Ready for re-review.

vanajmoorthy added 8 commits May 10, 2026 00:04

feat: [US-004] - DEBUG defaults to False and add startup assert in pr…

e978af4

…oduction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: critical security — bind anon DNA tasks, fix DEBUG default, lock task whitelist#106

Phase 1: critical security — bind anon DNA tasks, fix DEBUG default, lock task whitelist#106
vanajmoorthy wants to merge 8 commits into
mainfrom
triage/codebase-fixes

vanajmoorthy commented May 14, 2026

Uh oh!

vanajmoorthy commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vanajmoorthy commented May 14, 2026

Summary

What changed

Critical security fixes (US-001 → US-005)

Required deploy actions

Scaffolding (not user-facing)

Test plan

Pre-merge checklist

Related

Uh oh!

vanajmoorthy commented May 15, 2026

Review fixes pushed (e80c734)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review fixes pushed (`e80c734`)