Skip to content

Phase 1: critical security β€” bind anon DNA tasks, fix DEBUG default, lock task whitelist#106

Open
vanajmoorthy wants to merge 8 commits into
mainfrom
triage/codebase-fixes
Open

Phase 1: critical security β€” bind anon DNA tasks, fix DEBUG default, lock task whitelist#106
vanajmoorthy wants to merge 8 commits into
mainfrom
triage/codebase-fixes

Conversation

@vanajmoorthy
Copy link
Copy Markdown
Owner

Summary

Phase 1 of the multi-agent triage remediation. Closes the three Critical-severity security findings from the triage report:

  1. Anonymous DNA hijack via guessable task_id (US-001, US-002, US-003) β€” anyone who learned a task_id (Referer leak, shared link, log) could pull another user's DNA into their own session, or sign up with the stolen task_id and inherit the whole library.
  2. DEBUG defaulting to True (US-004) β€” a missing/typo'd env var silently exposed Silk profiler and disabled secure cookies.
  3. run_management_command_task had no in-task whitelist (US-005) β€” admin had a whitelist but a malicious caller publishing directly to the broker could bypass it.

Also lands the Ralph harness + the 8-phase remediation PRD; subsequent PRs (Phase 2..8) will work off the same branch.

What changed

Critical security fixes (US-001 β†’ US-005)

  • US-001 (011ab73) β€” anonymous branch of upload_view now binds task_owner_<task_id> in Redis to the visitor's session_key at upload time (TTL 3600s). Session is .save()d first if session_key is None.
  • US-002 (07abc13) β€” get_task_result_view checks task_owner_<task_id> and rejects mismatched callers with 403. Gated behind a new ENFORCE_TASK_OWNERSHIP env flag (defaults False) so in-flight pre-US-001 task IDs aren't immediately broken. Logs task_owner check skipped warnings while the flag is off so operators can drain legacy traffic.
  • US-003 (ff16c64) β€” signup_view validates task_id_to_claim against the session's own anonymous_task_id BEFORE creating the user (and before login() rotates session_key). The pre-login session_key is captured into a local and passed as a new third arg to claim_anonymous_dna_task.delay(...). The task refuses any claim where task_owner_<task_id> doesn't match the supplied key.
  • US-004 (e978af4) β€” DEBUG = os.environ.get(\"DEBUG\", \"False\") == \"True\" (was \"True\" default); ImproperlyConfigured raised at startup if DJANGO_ENV=production and DEBUG=True. Also adds explicit DEBUG=True + ALLOWED_HOSTS=localhost,127.0.0.1 to docker-compose.local.yml because dev was implicitly relying on the old default.
  • US-005 (44b8a55) β€” ADMIN_COMMANDS / ALLOWED_COMMANDS carved into new core/management_command_registry.py (third module β€” sidesteps the tasks.py β†’ admin.py cycle that would have been created if we imported the other direction). run_management_command_task now raises ValueError(f\"command not allowed: {command_name}\") as its first statement.

Required deploy actions

  • ⚠️ Set ENFORCE_TASK_OWNERSHIP=True in the production env β‰₯1 hour after this deploy lands. Until then, US-002's strict check is off (legacy task IDs predating US-001 are warn-and-allow). Grep production logs for task_owner check skipped β€” once it drops to zero for β‰₯1 hour, flip the flag.
  • ⚠️ Set DEBUG=False (or unset it entirely) and DJANGO_ENV=production in the production env. With the new default, a missing DEBUG env var is safe; with the new assert, an accidentally-set DEBUG=True in production will refuse to start.
  • ⚠️ Local dev now requires DEBUG=True to be set explicitly. The compose file at docker-compose.local.yml is updated; if you run Django outside Docker (honcho start), ensure your .env has DEBUG=True.

Scaffolding (not user-facing)

  • tasks/prd-codebase-triage-fixes.md β€” full 69-story remediation PRD (8 phases, ~1250 lines).
  • scripts/ralph/ β€” Ralph harness (ralph.sh, prompt, prd.json work list, PRD-to-JSON converter).
  • AGENTS.md β€” project conventions + settings invariants + Redis key registry (referenced by future stories).
  • progress.txt β€” append-only chronological log; populated as iterations land.

Test plan

All five stories shipped with regression tests. New / updated tests live in:

  • core/tests/test_views_e2e.py β€” anon upload-binds-owner, owner-positive read, foreign-session strict 403, foreign-session legacy warn, signup-cross-session reject, signup-post-rotation positive.
  • core/tests/test_tasks_integration.py β€” claim_anonymous_dna_task direct-invocation mismatch rejection.
  • core/tests/test_tasks_unit.py β€” run_management_command_task whitelist (blocked-name + empty-string negatives, whitelisted positive).

Pre-merge checklist

  • CI passes (full Django test suite).
  • Confirm ENFORCE_TASK_OWNERSHIP env var is in the production env (default False).
  • Confirm prod DEBUG env is unset OR explicitly False; DJANGO_ENV=production set.
  • Manual smoke test of: anonymous upload β†’ dashboard β†’ recommendations, authenticated upload β†’ dashboard, signup with anon-DNA claim, login, password reset.
  • Plan the ENFORCE_TASK_OWNERSHIP=True flip for 1h+ post-deploy.

Related

  • PRD: tasks/prd-codebase-triage-fixes.md
  • Phase 2 target (next PR): chore: remove dead code, scripts, profiles, personal CSVs, test fixtures (US-006..US-014).

- scripts/ralph/ralph.sh β€” customized harness (uses RALPH_PROMPT.md
  to avoid CLAUDE.md collision; runs claude with project-root cwd)
- scripts/ralph/RALPH_PROMPT.md β€” agent prompt with Bibliotype-specific
  guardrails, blocked-prereq handling, and stop condition
- scripts/ralph/convert_prd.py β€” PRD markdown -> prd.json converter
- scripts/ralph/prd.json β€” 70 stories (US-000 marked done at scaffold)
- progress.txt + AGENTS.md β€” initialized at project root per US-000
- tasks/prd-codebase-triage-fixes.md β€” copied from origin work tree

[US-000] - Create AGENTS.md and progress.txt scaffolding
Findings from independent agent review + spot-checks:

BLOCKERS:
- US-007: premise false (no .prof files tracked); converted to verify-only
- US-008a-i/ii: false 'broken-commit' premise β€” tests don't reference csv/
  paths at runtime (they use SimpleUploadedFile inline). Collapsed into
  single atomic US-008a.
- US-014: safe_cache_delete is intentionally re-exported (consumed by
  test_cache_refactor.py); story now targets only Avg + Author.
- US-022: dropping Book.objects.get() refetch produced stale
  global_read_count and broke niche-book stats; replaced with
  book.refresh_from_db(fields=['global_read_count']).
- US-024c: cache key was 'recs_<id>' but actual is 'user_recommendations_<id>';
  also documented anon mirror.
- US-031a: explicit AC bullet for @patch decorator targets
  (test_cache_refactor.py:273 was a hidden footgun).
- US-033: deletion range was wrong (would have deleted hot-path community-
  averages refresh, not the legacy backfills). Narrowed scope with
  identifying characteristics.
- US-034: removed entirely β€” both 'unreachable' guards have explicit tests
  (test_currently_reading:22-25, test_sanitize_review:50). Simplification
  not worth test-update churn. Marked passes=true at scaffold.

HIGH:
- US-027: rate_limit='30/m' is NOT present on enrich_book_task today; AC
  must ADD it (was 'confirm').
- US-005: import existing ALLOWED_COMMANDS from core/admin instead of
  redefining as second source of truth.
- US-041: rephrased 'add' β†’ 'promote to module level' for BATCH_LIMIT /
  AGE_THRESHOLD_DAYS (already exist function-local).
- US-042: replaced specific line numbers with grep guidance.
- US-001: explicit scope to anonymous branch of upload_view.

PRD-WIDE:
- Β§4 conventions: explicit note that line numbers are advisory; re-anchor
  on symbols (codebase has shifted since triage).

prd.json regenerated from updated markdown. US-000 and US-034 are
pre-passed (Ralph skips them). 69 stories total.
…oad time

Anonymous branch of upload_view now ensures session_key exists, then writes
task_owner_<task_id> to Redis (TTL 3600s) so US-002/US-003 can refuse
cross-session task lookups. Adds positive e2e test covering both the session
binding and the cache write.
…to caller's session

Gate strict ownership behind ENFORCE_TASK_OWNERSHIP env flag (default
False); legacy traffic still passes but each unowned lookup logs a
WARNING so operators can drain pre-US-001 task IDs before flipping the
kill-switch in production.
…n session_key

signup_view now (a) validates POSTed task_id_to_claim matches
session["anonymous_task_id"] before account creation and (b) captures
pre_login_session_key before login() rotates it. The captured key is
passed to claim_anonymous_dna_task, which verifies it against
task_owner_<task_id> in cache and aborts the claim on mismatch.

Closes the signup-time DNA hijack hole: a stolen task_id can no longer
be used to inherit a stranger's library because the session_key bound
at upload (US-001) must match the key the signup view captures.
…t_command_task

Extract ADMIN_COMMANDS + ALLOWED_COMMANDS out of core/admin.py into a new
core/management_command_registry.py module. core/admin.py imports them
from there (same frozenset object), and run_management_command_task now
guards its first line with `if command_name not in ALLOWED_COMMANDS:
raise ValueError(f"command not allowed: {command_name}")`.

This is defence-in-depth on top of the existing command_run_api check
in core/admin.py: a caller who publishes directly to the Celery broker
(bypassing the HTTP layer) can no longer invoke an arbitrary management
command.

Avoids a circular import by putting the registry in its own module β€”
core/admin.py already does a deferred `from .tasks import
run_management_command_task` inside command_run_api, so importing
ALLOWED_COMMANDS from admin into tasks would invert that at startup.
Addresses all 4 BLOCKER + several MEDIUM findings from the post-Phase-1
security review. The original PR's default config didn't actually close the
anonymous-DNA-hijack vulnerability β€” this commit closes it for real.

Blocker fixes
=============

#1 β€” `get_task_result_view` no longer writes DNA into the session on
ownership mismatch (regardless of `ENFORCE_TASK_OWNERSHIP`). The old
warn-and-allow path wrote `session["dna_data"]` even when the caller
wasn't the owner, which combined with signup_view's
`if "dna_data" in request.session:` branch to re-open the hijack.
Now: cache miss or mismatch β†’ 403 + log, no session writes.
`ENFORCE_TASK_OWNERSHIP` is retained for deploy-doc compat but no longer
affects the response (documented in settings.py).

#2 β€” `claim_anonymous_dna_task(session_key)` is now a REQUIRED positional
arg (was `session_key: str = None`). Previously, omitting it silently
bypassed the ownership check. Direct broker publishers can no longer skip
the check by leaving the arg off.

#3 β€” Both the view and the task now fail closed on cache miss. The 1hr
TTL on `task_owner_<id>` used to be a security hole: after expiry the
check silently passed because `owner_session_key is None`. Now treated
as "task unowned, reject" β€” the user re-uploads if their TTL ran out
(the cached DNA at `dna_result_<id>` has the same TTL anyway).

#4 β€” Session keys are session bearer credentials. They are no longer
logged raw. Both views and tasks now hash with sha256[:12] before
emitting any log line. Forensic value preserved; credential leak eliminated.

Medium fixes
============

#6 β€” `signup_view` now ALSO checks `task_owner_<task_id>` against the
caller's session_key before accepting `task_id_to_claim`. Layered with
the existing `session["anonymous_task_id"]` consistency check for defense
in depth.

#7 β€” DEBUG production assert widened. Was `DJANGO_ENV == "production"`
exact match (so `DJANGO_ENV=staging` silently allowed DEBUG=True). Now
uses a whitelist of permitted DEBUG envs (`development`, `test`, `ci`);
anything else with DEBUG=True raises ImproperlyConfigured. Also adds a
`_env_bool` helper that fails loud on unrecognized values β€” typoed
`ENFORCE_TASK_OWNERSHIP=true` (lowercase) no longer silently parses as
False.

#12 β€” On any rejection in `claim_anonymous_dna_task` (missing key, cache
miss, or mismatch), `pending_dna_task_id` is cleared so the rightful
user's dashboard doesn't poll forever.

Polish
======

#10 β€” Removed hardcoded absolute username path from RALPH_PROMPT.md.

#11 β€” Added `scripts/ralph/`, `scripts/regression/`, `tasks/`,
`progress.txt`, `AGENTS.md`, `.claude/` to .dockerignore so the
internal triage scaffolding doesn't ship in production images. Also
added `.claude/worktrees/` and `scripts/ralph/.last-branch` to
.gitignore.

Tests
=====

Updated `test_task_result_warns_but_allows_when_not_enforced` to
`test_task_result_rejects_foreign_session_even_when_not_enforced` β€” the
original test was happy-path-shaped and would have passed against the
buggy fix. New assertions verify the attacker's session has NO leaked
DNA data after the 403 response (finding #9).

New negative tests:
- `test_claim_anonymous_dna_task_fails_closed_on_cache_miss`
- `test_claim_anonymous_dna_task_fails_closed_on_missing_session_key`

Updated positive `test_claim_anonymous_dna_task` to pass a session_key
arg (was 2-arg, would have failed after #2's required-arg change).

Full suite: 359 tests, all green.
@vanajmoorthy
Copy link
Copy Markdown
Owner Author

Review fixes pushed (e80c734)

All 4 BLOCKER findings + bonus MEDIUM/LOW polish from the post-merge security review are now addressed. Full suite remains green (359 tests).

Blockers closed:

  • FeaturesΒ #1 β€” get_task_result_view no longer writes session DNA on ownership mismatch (regardless of ENFORCE_TASK_OWNERSHIP). The original warn-and-allow path leaked into session["dna_data"], which combined with signup_view's session-DNA branch to re-open the hijack.
  • feat: set up basic posthog eventΒ #2 β€” claim_anonymous_dna_task(session_key) is now a required positional arg. Direct broker publishers can no longer omit it to skip the check.
  • cssΒ #3 β€” Both view and task fail closed on cache miss (TTL expiry was a silent bypass).
  • Update README.mdΒ #4 β€” Session keys hashed with sha256[:12] before logging (they're bearer credentials).

Medium fixes:

  • update meterΒ #6 β€” signup_view now ALSO checks task_owner_<id> cache against caller session_key, layered with existing session["anonymous_task_id"] consistency check.
  • Update README.mdΒ #7 β€” DEBUG production assert uses a whitelist of allowed envs (development, test, ci); staging is no longer silently permissive. New _env_bool helper fails loud on typos like ENFORCE_TASK_OWNERSHIP=true (lowercase) β€” was previously silently coerced to False.
  • use google books and add mgmt command to fix genres and update mappingsΒ #12 β€” pending_dna_task_id is cleared on any claim-task rejection, so the rightful user's dashboard doesn't poll a doomed task forever.

Polish (#10, #11): RALPH_PROMPT.md no longer ships with hardcoded username path; scripts/ralph/, scripts/regression/, tasks/, progress.txt, AGENTS.md, .claude/ added to .dockerignore so internal triage scaffolding doesn't end up in the production image.

Test hardening (#9): test_task_result_warns_but_allows_when_not_enforced (which would have passed against the buggy fix) replaced with test_task_result_rejects_foreign_session_even_when_not_enforced that asserts the attacker's session has NO leaked dna_data after the 403. Plus two new fail-closed negative tests for the claim task (cache-miss + missing-key paths).

Deploy notes update: the ENFORCE_TASK_OWNERSHIP env var is now vestigial β€” fail-closed semantics apply regardless. Kept for forward compat with operational docs.

Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant