Skip to content

Fix flaky timing-dependent security tests#207

Open
Clawdy-ast wants to merge 10 commits into
RichardAtCT:mainfrom
Clawdy-ast:fix/flaky-security-test-timing
Open

Fix flaky timing-dependent security tests#207
Clawdy-ast wants to merge 10 commits into
RichardAtCT:mainfrom
Clawdy-ast:fix/flaky-security-test-timing

Conversation

@Clawdy-ast
Copy link
Copy Markdown

Summary

Three security tests in tests/unit/test_security/ were flaky because they
depended on microsecond-precision timestamp ordering that isn't guaranteed.
All fixes are test-side only — production risk-assessment and session logic
are unchanged.

  • test_log_command_risk_assessment and test_log_file_access_risk_assessment
    (test_audit.py): when two audit events are logged within the same
    microsecond, InMemoryAuditStorage.get_events (a stable sort by timestamp,
    newest-first) falls back to insertion order, so the first-inserted high-risk
    event stays at events[0] instead of the most-recent low-risk one. Changed
    the assertions to locate each event by filtering on its details content
    (command name / file path) rather than by list position.

  • test_session_management (test_auth.py): refresh_session could run
    in the same microsecond as authenticate_user, making
    last_activity == created_at so last_activity > old_activity failed. The
    test now backdates the session 1 second before refreshing, making the
    comparison deterministic.

Test plan

  • .venv\Scripts\python.exe -m pytest tests/unit/test_security/ -q → 85 passed
  • The three previously-flaky tests pass when run explicitly

Talla and others added 10 commits April 27, 2026 15:27
Add /model command with inline keyboard UI for switching between
Opus/Sonnet/Haiku models and effort levels (low/medium/high/max)
at runtime. Model changes force a new session since the CLI doesn't
support model switching on resumed sessions.

- Effort levels are model-aware: Haiku has none, Sonnet excludes
  "max", Opus supports all including "max"
- Override is per-user via context.user_data (in-memory, resets on
  bot restart)
- Threaded through all run_command call sites (orchestrator, classic
  message handler) into the SDK layer
- Registered in both agentic and classic handler modes
- Added to bot command menu and /help text
- 17 new tests covering keyboard display, model/effort selection,
  label formatting, and effort-per-model configuration

Closes RichardAtCT#138
Instead of just "Default", show "Default (claude-sonnet-4-6)" or
"Default (CLI default)" so users can verify what model is active
after resetting.
…ases

- callback.py: replace shared _model_effort_handler closure with two
  explicit lambdas (model:/effort:) — eliminates outer-scope capture;
  move import to module level
- command.py: drop _MODELS dict with hardcoded version IDs; use
  _MODEL_FAMILIES list of short CLI aliases ("opus"/"sonnet"/"haiku")
  which the CLI resolves to current latest automatically
- command.py: add CallbackQuery + ContextTypes type annotations to
  _handle_model_selection (fixes mypy strict mode)
- command.py: simplify _current_model_label (no reverse-map needed)
- command.py: add PR RichardAtCT#165 compatibility comment on force_new_session
- tests: update imports/assertions for _MODEL_FAMILIES; add 3 tests:
  closure regression guard, effort: prefix isolation, force_new_session

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cumulative session cost tracking with warnings at $5 / $10 / $20
(configurable via SESSION_COST_TIERS env var). Fires once per tier
per session, resets on /new or model swap.

Also logs the actual model returned by Claude at turn complete —
useful for verifying /model swaps on this branch.

Smoke-tested 2026-05-07 with lowered thresholds; all 3 tiers fired
correctly in sequence.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three security tests relied on microsecond-precision timestamp ordering
that is not guaranteed:

- test_log_command_risk_assessment / test_log_file_access_risk_assessment:
  two audit events logged in the same microsecond fall back to insertion
  order in the stable sort, so the high-risk event stayed at events[0].
  Now assert by filtering events on their details content, not position.

- test_session_management: refresh_session could run in the same
  microsecond as authenticate_user, making last_activity == created_at and
  the `last_activity > old_activity` assertion fail. Backdate the session
  by 1s before refreshing so the comparison is deterministic.

Production risk-assessment and session logic are unchanged; these were
test-side timing bugs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants