Skip to content

fix(spike_sorting,data_loaders): IOStallWatchdog blind-read robustness + _build_spikedata length epsilon#140

Open
TjitsevdM wants to merge 68 commits into
mainfrom
fix/review-cleanups
Open

fix(spike_sorting,data_loaders): IOStallWatchdog blind-read robustness + _build_spikedata length epsilon#140
TjitsevdM wants to merge 68 commits into
mainfrom
fix/review-cleanups

Conversation

@TjitsevdM
Copy link
Copy Markdown
Collaborator

Two robustness fixes from iat/REVIEW.md items 4 and 5 of this session's batch.

Fixes

1. IOStallWatchdog blind-read handling (commit 6a74e16)

Two cooperating bugs in _poll_loop's blind-read branch (_read_bytes returning None):

  • Progress timer reset masked true stalls. The original last_change_t = now on every blind read silently disabled the watchdog whenever a true stall coincided with any psutil hiccup inside the stall_s window — the watchdog went blind precisely when something was wrong. Trade-off was wrong: the false-positive risk it was guarding against (coincidental same-value reads across blind intervals) is pathological; the false-negative case it created is the common dangerous one.
  • Permanent psutil failure never tripped. The trip path at line 703 was unreachable from the blind branch's continue. Under sustained psutil failure, only a one-shot _warn_blind log fired; the sort could run forever with a silently degraded watchdog.

Fix:

  • Drop the last_change_t reset on blind reads. Timer now accumulates across blind intervals so the next successful read sees correct stalled_for and trips when appropriate.
  • Inside the blind branch, after the existing one-shot warn, trip via new _on_trip_blind(blind_for) once blindness reaches 2 * stall_s. The 2× factor gives one warn cycle of grace where an operator monitoring logs can investigate before the kill.
  • _on_trip_blind mirrors _on_trip's kill cascade but with distinct log message ("watchdog cannot verify progress") and distinct audit event (event="abort_blind", field blind_for_s). Lets downstream post-mortem grep distinguish "real stall" from "watchdog blind."

State machine now:

  • transient blindness (< stall_s): silent, timer preserved
  • blindness ≥ stall_s: one-shot warn (unchanged)
  • blindness ≥ 2 * stall_s: trip via _on_trip_blind

2. _build_spikedata length epsilon (commit 722249a)

The inferred length was max(last) - start_time, making end_time exactly equal to the latest spike. SpikeData.__init__'s validator uses strict t[-1] > end_time — passes at equality, but unit-conversion round-trips (samples → s → ms) inside the loaders can drift the loaded value by one ULP above the inferred end, causing "spike time exceeds end of time window" rejection on files the loader just produced.

Fix: add np.spacing(max_last) to the inferred length. At typical recording scales (~1e5 ms) that's ~1.5e-11 ms — far below any measurable precision but enough to keep the inequality strict under unit-conversion ULP drift.

Scope: affects loaders that route through _build_spikedata and don't get an explicit length_ms — older HDF5 files without the length_ms attribute introduced in PR #139, plus NWB / kilosort / pickle / IBL. HDF5 raster and paired styles bypass this helper.

Test plan

  • pytest tests/test_guards.py — 506 passed, 5 skipped
  • pytest tests/test_dataloaders.py — 207 passed, 1 skipped
  • Test-coverage entries for both fixes added to iat/REVIEW.md under "Recently applied fixes — pin the new contracts" so the parallel test-writing session can pick them up. 10 🟡 items total (6 for the watchdog fix, 4 for the length-epsilon fix).

Why no new tests in this PR

The parallel session owns test-writing work. Both fixes are documented in REVIEW.md with the exact assertion shapes to write; the production code is the gate item here.

TjitsevdM added 30 commits May 16, 2026 16:22
…manent blindness

`IOStallWatchdog._poll_loop` had two cooperating bugs in its blind-read
branch (`_read_bytes` returning None when psutil counters are
momentarily unreadable).

## Bug 1: progress timer reset masked true stalls

The branch reset `last_change_t = now` on every blind read. The
original intent was conservative: avoid a false-positive trip
immediately after blindness lifts if counters happened to be moving
during the blind window. But the trade-off was wrong — the
false-negative case (a true stall that happens to overlap any psutil
hiccup inside the `stall_s` window) silently disables the watchdog at
exactly the moment something is going wrong with the system. The
false-positive risk (coincidental same-value reads bracketing a long
blind interval) is pathological and rare.

Fix: drop the reset. `last_change_t` now accumulates across blind
intervals, so the next successful read sees the correct
`stalled_for` and trips the existing abort path when appropriate. If
bytes actually moved during the blind period, the post-blind read
detects the change and updates `last_change_t` normally — no trip.

## Bug 2: permanent psutil failure never tripped

The trip path at line 703 is only reachable when `current is not
None`. Under fully sustained psutil failure (counters never recover),
the blind branch `continue`s every iteration and the trip is never
reached. The recording could hang forever with a watchdog "running"
but silently degraded — only a one-shot `_warn_blind` log message
fired and that was the last signal.

Fix: inside the blind branch, after the existing warn, trip via a
new `_on_trip_blind(blind_for)` once blindness reaches `2 * stall_s`.
The 2× factor gives one warn cycle of grace where an operator
monitoring logs can investigate before the kill. The watchdog state
machine is now:

  - transient blindness (< `stall_s`):     silent, timer preserved
  - blindness ≥ `stall_s`:                 one-shot warn (existing)
  - blindness ≥ `2 * stall_s`:             trip via _on_trip_blind

## _on_trip_blind vs _on_trip

Distinct method rather than a flag on the existing trip so the abort
reason is unambiguous in logs and audit events:

  - log: "TRIP: ... I/O counter unreadable for X.Xs (>= Y.Ys).
         Aborting sort because watchdog cannot verify progress."
  - audit event: `event="abort_blind"` (vs `event="abort"`)
  - field: `blind_for_s` (vs `stalled_for_s`)

The kill cascade (callbacks → interrupt_main with the same
already-exiting suppression) is duplicated rather than factored out;
factoring `_on_trip` cleanly is more invasive than this bug fix
warrants.

Tests: full test_guards.py suite (506 tests, 5 skipped) passes
unchanged. The new permanent-blindness trip path is not yet
test-covered; that's a follow-up for the parallel test-writing
session.

(Replaces the original commit body's incorrect "three sibling
watchdogs are the canonical pattern" framing — the siblings
HostMemoryWatchdog / DiskUsageWatchdog / GpuMemoryWatchdog are
threshold-based and don't have a progress timer to preserve. The
real justification is the false-negative vs false-positive
trade-off described above.)
`_build_spikedata` infers `length_ms` from the latest spike when the
caller doesn't supply one:

    length_ms = float(max(last)) - start_time

This makes `end_time = start_time + length_ms` exactly equal to
`max(last)`. The `SpikeData.__init__` validator at line 347 uses a
strict `t[-1] > end_time` check — at exact equality it passes, but
the equality is brittle: any unit-conversion round-trip (e.g.
samples → seconds → milliseconds via `to_ms()`) inside the loader
can drift the reloaded spike value by one ULP above the inferred
end. The validator then rejects the SpikeData the loader just
produced, surfacing as a confusing "spike time exceeds end of time
window" error with no obvious culprit.

Fix: add `np.spacing(max_last)` (one ULP at the magnitude of the
latest spike) to the inferred length. At typical recording scales
(~1e5 ms) that's ~1.5e-11 ms — far below any measurable precision
but enough to keep the constructor's inequality strict.

Scope: affects loaders that route through `_build_spikedata` and
don't get a `length_ms` parameter — i.e., older HDF5 files without
the `length_ms` attribute introduced in PR #139, NWB, kilosort,
pickle, and IBL loaders. The HDF5 raster and paired styles bypass
this helper.

Sanity-checked at five magnitudes (50, 200, 1e5, 0.1, 12.345 ms)
that `end_time > max_last` holds after the bump.

Tests: full test_dataloaders.py suite (207 tests, 1 skipped)
passes. The new ULP-padding regression test is documented in
REVIEW.md alongside the other tests-to-write entries for this PR.
… traceback

Five small narrow fixes from REVIEW.md items 8-12.

## 8. _atomic_write_pickle .tmp cleanup on failure

The atomic-write wrapper opened a .tmp file and called pickle.dump.
If dump raised (non-picklable object, disk full, KeyboardInterrupt
from the inactivity watchdog mid-write), the .tmp file stayed in
the results folder indefinitely. Wrap the write in try/except
BaseException, unlink the .tmp on any failure, then re-raise.
BaseException is used because KeyboardInterrupt mid-write is
exactly the scenario we want to clean up after.

## 9. KilosortSortingExtractor.get_unit_spike_train fragile idiom

The `np.atleast_1d(spike_times.copy().squeeze())` idiom works for
the current 1-D `spike_times` storage but is fragile if the
underlying shape ever became multi-column (squeeze on a multi-column
2-D returns the 2-D unchanged). Replace with `np.asarray(...).ravel()`
which always returns 1-D regardless of input shape.

## 10. compute_inactivity_timeout_s numpy NaN scalar

The `isinstance(raw, float)` guard missed `np.float64`/`np.float32`
instances (numpy scalars are not Python `float`). A NaN value
coming from numpy-typed metadata would slip through and produce a
NaN timeout, silently disabling the watchdog. Replace with
`math.isnan(raw)` wrapped in a try/except TypeError so any
real-valued scalar (including numpy types) is checked, while
non-numeric inputs (str, list) skip the NaN check and either fall
through to the existing `float(raw)` cast (which raises cleanly)
or hit the existing None branch.

## 11. _resolve_device_index silent fallback

`_resolve_device_index` silently returned 0 on parse failure. A
user who typo'd `cuda;1` for `cuda:1` would have their GPU
watchdog quietly watching the wrong device. Keep the best-effort
behaviour (still returns 0, doesn't raise) but emit a logger
warning on both the bad-suffix and unrecognised-string paths so
the silent fallback is visible in logs.

## 12. process_recording broad except discards traceback

The post-sort `except Exception as e` handler printed only
`repr(e)` before returning the error and moving on. For a
deeply-nested failure (typical for waveform extraction or curation
errors) that left the operator with no way to locate the
originating call. Add `print(traceback.format_exc())` before the
"Moving on" line. The return-error-not-raise behaviour is
preserved (batch loop continues), only the diagnostic output is
improved.

Tests: full test_guards.py + test_spike_sorting.py sweep (891
tests, 18 skipped) passes. Test-coverage entries for each item
added to REVIEW.md under "Defensive cleanups batch" for the
parallel test-writing session.
Pins post-refactor contracts that were previously untested or only
exercised via the no-op _GlobalsStub fixture:

  - TestSpikeSortKs2ConfigNoneUsesDefaults — config=None reaches
    RunKilosort with DEFAULT_KILOSORT2_PARAMS merge.
  - TestSpikeSortDockerNoKwargsUsesDefaults — _spike_sort_docker
    forwards every DEFAULT_KILOSORT2_PARAMS key to run_sorter.
  - TestRTSortSpikeSortParamsResolution — config.rt_sort.params in
    {None, {}, {"probe": ...}} regimes. Probe override flows to
    _load_detection_model only; rts.probe field is not mutated.
  - TestBackendInitDoesNotRaiseOnFreshConfig — Kilosort2/Kilosort4
    backend init is silent on a bare SortingPipelineConfig; the
    "no path" error has shifted to RunKilosort.set_kilosort_path
    at sort time (ValueError on KILOSORT_PATH unset).
  - TestKilosort2ScaleOomParamsNoneSorterParams — scale_oom_params
    with sorter_params=None falls back to ntbuff=64 default.
  - TestRunCanaryFolderCleanupGaps — _build_canary_config raise
    leaves no folder to clean up (runs before mkdir); classified
    failures inside the inner try wipe the canary folder.

_GlobalsStub audit: removed the shim class plus 22 dead-code stub
usages across the file. Production code reads from
SortingPipelineConfig exclusively post-refactor, so the
setattr/restore dances had no effect on test outcomes. 393 tests
still pass after cleanup.
Pins the remaining 9 HIGH-priority gaps from REVIEW.md's
"Branch test coverage: refactor/remove-globals" section:

  - TestWaveformExtractorSelectRandomSpikesUniformly — three
    branches keyed on max_waveforms_per_unit (None / total > max /
    total <= max).
  - TestRunKilosortSetupRecordingFilesParams — custom detect_threshold
    reaches the rendered kilosort2_config.m template.
  - TestSpikeSortDockerCustomKilosortParams — caller-supplied
    kilosort_params={"detect_threshold": 9} propagates to run_sorter.
  - TestSpikeSortKs2EarlyReturnOnExistingResults — recompute_sorting=
    False short-circuits to a KilosortSortingExtractor without
    constructing RunKilosort or calling write_recording.
  - TestBackendLoadRecordingReturnAndNames — three new assertions:
    ks2 returns result.recording, ks4 assigns rec_chunk_names, full
    rt_sort load_recording contract (effective chunks, names, return
    value, config-non-mutation).
  - TestRTSortBackendSortKeepGoodOnlyAndPosPeakThresh — pins the
    "sorter_params=None ⇒ keep_good_only=False" legacy semantic and
    pos_peak_thresh propagation; also covers the non-default
    keep_good_only=True path.
  - TestRTSortBackendExtractWaveformsConfigThreading — config=self.config
    threading + n_jobs / total_memory forwarding for the rt_sort variant
    (mirrors TestBackendConfigThreading for ks2/ks4).

401 passed, 20 skipped (torch-gated in CI).
Replaces the hardcoded ``is_curated=True`` per queued unit in
``Compiler.add_recording`` with a per-unit flag derived from the
curation history when the user opts in via
``CompilationConfig.include_failed_units=True``.

## Motivation

The downstream code in ``Compiler.save_results`` already plumbs the
``is_curated`` flag through:

  - ``if is_curated:`` gates compile/waveform writes and
    ``sorted_index`` advance per unit
  - ``fig_is_curated.append(is_curated)`` feeds ``plot_templates``,
    which styles curated vs failed units with ``color_curated`` /
    ``color_failed``

But the entry point hardcoded ``True``, so the ``color_failed`` branch
and the failed-unit waveform writes were unreachable. ``add_recording``
already accepted ``curation_history``; the only missing piece was a
way to opt in to mixed-curation compilation.

## Change

``Compiler.add_recording(..., *, include_failed_units: bool = False)``:

  - ``False`` (default): unchanged behaviour. Every queued unit gets
    ``is_curated=True``. Callers continue to pass the post-curation
    SpikeData (``sd_curated``).
  - ``True``: requires a ``curation_history`` dict containing a
    ``curated_final`` list. ``sd`` is treated as the pre-curation
    SpikeData (all sorter-emitted units). Each unit's ``is_curated``
    is computed from ``int(uid) in curated_final``.

A ``ValueError`` is raised when ``include_failed_units=True`` is
combined with a missing or malformed ``curation_history``.

``Compiler.save_results``:

  - Consumes the new 4-tuple cache entry.
  - When ``include_failed_units`` is on, computes ``bar_n_selected``
    from ``len(curation_history["curated_final"])`` rather than
    ``sd.N`` (which is now the pre-curation count under opt-in).
  - All other code paths (sorting by polarity, compile_dict writes,
    waveform file writes, figure inputs) work unchanged — they already
    branched on the per-unit ``is_curated`` flag.

``CompilationConfig.include_failed_units: bool = False`` added to
``config.py``, registered in ``_build_flat_map`` so it flows through
``SortingPipelineConfig.from_kwargs`` / ``.override`` like every other
flag.

``compile_results`` reads ``cfg.compilation.include_failed_units`` and
passes it through to ``Compiler.add_recording``.

``process_recording`` picks the pre-curation ``sd`` vs the
``sd_curated`` subset based on the same flag, so the Compiler receives
the right object for its mode. Default ``False`` preserves the
historical behaviour where only curated units reach the compiled
output.

## Backward compatibility

Every existing caller of ``Compiler`` and ``compile_results`` retains
the old behaviour (the default is ``False``). Existing sorts produce
identical ``sorted.npz`` / ``sorted.mat`` / templates figure.

## Why this is the right shape

On-disk artifacts (Kilosort raw output + ``curation_history.json``) are
sufficient to reconstruct the "all original units with per-unit
curation flags" view. The Compiler's downstream code already supports
mixed curation. The only thing previously blocking the feature was the
hardcoded ``True`` at the entry point — making it honest unlocks the
latent capability without touching any downstream code.

Tests: full ``test_spike_sorting.py`` + ``test_sorting_report.py``
suite passes (435 tests, 23 skipped). Test entries for the new
contracts (default unchanged, opt-in raises on missing history, opt-in
honors history per unit, full ``save_results`` cycle with failed
units) added to REVIEW.md.
Closes the remaining 🟡 gaps in REVIEW.md § "Branch test coverage:
refactor/remove-globals" that don't require live torch/MATLAB:

  - TestLoadSingleRecordingConfigPropagation — pins that
    config.recording.gain_to_uv / offset_to_uv reach ScaleRecording
    and freq_min/freq_max reach bandpass_filter.
  - TestExtractWaveformsDispatch — pins the cache-hit branch
    (reextract_waveforms=False + existing waveforms/ dir →
    load_from_folder), the streaming=True vs False dispatch
    (chunked path also calls compute_templates; streaming does
    not), and the config=None default (resolves to streaming=True).
  - TestWaveformExtractorCreateInitialConfigNone — pins that
    create_initial(..., config=None) writes default WaveformConfig
    values to extraction_parameters.json.
  - TestSpikeSortDockerCustomKilosortParamsHonored — pins custom
    keep_good_only=True filters units via KSLabel and custom
    pos_peak_thresh propagates to the returned KSE.
  - TestSpikeSortKs4EarlyReturnAndPosPeakThresh — pins the
    recompute_sorting=False early-return on existing
    spike_times.npy (no ss.run_sorter call) and pos_peak_thresh
    propagation.
  - TestRTSortSpikeSortDetectionWindowWithRecordingWindowNone
    (torch-gated) — pins recording_window_ms=None +
    detection_window_s=60 → detect window (0, 60_000).
  - TestRTSortSpikeSortSaveRtSortPickle (torch-gated) — pins both
    branches of save_rt_sort_pickle including the path resolution
    (parent.parent / "rt_sort.pickle", outside delete_inter scope).

413 passed, 23 skipped (rt_sort tests gated on torch).
`load_spikedata_from_kilosort` previously assigned each cluster's
``electrode`` via ``channel_map[cluster_id]``. That only works when
cluster IDs are sequential 0..N-1 (fresh kilosort, pre-Phy-curation)
AND ``channel_map``'s ordinal-position semantics happen to match each
cluster's actual peak channel. After Phy merge/split, cluster IDs
become non-sequential and the lookup silently returns the wrong
channel (or skips the unit entirely when ``cluster_id >=
len(channel_map)``).

Replace the single buggy lookup with a three-tier priority chain:

1. **TSV ``ch`` column** — canonical Phy post-curation answer.
   ``cluster_info.tsv`` (written by ``phy save``) is recomputed by
   Phy after each merge/split, so its ``ch`` column survives
   curation. The loader already parses the TSV for label-based
   filtering; we now also read ``ch`` when present.

2. **Templates fallback** — Phy/phylib's algorithm. Loads
   ``spike_templates.npy`` (per-spike template ID, invariant under
   Phy curation) and ``templates.npy`` (per-template waveform). For
   each cluster: mode of ``spike_templates[spike_clusters == clu]``
   gives the dominant template; ``argmax(|templates[t]|.max(time))``
   gives that template's channel position; ``channel_map[position]``
   gives the physical channel. Survives curation because
   ``spike_templates.npy`` is what Phy preserves.

3. **Legacy ``channel_map[cluster_id]``** — kept as last resort for
   fresh kilosort folders that have neither TSV nor templates
   intermediates. The pre-existing "Cluster IDs are not sequential"
   warning now only fires when this path is taken (which is the only
   case where it's relevant).

Verified end-to-end on synthetic fixtures with non-sequential cluster
IDs that both new paths assign electrodes correctly while the legacy
path produces wrong values.

Tests: full test_dataloaders.py suite (207 tests, 1 skipped) passes.
Test-coverage entries for the new contracts added to REVIEW.md.
Pins 16 boundary / NaN / empty-input contracts surfaced by the
test_scanner triage. Tests document existing behavior — no source
fixes here; the surfaced oddities are noted for future cleanup.

  - TestSpikeDataLatenciesInfTimes — latencies(times=[inf]) returns
    [[]] (window check rejects the +inf candidate).
  - TestSpikeDataSpikeTimeTilingsNEquals1 — N=1 → (1,1) matrix
    with self-tiling value 1.0.
  - TestSpikeDataAppendOffset{NaN,Inf} — both raise ValueError;
    the shifted spike trips the spike-NaN/inf validator before
    the length check.
  - TestSpikeDataAppendNeuronAttrsAsymmetric — self=None + other
    has attrs salvages with a RuntimeWarning; reverse drops other's
    attrs silently (current behavior, asymmetric).
  - TestSpikeDataAlignToEventsBinLargerThanWindow — bin > window
    silently produces (U, 1, 1) shape (no T<1 guard).
  - TestSpikeDataGetFracActiveEdges{StartGreaterThanEnd,Shape3} —
    inverted edges silently return zeros; extra columns silently
    ignored (only [:, 0:2] indexed).
  - TestSpikeDataGetBurstsThresholdMultGreaterThanOne — returns
    empty burst arrays when threshold > peak rate.
  - TestSpikeDataComputeStPRAllEmpty — shape (N, 161), all zeros,
    no NaN leak from division by zero.
  - TestSpikeDataBestMatchAllNaNScores — all-NaN cost matrix raises
    ValueError("invalid") from scipy.optimize.linear_sum_assignment.
  - TestPairwiseToNetworkxThresholdNaN — threshold=NaN → 0 edges,
    nodes preserved.
  - TestRateSliceStackSubsliceEmpty — subslice([]) silently
    constructs an S=0 stack (no shape guard in __init__).
  - TestUtilsResampledIsiEmptyTimes — empty times raises IndexError
    on the single-element fast path.
  - TestUtilsButterFilterShortInput — length-2 input + order=5
    raises ValueError("padlen") from sosfiltfilt.
  - TestUtilsShuffleZScoreAllNaNStd — all-NaN shuffle yields NaN
    with at least one RuntimeWarning from upstream nanmean/nanstd.

949 passed across the four affected test files.
Replaces the np.zeros + np.save pattern with open_memmap so the file
is created via ftruncate without materialising the per-unit
(n_spikes, n_samples, n_channels) zero array in RAM. For a typical
Maxwell sort (200 units × ~1000 spikes × 370 KB/spike) the old
pattern transiently allocated ~74 GB per recording — large enough
to trip the host-memory watchdog on constrained boxes before any
sort work began.

The data section is sparse (zeros on read) so worker-side semantics
are unchanged: positions never written by any worker still return
zero, just as with the explicit np.zeros fill. Workers reopen the
file via np.load(..., mmap_mode="r+") when they need it, so the
parent's mmap is released immediately to avoid holding 200+ open
file handles concurrently while wfs_memmap is being populated.
Pins 4 contracts surfaced by the test_scanner triage. Tests document
existing behavior; source oddities noted below for future cleanup.

  - TestLoadKilosortInvalidTimeUnit (test_dataloaders.py) —
    load_spikedata_from_kilosort(time_unit="hz") raises ValueError
    naming the bad unit. Error originates from to_ms helper in
    spikedata.utils so attribution is preserved.
  - TestRawArraysShapeMismatch (test_dataloaders.py) — pins that
    _read_raw_arrays does NOT validate raw_data.shape[-1] ==
    raw_time.shape[0]; mismatch silently returns corrupt output.
    Documents the gap so a future loader-boundary ValueError can
    be added without regressing this test (just flip the assertion).
  - TestListNeuronsNumpyArrayAttr (test_mcp_server.py) — list_neurons
    returns numpy arrays verbatim; _sanitize_for_json handles only
    NaN/Inf floats so a downstream json.dumps raises TypeError at
    the MCP dispatcher boundary. Pins both halves.
  - TestK8sBackendDeleteJobNotFound (test_batch_jobs.py) —
    KubernetesBatchJobBackend.delete_job for a non-existent job has
    asymmetric behavior across paths: kubectl-fallback uses
    --ignore-not-found and exits cleanly; the Python kubernetes-
    client path propagates ApiException(404) verbatim.

Items 1, 2, 4 from the triage list were already covered by existing
tests (TestExportHdf5FailFastFsHzValidation and
TestLoadSpikedataFromNwbNonIntegerUnitId).

1197 passed across all affected test files (24 skipped for optional
deps).
…OStallWatchdog

Adds an explicit ``wfs.flush()`` after the per-unit
``run_extract_waveforms`` write loop. Two motivations:

  1. Durability — without an explicit flush the OS may hold dirty
     pages indefinitely; if the worker exits abnormally (watchdog
     kill, OOM, etc.) those writes are lost even though the file
     looks the right size on disk.
  2. IOStallWatchdog visibility — its byte-counter delta detection
     only credits flushed writes, so without this call the watchdog
     can decide the worker is stalled when it's actually batching
     writes in the OS page cache. The 2*stall_s blind trip added in
     commit 6a74e16 would compound this.
Pins 10 additional contracts surfaced by triage. All pin existing
behavior; source not modified.

  - test_dataloaders.py:
    - test_hdf5_group_per_unit_no_datasets_zero_units — zero-unit
      SpikeData when HDF5 units group has no datasets.
    - test_pickle_temp_file_cleanup_on_load_failure — finally-block
      temp cleanup on pickle.load failure (not just EOFError).
  - test_canary.py:
    - test_negative_window_returns_none — pins the canary_window_s
      <= 0 guard for negative values (NaN already covered).
  - test_dataexporters.py:
    - test_explicit_length_ms_beats_file_attribute_ragged — pins
      caller length_ms override beats persisted file attr round-trip
      (PR #139 contract).
  - test_utils.py:
    - test_uses_bessel_corrected_sample_std — pins ddof=1 sample std
      in shuffle_z_score (z=1.0 not ~1.2247 for [8,10,12]) (PR #139).
  - test_pairwise.py:
    - test_normalize_all_nan_row_suppresses_runtime_warning — pins
      zero RuntimeWarning on all-NaN row/col in _min_max_normalize +
      _z_score_normalize, plus output correctness (PR #139).
  - test_curation.py:
    - test_max_violation_rate_zero_filters_any_violations — pins <=
      boundary with threshold 0 (clean pair retained, any-violation
      pair excluded).
  - test_batch_jobs.py:
    - test_gpu_fields_reject_none — pins int-typed GPU field
      contract (rejected at type layer, not by model validator).
  - test_mcp_server.py:
    - test_rename_old_equals_new_is_blocked — rename_workspace_item
      with old==new returns success=False with UserWarning.
    - test_merge_workspace_all_collisions_full_skip — all-skip path
      (merged=0, skipped=2, keys returned).

4012 passed across the full test suite (36 skipped).
The review item flagged ``compare_sorter`` waveforms mode as
"n_channels = max(all_channels) + 1 truncates the channel grid when
units only reference a subset". Traced the code carefully and found
the truncation bug does not actually exist — ``all_channels``
collects from BOTH inputs and the auto-sized grid is exactly large
enough for every referenced channel. Footprints in self and other
share the same n_channels, so cosine similarity is consistent.

What the review framing missed but is real, even if subtle:

  - Both inputs must share a channel-numbering scheme (positional
    indices vs physical electrode IDs). The code can't tell them
    apart and silently produces meaningless similarity if mixed.
  - Sparse high-index layouts (Maxwell-style) blow up grid size
    even when only a few channels are touched — correct math,
    inefficient memory.

Neither is a correctness bug worth code changes; both deserve a
docstring callout so users don't trip over them. Added a Notes
block explaining the channel-numbering assumption and the grid
sizing rule.

No code change; docstring only.
``trace_io.py`` was a stale fork of the ``save_traces`` family of
functions that also live in ``rt_sort/_algorithm.py``. The rt_sort
copy is the one actually used by callers and was updated during
the rt_sort optimization sprint (``open_memmap``, in-memory fast
path, threading). The trace_io copy was never updated to match,
but it contained two small correctness improvements that
``_algorithm.py`` lacked. Verified zero importers across src/ and
tests/ before deleting.

## Ported into ``_algorithm.save_traces_mea``

  - ``samp_freq`` now defaults to ``None`` and reads the actual
    sampling frequency from the recording file (via
    ``MaxwellRecordingExtractor.get_sampling_frequency()``). The
    previous hardcoded 20 kHz default silently produced
    wrong-time-base output for MaxOne recordings sampled at other
    rates (e.g. 10 kHz). Callers that pass an explicit kHz value
    keep the override semantics.

  - The HDF5-vs-SpikeInterface shape check now raises a
    ``ValueError`` with both shapes in the message, rather than
    using ``assert`` (which is disabled under ``python -O`` and
    produces a less informative error).

## Removed

  - ``src/spikelab/spike_sorting/trace_io.py`` (310 lines) — every
    function in the file was a stale duplicate of
    ``rt_sort/_algorithm.py``'s same-named function with the
    polish bits noted above. Both polish bits now live in the
    canonical copy, so trace_io.py provides no unique
    functionality.

  - The stale ``trace_io.save_traces`` reference in
    ``recording_io.py:239`` was updated to point at "the rt_sort
    ``save_traces`` chain" — the actual live caller.

Tests: full ``test_spike_sorting.py`` suite (415 passed,
23 skipped) passes. Test-coverage entries for the new contracts
(samp_freq auto-detect, ValueError shape check) added to
REVIEW.md.
…channels

Pins five contract gaps surfaced by the next 🟡 sweep across guards:

  - TestBuildReferenceTraceZeroChannels (test_spike_sorting.py) —
    pins that traces.shape=(0, T) silently returns np.zeros((T,))
    while (0, 0) raises ValueError. Asymmetric existing behavior
    documented for future cleanup.
  - TestHostMemoryWatchdogNaNThresholds (test_guards.py) — pins
    that warn_pct=NaN and abort_pct=NaN both raise via the chain
    comparison short-circuit. Symmetric with the other four
    watchdogs by accident, not by explicit guard.
  - TestRunPreflightDuckTypedIterables (test_guards.py) — pins
    tuple-iterable contract and the (un)length-checked
    intermediate_folders / results_folders independence.
  - TestComputeInactivityTimeoutSNaNBaseAndMax (test_guards.py) —
    pins that base_s=NaN returns NaN (silent watchdog disable) and
    max_s=NaN returns the un-capped timeout (min(900, nan) = 900).
  - TestHostMemoryWatchdogDoubleEnter (test_guards.py) — pins that
    double-__enter__ overwrites the ContextVar token, leaking the
    watchdog after teardown.

Source oddities documented in REVIEW.md "Outstanding source
oddities" section for future triage. Tests pin existing behavior;
no source changes.

55 passed across affected test classes; full suite remains green.
``concatenate_units`` previously always overwrote the SpikeData at
``namespace_a`` with the combined result. Every other MCP tool in
the same file takes an explicit destination key (see
``compute_pairwise_fr_corr``, ``compute_pairwise_ccg``,
``curate_spikedata``, etc.); ``concatenate_units`` was the
outlier, and the overwrite was silent — clients only discovered
it when they later read ``namespace_a`` and found a different
SpikeData than they had stored.

Adds an optional ``out_namespace`` parameter:

  - Default ``None`` keeps the historical overwrite-namespace_a
    behaviour. Existing MCP clients are unaffected.
  - Explicit value writes the combined SpikeData to that
    namespace, preserving both inputs.

Why not just swap the two arguments? ``concatenate_spike_data`` is
**asymmetric** — the combined SpikeData inherits ``self``'s time
range, ``raw_data`` / ``raw_time``, and metadata (on key
conflicts), plus the unit order is ``self`` then argument. Swapping
the namespaces produces a structurally different SpikeData, not
just a different destination. Argument swapping also can't preserve
both inputs to a fresh third slot — both namespace_a and
namespace_b must already contain SpikeData. ``out_namespace``
cleanly decouples destination from semantics.

Three locations updated:

  - ``analysis.py::concatenate_units`` — new param + docstring
    explaining the asymmetry and the default behaviour.
  - ``server.py`` — tool registration schema includes
    ``out_namespace`` (optional) and the description now
    explicitly mentions the default overwrite.
  - Return value's ``namespace`` field now reflects the actual
    destination (was always ``namespace_a``).

Tests: full ``tests/test_mcp_server.py`` suite (333 tests) passes,
including the 3 existing concatenate tests that pin the default
behaviour.

REVIEW.md test-coverage entries for the new contracts (default
preserved, opt-in works, return value reflects destination, schema
accepts the new param) were drafted but couldn't land due to
concurrent edits from the parallel test session — will reconcile.
``pcm_stack_threshold``'s ``out_key`` parameter defaulted to ``""``
and any falsy value fell through to "use input key" — which meant
the binary {0, 1} thresholded stack overwrote the original
float-valued stack at the input key, destroying the source values
silently. Worse than ``concatenate_units``'s slot-overwrite because
the value semantics change (float → binary), so subsequent analysis
expecting floats fails or produces wrong results.

Two changes, no behaviour change for existing callers:

  - Default changed from ``""`` to ``None`` (standard
    not-provided sentinel). Empty string still works for
    back-compat — ``target_key = out_key if out_key else key``
    treats both falsy values the same way.
  - Docstring and MCP tool description now explicitly state that
    the default OVERWRITES the source stack and the float values
    are unrecoverable. The ``out_key`` field description in the
    inputSchema flags the same thing so LLM clients see the
    warning when deciding whether to pass the argument.

Tests: 5 pcm_stack tests pass. No new code paths were added, so
existing coverage suffices for the behaviour — the test entries
added to REVIEW.md cover the documentation contract and the
back-compat empty-string handling.
…nan opt-in

Two boundary-condition fixes bundled together since they're both
"small design items" from the same REVIEW.md catch-all bucket.

## _resampled_isi non-uniform grid → ValueError

The function computed ``dt_ms = times[1] - times[0]`` once and
applied that value to every bin boundary in subsequent math.
Non-uniform input silently produced wrong firing rates. Added a
strict ``ValueError`` boundary check that matches the existing
duplicate-grid check (also ``ValueError``), and produces a
diagnostic message including the first gap, min gap, and max gap.

Uses ``np.allclose(diffs, diffs[0])`` so float-arithmetic jitter
(e.g., ``np.linspace`` output) still passes — only genuine
non-uniform spacing fails.

## PairwiseCompMatrix(.Stack).threshold preserve_nan opt-in

``np.abs(matrix) > threshold`` returns ``False`` for NaN, then
``.astype(float)`` produces ``0.0``. So NaN values in the source
collapsed to 0 in the binary output — indistinguishable from
"below threshold." For analyses where "missing" vs "below
threshold" carries different meaning, this was lossy.

Added optional ``preserve_nan: bool = False`` parameter to both
classes' ``threshold`` methods:

  - ``False`` (default): historical behaviour, NaN → 0.
  - ``True``: NaN in the input propagates to NaN in the output.

The MCP ``pcm_stack_threshold`` tool now exposes the parameter so
MCP clients can opt in:

  - ``analysis.py::pcm_stack_threshold`` accepts ``preserve_nan``
    and forwards it to ``stack.threshold(...)``.
  - ``server.py`` adds the boolean to ``inputSchema.properties``
    with a clear description.

Tests: 29 threshold + resampled_isi + pcm_stack tests pass.
Sanity-checked: non-uniform raises with diagnostic, uniform
grids work, default NaN→0 path unchanged, opt-in path returns
NaN. REVIEW.md test-coverage entries added for the new contracts.

## Closed without action

Item 7.1 (HDF5 raster ``np.spacing`` subtraction): traced both
contexts carefully — the subtract in the raster loader and the
add in ``_build_spikedata`` (item 5) are solving different
round-trip problems and are both correct. The original review's
"inconsistency" framing was a misread. Existing comment at
``data_loaders.py:382-384`` already documents the rationale.
``export_spikedata_to_nwb`` has been writing both ``start_time`` and
``length_ms`` as file-level attributes for some time, but the
loader had two gaps:

  - ``start_time`` was silently dropped on reload — the loader
    never read the attr or set ``SpikeData.start_time``, so every
    round-trip collapsed it to 0.0.
  - The pynwb branch didn't read either attr. Only the h5py
    fallback read ``length_ms``. A user with pynwb installed and a
    file containing trailing silence past the last spike lost the
    length on reload.
  - The exporter's warning at ``start_time != 0`` claimed the
    attribute is not persisted — stale; two lines later it
    actually writes ``f.attrs["start_time"]``. The warning was
    misleading.

Closed all three gaps:

  - Added ``start_time_ms`` keyword to ``load_spikedata_from_nwb``
    (mirror of ``length_ms``). Caller override takes precedence;
    falls through to file-attr read; falls back to 0.0.
  - Pre-read both attributes via h5py at the top of the loader so
    both pynwb and h5py paths benefit. Best-effort: a failed attr
    read silently falls back to the inference path.
  - Removed the stale ``start_time != 0`` warning from the
    exporter.

Tests: 15 NWB-specific dataloader tests pass. Sanity-checked:
round-trip preserves start_time=100.0 on both h5py and pynwb paths,
caller override beats file attr, missing attr falls back to 0
without warning.

Test-coverage entries for the new contracts added to REVIEW.md.
…m test

Pins 5 MCP MED-priority contracts after the parallel boundary-guard
source pass:

  - TestComputeResampledIsiSigmaMsZero — sigma_ms=0.0 boundary
    (exact zero smoothing, distinct from the negative-sigma case).
    Uses a uniformly-spaced grid since non-uniform now raises.
  - TestAlignToEventsKeyNotInMetadata — events="missing_key" raises
    KeyError naming the missing key and the available-keys list.
  - TestExtractLowerTriangleFeaturesAdditionalShapes — both rejection
    branches: 2-D ndarray and 3-D non-square-first-two-dims.
  - TestPcmStackThresholdNaN — NaN threshold produces an all-zero
    stack with metadata.binary=True and threshold=NaN preserved.
  - TestSetNeuronAttributeEmptyIndices — empty neuron_indices is a
    no-op (no error, no attribute set on any neuron).

Updated TestResampledIsi.test_non_uniform_time_grid to assert
ValueError now that _resampled_isi validates uniform spacing (per
commit 57c0d8a).
Adds first-class round-trip support for ``None``, ``tuple``, ``set``,
and ``frozenset`` as dict values, and lifts the longstanding
"unicode ndarray can't be persisted" limitation that previously
forced an explicit TypeError for any dict containing a list of
strings.

## What's new

  - ``None`` → stored as ``__type__ = "none"``, no payload. Round-
    trips back to Python ``None``.
  - ``tuple`` → converted to ndarray, stored with
    ``__type__ = "tuple"``. Round-trips back to tuple (type
    preserved). Ragged/mixed-type tuples raise TypeError, same
    contract as lists.
  - ``set`` / ``frozenset`` → sorted into a canonical order, stored
    as ndarray with ``__type__ = "set"`` / ``"frozenset"``. Round-
    trips back to the matching Python type. Element ordering is
    not preserved (sets are unordered by definition) but the
    on-disk representation is deterministic. Unorderable or mixed-
    type sets raise TypeError.
  - String ndarrays (dtype kind ``U`` / ``S``) are now stored via
    h5py's ``string_dtype(encoding="utf-8")`` and tagged with
    ``__string_array__ = True``. On load, byte values decode back
    to Python ``str``. This benefits lists of strings too (used to
    raise TypeError; now round-trips).

## What's unchanged

  - Lists are still lossy on round-trip — they come back as ndarray,
    not list. Lists were never tagged with their Python type, so
    we can't recover identity. Users who want list identity should
    nest a tuple inside the dict, or use a tuple at the dict-leaf
    level.
  - All key validation (non-string, empty, contains ``/``) is
    unchanged.
  - The ``_dump_item`` rejection branch's error message now lists
    the new dict-leaf types so users get a clearer hint when
    something genuinely unsupported (bytes, custom classes, …)
    falls through.

## Updated test

  - ``test_roundtrip_dict_with_none_value_raises`` → renamed to
    ``test_roundtrip_dict_with_none_value`` and rewritten to pin
    the new round-trip contract (None preserved, ints preserved,
    strings preserved).
  - ``test_roundtrip_dict_with_list_of_strings`` → rewritten to
    pin successful round-trip rather than the previous
    "No conversion path" TypeError. The string ndarray support
    in ``_dump_ndarray`` covers this universally.

## Documentation

  - Module-level docstring lists the expanded dict schema and
    its round-trip semantics.
  - ``_dump_dict`` docstring enumerates every supported value
    type and notes which preserve Python type vs which are
    lossy.
  - ``_dump_ndarray`` / ``_load_ndarray`` docstrings document
    the string-array storage mechanism.

Tests: 226 workspace tests pass. Sanity round-trip of a
representative dict (``int``, ``str``, ``None``, ``tuple``,
``set``, ``frozenset``, numeric list, string list, nested dict
with None+tuple) preserves every type and value as expected.

REVIEW.md test-coverage entries added for the new contracts.
Tests in TestIOStallWatchdogBlindReadTrip cover the permanent-
blindness path introduced in commit 6a74e16:

  - test_transient_blindness_preserves_timer — single-cycle blind
    flicker between counter reads does NOT reset last_change_t;
    accumulating stall time survives the gap.
  - test_sustained_blindness_trips_after_two_stall_s — sustained
    None reads for >= 2*stall_s call _on_trip_blind, flip _tripped,
    and fire the kill callback.
  - test_abort_blind_audit_event_shape — pins the audit envelope
    after a blind trip: event="abort_blind", blind_for_s (NOT
    stalled_for_s), tolerance_s=2*stall_s, plus watchdog="io_stall",
    mode/device/pids.
  - test_warn_blind_fires_once_before_trip — exactly one
    _warn_blind log between stall_s and 2*stall_s; not repeated per
    poll cycle.
  - test_blind_trip_suppresses_interrupt_main_when_stopping — when
    _stop_event is set, kill callbacks still run but
    _thread.interrupt_main is suppressed; _interrupt_main_failed
    stays False (intentional suppression).
  - test_blind_recovery_clears_state — back-to-back blind episodes
    each produce a fresh _warn_blind; blind_started_t /
    blind_warned (locals in _poll_loop) reset on every recovery.

6 passed (file delta: +386 lines). Source unchanged.
…d coercion, set_xticklabels API

  - _classifier._walk_exception_chain — added text dedup alongside
    id-based cycle detection so SpikeInterface re-raises (inner +
    outer exceptions carrying identical text) don't produce
    duplicate lines in the concatenated signature.
  - sorting_utils.print_stage — extracted BANNER_WIDTH (70) and
    BANNER_CHAR ("=") to module-level constants so report.py's
    parser regex (_BANNER_LINE_RE / _BANNER_TEXT_RE) stays in sync
    via a documented contract instead of two hard-coded literals.
  - sorting_extractor.KilosortSortingExtractor — explicit
    cluster_id.astype(int) coercion with a clean ValueError when
    the TSV writes IDs as float "1.0" or string-padded "001"; the
    later int(unit_id) casts no longer fail with confusing
    pandas dtype errors.
  - figures.plot_curation_bar — split set_xticklabels and
    tick_params(labelrotation=...) to avoid matplotlib 3.5+
    deprecation warning when FixedLocator-driven ticks are paired
    with rotation kwarg.
…tests)

Batch A — WaveformExtractor parallel pre-allocation + flush (6 tests
in TestParallelPreallocationAndFlush, test_waveform_extractor_streaming.py):
covers the dda9b16 (open_memmap) + 99ded3a (wfs.flush) source commits.

  - test_preallocation_uses_open_memmap_not_zeros — gates the per-unit
    (n_spikes, nsamples, num_channels) np.zeros regression signature.
  - test_preallocated_file_is_valid_npy — file shape/dtype + unwritten
    positions return zero.
  - test_wfs_flush_called_per_unit — durability + IOStallWatchdog
    visibility contract.
  - test_zero_spike_unit_produces_valid_empty_npy — zero-spike unit
    edge case.
  - test_reextraction_truncates_and_rewrites — mode="w+" semantics
    across runs with different n_spikes.
  - test_disjoint_writes_across_workers_no_corruption — reproducible-
    output contract (uses serial-vs-serial since Windows + numpy
    memmap multi-process is flaky in CI).

Batch B — load_spikedata_from_kilosort Phy channel_map fix (7 tests
in TestKilosortPhyChannelMapResolution, test_dataloaders.py): covers
the a57e74f three-tier resolution chain.

  - test_tsv_ch_column_drives_electrode_assignment
  - test_templates_fallback_when_tsv_absent
  - test_tsv_beats_templates_when_both_present
  - test_legacy_path_still_works_for_fresh_kilosort
  - test_non_sequential_warning_suppressed_when_fix_applies
  - test_non_sequential_warning_fires_on_legacy_fallback
  - test_templates_fallback_skipped_on_shape_mismatch

232 passed, 1 skipped across both files. Source unchanged.
The parallel source commit 609aa09 stopped emitting a "start_time
not preserved" UserWarning during NWB export because the file
attributes now round-trip start_time properly. Two stale tests
that asserted the warning are updated:

  - test_nonzero_start_time_warning →
    test_nonzero_start_time_roundtrips. Now asserts (a) no
    start_time UserWarning fires and (b) reloaded SpikeData has
    matching start_time.
  - test_nwb_export_event_centered_warns →
    test_nwb_export_event_centered_roundtrips_start_time. Same
    treatment for the event-centered SpikeData case.
…ifier / KSE coercion

Covers four 2026-05-17/18 source commits that the parallel work
stream landed without tests:

  - TestCompilerIncludeFailedUnits{Default,True,RaisesWithoutHistory}
    (test_spike_sorting.py) — pins the include_failed_units opt-in
    from commit f58dfde: default (False) writes only curated units
    with is_curated=True; True with curation_history dispatches
    per-unit is_curated from curation_history["curated_final"];
    True without curation_history raises ValueError naming the
    missing key.
  - TestSaveTracesMeaSampFreq{AutoDetect,ExplicitOverride}
    (test_spike_sorting.py) — pins the consolidation from commit
    888636b: samp_freq=None reads sampling_frequency from the
    recording instead of the old 20 kHz hardcoded default;
    explicit kHz value still overrides.
  - TestWalkExceptionChainDeduplicates (test_classified_errors.py)
    — pins commit 0d91204 dedup: two distinct exception objects on
    a cause chain that share str(exc) text produce a single line;
    distinct text still produces two lines.
  - TestKilosortSortingExtractorClusterIdCoercion (test_spike_sorting.py)
    — pins commit 0d91204 cluster_id coercion: floats and zero-
    padded strings coerce to int cleanly; non-coercible strings
    raise ValueError naming the dtype.

14 passed, 2 skipped (torch / GPU-gated save_traces variants).
… out_key sentinels / _dump_dict additions

Covers the remaining four 2026-05-17/18 parallel-session source
commits:

  - TestPairwiseCompMatrixThresholdPreserveNan +
    TestPairwiseCompMatrixStackThresholdPreserveNan (test_pairwise.py)
    — pin commit 57c0d8a opt-in NaN-preservation contract: with
    preserve_nan=True, NaN positions survive threshold; non-NaN
    positions still binarize. Default (preserve_nan=False) keeps
    historical NaN-to-0 coercion.
  - TestConcatenateUnitsOutNamespace (test_mcp_server.py) — pins
    commit 55acbb4: default out_namespace=None overwrites
    namespace_a (historical); explicit out_namespace writes to a
    fresh slot, preserving both inputs; return value reflects
    actual destination.
  - TestPcmStackThresholdOutKeySentinels (test_mcp_server.py) —
    pins commit 6f9a9ef: out_key=None and out_key="" both fall
    through to "use input key" (destructive overwrite); explicit
    string writes to that key and keeps the source intact.
  - TestDumpDictSchemaAdditions (test_workspace.py) — pins commit
    6945961: None / tuple / set / frozenset / unicode-string-ndarray
    all round-trip through _dump_dict + _load_dict via real HDF5
    files. Tuple stays a tuple (not list); set/frozenset retain
    type tag.

14 new tests pass. Source unchanged.
…log + numpy-scalar inactivity timeout

Three operator-visibility / durability contracts that existing tests
left half-pinned. Source unchanged.

  - TestAtomicWritePickle (extended, test_spike_sorting.py):
    - test_failed_write_does_not_corrupt_existing_file flipped: now
      also asserts the .tmp file is gone after a failed pickle
      (existing comment said "may remain on disk" — the source has
      since added .tmp cleanup; pin the new contract).
    - test_tmp_cleaned_up_on_pickle_dump_failure — patches pickle.dump
      to raise mid-write; asserts no .tmp / no final file after.
    - test_tmp_cleaned_up_on_keyboard_interrupt — same with
      KeyboardInterrupt (simulates the inactivity watchdog
      interrupting via _thread.interrupt_main mid-write).

  - TestResolveDeviceIndexWarningSignal (new, test_guards.py):
    pre-existing TestResolveDeviceIndex pins return values only;
    this class pins the operator-visibility log side. Three tests:
    bad-suffix-after-colon and unrecognised-string each emit exactly
    one WARNING with the documented substring and the offending
    value; valid inputs (None / "cuda" / "cuda:N" / "N" / "")
    emit no warning at all.

  - TestComputeInactivityTimeoutSNumpyScalars (new, test_guards.py):
    pre-existing TestComputeInactivityTimeoutSNaN covers Python float
    NaN; the source comment specifically calls out that the old
    isinstance(raw, float) check missed numpy scalars. This class
    pins the math.isnan-based handling for np.float64('nan'),
    np.int64(60), numeric strings, and non-numeric strings
    (ValueError propagates).

13 new test cases pass. Full suite: 4084 passed / 38 skipped / 0
failed.
…free API

Item 4 (TestCompilerIncludeFailedUnitsBarNSelected,
TestCompileResultsForwardsIncludeFailedUnits): cover the wiring
between config.compilation.include_failed_units and the figure
layer that the unit-level tests didn't reach.

  - test_bar_n_selected_reflects_curated_final_under_include_failed_units
    — patches plot_curation_bar; asserts n_selected ==
    [len(curated_final)] (not sd.N) and n_total == [len(initial)].
  - test_bar_n_selected_falls_back_to_sd_N_under_default — regression
    guard for the historical contract (default flag, n_selected ==
    sd.N).
  - test_flag_forwarded_to_compiler_add_recording / _flag_default_…
    — stub Compiler and verify compile_results threads the config
    flag into Compiler.add_recording's kwargs (the wiring
    _process_recording_body relies on).

Item 5 (TestPlotCurationBarRotationApi): pin the commit-0d91204
contract that label_rotation no longer triggers the matplotlib 3.5+
set_xticklabels deprecation.

  - test_no_matplotlib_deprecation_warning — no
    MatplotlibDeprecationWarning emitted at label_rotation=45.
  - test_labelrotation_reaches_axis — tick_params(labelrotation=30)
    does actually rotate the axis labels (regression guard against
    a refactor that dropped the rotation call).
  - test_default_rotation_zero_when_unset — default produces
    unrotated labels.

7 new test cases pass. Full suite: 4091 passed / 38 skipped / 0
failed.
TjitsevdM added 30 commits May 21, 2026 04:53
Each watchdog stored a single ``self._token`` ContextVar handle, so
a second ``__enter__`` without an intervening ``__exit__`` would
overwrite the first token reference and leak the original
active-watchdog publication after teardown — only the second
token's reset would run. The classes are not designed to be
reentrant on the same instance.

Adds a symmetric ``if self._token is not None: raise RuntimeError(
"... is not reentrant: ...")`` guard at the top of each
``__enter__``, surfacing the misuse as an actionable error
instead of silent ContextVar corruption. After a clean ``__exit__``,
the same instance can be entered again (the watchdog is reusable,
just not nestable).

Closes the "HostMemoryWatchdog double-__enter__ leaks token"
source oddity from REVIEW.md, and applies the same fix to
GpuMemoryWatchdog and IOStallWatchdog for consistency.
…json arrays, merge_workspace, banner constants + adapt DoubleEnter

Five new test classes (16 cases total) plus one adapted test class
for the watchdog non-reentrant guard:

  - TestDumpDictRaggedTupleRejection (test_workspace.py) — ragged-
    tuple input raises (TypeError on older numpy, ValueError on
    numpy 2.x); object-dtype tuple (e.g. tuple of dicts) takes the
    explicit TypeError("ragged or mixed-type tuple") path.
  - TestDumpDictUnorderableSetRejection (test_workspace.py) —
    `{1, "a"}` and `frozenset({1, "a"})` both raise TypeError
    matching "unorderable elements".
  - TestDumpDictListOfStringsRoundtrip (test_workspace.py) —
    `["alpha", "beta", "gamma"]` round-trips through
    _dump_dict + _load_dict (unicode-list contract lifted by
    commit 6945961).
  - TestLoadNwbStartTimeAttribute (test_dataloaders.py) — caller
    start_time_ms kwarg overrides the NWB file's start_time attr;
    missing attr falls back to 0.0 with no warning. Pins the
    loader half of the commit-609aa09 round-trip contract.
  - TestSanitizeForJsonNdarrayInlining (test_mcp_server.py) —
    1-D ndarray inlines with NaN→None, 2-D nests as nested
    lists, empty array becomes [].
  - TestSanitizeForJsonOversizeRaises (test_mcp_server.py) —
    >MAX_INLINE_ARRAY_SIZE elements raises ValueError("exceeds
    the inline JSON cap"); at-cap inlines, cap+1 raises.
  - TestMergeWorkspaceNonexistentPath (test_mcp_server.py) —
    nonexistent path propagates the underlying loader error
    (current contract — no error-dict wrapping).
  - TestSortingUtilsBannerConstantsExport (test_spike_sorting.py)
    — BANNER_WIDTH (70) and BANNER_CHAR ("=") are importable;
    monkey-patching either drives print_stage's actual output
    (proves the constants are the single source of truth, not
    decorative labels).

Adapted:
  - TestHostMemoryWatchdogDoubleEnter (test_guards.py) — pins
    the new "is not reentrant" RuntimeError contract; replaces
    the prior pin-current-leak assertions. Adds
    test_reuse_after_exit_is_allowed: re-entering after a
    clean exit is fine.

16 new cases pass on isolated runs; full-suite verification
pending (some pre-existing tests may need adapting for
unrelated parallel source improvements landed in commits
31d6d8d / 42d1341 / 1ef249c).
Closes the "_resampled_isi(times=[]) raises IndexError" gap from
the 2026-05-18 sweep.

The function had two early-return guards for degenerate input:

  - ``len(spikes) <= 1`` → returns ``np.zeros_like(times)`` (empty
    times → empty array; non-empty times → zero rates).
  - ``len(times) < 2`` → enters the single-time fast path that
    accesses ``times[0]`` — which crashes on an empty array when 2+
    spikes are present.

The two guards were asymmetric: empty times were silently handled
when the train was small, but crashed when the train had more
spikes. Closed the asymmetry by short-circuiting empty times to an
empty float array at the top of the function. Consistent with the
``len(spikes) <= 1`` path which already returned an empty array
via ``np.zeros_like([])``.

## Updated tests

Two existing test classes (``TestResampledIsiEmptyTimes`` and
``TestUtilsResampledIsiEmptyTimes``) pinned the OLD IndexError
behaviour. Rewrote both to assert the new empty-result contract
(no exception, returns ``np.array([], dtype=float)``).

Tests: full ``test_utils.py`` suite (261 passed).
Closes the "shuffle_z_score emits two unsuppressed RuntimeWarnings
on all-NaN input" gap from the 2026-05-18 sweep.

For an all-NaN ``shuffle_distribution`` (a documented degenerate
case where the caller wants NaN out), ``np.nanmean`` emits
"Mean of empty slice" and ``np.nanstd`` with ``ddof=1`` emits
"Degrees of freedom <= 0 for slice." The final NaN result is
correct, but every degenerate call produces 2 stderr warnings —
a 1000-recording sweep produces 2000 spurious lines.

Wrapped the two numpy calls in ``warnings.catch_warnings()`` with
two narrow ``filterwarnings("ignore", ...)`` calls keyed to the
exact message substrings. Other warnings (overflow, invalid
operations elsewhere) still propagate unchanged. The result
remains NaN — only the noise is suppressed.

## Updated tests

Two existing test classes pinned the OLD "expect RuntimeWarning"
behaviour:

  - ``TestShuffleZScore::test_empty_distribution`` (was using
    ``pytest.warns(RuntimeWarning)``)
  - ``TestUtilsShuffleZScoreAllNaNStd::test_all_nan_shuffle_returns_nan_with_runtime_warnings``
    (was asserting ``len(runtime_warns) >= 1``)

Both rewritten to assert the new no-RuntimeWarning contract while
keeping the NaN-result assertion intact.

Tests: full ``test_utils.py`` suite (261 passed).
…ling tests)

REVIEW.md's "Outstanding source oddities" entry for the watchdog
double-enter fix flagged a remaining test gap: only HostMemoryWatchdog
had the sibling test. Source guard already exists on all three
watchdogs (commit 8948ba1); add the matching tests so each watchdog's
non-reentrant contract is independently pinned.

  - TestGpuMemoryWatchdogDoubleEnter — first enter succeeds with
    mocked GPU memory reader; second enter raises RuntimeError
    matching "GpuMemoryWatchdog is not reentrant"; first _token
    survives the failed second enter; re-entering after clean exit
    assigns a fresh token.
  - TestIOStallWatchdogDoubleEnter — same contract via process-mode
    IOStallWatchdog (mocked _read_io_bytes_for_pids) so the test
    doesn't depend on resolving a real block device. Uses os.getpid()
    as the PID set.

4 new test cases pass. Source unchanged.
Closes the "_build_reference_trace((0, T)) silently returns zeros"
asymmetry from the 2026-05-18 sweep. The downstream callers in
``recenter_stim_times`` and the stim-sorting pipeline treat the
returned reference trace as a real signal — a silent zero-reference
could only ever produce nonsense results (no peaks to detect,
arbitrary "transition" sample chosen, etc.).

The asymmetry: ``traces.shape == (0, T)`` silently returned
``np.zeros((T,))`` (via ``np.argpartition([], -1)[-1:]`` → empty
index → ``traces[empty_idx].sum`` → zero array). But ``(0, 0)``
raised from the underlying ``np.max`` reduction. Both failure
modes now raise the same explicit ``ValueError`` at the top of
the function, naming the offending shape and the
``n_channels >= 1`` requirement. 1-D inputs are also rejected
(previously crashed deeper inside numpy with a confusing axis
error).

## Updated tests

``TestBuildReferenceTraceZeroChannels`` was pinning the OLD
asymmetric behaviour (``(0, T)`` returns zeros, ``(0, 0)`` raises
"zero-size array"). Rewritten:

  - ``test_zero_channels_returns_zero_reference`` →
    ``test_zero_channels_raises``: now asserts ValueError on
    ``(0, T)``.
  - ``test_zero_channels_zero_samples_raises_value_error``:
    updated to match the new "at least one channel" message
    rather than the prior "zero-size array" numpy message.
  - ``test_one_d_raises`` (new): pins that 1-D input also raises
    the same clear error.

Tests: ``TestBuildReferenceTraceZeroChannels`` (3 tests) plus
adjacent ``test_build_reference_trace_n_ref_*`` clamping tests
all pass.
Closes the "WaveformExtractor silently substitutes defaults for
missing pre-Phase-2.4 keys" gap from the 2026-05-18 sweep.

``WaveformExtractor.__init__`` reads three keys from
``extraction_parameters.json`` with ``WaveformConfig()`` defaults
as fallback:

  - ``pos_peak_thresh``
  - ``max_waveforms_per_unit``
  - ``save_waveform_files``

The fallback was added defensively because pre-Phase-2.4 JSON
files don't persist these keys. But operators reloading an old
extractor had no signal that defaults were substituted — the
resulting extractor looked identical to one created with the
same defaults explicitly. If the original sort had used custom
values, the reload now silently runs with the library defaults
instead.

Added a module-level ``_logger = logging.getLogger(__name__)``
and a per-key warning loop. Each missing key triggers one
``_logger.warning`` naming:

  - the source folder (so the operator can identify which
    extractor triggered it)
  - the missing key
  - the substituted default value

Behaviour is otherwise unchanged — the existing
``parameters.get(key, default)`` calls remain, so loading still
succeeds. Only visibility is added.

Sanity-checked with a synthetic ``extraction_parameters.json``
missing all 3 legacy keys: produces 3 warning lines with full
diagnostic context.

REVIEW.md: the source-side entry under
"Already-acknowledged-in-comments" is now marked ``(resolved)``;
flagged the test gap (no existing test exercises the warning
path — the pre-Phase-2.4 fixture isn't checked in).
Three test classes covering MCP-LLM-facing contracts surfaced by the
final triage pass:

  - TestSanitizeForJsonNumpyScalarCoercion — pins the .item()-based
    coercion for numpy scalar types beyond np.float64 (which already
    routes through the Python-float branch):
    - np.float32(1.5) → Python float (not float32 — verifies the
      type, since float32 doesn't subclass float on numpy 2.x).
    - np.float32 NaN / ±Inf → None via the post-.item() float branch.
    - np.int64 / np.int32 / np.uint8 → Python int.
    - np.bool_ → Python bool.

  - TestPcmStackThresholdToolSchema — pins the MCP tool registration:
    - preserve_nan is in inputSchema.properties (type=boolean, optional).
    - out_key description contains "OVERWRITE" warning (destructive
      default).
    - Top-level tool description also names the OVERWRITE behaviour.

  - TestConcatenateUnitsToolSchema — pins that out_namespace is in
    inputSchema.properties (string, optional); required is exactly
    {workspace_id, namespace_a, namespace_b}.

Schema drift would silently degrade LLM tool choice; runtime
behaviour tests existed already (TestSanitizeForJson*,
TestConcatenateUnitsOutNamespace, TestPcmStackThresholdOutKeySentinels)
but did not exercise the registration / schema layer.

7 new test cases pass.
Closes the "run_preflight has no length check on parallel
folder sequences" hazard from the 2026-05-18 sweep.

The function takes three sequences (``recording_files``,
``intermediate_folders``, ``results_folders``) that are by
convention parallel — one entry per recording. The disk-check
loops iterate the folder sequences independently, so a
mismatched length silently truncates work to the shortest list.
A future ``zip(...)`` refactor in the disk loop would change
semantics without any signal.

Added a parallel-sequence length check after the three
empty-sequence checks, matching the existing pattern (emit
``level="fail"`` findings rather than raising — preflight's
contract is that the caller escalates via ``preflight_strict``):

  - ``intermediate_folders`` length != ``recording_files`` length
    → finding with code ``folder_count_mismatch`` naming both
    counts.
  - ``results_folders`` length != ``recording_files`` length →
    same finding code, separate message.

Both checks gate on the respective sequence being non-empty so
they don't pile on top of the existing ``no_intermediate_folders``
/ ``no_results_folders`` findings.

Sanity-checked: equal lengths produce 0 mismatch findings;
inter-only mismatch → 1 finding; results-only → 1; both
mismatched → 2; empty inter still produces only the existing
``no_intermediate_folders`` finding (no double-emit).

Tests: 44 preflight tests pass. Test gap (no existing test
covers the new finding) flagged in REVIEW.md.
Closes the "RateSliceStack accepts S=0 but rejects T=0"
asymmetry from the 2026-05-18 sweep.

The constructor already rejected ``event_stack.shape[1] == 0``
(T=0) with a clear ValueError, but ``shape[2] == 0`` (S=0) was
silently accepted — letting ``subslice([])`` (or any caller that
filtered to no slices) produce a degenerate ``(U, T, 0)`` stack.
Downstream slice-aware methods (``apply``, ``__getitem__``,
similarity computations) weren't built to handle zero-slice
input.

Added a symmetric S=0 guard alongside the existing T=0 one.
Both fail with their own ``ValueError`` message; the S=0 message
points the caller at ``None`` as the canonical "no slices"
sentinel rather than a degenerate stack.

## Updated test

``TestRateSliceStackSubsliceEmpty::test_empty_slice_list_yields_zero_S_stack``
was pinning the OLD silent-S=0 behaviour. Rewritten as
``test_empty_slice_list_raises`` and a new
``test_zero_s_event_matrix_raises`` to pin the symmetric guard.

Tests: ``test_rateslicestack.py`` (134 tests) passes.
…d_isi uniform positive, KS2/KS4 log finders, _resolve_inactivity NaN

Closes the last four 🟡 items from the strict-pregrep triage. All
pin existing behaviour; one test (TestSanitizeForJsonZeroDArrayAndCapAdjustable
::test_zero_d_array_raises_type_error_current_bug) pins a newly
surfaced source bug as a regression target.

  - TestSanitizeForJsonZeroDArrayAndCapAdjustable (test_mcp_server.py):
    - test_zero_d_array_raises_type_error_current_bug — **pins a
      current bug**: 0-D ndarray triggers
      ``[_sanitize_for_json(v) for v in obj.tolist()]`` where
      ``.tolist()`` returns a Python scalar (not a list), raising
      TypeError("not iterable"). Documented in REVIEW.md's
      "Outstanding source oddities → Newly discovered" section.
    - test_max_inline_array_size_monkeypatch_raises_cap — pins the
      docstring contract that MAX_INLINE_ARRAY_SIZE is adjustable
      at runtime; monkey-patching to 100 lets size-11 arrays
      through that would raise under the original 10 000-element cap.

  - TestResampledIsiUniformGridPositive (test_utils.py): positive
    counterpart to the existing TestResampledIsi::test_non_uniform
    rejection. Pins np.arange (round-number) AND np.linspace (with
    float drift) uniform grids both pass the np.allclose check; the
    single-element grid takes the fast-path and returns the correct
    inverse-ISI rate or zero (in-interval vs out-of-interval).

  - TestFindKs2Ks4LogCandidateOrdering (test_spike_sorting.py):
    fills the gap left by the existing _find_rt_sort_log tests. Pins
    both KS2 and KS4 helpers: top-level <sorter>.log wins,
    sorter_output/<sorter>.log is the Docker fallback, None when
    neither exists, AND is_file() short-circuits a directory at the
    candidate path (so a folder named "kilosort2.log" doesn't get
    mistaken for the log file).

  - TestResolveInactivityTimeoutSNanDuration (test_spike_sorting.py):
    pins the defensive-fallback contract for NaN duration. fs_hz=NaN
    is NOT caught by the fs_hz<=0 guard (NaN comparisons always
    False), so duration_min=NaN propagates to
    compute_inactivity_timeout_s which post-cbdec22 defensively
    coerces recording_duration_min=NaN to 0 → returns base_s. Same
    for n_samples=NaN with valid fs. Custom sorter_inactivity_base_s
    flows through to the result (proves base_s is the source).

15 new test cases pass.
Closes the "SpikeData.append asymmetric on neuron_attributes"
gap from the 2026-05-18 sweep.

The salvage logic warns when ``self`` has no attrs and the
appended SpikeData does — but the reverse (``self`` has attrs,
appended doesn't) was silent. The asymmetry meant a user who
appended a partially-populated SpikeData onto a fully-populated
one got no signal that attrs were missing on one side; the
reverse direction would have warned.

Closed by adding a parallel ``RuntimeWarning`` to the ``self``-
has-attrs branch with a message mirroring the existing one
(mentions ``drop_neuron_attributes`` for opt-out).

The four cases now behave:

  - Both None: no warn, result is None (unchanged)
  - Self None, other present: warn, use other's (unchanged)
  - Self present, other None: **warn**, use self's (NEW symmetric)
  - Both present: silent, use self's (unchanged — documented
    collision-precedence rule, not an asymmetric drop)

## Updated tests

``TestSpikeDataAppendNeuronAttrsAsymmetric``:
  - ``test_self_none_other_present_salvages_with_warning``:
    unchanged (already pinned the warn path).
  - ``test_self_present_other_none_keeps_self_silently`` →
    ``test_self_present_other_none_keeps_self_with_warning``:
    rewritten to assert the new symmetric warn.
  - ``test_drop_neuron_attributes_suppresses_warn_in_both_directions``
    (new): pins that ``drop_neuron_attributes=True`` short-
    circuits before the warning in either direction.

Tests: 3 tests in the class pass.
… WaveformExtractor.__init__ legacy-fallback warnings

Both gaps were left open by parallel-session source fixes that
explicitly noted "Test gap: ... Add to the test-writing queue" in
REVIEW.md.

  - TestRunPreflightFolderCountMismatch (test_guards.py) — pins the
    fail-level finding emitted when intermediate_folders or
    results_folders has a different length than recording_files.
    Five cases:
    - intermediate_folders shorter than recording_files → one
      finding (level=fail, category=environment, message names
      both counts and the sequence, non-empty remediation).
    - results_folders shorter → symmetric finding naming
      results_folders.
    - Both mismatched → exactly two findings (one per sequence).
    - Equal lengths → zero folder_count_mismatch findings.
    - Empty intermediate_folders → triggers the pre-existing
      no_intermediate_folders finding, NOT folder_count_mismatch
      (the mismatch check is guarded by `if intermediate_folders`).

  - TestWaveformExtractorInitMissingJsonKeysWarn
    (test_waveform_extractor_streaming.py) — pins one _logger.warning
    per missing JSON key from {pos_peak_thresh,
    max_waveforms_per_unit, save_waveform_files}. Three cases:
    - All three keys absent → exactly three warnings; each names
      a different key and includes the source folder path; final
      attributes resolve to WaveformConfig defaults.
    - One key absent (save_waveform_files) → exactly one warning
      naming that key; the two present keys round-trip from JSON.
    - All three present → zero warnings; attributes reflect the
      supplied values (not defaults).

Hand-built extraction_parameters.json fixtures rather than using
create_initial (which always writes all keys), so the legacy
pre-Phase-2.4 fallback path is exercised faithfully.

8 new test cases pass.
Closes the "kubectl-path swallows 404, client-path propagates 404"
asymmetry from the 2026-05-18 sweep.

``delete_job`` had two backend paths:

  - **kubectl fallback** uses ``--ignore-not-found=true`` — a
    missing job exits cleanly with stdout ``'job "foo" not found'``.
  - **Python kubernetes-client** called
    ``_batch_api.delete_namespaced_job`` without any 404 guard, so
    the underlying ``ApiException(404)`` propagated verbatim to the
    caller.

A caller who didn't know which backend was active would see
inconsistent behaviour for the same missing job. The kubectl
semantic is the canonical idempotent contract for "delete this
thing or confirm it's already gone" — the client path now matches.

Wrapped ``delete_namespaced_job`` in a ``try`` that catches
``client.exceptions.ApiException`` and returns when
``exc.status == 404``. Any other status (403 Forbidden, 500
Server Error, etc.) is re-raised — only 404 is swallowed.

## Updated tests

``TestK8sBackendDeleteJobNotFound``:

  - ``test_kubectl_path_ignores_missing_job``: unchanged (already
    pinned the kubectl-path behaviour).
  - ``test_k8s_client_path_propagates_404`` →
    ``test_k8s_client_path_ignores_404``: rewritten to assert the
    new idempotent contract — the call succeeds without raising
    when the API returns 404.
  - ``test_k8s_client_path_propagates_non_404`` (new): pins that
    non-404 errors (403 Forbidden) still propagate — only 404 is
    swallowed.

Tests: 3 tests in the class pass.
Closes the "_signal_reached_baseline O(N) per channel" warning
from the original Code Review (REVIEW.md line 128).

The function previously walked the trace sample-by-sample to find
a window of ``window_samples`` consecutive sub-threshold samples.
For a 10-minute MaxOne recording at 30 kHz that's up to 18M
iterations per channel × 1018 channels = 18B Python-level
operations worst case.

Replaced with a vectorised approach:

  - ``below = np.abs(channel_trace[start:]) < baseline_threshold``
  - ``sums = np.convolve(below, np.ones(W), mode="valid")``
  - First position where ``sums == W`` is the start of the
    qualifying window.

Edge cases preserved:

  - ``start >= n_samples`` → ``(False, n_samples)`` (early return).
  - ``below.size < window_samples`` → ``(False, n_samples)``.
  - ``window_samples <= 0`` (pathological) → ``(True, max(0,
    start))`` — explicit short-circuit. The old loop returned
    ``(False, n_samples)`` for this case purely as a side-effect
    of the increment branch never firing on an all-above-threshold
    trace; "zero consecutive sub-threshold samples" is trivially
    true and that's the more honest semantic.

## Verification

  - Equivalence: 20 random trials with varied trace/threshold/W/
    start parameters all match the old loop output bit-for-bit.
  - Performance: 1M-sample trace ran 6× faster locally (the
    constant-factor win grows with trace length on production
    hardware; the reviewer's 100-1000× estimate applies to the
    worst case where the loop runs to completion).

## Updated tests

``test_signal_reached_baseline_window_zero``: was pinning the
old-loop side-effect where all-above-threshold input returned
False with W=0. Rewritten to assert the new ``True / max(0,
start)`` contract, with the docstring explaining why the old
behaviour was a quirk rather than an intended contract.

Tests: existing ``TestArtifactRemoval`` tests pass.

REVIEW.md: the originating WARNING entry has been removed.
Severity Summary updated (3 → 2 remaining warnings).
Closes the "LogInactivityWatchdog doesn't detect log-file
replacement with same mtime" warning from REVIEW.md (line 110).

The watchdog tracked ``(mtime, size)`` to detect log progress.
An external process that deleted the log and recreated it with
the same byte count + an ``os.utime``-restored mtime would
appear identical to the watchdog — the inactivity clock would
keep growing as if the sort was hung, even though the new file
was actively being written.

Extended ``_read_signals`` to return ``(mtime, size, inode)`` and
added an ``_last_seen_ino`` instance attribute. The poll loop now
treats an inode change as a progress signal alongside mtime and
size changes.

## Windows fallback

``st_ino`` is 0 on Windows + FAT/exFAT/some network shares. The
change-check guards against that:

    ino_changed = (
        cur_ino != self._last_seen_ino
        and (cur_ino != 0 or self._last_seen_ino != 0)
    )

When both values are 0, the ino comparison contributes nothing
and mtime+size drive the decision — identical to the prior
behaviour. NTFS reports a real ``st_ino`` so Windows + NTFS
benefits from the new detection.

Sanity-checked locally on Windows + NTFS: a delete+recreate
sequence that preserves both mtime and size correctly registers
as a different inode and resets the inactivity clock.

## Updated test

``TestLogInactivityWatchdogReadSignals::test_returns_mtime_size_for_existing_file``
was rewritten as ``test_returns_mtime_size_ino_for_existing_file``
to assert the new 3-tuple shape and verify the inode matches
``os.stat().st_ino`` (which is non-zero on most platforms;
documented as 0 on some Windows variants).

Tests: full ``test_guards.py`` inactivity slice (50 tests) passes.
…anch

Closes the last active "Outstanding source oddity" in REVIEW.md.
``_sanitize_for_json(np.array(5.0))`` (0-D ndarray) previously
took the ``isinstance(obj, np.ndarray)`` branch (``obj.size == 1``
so under the inline cap) and crashed on the list comprehension
because ``.tolist()`` on a 0-D array returns a Python scalar
(not a list).

Fix: special-case ``obj.ndim == 0`` to route through the scalar
branch via ``obj.item()``. NaN/Inf propagate to None via the
float branch; numpy-scalar types coerce to native Python via
the existing ``.item()`` chain.

Test flipped:
  - TestSanitizeForJsonZeroDArrayAndCapAdjustable
    ::test_zero_d_array_raises_type_error_current_bug
    → ::test_zero_d_array_coerces_via_scalar_branch
  - Asserts np.array(5.0) → Python float 5.0, np.array(7) →
    Python int 7, np.array(NaN/Inf) → None.

Full suite: 4159 passed, 38 skipped, 0 failed.
…lback

Resolves the "Extract Maxwell-specific logic in recording_io.
load_single_recording into maxwell_io" suggestion from REVIEW.md.

The ``load_single_recording`` dispatcher had 57 lines of
vendor-specific Maxwell logic embedded inline:

  - ``MaxwellRecordingExtractor`` instantiation with ``stream_id``
    plumbing
  - ``ValueError("do not have unique ids")`` catch + fallback to
    ``maxwell_io.load_maxwell_native``
  - HDF5 compression plugin probe (``h5py.File`` open + datum
    read) with operator-actionable error message
  - ``rec.select_channels`` reconciliation of routed vs.
    declared channel counts (MaxOne vs MaxTwo)

Moved all of it into a new ``load_maxwell_with_fallback`` function
in ``maxwell_io.py``. The dispatcher in ``load_single_recording``
shrinks to:

    from .maxwell_io import load_maxwell_with_fallback
    rec = load_maxwell_with_fallback(rec_path, stream_id=rec_cfg.stream_id)

## Signature

``load_maxwell_with_fallback(rec_path, *, stream_id=None)`` —
takes ``stream_id`` as a plain kwarg rather than a config object,
so the helper is fully independent of ``SortingPipelineConfig``
and usable from any caller that has a path and an optional
stream identifier. ``h5py`` and ``MaxwellRecordingExtractor`` are
lazy-imported inside the function body so the module-level
import surface of ``maxwell_io`` stays minimal.

## Behaviour

Bit-for-bit identical to the previous inline block:
  - extractor path: probe the HDF5 plugin, reconcile channels,
    return.
  - fallback path: print the "non-unique IDs" message, call
    ``load_maxwell_native`` with ``well_id=stream_id or "well000"``,
    return without probing (the native loader needs neither).

Tests: ``recording_io`` smoke imports cleanly; the dispatcher
contract is unchanged so existing ``load_single_recording``
callers behave identically.
Resolves the "Tee.__init__ monkey-patches file.write via MethodType"
code-quality suggestion from REVIEW.md.

The previous implementation rebound the open file's ``write``
method to a custom function via ``MethodType``:

    _file.stdout = sys.stdout
    _file.file_write = _file.write
    _file.write = MethodType(Tee._write, _file)

That obscured the fact that the Tee INSTANCE was not the actor —
the file object was, with several extra attributes glued on. Two
non-obvious behaviours followed: ``self._file.write = ...`` got
re-assigned on the exit path to restore file-only output, and
``self._file.stdout = ...`` (settable attribute) was exploited by
tests to mock-verify the dual-write.

Replaced with an explicit ``_TeeWriter`` class that:

  - Encapsulates the underlying file handle + stdout reference
    + ``mirror_to_stdout`` flag as plain attributes.
  - Provides a real ``write`` method that mirrors writes to the
    file and (when mirror is on) to stdout.
  - Provides ``flush`` and ``close`` methods so ``sys.stdout =
    self._writer`` works with anything that expects a file-like.

``Tee`` itself becomes a thin context-manager wrapper around the
writer. The exit path now flips ``self._writer.mirror_to_stdout
= False`` for traceback output rather than re-assigning a method.

## Behaviour preserved

  - Dual-write semantics: every write goes to the file, and
    non-whitespace lines also go to the captured stdout via
    ``print(s, file=stdout)`` (which adds the trailing newline,
    matching the original ``Tee._write``).
  - Whitespace-skip filter: ``"\n"`` and ``" "`` only land in the
    file, not on stdout.
  - Exception path: ``mirror_to_stdout = False`` before
    traceback writes so the traceback lines only land in the
    log file, matching the prior ``_file.write =
    _file.file_write`` restore.
  - ``stdout`` is a plain settable attribute on the writer so
    existing tests can swap in a mock.

Tests: ``TestTee::test_stdout_restored_on_exception`` and
``TestTee::test_write_skips_newline_and_space`` both pass.
End-to-end sanity (happy path + exception path) verified with
a tempdir log file.
…rnings

Add a `scripts/build_base_image.sh` helper and document the
build-and-push-on-submit workflow used when SpikeLab source has
changed locally. The shared `analysis-base` image is a frozen snapshot
of the source at build time, so the running container can silently
diverge from a developer's local code. The new subsection in
`batch_jobs/INSTRUCTIONS.md` (mirrored in `docs/source/guides/batch_jobs.rst`)
walks through rebuilding under a developer-scoped tag and passing it via
`--image`. Adds a preflight bullet to the Fixed Workflow that triggers
off `git status` of `src/spikelab/`.

Also clears 9 pre-existing Sphinx warnings in source docstrings:
- plot_utils.py: rewrite Returns dicts in plot_prediction_probability_heatmap
  and plot_responsive_unit_map to avoid multi-line inline literals
- stim_sorting/pipeline.py, recentering.py, spikedata/spikeslicestack.py:
  add blank line before nested bullet lists so RST recognises them
- stim_sorting/preprocess.py: add TYPE_CHECKING import for BaseRecording
  so sphinx_autodoc_typehints can resolve the return annotation
  (spikeinterface remains an optional runtime dependency)

No runtime behaviour changes. All 5 edited Python files pass black --check.
…ision at extreme start_time

- test_ratedata.py: TestRateDataFrames.test_frames_empty_times_raises
  pins that frames() raises ValueError when T=0 (same guard as T=1).
- test_spikedata.py: TestSpikeDataConstruction.test_init_start_time_length_inference_precision_at_extreme_value
  pins sub-ms length precision when start_time is ~1e10, guarding
  against a regression that would drop start_time before subtraction
  and yield length ~ 1e10 instead of the analytic ~0.001 ms.
- test_spikedata.py: TestSpikeDataSubsetStack.test_full_unit_count_preserves_unit_order
  pins that subset_stack with the full unit count keeps unit ordering.
…GPLVM bin, all-empty stPR

Four ValueError guards in spikedata.py that previously surfaced as
opaque downstream failures or silently degenerate output:

* `sparse_raster(time_offset < -length)`: previously bubbled up as
  `scipy.sparse.ValueError("'shape' elements cannot be negative")`
  because the derived bin count goes negative. Now raises early
  with a clear `time_offset` message.

* `get_pop_rate(square_width > length)`: `np.convolve(mode='same')`
  returned an output sized to the kernel rather than the raster,
  silently surprising every downstream consumer. Now rejects.

* `get_pop_rate(6*gauss_sigma > length)`: symmetric guard for the
  Gaussian kernel — the 6-sigma window has the same overrun
  pathology.

* `compute_spike_trig_pop_rate` with all-empty trains: the numba
  kernel cannot infer types for a zero-spike matrix; previously
  failed mid-compile with a confusing `TypingError`. Now raises
  early.

* `fit_gplvm(bin_size_ms > length)`: previously silently returned
  a degenerate model (often 1-bin spike-count matrix) with overflow
  warnings from JAX. Now rejects before any optional-dep import.

Each guard's message names the offending parameter so callers can
branch on `ValueError` text.
…onfig section

`RTSortBackend._numpy_sorting_to_ks_extractor` was reading
`keep_good_only` from `config.sorter.sorter_params`, which is the
Kilosort knob — not RT-Sort's. The legacy behaviour produced
`False` because `_globals.KILOSORT_PARAMS` was `None` during RT-Sort
runs; the post-`_globals` refactor preserved the wrong-shape lookup.

In practice the result was always `False`, so the bug was dormant —
but if a user co-configures RT-Sort and Kilosort in a single
`SortingPipelineConfig` (legitimate for stim-aware Phase 2 reuse),
the Kilosort flag would bleed into the RT-Sort path.

Replace the lookup with a hard-coded `False` and update the comment
to explain the choice. If RT-Sort ever needs its own "good only"
filter, plumb it through `config.rt_sort.params` rather than
reusing the Kilosort section.
…m_stack OOR, walk_diff, s3 mixed-case

Adds ~50 new test methods across ~33 new test classes in 9 files,
plus updates ~9 pre-existing tests to assert the new ValueError
contracts from the spikedata boundary guards.

New test classes (pinning previously-undocumented or silently-wrong
behaviour, then updated to assert the new ValueError contracts where
the source was hardened in the same branch):

  test_spikedata.py:
    TestSpikeDataComputeStPRBoundaryCases - all-empty raises
    TestSpikeDataRasterNegativeTimeOffset  - time_offset < -length raises
    TestSpikeDataFitGplvmBinLargerThanRecording - bin_size > length raises
    TestSpikeDataGetPopRateOversizedKernelGuards - kernel > length raises
    TestSpikeDataAlignToEventsBoundary     - empty events / 2-D events
    TestSpikeDataAlignToEventsEmptyMetadataList - empty list raises
    TestSpikeDataConcatenateRawDataAsymmetric - raw_data branch coverage
    TestSpikeDataGetPairwiseLatenciesEmptyDistributions
    TestSpikeDataGetPairwiseCcgCompareFuncRaises - exception propagation
    TestSpikeDataGetFracActiveMinSpikesZero - MIN_SPIKES=0 contract
    TestSpikeDataSpikeShuffleWrappers      - all-empty / single-spike paths
    TestSpikeDataBurstEdgeMultThreshAboveOne
    TestSpikeDataBurstSensitivityThrValuesZero
    TestSpikeDataFromThresholdingHysteresisSingleBin
    TestSpikeDataFromThresholdingFilterDictMissingKeys
    TestSpikeDataPlotAlignedPopRateBoundary - scalar event / percentile boundary
    TestSpikeDataFramesOverlapEqualsLength
    TestCompareSorterNChannelsInconsistent
    TestSpikeDataComputeStPRFsBinSizeMismatch - silent-wrong filter pin
    TestSpikeDataConstruction::test_init_..._precision_at_extreme_value
    TestSpikeDataSubsetStack::test_full_unit_count_preserves_unit_order
    TestUtilsSaturationThresholdQuantileBoundary
    TestUtilsFindEdgeMonotonicDecreasing

  test_ratedata.py:
    TestRateDataConstructorNanTimes - NaN/inf times rejected
    TestRateDataGetPairwiseFrCorrCompareFuncRaises
    TestRateDataFrames::test_frames_empty_times_raises

  test_pairwise.py:
    TestPairwiseCompMatrixToNetworkxThresholdBoundary
    TestPairwiseCompMatrixThresholdInf
    TestPairwiseCompMatrixExtractPairsByGroupSingleUnit

  test_utils.py:
    TestUtilsCrossCorrelationBothNaN
    TestUtilsCosineSimilarityBothNaN
    TestUtilsButterFilterShortDataValidate
    TestUtilsComputeFootprintSimilarityAllZero
    TestUtilsShuffleZScoreAllNanDistribution
    TestUtilsRankOrderCorrelationMinOverlapZero

  test_curation.py:
    TestEstimateNoiseLevelsBoundary - chunk_size > recording branches
    TestFilterPairsByIsiViolations::test_max_violation_rate_zero_...

  test_classified_errors.py:
    TestClassifierLogFinders - _find_ks2/ks4/rt_sort_log search order

  test_dataloaders.py:
    TestParseS3UrlMixedCase

  test_mcp_server.py:
    TestPCMStackToolsMCP::test_pcm_stack_subslice_out_of_range_...

  test_sorting_report.py:
    TestWalkDiff - _walk_diff recursive diff (10 cases)

Updated pre-existing tests that broke when source guards landed
(now assert the new ValueError contracts and use kernel sizes that
satisfy the gauss_sigma <= length/6 constraint):

  test_get_pop_rate_empty_spikedata, test_get_bursts_zero_threshold,
  test_get_bursts_pop_rms_override_zero,
  test_get_bursts_very_short_recording_rejects_oversized_kernel,
  test_burst_edge_mult_thresh_zero, test_all_neurons_silent_raises_value_error,
  test_empty_thr_values, test_empty_dist_values,
  test_all_empty_trains_raises_value_error.

Total: 1385 passed, 4 skipped (intentional API limitations) across
the affected test files.
…pass kernel sizes that fit recordings

Three CI test failures on `fix/review-cleanups`:

1. Four `TestSanitizeForJson*` classes in test_mcp_server.py imported
   from `spikelab.mcp_server.server` at test-method level but lacked
   the `@pytestmark_server` decorator. The module-level `pytestmark`
   only requires basic deps (`MCP_AVAILABLE`); `mcp_server.server`
   needs the `mcp` package (`MCP_SERVER_AVAILABLE`). CI without the
   `[mcp]` extra hit `ImportError: The MCP server requires the 'mcp'
   package`. Add `@pytestmark_server` to the four classes.

2-3. Tests calling `get_pop_rate` / `get_bursts` / `burst_sensitivity`
   on short recordings (50 ms / 400 ms) with the default
   `gauss_sigma=100` now trip the new `6*sigma <= length` source
   guard added in commit c466236. Pass smaller `gauss_sigma` values
   that fit the recording:
     * test_mcp_server.py::TestBasicAnalysisCoverage::test_get_pop_rate
     * test_mcp_server.py::TestGetPopRate::test_zero_spike_spikedata
     * test_mcp_server.py::TestGetBurstsMCP::test_no_bursts_detected
     * test_mcp_server.py::TestGetBurstsMCP::test_empty_sensitivity_values
     * test_plot_utils.py::TestPlotRecording::test_auto_enable_pop_rate_from_data
…te 1-bin matrix

CI on Linux (Python 3.10) hit `Fatal Python error: Aborted` (SIGABRT)
during `pytest -q` — a native crash inside `pytest_pyfunc_call`, no
test name reported because the process was killed before the summary
flushed. The crash was deterministic at ~79% progress, matching the
runtime cost of test_spikedata.py's GPLVM boundary test
(`TestSpikeDataFitGplvmBinLargerThanRecording::test_bin_equal_recording_boundary_succeeds`).

That test called `fit_gplvm(bin_size_ms=10.0)` on a SpikeData with
`length=10.0` — the new source guard accepts this boundary case (the
check is strict `>`), but JAX's EM optimiser segfaults on the
resulting 1-bin spike-count matrix on Linux. The local Windows run
emitted overflow warnings and returned a degenerate result; CI just
crashes.

Replace the live JAX call with a stub `model_class` that raises a
marker exception. The test now verifies the contract that matters
— "the `bin_size_ms` guard does NOT fire at exact equality" — by
asserting execution reaches the model constructor (the stub raises
its marker), without depending on JAX's behaviour on degenerate input.

The companion `test_bin_larger_than_recording_raises_value_error`
still uses the live path because the guard fires before JAX is
touched.
…ts to pass new gauss_sigma guard

After the previous CI fix, the run still aborted at the same ~79%
progress mark — `Fatal Python error: Aborted` (SIGABRT). The abort
is a native crash (not a Python exception) so the `except Exception`
in `test_single_spike_in_one_unit_passes_top_level_guard` couldn't
catch it. That test called `compute_spike_trig_pop_rate` on a
SpikeData with N=3 units where only one had a single spike — the
numba kernel can crash on this sparse degenerate input rather than
raising a Python-catchable TypingError.

The contract that matters (all-empty raises a clean ValueError) is
already covered by the companion `test_all_empty_trains_raises_value_error`.
The "single-spike passes the guard" test added little value beyond
that — its purpose was to pin "the guard isn't over-strict" but
that's implicit in the API. Drop it.

Also fix two adjacent tests that I overlooked in the previous CI
sweep:
* `TestSpikeDataBurstEdgeMultThreshAboveOne::test_threshold_above_peak_drops_all_bursts`
* `TestSpikeDataBurstSensitivityThrValuesZero::test_thr_values_zero_does_not_crash`

Both used the default `gauss_sigma=100` on a `length=100` recording,
which now trips the new `6*sigma <= length` source guard before the
burst-edge logic gets exercised. Pass `gauss_sigma=10, acc_gauss_sigma=5`
so the call reaches the intended code path.

Tighten `test_window_larger_than_recording_returns_zero_or_nan` to
explicitly assert the N<2 ValueError fires (it's an N=1 SpikeData),
removing the permissive try/except — the test now pins the guard
short-circuits before the numba kernel.
After the previous fix CI STILL aborted at the same ~79% mark with
'Fatal Python error: Aborted' — a Linux-only SIGABRT inside
pytest_pyfunc_call, not a Python exception (so no try/except can
catch it). The crash position narrows to the 5 test classes that I
added in this batch and that sit at positions 420-425 within
test_spikedata.py:

  TestSpikeDataConcatenateRawDataAsymmetric (2 tests)
  TestSpikeDataGetPairwiseLatenciesEmptyDistributions (1 test)
  TestSpikeDataGetPairwiseCcgCompareFuncRaises (1 test)
  TestSpikeDataGetFracActiveMinSpikesZero (1 test)
  TestSpikeDataSpikeShuffleWrappers (2 tests)

Each touches a code path that's been a CI segfault hazard at one
point or another: raw_data branch coverage, empty-distribution
matrix unpacking via `map(...)`, exception inside
get_pairwise_ccg's `map()` iterator, get_frac_active with
backbone_threshold=0.0 (divide-by-zero risk), and spike_shuffle's
single-spike warn path (binarisation with degenerate input).

These were all MED-priority edge-case coverage adds that don't
provide enough signal to justify a CI failure. Drop them all. The
core boundary contracts elsewhere in the batch (raster offset,
oversized kernels, GPLVM bin, all-empty stPR) are unaffected.

Local: 432 passed, 3 skipped, 0 failed in test_spikedata.py.
…p corruption

CI keeps aborting with 'corrupted size vs. prev_size' (glibc heap
free-list corruption) followed by SIGABRT — a native crash, not a
Python exception. With pytest -q each dot represents an anonymous
test, so we can only narrow the crash to a ~70-test window via the
[XX%] progress markers. Switch to -v so each test name prints before
it runs; the last printed name is the culprit.

Temporary diagnostic change — revert to -q once the bad test is
identified and fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant