Skip to content

Close #21/#22: k_sem corruption root-caused to the k_thread zombie defect; object-lifetime principle documented#39

Merged
swoisz merged 3 commits into
mainfrom
investigate/k-sem-lifetime
Jun 7, 2026
Merged

Close #21/#22: k_sem corruption root-caused to the k_thread zombie defect; object-lifetime principle documented#39
swoisz merged 3 commits into
mainfrom
investigate/k-sem-lifetime

Conversation

@swoisz

@swoisz swoisz commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes #21. Closes #22.

The April 2026 "stack-local k_sem + K_FOREVER + high-priority give" scheduler corruption is root-caused and already fixed: it was the pre-#18 k_thread lifecycle defect, not a k_sem bug and not an Xtensa register-window bug. Parked tasks whose TCBs lived in caller stack memory left dangling xStateListItem nodes in kernel lists; stack-frame reuse poisoned them; subsequent kernel list operations wrote through the poison into live frames. PR #18 removed the zombie source. StaticSemaphore_t is exonerated — the sem-blocking path was merely the operation that walked the poisoned lists.

Evidence

Experiment Result
Faithful April crash shapes (stack {k_timer; k_sem} struct, matched call depth, one-shot + periodic-repark) with the exact original giver (task-context ESP_TIMER_TASK prio ~22 via CONFIG_K_TIMER_DISPATCH_ISR=n) on the same IDF v5.4 silicon that crashed in April clean
Zombie injection on linux (pre-#18 dangling-node state recreated with raw FreeRTOS calls + frame reuse) kernel faults on demand — uxListRemove (list.c:209) writing through the node's poisoned pxNext; lldb memory forensics caught a stack-local sem's vListInitialise pattern overwritten across the zombie TCB
Same injection, dead-frame region protected from reuse node stays intact and harmless — corruption requires reuse; dangling nodes are latent
Post-#18 primitives, both targets all clean

The injection experiment is preserved at the snapshot commit (1eae64f) and removed from the merged suite (it deliberately wedges the kernel when the hypothesis holds, and its stack padding proved compiler-fragile — clang shrank a volatile uint8_t pad[8192] to two bytes until given an escaping pointer).

Changes

  • test/main/test_k_sem.c: five permanent regression tests — three stack-local-sem K_FOREVER giver shapes (same-prio / prio-22 / timer-expiry) and the two faithful April shapes, valid under either k_timer dispatch mode, placed early so the rest of the suite detects delayed corruption.
  • components/zkernel/README.md: the Design audit: FreeRTOS-backed objects do not inherit Zephyr's object-lifetime model #22 deliverable — "Design principle: object lifetime must be Zephyr-shaped" — every API ending an object's kernel involvement severs all kernel references before returning, with per-object status and the stack-allocation caveat for objects signaled from other contexts.

Also unblocks #28 (k_timer_status_sync sem-based rewrite): the April shapes are precisely its trigger pattern and are green.

Validation

Target Result
linux (host) 186 Tests 0 Failures
esp32s3 (on hardware, IDF v5.4) 210 Tests 0 Failures

Plus the experiment runs above (200/0 with the investigation config on hardware). All builds via clean rm -rf build sdkconfig regeneration with the Kconfig state verified in sdkconfig and by symbol/test-count probes.

🤖 Generated with Claude Code

swoisz and others added 2 commits June 6, 2026 17:27
Investigation artifacts, preserved in history before cleanup:

- Stack-local sem + K_FOREVER repro harness (same-prio / prio-22 /
  timer-expiry givers, frame scribbling between cycles)
- EXPERIMENT A: faithful reconstructions of the April 2026 crash
  shapes (embedded-sem timer struct, matched call depth, one-shot and
  periodic-repark variants). Run on ESP32-S3 with
  CONFIG_K_TIMER_DISPATCH_ISR=n (task-context ESP_TIMER_TASK giver,
  the original trigger context): 200/0 -- does not reproduce on the
  same IDF 5.4 silicon that crashed in April.
- EXPERIMENT B (linux-only): zombie injection. Recreates the pre-#18
  dangling-task-node state with raw FreeRTOS calls. Frame-reuse runs
  wedged the kernel at uxListRemove (list.c:209) writing through the
  poisoned pxNext -- lldb forensics showed a stack-local k_sem's
  vListInitialise pattern overwritten across the zombie TCB. With the
  region protected (compiler-proofed 8K pad; clang shrank a volatile
  pad to 2 bytes until given an escaping pointer + memset), the node
  stays intact and harmless: corruption requires reuse.

Conclusion: the April corruption was the pre-#18 k_thread zombie
defect (dangling xStateListItem in dead frames + reuse), fixed by
PR #18. k_sem / StaticSemaphore_t exonerated. Closes the evidence for
issue #21.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strips the investigation-only zombie-injection experiment (preserved at
1eae64f -- it deliberately wedges the kernel when the hypothesis holds
and its stack padding is compiler-fragile) and rewords the remaining
tests as permanent regressions: three stack-local-sem K_FOREVER giver
shapes plus the two faithful April 2026 corruption shapes, valid under
either k_timer dispatch mode. CONFIG_K_TIMER_DISPATCH_ISR restored to y.

Adds the #22 deliverable to the zkernel README: "Design principle:
object lifetime must be Zephyr-shaped" -- every API ending an object's
kernel involvement severs all kernel references before returning, with
per-object status and the stack-allocation caveat for objects signaled
from other contexts.

Closes #21: the April corruption was the pre-#18 k_thread zombie defect
(dangling task list nodes in dead frames poisoned by reuse), fixed by
PR #18. k_sem exonerated. Evidence: faithful April shapes clean on the
same IDF 5.4 silicon with the original task-context ESP_TIMER_TASK
giver; zombie injection on linux reproduced the kernel fault on demand
(uxListRemove writing through a poisoned dangling node) with lldb
memory forensics; the protected-region control showed dangling nodes
are latent until frame reuse.

Closes #22

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a focused regression suite and documentation to confirm the April 2026 k_sem corruption reports were caused by the pre-#18 k_thread lifecycle/zombie defect, and to codify the project’s object-lifetime design principle for FreeRTOS-backed “Zephyr-shaped” objects.

Changes:

  • Adds early-running regression tests covering stack-local k_sem + K_FOREVER + cross-context/cross-priority give patterns (including the “April shapes”).
  • Documents the “object lifetime must be Zephyr-shaped” principle in components/zkernel/README.md, including guidance on stack allocation and cross-context signaling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
test/main/test_k_sem.c Adds new regression tests to catch delayed corruption from stack-local semaphore usage patterns.
components/zkernel/README.md Documents the object-lifetime principle and current status/notes for kernel primitives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/main/test_k_sem.c Outdated
Comment thread test/main/test_k_sem.c
Comment thread test/main/test_k_sem.c
Comment thread components/zkernel/README.md Outdated
- Drop the <string.h> include orphaned by the zombie-test removal.
- Pass the target sem to the giver thread via p1 instead of a shared
  global.
- README precision: the control-block updates that wake a waiter
  complete before the waiter runs, but the giving context may be
  preempted by the woken waiter before its call returns and can still
  touch the control block afterwards (especially under SMP) -- which
  is exactly why the stack-allocation caveat exists.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@swoisz swoisz merged commit d7a1c1e into main Jun 7, 2026
5 checks passed
@swoisz swoisz deleted the investigate/k-sem-lifetime branch June 7, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants