Close #21/#22: k_sem corruption root-caused to the k_thread zombie defect; object-lifetime principle documented#39
Merged
Conversation
Investigation artifacts, preserved in history before cleanup: - Stack-local sem + K_FOREVER repro harness (same-prio / prio-22 / timer-expiry givers, frame scribbling between cycles) - EXPERIMENT A: faithful reconstructions of the April 2026 crash shapes (embedded-sem timer struct, matched call depth, one-shot and periodic-repark variants). Run on ESP32-S3 with CONFIG_K_TIMER_DISPATCH_ISR=n (task-context ESP_TIMER_TASK giver, the original trigger context): 200/0 -- does not reproduce on the same IDF 5.4 silicon that crashed in April. - EXPERIMENT B (linux-only): zombie injection. Recreates the pre-#18 dangling-task-node state with raw FreeRTOS calls. Frame-reuse runs wedged the kernel at uxListRemove (list.c:209) writing through the poisoned pxNext -- lldb forensics showed a stack-local k_sem's vListInitialise pattern overwritten across the zombie TCB. With the region protected (compiler-proofed 8K pad; clang shrank a volatile pad to 2 bytes until given an escaping pointer + memset), the node stays intact and harmless: corruption requires reuse. Conclusion: the April corruption was the pre-#18 k_thread zombie defect (dangling xStateListItem in dead frames + reuse), fixed by PR #18. k_sem / StaticSemaphore_t exonerated. Closes the evidence for issue #21. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Strips the investigation-only zombie-injection experiment (preserved at 1eae64f -- it deliberately wedges the kernel when the hypothesis holds and its stack padding is compiler-fragile) and rewords the remaining tests as permanent regressions: three stack-local-sem K_FOREVER giver shapes plus the two faithful April 2026 corruption shapes, valid under either k_timer dispatch mode. CONFIG_K_TIMER_DISPATCH_ISR restored to y. Adds the #22 deliverable to the zkernel README: "Design principle: object lifetime must be Zephyr-shaped" -- every API ending an object's kernel involvement severs all kernel references before returning, with per-object status and the stack-allocation caveat for objects signaled from other contexts. Closes #21: the April corruption was the pre-#18 k_thread zombie defect (dangling task list nodes in dead frames poisoned by reuse), fixed by PR #18. k_sem exonerated. Evidence: faithful April shapes clean on the same IDF 5.4 silicon with the original task-context ESP_TIMER_TASK giver; zombie injection on linux reproduced the kernel fault on demand (uxListRemove writing through a poisoned dangling node) with lldb memory forensics; the protected-region control showed dangling nodes are latent until frame reuse. Closes #22 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a focused regression suite and documentation to confirm the April 2026 k_sem corruption reports were caused by the pre-#18 k_thread lifecycle/zombie defect, and to codify the project’s object-lifetime design principle for FreeRTOS-backed “Zephyr-shaped” objects.
Changes:
- Adds early-running regression tests covering stack-local
k_sem+K_FOREVER+ cross-context/cross-priority give patterns (including the “April shapes”). - Documents the “object lifetime must be Zephyr-shaped” principle in
components/zkernel/README.md, including guidance on stack allocation and cross-context signaling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
test/main/test_k_sem.c |
Adds new regression tests to catch delayed corruption from stack-local semaphore usage patterns. |
components/zkernel/README.md |
Documents the object-lifetime principle and current status/notes for kernel primitives. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Drop the <string.h> include orphaned by the zombie-test removal. - Pass the target sem to the giver thread via p1 instead of a shared global. - README precision: the control-block updates that wake a waiter complete before the waiter runs, but the giving context may be preempted by the woken waiter before its call returns and can still touch the control block afterwards (especially under SMP) -- which is exactly why the stack-allocation caveat exists. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #21. Closes #22.
The April 2026 "stack-local
k_sem+K_FOREVER+ high-priority give" scheduler corruption is root-caused and already fixed: it was the pre-#18 k_thread lifecycle defect, not a k_sem bug and not an Xtensa register-window bug. Parked tasks whose TCBs lived in caller stack memory left danglingxStateListItemnodes in kernel lists; stack-frame reuse poisoned them; subsequent kernel list operations wrote through the poison into live frames. PR #18 removed the zombie source.StaticSemaphore_tis exonerated — the sem-blocking path was merely the operation that walked the poisoned lists.Evidence
{k_timer; k_sem}struct, matched call depth, one-shot + periodic-repark) with the exact original giver (task-contextESP_TIMER_TASKprio ~22 viaCONFIG_K_TIMER_DISPATCH_ISR=n) on the same IDF v5.4 silicon that crashed in ApriluxListRemove(list.c:209) writing through the node's poisonedpxNext; lldb memory forensics caught a stack-local sem'svListInitialisepattern overwritten across the zombie TCBThe injection experiment is preserved at the snapshot commit (
1eae64f) and removed from the merged suite (it deliberately wedges the kernel when the hypothesis holds, and its stack padding proved compiler-fragile — clang shrank avolatile uint8_t pad[8192]to two bytes until given an escaping pointer).Changes
test/main/test_k_sem.c: five permanent regression tests — three stack-local-semK_FOREVERgiver shapes (same-prio / prio-22 / timer-expiry) and the two faithful April shapes, valid under either k_timer dispatch mode, placed early so the rest of the suite detects delayed corruption.components/zkernel/README.md: the Design audit: FreeRTOS-backed objects do not inherit Zephyr's object-lifetime model #22 deliverable — "Design principle: object lifetime must be Zephyr-shaped" — every API ending an object's kernel involvement severs all kernel references before returning, with per-object status and the stack-allocation caveat for objects signaled from other contexts.Also unblocks #28 (
k_timer_status_syncsem-based rewrite): the April shapes are precisely its trigger pattern and are green.Validation
Plus the experiment runs above (200/0 with the investigation config on hardware). All builds via clean
rm -rf build sdkconfigregeneration with the Kconfig state verified insdkconfigand by symbol/test-count probes.🤖 Generated with Claude Code