Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions libdd-shared-runtime/src/shared_runtime/fork_safe.rs
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,17 @@ impl ForkSafeRuntime {
/// exceeded.
pub fn shutdown(&self, timeout: Option<std::time::Duration>) -> Result<(), SharedRuntimeError> {
debug!(?timeout, "Shutting down ForkSafeRuntime");
// block_on calls context::enter() which accesses Tokio's CONTEXT thread-local.
// During CPython interpreter finalization, TLS is destroyed before atexit handlers
// fire, causing a panic. Detect this via try_current() and bail out early —
// the OS will clean up remaining threads on process exit.
if matches!(
tokio::runtime::Handle::try_current(),
Err(ref e) if e.is_thread_local_destroyed()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this catches the symptom, but doesn't actually solve the problem, and we're going to play whack-a-mole with global state being destroyed in a non-deterministic order.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally created this PR to solve this ticket: https://datadoghq.atlassian.net/browse/APMS-19910. I'm not sure what the root fix would be. Do you have any alternatives in mind?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this would only happen on uwsgi. If TLS is destroyed by CPython we should have the same panic on every python app calling shutdown

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this solution means we are skipping the flush on shutdown and losing data. I think we should be able to start a new current thread runtime to block on it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think we're having the same issue as unbit/uwsgi#859 which explains why it's specific to uwsgi. Maybe we could patch uwsgi to call the python handlers before cleaning up TLS.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'm on Python ERE this week for SDK so I'll tackle it from the dd-trace-py angle for this 🤔

@wantsui wantsui Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without fixing it in uwsgi specifically (bc this can happen in other setups), I have a draft attempt here: DataDog/dd-trace-py#18853 .

If folks think we should wrap uwsgi to fix this I can change approach too.

Edit: Actually Claude wasn't able to fully fix this yet, so the approach will change.

PR: DataDog/dd-trace-py#18853 will just focus on uwsgi

) {
debug!("Tokio TLS destroyed during interpreter finalization, skipping shutdown");
return Ok(());
}
match self.runtime.lock_or_panic().take() {
Some(runtime) => {
if let Some(timeout) = timeout {
Expand Down Expand Up @@ -421,4 +432,24 @@ mod tests {
"worker should not run or shut down after fork in child when restart_on_fork is false"
);
}

#[test]
fn test_shutdown_is_idempotent() {
// Calling shutdown() twice must not panic or error. The second call hits the
// None-guard (runtime already taken). This covers the same early-return path as
// the TLS-destroyed guard added for CPython atexit finalization ordering.
let shared_runtime = ForkSafeRuntime::new().unwrap();
let (worker, receiver) = make_test_worker();

let _ = shared_runtime.spawn_worker(worker, true).unwrap();
receiver
.recv_timeout(Duration::from_secs(1))
.expect("worker did not run");

assert!(shared_runtime.shutdown(None).is_ok());
assert!(
shared_runtime.shutdown(None).is_ok(),
"second shutdown must not panic"
);
}
}
Loading