Skip to content

[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065

Open
nohwnd wants to merge 9 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-5bee74e4b58b61b6
Open

[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065
nohwnd wants to merge 9 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-5bee74e4b58b61b6

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented May 25, 2026

Fix #15588

The inactivity timer in BlameCollector starts during Initialize, but _testHostProcessId is only set later when TestHostLaunchedHandler fires. If the timer fires before testhost launches (short timeout, slow start, or in unit tests with TimeSpan.Zero), we try to dump PID 0 — the Windows Idle process.

The fix: don't start the timer until testhost actually launches. Initialize creates the timer, TestHostLaunchedHandler sets the PID and then starts it. The timer can never fire before we know who to dump.

  • Remove ResetInactivityTimer() from Initialize
  • Swap order in TestHostLaunchedHandler: set PID before starting timer
  • Add integration test that poisons runtimeconfig.json with net9999.0 so testhost fails to start, then runs with CollectHangDump — verifies vstest exits cleanly without hanging or dumping PID 0
  • Update unit tests to reflect that the timer starts on TestHostLaunched, not Initialize

When the inactivity timer fires before the testhost process has launched,
_testHostProcessId is 0 (default int). Previously this caused
ProcessDumpUtility.StartHangBasedProcessDump to attempt to dump PID 0
(the Idle process on Windows / Swapper on Linux), resulting in an empty
or incorrect dump file.

The fix adds an early-return guard in CollectDumpAndAbortTesthost:
if _testHostProcessId == 0, log a warning and skip the dump/kill.

Also updates three existing hang dump unit tests to properly simulate
the happy-path scenario (testhost launches before the timer fires) by:
- Using a 50 ms timeout instead of 0 ms so the TestHostLaunched event
  can be raised before the timer callback runs
- Raising TestHostLaunched with PID 1234 before the timer fires

Adds a new test that verifies StartHangBasedProcessDump is NOT called
when the timer fires before TestHostLaunched.

Fixes #15588

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 25, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in BlameCollector hang-dump collection where the inactivity timer can fire before TestHostLaunched sets the testhost PID, causing the collector to attempt a dump/kill against PID 0. The PR adds a guard to skip hang-dump collection when the testhost hasn’t launched yet, and updates unit tests/localized resources accordingly.

Changes:

  • Add early-return in CollectDumpAndAbortTesthost when _testHostProcessId == 0, with warning logs.
  • Make hang-dump unit tests deterministic by avoiding TimeSpan.Zero in several tests and explicitly raising TestHostLaunched.
  • Add localized resource string TestHostNotLaunchedCannotCollectHangDump across .resx, designer, and all 13 .xlf files.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests/BlameCollectorTests.cs Updates hang-dump timer tests and adds a regression test for “timer fires before testhost launched”.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/BlameCollector.cs Skips hang-dump collection when testhost PID is not yet known (PID 0).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.resx Adds new resource string for the “testhost not launched” warning.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs Adds strongly-typed accessor for the new resource string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.cs.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.de.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.es.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.fr.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.it.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ja.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ko.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pl.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pt-BR.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ru.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.tr.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hans.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hant.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
Files not reviewed (1)
  • src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported

Comment on lines +302 to +304
// Do NOT raise TestHostLaunched — _testHostProcessId stays 0.
warningLogged.Wait(1000, TestContext.CancellationToken);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot — reworked the fix to not start the timer until testhost launches. The PID==0 guard and warning log are gone; the test now verifies the timer never fires when testhost hasn't launched (no signal to wait for).


_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot — with the timer starting only after TestHostLaunched, there's no race between the timer and the PID assignment. The 50ms timeout is fine.


_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — no race with the new approach.


_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — no race with the new approach.

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector

Dimensions activated: Crash & Hang Dump Reliability · Error Reporting & Diagnostic Clarity · Parallel Execution & Scheduling Safety

Summary

The fix is correct and well-placed. Setting _inactivityTimerAlreadyFired = true before the guard ensures the timer is permanently defused regardless of the early-return path, and the test coverage faithfully exercises the new code path with TimeSpan.Zero.

One finding: _testHostProcessId is not volatile. The new guard makes this field load-bearing across a thread boundary for the first time — the timer thread now returns early based on reading 0. On x86/x64 this is unlikely to matter in practice, but volatile int is a one-liner fix that makes the cross-thread contract explicit and covers ARM64. See the inline comment for details.

Description alignment

PR description matches the diff exactly. Localization coverage (13 XLF files + resx + Resources.Designer.cs) is complete.


🧠 Reviewed by Expert Code Reviewer

🧠 Reviewed by Expert Code Reviewer 🧠

}

// If testhost has not launched yet, we cannot dump or kill it.
if (_testHostProcessId == 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Crash & Hang Dump Reliability / Parallel Execution & Scheduling Safety]_testHostProcessId is not volatile

_testHostProcessId is written on the data-collection events thread in TestHostLaunchedHandler and now load-bearing read here on the timer-callback thread. The C# memory model does not guarantee that a plain int write on one thread is visible as a non-stale read on another without a memory barrier.

Without volatile (or Interlocked), the JIT is allowed to cache the 0 in a register and satisfy this guard with a stale value, causing the guard to fire and silently skip the hang dump even when testhost has already launched. On x86/x64 (TSO) this is unlikely to manifest, but it is technically a data race and can bite on ARM64 CI agents.

The field was already read without synchronization at lines 261, 271, 319, and 627 (pre-existing), but this new guard is the first place where the value of 0 has a meaningful semantic effect (early return vs. proceeding). Marking the field volatile would make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive:

// before
private int _testHostProcessId;

// after
private volatile int _testHostProcessId;

volatile int is safe here — the field is only ever assigned once (in TestHostLaunchedHandler) and only read elsewhere, so no compare-and-swap semantics are needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — marked _testHostProcessId as volatile to make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive on weakly-ordered CPUs (ARM64).

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would use interlocked instead as that is more common, and less scary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went a step further — eliminated the cross-thread concern entirely. Timer now starts only after TestHostLaunchedHandler sets the PID, so the field is always written before any timer callback reads it. Reverted to plain int.

Ensures the hang-dump guard added in the parent commit sees a fresh
value on the timer-callback thread (ARM64 / weakly-ordered CPUs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 25, 2026

Commit pushed: 16a2275

🔧 Iterated by PR Iteration Agent 🔧

Instead of guarding against PID 0 in the timer callback, don't start
the inactivity timer until TestHostLaunched fires. This eliminates
the race at the source — the timer can never fire before we know which
process to dump.

- Remove ResetInactivityTimer() call from Initialize; timer is created
  but not started until TestHostLaunchedHandler sets the PID and calls
  ResetInactivityTimer.
- Swap order in TestHostLaunchedHandler: set PID before starting timer.
- Remove PID==0 guard and resource string (no longer needed).
- Revert volatile — no cross-thread visibility concern with proper ordering.
- Update tests to reflect that timer starts on TestHostLaunched, not Initialize.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 26, 2026 11:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Poisons runtimeconfig.json with net9999.0 so testhost can't launch,
then runs with CollectHangDump. Verifies vstest exits cleanly without
hanging or producing a dump against PID 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 26, 2026 13:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 16 comments.

Comment on lines 200 to 205
if (_collectProcessDumpOnHang)
{
_inactivityTimer ??= new InactivityTimer(CollectDumpAndAbortTesthost);
ResetInactivityTimer();
// Don't start the timer here — wait until TestHostLaunched so we know
// which process to dump. ResetInactivityTimer is called from TestHostLaunchedHandler.
}
Comment on lines +242 to +266
[NetCoreTargetFrameworkDataSource]
public void HangDumpShouldNotHangWhenTestHostFailsToStart(RunnerInfo runnerInfo)
{
// When testhost can't start (e.g. wrong runtime version), the inactivity timer
// must not try to dump PID 0. It should exit cleanly because the timer only starts
// after TestHostLaunched, which never fires here.
SetTestEnvironment(_testEnvironment, runnerInfo);
const string testAssetProjectName = "SimpleTestProjectMessedUpTargetFramework";
var assemblyPath = GetTestDllForFramework(testAssetProjectName + ".dll", Core11TargetFramework);

// Make testhost fail immediately by targeting a non-existent runtime.
var runtimeConfigJson = Path.Combine(Path.GetDirectoryName(assemblyPath)!, testAssetProjectName + ".runtimeconfig.json");
var fileContent = File.ReadAllText(runtimeConfigJson);
var updatedContent = fileContent.Replace("\"version\": \"11.0.0", "\"version\": \"9999.0.0");
File.WriteAllText(runtimeConfigJson, updatedContent);

var arguments = PrepareArguments(assemblyPath, GetTestAdapterPath(), string.Empty, string.Empty, runnerInfo.InIsolationValue);
arguments = string.Concat(arguments, $" /ResultsDirectory:{TempDirectory.Path}");
arguments = string.Concat(arguments, $@" /Blame:""CollectHangDump;HangDumpType=mini;TestTimeout=30s""");
InvokeVsTest(arguments);

// vstest should exit with failure (testhost didn't start), but not hang and not crash.
ExitCodeEquals(1);
Assert.DoesNotContain(".dmp", StdOut, "no dump should be collected when testhost never launched");
}
Comment on lines +304 to +306
// Do NOT raise TestHostLaunched — timer should never start.
// Wait long enough that a started timer with timeout 0 would have fired.
Thread.Sleep(100);
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (rework)

Dimensions checked: Crash & Hang Dump Reliability · Parallel Execution & Scheduling Safety · Algorithmic Correctness · Resource Management

Summary

The rework is a clear improvement over the previous iteration. The structural fix — don't start the timer until TestHostLaunchedHandler fires — completely eliminates the data-race concern that motivated the earlier volatile suggestion. Once the timer can only fire after _testHostProcessId is written (on the same thread), the System.Threading.Timer ThreadPool dispatch provides an implicit memory barrier, so plain int is correct here.

Specific code changes:

  • Initialize: Timer is created (??=) but no longer started — right fix, right place.
  • TestHostLaunchedHandler: PID assigned before ResetInactivityTimer() — the invariant is now enforced by construction.
  • CollectDumpAndAbortTesthost: TPDebug.Assert(_testHostProcessId != 0, ...) documents the invariant in debug builds. Any path that somehow bypassed the architectural guarantee would still hit the try/catch around StartHangBasedProcessDump in release, so the fallback handling is adequate.
  • XLF files: The 13 xlf changes are a BOM/XML-declaration normalisation — no new resource strings were added, consistent with the absence of a .resx diff.

Previous finding — resolved ✅

The volatile int _testHostProcessId finding from the prior review is fully resolved. The architectural fix is the right call — Interlocked/volatile would have been patch-over-the-symptom; not starting the timer early removes the cross-thread read at the source.

Description alignment

PR description matches the diff exactly. Localization coverage (13 XLF files, BOM-only change with no new strings) is consistent with the resx being unmodified.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

The test mutates SimpleTestProjectMessedUpTargetFramework's
runtimeconfig.json, which is shared with ProcessesInteractionTests.
Wrap in try/finally to restore the original content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (follow-up — test commits)

Dimensions checked: Crash & Hang Dump Reliability · Algorithmic Correctness · Resource Management

New commits since last review (14:02 UTC)

Two test-only commits were pushed after the previous review:

  1. Unit test additions (InitializeWithDumpForHangShouldNotStartTimerIfTestHostHasNotLaunchedYet): Directly verifies the new invariant — timer does not fire when TestHostLaunched is never raised. Timeout changed from 050ms in the three existing timing-sensitive tests; this provides sufficient margin for Thread.Sleep(100) to distinguish a correctly dormant timer from one that actually fires. ✅

  2. Integration test (HangDumpShouldNotHangWhenTestHostFailsToStart): Poisons runtimeconfig.json with "9999.0.0" to prevent testhost from launching, then asserts vstest exits with code 1 and emits no .dmp file. The final commit wraps the file mutation in try/finally to restore the original content — correct and necessary since SimpleTestProjectMessedUpTargetFramework is shared with ProcessesInteractionTests. The Assert.DoesNotContain(".dmp", StdOut, ...) parameter order is correct (needle first). ✅

Previously reviewed findings — still resolved ✅

The architectural fix (timer only starts in TestHostLaunchedHandler after PID is set) and the TPDebug.Assert(_testHostProcessId != 0, ...) invariant guard were both confirmed in the prior review. Nothing in the new commits regresses those findings.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 27, 2026

macOS leg timed out after 1h (infrastructure). Windows, Ubuntu, and Source-Build all passed. Ready to retry or merge — the fix is Windows-specific (hang dump with PID 0).

Reviewer preference: use Interlocked for cross-thread visibility instead
of plain int assignment, as it is more common and explicit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 28, 2026

Commit pushed: b179a7d

🔧 Iterated by PR Iteration Agent 🔧

Copilot AI review requested due to automatic review settings May 28, 2026 07:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Comment on lines 200 to 205
if (_collectProcessDumpOnHang)
{
_inactivityTimer ??= new InactivityTimer(CollectDumpAndAbortTesthost);
ResetInactivityTimer();
// Don't start the timer here — wait until TestHostLaunched so we know
// which process to dump. ResetInactivityTimer is called from TestHostLaunchedHandler.
}
Comment on lines +248 to +251
// If testhost has not launched yet, the timer should not have been started,
// so this callback should not fire. Assert to catch unexpected paths.
TPDebug.Assert(_testHostProcessId != 0, "CollectDumpAndAbortTesthost called but testhost has not launched yet.");

Comment on lines +241 to +243
[TestMethod]
[NetCoreTargetFrameworkDataSource]
public void HangDumpShouldNotHangWhenTestHostFailsToStart(RunnerInfo runnerInfo)
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="utf-8"?>
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (follow-up — Interlocked.Exchange commit)

Dimensions checked: Parallel Execution & Scheduling Safety · Algorithmic Correctness

New commit since last review (2026-05-26T14:37Z)

Interlocked.Exchange(ref _testHostProcessId, args.TestHostProcessId) — replaces the plain int assignment in TestHostLaunchedHandler.

This is correct and clean. The architectural guarantee (timer cannot fire until ResetInactivityTimer() is called, which happens on the next line after the Exchange) already ensures the timer thread will always observe a non-zero PID. Interlocked.Exchange makes the full memory fence explicit and is a reasonable stylistic choice for cross-thread writes, consistent with the reviewer's preference.

No new issues introduced. All previously reviewed findings remain resolved. ✅


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

var originalContent = File.ReadAllText(runtimeConfigJson);
try
{
var updatedContent = originalContent.Replace("\"version\": \"11.0.0", "\"version\": \"9999.0.0");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex, so we don't have to update every time we update runtime.

Make the test do-not-parallelize, just so we avoid running in conflict with other possible test that edits the project.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the test now uses Regex.Replace to match any "version": "x.y.z" pattern (so the test won't need updating as the runtime version changes), and the [DoNotParallelize] attribute was added to avoid conflicts with other tests that might read the same runtimeconfig.json file on disk.

🔧 Iterated by PR Iteration Agent 🔧

InvokeVsTest(arguments);

// vstest should exit with failure (testhost didn't start), but not hang and not crash.
ExitCodeEquals(1);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try ensuring harder the crash was because we did not find the runtime.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added Assert.MatchesRegex after ExitCodeEquals(1) that checks the combined stdout+stderr for the runtime-not-found message (matches "framework.*9999.0.0", "9999.0.0.*not found", or "compatible framework version"). This verifies the failure was specifically due to the missing runtime, not some other error.

🔧 Iterated by PR Iteration Agent 🔧

… runtime-not-found error

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 29, 2026

Commit pushed: a30e0c2

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (follow-up — test hardening commit)

Dimensions checked: Algorithmic Correctness · Crash & Hang Dump Reliability · Resource Management

New commit since last review (2026-05-28T07:19Z)

test: use regex for version replacement, add DoNotParallelize, assert runtime-not-found error (a30e0c2) — test-only change to the integration test.

Three improvements to HangDumpShouldNotHangWhenTestHostFailsToStart:

  1. Regex.Replace for version patching — replaces the prior literal string substitution with Regex.Replace(content, @"""version""\s*:\s*""[\d.]+""", ...). More robust against whitespace variations in the JSON. ✅

  2. [DoNotParallelize] — ensures the runtimeconfig.json mutation is not interleaved with other tests that use the same SimpleTestProjectMessedUpTargetFramework asset. ✅

  3. Runtime-not-found assertionAssert.MatchesRegex(new Regex(@"framework.*9999\.0\.0|...", IgnoreCase), StdOut + StdErr, ...) verifies the failure is specifically "runtime not found", not some other exit-code-1 scenario. Parameter order (pattern first, haystack second) is correct. ✅

Assert.DoesNotContain(".dmp", StdOut, ...) parameter order (needle first) remains correct from the prior commit. ✅

Previously reviewed findings — still resolved ✅

No regressions to prior findings. All architecture invariants (timer only starts in TestHostLaunchedHandler, Interlocked.Exchange for PID write) remain intact.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Use the same assertion approach as ProcessesInteractionTests which is
proven to work on Windows. The previous Assert.MatchesRegex on combined
StdOut+StdErr may fail due to multiline regex behavior differences.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 31, 2026

Commit pushed: 5895ebd

🔧 Iterated by PR Iteration Agent 🔧

@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 31, 2026

Fixed the CI assertion in HangDumpShouldNotHangWhenTestHostFailsToStart.

The previous Assert.MatchesRegex on combined StdOut + StdErr was replaced with StdErrorRegexIsMatch("9999\\.0\\.0") — the same approach used by ProcessesInteractionTests.WhenTestHostProcessExitsBecauseTheTargetedRuntimeIsNoFound... which is proven to work on Windows.

The multiline hostfxr error message (containing Framework: '...', version '9999.0.0') may not match a non-Singleline . regex across StdOut + StdErr, whereas checking StdErr directly for the version string 9999.0.0 is reliable.

🔧 Iterated by PR Iteration Agent 🔧

Copilot AI review requested due to automatic review settings May 31, 2026 06:14
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (follow-up — StdError assertion commit)

Dimensions checked: Algorithmic Correctness · Crash & Hang Dump Reliability

New commit since last review (2026-05-29T15:59Z)

test: use StdErrorRegexIsMatch for runtime-not-found assertion (5895ebd) — test-only, one-line change.

Replaces Assert.MatchesRegex(new Regex(@"framework.*9999\.0\.0|...", IgnoreCase), StdOut + StdErr, ...) with StdErrorRegexIsMatch("9999\\.0\\.0"). This aligns with the pattern already used in ProcessesInteractionTests, which is proven to work on Windows. The multiline regex over a combined string was a potential source of cross-platform regex behavior differences; the targeted StdErrorRegexIsMatch helper is cleaner and more reliable. ✅

Previously reviewed findings — still resolved ✅

All prior findings (architectural timer fix, Interlocked.Exchange, try/finally restore, [DoNotParallelize], Regex.Replace version patching) remain intact and unaffected.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comment on lines +248 to +251
// If testhost has not launched yet, the timer should not have been started,
// so this callback should not fire. Assert to catch unexpected paths.
TPDebug.Assert(_testHostProcessId != 0, "CollectDumpAndAbortTesthost called but testhost has not launched yet.");

Comment on lines +258 to +260
var updatedContent = Regex.Replace(originalContent, @"""version""\s*:\s*""[\d.]+""", @"""version"": ""9999.0.0""");
File.WriteAllText(runtimeConfigJson, updatedContent);

Comment on lines +203 to 205
// Don't start the timer here — wait until TestHostLaunched so we know
// which process to dump. ResetInactivityTimer is called from TestHostLaunchedHandler.
}
The .NET runtime-not-found error message can appear in either stdout or
stderr depending on the platform and vstest host configuration. Check
both streams to make the assertion robust across environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Jun 1, 2026

Commit pushed: ecb07c8

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector (follow-up — check both stdout and stderr)

Dimensions checked: Algorithmic Correctness · Crash & Hang Dump Reliability

New commit since last review (2026-05-31T06:17Z)

test: check both stdout and stderr for runtime-not-found assertion (ecb07c8) — test-only, replaces StdErrorRegexIsMatch("9999\\.0\\.0") with:

Assert.IsTrue(
    Regex.IsMatch(StdErr, "9999\\.0\\.0") || Regex.IsMatch(StdOut, "9999\\.0\\.0"),
    $"Expected '9999.0.0' in output.\nStdErr: {StdErr}\nStdOut: {StdOut}");

This is correct and a genuine improvement for cross-platform robustness. On some platforms/host configurations the .NET runtime-not-found error surfaces in stdout rather than stderr. The || guards both cases. The regex pattern is properly escaped (\\.), the parameter orders to Regex.IsMatch are correct (pattern second), and the diagnostic message includes both streams for debugging failed CI runs. ✅

Previously reviewed findings — still resolved ✅

All prior findings (architectural timer fix, Interlocked.Exchange, try/finally restore, [DoNotParallelize], Regex.Replace version patching) remain intact and unaffected.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HangDumpChildProcesses very flaky

2 participants