Skip to content

[fix] Fix datacollector crash visibility: replace Assert with throwable exceptions#16048

Open
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-15622-c6839125a22aaa9f
Open

[fix] Fix datacollector crash visibility: replace Assert with throwable exceptions#16048
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-15622-c6839125a22aaa9f

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented May 20, 2026

🤖 This is an automated fix generated by the Issue Repro Triage & Auto-Fix workflow.

Fixes #15622

Root Cause

When TestElement is null in DataCollectionManager.TestCaseStarted or TestCaseEnded, the code called TPDebug.Assert(testCaseStartEventArgs.TestElement is not null, "...").

TPDebug.Assert delegates to Debug.Assert, which in debug builds (and when no debugger is attached) calls Environment.FailFast through the default trace listener. This terminates the datacollector process immediately, bypassing the try-catch in DataCollectionTestCaseEventHandler.ProcessRequests. As a result, the error surfaces only as a cryptic crash in stderr (Process terminated. Assertion failed.) that is easy to miss in test output.

Fix

Replace the TPDebug.Assert null checks with explicit if/throw statements that throw InvalidOperationException. Since these are inside the try-catch in ProcessRequests, the exception is:

  1. Caught by the exception handler
  2. Reported via _messageSink.SendMessage(new DataCollectionMessageEventArgs(TestMessageLevel.Error, ...))visible in test output
  3. Logged via EqtTrace.Error
  4. The data collection continues without crashing the entire process
// Before (crashes the process in debug builds)
TPDebug.Assert(testCaseStartEventArgs.TestElement is not null, "testCaseStartEventArgs.TestElement is null");

// After (caught, logged, and reported as a user-visible error)
if (testCaseStartEventArgs.TestElement is null)
{
    throw new InvalidOperationException("TestCaseStartEventArgs.TestElement is null...");
}

Tests

All 390 Microsoft.TestPlatform.Common.UnitTests pass with the change applied.

🔍 Triaged by Issue Repro Triage & Auto-Fix 🔍

🔍 Triaged by Issue Repro Triage & Auto-Fix 🔍

…eptions

When TestElement is null in TestCaseStarted/TestCaseEnded, TPDebug.Assert
calls Debug.Assert which in debug builds terminates the process via
Environment.FailFast, bypassing the try-catch in
DataCollectionTestCaseEventHandler.ProcessRequests. This makes errors
very hard to see.

Replace the TPDebug.Assert null checks with explicit if/throw so the
InvalidOperationException is caught by ProcessRequests, reported via
_messageSink.SendMessage as a visible error, and the test run continues
cleanly instead of crashing.

Fixes #15622

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 14:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates DataCollectionManager to avoid TPDebug.Assert-triggered process termination when TestElement is missing, so failures become catchable/reportable errors instead of a silent datacollector crash.

Changes:

  • Replace TPDebug.Assert(...TestElement...) with explicit if/throw InvalidOperationException in TestCaseStarted.
  • Replace TPDebug.Assert(...TestElement...) with explicit if/throw InvalidOperationException in TestCaseEnded.

@@ -317,7 +317,11 @@ public void TestCaseStarted(TestCaseStartEventArgs testCaseStartEventArgs)
}

TPDebug.Assert(_dataCollectionEnvironmentContext is not null, "_dataCollectionEnvironmentContext is null");
@@ -333,7 +337,11 @@ public Collection<AttachmentSet> TestCaseEnded(TestCaseEndEventArgs testCaseEndE
}

TPDebug.Assert(_dataCollectionEnvironmentContext is not null, "_dataCollectionEnvironmentContext is null");
Comment on lines +320 to +323
if (testCaseStartEventArgs.TestElement is null)
{
throw new InvalidOperationException("TestCaseStartEventArgs.TestElement is null. The data collector cannot start a test case without a test element.");
}
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Dimensions activated: Backward Compatibility & Rollback Safety · Error Reporting & Diagnostic Clarity · Null Safety & Boundary Validation

The fix is correct and the PR description accurately describes the change.

Key verification points:

  • TPDebug.Assert is [Conditional("DEBUG")] — a no-op in release builds. In debug builds it delegates to Debug.Assert, which can call FailFast and bypass the try/catch in ProcessRequests. The new if/throw InvalidOperationException enforces the null check in all build configurations.
  • Both new exceptions are caught by the existing try/catch blocks in DataCollectionTestCaseEventHandler.ProcessRequests (TestCaseStarted handler at ~line 88, TestCaseEnded handler at ~line 112) and correctly surfaced via _messageSink.SendMessage(new DataCollectionMessageEventArgs(TestMessageLevel.Error, ...)).
  • TestCaseEnded still sends DataCollectionTestEndResult with an empty AttachmentSet on error, which prevents the vstest.console caller from deadlocking while waiting for the result.
  • No public API surface changes; IDataCollectionManager interface is unchanged.

No issues found. ✅


🧠 Reviewed by Expert Code Reviewer

🧠 Reviewed by Expert Code Reviewer 🧠

@Evangelink
Copy link
Copy Markdown
Member

🤖 Expert vstest review (automated)

Summary — Small, low-risk fix that correctly addresses the user-visible symptom in #15622 (datacollector Debug.AssertFailFast → cryptic stderr bypasses the try/catch in DataCollectionTestCaseEventHandler.ProcessRequests). The root-cause analysis is accurate and the replacement if/throw flows through the existing catch in ProcessRequests (DataCollectionTestCaseEventHandler.cs:88-98 / 112-123), so the error is reported via _messageSink as intended. However the fix is incomplete (sibling asserts with the exact same FailFast vulnerability are left untouched) and ships without any regression test for behaviour the PR is specifically meant to lock in. Recommending changes before merge.

⚠️ Issues

1. Fix is incomplete — sibling TPDebug.Assert(_dataCollectionEnvironmentContext is not null, ...) calls have the same FailFast bug
src/Microsoft.TestPlatform.Common/DataCollection/DataCollectionManager.cs:319, 335 (and the analogous one at line 304 in SessionStarted).

These are also reached on the datacollector thread inside ProcessRequests' try/catch. In Debug builds (which is exactly the scenario reproduced in #15622 — note the issue's stack trace is from Debug\net8.0), if InitializeDataCollectors wasn't called or failed to set _dataCollectionEnvironmentContext, TestCaseStarted/TestCaseEnded would hit Environment.FailFast from the DefaultTraceListener and reproduce the exact same opaque crash this PR is trying to eliminate. If the goal is "datacollector crash visibility," fix all the asserts on this hot path, not just the two that happened to be on the issue's stack. Suggest converting all three to if/throw InvalidOperationException (or a dedicated helper).

2. No regression test for the new behaviour
The PR body claims "All 390 Microsoft.TestPlatform.Common.UnitTests pass" but does not add a single test for the only thing it changes. test/datacollector.UnitTests/DataCollectionManagerTests.cs already has TestCaseStartedShould* / TestCaseEndedShould* tests (lines 422-470) — trivial to add:

[TestMethod]
public void TestCaseStartedShouldThrowInvalidOperationExceptionWhenTestElementIsNull()
{
    _dataCollectionManager.InitializeDataCollectors(_dataCollectorSettings);
    var args = new TestCaseStartEventArgs(testCase: null!);
    Assert.ThrowsExactly<InvalidOperationException>(() => _dataCollectionManager.TestCaseStarted(args));
}

Without this, a future refactor that re-introduces TPDebug.Assert (or wraps the throw behind a guard) silently regresses #15622, and no CI signal exists. Per dimension 12 (Acceptance Test Coverage Design), behaviour-changing fixes need a test that pins the new contract.

💡 Suggestions

3. Error message is hardcoded Englishsrc/Microsoft.TestPlatform.Common/DataCollection/DataCollectionManager.cs:321, 341. The string is surfaced to end users via _messageSink.SendMessage(new DataCollectionMessageEventArgs(TestMessageLevel.Error, ...)) (DataCollectionTestCaseEventHandler.cs:96, 120) — i.e. it shows up in console/trx/html logger output. Everything else user-facing in this file goes through Resources.Resources.* and has .xlf translations (see lines 496, 507, 537, 558, 580). Imho add a resource string for consistency; this is the only path in DataCollectionManager that bypasses the resource pipeline.

4. Exception type choiceInvalidOperationException is defensible (state of the passed args is wrong, not the args reference), but since the property of the arg is null, ArgumentException(message, nameof(testCaseStartEventArgs)) is also reasonable and would let callers distinguish "your input is bad" from "I'm in a bad state." Nit, take it or leave it.

✅ Notes

  • Root-cause analysis is correct: Debug.AssertDefaultTraceListener.FailEnvironment.FailFast on .NET Core when AssertUiEnabled=false (which is the default for non-interactive testhost/datacollector processes). The stack in datacollector crash is not easy to see, #15622 confirms this.
  • Behaviour in Release is essentially unchanged: pre-PR, a null TestElement would NRE inside DataCollectionContext(SessionId, TestCase) (ObjectModel/DataCollector/DataCollectionContext.cs:71-76, dereferences testCase.Id) and be caught by the same try/catch. The PR just gives that path a clearer message. So no compat/wire-protocol concerns.
  • DataCollectionManager is internal (line 26), so no public API surface impact.
  • IPC/threading unaffected — the throw happens on the same thread that's already exception-safe via ProcessRequests' catch.

Verdict

🟡 COMMENT / minor changes requested — Mergeable as-is in a pinch, but please (a) extend the same treatment to the _dataCollectionEnvironmentContext asserts on lines 304/319/335 and (b) add a unit test pinning the new throw behaviour. Otherwise this fix is one careless refactor away from regressing the very issue it closes.

- Convert all TPDebug.Assert(_dataCollectionEnvironmentContext is not null) calls
  in SessionStarted, TestCaseStarted, and TestCaseEnded to if/throw
  InvalidOperationException so errors surface via the messageSink catch block
  instead of FailFast in debug builds
- Use localized resource strings for all new error messages
- Add resource strings DataCollectionContextNotInitialized,
  DataCollectionTestCaseStartMissingTestElement, and
  DataCollectionTestCaseEndMissingTestElement to Resources.resx, Designer.cs,
  and all 13 .xlf locale files
- Add unit tests: TestCaseStarted/Ended throw InvalidOperationException when
  TestElement is null, and when _dataCollectionEnvironmentContext is null

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Jun 1, 2026

Commit pushed: 3a24949

🔧 Iterated by PR Iteration Agent 🔧

@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Jun 1, 2026

Addressed the review feedback:

  1. Fixed sibling _dataCollectionEnvironmentContext asserts — converted all three TPDebug.Assert(_dataCollectionEnvironmentContext is not null, ...) calls in SessionStarted, TestCaseStarted, and TestCaseEnded to if/throw InvalidOperationException, matching the fix applied to the TestElement asserts.

  2. Added localized resource strings — the three error messages now go through Resources.Resources.* (DataCollectionContextNotInitialized, DataCollectionTestCaseStartMissingTestElement, DataCollectionTestCaseEndMissingTestElement) with entries in all 13 .xlf locale files.

  3. Added unit tests — four new tests in DataCollectionManagerTests.cs:

    • TestCaseStartedShouldThrowInvalidOperationExceptionWhenTestElementIsNull
    • TestCaseEndedShouldThrowInvalidOperationExceptionWhenTestElementIsNull
    • TestCaseStartedShouldThrowInvalidOperationExceptionWhenContextNotInitialized
    • TestCaseEndedShouldThrowInvalidOperationExceptionWhenContextNotInitialized

All 157 datacollector unit tests pass locally.

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Dimensions activated: Backward Compatibility & Rollback Safety · Error Reporting & Diagnostic Clarity · Null Safety & Boundary Validation

The fix is correct. Replacing TPDebug.Assert (a [Conditional("DEBUG")] no-op in release builds that routes to FailFast in debug builds) with if/throw InvalidOperationException is the right approach — these exceptions are caught by the try/catch in DataCollectionTestCaseEventHandler.ProcessRequests and surfaced via _messageSink.

The 4 new unit tests correctly exercise the intended paths. InitializeDataCollectors sets _dataCollectionEnvironmentContext (line 172), so the "WhenTestElementIsNull" tests reach the TestElement null check as intended. The "WhenContextNotInitialized" tests correctly use reflection to force the null state.


Observation — Incomplete conversion scope

SessionEnded (line ~241) and TestHostLaunched (line ~289) retain TPDebug.Assert(_dataCollectionEnvironmentContext is not null, ...). These are public interface methods with the same crash-visibility risk as the three methods fixed by this PR. In debug builds, if _dataCollectionEnvironmentContext is unexpectedly null in these methods, the process will FailFast rather than throwing a catchable exception.

This is not a blocker for the current fix (which correctly addresses the reported issue), but worth tracking as follow-up work.

Observation — PR description scope

The description focuses on TestCaseStarted and TestCaseEnded but the diff also replaces the assert in SessionStarted (line ~304). Minor documentation gap.


🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

datacollector crash is not easy to see,

3 participants