Skip to content

Fix try script hang scripting#349

Merged
tkmcmaster merged 6 commits into0.6.0-scripting-devfrom
fix-try-script-hang-scripting
Jan 8, 2026
Merged

Fix try script hang scripting#349
tkmcmaster merged 6 commits into0.6.0-scripting-devfrom
fix-try-script-hang-scripting

Conversation

@tkmcmaster
Copy link
Copy Markdown
Collaborator

tkmcmaster and others added 4 commits January 6, 2026 08:36
* Fix for #123

1. lib/channel/src/lib.rs - Fixed OnDemandReceiver polling logic:
  - Simplified poll_next() by removing the infinite loop
  - Kept listener: None in new() (this was critical for endpoints with both on_demand and peak_load)
  - Added logic to create listener if None before polling
2. src/lib.rs - Fixed try mode endpoint selection:
  - Removed self-referencing provider filter (line 990-997)
  - Only force on_demand=true for endpoints without peak_load (line 1006-1007)
  - Added is_on_demand tracking through the build process
  - Added start streams for non-on_demand endpoints in try mode (line 176)
3. examples/test_examples_try.sh - New test script for try mode

* Added testing examples to the PR flow

* Changed test_examples.sh to prompt to run in pr.sh

* Attempt to fix providers::tests::literals_provider_works

  Fixed: Intermittent literals_provider_works Test Timeout

  Root Cause:

  The test creates list providers with repeat: true (infinite streams). When collecting only 100 items with .take(100), the test would:
  1. Drop the receiver after 100 items
  2. BUT the Provider struct still held a tx reference
  3. The spawned background task had tx2.clone() and tried to send infinitely
  4. If the channel buffer (size 5) was full, the task would block forever
  5. The tokio runtime would hang during shutdown waiting for the blocked task

  The Fix:

  File: src/providers.rs

  1. Explicitly drop tx before collecting (lines 388-389, 413-414):
  let Provider { rx, tx, .. } = p;
  drop(tx);  // Let the background task detect channel closure
  2. Add small delay after collection (lines 394, 419):
  time::sleep(Duration::from_millis(10)).await;
  2. This ensures the spawned task has time to detect the closed channel and complete before the next test starts.

* Fixed cleanup script to not stop on error

* Fixed clippy warning

* Sped up test_examples.sh by running in parallel

* Increased the timeout for test_examples. Some tests take 30 seconds

* Added similar fix for literals_provider_works for range_provider_works

---------

Co-authored-by: Trevor McMaster <tkmcmaster@users.noreply.github.com>
* Partial fix for #36

  Problem: When running PORT=8085 pewpew run ProviderEndsEarly.yaml, the test exited early with no explanation.

  Root cause: The code was always sending TestEndReason::Completed, even when the test ended early because providers ran out.

  Fix (src/lib.rs:1193-1222): Modified create_load_test_future to:
  1. Track whether the timeout expired with a timeout_expired flag
  2. Poll the timeout first instead of the endpoints
  3. When endpoints complete, check if timeout_expired is false → send TestEndReason::ProviderEnded
  4. When timeout expires → send TestEndReason::Completed

  Result: Now shows the yellow message:
  Test ended early because one or more providers ended

* Fixes for ending early when providers empty

- We had a couple issues. Even when providers would be empty and it should end, if --watch was passed it would still log provider stats but would stop making API calls until the duration ended. Fixed so even with --watch it will exit when providers end
- We were logging that providers were empty even in cases where the test ended on time due to race conditions on the order of shutdown of endpoints vs. providers. Added some timers to check if we're near the duration end. > 90% complete will be treated as done.

* Updated the gitignore and cleanup scripts

* Major change to return codes

- create_run now returns a Result with the reason why the test ended.
  - Test end reasons can be Completed, ProviderEnded, KilledByLogger, CtrlC
  - Depending on the Result Reason, different exit codes are used.
- Updated guide, README, and agent/pewpewtest.ts to handle the new new return codes
- Added new integration tests to handle the new exit codes
- Updated test_examples.sh for the new exit codes and added log_kill.yaml to examples

* Updated README for changes

* Fixed error message so agent test will pass

* More attempts to fix flaky tests

- For integration tests, we were hitting a race condition with the test server PORT and environment variables. To fix this, we initially used a mutex, but that slowed the tests way down. Instead we'll have the exported create_run() get the environment variables rather than the internal _create_run(). This will let the integration tests pass in the PORT directly via create_run_with_env()
- For the second issue of the range and literals providers taking more than 60 seconds (hanging):
Root Cause

  1. Infinite streams: Providers with repeat: true create infinite streams that cycle forever
  2. spawn_blocking thread pool: These streams are processed using spawn_blocking (for blocking I/O operations)
  3. Global resource: The spawn_blocking thread pool is global across all tests and persists between test runs
  4. Thread pool exhaustion: When tests run in rapid succession, tasks from run N are still active when run N+1 starts, exhausting the thread pool
  5. Result: New tests hang waiting for thread pool capacity

  The Solution

  1. Increased cleanup times (src/providers.rs):
  // During repeat sections - changed from 50ms to 200ms
  time::sleep(Duration::from_millis(200)).await;

  // At end of test - added 100ms final cleanup
  time::sleep(Duration::from_millis(100)).await;

  This gives the spawn_blocking tasks enough time to:
  - Complete their work
  - Exit and free their threads
  - Drain the thread pool before the runtime shuts down

  2. Run tests serially in CI (.github/workflows/pr-rust.yml):
  - run: cargo test --all -- --skip providers::tests::range_provider_works --skip providers::tests::literals_provider_works
  - run: cargo test --lib providers::tests::range_provider_works -- --test-threads=1
  - run: cargo test --lib providers::tests::literals_provider_works -- --test-threads=1

  Running with --test-threads=1 prevents parallel execution that would compete for the same thread pool.

  Why This Was Necessary

  Without sufficient cleanup time:
  - Test runs would leave zombie tasks in the spawn_blocking thread pool
  - Subsequent test runs would find no available threads
  - Tests would hang indefinitely or timeout

  The cleanup times ensure the global thread pool is fully drained between test runs, preventing resource contention.

* Fixed rust format

* Added new tests to createtest for exit codes

- Added tests for logger and provider ends early tests
- Added validation to the Stop test to check for Ctrl-C
- Updated the binary used for tests to an alpha build

* Changed Splunk log clean-up to optional

- New environment variable SPLUNK_LOGS_CLEANUP (default false) controls whether the stdout and stderr files are deleted during cleanup
- Updated tests to validate log cleanup based on SPLUNK_LOGS_CLEANUP

* Fixed the cleanup script for integration tests

* Fixed missing integration test

---------

Co-authored-by: Trevor McMaster <tkmcmaster@users.noreply.github.com>
@tkmcmaster tkmcmaster self-assigned this Jan 7, 2026
@tkmcmaster tkmcmaster marked this pull request as ready for review January 8, 2026 16:18
@tkmcmaster tkmcmaster merged commit 336206f into 0.6.0-scripting-dev Jan 8, 2026
68 of 70 checks passed
@tkmcmaster tkmcmaster deleted the fix-try-script-hang-scripting branch January 8, 2026 16:18
tkmcmaster added a commit that referenced this pull request Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant