Fix try script hang scripting by tkmcmaster · Pull Request #349 · FamilySearch/pewpew

tkmcmaster · 2026-01-07T23:04:44Z

* Fix for #123 1. lib/channel/src/lib.rs - Fixed OnDemandReceiver polling logic: - Simplified poll_next() by removing the infinite loop - Kept listener: None in new() (this was critical for endpoints with both on_demand and peak_load) - Added logic to create listener if None before polling 2. src/lib.rs - Fixed try mode endpoint selection: - Removed self-referencing provider filter (line 990-997) - Only force on_demand=true for endpoints without peak_load (line 1006-1007) - Added is_on_demand tracking through the build process - Added start streams for non-on_demand endpoints in try mode (line 176) 3. examples/test_examples_try.sh - New test script for try mode * Added testing examples to the PR flow * Changed test_examples.sh to prompt to run in pr.sh * Attempt to fix providers::tests::literals_provider_works Fixed: Intermittent literals_provider_works Test Timeout Root Cause: The test creates list providers with repeat: true (infinite streams). When collecting only 100 items with .take(100), the test would: 1. Drop the receiver after 100 items 2. BUT the Provider struct still held a tx reference 3. The spawned background task had tx2.clone() and tried to send infinitely 4. If the channel buffer (size 5) was full, the task would block forever 5. The tokio runtime would hang during shutdown waiting for the blocked task The Fix: File: src/providers.rs 1. Explicitly drop tx before collecting (lines 388-389, 413-414): let Provider { rx, tx, .. } = p; drop(tx); // Let the background task detect channel closure 2. Add small delay after collection (lines 394, 419): time::sleep(Duration::from_millis(10)).await; 2. This ensures the spawned task has time to detect the closed channel and complete before the next test starts. * Fixed cleanup script to not stop on error * Fixed clippy warning * Sped up test_examples.sh by running in parallel * Increased the timeout for test_examples. Some tests take 30 seconds * Added similar fix for literals_provider_works for range_provider_works --------- Co-authored-by: Trevor McMaster <tkmcmaster@users.noreply.github.com>

* Partial fix for #36 Problem: When running PORT=8085 pewpew run ProviderEndsEarly.yaml, the test exited early with no explanation. Root cause: The code was always sending TestEndReason::Completed, even when the test ended early because providers ran out. Fix (src/lib.rs:1193-1222): Modified create_load_test_future to: 1. Track whether the timeout expired with a timeout_expired flag 2. Poll the timeout first instead of the endpoints 3. When endpoints complete, check if timeout_expired is false → send TestEndReason::ProviderEnded 4. When timeout expires → send TestEndReason::Completed Result: Now shows the yellow message: Test ended early because one or more providers ended * Fixes for ending early when providers empty - We had a couple issues. Even when providers would be empty and it should end, if --watch was passed it would still log provider stats but would stop making API calls until the duration ended. Fixed so even with --watch it will exit when providers end - We were logging that providers were empty even in cases where the test ended on time due to race conditions on the order of shutdown of endpoints vs. providers. Added some timers to check if we're near the duration end. > 90% complete will be treated as done. * Updated the gitignore and cleanup scripts * Major change to return codes - create_run now returns a Result with the reason why the test ended. - Test end reasons can be Completed, ProviderEnded, KilledByLogger, CtrlC - Depending on the Result Reason, different exit codes are used. - Updated guide, README, and agent/pewpewtest.ts to handle the new new return codes - Added new integration tests to handle the new exit codes - Updated test_examples.sh for the new exit codes and added log_kill.yaml to examples * Updated README for changes * Fixed error message so agent test will pass * More attempts to fix flaky tests - For integration tests, we were hitting a race condition with the test server PORT and environment variables. To fix this, we initially used a mutex, but that slowed the tests way down. Instead we'll have the exported create_run() get the environment variables rather than the internal _create_run(). This will let the integration tests pass in the PORT directly via create_run_with_env() - For the second issue of the range and literals providers taking more than 60 seconds (hanging): Root Cause 1. Infinite streams: Providers with repeat: true create infinite streams that cycle forever 2. spawn_blocking thread pool: These streams are processed using spawn_blocking (for blocking I/O operations) 3. Global resource: The spawn_blocking thread pool is global across all tests and persists between test runs 4. Thread pool exhaustion: When tests run in rapid succession, tasks from run N are still active when run N+1 starts, exhausting the thread pool 5. Result: New tests hang waiting for thread pool capacity The Solution 1. Increased cleanup times (src/providers.rs): // During repeat sections - changed from 50ms to 200ms time::sleep(Duration::from_millis(200)).await; // At end of test - added 100ms final cleanup time::sleep(Duration::from_millis(100)).await; This gives the spawn_blocking tasks enough time to: - Complete their work - Exit and free their threads - Drain the thread pool before the runtime shuts down 2. Run tests serially in CI (.github/workflows/pr-rust.yml): - run: cargo test --all -- --skip providers::tests::range_provider_works --skip providers::tests::literals_provider_works - run: cargo test --lib providers::tests::range_provider_works -- --test-threads=1 - run: cargo test --lib providers::tests::literals_provider_works -- --test-threads=1 Running with --test-threads=1 prevents parallel execution that would compete for the same thread pool. Why This Was Necessary Without sufficient cleanup time: - Test runs would leave zombie tasks in the spawn_blocking thread pool - Subsequent test runs would find no available threads - Tests would hang indefinitely or timeout The cleanup times ensure the global thread pool is fully drained between test runs, preventing resource contention. * Fixed rust format * Added new tests to createtest for exit codes - Added tests for logger and provider ends early tests - Added validation to the Stop test to check for Ctrl-C - Updated the binary used for tests to an alpha build * Changed Splunk log clean-up to optional - New environment variable SPLUNK_LOGS_CLEANUP (default false) controls whether the stdout and stderr files are deleted during cleanup - Updated tests to validate log cleanup based on SPLUNK_LOGS_CLEANUP * Fixed the cleanup script for integration tests * Fixed missing integration test --------- Co-authored-by: Trevor McMaster <tkmcmaster@users.noreply.github.com>

- #347 - #348

- Fixed merge conflicts after #349

tkmcmaster and others added 4 commits January 6, 2026 08:36

Merge branch 'master' into fix-try-script-hang-scripting

0da0043

- #347 - #348

Updated the new tests and examples for scripting

769a687

tkmcmaster self-assigned this Jan 7, 2026

tkmcmaster added 2 commits January 7, 2026 23:33

More fixes after merge from master

841f51b

Added code to skip examples with declare in try mode

5b55b83

tkmcmaster marked this pull request as ready for review January 8, 2026 16:18

tkmcmaster merged commit 336206f into 0.6.0-scripting-dev Jan 8, 2026
68 of 70 checks passed

tkmcmaster deleted the fix-try-script-hang-scripting branch January 8, 2026 16:18

tkmcmaster added a commit that referenced this pull request Jan 8, 2026

Merge branch 'master' into 0.6.0-scripting-dev

588ab24

- Fixed merge conflicts after #349

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix try script hang scripting#349

Fix try script hang scripting#349
tkmcmaster merged 6 commits into0.6.0-scripting-devfrom
fix-try-script-hang-scripting

tkmcmaster commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tkmcmaster commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant