Skip to content

perf(flagd): run all 3 e2e resolver modes concurrently via @TestFactory#1753

Draft
aepfli wants to merge 3 commits intofeat/speed-up-flagd-e2e-testsfrom
feat/parameterized-e2e-suite
Draft

perf(flagd): run all 3 e2e resolver modes concurrently via @TestFactory#1753
aepfli wants to merge 3 commits intofeat/speed-up-flagd-e2e-testsfrom
feat/parameterized-e2e-suite

Conversation

@aepfli
Copy link
Copy Markdown
Member

@aepfli aepfli commented Apr 1, 2026

Summary

Stacked on top of #1752. Replaces three sequential @Suite runner classes with a single RunE2ETests class using Jupiter @TestFactory methods, cutting wall-clock time from ~3:20 to ~2:00.

Changes

1. RunE2ETests — three concurrent @TestFactory methods

Each factory (rpc(), inProcess(), file()) launches a full Cucumber engine for its resolver mode. With @Execution(CONCURRENT) and junit.jupiter.execution.parallel.enabled=true, all three engines run simultaneously:

Before: RPC (sequential) → InProcess → File  ≈ 3:20
After:  RPC ──┐
              ├── all run in parallel ──→ ≈ 2:00
         File ┘

Each factory returns a Stream<DynamicNode> mirroring the Cucumber TestPlan (engine → feature → scenario), so IDEs show the full expandable tree with accurate pass/fail/skip per scenario — individual scenarios can be re-run from IntelliJ.

2. CucumberResultListener — full TestExecutionListener

Captures every lifecycle event (started, finished, skipped, dynamic) from the Cucumber engine and maps them to DynamicTest results. The listener tracks both started and finished state to correctly report scenarios that started but did not complete.

3. ContainerPool — JVM shutdown hook + restart semaphore

  • Lazy ensureInitialized() via AtomicBoolean — pool starts once on first acquire(), shared across all three concurrent engines
  • JVM shutdown hook replaces the previous ref-counted initialize()/shutdown() — no lifecycle calls needed from test classes
  • Semaphore(1) serialises disruptive container operations (stop/restart) across parallel engines to prevent cascading init timeouts

4. Per-resolver glue packages

RpcSetup, InProcessSetup, FileSetup — each a simple @Before hook that sets state.resolverType, allowing all three engines to share the same step definitions with isolated per-scenario state.

5. Deleted RunRpcTest, RunInProcessTest, RunFileTest

No longer needed — RunE2ETests replaces all three.

6. Envoy cluster improvements (test-harness submodule)

Added connect_timeout: 0.25s and active TCP health checks (interval: 1s) to both flagd clusters in envoy config. Envoy now detects and recovers from upstream flagd restarts within one health-check cycle.

Known limitations

  • @targetURI @in-process scenarios (7) are excluded from the parallel run. Root cause: retryBackoffMaxMs controls both the initial-connection throttle in SyncStreamQueueSource (when getMetadata() fails because envoy's upstream isn't ready yet) and the post-disconnect reconnect backoff. These cannot be tuned independently — reducing the backoff for fast initial connection breaks the reconnect event timing tests. Tracked in flagd#1584 — once getMetadata() is removed, these can be re-enabled by removing "targetURI" from the inProcess() excludeTags.
  • file()[4][1-3] — FILE resolver lacks flag-set metadata support (SDK limitation, pre-existing)
  • inProcess()[3][2], [6][1-3], [8][2] — contextEnrichment failures (pre-existing, also fail on base branch)

Run from repo root

./mvnw -pl providers/flagd -Pe2e test

aepfli and others added 2 commits March 31, 2026 15:57
Replace three sequential @suite runner classes (RunRpcTest, RunInProcessTest,
RunFileTest) with a single RunE2ETests class using Jupiter @testfactory methods.
Each factory (rpc, inProcess, file) runs its Cucumber engine concurrently via
@execution(CONCURRENT), giving wall-clock time ≈ max(RPC, InProcess, File)
instead of their sum (~2:00 vs ~3:20 in Maven/CI).

Full scenario tree is preserved in IDEs: each factory returns Stream<DynamicNode>
mirroring the Cucumber TestPlan via CucumberResultListener.

Container pool uses JVM shutdown hook for lifecycle (no explicit init/shutdown
needed from test classes) and a Semaphore to serialize disruptive container
operations across parallel engines.

Envoy clusters now use connect_timeout=0.25s and active TCP health checks
(interval=1s) so upstream reconnection after flagd restart is detected within
one health-check cycle rather than waiting for the next client connection.

Known parallel-load failures (also present in base branch sequentially):
- file()[4][1-3]: FILE resolver lacks flag-set metadata support (SDK limitation)
- inProcess()[3][2], [6][1-3], [8][2]: contextEnrichment pre-existing failures
- inProcess()[2][5-7]: TargetUri scenarios sensitive to shared container pool
  contention; all 3 engines share 4 containers so gRPC init occasionally hits
  the 2s deadline under peak load. Pass reliably in sequential mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Simon Schrottner <simon.schrottner@dynatrace.com>
The retryBackoffMaxMs option controls both the initial-connection throttle
in SyncStreamQueueSource (when getMetadata() fails) and the post-disconnect
reconnect backoff. Under parallel load, envoy's upstream gRPC connection to
flagd may not be established when the first getMetadata() call fires. The
call times out after deadline=1000ms, shouldThrottle is set, and the retry
waits retryBackoffMaxMs=2000ms — beyond the waitForInitialization window of
deadline*2=2000ms. Reducing retryBackoffMaxMs breaks the reconnect event
tests that need a slow-enough backoff for error events to fire.

Exclude @targetURI from the inProcess @testfactory until flagd issue #1584
is resolved (removing getMetadata() entirely), at which point the throttle
timing problem disappears and these scenarios can be re-enabled.

RPC @targetURI scenarios are unaffected (different code path, no metadata call).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Simon Schrottner <simon.schrottner@dynatrace.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the flagd E2E test suite to support concurrent execution of RPC, in-process, and file resolver modes using JUnit 5 @testfactory. It introduces a lazy-initialized ContainerPool with a JVM shutdown hook, a semaphore for managing container restarts, and improved synchronization for gRPC and file availability. Feedback identifies a thread-safety issue in CucumberResultListener and suggests optimizing ContainerPool initialization to reduce lock contention.

- CucumberResultListener: replace LinkedHashSet/LinkedHashMap with
  thread-safe ConcurrentHashMap equivalents. Cucumber runs scenarios in
  parallel via PARALLEL_EXECUTION_ENABLED_PROPERTY_NAME, so the listener
  collections are written from multiple threads during launcher.execute().

- ContainerPool.ensureInitialized: replace synchronized method with
  double-checked locking (fast-path initialized.get() before entering
  synchronized block). After the pool is warmed up, concurrent acquire()
  calls skip the lock entirely and go straight to pool.take().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Simon Schrottner <simon.schrottner@dynatrace.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants