Skip to content

Metadata connection (broker -1) receive timeout causes test hangs #578

@thomhurst

Description

@thomhurst

Summary

Integration tests intermittently hang when the metadata/bootstrap connection (broker -1) times out with Receive timeout after 30000ms. This prevents the producer from resolving partition leaders for newly created topics, causing tests to hang until the 5-minute test timeout.

Symptoms

  • [Error] Dekaf.Networking.KafkaConnection: Receive timeout after 30000ms on broker -1 - marking connection as failed
  • [Information] Dekaf.Networking.ConnectionPool: All connections closed appearing during test setup
  • Tests that create multi-partition topics and produce to them hang indefinitely
  • Affected tests: MultiPartition_SeekOnSpecificPartition_WorksCorrectly, MultiPartition_DifferentKeys_DistributeAcrossPartitions, others in the Messaging category

Context

Observed in CI runs on fix/flaky-tests-round2 branch after the producer batch lifecycle bugs were fixed. The "No response for partition" topic mismatch is now resolved — this is a separate issue.

Possible causes

  • Bootstrap connection not being re-established after failure
  • Connection pool disposal from a parallel test affecting metadata lookups
  • Kafka container under load from parallel tests causing slow metadata responses
  • Missing retry/reconnect logic for broker -1 metadata connections

CI references

  • Run 23491170091 (commit 720e76d): 1/51 failed — MultiPartition_SeekOnSpecificPartition_WorksCorrectly
  • Run 23490015213 (commit 0030eca): 1/51 failed — MultiPartition_DifferentKeys_DistributeAcrossPartitions

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High prioritybugSomething isn't workingtestingTest coverage improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions