Skip to content

Improve FlakyStrategyDefinition error messages with specific details#4676

Draft
ianhi wants to merge 7 commits intoHypothesisWorks:masterfrom
ianhi:flaky-feedback
Draft

Improve FlakyStrategyDefinition error messages with specific details#4676
ianhi wants to merge 7 commits intoHypothesisWorks:masterfrom
ianhi:flaky-feedback

Conversation

@ianhi
Copy link
Contributor

@ianhi ianhi commented Mar 10, 2026

Moved into a draft. see: #4676 (comment)


🤖 - I used claude extensively for this PR, but have personally reviewed every line to the best of my ability.

Fix for #4673

This PR implements four tightly related changes in the output of hypothesis when there is a flaky failure

  1. Ensure seed is printed

  2. Print the actual different choices in the strategy

the error now says what was different (different constraints, different type, fewer/more draws) instead of just "data generation was inconsistent"

  1. For stateful tests give the replay/give info on how to trigger observability

  2. Fix duplicate FlakyStrategyDefinition errors

when a mismatch was detected during a draw, a second FlakyStrategyDefinition could be raised from conclude_test if the mismatch also resulted in fewer draws. Now the observer has a flaky flag to prevent this redundant second raise.

One of the tricky things is that a FlakyStrategyDefinition error can be thrown with or without a real test failure. In the latter case then you would get a messy output with nested errors (during. the handling another exception...) which made it hard to notice the first error with so much text on the screen. Now the FlakyError is temporarily suppressed and reported in the Hypothesis output, keeping the test failure more visible. (see final example below)

I cooked up this demo script with claude to test out the various combinations of failure modes and otuput (e.g. observability) which I found quite helpful. Here is the script:

demo_flaky.py (click to expand)
import hypothesis.strategies as st
from hypothesis import given, settings
from hypothesis.stateful import Bundle, RuleBasedStateMachine, initialize, rule

_catalog_size = [5]

class FlakyConstraints(RuleBasedStateMachine):
    items = Bundle("items")

    @initialize()
    def create_cart(self):
        self.cart = []

    @rule(target=items, name=st.text(min_size=1, max_size=10))
    def add_item(self, name):
        self.cart.append(name)
        return name

    @rule(
        item=items,
        price=st.integers(1, 100).flatmap(
            lambda p: st.integers(p, p + _catalog_size[0])
        ),
    )
    def set_price(self, item, price):
        _catalog_size[0] += 3

TestConstraint = FlakyConstraints.TestCase
TestConstraint.settings = settings(max_examples=200, database=None, stateful_step_count=10)

_call_count = [0]

@settings(max_examples=200, database=None)
@given(data=st.data())
def test_type_mismatch(data):
    _call_count[0] += 1
    if _call_count[0] % 2 == 0:
        data.draw(st.booleans())
    else:
        data.draw(st.integers(0, 10))

_upper = [10]

@settings(max_examples=200, database=None)
@given(data=st.data())
def test_plain_mismatch(data):
    data.draw(st.integers(0, _upper[0]))
    _upper[0] += 10

_more_count = [0]

@settings(max_examples=200, database=None)
@given(data=st.data())
def test_more_draws(data):
    data.draw(st.integers(0, 10))
    _more_count[0] += 1
    if _more_count[0] > 1:
        data.draw(st.integers(0, 10))
    assert False

With these outputs (formatted by claude to elide portions of the error to focus on whats relevant for this PR)

1. Stateful test — constraint mismatch (observability off)

python -m pytest demo_flaky.py -s -k constraint

Before (v6.151.9)After
FlakyStrategyDefinition: Inconsistent data
generation! Data generation behaved differently
between different runs. Is your data generation
depending on external state?
while generating 'price' from integers(...)
  .flatmap(...) for rule set_price

During handling of the above exception,
another exception occurred:

  ...long chained traceback...

FlakyStrategyDefinition: Inconsistent data
generation! ...
while selecting a rule to run. This is usually
caused by a flaky precondition, or a bundle
that was unexpectedly empty.
FlakyStrategyDefinition: Inconsistent data
generation! Data generation behaved differently
between different runs. Is your data generation
depending on external state?

The second run drew integer with different
constraints than the first run.
  first run:  {'min_value': 99, 'max_value': 278,
               'weights': ...}
  second run: {'min_value': 99, 'max_value': 290,
               'weights': ...}

while generating 'price' from integers(...)
  .flatmap(...) for rule set_price
This error occurred while selecting a rule to
run. This is usually caused by a flaky
precondition, a bundle that was unexpectedly
empty, or a rule that depends on external state
such as time or a global variable.
---------- Hypothesis ----------
Tip: to see which steps led to this error,
  re-run with
  HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY=1
You can add @seed(...) to this test or run
  pytest with --hypothesis-seed=... to reproduce
  this failure.

2. Stateful test — constraint mismatch (observability on)

HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY=1 python -m pytest demo_flaky.py -s -k constraint

Before (v6.151.9)After
  ...same duplicate chained traceback as above...

FlakyStrategyDefinition: Inconsistent data
generation! ...
while selecting a rule to run. This is usually
caused by a flaky precondition, or a bundle
that was unexpectedly empty.
FlakyStrategyDefinition: ...

The second run drew integer with different
constraints than the first run.
  first run:  {'min_value': 29, 'max_value': 220,
               'weights': ...}
  second run: {'min_value': 29, 'max_value': 238,
               'weights': ...}

while generating 'price' from integers(...)
  .flatmap(...) for rule set_price
This error occurred while selecting a rule ...
---------- Hypothesis ----------
Steps leading up to this error:
  state = FlakyConstraints()
  state.create_cart()
  items_0 = state.add_item(name='...')
  state.teardown()
You can add @seed(...) to this test or run
  pytest with --hypothesis-seed=... to reproduce
  this failure.

3. Non-stateful — type mismatch

python -m pytest demo_flaky.py -s -k type_mismatch

Before (v6.151.9)After
FlakyFailure: Inconsistent results from
replaying a test case!
  last: VALID from None
  this: INTERESTING from
    FlakyStrategyDefinition at datatree.py:1053
  (1 sub-exception)
+-+---------------- 1 ----------------
  |   ...long traceback...
  | FlakyStrategyDefinition: Inconsistent data
  |   generation! ...
  | while generating 'Draw 1' from booleans()
  +------------------------------------
FlakyStrategyDefinition: Inconsistent data
generation! Data generation behaved differently
between different runs. Is your data generation
depending on external state?

The second run drew a different type of value
than the first run.
  first run:  integer
  second run: boolean

while generating 'Draw 1' from booleans()
---------- Hypothesis ----------
You can add @seed(...) to this test or run
  pytest with --hypothesis-seed=... to reproduce
  this failure.

4. Non-stateful — constraint mismatch

python -m pytest demo_flaky.py -s -k plain

Before (v6.151.9)After
FlakyFailure: Inconsistent results from
replaying a test case!
  last: VALID from None
  this: INTERESTING from
    FlakyStrategyDefinition at datatree.py:1053
  (1 sub-exception)
+-+---------------- 1 ----------------
  |   ...long traceback...
  | FlakyStrategyDefinition: Inconsistent data
  |   generation! ...
  | while generating 'Draw 1' from
  |   integers(min_value=0, max_value=20)
  +------------------------------------
FlakyStrategyDefinition: Inconsistent data
generation! Data generation behaved differently
between different runs. Is your data generation
depending on external state?

The second run drew integer with different
constraints than the first run.
  first run:  {'min_value': 0, 'max_value': 10,
               'weights': None,
               'shrink_towards': 0}
  second run: {'min_value': 0, 'max_value': 20,
               'weights': None,
               'shrink_towards': 0}

while generating 'Draw 1' from
  integers(min_value=0, max_value=20)
---------- Hypothesis ----------
You can add @seed(...) to this test or run
  pytest with --hypothesis-seed=... to reproduce
  this failure.

5. Real bug + suppressed flaky error

python -m pytest demo_flaky.py -s -k more

Before (v6.151.9)After
FlakyFailure: Inconsistent results from
replaying a test case!
  last: INTERESTING from AssertionError ...
  this: INTERESTING from
    FlakyStrategyDefinition at datatree.py:1106
  (2 sub-exceptions)
+-+---------------- 1 ----------------
  |   ...
  |     assert False
  | AssertionError: assert False
  +---------------- 2 ----------------
  |   ...long traceback...
  | FlakyStrategyDefinition: Inconsistent data
  |   generation! ...
  | while generating 'Draw 2' from
  |   integers(min_value=0, max_value=10)
  +------------------------------------
FlakyFailure: ...An example failed on the
  first run but now succeeds ...
Falsifying example: test_more_draws(
    data=data(...),
)
Draw 1: 0
+-+---------------- 1 ----------------
  |   ...
  |     assert False
  | AssertionError: assert False
  +------------------------------------
---------- Hypothesis ----------
WARNING: a flaky strategy definition error was
  detected during shrinking and suppressed in
  favor of the real failure above.
  Inconsistent data generation! ...

The second run drew more data than the first run.

You can add @seed(...) to this test or run
  pytest with --hypothesis-seed=... to reproduce
  this failure.

FlakyStrategyDefinition errors now describe what changed between runs
(type mismatch, constraint mismatch, forced value difference, more/fewer
draws) instead of a generic "inconsistent data generation" message.
@ianhi ianhi requested review from Liam-DeVoe and Zac-HD as code owners March 10, 2026 23:02
@ianhi ianhi marked this pull request as draft March 12, 2026 05:19
@ianhi
Copy link
Contributor Author

ianhi commented Mar 12, 2026

I moved this back to draft - looking back I was pushing through some hunger when I submitted this - riding the thrill of trying to get it to the end. And in retrospect I need to spend some more time looking over the tests as carefully as I did the code/behavior. Which I suppose is ironic given this is a testing library - but alas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant