Skip to content

CockroachDB's disable_synchronization_unsafe in tests causes unsynchronized writes #10085

@davepacheco

Description

@davepacheco

I had a test that failed with the usual:

WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/dangerzone/omicron_tmp/.tmpFf8lhP"
        If you would like to access the database for debugging, run the following:

        # Run the database
        cargo xtask db-dev run --no-populate --store-dir "/dangerzone/omicron_tmp/.tmpFf8lhP/data"
        # Access the database. Note the port may change if you run multiple databases.
        cockroach sql --host=localhost:32221 --insecure

When I loaded up the database, its contents were inconsistent with what I expected based on the logging. More precisely:

  • my test loads up 13 blueprints and then deletes 4
  • the logging was pretty conclusive that it did delete the 4 before failing
  • when I loaded up the database, all 13 blueprints were still there

I really couldn't see how this could happen so I added a tokio::time::sleep to give me long enough to connect to the live database right before the blown assertion and sure enough the 4 blueprints were gone.

In chat @jmpesp reported having run into this before, that it's a result of #8275, and it sounds like he routinely patches that out.

I had thought that 8275 was just disabling fsync. That would be alright because there's no host OS crash on the scene here. However, that's not the only thing this option does:

This not only disables fsync, but also disables flushing writes to the OS buffer.

That would explain things -- if cockroach is going down ungracefully, it may not have written this stuff to files at all, let alone fsync'd it.


We discussed this a bit in 8275 and I think we should revisit it. The Helios CI already runs on ZFS with sync=disabled, so I expect that this behavior shouldn't affect Helios CI time much. This would affect:

  • local tests
  • Linux CI

But aren't we putting this stuff in TMPDIR, which is usually in-memory anyway? What do you think @smklein @sunshowers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions