Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/cashet/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ def _stable_repr_to(
_visited.add(obj_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Switching from repr to str as sort key breaks determinism for mixed-type collections where different elements share the same str() representation (e.g., 1 and '1'). Python's sorted preserves original iteration order when keys compare equal, but for unordered containers like sets that iteration order is arbitrary and can vary between runs. Identical inputs may then produce different canonical representations and different hashes, causing cache misses or false hits. Either keep repr for a guaranteed total order, or—if mixed types are not supported—clearly document the restriction and add a check or test to detect the case.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Switching from repr to str for sorting changes the canonical representation of any collection that contains string keys or elements, because str('hello') is 'hello' while repr('hello') is "'hello'". This will produce a different hash for every cached entry that involves strings, effectively invalidating all existing caches on upgrade. The PR description incorrectly states that str and repr return identical strings for all affected built-in types. Either treat this as an intentional breaking hash change (and require a cache clear) or preserve repr for strings while using str for the remaining types.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 The PR description asserts that str() and repr() return identical strings for the built-in types that dominate real-world args. That is not true for strings, which are among the most common keys and elements. Correct the description to avoid misleading future readers and maintainers.

buf.write("{")
first = True
for item in sorted(obj, key=repr):
for item in sorted(obj, key=str):
if not first:
buf.write(", ")
first = False
Expand All @@ -362,7 +362,7 @@ def _stable_repr_to(
_visited.add(obj_id)
buf.write("frozenset({")
first = True
for item in sorted(obj, key=repr):
for item in sorted(obj, key=str):
if not first:
buf.write(", ")
first = False
Expand All @@ -377,7 +377,7 @@ def _stable_repr_to(
_visited.add(obj_id)
buf.write("{")
first = True
for key, val in sorted(obj.items(), key=lambda p: repr(p[0])):
for key, val in sorted(obj.items(), key=lambda p: str(p[0])):
if not first:
buf.write(", ")
first = False
Expand Down
6 changes: 3 additions & 3 deletions src/cashet/store.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,20 +306,20 @@ def _put_commit_row(
commit.task_def.args_hash,
commit.task_def.args_snapshot,
commit.task_def.func_source,
json.dumps(commit.task_def.dep_versions),
json.dumps(commit.task_def.dep_versions, sort_keys=True),
int(commit.task_def.cache),
commit.task_def.retries,
int(commit.task_def.force),
timeout_seconds,
commit.task_def.ttl.total_seconds() if commit.task_def.ttl else None,
json.dumps([r.hash for r in commit.input_refs]),
json.dumps([r.hash for r in commit.input_refs], sort_keys=True),
output_hash,
output_size,
output_tier,
commit.parent_hash,
commit.status.value,
commit.error,
json.dumps(commit.tags),
json.dumps(commit.tags, sort_keys=True),
commit.created_at.isoformat(),
commit.claimed_at.isoformat(),
accessed_at,
Expand Down
20 changes: 20 additions & 0 deletions tests/test_hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,26 @@ def identity(data: Any) -> Any:
ref2 = client.submit(identity, {1: 2})
assert ref1.hash != ref2.hash

def test_mixed_key_dict_ordering_deterministic(self, client: Client) -> None:
def identity(data: Any) -> Any:
return data

d1 = {"x": 1, "y": 2}
d2 = {"y": 2, "x": 1}
ref1 = client.submit(identity, d1)
ref2 = client.submit(identity, d2)
assert ref1.hash == ref2.hash

def test_mixed_element_set_ordering_deterministic(self, client: Client) -> None:
def identity(data: Any) -> Any:
return data

s1 = {1, 2, 3}
s2 = {3, 2, 1}
ref1 = client.submit(identity, s1)
ref2 = client.submit(identity, s2)
assert ref1.hash == ref2.hash


class TestRecursiveStructures:
def test_recursive_list_does_not_crash(self, client: Client) -> None:
Expand Down
Loading