Skip to content

Conversation

@bharos
Copy link

@bharos bharos commented Jan 19, 2026

Rationale for this change

HMS-specific table properties (e.g., table_category from data contracts) were lost during PyIceberg commits because commit_table() replaced all HMS parameters instead of merging them.

Fixes: #2926

Are these changes tested?

Yes. Added test_hive_preserves_hms_specific_properties() that:

  1. Sets HMS properties via Hive client (simulating external systems)
  2. Performs PyIceberg commit
  3. Verifies HMS properties are preserved

Are there any user-facing changes?

Bug fix only - HMS properties set outside PyIceberg are now preserved during commits. No API changes.

@bharos bharos force-pushed the fix/hive-preserve-hms-properties branch 2 times, most recently from 96b6185 to 46ee7f7 Compare January 20, 2026 00:08
This change fixes HMS-specific table properties being lost during
commits by merging parameters instead of replacing them.

Fixes: apache#2926
@bharos bharos force-pushed the fix/hive-preserve-hms-properties branch from 46ee7f7 to be4e2a7 Compare January 20, 2026 00:28
@nssalian
Copy link
Contributor

Thanks for the fix. A few things to check for HMS properties that I can remember:

  1. If an external tool happens to set a property with the same name as an Iceberg-managed one, we need to make sure Iceberg's value should take predence.
  2. If someone removes a property through Iceberg, make sure it actually gets deleted. Right now it might come back from the old HMS state after the merge.
  3. Things like EXTERNAL, table_type, metadata_location should always come from Iceberg, not be carried over from old HMS state.
  4. Worth expanding the tests to test to check for removing a property and checking it is removed, and Iceberg value preservation.
    CC: @kevinjqliu @Fokko

@bharos bharos force-pushed the fix/hive-preserve-hms-properties branch from a9fbb1c to ae8a024 Compare January 20, 2026 00:54
@bharos
Copy link
Author

bharos commented Jan 20, 2026

Thanks for the fix. A few things to check for HMS properties that I can remember:

  1. If an external tool happens to set a property with the same name as an Iceberg-managed one, we need to make sure Iceberg's value should take predence.
  2. If someone removes a property through Iceberg, make sure it actually gets deleted. Right now it might come back from the old HMS state after the merge.
  3. Things like EXTERNAL, table_type, metadata_location should always come from Iceberg, not be carried over from old HMS state.
  4. Worth expanding the tests to test to check for removing a property and checking it is removed, and Iceberg value preservation.
    CC: @kevinjqliu @Fokko

Makes sense. Added more tests to check these cases, also the code gives precedence to iceberg properties, PTAL and let me know if you find any issues. Thanks!

@bharos bharos marked this pull request as ready for review January 20, 2026 01:11
# Merge HMS parameters: preserve HMS-only properties (e.g., table_category) while
# ensuring Iceberg controls its metadata. Start with HMS params, remove deleted
# Iceberg properties, then update with new Iceberg values (which take precedence).
updated_parameters = {k: v for k, v in hive_table.parameters.items() if k not in removed_keys}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this looks fine, I think it would be cleaner if we started with the existing, removed and then updated to have the final params. Something like this:

merged_params = dict(hive_table.parameters)
for key in removed_keys:
    merged_params.pop(key, None)
merged_params.update(new_parameters)
hive_table.parameters = merged_params

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assert hive_table.parameters.get("abc") is None


@pytest.mark.integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a test for property deletion to ensure it's not picked up from the old state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@pytest.mark.integration
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")])
def test_hive_iceberg_property_takes_precedence(catalog: Catalog) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named better. It's more like Iceberg's metadata is the source of truth. CC: @kevinjqliu to see what the right expected behavior is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call, this should be phrased as iceberg metadata is the source of truth, and HMS parameters are a projection of Iceberg state, plus any HMS-only properties that external tools may have set.

Looking at the Java implementation, they follow this order to sync properties:

  1. We start our with the existing HMS params
  2. Push all Iceberg properties into HMS
  3. Remove old props
  4. Then set all the iceberg properties

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Merge HMS parameters: preserve HMS-only properties (e.g., table_category) while
# ensuring Iceberg controls its metadata. Start with HMS params, remove deleted
# Iceberg properties, then update with new Iceberg values (which take precedence).
updated_parameters = {k: v for k, v in hive_table.parameters.items() if k not in removed_keys}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking as this solves the issue, but one suggestion would be to extract the merge logic into a small helper function like _merge_hms_parameters(). That way we could unit test without mocking the full commit flow, and document the semantics in one spot.

)

# Detect properties that were removed from Iceberg metadata
old_iceberg_keys = set(current_table.properties.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can simplify to current_table.properties.keys() - updated_staged_table.properties.keys()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

@bharos bharos force-pushed the fix/hive-preserve-hms-properties branch from 5b2939b to 291311f Compare January 20, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyIceberg commit not preserving Hive table properties

3 participants