Skip to content

Chardata plus encoded datasets#6898

Draft
pp-mo wants to merge 48 commits intoSciTools:mainfrom
pp-mo:chardata_plus_encoded_datasets
Draft

Chardata plus encoded datasets#6898
pp-mo wants to merge 48 commits intoSciTools:mainfrom
pp-mo:chardata_plus_encoded_datasets

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Jan 19, 2026

Closes #6309 + various

Successor to #6850
now incorporating #6851

+ now integrated usage with netcdf load+save, to use encoded datasets

pp-mo added 28 commits January 19, 2026 11:49
…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
Rename; addin parts of old investigation; add temporary notes.
@pp-mo pp-mo mentioned this pull request Jan 20, 2026
string_width: int # string lengths when viewing as strings (i.e. "Uxx")

def __init__(self, cf_var):
"""Get all the info from an netCDF4 variable (or similar wrapper object).
Copy link
Member Author

@pp-mo pp-mo Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get all the info from an netCDF4 variable (or similar wrapper object).
"""Get all the info from a netCDF4 variable.

It actually must be "at least" a threadsafe wrapped variable (or real netCDF4.Variable) and not an EncodedVariable, since we inspect it's '.dtype' etc.

Comment on lines +120 to +123
read_encoding: str # *always* a valid encoding from the codecs package
write_encoding: str # *always* a valid encoding from the codecs package
n_chars_dim: int # length of associated character dimension
string_width: int # string lengths when viewing as strings (i.e. "Uxx")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now only set if "is_chardata" -- see init code

Comment on lines +240 to +242
DECODE_TO_STRINGS_ON_READ = NetcdfStringDecodeSetting()
DEFAULT_READ_ENCODING = "utf-8"
DEFAULT_WRITE_ENCODING = "ascii"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be made available in public API.
Probably by importing in iris.fileformats.netcdf and including in its __all__ ?

Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment at this time.

encoding = self.read_encoding
if "utf-16" in encoding:
# Each char needs at least 2 bytes -- including a terminator char
strlen = (strlen // 2) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to account for a terminating char on "utf-32" and "utf-16" encodings?
When writing to a netCDF file, surely the terminator isn't written? This is just something that is used when storing strings in memory, is it not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this looks to be the case. Certainly encoding a byte string to "utf-16" or "utf-32" does appear to add an extra null terminator...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this looks to be the case. Certainly encoding a byte string to "utf-16" or "utf-32" does appear to add an extra null terminator...

And, from my experiments, omitting the extra byte breaks a reverse 'decode' operation.

@pp-mo pp-mo force-pushed the chardata_plus_encoded_datasets branch from 274fae4 to 31884e9 Compare March 6, 2026 10:37
@pp-mo
Copy link
Member Author

pp-mo commented Mar 6, 2026

Update

merged from main to unblock CI testing

@pp-mo pp-mo force-pushed the chardata_plus_encoded_datasets branch 2 times, most recently from e328f94 to 2800dc1 Compare March 6, 2026 12:31
@pp-mo pp-mo force-pushed the chardata_plus_encoded_datasets branch from c4a60d5 to 0bb70e1 Compare March 6, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Fix iris handling of netcdf character array variables

2 participants