Skip to content

libarchive: Handle UTF-8 filenames without locale dependency#3559

Merged
cgwalters merged 1 commit intoostreedev:mainfrom
cgwalters:archive-utf8
Jan 30, 2026
Merged

libarchive: Handle UTF-8 filenames without locale dependency#3559
cgwalters merged 1 commit intoostreedev:mainfrom
cgwalters:archive-utf8

Conversation

@cgwalters
Copy link
Copy Markdown
Member

When importing archives (including OCI container layers), libarchive attempts to convert filenames from UTF-8 to the current locale charset. In POSIX/C locale (which uses ASCII), this conversion fails for any non-ASCII UTF-8 characters, returning ARCHIVE_WARN.

This is triggered by Python 3.14 which creates a "𝜋thon" symlink in venvs, and affects bootc installations in environments where LANG is not set (defaulting to POSIX locale).

Fix this by:

  1. Using archive_entry_pathname_utf8() and archive_entry_symlink_utf8() which return UTF-8 directly without locale conversion

  2. Falling back to the regular accessors with explicit UTF-8 validation when the _utf8 variants return NULL

  3. Accepting ARCHIVE_WARN from archive_read_next_header() since we now validate UTF-8 ourselves rather than relying on libarchive charset conversion

This matches the behavior of GNU tar which treats filenames as opaque bytes without charset conversion.

Closes: #3431

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue with handling non-ASCII UTF-8 filenames in libarchive when the locale is POSIX/C. The strategy of using _utf8 function variants with a fallback to manual validation is a solid approach. The new test case is also a valuable addition, ensuring the fix is effective.

I have identified a couple of critical issues in the implementation that could lead to crashes or re-introduce the bug this PR aims to fix. I've also pointed out a minor inefficiency. My review comments include specific code suggestions to resolve these problems. After addressing these points, the pull request will be much stronger.

Comment thread src/libostree/ostree-repo-libarchive.c Outdated
Comment thread src/libostree/ostree-repo-libarchive.c
@cgwalters cgwalters marked this pull request as draft January 20, 2026 16:16
When importing archives (including OCI container layers), libarchive
attempts to convert filenames from UTF-8 to the current locale charset.
In POSIX/C locale (which uses ASCII), this conversion fails for any
non-ASCII UTF-8 characters, returning ARCHIVE_WARN.

This is triggered by Python 3.14 which creates a "𝜋thon" symlink in
venvs, and affects bootc installations in environments where LANG is
not set (defaulting to POSIX locale).

Fix this by:

1. Using archive_entry_pathname_utf8() and archive_entry_symlink_utf8()
   which return UTF-8 directly without locale conversion

2. Falling back to the regular accessors with explicit UTF-8 validation
   when the _utf8 variants return NULL

3. Accepting ARCHIVE_WARN from archive_read_next_header() since we now
   validate UTF-8 ourselves rather than relying on libarchive charset
   conversion

This matches the behavior of GNU tar which treats filenames as opaque
bytes without charset conversion.

Closes: ostreedev#3431
@cgwalters cgwalters requested a review from jmarrero January 22, 2026 16:13
@cgwalters cgwalters marked this pull request as ready for review January 22, 2026 16:15
Copy link
Copy Markdown
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

error: Pathname can't be converted from UTF-8 to current locale

2 participants