Status
Resolved on our side — fixed by (1) making ffindex.py/hhsuitedb.py use binary I/O with latin-1 and no name truncation, and (2) normalizing a few problematic, overly long A3M basenames.
Original Symptoms
UnicodeDecodeError in ffindex.read_index() when hhsuitedb.py reads .ffindex.
Successful _a3m/_hhm but missing _cs219 files; later hhsearch -d <db> fails with could not open ..._cs219.ffdata.
Sporadic failures tied to very long/complex basenames (multiple double underscores, long family names).
Root Causes
- Scripts treated
.ffindex/.ffdata as UTF-8 text; they’re effectively binary tab tables (non-UTF-8 bytes possible).
- Index writing truncated names with
"{name:.64}", causing downstream mismatches.
cstranslate --ffindex appears sensitive to very long basenames.
What We Changed (local patches)
ffindex.py
Read/write binary; decode/encode lines with latin-1 (1:1 byte mapping).
Removed 64-char truncation; write full names.
Actually sort entries (entries.sort(...)); use mmap.ACCESS_READ.
Use splitlines(); open output index in binary.
hhsuitedb.py
New robust read_ffindex(path) (binary + latin-1; ignore malformed short lines).
write_subset_index(...) writes exact, untruncated names in latin-1 with \n.
This stabilizes subset builds used by ffindex_apply_mpi and cstranslate.
After these changes, _a3m/_hhm build reliably and _cs219 builds for almost all families.
Filename Fix
Simplifiing input .a3m filenames can help with some cases.
Status
Resolved on our side — fixed by (1) making
ffindex.py/hhsuitedb.pyuse binary I/O with latin-1 and no name truncation, and (2) normalizing a few problematic, overly long A3M basenames.Original Symptoms
UnicodeDecodeErrorinffindex.read_index()whenhhsuitedb.pyreads.ffindex.Successful
_a3m/_hhmbut missing_cs219files; laterhhsearch -d <db>fails with could not open..._cs219.ffdata.Sporadic failures tied to very long/complex basenames (multiple double underscores, long family names).
Root Causes
.ffindex/.ffdataas UTF-8 text; they’re effectively binary tab tables (non-UTF-8 bytes possible)."{name:.64}", causing downstream mismatches.cstranslate --ffindexappears sensitive to very long basenames.What We Changed (local patches)
ffindex.pyRead/write binary; decode/encode lines with latin-1 (1:1 byte mapping).
Removed 64-char truncation; write full names.
Actually sort entries (
entries.sort(...)); usemmap.ACCESS_READ.Use
splitlines(); open output index in binary.hhsuitedb.pyNew robust
read_ffindex(path)(binary + latin-1; ignore malformed short lines).write_subset_index(...)writes exact, untruncated names in latin-1 with\n.This stabilizes subset builds used by
ffindex_apply_mpiandcstranslate.After these changes,
_a3m/_hhmbuild reliably and_cs219builds for almost all families.Filename Fix
Simplifiing input .a3m filenames can help with some cases.