Skip to content

Latest commit

 

History

History
337 lines (178 loc) · 39.4 KB

File metadata and controls

337 lines (178 loc) · 39.4 KB

Development Gotchas & Pitfalls

This document tracks non-obvious behaviors, common pitfalls, and architectural "gotchas" in the libmagic-rs codebase to assist future maintainers and contributors.

1. Build Script Boundary

1.1 error.rs is Shared with build.rs

src/error.rs is compiled by both the library and build.rs. It cannot reference lib-only types like crate::io::IoError.

  • Workaround: FileError(String) wraps structured I/O errors as strings.
  • Rule: Use ParseError::IoError for I/O errors in parser code, not ParseError::invalid_syntax.

1.2 Serialization Code Sharing

Serialization functions live in src/parser/codegen.rs, shared by both build.rs (via #[path] include) and src/build_helpers.rs (via crate::parser::codegen). Only format_parse_error remains duplicated because ParseError has different import paths in each context.

1.3 Generated Imports Must Stay in Sync

generate_builtin_rules() emits a use crate::parser::ast::{...} import line in generated code. When adding new AST types (e.g., PStringLengthWidth), this import must be updated or the build pipeline breaks on any rule that uses the new type.

  • Symptom: Build fails with E0433 ("cannot find type") in the generated builtin_rules.rs.
  • Fix: Update the import in generate_builtin_rules() in src/parser/codegen.rs.

2. Adding Enum Variants

2.1 TypeKind Exhaustive Matches

Adding a variant to TypeKind requires updating exhaustive matches in 10+ files: ast, grammar, types, codegen (serialize_type_kind -- easy to forget; build.rs is a separate compilation unit so the error surfaces there first), strength, property_tests, evaluator/types/mod.rs (read_typed_value, coerce_value_to_type, bytes_consumed -- variable-width variants must be matched explicitly or relative-offset anchors will silently corrupt), output/mod.rs (2 length matches), output/json.rs (format_value_as_hex), and grammar/tests.rs (stale assertions). For string-family variants (anything that semantically holds a byte sequence rather than a fixed-width number), also add the variant to is_string_family_type in parser/grammar/mod.rs::parse_magic_rule so bareword (unquoted) value parsing falls back through the string parser instead of failing strict parse_value -- String16 was caught missing from this helper during the PR #233 review (see docs/solutions/design-patterns/multi-agent-review-surfaces-cross-cutting-consistency-gaps-2026-04-25.md Gap 5 bonus fix). Note: coerce_value_to_type, output matches, and bytes_consumed use catch-all _ => so they compile without changes but may need semantic updates -- bytes_consumed will fire a debug_assert in test/dev builds for unhandled variable-width variants.

Meta variants (TypeKind::Meta(MetaType::...)) are structurally different. They do not read bytes, so read_typed_value, coerce_value_to_type, and bytes_consumed treat them as no-ops (catch-all or zero-width arms). The live dispatch for control-flow directives happens in evaluator/engine/mod.rs::evaluate_rules -- the loop body, not in evaluate_single_rule_with_anchor. MetaType::Use, MetaType::Default, MetaType::Clear, and MetaType::Indirect are all handled inline there so subroutine / control-flow effects can be spliced into the caller's match vector; MetaType::Name is hoisted out of the rule list at parse time by parser::name_table::extract_name_table and should normally never reach the evaluator at all. The evaluator's leaked-Name arm uses debug! (not debug_assert!) because prop_arbitrary_rule_evaluation_never_panics synthesizes arbitrary TypeKind values and a panic there would break the never-panics invariant. This applies to every inline-dispatched meta variant that has a "not configured" fallback (currently Use and Indirect for the no-RuleEnvironment case): property tests run with rule_env = None, so any debug_assert! on that path will fire. When adding a new MetaType variant: decide whether it needs inline loop-level dispatch (add to evaluate_rules) or is a silent no-op (add to the _ arm of the Meta(_) match in evaluate_single_rule_with_anchor). See docs/solutions/integration-issues/meta-type-subroutine-dispatch-architecture.md for the full pattern.

2.2 Operator Exhaustive Matches

Adding a variant to Operator requires updating: ast, grammar, codegen, strength, property_tests, evaluator/operators/.

2.3 Value Exhaustive Matches

Adding a variant to Value requires updating: ast, codegen, strength, property_tests, output/mod.rs (2 length matches), output/json.rs (format_value_as_hex), evaluator/operators/comparison.rs (compare_values). Bitwise/equality operators use catch-all _ => so they are safe.

  • Note: Value no longer derives Eq (removed when Value::Float(f64) was added) -- no production code depends on Value: Eq.
  • Cross-type String/Bytes equality is policy, not accident. apply_equal deliberately compares Value::String(s) against Value::Bytes(b) by underlying byte sequence (s.as_bytes() == b.as_slice()) -- this is libmagic-compatible behavior. The parser produces Value::Bytes for backslash-escape patterns like \177ELF (via parse_mixed_hex_ascii), while read_string_exact returns Value::String; both must compare equal when their bytes match or rules with hex/octal escapes silently fail to match. Any new Value variant that can hold the same byte sequence as Value::String should extend this cross-equality, not lock into strict-variant comparison. See docs/solutions/logic-errors/magic-string-rule-matching-3-bug-fix-2026-04-25.md for the worked example.
  • Cross-type ordering must follow the same policy or trichotomy breaks. compare_values in evaluator/operators/comparison.rs mirrors the cross-type String/Bytes equality with Some(s.as_bytes().cmp(b.as_slice())) (and the symmetric Bytes/String arm). Without this, apply_equal(Bytes, String) returns true for byte-equal pairs but apply_less_than and apply_greater_than both return false — neither <, ==, nor > is true, breaking the trichotomy invariant. magic(5) rules using > or < against a Value::Bytes literal silently never fire even when equality with the same literal succeeds. When adding a new Value variant that participates in equality, audit compare_values for the matching arms; an asymmetric policy is worse than no policy. See docs/solutions/design-patterns/multi-agent-review-surfaces-cross-cutting-consistency-gaps-2026-04-25.md Gap 1 for the worked example. Regression test: cross_type_string_bytes_ordering_is_byte_sequence in the comparison module pins the trichotomy.

2.4 Pattern-Bearing Types Bypass apply_operator in the Engine

TypeKind::Regex and TypeKind::Search are evaluated by logical match in evaluate_single_rule_with_anchor (src/evaluator/engine/mod.rs), not by string equality against rule.value. The engine calls types::read_pattern_match, which returns Result<Option<Value>, _>: Some(v) means the pattern matched (possibly zero-width) and None means it did not. The engine translates that Option directly into Equal/NotEqual. Comparing matched text to the pattern literal via apply_operator would fail for any regex with metacharacters (e.g., matched "123" vs pattern "[0-9]+"). Non-equality operators on pattern-bearing types are rejected as TypeReadError::UnsupportedType — an earlier revision fell through to apply_operator and silently produced lexicographic ordering comparisons against the pattern source text. If you add a new pattern-bearing TypeKind variant, add its arm to both read_pattern_match and bytes_consumed_with_pattern; the engine's special-case match is keyed on the Regex | Search pair so you must add new variants there too.

2.5 Zero-Width Regex Matches vs Misses

read_regex returns Ok(Some(Value::String(""))) for a legitimate zero-width match (^, a*, .{0}) and Ok(None) for a genuine miss. (The Rust regex crate does not support look-around assertions such as lookaheads or lookbehinds -- they are excluded to preserve linear-time matching guarantees.) An earlier revision collapsed both cases to Value::String(String::new()) and distinguished them by is_empty(), which broke every pattern that legitimately matches zero bytes. The structured Option is the invariant — do not re-flatten it. read_typed_value_with_pattern does collapse None to Value::String(String::new()) for back-compat with its single-Value return shape, but the engine does not go through that function for pattern types; it calls read_pattern_match directly.

2.6 Search Anchor Advance Is Match-End, Not Window-End

search_bytes_consumed returns match_idx + pattern.len() — the byte just past the matched pattern — not range (the window size). This matches GNU file semantics: src/softmagic.c FILE_SEARCH in moffset() computes o = ms->search.offset + vlen - offset where ms->search.offset has already been advanced by idx (the match index inside the window) in magiccheck, and vlen = m->vallen (the pattern length). An earlier revision returned the full window size, which silently corrupted relative-offset children of every successful search rule (e.g., search/256 "MAGIC" at index 4 advanced the anchor by 256 instead of by 9). The fix threaded the pattern through bytes_consumed_with_pattern for TypeKind::Search so the scan can be re-run at anchor-advance time. Search does not currently support a /s-style start-offset flag; if one is added, match-end can become match-start.

2.7 Regex /l Is Encoded by RegexCount::Lines, Not a Flag

The /l suffix is not a RegexFlags field — it lives on the RegexCount::Lines variant of TypeKind::Regex::count, alongside RegexCount::Default and RegexCount::Bytes(n). This collapses the previously-possible "line-mode without count" degenerate state into the type-enforced RegexCount::Lines(None) (the regex/l shorthand) and makes "byte-count scan" and "line-count scan" mutually exclusive at the type level. compute_window in src/evaluator/types/regex.rs dispatches on the RegexCount variant: Lines(Some(n)) walks line terminators (LF, CRLF, or bare CR — each counting as one terminator) until the Nth is found, and Lines(None) / Default both return the full byte-capped window. The /l encoding does not toggle regex multi-line matching — libmagic always compiles with REG_NEWLINE (unconditional at src/softmagic.c::alloc_regex line 2123), so ^ and $ match at line boundaries for every regex rule regardless of which RegexCount variant is chosen. An earlier revision of this crate wrapped line-based patterns in ^(?:...) and only set multi_line(true) when /l was set; that was wrong on both counts and has been removed. build_regex now unconditionally sets multi_line(true) and dot_matches_new_line(false) for all patterns.

2.8 Regex Scan Window Is Always Capped at 8192 Bytes

Every regex rule is subject to the REGEX_MAX_BYTES (8192) hard cap, matching GNU file's FILE_REGEX_MAX (src/file.h:522). This applies to every RegexCount variant:

  • RegexCount::Default (plain regex) → scan the full 8192-byte window (or less if the buffer is shorter).
  • RegexCount::Bytes(n)min(n, 8192, remaining) — explicit counts larger than 8192 are clamped.
  • RegexCount::Lines(Some(n)) → walks line terminators to the Nth, but the walk is bounded by the byte-capped slice, so a buffer with fewer than N terminators in the first 8192 bytes stops at the 8192-byte boundary regardless of n.
  • RegexCount::Lines(None) → same as Default: full 8192-byte capped window.

The cap is a DoS mitigation: without it, a malicious regex against a multi-GB buffer combined with EvaluationConfig::default() (no timeout — see S13.1) can hang the evaluator. REGEX_MAX_BYTES itself lives in src/evaluator/types/regex.rs (not in parser::ast) because it is runtime evaluation policy rather than AST shape — putting it in ast.rs would couple the build-script compilation unit to an evaluator-only concern per S1.1. Do not add a path that bypasses the cap, even for "trusted" rules — the cap is also what makes the regex evaluator's worst-case runtime bounded.

2.9 Regex /s Flag Affects Anchor Advance Only, Not Match Result

RegexFlags::start_offset (the /s suffix) controls only regex_bytes_consumed: when set, the anchor advance is m.start() (match-start) instead of m.end() (match-end). The match result (whether a pattern matches, and what matched text is returned) is unchanged. This matches libmagic's REGEX_OFFSET_START flag, which zeros the rm_len contribution in moffset() but does not alter the regex scan itself. Tests for /s must exercise regex_bytes_consumed directly or check the resolved offset of a Relative(N) child rule; checking read_regex alone won't detect a broken /s implementation.

2.10 RegexCount::Lines(None) Is Behaviorally Equivalent to RegexCount::Default

Both RegexCount::Default (plain regex) and RegexCount::Lines(None) (the regex/l shorthand with no explicit count) produce the full 8192-byte capped window. compute_window handles them with a shared match arm, and calculate_default_strength gives them the same strength score (20, no "constrained scan" bonus). The two variants are kept distinct at the AST level because the magic-file surface syntax distinguishes them — regex and regex/l parse to different RegexCount variants and round-trip through codegen as different Rust expressions — but no runtime path treats them differently. If you write a test that passes Lines(None) expecting different behavior from Default, the test is wrong, not the implementation. See test_read_regex_lines_none_is_equivalent_to_default_on_buffer_with_terminators in src/evaluator/types/regex.rs for the regression guard that pins this equivalence.

2.11 MetaType Exhaustive Matches

Adding a variant to the MetaType enum (nested inside TypeKind::Meta) requires updates in: ast.rs (variant definition + 3 test fixtures that iterate MetaType variants: test_meta_type_variants_debug_clone_eq, test_meta_type_serde_roundtrip, test_type_kind_meta_bit_width_is_none), parser/types.rs (parse_type_keyword tag + type_keyword_to_kind match arm + test_roundtrip_all_keywords array), parser/codegen.rs (serialize_type_kind -- the inner TypeKind::Meta(meta) arm), and tests/property_tests.rs (arb_type_kind prop_oneof branch).

Evaluator decision rule. The evaluator has two possible locations for a MetaType arm and the choice is load-bearing:

  • Inline loop-level dispatch in evaluate_rules (in src/evaluator/engine/mod.rs) — use this when the variant has side-effects that the single-rule path cannot express. MetaType::Default (needs the per-level sibling_matched flag), MetaType::Clear (mutates sibling_matched), MetaType::Use (splices subroutine matches into the caller's match vector from RuleEnvironment::name_table), and MetaType::Indirect (recurses into RuleEnvironment::root_rules with a reset anchor and a sliced buffer) all live here. These variants are dispatched before the loop body calls evaluate_single_rule_with_anchor.
  • Silent no-op via the Meta(_) wildcard arm in evaluate_single_rule_with_anchor — use this for variants that produce no match, do not mutate sibling_matched, and do not recurse. MetaType::Name falls here (it is also defensively scrubbed at load time by parser::name_table::extract_name_table, so the evaluator arm is a debug!-and-return safety net rather than a normal code path).

Similarly, strength.rs::calculate_default_strength and types/mod.rs::bytes_consumed_with_pattern treat all Meta(_) variants uniformly (zero strength contribution, zero bytes consumed) via catch-alls, so they do not need per-variant arms.

3. Parser Architecture

3.1 Type Keyword Parsing Split

Type keyword parsing lives in src/parser/types.rs (parse_type_keyword + type_keyword_to_kind); the grammar module in src/parser/grammar/mod.rs orchestrates and handles type-specific suffixes (e.g., pstring /B, /H, /L).

  • Gotcha: parse_type_keyword only consumes the base keyword (e.g., "pstring"), leaving suffixes for the grammar layer.
  • Gotcha: parse_type_and_operator consumes trailing whitespace after parsing type+suffix -- test expectations for remaining input must account for this.

3.2 parse_number Return Type

parse_number returns i64 and is shared by offset + value parsing. Never widen its return type. Use the separate parse_unsigned_number (u64) path only in parse_numeric_value.

3.3 nom tuple Combinator

The nom tuple combinator is deprecated. Use bare tuple syntax (a, b, c) directly as Parser is implemented for tuples.

3.4 type_keyword_to_kind Line Length

type_keyword_to_kind has #[allow(clippy::too_many_lines)] because it exceeds 100 lines with all date keywords.

3.5 parse_number Does Not Handle + Prefix

parse_number handles - signs but not +. When parsing syntax like +4 (e.g., indirect offset adjustments), consume the + character manually before calling parse_number.

3.6 parse_value Requires Quoted Strings (But parse_magic_rule Has a Bare-Word Fallback)

parse_value() itself does not accept bare unquoted strings -- parse_value("xyz") still returns Err (see test_parse_value_invalid_input in grammar/tests/mod.rs, which pins this behavior). However, parse_magic_rule adds a bare-word fallback (parse_bare_string_value) when the rule's type is string-family (String, PString, Regex, Search), so 0 string TEST and >0 search/12 ABC parse successfully without quotes -- matching libmagic magic(5) surface syntax and allowing real-world fixtures like third_party/tests/searchbug.magic to load. For non-string-family types (byte/short/long/etc.), bare words still fail; integration tests exercising parse_value directly (as opposed to going through parse_magic_rule) must still quote string literals.

3.7 Indirect Offset Pointer Specifiers Follow GNU file Semantics

Lowercase pointer specifiers (.s, .l, .q) map to little-endian, not native endian. Uppercase (.S, .L, .Q) map to big-endian. All numeric pointer types are signed by default (per S6.3).

Adjustment is accepted in two forms:

  • Inside the parens (canonical magic(5)): (base.type+N), (base.type-N), (base.type*N), (base.type/N), (base.type%N), (base.type&N), (base.type|N), (base.type^N). The full magic(5) operator set is supported here.
  • After the closing paren (legacy/alternate): (base.type)+N, (base.type)-N. Only +/- are accepted in this form.

Only one form may be used per rule; combinations like (19.b-1)+2 are not permitted. The operator selection is stored on OffsetSpec::Indirect.adjustment_op (IndirectAdjustmentOp enum). Subtraction is folded into IndirectAdjustmentOp::Add with a negative adjustment so the evaluator does not need a dedicated Sub variant.

The evaluator (evaluator/offset/indirect.rs::apply_adjustment) treats the operand as i64 for Add (signed addition) and reinterprets it as u64 for Mul/Div/Mod/And/Or/Xor to match libmagic's apprentice.c::do_offset raw-machine-word semantics. Div/Mod reject a zero operand with EvaluationError::InvalidOffset; Mul rejects integer overflow the same way. IndirectAdjustmentOp derives Default = Add so a missing field round-trips correctly through serde for older AST snapshots.

3.8 Relative Offsets: Global Anchor, No Save/Restore

OffsetSpec::Relative(N) resolves against EvaluationContext::last_match_end(), which is updated after every successful match in evaluate_rules and is never saved/restored across child recursion. This is intentional and matches GNU file: a sibling rule sees the anchor wherever the deepest descendant of the previous sibling left it. The anchor is global/shared rather than stack-scoped, but its numeric value is not guaranteed to be non-decreasing -- a successful Relative(-N) rule (or any later rule that matches at a lower absolute position) can move it earlier. Do not wrap recursion in a save/restore pair "for safety" -- it would silently break sibling-after-nested chains. The recursion-depth pattern in the same loop is save/restore, and the asymmetry is correct.

The load-bearing invariant is that the anchor is updated before recursing into children (so children and their followers see the new anchor). The current code also happens to set the anchor before matches.push(...), but the push-ordering relative to set_last_match_end is incidental for anchor correctness -- only the ordering before the evaluate_rules recursion call matters. (Future code that reads the anchor while iterating matches would make this ordering load-bearing, so do not "optimize" the order without checking call sites first.) bytes_consumed() (in evaluator/types/mod.rs) is the source of truth for advance distance; for variable-width types it re-derives consumption from the buffer rather than trusting Value::String.len() (which can drift from the original byte length via from_utf8_lossy). Pascal-string consumption is also clamped against the remaining buffer to prevent attacker-controlled length prefixes from poisoning the anchor to usize::MAX.

Continuation-sibling exception (issue #42): Inside a child recursion (recursion_depth > 0), the anchor IS reset to the sibling list's entry value between iterations. Continuation rules at the same indentation level (>>&0 ubyte ...; >>&0 offset ...) all resolve &N against the parent-level anchor rather than chaining off each other. This matches libmagic's ms->c.li[cont_level] continuation-level model and is required for the GNU file searchbug.magic fixture to produce at_offset 11 (not at_offset 12). Top-level siblings (recursion_depth == 0) keep the chaining behavior described above. The reset_anchor_between_siblings flag in evaluate_rules gates this. Tests that assert top-level sibling chaining (e.g. relative_anchor_can_decrease_when_later_sibling_matches_at_lower_position) must stay at depth 0; tests that assert continuation-sibling parent-anchor resolution (e.g. test_offset_does_not_advance_anchor_for_continuation_siblings) must go through a parent rule's children. The two semantics coexist deliberately; do not "unify" them without reading the entire GNU file searchbug test chain.

3.10 Subroutine Base Offset for use Bodies

Inside a MetaType::Use subroutine body, OffsetSpec::Absolute(n) with n >= 0 resolves to base_offset + n, where base_offset is the use-site offset. EvaluationContext::base_offset tracks this; evaluate_use_rule saves and restores it around the subroutine call, alongside last_match_end. This matches magic(5) / libmagic semantics: a subroutine written as >0 search/12 ABC invoked via >>64 use part2 scans starting at file position 64, not 0. Negative absolute offsets (FromEnd-style), Indirect pointer reads, and Relative(&N) offsets are unaffected -- they already have well-defined reference frames. Callers of offset::resolve_offset_with_base pass context.base_offset(); the base_offset field is gated behind pub(crate) accessors so external consumers cannot inject arbitrary offset bias.

3.11 parse_text_magic_file is Fail-Fast, Not Skip-on-Error

build_rule_hierarchy propagates any parse_magic_rule_line error immediately, so a single unparseable rule (e.g., a child using unsupported &+N relative-offset syntax or an unquoted $VAR string value -- see S3.6) causes the entire file load to fail with ParseError::InvalidSyntax. There is no skip-and-continue mode. When writing corpus tests against third_party .magic files that mix supported and unsupported syntax, bypass the parser and build the equivalent MagicRule tree programmatically via the AST; the runtime evaluator can still be exercised end-to-end against the real testfile buffer. See tests/evaluator_tests.rs::test_regex_eol_corpus for a worked example.

3.10 parse_text_magic_file Returns ParsedMagic, Not Vec<MagicRule>

parse_text_magic_file, load_magic_file, and load_magic_directory all return Result<ParsedMagic { rules: Vec<MagicRule>, name_table: NameTable }, ParseError>. Top-level Meta(Name(id)) rules are hoisted out of the flat rule list at parse time by parser::name_table::extract_name_table and placed into the name_table field keyed by identifier. Duplicate names keep the first definition and emit a warn!; nested Name rules (not well-defined in magic(5)) are scrubbed with a warning during extraction.

  • Callers must destructure at the boundary. Codegen consumers (build.rs, src/build_helpers.rs) use parsed.rules and discard the table. Runtime consumers (MagicDatabase::load_from_file_with_config) pattern-match let ParsedMagic { rules, name_table } = ...; and wrap both in Arcs before handing them to the database.
  • Directory loads merge name tables with first-wins semantics. load_magic_directory merges per-file name tables alphabetically (same ordering as the rule merge); duplicate-name warnings during merge are distinct from per-file duplicate-name warnings during extraction.
  • Evaluator consumption is via RuleEnvironment (pub(crate) in evaluator/mod.rs), threaded as an optional Arc<RuleEnvironment> on EvaluationContext. Low-level callers (evaluate_rules, evaluate_rules_with_config) leave this as None and Use rules no-op silently -- this preserves the low-level API for property tests and fuzz harnesses that construct rule trees by hand. MagicDatabase::evaluate_buffer_internal attaches the environment before dispatching.
  • Each subroutine in the name table must be strength-sorted the same way top-level rules are (via sort_rules_by_strength_recursive), otherwise use-site evaluation is non-deterministic with respect to source order inside the name block. MagicDatabase::load_from_file_with_config iterates name_table.values_mut() to do this after the top-level sort.

See docs/solutions/integration-issues/meta-type-subroutine-dispatch-architecture.md for the full three-layer pattern (parse-time hoist, ParsedMagic return type, optional RuleEnvironment).

4. Module Visibility & Re-exports

4.1 Private Engine Module

evaluator::engine is private. Integration tests must import libmagic_rs::evaluator::evaluate_rules (re-exported), not evaluator::engine::evaluate_rules.

4.2 MatchResult Name Collision

evaluator::MatchResult was renamed to evaluator::RuleMatch (issue #60). output::MatchResult is the public-facing type. Do not confuse the two.

4.3 Glob Re-exports

Prefer pub mod over pub use module::* for submodules -- glob re-exports expand public API surface unintentionally.

4.4 Parallel EvaluationResult / EvaluationMetadata Types

There are two distinct EvaluationResult and EvaluationMetadata types with overlapping names but different fields:

  • src/lib.rs: the top-level library API types. EvaluationResult carries description, mime_type, confidence, matches: Vec<evaluator::RuleMatch>, metadata. EvaluationMetadata carries file_size: u64, evaluation_time_ms: f64, rules_evaluated: usize, magic_file: Option<PathBuf>, timed_out: bool.
  • src/output/mod.rs: the output-formatting types. EvaluationResult carries filename: PathBuf, matches: Vec<MatchResult>, metadata, error: Option<String>. EvaluationMetadata carries file_size: u64, evaluation_time_ms: f64, rules_evaluated: u32, rules_matched: u32.

The two pairs are not interchangeable -- notice that rules_evaluated is usize in one and u32 in the other, and timed_out/magic_file exist only on the lib.rs variant while rules_matched/filename/error exist only on the output variant. A one-way conversion exists: output::EvaluationResult::from_library_result(...) converts from the library type to the output type; there is no reverse conversion. When importing, qualify the path (libmagic_rs::EvaluationResult vs libmagic_rs::output::EvaluationResult) to make intent explicit. Do not "unify" them without coordinated API changes across the CLI, integration tests, and any downstream consumers.

5. Numeric Types & Arithmetic

5.1 usize::from(u32) Does Not Compile

There is no From<u32> for usize (fails on 32-bit targets). Use as usize for widening conversions from u32.

5.2 Checked Arithmetic for Offsets

Buffer offset calculations must use checked_add to prevent overflow panics from malicious/malformed input. Always test with usize::MAX -- otherwise regressions go undetected.

5.3 Float Epsilon Equality

inf - inf = NaN, so (a - b).abs() <= epsilon fails for infinities. Handle is_infinite() before the epsilon check. Use #[allow(clippy::float_cmp)] on the exact infinity comparison.

  • Note: apply_equal/apply_not_equal use epsilon-aware comparison for Value::Float; ordering operators (<, >, <=, >=) use direct partial_cmp -- these are deliberately different semantics.

6. String & PString Types

6.1 Multi-Byte PString Length Prefixes

Uppercase pstring suffix letters indicate big-endian byte order, lowercase indicate little-endian: /H (2-byte BE), /h (2-byte LE), /L (4-byte BE), /l (4-byte LE). The /J flag means the stored length includes the length field itself (JPEG convention). Flags are combinable (e.g., pstring/HJ). Test data must use the correct byte order for the variant (e.g., \x00\x05 for TwoByteBE length 5, \x05\x00 for TwoByteLE length 5).

6.2 medate/meldate Not Supported

Middle-endian date keywords are NOT supported. They were removed until real middle-endian decoding is implemented end-to-end.

6.3 Signed-by-Default Types

libmagic types are signed by default (byte, short, long, quad). Unsigned variants use u prefix (ubyte, ushort, ulong, uquad, etc.).

6.4 TypeKind::String Has Two Read Modes -- Pick Consciously

src/evaluator/types/string.rs exposes two read functions and the dispatcher in src/evaluator/types/mod.rs::read_typed_value_with_pattern picks between them based on (max_length, pattern):

  • read_string_exact(buffer, offset, length) reads exactly length bytes with NO NUL truncation. Used for libmagic-compatible string PATTERN byte-for-byte comparison whenever a comparison value is supplied -- including patterns that legitimately contain embedded NULs (e.g. 0 string PNCIHISK\0 Norton Utilities disc image data). Selecting this path is critical: a 9-byte pattern ending in \0 compared against a 9-byte buffer ending in \0 must read 9 bytes, not stop at 8.
  • read_string(buffer, offset, max_length) reads until the first NUL or end of buffer, capped by max_length if set. Used for the x (any-value) operator path where the read is for printable-prefix output rather than equality comparison. Adding a new code path that needs string bytes must pick consciously: pattern comparison -> read_string_exact; printable scan -> read_string.

Historical note (NUL-free buffer behavior): when read_string runs with max_length: None on a NUL-free buffer (raw ASCII text, JSON, log lines), it reads the entire remaining buffer, which historically broke equality comparison against short target values. The pattern-aware dispatcher now routes those comparisons through read_string_exact so that failure mode no longer applies; programmatic rules constructed by hand without going through the dispatcher should still set max_length: Some(target_len) explicitly when targeting NUL-free buffers.

The full backstory of why both functions exist (and why the previous single-function design silently broke 0 string PNCIHISK\0 ... and similar rules) is documented in docs/solutions/logic-errors/magic-string-rule-matching-3-bug-fix-2026-04-25.md.

7. Testing

7.1 Doctest Import Paths

Doctests compile as external crates. Use libmagic_rs:: imports, never crate::.

7.2 Heterogeneous Byte-Array Test Tables

Table-driven tests with byte-array buffers of different sizes need &[(&[u8], ...)] slice type annotation. Bare arrays can't hold heterogeneous-length byte literals in a [T; N].

7.3 Doc Examples and Pattern Matching

cargo test --doc verifies doc examples. Ensure example strings don't accidentally match multiple patterns.

7.4 clap Args in Tests

Adding fields to the clap Args struct requires updating ALL manual Args { ... } constructions in unit tests (search for Args { in src/main.rs). process_file and run_analysis are called from unit tests with mocked stdin -- signature changes require updating all test call sites.

7.5 Reserved-for-Future Enum Variants

Reserved-for-future enum variants (e.g., TypeReadError::UnsupportedType) need explicit doc comments explaining intent -- otherwise every reviewer flags them as dead code.

8. Clippy & Formatting

8.1 Enum Variant Naming

Enum variants with a shared suffix (e.g., OneByte, TwoByte, FourByte) trigger clippy::enum_variant_names. Add #[allow(clippy::enum_variant_names)] when renaming would hurt readability.

8.2 unsafe_code = "forbid" is Absolute

unsafe_code = "forbid" in Cargo.toml workspace lints cannot be overridden with #[allow(unsafe_code)]. Use vetted crates (e.g., chrono for timezone) instead of libc FFI or subprocess calls.

8.3 Pre-commit Hook Reformats

Pre-commit hook runs cargo fmt. First commit attempt may fail if fmt modifies files -- just re-stage and retry.

9. Error Handling Patterns

9.1 Error Return Path Cleanup

On error return paths, use let _ = cleanup() not cleanup()? -- the ? masks the original error with cleanup failures. Bind original errors with e @ pattern instead of reconstructing after state changes (avoids data loss).

9.2 Use Correct Error Constructors

  • Use ParseError::IoError for I/O errors in parser code, not ParseError::invalid_syntax
  • Use LibmagicError::ConfigError for config validation, not ParseError::invalid_syntax

10. Documentation

10.1 Extracting Public Functions

When extracting public functions to new modules, verify # Examples, # Arguments, # Returns doc sections are preserved -- they silently disappear.

10.2 # Returns and # Security Accuracy

When refactoring docs, verify # Returns and # Security sections match actual behavior. For example, from_utf8_lossy never errors but docs may claim it does.

10.3 Public Enum Variant Docs

All public enum variants need # Examples rustdoc sections. Clippy enforces this.

11. Date/Timestamp Types

11.1 Date Value Representation

read_date/read_qdate return Value::Uint(raw_secs) for correct numeric comparisons. Use format_timestamp_value(secs, utc) at output time for GNU file-compatible display. There is no special date coercion in coerce_value_to_type.

11.2 Timezone Handling

chrono crate (no default features, std + clock) is a vetted dependency for timezone offset computation in evaluator/types/date.rs. Do not use libc FFI or subprocess calls (see S8.2).

12. Git & Release

12.1 Repository Identity

Repository is EvilBit-Labs/libmagic-rs. NEVER guess repo/org names -- always run git remote -v to verify.

12.2 Tag Signing

All tags and commits MUST be signed -- use git tag -s and git commit -s -S. If signing fails, STOP and troubleshoot. Never push unsigned tags/commits. Never delete or force-push tags without explicit user permission.

12.3 Auto-Generated Release Workflow

.github/workflows/release.yml is auto-generated by cargo-dist -- never modify manually.

13. Evaluation Configuration

13.1 EvaluationConfig::default() Has No Timeout

EvaluationConfig::default() sets timeout_ms: None, meaning library consumers who construct a config via Default::default() get unbounded evaluation time. On crafted or adversarial input, rules with expensive operations (deep nesting, large buffers, pathological backtracking in future regex/search types) can run indefinitely and expose callers to a denial-of-service vector.

  • Rule: Library consumers embedding libmagic-rs in services or untrusted-input pipelines should not use EvaluationConfig::default(). Use EvaluationConfig::performance() (which sets timeout_ms: Some(1000)) as the safe preset, or construct a config explicitly with a non-None timeout sized for your workload.
  • Validation: timeout_ms is clamped to MAX_SAFE_TIMEOUT_MS (5 minutes) by config validation and must be > 0 if specified -- see the validation logic in src/config.rs.
  • Note: Default cannot be changed to set a timeout without breaking API expectations of callers who deliberately want no timeout (e.g., CLI one-shot invocations). The gotcha is that the unsafe default is the ergonomic choice; document the tradeoff prominently in any new consumer-facing docs.

13.2 EvaluationConfig::default() Stops at First Top-Level Match

EvaluationConfig::default() sets stop_at_first_match: true, so once any top-level rule produces a match the evaluator halts further sibling iteration. Integration tests that rely on later siblings running (e.g., default/clear/default chains, use-followed-by-sibling continuation, anything exercising the per-level sibling_matched flag across the full chain) must use MagicDatabase::load_from_file_with_config with EvaluationConfig::default().with_stop_at_first_match(false). Unit tests that go through evaluate_rules directly hit the same issue -- override the config at construction time. The existing Use tests (test_use_child_rules_evaluated_after_subroutine, test_use_stop_at_first_match_short_circuits_siblings) document both sides of this contract.

14. Output Formatting

14.1 \b (Backspace) Prefix in Rule Messages Suppresses Leading Space

MagicDatabase::build_result concatenates rule messages with a space separator, except when a message starts with \u{0008} (backspace / \b), in which case the backspace is stripped and no leading space is inserted. This mirrors GNU file's description formatting (used by rules like >&1 regex/1l ... \b, version %s to produce Ansible Vault text, version 1.1 instead of Ansible Vault text , version 1.1). Tests that manually simulate the concatenation path (e.g., corpus tests that bypass load_from_file -- see S3.11) must honor this convention or their assertions will diverge from the real evaluator output.

14.2 Printf-Style Format Specifiers Are Substituted by format_magic_message

Magic rule messages like at_offset %lld or followed_by 0x%02x are substituted with the rule's RuleMatch.value at description-assembly time, via src/output/format.rs::format_magic_message. The supported subset covers %d, %i, %u, %x, %X, %o, %s, %c, and %%, plus width/padding modifiers (%05d, %-5d) and length modifiers (l, ll, h, hh, j, z, t) which are parsed and ignored (all numeric rendering uses u64/i64 width).

Hex specifiers mask the value to the natural width of the rule's TypeKind -- a signed byte carrying -1 renders as ff, not ffffffffffffffff. Unknown specifiers pass through literally with a debug! log, matching the evaluator's graceful-skip discipline.

Substitution runs BEFORE the backspace check in S14.1, so a rule emitting \b, version %s correctly composes with the preceding match without an intervening space after the value is substituted.

Tests that manually simulate the concatenation path must run their message strings through format_magic_message or construct RuleMatch::message strings that contain no % metacharacters. A literal % in user-facing data should be escaped as %%.