Skip to content

pmarreck/validate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

344 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

validate

CI built with garnix

Data silently rots.

If you aren't actively validating, you likely already have corrupt files that are being quietly re-copied to the cloud or your NAS as "good" backups. Family photos, legal documents, old projects, and cherished media are exactly the kind of files that get silently damaged and then preserved in that damaged state.

Drive failures are obvious. Silent sector failures, copy errors, and transmission errors are not. That's why validate exists: deterministic, byte-level validation across a wide range of file formats (100+, see FORMAT_VERIFICATIONS.md).

Why some formats resist corruption detection

Not all formats are equally detectable. Some formats include checksums (PNG, FLAC, ZIP) that make corruption trivially provable — a single flipped bit anywhere in the file will be caught. Others have no integrity mechanism at all (WAV, TIFF, raw images) and can only be validated structurally.

The most insidious case is entropy-coded formats like HEIC, JPEG, and H.264 video. These formats use arithmetic or Huffman coding where every possible bit pattern decodes to a valid output. A corrupted HEIC file doesn't crash the decoder — it silently produces a slightly wrong image. There are no invalid bitstream states for the decoder to catch, because the encoding is designed to use the entire code space efficiently. This is the fundamental tradeoff of high-efficiency compression: the same property that makes it compress well (no wasted bit patterns) also makes it corruption-opaque.

HEIC is arguably the worst case here because it is the default photo format on every iPhone. Billions of photos worldwide are stored in a format where a single bit flip in the CABAC-encoded data is mathematically undetectable without the original file to compare against. Even a full decode — parsing every arithmetic-coded symbol — cannot distinguish corruption from valid data, because corruption simply produces a different valid decode.

validate reports these realities honestly: formats are classified as "fully validated" only when every byte is covered by a checksum, decompression, or decode that would fail on corruption. Formats where corruption can hide in opaque payload data are reported as "structural" validation depth, regardless of how much parsing we perform. See FORMAT_VERIFICATIONS.md for measured detection rates per format.

Components

  • Zig library (core validation)
  • C FFI (stable-enough for integration, but not yet 1.0)
  • C CLI wrapper: validate

Status

The C FFI mirrors the current Zig validation API for ease of integration. It is expected to evolve before a 1.0 release.

Build

./build

Runs ./test first. When DEBUG is unset/0, dependencies build in ReleaseFast and ./build defaults to -Doptimize=ReleaseFast.

CLI

./zig-out/bin/validate <path> [--jobs N]

--jobs 0 (default) uses all available cores (logical CPU count). MAX_FILES limits the number of files scanned when validating a directory. MAX_VIDEO_SIZE limits deep video validation to files under N MB (unset = no limit). MEM_TELEMETRY=1 logs per-file RSS memory samples (use MEM_TELEMETRY_PATH to log to a file, MEM_TELEMETRY_EVERY=N to sample every N files). UNKNOWN_OUT=/path writes UNKNOWN entries to that path instead of stdout (supports /dev/null, /dev/fd/1, /dev/fd/2). ZIP_TELEMETRY=1 logs slow ZIP entry validation details to stderr (adjust threshold with ZIP_SLOW_SECONDS). PDF_TELEMETRY=1 logs slow PDF deep-validation breakdowns to stderr (adjust threshold with PDF_SLOW_SECONDS).

Tests

./test

Windows Tests

./test-windows

On Linux (x86_64): Uses Wine from the Nix flake devShell (automatically provided).

On macOS: Uses CrossOver. Requires a bottle named windows-dev-test (or set CROSSOVER_BOTTLE).

About

a full binary file format validator for over 100 (EDIT: now around 190) different filetypes, written in Zig with frontier AI assistance

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors