Skip to content

Generate documents with Zephyr#19

Open
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/zyph-v1-contacts
Open

Generate documents with Zephyr#19
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/zyph-v1-contacts

Conversation

@alxmrs
Copy link
Copy Markdown
Member

@alxmrs alxmrs commented May 20, 2026

Update exp1 to generate documents on Zephyr and Iris at scale. Fixes #5.

We can verify the output document quality here:
gs://marin-us-east5/protein-structure/MarinFold/exp1/corpus_v1-{shard:05d}-of-{total:05d}.parquet.

@alxmrs alxmrs force-pushed the u/alxmrs/zyph-v1-contacts branch from db370ec to 16efdb7 Compare May 26, 2026 19:13
[tool.uv.sources]
marinfold = { path = "../../marinfold", editable = true }
marinfold = { git = "https://github.com/Open-Athena/MarinFold.git", subdirectory = "marinfold" }
marin-zephyr = { git = "https://github.com/marin-community/marin.git", subdirectory = "lib/zephyr", branch = "alxmrs/stamp-iris-build-date" }
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is necessary to not hit a Iris client issue.

@alxmrs
Copy link
Copy Markdown
Member Author

alxmrs commented May 27, 2026

@codex may I have your review?

@alxmrs alxmrs requested a review from timodonnell May 27, 2026 22:48
@alxmrs alxmrs marked this pull request as ready for review May 27, 2026 22:48
alxmrs and others added 19 commits May 27, 2026 16:18
Make --num-docs limit the number of input files processed (one doc per
structure, ≈ N docs) rather than post-filtering emitted docs. This
restores the original generate_documents semantics and bounds parsing to
the first N files. Sharding is driven by the --out pattern: a {shard}
placeholder writes one parquet per input, otherwise a single merged file.

- parse.list_structure_files(): brace-expanding glob resolver with a limit
- cli: from_iterable(first N files) when --num-docs is set, else from_files
- README: corrected single-line Iris command + arg/prefix/cancel guidance
- tests: --num-docs now counts inputs (unparseable inputs consume budget)
  and spreads one-per-file under a {shard} --out

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a second input modality so --input can point at the
timodonnell/afdb-1.6M dataset (parquet shards whose cif_content column
holds raw mmCIF text), in addition to .cif/.pdb files. Dispatched on the
--input extension:

- .parquet  -> from_files(glob).load_parquet([cif_column, entry_id]) and
  parse the text in memory via parse.parse_cif_content. entry_id is loaded
  because generate_one seeds its RNG from it.
- otherwise -> the existing file-path + gemmi-read path.

parse.parse_structure is refactored to share _build_parsed_structure with
the new parse_cif_content / try_parse_cif_content. Adds a --cif-column
flag (default cif_content) and huggingface_hub as a base dep for hf://.
README documents the hf:// smoke-test (single shard) and full-run globs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alxmrs alxmrs force-pushed the u/alxmrs/zyph-v1-contacts branch from 176baca to a8c014d Compare May 27, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement data generation on zephyr

1 participant