Generate documents with Zephyr by alxmrs · Pull Request #19 · Open-Athena/MarinFold

alxmrs · 2026-05-20T20:40:14Z

Update exp1 to generate documents on Zephyr and Iris at scale. Fixes #5.

We can verify the output document quality here:
gs://marin-us-east5/protein-structure/MarinFold/exp1/corpus_v1-{shard:05d}-of-{total:05d}.parquet.

alxmrs · 2026-05-27T22:45:07Z

 [tool.uv.sources]
-marinfold = { path = "../../marinfold", editable = true }
+marinfold = { git = "https://github.com/Open-Athena/MarinFold.git", subdirectory = "marinfold" }
+marin-zephyr = { git = "https://github.com/marin-community/marin.git", subdirectory = "lib/zephyr", branch = "alxmrs/stamp-iris-build-date" }


This branch is necessary to not hit a Iris client issue.

alxmrs · 2026-05-27T22:46:43Z

@codex may I have your review?

Make --num-docs limit the number of input files processed (one doc per structure, ≈ N docs) rather than post-filtering emitted docs. This restores the original generate_documents semantics and bounds parsing to the first N files. Sharding is driven by the --out pattern: a {shard} placeholder writes one parquet per input, otherwise a single merged file. - parse.list_structure_files(): brace-expanding glob resolver with a limit - cli: from_iterable(first N files) when --num-docs is set, else from_files - README: corrected single-line Iris command + arg/prefix/cancel guidance - tests: --num-docs now counts inputs (unparseable inputs consume budget) and spreads one-per-file under a {shard} --out Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add a second input modality so --input can point at the timodonnell/afdb-1.6M dataset (parquet shards whose cif_content column holds raw mmCIF text), in addition to .cif/.pdb files. Dispatched on the --input extension: - .parquet -> from_files(glob).load_parquet([cif_column, entry_id]) and parse the text in memory via parse.parse_cif_content. entry_id is loaded because generate_one seeds its RNG from it. - otherwise -> the existing file-path + gemmi-read path. parse.parse_structure is refactored to share _build_parsed_structure with the new parse_cif_content / try_parse_cif_content. Adds a --cif-column flag (default cif_content) and huggingface_hub as a base dep for hf://. README documents the hf:// smoke-test (single shard) and full-run globs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

alxmrs force-pushed the u/alxmrs/zyph-v1-contacts branch from db370ec to 16efdb7 Compare May 26, 2026 19:13

alxmrs commented May 27, 2026

View reviewed changes

alxmrs requested a review from timodonnell May 27, 2026 22:48

alxmrs marked this pull request as ready for review May 27, 2026 22:48

alxmrs and others added 19 commits May 27, 2026 16:18

ignore idea files.

ec494da

First pass at fixing #5.

37bfa0c

Wrote tests and squashed bugs.

ea0d300

Cleaner start to fluent cmds.

0ee395e

more succinct comment.

2ee83d2

I spotted a bug! I fixed it and Claude wrote a test for it.

bfe0824

update lock after merge.

b6f6e92

More efficient/cloud-friendly parsing of data.

f4abf84

WIP documenting how to run this.

adb71d5

A uv lock file that may be a bust

ea06617

rm comment, fix filter.

75b0241

Getting closer.

bdada69

Proper deps

24a3feb

Getting ready for a real run (fixed sharding: one doc per input).

50b7b21

rm verbose comment.

b00eec5

Proper output path.

b0d19bf

Based on log analysis, here's are better tuned resources for this job.

a8c014d

alxmrs force-pushed the u/alxmrs/zyph-v1-contacts branch from 176baca to a8c014d Compare May 27, 2026 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate documents with Zephyr#19

Generate documents with Zephyr#19
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/zyph-v1-contacts

alxmrs commented May 20, 2026 •

edited

Loading

Uh oh!

alxmrs May 27, 2026

Uh oh!

alxmrs commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alxmrs commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alxmrs May 27, 2026

Choose a reason for hiding this comment

Uh oh!

alxmrs commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alxmrs commented May 20, 2026 •

edited

Loading