Skip to content

docs: clarify PAIBench-C reproduction seed and prompt format#211

Open
Muneerali199 wants to merge 1 commit into
NVIDIA:mainfrom
Muneerali199:patch-transfer-readme
Open

docs: clarify PAIBench-C reproduction seed and prompt format#211
Muneerali199 wants to merge 1 commit into
NVIDIA:mainfrom
Muneerali199:patch-transfer-readme

Conversation

@Muneerali199

@Muneerali199 Muneerali199 commented Jun 13, 2026

Copy link
Copy Markdown

Closes NVIDIA/cosmos-framework#14

Adds a clarifying note to the transfer cookbook README addressing the two remaining questions from bhack on the PAIBench-C reproducibility issue:

  1. Seed: All clips use --seed 2026 as the canonical reference seed
  2. Prompt format: Prompts follow the structured prompt.json format shown in assets/*/

Remaining reproducibility gaps (prompt conversion, SAM2 determinism, evaluation) are tracked at #219.

Signed-off-by: Muneerali199 <alimuneerali245@gmail.com>
@bhack

bhack commented Jun 13, 2026

Copy link
Copy Markdown

But how the structured prompts are generated for the dataset?

Also the problem is not only about reproducibility striclty it is that if you compared with the official PAIBench-C precomputed dataset seg GT it is not reproducible. Have you recomputed source segmentation for your paper/model card?

@lfengad

lfengad commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@trungtpham for review? THX!

@Muneerali199

Copy link
Copy Markdown
Author

Thanks for looking at this. The structured prompts follow the format in assets/*/prompt.json — basically load that template and fill in the scene params per clip. The generation code is in cookbooks/cosmos3/generator/transfer/.

About the source segmentation — I haven't compared against the official PAIBench-C precomputed GT yet. I'll add a note in the cookbook saying that's still pending and link to the non-determinism tracker (#7) for now. Will follow up once I've done the validation.

@bhack

bhack commented Jun 15, 2026

Copy link
Copy Markdown

I think we are quite far from reproducibility of the model card.

The remaining blocker for PAIBench-C reproduction is the prompt artifact.

PAIBench-C public prompts are natural-language captions in metadata.csv / captions/*.json, while the Cosmos3 cookbook uses a structured prompt.json schema.

Could you clarify exactly how the PAIBench-C captions were converted into Cosmos3 structured prompt.json files for Table 16?

In particular:

  1. Were the public PAIBench-C captions used directly, or converted into structured Cosmos3 prompt.json?
  2. If converted, was the input metadata.csv caption_text, captions/{task_id}.json, the source video, or some combination?
  3. Is the conversion script / system prompt / model available?
  4. Are the per-clip structured prompt.json files used for the 600 PAIBench-C examples available?
  5. Did the reported Table 16 segmentation result use those structured prompts plus official HF sam2_vids/sam2_pkls, or were source segmentations recomputed?

Without those prompt files or the conversion recipe, the released specs/seed/control settings define the inference shape, but not an exact reproduction of the PAIBench-C table, because the prompt conditioning differs from the public PAIBench-C dataset.

@Muneerali199

Copy link
Copy Markdown
Author

Fair questions — though this PR is intentionally limited in scope. It just documents the seed (2026) and the prompt.json format that's already in the repo (under assets/*/), since those were the two open items from the original issue that could be clarified right away.

The deeper pipeline questions — how PAIBench-C captions were converted into structured prompts, what the per-clip files look like, whether the conversion script exists — are outside what this PR covers. Those would need a separate follow-up with the right context.

For now, this PR gets the basic recipe documented so someone can at least run with the same seed and prompt schema. The rest (conversion script, per-clip files, segmentation validation) is still TBD. Happy to open a tracking issue for that if it helps.

@bhack

bhack commented Jun 16, 2026

Copy link
Copy Markdown

Ok as you know the internal process that produced Table 16 can you please open a tracking issue cause also after this PR is merged we have really 0 reproducibility for the conditioned metrics especially if you really used the structured prompt of the example for computing the table in the model card instead of using the official "flat" benchmark prompts.

Also your seed don't fix the sam2 random point sampling but only the WAN/DIT derived Cosmos generator.
See https://github.com/SHI-Labs/physical-ai-bench/blob/main/conditional_generation/models/grounded_sam_v2.py#L44C1-L83C18

So we need to know if you have wrote your custom evaluation/metrics as we don't have any public evidence of this.

Cause if you have used that original official benchmark code it is not deterministic and so also your model card was impacted.

@Muneerali199

Copy link
Copy Markdown
Author

Good points, especially about SAM2 — you're right that the seed only covers the generator side. I've opened #219 to track the remaining gaps (prompt conversion, per-clip files, SAM2 determinism, evaluation pipeline). That way we can close this docs PR and continue the conversation there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need released-code recipe to reproduce Cosmos3 PAIBench-C transfer results

3 participants