docs: clarify PAIBench-C reproduction seed and prompt format#211
docs: clarify PAIBench-C reproduction seed and prompt format#211Muneerali199 wants to merge 1 commit into
Conversation
Signed-off-by: Muneerali199 <alimuneerali245@gmail.com>
|
But how the structured prompts are generated for the dataset? Also the problem is not only about reproducibility striclty it is that if you compared with the official PAIBench-C precomputed dataset seg GT it is not reproducible. Have you recomputed source segmentation for your paper/model card? |
|
@trungtpham for review? THX! |
|
Thanks for looking at this. The structured prompts follow the format in About the source segmentation — I haven't compared against the official PAIBench-C precomputed GT yet. I'll add a note in the cookbook saying that's still pending and link to the non-determinism tracker (#7) for now. Will follow up once I've done the validation. |
|
I think we are quite far from reproducibility of the model card. The remaining blocker for PAIBench-C reproduction is the prompt artifact. PAIBench-C public prompts are natural-language captions in metadata.csv / captions/*.json, while the Cosmos3 cookbook uses a structured prompt.json schema. Could you clarify exactly how the PAIBench-C captions were converted into Cosmos3 structured In particular:
Without those prompt files or the conversion recipe, the released specs/seed/control settings define the inference shape, but not an exact reproduction of the PAIBench-C table, because the prompt conditioning differs from the public PAIBench-C dataset. |
|
Fair questions — though this PR is intentionally limited in scope. It just documents the seed ( The deeper pipeline questions — how PAIBench-C captions were converted into structured prompts, what the per-clip files look like, whether the conversion script exists — are outside what this PR covers. Those would need a separate follow-up with the right context. For now, this PR gets the basic recipe documented so someone can at least run with the same seed and prompt schema. The rest (conversion script, per-clip files, segmentation validation) is still TBD. Happy to open a tracking issue for that if it helps. |
|
Ok as you know the internal process that produced Table 16 can you please open a tracking issue cause also after this PR is merged we have really 0 reproducibility for the conditioned metrics especially if you really used the structured prompt of the example for computing the table in the model card instead of using the official "flat" benchmark prompts. Also your seed don't fix the sam2 random point sampling but only the WAN/DIT derived Cosmos generator. So we need to know if you have wrote your custom evaluation/metrics as we don't have any public evidence of this. Cause if you have used that original official benchmark code it is not deterministic and so also your model card was impacted. |
|
Good points, especially about SAM2 — you're right that the seed only covers the generator side. I've opened #219 to track the remaining gaps (prompt conversion, per-clip files, SAM2 determinism, evaluation pipeline). That way we can close this docs PR and continue the conversation there. |
Closes NVIDIA/cosmos-framework#14
Adds a clarifying note to the transfer cookbook README addressing the two remaining questions from bhack on the PAIBench-C reproducibility issue:
--seed 2026as the canonical reference seedprompt.jsonformat shown inassets/*/Remaining reproducibility gaps (prompt conversion, SAM2 determinism, evaluation) are tracked at #219.