Context
#237 lands the infrastructure for PaliGemma-style location tokens — a coordinate ↔ <locNNNN> codec, an ensure_loc_tokens utility that handles both PaliGemma (promote-existing-IDs) and Gemma 3 (extend-vocab-and-resize), and the policy wiring for π0.5 and π0.6. It deliberately does NOT ship a concrete grounding dataset.
The reason is design: a class-per-source (PixMo-points, RefCOCO, OpenImages, …) approach scales poorly. Each new grounding source would otherwise require a new Python file even though the only differences from the existing class are (a) the HF dataset name, (b) the field names for the image / coords / label, and (c) the response format (points vs. xyxy vs. xywh).
We want the user to add a grounding source by editing config, not by writing a new dataset class.
Goal
A single generic grounding dataset class (or one for points + one for boxes — TBD) reads its source name, format, prompt template, and field mapping from DatasetConfig, applies the codec from src/opentau/datasets/grounding/loc_codec.py, and emits prompt / postfix strings that flow through the existing response_ce_loss path unchanged.
Expected user-facing config (one possible shape — final design left to the implementer):
{
"vqa": "grounding_points",
"vqa_kwargs": {
"source": "allenai/pixmo-points",
"prompt_template": "point to {label}",
"label_field": "label",
"points_field": "points",
"image_field": "image_url",
"max_points": 8
}
}
{
"vqa": "grounding_bbox",
"vqa_kwargs": {
"source": "lmms-lab/RefCOCO",
"prompt_template": "detect {sentence}",
"bbox_field": "bbox",
"bbox_format": "xywh",
"sentences_field": "sentences",
"image_field": "image"
}
}
Open questions to decide as part of this work:
- Where do the kwargs live on
DatasetConfig? Options: (a) free-form vqa_kwargs: dict | None, (b) draccus subclass-config pattern (GroundingDatasetConfig as a typed union), (c) a class-level constants pattern with thin per-source subclasses (SOURCE = ..., PROMPT_TEMPLATE = ...).
- One generic class or two (points + bbox)? Field shapes and response formats differ enough that one class is awkward; two might be cleaner.
- Should the existing
vqa: str / repo_id: str two-source XOR validator extend to a three-way XOR with grounding? Or is grounding just another vqa value?
- PixMo-points migration path: the broken
vqa/pixmo.py was deleted in #237. The first dataset to land via the new infra is the natural replacement.
- RefCOCO support format: xywh is the COCO native; the codec already has both
xyxy_to_loc_tokens and xywh_to_loc_tokens.
In scope
DatasetConfig plumbing for per-dataset format kwargs (whichever shape is picked above).
- Generic grounding dataset class(es) under
src/opentau/datasets/grounding/.
- PixMo-points config replacement using the new infra.
- RefCOCO support.
- Tests covering the configurable response formatting on each new source (sample loads, response strings match
<loc\d{4}> regex, no JSON characters slip through).
Out of scope (separate follow-ups)
- Eval-time decoding of
<locNNNN> strings to bounding boxes for IoU / mAP. The codec already has loc_tokens_to_xyxy / loc_tokens_to_points for this; an eval/regression test that closes the round-trip is a separate task.
- OpenImages-detect (multi-object scenes, longer responses — likely needs
response_max_length bump).
- Defensive
ensure_loc_tokens calls in π0 / π0.5_mem / π0.7-paligemma. These all share the PaliGemma backbone where the call is a no-op for vocab size, but they currently rely on the bare tokenizer fragmenting <loc0000> into seven pieces — adding the call once we have a real grounding consumer would prevent silent failures.
References
Context
#237 lands the infrastructure for PaliGemma-style location tokens — a coordinate ↔
<locNNNN>codec, anensure_loc_tokensutility that handles both PaliGemma (promote-existing-IDs) and Gemma 3 (extend-vocab-and-resize), and the policy wiring for π0.5 and π0.6. It deliberately does NOT ship a concrete grounding dataset.The reason is design: a class-per-source (PixMo-points, RefCOCO, OpenImages, …) approach scales poorly. Each new grounding source would otherwise require a new Python file even though the only differences from the existing class are (a) the HF dataset name, (b) the field names for the image / coords / label, and (c) the response format (
pointsvs.xyxyvs.xywh).We want the user to add a grounding source by editing config, not by writing a new dataset class.
Goal
A single generic grounding dataset class (or one for points + one for boxes — TBD) reads its source name, format, prompt template, and field mapping from
DatasetConfig, applies the codec from src/opentau/datasets/grounding/loc_codec.py, and emitsprompt/postfixstrings that flow through the existingresponse_ce_losspath unchanged.Expected user-facing config (one possible shape — final design left to the implementer):
{ "vqa": "grounding_points", "vqa_kwargs": { "source": "allenai/pixmo-points", "prompt_template": "point to {label}", "label_field": "label", "points_field": "points", "image_field": "image_url", "max_points": 8 } }{ "vqa": "grounding_bbox", "vqa_kwargs": { "source": "lmms-lab/RefCOCO", "prompt_template": "detect {sentence}", "bbox_field": "bbox", "bbox_format": "xywh", "sentences_field": "sentences", "image_field": "image" } }Open questions to decide as part of this work:
DatasetConfig? Options: (a) free-formvqa_kwargs: dict | None, (b) draccus subclass-config pattern (GroundingDatasetConfigas a typed union), (c) a class-level constants pattern with thin per-source subclasses (SOURCE = ...,PROMPT_TEMPLATE = ...).vqa: str/repo_id: strtwo-source XOR validator extend to a three-way XOR withgrounding? Or is grounding just anothervqavalue?vqa/pixmo.pywas deleted in #237. The first dataset to land via the new infra is the natural replacement.xyxy_to_loc_tokensandxywh_to_loc_tokens.In scope
DatasetConfigplumbing for per-dataset format kwargs (whichever shape is picked above).src/opentau/datasets/grounding/.<loc\d{4}>regex, no JSON characters slip through).Out of scope (separate follow-ups)
<locNNNN>strings to bounding boxes for IoU / mAP. The codec already hasloc_tokens_to_xyxy/loc_tokens_to_pointsfor this; an eval/regression test that closes the round-trip is a separate task.response_max_lengthbump).ensure_loc_tokenscalls in π0 / π0.5_mem / π0.7-paligemma. These all share the PaliGemma backbone where the call is a no-op for vocab size, but they currently rely on the bare tokenizer fragmenting<loc0000>into seven pieces — adding the call once we have a real grounding consumer would prevent silent failures.References
(y_min, x_min, y_max, x_max)order, then label, then;separator.src/opentau/datasets/vqa/pixmo.pyin #237 for reference.