Configurable response formatting for grounding/VQA datasets

## Context

[#237](https://github.com/TensorAuto/OpenTau/pull/237) lands the infrastructure for PaliGemma-style location tokens — a coordinate ↔ `<locNNNN>` codec, an `ensure_loc_tokens` utility that handles both PaliGemma (promote-existing-IDs) and Gemma 3 (extend-vocab-and-resize), and the policy wiring for π0.5 and π0.6. It deliberately does NOT ship a concrete grounding dataset.

The reason is design: a class-per-source (PixMo-points, RefCOCO, OpenImages, …) approach scales poorly. Each new grounding source would otherwise require a new Python file even though the only differences from the existing class are (a) the HF dataset name, (b) the field names for the image / coords / label, and (c) the response format (`points` vs. `xyxy` vs. `xywh`).

We want the user to add a grounding source by editing config, not by writing a new dataset class.

## Goal

A single generic grounding dataset class (or one for points + one for boxes — TBD) reads its source name, format, prompt template, and field mapping from `DatasetConfig`, applies the codec from [src/opentau/datasets/grounding/loc_codec.py](src/opentau/datasets/grounding/loc_codec.py), and emits `prompt` / `postfix` strings that flow through the existing `response_ce_loss` path unchanged.

Expected user-facing config (one possible shape — final design left to the implementer):

```json
{
  "vqa": "grounding_points",
  "vqa_kwargs": {
    "source": "allenai/pixmo-points",
    "prompt_template": "point to {label}",
    "label_field": "label",
    "points_field": "points",
    "image_field": "image_url",
    "max_points": 8
  }
}
```

```json
{
  "vqa": "grounding_bbox",
  "vqa_kwargs": {
    "source": "lmms-lab/RefCOCO",
    "prompt_template": "detect {sentence}",
    "bbox_field": "bbox",
    "bbox_format": "xywh",
    "sentences_field": "sentences",
    "image_field": "image"
  }
}
```

Open questions to decide as part of this work:

- **Where do the kwargs live on `DatasetConfig`?** Options: (a) free-form `vqa_kwargs: dict | None`, (b) draccus subclass-config pattern (`GroundingDatasetConfig` as a typed union), (c) a class-level constants pattern with thin per-source subclasses (`SOURCE = ...`, `PROMPT_TEMPLATE = ...`).
- **One generic class or two (points + bbox)?** Field shapes and response formats differ enough that one class is awkward; two might be cleaner.
- **Should the existing `vqa: str` / `repo_id: str` two-source XOR validator extend to a three-way XOR with `grounding`?** Or is grounding just another `vqa` value?
- **PixMo-points migration path:** the broken `vqa/pixmo.py` was deleted in [#237](https://github.com/TensorAuto/OpenTau/pull/237). The first dataset to land via the new infra is the natural replacement.
- **RefCOCO support format:** xywh is the COCO native; the codec already has both `xyxy_to_loc_tokens` and `xywh_to_loc_tokens`.

## In scope

1. `DatasetConfig` plumbing for per-dataset format kwargs (whichever shape is picked above).
2. Generic grounding dataset class(es) under `src/opentau/datasets/grounding/`.
3. PixMo-points config replacement using the new infra.
4. RefCOCO support.
5. Tests covering the configurable response formatting on each new source (sample loads, response strings match `<loc\d{4}>` regex, no JSON characters slip through).

## Out of scope (separate follow-ups)

- Eval-time decoding of `<locNNNN>` strings to bounding boxes for IoU / mAP. The codec already has `loc_tokens_to_xyxy` / `loc_tokens_to_points` for this; an eval/regression test that closes the round-trip is a separate task.
- OpenImages-detect (multi-object scenes, longer responses — likely needs `response_max_length` bump).
- Defensive `ensure_loc_tokens` calls in π0 / π0.5_mem / π0.7-paligemma. These all share the PaliGemma backbone where the call is a no-op for vocab size, but they currently rely on the bare tokenizer fragmenting `<loc0000>` into seven pieces — adding the call once we have a real grounding consumer would prevent silent failures.

## References

- Codec: [src/opentau/datasets/grounding/loc_codec.py](src/opentau/datasets/grounding/loc_codec.py) (after [#237](https://github.com/TensorAuto/OpenTau/pull/237) merges).
- Tokenizer utility: [src/opentau/datasets/grounding/tokenizer_utils.py](src/opentau/datasets/grounding/tokenizer_utils.py).
- PaliGemma grounding format: paper §IV.C / Fig. 3-4 — bbox encoded as four loc tokens in `(y_min, x_min, y_max, x_max)` order, then label, then `; ` separator.
- Existing class-per-source pattern that this work replaces: see the deleted `src/opentau/datasets/vqa/pixmo.py` in [#237](https://github.com/TensorAuto/OpenTau/pull/237) for reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable response formatting for grounding/VQA datasets #238

Context

Goal

In scope

Out of scope (separate follow-ups)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configurable response formatting for grounding/VQA datasets #238

Description

Context

Goal

In scope

Out of scope (separate follow-ups)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions