Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@

# Plugins
/plugins/data-designer-template/ @NVIDIA-NeMo/data_designer_reviewers
/plugins/data-designer-visual-search/ eric.tramel@gmail.com
86 changes: 86 additions & 0 deletions docs/plugins/data-designer-visual-search/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Practical Examples

## Branch From an Earlier Crop

The image workspace is tree-shaped. A model can create one crop, inspect it,
then operate on the original image again:

1. `open_image()` returns `img_0000`.
2. `crop_image(image_id="img_0000", x=0, y=0, width=50, height=50, unit="percent")`
returns `img_0001`.
3. `edit_color(image_id="img_0001", contrast=1.5)` returns `img_0002`.
4. `crop_image(image_id="img_0000", x=50, y=50, width=50, height=50, unit="percent")`
returns `img_0003`.

The resulting history preserves both branches:

```text
img_0000 open_image
|-- img_0001 crop_image
| `-- img_0002 edit_color
`-- img_0003 crop_image
```

This is useful when the model needs to compare multiple areas or recover from a
crop that turned out to be unhelpful.

## Read Small Text

```python
builder.add_column(
name="label_text",
column_type="visual-search",
image_column="product_photo",
prompt=(
"Find the ingredients label. Crop tightly around it, increase contrast "
"if needed, and return the text you can read."
),
model_alias="vision",
max_tool_call_turns=5,
)
```

Expected model behavior:

- Inspect the original image.
- Crop the label region.
- Optionally increase contrast or convert to grayscale.
- Answer using the attached edited crop.

## Compare Two Regions

```python
builder.add_column(
name="comparison",
column_type="visual-search",
image_column="shelf_image",
prompt=(
"Compare the price tags on the left and right sides of the shelf. "
"Use separate crops and report which price is lower."
),
model_alias="vision",
max_tool_call_turns=6,
)
```

The model can crop the left tag from `img_0000`, crop the right tag from
`img_0000`, inspect both resulting IDs, and answer from the evidence.

## Data URI Input

The `image_column` can contain base64 data or a full data URI instead of a file
path:

```python
builder.add_column(
name="base64_answer",
column_type="visual-search",
image_column="image_data_uri",
prompt="Crop the center of the image and describe what is visible.",
model_alias="vision",
)
```

If values are raw base64 and the format cannot be detected reliably, set
`image_data_type="base64"` and `image_format="png"` or another supported image
format.
94 changes: 94 additions & 0 deletions docs/plugins/data-designer-visual-search/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# data-designer-visual-search

`data-designer-visual-search` adds a `visual-search` column type for
image-grounded visual search workflows. It is intended for cases where a VLM
needs to inspect an image, crop into regions, transform the view, adjust color,
and then continue reasoning over the resulting image.

The plugin owns the extra plumbing that ordinary model tool calling does not
handle: each local image operation returns an `image_id`, the new image is held
in memory, and the generated image is attached back into the next model turn as
multimodal context.

## What It Provides

- A `VisualSearchColumnConfig` registered as column type `visual-search`.
- A row-scoped in-memory image workspace.
- Local tools for opening images, listing image IDs, inspecting image metadata,
cropping, transforming, and editing color.
- Tree-shaped image history, so the model can branch from any previous
`image_id` instead of following a single linear edit chain.
- A default side-effect column named `{column_name}__image_history` that records
image IDs, parent IDs, child IDs, operations, dimensions, and operation
metadata.
- Optional model trace and reasoning-content side-effect columns that match the
conventions used by Data Designer LLM columns.

## Column Interface

| Field | Required | Description |
| --- | --- | --- |
| `name` | Yes | Output column name. |
| `column_type` | Yes | Must be `visual-search`. |
| `image_column` | Yes | Existing column containing a local image path, URL, base64 image, or image data URI. |
| `prompt` | Yes | Jinja2 prompt template for the visual search task. |
| `model_alias` | Yes | Alias of a vision-capable chat model in the Data Designer config. |
| `system_prompt` | No | Optional Jinja2 system prompt appended to the built-in visual search instructions. |
| `image_data_type` | No | Optional explicit image data type, such as `url` or `base64`. Leave unset for auto-detection. |
| `image_format` | Conditional | Required when `image_data_type` is explicitly `base64`. |
| `image_placeholder` | No | Optional text token to include next to every image attachment for endpoints that require one. |
| `max_tool_call_turns` | No | Maximum tool-calling turns per row. Defaults to `6`. |
| `allowed_tools` | No | Optional allowlist of built-in visual tools. Defaults to all tools. |
| `attach_images_after_tool_calls` | No | Whether to attach tool-created images into the next model turn. Defaults to `True`. |
| `include_image_history` | No | Whether to write `{name}__image_history`. Defaults to `True`. |
| `with_trace` | No | Optional trace capture mode. Defaults to `none`. |
| `extract_reasoning_content` | No | Whether to write `{name}__reasoning_content`. Defaults to `False`. |
| `use_default_system_prompt` | No | Whether to prepend built-in image-tool instructions. Defaults to `True`. |

## Built-In Tools

| Tool | Purpose |
| --- | --- |
| `open_image` | Opens the configured row image and returns the root `image_id`. |
| `get_image_info` | Returns dimensions, parent ID, children IDs, operation name, and metadata for an `image_id`. |
| `list_images` | Lists every image currently held in the row workspace. |
| `crop_image` | Crops an existing image by pixel or percent coordinates and returns a new `image_id`. |
| `transform_image` | Rotates, flips, or resizes an existing image and returns a new `image_id`. |
| `edit_color` | Adjusts brightness, contrast, saturation, sharpness, grayscale, or inversion and returns a new `image_id`. |

Tool results are ordinary tool messages containing JSON metadata. When a tool
creates an image, the plugin also attaches that image to the next user turn so
the model can inspect it visually.

## Image History

Every image node has stable metadata:

```json
{
"image_id": "img_0001",
"parent_image_id": "img_0000",
"children_image_ids": [],
"operation": "crop_image",
"width": 512,
"height": 384,
"metadata": {
"box": {"left": 0, "top": 0, "right": 512, "bottom": 384},
"unit": "pixels"
}
}
```

Because the model controls the `image_id` argument, it can crop from the root
image, transform that crop, rewind to the root, and crop a different region.
The workspace keeps the whole tree for the duration of that row.

## When To Use It

Use `visual-search` when the model needs iterative visual operations before it
can answer reliably. Good examples include reading small labels, comparing
regions, checking color after contrast adjustment, or zooming into a specific
part of a larger image.

For a single prompt over an image with no iterative image manipulation, a
standard Data Designer LLM column with multimodal context may be simpler.
125 changes: 125 additions & 0 deletions docs/plugins/data-designer-visual-search/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Usage

This example starts with a dataframe column containing image paths and adds a
`visual-search` column. The model can call image tools while answering the
prompt, and the plugin will pass each resulting crop or edited image back to the
model automatically.

```python
import pandas as pd

from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.config.models import ChatCompletionInferenceParams, ModelConfig, ModelProvider
from data_designer.config.seed_source_dataframe import DataFrameSeedSource
from data_designer.interface.data_designer import DataDesigner

seed_df = pd.DataFrame(
{
"image_path": ["/path/to/store-shelf.png"],
"target": ["the nutrition label on the cereal box"],
}
)

provider = ModelProvider(
name="nvidia",
endpoint="https://integrate.api.nvidia.com/v1",
api_key="NVIDIA_API_KEY",
provider_type="openai",
)

vision_model = ModelConfig(
alias="vision",
model="qwen/qwen3.5-122b-a10b",
provider="nvidia",
inference_parameters=ChatCompletionInferenceParams(
temperature=0,
max_tokens=512,
timeout=60,
),
)

builder = DataDesignerConfigBuilder(model_configs=[vision_model])
builder.with_seed_dataset(DataFrameSeedSource(df=seed_df))
builder.add_column(
name="visual_answer",
column_type="visual-search",
image_column="image_path",
prompt=(
"Find {{ target }}. Use crop_image or edit_color if that helps. "
"Return the text you can read and explain which image_id you used."
),
model_alias="vision",
max_tool_call_turns=4,
)

result = DataDesigner(
artifact_path="artifacts",
model_providers=[provider],
).preview(builder, num_records=1)
```

The generated dataset includes:

- `visual_answer`: the model's final answer.
- `visual_answer__image_history`: the image operation tree produced while
answering the row.

## Restricting Tools

Use `allowed_tools` when you want the model to perform only a narrower set of
operations:

```python
builder.add_column(
name="crop_only_answer",
column_type="visual-search",
image_column="image_path",
prompt="Crop the upper-right quadrant and describe the dominant color.",
model_alias="vision",
allowed_tools=["open_image", "get_image_info", "crop_image"],
max_tool_call_turns=2,
)
```

## Endpoint Image Tokens

Most OpenAI-compatible multimodal endpoints accept image content blocks directly.
Some model servers also require a model-specific image token in the text for
each attached image. Set `image_placeholder` for those endpoints:

```python
builder.add_column(
name="answer",
column_type="visual-search",
image_column="image_path",
prompt="Inspect the attached image and answer the question.",
model_alias="vision",
image_placeholder="<image>",
)
```

The plugin prepends the placeholder to the initial image turn and to every later
turn that attaches a tool-created image.

## Capturing Trace Output

The column supports the same trace side-effect pattern as other LLM-backed Data
Designer columns:

```python
from data_designer.config.utils.trace_type import TraceType

builder.add_column(
name="answer_with_trace",
column_type="visual-search",
image_column="image_path",
prompt="Zoom into the serial number and read it.",
model_alias="vision",
with_trace=TraceType.ALL_MESSAGES,
extract_reasoning_content=True,
)
```

This adds `answer_with_trace__trace` and
`answer_with_trace__reasoning_content` when the selected model provides
reasoning content.
11 changes: 11 additions & 0 deletions docs/plugins/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,15 @@ Browse available Data Designer plugins by what they add to your data generation
<span class="plugin-doc-card__chips"><span class="plugin-doc-chip">text-transform</span></span>
</span>
</a>
<a class="plugin-doc-card" href="data-designer-visual-search/" aria-label="Open data-designer-visual-search documentation">
<span class="plugin-doc-card__header">
<span class="plugin-doc-card__title">data-designer-visual-search</span>
<span class="plugin-doc-card__version">v0.1.0</span>
</span>
<span class="plugin-doc-card__description">Visual search column with local image crop, transform, and color-edit tools</span>
<span class="plugin-doc-card__section">
<span class="plugin-doc-card__label">Column types</span>
<span class="plugin-doc-card__chips"><span class="plugin-doc-chip">visual-search</span></span>
</span>
</a>
</div>
3 changes: 3 additions & 0 deletions plugins/data-designer-visual-search/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Owner(s) of this plugin — used to generate the root CODEOWNERS file.
# GitHub accepts @username, @org/team, or email format.
* eric.tramel@gmail.com
61 changes: 61 additions & 0 deletions plugins/data-designer-visual-search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# data-designer-visual-search

Data Designer plugin for VLM-driven visual search over image columns, with
local image crop, transform, and color-edit tools.

The `visual-search` column runs a vision-capable chat model with built-in
image-operation tools:

- `open_image`
- `get_image_info`
- `list_images`
- `crop_image`
- `transform_image`
- `edit_color`

Each operation returns an `image_id`. The column keeps intermediate images in
memory and re-attaches tool-produced images to the following model turn, so the
model can inspect a crop or transformed image before deciding what to do next.
Because IDs remain addressable, the model can branch from an earlier image
rather than being forced through a linear edit chain.

## Installation

```bash
pip install data-designer-visual-search
```

## Usage

Once installed, the `visual-search` column type is automatically discovered by
[NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner).

```python
import pandas as pd
from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.config.seed_source_dataframe import DataFrameSeedSource
from data_designer.interface.data_designer import DataDesigner

seed_df = pd.DataFrame({"image_path": ["/path/to/scene.png"]})

builder = DataDesignerConfigBuilder()
builder.with_seed_dataset(DataFrameSeedSource(df=seed_df))
builder.add_column(
name="visual_answer",
column_type="visual-search",
image_column="image_path",
prompt="Find the red object. Crop or transform the image if that helps.",
model_alias="nvidia-vision",
# Optional: set a model-specific image token here if your endpoint requires
# one in the text for every attached image.
# image_placeholder="<image>",
)

result = DataDesigner(artifact_path="artifacts").preview(builder, num_records=1)
```

The main output column contains the model's final answer. By default the plugin
also writes `{column_name}__image_history`, a compact tree of image IDs, parent
IDs, operations, and dimensions.

See `docs/` for the full interface reference and practical examples.
Loading
Loading