Skip to content

Add GitHub repository seed reader plugin#4

Open
eric-tramel wants to merge 3 commits intomainfrom
codex/data-designer-github
Open

Add GitHub repository seed reader plugin#4
eric-tramel wants to merge 3 commits intomainfrom
codex/data-designer-github

Conversation

@eric-tramel
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel commented May 5, 2026

What

Adds data-designer-github, a new Data Designer seed-reader plugin that turns GitHub and local git repositories into seed datasets. The plugin registers a github seed source and emits one row per selected repository file with repository metadata, path metadata, language hints, file size, content SHA-256, and hydrated file content.

Why

Code repositories are a useful seed surface for Data Designer workflows: review, labeling, transformation, synthetic instruction generation, quality analysis, and repository-scale experimentation all start from the same need to reliably get source files into a tabular seed dataset.

The earlier ad hoc workflow cloned repositories and extracted narrow function-level examples. This plugin generalizes the useful part into reusable NDD tooling: read public GitHub repos or local git repos, preserve provenance, and let downstream Data Designer columns decide what to analyze, transform, filter, or synthesize.

Usage

Read files from a public GitHub repository:

from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.interface.data_designer import DataDesigner
from data_designer_github.config import GitHubSeedSource

builder = DataDesignerConfigBuilder()
builder.with_seed_dataset(
    GitHubSeedSource(
        repositories=["NVIDIA-NeMo/DataDesigner"],
        file_pattern="*.py",
        recursive=True,
    )
)

preview = DataDesigner().preview(builder, num_records=5)
print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]])

Read local git repositories instead:

builder.with_seed_dataset(
    GitHubSeedSource(
        path="/path/to/repos",
        repository_paths=["/path/to/one/repo"],
        file_pattern="*.py",
        include_extensions=[".py", ".toml", ".md"],
    )
)

The emitted seed columns include:

  • repo_id, repo_url, commit_sha, source_kind
  • repository_path, source_path, relative_path, file_name, file_extension
  • code_lang, size_bytes, content_sha256, content

How

The implementation follows the existing Data Designer filesystem seed-reader architecture instead of building a separate ingestion path. GitHubSeedSource extends FileSystemSeedSource and registers through the data_designer.plugins entry point as PluginType.SEED_READER. GitHubSeedReader prepares local repository roots during filesystem context creation, then lets the base FileSystemSeedReader machinery handle manifest sampling, batching, hydration, and schema validation.

Important behavior:

  • GitHub inputs accept owner/name, https://github.com/owner/name, or .git URLs and clone into an attachment-scoped temporary directory.
  • Local inputs accept explicit git repository paths or a parent directory whose immediate children are git repositories.
  • File inclusion is controlled by file_pattern, recursion, extension/name allowlists, exclude globs, max file size, and encoding.
  • Branches and tags use git clone --branch; commit SHA refs are checked out after clone.
  • Hydration is deferred until selected manifest rows are read, so preview and sampling do not eagerly load every file body.

Validation

  • make all
  • make build-plugin PLUGIN=data-designer-github
  • Live smoke test with Data Designer preview against pallets/markupsafe: cloned from GitHub, found 12 Python seed rows, and previewed 2 hydrated rows (bench.py, docs/conf.py).

@eric-tramel eric-tramel changed the title [codex] add github seed reader plugin Add GitHub repository seed reader plugin May 5, 2026
@eric-tramel eric-tramel marked this pull request as ready for review May 5, 2026 16:22
@eric-tramel eric-tramel requested a review from a team as a code owner May 5, 2026 16:22
@eric-tramel eric-tramel self-assigned this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant