diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index eb06565..394d44b 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -7,4 +7,5 @@ /.github/ @NVIDIA-NeMo/data_designer_reviewers # Plugins +/plugins/data-designer-github/ @eric-tramel /plugins/data-designer-template/ @NVIDIA-NeMo/data_designer_reviewers diff --git a/docs/plugins/data-designer-github/index.md b/docs/plugins/data-designer-github/index.md new file mode 100644 index 0000000..1243d43 --- /dev/null +++ b/docs/plugins/data-designer-github/index.md @@ -0,0 +1,79 @@ +# data-designer-github + +`data-designer-github` is a Data Designer seed reader for repository files. It +turns GitHub repositories or local git repositories into seed rows that carry +file content, path metadata, repository provenance, and commit identifiers. + +Use it when a workflow needs code repository data as the starting point for +generation, review, transformation, or indexing tasks. The reader is intentionally +file-oriented: each matching text file becomes one seed row, and downstream Data +Designer columns decide how to summarize, critique, rewrite, label, or enrich +that row. + +## Installation + +```bash +uv add data-designer data-designer-github +``` + +The plugin is discovered through the `data_designer.plugins` entry point once it +is installed in the same environment as Data Designer. + +## Seed source + +Use the `github` seed source when the seed dataset should come from one or more +repositories. + +| Field | Required | Description | +| --- | --- | --- | +| `path` | No | A local git repository path, or a directory whose immediate children are git repositories. | +| `repositories` | No | GitHub repositories to clone. Entries may be `owner/name`, `https://github.com/owner/name`, or `https://github.com/owner/name.git`. | +| `repository_paths` | No | Additional explicit local git repository paths to read. | +| `ref` | No | Branch, tag, or commit to check out for cloned GitHub repositories. | +| `clone_depth` | No | Shallow clone depth for GitHub repositories. Defaults to `1`; set to `None` for a full clone. | +| `clone_timeout_seconds` | No | Timeout for each clone or checkout operation. Defaults to `300`. | +| `file_pattern` | No | Inherited file glob from Data Designer's filesystem seed source. For example, `*.py`. | +| `recursive` | No | Whether `file_pattern` is applied recursively. | +| `include_extensions` | No | File extensions to include after the glob match. Defaults to common code and documentation extensions. Set to `None` to allow every extension. | +| `include_file_names` | No | Extensionless file names to include, such as `Dockerfile` and `Makefile`. | +| `exclude_patterns` | No | Relative path glob patterns to skip, including `.git`, cache, build, virtualenv, and dependency directories by default. | +| `max_file_size_bytes` | No | Maximum file size to hydrate into `content`. Defaults to `1_000_000`. | +| `encoding` | No | Text encoding used when reading file contents. Defaults to `utf-8`. | + +At least one of `path`, `repositories`, or `repository_paths` is required. + +## Output columns + +| Column | Description | +| --- | --- | +| `repo_id` | Repository identifier. GitHub repositories use `owner/name`; local repositories use their GitHub remote when available, otherwise the directory name. | +| `repo_url` | Remote origin URL when available. | +| `commit_sha` | Checked-out commit SHA for the repository. | +| `source_kind` | `github` for cloned repositories, or `git_repository` for local repositories. | +| `repository_path` | Local path used by the reader. GitHub repositories are cloned into a temporary runtime directory. | +| `source_path` | Absolute path to the file that produced the seed row. | +| `relative_path` | File path relative to the repository root. | +| `file_name` | Basename of the file. | +| `file_extension` | Lowercase file extension. | +| `code_lang` | Language hint inferred from the file name or extension. | +| `size_bytes` | File size at manifest time. | +| `content_sha256` | SHA-256 hash of the hydrated file bytes. | +| `content` | Decoded text content. | + +## Behavior + +When the reader is attached, it resolves local repository roots, clones any +configured GitHub repositories, records the checked-out commit, and builds a +manifest of matching files. File content is read during row hydration, so Data +Designer can batch and sample repository content using the same seed reader +interfaces as other filesystem-backed datasets. + +The plugin reads repository files only. It does not parse code into functions, +classes, symbols, dependency graphs, or AST nodes. If a workflow needs those +structures, use this reader to collect stable file-level inputs and add +downstream columns that perform the language-specific analysis. + +The plugin shells out to `git` for repository operations and does not manage +GitHub API tokens. Public repositories work directly. Private repositories +require the execution environment's git credential configuration to already have +access. diff --git a/docs/plugins/data-designer-github/usage.md b/docs/plugins/data-designer-github/usage.md new file mode 100644 index 0000000..7f96e8b --- /dev/null +++ b/docs/plugins/data-designer-github/usage.md @@ -0,0 +1,165 @@ +# Usage + +This tutorial walks through the common patterns for turning repositories into +Data Designer seed rows. The examples use the Python builder API, but the same +configuration fields apply when a workflow is built from serialized config. + +## Read a GitHub repository + +Start with a small repository and a narrow file pattern. This keeps previews +fast and makes it clear which rows are entering the workflow. + +```python +from data_designer.config.config_builder import DataDesignerConfigBuilder +from data_designer.interface.data_designer import DataDesigner +from data_designer_github.config import GitHubSeedSource + +builder = DataDesignerConfigBuilder() +builder.with_seed_dataset( + GitHubSeedSource( + repositories=["pallets/markupsafe"], + file_pattern="*.py", + recursive=True, + ) +) + +builder.add_column( + name="_row_id", + column_type="sampler", + sampler_type="uuid", + params={}, +) + +preview = DataDesigner().preview(builder, num_records=5) +print(preview.dataset[["repo_id", "relative_path", "code_lang", "content"]]) +``` + +The seed rows contain repository provenance and file text. Downstream columns can +then ask questions such as "summarize this file", "identify risky APIs", "write +a short module description", or "extract candidate test scenarios" using the +`content`, `relative_path`, `code_lang`, and `commit_sha` columns. + +## Pin a branch, tag, or commit + +Use `ref` when the dataset must be reproducible against a specific branch, tag, +or commit. Branches and tags are passed to `git clone --branch`; commit SHAs are +checked out after cloning. + +```python +source = GitHubSeedSource( + repositories=["NVIDIA-NeMo/DataDesigner"], + ref="v0.5.7", + clone_depth=1, + file_pattern="*.py", + recursive=True, +) +``` + +For arbitrary commit SHAs, set `clone_depth=None` if the commit may not be +reachable from the shallow default clone. + +```python +source = GitHubSeedSource( + repositories=["NVIDIA-NeMo/DataDesigner"], + ref="0123456789abcdef0123456789abcdef01234567", + clone_depth=None, + file_pattern="*.py", + recursive=True, +) +``` + +## Read local repositories + +Local repositories are useful for private code, local experiments, or a checked +out monorepo that already exists on disk. + +```python +source = GitHubSeedSource( + repository_paths=[ + "/workspace/services/api", + "/workspace/libraries/shared", + ], + file_pattern="*.py", + recursive=True, +) +``` + +If `path` points at a git repository, that repository is read. If `path` points +at a directory whose immediate children are git repositories, each child +repository is discovered and read. + +```python +source = GitHubSeedSource( + path="/workspace/repos", + file_pattern="*.ts", + recursive=True, +) +``` + +## Control which files become rows + +The reader first applies `file_pattern` and `recursive`, then filters by +extension, file name, exclude pattern, and file size. + +```python +source = GitHubSeedSource( + repositories=["NVIDIA-NeMo/DataDesigner"], + file_pattern="*", + recursive=True, + include_extensions=["py", "toml", "md"], + include_file_names=["Dockerfile", "Makefile"], + exclude_patterns=[ + ".git/**", + "**/__pycache__/**", + "**/build/**", + "**/dist/**", + "docs/generated/**", + ], + max_file_size_bytes=250_000, +) +``` + +Use `include_extensions=None` for broad repository inventory tasks where the +glob and exclude patterns should decide the candidate set. + +```python +source = GitHubSeedSource( + repositories=["owner/repo"], + file_pattern="LICENSE*", + recursive=False, + include_extensions=None, +) +``` + +## Typical workflows + +`data-designer-github` works best as the seed layer for file-level code +workflows: + +- Repository QA: score files for risky dependencies, missing license headers, or + stale implementation notes. +- Documentation generation: turn source files into module summaries, migration + notes, or API reference drafts. +- Test ideation: derive test scenarios from implementation files and route them + to a code-generation column. +- Code search preparation: create embeddings or labels from stable file content + and repository metadata. +- Dataset construction: sample representative code files from several projects + while preserving `repo_id`, `relative_path`, and `commit_sha` provenance. + +Because the reader emits full file content, prompts should account for file +length and language. A common pattern is to filter or sample seed rows first, +then generate focused columns that reference only the metadata and content each +task needs. + +## Operational notes + +The plugin requires `git` on `PATH`. GitHub repositories are cloned into a +temporary runtime directory for the reader attachment and local repositories are +read in place. Files that exceed `max_file_size_bytes` are skipped before +hydration. Files that cannot be decoded with `encoding` are skipped with a +warning rather than producing partial text. + +The reader does not call the GitHub API, manage credentials, or expand GitHub +issues and pull requests. It is scoped to repository file content so workflows +can compose repository-aware seed data with the rest of Data Designer. diff --git a/docs/plugins/index.md b/docs/plugins/index.md index 4e54e2e..d488636 100644 --- a/docs/plugins/index.md +++ b/docs/plugins/index.md @@ -5,6 +5,17 @@ Browse available Data Designer plugins by what they add to your data generation workflow.