Thank you for your interest in contributing to Code Graph RAG! We welcome contributions from the community.
- Browse Issues: Check out our GitHub Issues to find tasks that need work
- Pick an Issue: Choose an issue that interests you and matches your skill level
- Comment on the Issue: Let us know you're working on it to avoid duplicate effort
- Fork the Repository: Create your own fork to work on
- Create a Branch: Use a descriptive branch name like
feat/add-featureorfix/bug-description
-
Set up Development Environment:
git clone https://github.com/YOUR-USERNAME/code-graph-rag.git cd code-graph-rag uv sync --extra treesitter-full --extra test --extra dev
-
Install Pre-commit Hooks (mandatory):
pre-commit install
All commits must pass pre-commit checks. Do not skip hooks with
--no-verify. -
Make Your Changes:
- Follow the existing code style and patterns
- Add tests for new functionality
- Update documentation if needed
- Do not add inline comments (see Comment Policy below)
-
Test Your Changes:
- Run the existing tests to ensure nothing is broken
- Test your new functionality thoroughly
- Run
uv run ruff checkanduv run ruff format --checkbefore committing
-
Submit a Pull Request:
- Push your branch to your fork
- Create a pull request against the main repository
- Reference the issue number in your PR description
- Provide a clear description of what you've changed and why
- Keep PRs focused on a single issue or feature
- Write clear, descriptive commit messages
- Include tests for new functionality
- Update documentation when necessary
- Be responsive to feedback during code review
This project uses automated code review bots (Greptile and Gemini Code Assist) to provide initial feedback on PRs. Before requesting a human review:
- Address all bot comments: Every comment from Greptile and Gemini Code Assist must be resolved
- Accept or push back: For each bot suggestion, either:
- Accept: Implement the suggestion and resolve the comment
- Push back: Reply inline with a clear justification for why the suggestion doesn't apply
- Iterate as needed: Continue addressing new bot comments through multiple review rounds until all are resolved
- Then request human review: Only after all bot comments are cleared, assign the PR to core maintainers for human review
This process ensures that human reviewers focus on high-level design and logic rather than style and common issues that bots can catch.
- PydanticAI Only: This project uses PydanticAI as the official agentic framework. Do not introduce other frameworks like LangChain, CrewAI, or AutoGen.
- Heavy Pydantic Usage: Use Pydantic models extensively for data validation, serialization, and configuration
- Package Management: Use
uvfor all dependency management and virtual environments - Code Quality: Use
rufffor linting and formatting - runruff checkandruff formatbefore submitting - Type Safety: Use type hints everywhere and run
uv run ty checkfor type checking
- uv: Package manager and dependency resolver
- ruff: Code linting and formatting (replaces flake8, black, isort)
- ty: Static type checking (from Astral)
- pytest: Testing framework
- ripgrep (
rg): Required for shell command text searching (install viabrew install ripgrepon macOS orapt install ripgrepon Linux)
This project uses pre-commit to automatically run checks before each commit, ensuring code quality and consistency.
To get started, first make sure you have the development dependencies installed:
uv sync --extra treesitter-full --extra test --extra devThen, install the git hooks:
pre-commit install
pre-commit autoupdate --repo https://github.com/pre-commit/pre-commit-hooksNow, pre-commit will run automatically on git commit.
All tooling is from Astral:
| Tool | Purpose | Command |
|---|---|---|
| uv | Package management | uv sync, uv add, uv run |
| ty | Type checking | uv run ty check |
| ruff | Linting and formatting | uv run ruff check, uv run ruff format |
# Sync dependencies
uv sync --extra dev --extra test
# Upgrade a package
uv sync --upgrade-package <pkg>
# Type check
uv run ty check codebase_rag/
# Lint and format
uv run ruff check --fix .
uv run ruff format .| Structure | Use Case |
|---|---|
| StrEnum | Constrained string constants used in comparisons, defaults, assignments |
| NamedTuple | Immutable records with named fields (lightweight, hashable) |
| TypedDict | Dict shapes for function return types or JSON-like data |
| dataclass | Mutable class instances with behavior/methods |
| Pydantic BaseModel | Configs needing validation, serialization, or schema generation |
from dataclasses import dataclass
from enum import StrEnum
from typing import NamedTuple, TypedDict
# StrEnum - string constants
class Status(StrEnum):
PENDING = "pending"
DONE = "done"
# NamedTuple - immutable record
class Point(NamedTuple):
x: float
y: float
# TypedDict - dict shape
class Result(TypedDict):
success: bool
data: str
# dataclass - mutable with behavior
@dataclass
class User:
name: str
def greet(self) -> str:
return f"Hello, {self.name}"- Use
Literaltypes for constrained string values used only as type hints - Use
StrEnumwhen values need defaults or are used in code (not just type hints) - Never use loose dict types like
dict[str, Any]ordict[str, str | int | None]- use TypedDict instead - Use explicit TypedDict constructors instead of plain dict literals
Forward references are type hints wrapped in quotes like "ASTNode". These are NOT allowed.
How to identify forward references:
- Type hints with quotes:
def foo(x: "SomeClass") -> "Result" - These appear when a type is used before it's defined or to avoid circular imports
How to fix forward references:
- Add
from __future__ import annotationsat the top of the file - Remove the quotes from the type hints
IMPORTANT: Only add from __future__ import annotations to files that HAVE forward references. Do NOT add it to files that don't need it.
# Bad - forward reference with quotes (THIS IS NOT ALLOWED)
def process(node: "ASTNode") -> "Result": ...
# Good - add future import and remove quotes
from __future__ import annotations
def process(node: ASTNode) -> Result: ...# Bad - loose dict type
def process(args: dict[str, str | int | None]) -> dict[str, Any]: ...
# Good - TypedDict with known shape
class ProcessArgs(TypedDict):
name: str
count: int
def process(args: ProcessArgs) -> Result: ...
# Bad - dict literal
return {"success": True, "data": data}
# Good - TypedDict constructor
return Result(success=True, data=data)In Protocols and mixin classes, use regular method definitions instead of Callable attributes. Callables are not bound (don't receive self implicitly) and descriptors are not invoked.
from abc import abstractmethod
from typing import Callable, Protocol
# Bad - Callable attribute (not bound, not recommended)
class MyMixin:
process: Callable[[str], int]
class MyProtocol(Protocol):
handler: Callable[[str, int], bool]
# Good - regular method definition
# Mixin classes: use @abstractmethod for method stubs
class MyMixin:
@abstractmethod
def process(self, data: str) -> int: ...
# Protocols: no decorator needed (structural typing)
class MyProtocol(Protocol):
def handler(self, name: str, count: int) -> bool: ...Only use Callable attributes when reusing complex callable types is necessary.
Standard files in each module:
types_defs.py- Type aliases, TypedDicts, NamedTuples (immutable structural types)models.py- Dataclasses only (runtime data structures with behavior)constants.py- StrEnums, string literals, and application constantsconfig.py- Pydantic settings, environment config, and runtime configuration instancesschemas.py- All Pydantic BaseModel classes (data transfer objects, results, responses)logs.py- Log message templates for logger calls (info, debug, warning, error, success)tool_errors.py- Error messages returned by tools to the LLM/userexceptions.py- Exception classes and their error message templates (for raise statements)
- Soft rule: keep files under 700 lines (after linting); split larger files into submodules
- Group related functionality into submodules (e.g.,
stem_ops/,tools/,srg_parser/) - Use descriptive file names that reflect purpose (e.g.,
editor.py,factory.py,loader.py) - Each submodule can have its own
__init__.pyto expose public API
- Import from the module's public API, not internal files:
# Bad
from policy_digitization_tasks.srg_parser.editor import apply_edits
# Good
from policy_digitization_tasks.srg_parser import apply_edits- Group imports: stdlib, third-party, local (separated by blank lines)
- Use explicit imports, avoid
from module import *
When importing 5+ items from a module, use module-level import with a 2-letter alias:
# Bad - many lines of imports
from .constants import (
CLI_ERR_CONFIG,
CLI_MSG_DONE,
Color,
Provider,
# ... 20 more items
)
# Good - 1 line with 2-letter alias
from . import constants as cs
from . import exceptions as ex
from . import tool_errors as te
from . import logs as ls
# Usage
logger.info(ls.PROCESSING_FILE.format(path=path))
raise ex.LLMGenerationError(ex.CONFIG.format(error=e))Define constants, patterns, and types once. Import everywhere.
Use StrEnum when string values are used in code (defaults, comparisons, assignments):
from enum import StrEnum
# Bad - hardcoded strings scattered in code
def process(mode: str = "fast"): ...
if status == "pending": ...
# Good - centralized StrEnum
class Mode(StrEnum):
FAST = "fast"
SLOW = "slow"
def process(mode: Mode = Mode.FAST): ...
if status == Status.PENDING: ...Use an Enum with __call__ for parameterized error messages:
from enum import Enum
class Error(str, Enum):
NOT_FOUND = "Item '{id}' not found"
INVALID = "Invalid value"
def __call__(self, **kwargs) -> str:
return self.value.format(**kwargs) if kwargs else self.value
# Usage
raise ValueError(Error.NOT_FOUND(id="abc"))Use StrEnum for error type names passed to exception classes:
class ErrorCode(StrEnum):
VALIDATION = "ValidationError"
NOT_FOUND = "NotFoundError"
raise CustomError(ErrorCode.VALIDATION, Error.INVALID())Use loguru for all output instead of print:
# Bad
print(f"Processing: {file}")
print(f"Error: {e}", file=sys.stderr)
# Good
from loguru import logger
logger.info(f"Processing: {file}")
logger.error(f"Error: {e}")
logger.success("Done!")Use typer for CLI argument parsing instead of argparse:
# Bad
parser = argparse.ArgumentParser()
parser.add_argument("name", type=str)
parser.add_argument("--count", type=int, default=1)
args = parser.parse_args()
# Good
from typing import Annotated
import typer
def main(
name: Annotated[str, typer.Argument(help="Name")],
count: Annotated[int, typer.Option(help="Count")] = 1,
) -> None:
...
typer.run(main)Use click with @click.group() for nested subcommand groups that integrate with a typer main app:
- Typer's
add_typer()requires more boilerplate for this pattern - Bridge typer → click via
ctx.argsandstandalone_mode=False - Use
click.echo()/click.secho()for user-facing CLI output (not logging) - Add
logurufor actual error logging in exception handlers
# subcommands.py - click subcommand group
@click.group(help="Manage resources")
def cli() -> None:
pass
@cli.command(help="Add a new resource.")
def add(name: str) -> None:
try:
do_add(name)
click.echo(f"Added {name}")
except Exception as e:
logger.error(f"Failed to add: {e}") # loguru for logging
click.secho(f"Error: {e}", fg="red") # click for user output
# main.py - typer main app bridges to click
from .subcommands import cli as subcommand_cli
@app.command(
name="resource",
context_settings={"allow_extra_args": True, "allow_interspersed_args": False},
)
def resource_command(ctx: typer.Context) -> None:
subcommand_cli(ctx.args, standalone_mode=False)Use dataclasses with methods instead:
# Bad
HANDLERS = {
"create": lambda x: {"action": "create", "id": x.id},
}
# Good
@dataclass
class Handler:
action: str
template: str
def build(self, x) -> ActionDict:
return ActionDict(action=self.action, id=x.id)
HANDLERS = {"create": Handler(action="create", template="...")}# Bad
if action == "create":
return handle_create(data)
elif action == "update":
return handle_update(data)
else:
return handle_default(action, data)
# Good
match action:
case "create":
return handle_create(data)
case "update":
return handle_update(data)
case other:
return handle_default(other, data)When the if body does nothing (pass) and all logic is in the else clause, invert the condition and remove the empty else:
# Bad - empty if body with logic in else
if location == OUTSIDE:
pass
else:
take_off_hat()
# Good - inverted condition, no empty else
if location != OUTSIDE:
take_off_hat()This also applies when the if body is a guard condition that allows dropping the else entirely.
Use named expressions (:=) to merge assignment followed by a conditional check:
# Bad - separate assignment and condition
env_base = os.environ.get("PYTHONUSERBASE", None)
if env_base:
return env_base
chunk = file.read(8192)
while chunk:
process(chunk)
chunk = file.read(8192)
# Good - named expression
if env_base := os.environ.get("PYTHONUSERBASE", None):
return env_base
while chunk := file.read(8192):
process(chunk)Named expressions can also simplify nested conditions:
# Bad - nested if statements
if self._is_special:
ans = self._check_nans(context=context)
if ans:
return ans
# Good - merged with named expression
if self._is_special and (ans := self._check_nans(context=context)):
return ansIf a helper function is trivial and used once, inline it.
Declare variables as close to their usage as possible to minimize cognitive load and prevent stranded variables:
# Bad - assignment far from usage
cubes = []
function_unrelated_to_cubes()
if another_unrelated_condition():
more_unrelated_logic()
for i in range(20):
cubes.append(i**3)
# Good - assignment immediately before usage
function_unrelated_to_cubes()
if another_unrelated_condition():
more_unrelated_logic()
cubes = []
for i in range(20):
cubes.append(i**3)Extract repeated code blocks into helpers:
# Bad - same 4 lines repeated 3 times (12 lines)
def save_user(user):
conn = db.connect()
conn.execute(SQL_USER, user.dict())
conn.commit()
conn.close()
def save_order(order):
conn = db.connect()
conn.execute(SQL_ORDER, order.dict())
conn.commit()
conn.close()
def save_item(item):
conn = db.connect()
conn.execute(SQL_ITEM, item.dict())
conn.commit()
conn.close()
# Good - helper + 3 one-liners (7 lines)
def _save(sql: str, data: dict) -> None:
conn = db.connect()
conn.execute(sql, data)
conn.commit()
conn.close()
def save_user(user):
_save(SQL_USER, user.dict())
def save_order(order):
_save(SQL_ORDER, order.dict())
def save_item(item):
_save(SQL_ITEM, item.dict())Code should be self-documenting. Exception: comments prefixed with (H) are allowed.
Never use # type: ignore comments, cast(), Any type, or object as a type hint. These provide no useful type information. Fix the underlying type issue using proper typing, type narrowing, specific union types (e.g., str | int | bool | None), or TypedDict for dict values.
All repeated string literals should be constants or StrEnum members:
# Bad
if node.type == "predicate_definition": ...
artifact_type = "srg_v1"
# Good
if node.type == ElementType.PREDICATE: ...
artifact_type = ARTIFACT_SRGFiles that are NOT config.py, models.py, constants.py, logs.py, or CLI modules should have almost no string literals. Move all strings to:
logs.py- all log messages (info, debug, warning, error, success)constants.py- non-log constants, StrEnums, format stringstool_descriptions.py- tool/function descriptions (for tools modules)config.py- configuration defaults
# Bad - strings in service/tool files
logger.info(f"Processing file: {path}")
description="Reads file content from disk."
# Good - import from dedicated modules
from .. import logs
from . import tool_descriptions as td
logger.info(logs.PROCESSING_FILE.format(path=path))
description=td.FILE_READERUse StrEnum types in function signatures, not str:
# Bad
def extract(guideline_type: str, outcome: str = "Approve"): ...
# Good
def extract(guideline_type: GuidelineType, outcome: OutcomeType = OutcomeType.APPROVE): ...Use Pydantic model_validator for cross-field validation:
class Config(BaseModel):
options: list[str] | None = None
default: str = "fallback"
@model_validator(mode="after")
def validate_default_in_options(self):
if self.options and self.default not in self.options:
raise ValueError(f"default '{self.default}' must be in options")
return self- Conventional Commits format
- One-liner only
- No emoji
- No attribution or Co-Authored-By
Uses Conventional Commits format with this regex pattern:
^(build|chore|ci|docs|feat|fix|perf|p?refactor|revert|style|test)(\([a-zA-Z0-9_-]+\))?!?: .+$
Allowed prefixes:
| Prefix | Purpose |
|---|---|
build |
Build system or external dependencies |
chore |
Routine tasks, maintenance |
ci |
CI configuration changes |
docs |
Documentation only |
feat |
New feature |
fix |
Bug fix |
perf |
Performance improvement |
refactor or prefactor |
Code refactoring |
revert |
Reverting changes |
style |
Formatting, whitespace, etc. |
test |
Adding or modifying tests |
Format:
<type>[(<scope>)][!]: <description>
Examples:
feat: add user authenticationfix(api): resolve null pointer exceptionchore(deps): update dependenciesfeat!: breaking change to API(the!indicates a breaking change)refactor(core): simplify validation logic
The scope (in parentheses) is optional and can contain alphanumeric characters, underscores, and hyphens.
No inline comments are allowed unless they meet one of these criteria:
- Top-of-file comments: Comments that appear before any code (including imports) are allowed
(H)marker: Comments containing(H)are allowed - this stands for "Human" and indicates an intentional, human-written comment- Type annotations: Comments containing
type:,noqa,pyright, orty:are allowed
Why this rule exists: AI tools (like code assistants and LLMs) tend to generate redundant, obvious comments that clutter the codebase. Comments like # Loop through items or # Return the result add no value. This policy prevents AI-generated comment slop from polluting the code.
If you need to add a comment, prefix it with (H):
# (H) This algorithm uses memoization because the recursive solution times out on large inputsThe pre-commit hook no-inline-comments enforces this rule automatically.
If you have questions about contributing, feel free to:
- Open a discussion on GitHub
- Comment on the relevant issue
- Reach out to the maintainers
We appreciate your contributions!
This project uses a Makefile for streamlined development workflow:
# Set up complete development environment (recommended for new contributors)
make dev
# Run all tests
make test
# Run tests in parallel for faster execution
make test-parallel
# Clean up build artifacts and cache
make clean
# View all available commands
make help