Thank you for your interest in contributing to DataBeak! This guide will help you get started with contributing to the project.
- Code of Conduct
- Getting Started
- Development Setup
- Development Workflow
- Code Standards
- Testing
- Documentation
- Submitting Changes
- Release Process
By participating in this project, you agree to abide by our Code of Conduct:
- Be respectful and inclusive
- Welcome newcomers and help them get started
- Focus on constructive criticism
- Accept feedback gracefully
- Put the project's best interests first
-
Fork the repository on GitHub
-
Clone your fork locally:
git clone https://github.com/jonpspri/databeak.git cd databeak -
Add upstream remote:
git remote add upstream https://github.com/jonpspri/databeak.git
- Python 3.12 or higher
- Git
- uv - Ultra-fast package manager (required)
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or on macOS: brew install uv
# Or with pip: pip install uv
# Clone and setup in one command!
uv sync --all-extras
# Install pre-commit hooks
uv run pre-commit install
# That's it! You're ready to go in seconds!# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode
pip install -e ".[dev,test,docs]"
# Install pre-commit hooks
pre-commit installNote: We standardize on uv for all development. It's significantly faster
and handles everything pip does plus more.
# All commands use uv
uv run databeak --help
uv run pytest -n auto
uv run ruff check
uv run mypy src/databeak/🚨 IMPORTANT: Direct commits to main are prohibited. Pre-commit hooks enforce
branch-based development.
# Update main branch
git checkout main
git pull origin main
# Create descriptive feature branch
git checkout -b feature/your-feature-name
# OR use other prefixes: fix/, docs/, test/, refactor/, chore/Follow these guidelines:
- Branch-based development only - Never commit directly to main
- One feature per PR - Keep pull requests focused
- Write tests - All new features must have tests
- Update docs - Update README and docstrings as needed
- Follow style guide - Use Ruff and MyPy
- Conventional commits - Use conventional commit format (enforced by hooks)
# All commands use uv for speed and consistency
uv run ruff format # Format code with Ruff
uv run ruff check # Lint with Ruff
uv run mypy src/databeak/ # Type check with MyPyDataBeak uses a three-tier testing structure. Run tests appropriate to your changes:
# Run tests by category
uv run pytest -n auto tests/unit/ # Fast unit tests (run frequently)
uv run pytest -n auto tests/integration/ # Integration tests
uv run pytest -n auto tests/e2e/ # End-to-end tests
uv run pytest -n auto # All tests
# Run with coverage
uv run pytest -n auto --cov=src/databeak --cov-report=term-missing
# Run specific tests
uv run pytest -n auto tests/unit/servers/ # Test specific module
uv run pytest -k "test_filter" # Run tests matching pattern (single test, no parallel)
uv run pytest -n auto -x # Stop on first failureTesting Requirements:
- New features must have unit tests in
tests/unit/ - Bug fixes must include regression tests
- Maintain 80%+ code coverage
- See Testing Guide for details
# Push feature branch
git push -u origin feature/your-feature-name
# Create PR using GitHub CLI
gh pr create --title "feat: Add new data filtering tool" --body "Description of changes..."
# OR create via GitHub web interfacePull Request Requirements:
- Descriptive title with conventional commit prefix (feat:, fix:, docs:, etc.)
- Clear description explaining what changes and why
- Link related issues with "Closes #123" syntax
- All checks must pass (tests, linting, type checking)
- Review and approval required before merge
We use modern Python tooling for code quality:
- Ruff for code formatting and linting (line length: 100)
- Ruff for linting (replaces flake8, isort, and more)
- MyPy for type checking
- Pre-commit for automated checks
-
Type Hints: All functions must have type hints
async def process_data( session_id: str, options: Dict[str, Any], ctx: Optional[Context] = None ) -> Dict[str, Any]: """Process data with given options.""" ...
-
Docstrings: Use Google-style docstrings
def analyze_data(df: pd.DataFrame) -> Dict[str, Any]: """Analyze DataFrame and return statistics. Args: df: Input DataFrame to analyze Returns: Dictionary containing analysis results Raises: ValueError: If DataFrame is empty """
-
Error Handling: Use specific exceptions
if not session: raise ValueError(f"Session {session_id} not found")
-
Async/Await: Use async for all tool functions
@mcp.tool async def my_tool(param: str, ctx: Context) -> Dict[str, Any]: result = await async_operation(param) return {"success": True, "data": result}
-
Logging: Use appropriate log levels
logger.debug("Processing row %d", row_num) logger.info("Session %s created", session_id) logger.warning("Large dataset: %d rows", row_count) logger.error("Failed to load file: %s", error)
src/databeak/
├── __init__.py # Package initialization
├── server.py # Main server entry point
├── models/ # Data models and schemas
│ ├── __init__.py
│ ├── csv_session.py # Session management
│ └── data_models.py # Pydantic models
├── servers/ # MCP server implementations
│ ├── __init__.py
│ ├── io_operations.py
│ ├── transformations.py
│ ├── analytics.py
│ └── validation.py
├── resources/ # MCP resources
├── prompts/ # MCP prompts
└── utils/ # Utility functions
tests/
├── unit/ # Unit tests
│ ├── test_models.py
│ ├── test_transformations.py
│ └── test_analytics.py
├── integration/ # Integration tests
│ ├── test_server.py
│ └── test_workflows.py
├── benchmark/ # Performance tests
│ └── test_performance.py
└── fixtures/ # Test data
└── sample_data.csv
-
Use pytest fixtures:
import uuid @pytest.fixture async def session_with_data(): """Create a session with sample data.""" manager = get_session_manager() session_id = str(uuid.uuid4()) session = manager.get_or_create_session(session_id) # ... setup yield session_id # ... cleanup manager.remove_session(session_id)
-
Test async functions:
@pytest.mark.asyncio async def test_filter_rows(session_with_data): result = await filter_rows( session_id=session_with_data, conditions=[{"column": "age", "operator": ">", "value": 18}] ) assert result["success"] assert result["rows_after"] < result["rows_before"]
-
Use parametrize for multiple cases:
@pytest.mark.parametrize( "dtype,expected", [ ("int", True), ("float", True), ("str", False), ], ) def test_is_numeric(dtype, expected): assert is_numeric_dtype(dtype) == expected
- Minimum coverage: 80%
- New features must have >90% coverage
- Run coverage:
uv run pytest -n auto --cov
All public functions, classes, and modules must have docstrings:
"""Module description.
This module provides functionality for X, Y, and Z.
"""
class DataProcessor:
"""Process CSV data with various transformations.
Attributes:
session_id: Unique session identifier
df: Pandas DataFrame containing the data
"""
def transform(self, operation: str, **kwargs: Any) -> pd.DataFrame:
"""Apply transformation to data.
Args:
operation: Name of the transformation
**kwargs: Additional parameters for the operation
Returns:
Transformed DataFrame
Raises:
ValueError: If operation is not supported
Examples:
>>> processor.transform("normalize", columns=["price"])
>>> processor.transform("fill_missing", strategy="mean")
"""- README.md: Update for new features or breaking changes
- API Docs: Ensure docstrings are complete
- Examples: Add examples for new features
- Changelog: Update CHANGELOG.md
-
Update your branch:
git fetch upstream git rebase upstream/main
-
Push to your fork:
git push origin feature/your-feature-name
-
Create Pull Request:
- Go to GitHub and create a PR from your fork
- Use a clear, descriptive title
- Fill out the PR template
- Link related issues
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Coverage maintained/improved
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings- Automated checks must pass
- Code review by at least one maintainer
- Address feedback promptly
- Squash commits if requested
We follow Semantic Versioning:
- MAJOR: Breaking changes
- MINOR: New features (backward compatible)
- PATCH: Bug fixes
- Update version in
pyproject.toml - Update CHANGELOG.md
- Create release PR
- Tag release after merge
- Publish to PyPI (automated)
DataBeak documentation maintains professional, factual tone:
Prohibited terms:
- "exceptional", "perfect", "amazing", "outstanding", "superior"
- "revolutionary", "cutting-edge", "world-class", "best-in-class"
- "unparalleled", "state-of-the-art", "industry-leading", "premium", "elite"
- "ultimate", "maximum", "optimal", "flawless"
Use factual alternatives:
- "exceptional standards" → "strict standards"
- "perfect compliance" → "clean compliance"
- "comprehensive coverage" → "high coverage"
- "API design excellence" → "clear API design"
- "security best practices" → "defensive practices"
Acceptable (measurable):
- "Zero ruff violations" (verifiable metric)
- "100% mypy compliance" (measurable result)
- "1100+ unit tests" (concrete count)
Prohibited (subjective claims):
- "production quality" (marketing speak)
- "advanced analytics" (vague superlative)
- "sophisticated architecture" (self-congratulatory)
Use measured, technical language:
- "provides" not "delivers amazing"
- "supports" not "offers comprehensive"
- "implements" not "features advanced"
- "handles" not "excels at"
- Issues: Use GitHub Issues for bugs and features
- Discussions: Use GitHub Discussions for questions
- Discord: Join our Discord server (link in README)
Contributors are recognized in:
- AUTHORS.md file
- Release notes
- Project README
Thank you for contributing to DataBeak!