All notable changes to ChunkHound will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
4.0.0 - 2025-11-12
- Map-reduce synthesis for dramatically improved research accuracy - clusters related files and synthesizes them separately before combining insights
- Compact numbered citation system
[1][2][3]replacing verbosefile.py:123references for better readability - Automatic query expansion with intelligent deduplication to find more relevant results
- Structured JSON output support for LLM providers enabling programmatic research workflows
- Tree progress display with event system for visual research feedback
chunkhound research <query>command for direct code research without starting MCP serverchunkhound index --simulate [--json]- Dry-run mode showing which files would be indexed without making changeschunkhound diagnose [--json]- Troubleshooting command comparing ChunkHound's decisions vs git's ignore ruleschunkhound calibrate- Automatic batch size performance tuning for Qwen3 reranker--show-sizesflag for file size reporting during indexing- Swift language support with tree-sitter parsing for classes, protocols, functions, and properties (
.swift,.swiftinterface) - Objective-C support with content detection to disambiguate from MATLAB (
.mfiles) - Zig language support with comprehensive tree-sitter parsing
- Haskell language support for functions, types, classes, and modules (
.hs,.lhs,.hs-boot,.hsig,.hsc) - HCL (HashiCorp Configuration Language) support for Terraform with nested object parsing (
.hcl,.tf,.tfvars) - Vue.js Single File Component (SFC) support with specialized parsing for template, script, and style sections
- Vue cross-reference tracking between template elements and script definitions for enhanced semantic understanding
- PHP language support with comprehensive parsing for classes, interfaces, traits, functions, methods, namespaces, and PHPDoc comments
- RapidYAML parser using native bindings (10-100x faster than tree-sitter for large YAML files)
- Helm template sanitizer for Go template syntax in Kubernetes manifests
- Automatic fallback to tree-sitter parser when RapidYAML encounters issues
- Benchmark harness comparing PyYAML, universal, and RapidYAML performance (
scripts/bench_yaml.py) - Repo-aware ignore engine respecting repository boundaries and preventing rule leakage between sibling repos
- Workspace overlay mode collecting .gitignore rules from root and nested files with correct anchoring
- Combined exclusion modes:
indexing.exclude_modesupports"combined","config_only", or"gitignore_only" - Wildcard directory segment matching for patterns like
**/.venv*/and**/*.phar/ - Git pathspec capping with fallback to prevent pathspec explosion (default: 128, env:
CHUNKHOUND_INDEXING__GIT_PATHSPEC_CAP) - Real-time telemetry for git pathspec usage and exclusion sources
- TEI (Text Embeddings Inference) reranking format support alongside Cohere format
- Automatic reranker format detection from response field names (Cohere vs TEI)
- Thread-safe format caching for consistent reranker behavior across requests
- Authorization header support for TEI endpoints with
--api-keyflag - Qwen3 reranker with automatic batch size calibration for optimal performance
- Async regex search methods for concurrent search operations
- Claude Code CLI provider with direct integration (
claude-code-cli) - Codex CLI provider for synthesis workflows
- AWS Anthropic Bedrock provider using official Anthropic SDK
- Provider-specific synthesis concurrency limits: OpenAI (3), Bedrock (5), Claude CLI (1)
- Smart change detection using checksums for verification when mtime/size differ
- Content hash support in both DuckDB and LanceDB providers
- DuckDB schema migration with
files.content_hashcolumn (idempotent viaALTER TABLE IF NOT EXISTS) - LanceDB execute_query adapter for lightweight batch SELECT operations
- In-memory database mode for simulate on fresh workspaces (no .chunkhound/ directory created)
- Checkpointing and recovery for more robust indexing coordinator
- Per-file timeout controls:
indexing.per_file_timeout_seconds,indexing.per_file_timeout_min_size_kb - Configurable host parameter for HTTP MCP server (
--hostfor binding to specific interfaces) - Size-based filtering threshold for structured config files (JSON/YAML/TOML)
- Environment variable override for DB executor timeout:
CHUNKHOUND_DB_EXECUTE_TIMEOUT - Comprehensive test suites for Swift, Objective-C, Zig, Java, C#, Python, PHP, Vue, HCL
- Test fixtures for refactored research modules with fake providers and better mocks
- Native git bindings for gitignore exclusions replacing Python-based pattern matching (10-100x faster indexing)
- Parallel directory discovery with auto-scaling for enterprise monorepos
- Concurrent file parsing using ProcessPoolExecutor across CPU cores
- Lazy parser instantiation reducing startup time
- Single-file fast path using in-process handling (no ProcessPool overhead)
- Single-read checksum verification eliminating redundant file I/O
- Provider-aware embedding concurrency: OpenAI (8 concurrent batches), VoyageAI (40 concurrent batches)
- Automatic retry logic for VoyageAI embedding provider
- Real-time embedding pass: dedicated "embed" phase after quick parse/store for new chunks
- Removed redundant reranking passes from deep research pipeline
- xxHash3-64 replacing SHA-256 for faster file change detection
- Git pathspec capping preventing pathspec explosion (configurable via env)
- In-memory DuckDB for simulate mode on fresh workspaces
- Automatic parser worker auto-scaling to CPU count when timeouts enabled (capped at 32)
- Split progress reporting: "Parsing files" vs "Handling files" with live cumulative info
- Better error messages and truncation detection for LLM responses
- Non-TTY progress fallback properly working in CI environments
- Improved diagnostics for parse/store errors with clearer failure messages
- Post-run prompt to add timed-out files to
indexing.excludewhen interactive - Skipped file counts broken out into "Unchanged" and "Filtered" buckets
- Raw markdown output from code_research tool for better formatting in Claude
- Lazy imports for MCP-safe stdio operation
- Proper JSON-RPC handshake reliability
- Test-mode patches for Codex CLI integration (env-gated, no production impact)
- Increased startup wait time for Mac CI stability (3s → 5s)
- TEI reranking format comprehensive guide in CLAUDE.md
- Test coverage documentation with refactoring progress
- README improvements with startup profile CAP notes and exclusions section updates
- Benchmark instructions for YAML parser performance testing
- MCP setup improvements with multi-client support and
--show-setupflag
- BREAKING: Removed
depthparameter fromcode_researchMCP tool - system now auto-scales synthesis budgets based on repository size - BREAKING: Checksum algorithm switched from SHA-256 to xxHash3-64 for faster file change detection - all files will be reindexed on first run after upgrade
- BREAKING: Default exclusion behavior changed - providing
indexing.excludelist no longer disables .gitignore (useexclude_mode: "config_only"for legacy behavior) - BREAKING: RapidYAML is now the default YAML parser (set
CHUNKHOUND_YAML_ENGINE=treeto revert to tree-sitter) - BREAKING: LanceDB provider now requires
content_hashcolumn in files schema - Default per-file timeout enabled:
indexing.per_file_timeout_seconds=3.0(previously0, disabled) - Parser workers auto-scale to CPU count when timeouts enabled (capped at 32)
- Combined exclusion mode is now default: overlays gitignore + config excludes instead of replacing
- Model defaults updated to Haiku 4.5 for claude-code-cli and bedrock providers
- Deep research service refactored into specialized modules: question_generator, synthesis_engine, budget_calculator, citation_manager, quality_validator
- Search service refactored into strategies: context_retriever, single_hop_strategy, multi_hop_strategy, result_enhancer
- Extracted research pipeline modules: unified_search, query_expander, file_reader, context_manager
- Fixed double "**/" prefix preventing root file matches in default excludes
- Fixed real-time indexing for newly added languages
- Fixed file diversity collapse in deep research using proper reranking
- Fixed TOML parser to extract only matched node content instead of entire file
- Fixed tree-sitter language names for C# and Makefile parsers
- Fixed .gitignore pattern handling and error logging
- Fixed symbol validation inconsistency in Chunk.from_dict()
- Fixed Config.init to respect target_dir kwarg in tests
- Fixed DuckDB
get_file_by_path(as_model=True)to return correct mtime and size_bytes for accurate skip checks - Fixed registry provider instance handling (was storing lambda instead of provider)
- Fixed orphaned embeddings cleanup with proper per-call db_path configuration
- Fixed LanceDB optimize() API usage for 0.21.0+ (cleanup_older_than parameter)
- Fixed single-file indexing to use in-process path and call on_batch for immediate storage
- Fixed missing sources in synthesis by using correct chunk.content field (was chunk.code)
- Fixed flaky multi-hop semantic chain test
- Fixed reranker single-batch top_k filtering for consistency across backends
- Fixed concurrent rerank calls using aiohttp (replaced custom socket-based HTTP)
- Fixed MCP stdio flow for code_research end-to-end reliability
- Fixed non-TTY progress manager regression (added minimal Progress shim for CI)
- Fixed exception classes to allow traceback assignment (removed frozen dataclass)
- Fixed Windows path separator issues in gitignore pattern generation and matching
- Fixed ProcessPoolExecutor segfault on Linux by forcing spawn multiprocessing
- Fixed flaky QA test with file processing completion polling
- Fixed real-time indexing flakiness with proper timeout handling and task cleanup
- Removed AWS Bedrock provider (consolidated to Anthropic SDK-based Bedrock provider)
- Removed research tools setup section from CONTRIBUTING.md (obsolete)
- Removed obsolete tests incompatible with refactored modular architecture
- Removed embedded API key from
.chunkhound.json- use environment variables instead (e.g.,CHUNKHOUND_EMBEDDING__API_KEY)
3.3.1 - 2025-09-25
- Dependency updates to latest stable versions for improved stability and performance
- Test infrastructure reliability with better provider detection and error handling
- Tree-sitter 0.25.x API compatibility ensuring parsing works with latest language parsers
- Code formatting and import organization for cleaner, more maintainable codebase
3.3.0 - 2025-09-21
- Official Windows support with full CI testing across Windows, macOS, and Ubuntu
- Command-line search functionality (
chunkhound search) for semantic and regex queries without starting MCP - CONTRIBUTING.md guidelines
- Setup wizard when
.chunkhound.jsonisn't found in the directory
- File exclude patterns (/tmp/) on Linux systems
- Regex search path resolution across platforms
3.2.0 - 2025-08-24
- Semantic search upgraded from two-hop to dynamic multi-hop expansion with intelligent stopping criteria, delivering more comprehensive and contextually relevant results while avoiding search explosion
3.1.0 - 2025-08-21
- PDF document parsing and indexing with full text extraction using PyMuPDF integration
- Language support expanded to 29 languages with comprehensive documentation breakdown
- JSON file parsing now extracts specific node content instead of entire file content, improving search precision and reducing noise
3.0.1 - 2025-08-21
- Documentation site improved with cross-linking between pages and hero image for better navigation
- OpenAI-compatible endpoint flexibility increased by making API keys optional for local deployments
- Test infrastructure reliability improved with comprehensive CI fixes and timeout handling
- JSON file parsing now handles empty chunks correctly, eliminating indexing failures on common JSON patterns
- Test suite stability enhanced with proper background task cleanup and configuration isolation
- GitHub Actions workflow simplified and made more reliable by removing redundant processes
3.0.0 - 2025-08-20
- VoyageAI embedding provider with advanced two-hop semantic search and reranking capabilities
- GitHub Pages documentation site with interactive examples and improved navigation
- Intelligent file exclusion system with .gitignore support and JSON size filtering
- Advanced makefile parsing with dependency analysis for better code comprehension
- Comprehensive test suite for database consistency and integration testing
- Real-time filesystem indexing with MCP integration for live code monitoring
- Parsing system completely rebuilt with cAST (Code AST) algorithm for universal language support
- Configuration system dramatically simplified with fewer user-facing options for easier setup
- OpenAI provider unified to handle both standard and custom OpenAI-compatible endpoints
- MCP server reliability improved with proper initialization sequencing and watchdog coordination
- Test infrastructure enhanced with Ollama compatibility and extended timeouts
- Directory indexing consolidated between CLI and MCP with shared service architecture
- MCP server initialization blocking resolved - no more startup deadlocks during directory scanning
- Custom OpenAI endpoint configuration now properly recognized and applied
- Real-time indexing now generates missing embeddings for unchanged code chunks
- SSL verification disabled for custom OpenAI-compatible endpoints to support local deployments
- Watchdog filesystem monitoring no longer blocks MCP server startup process
- MCP server properly respects target directory path arguments across all operations
- TEI (Text Embeddings Inference) provider support - simplified provider ecosystem
- BGE provider support - consolidated to core providers for better maintenance
- Legacy parsing system replaced with modern cAST algorithm
- Obsolete configuration documentation and setup files cleaned up
2.8.1 - 2025-07-20
- Architecture documentation significantly improved for better LLM comprehension and AI-assisted development workflows
- Type annotation syntax errors that could cause import failures in Python 3.10+ environments
- Enhanced smoke tests now detect forward reference type annotation issues early
2.8.0 - 2025-07-20
- MCP HTTP transport support alongside stdio transport for flexible deployment options
- Configuration system unified across CLI and MCP components for consistent behavior
- File change processing reliability improved in MCP servers with better debouncing and coordination
- Database portability enhanced with relative path storage
- MCP server initialization deadlocks and startup crashes resolved with proper async coordination
- File deletion handling improved using IndexingCoordinator for better reliability
- MCP server tool discovery enhanced with fallback logic for better error recovery
- File path resolution improved in DuckDB provider for cross-platform consistency
2.7.0 - 2025-07-12
- MCP server now uses configured embedding model instead of hardcoded text-embedding-3-small default, ensuring semantic search works with any configured model
- MCP test environment improvements with comprehensive test data and configuration files
2.6.3 - 2025-07-10
- Configuration merge precedence now correctly preserves environment variables over JSON config values
- MCP server semantic search now works properly when running from different directories
- Removed obsolete Ubuntu 20 Dockerfile as issue was resolved in configuration system
2.6.2 - 2025-07-10
- MCP server now properly loads embedding provider configuration from target directory
2.6.1 - 2025-07-10
- MCP server now properly respects CLI-provided project root directory for configuration loading
- Configuration files (.chunkhound.json) are now correctly loaded when running MCP server from different directories
2.6.0 - 2025-07-10
- MCP server crashes on Ubuntu and Linux systems when running from different directories by fixing database path resolution and process coordination
- Enhanced TaskGroup error reporting to show underlying causes instead of generic wrapper errors
- Configuration file loading in MCP server now properly respects .chunkhound.json files in target directories
- Database lock conflicts between multiple MCP instances resolved with proper process detection
- Docker test infrastructure for MCP server validation to prevent future regressions
- Improved error messages for debugging MCP server issues with detailed analysis
2.5.4 - 2025-07-10
- MCP server reliability on Ubuntu and other Linux distributions when running from different directories
- Database path resolution consistency across all MCP server components
2.5.3 - 2025-07-10
- MCP server communication reliability improved by removing debug logging that interfered with JSON-RPC protocol
2.5.2 - 2025-07-10
- Automatic database optimization during embedding generation to maintain performance with large datasets (every 1000 batches, configurable via
CHUNKHOUND_EMBEDDING_OPTIMIZATION_BATCH_FREQUENCY)
- MCP server compatibility on Ubuntu and other strict platforms by preserving virtual environment context in subprocesses
- OpenAI embedding provider crash on Ubuntu due to async resource creation outside event loop context
2.5.1 - 2025-01-09
- Project detection now properly respects CHUNKHOUND_PROJECT_ROOT environment variable, ensuring MCP command works correctly when launched from any directory
- Removed duplicate MCP parser function that could cause confusion
2.5.0 - 2025-01-09
- MCP positional path argument now controls complete project scope - database location, config file search, and watch paths are all set to the specified directory instead of just watch paths
- MCP launcher import path resolution when running from different directories, eliminating TaskGroup errors on Ubuntu and other strict platforms
2.4.4 - 2025-01-09
- Ubuntu TaskGroup crash fixed by removing problematic directory change in MCP launcher
2.4.3 - 2025-01-09
- MCP server now works correctly when launched from any directory, not just the project root
- Fixed path resolution inconsistencies that caused TaskGroup errors on Ubuntu deployments
2.4.2 - 2025-01-09
- MCP command now accepts optional path argument to specify directory for indexing and watching (defaults to current directory)
- Parser architecture inconsistencies resolved across C, Bash, and Makefile parsers for consistent search functionality
- MCP server database duplication eliminated through proper async task isolation
- LanceDB storage growth controlled with automatic optimization during quiet periods
- MCP server reliability improved with corrected import structure and dependency resolution
- Python parser behavior now consistent between CLI and MCP modes
- Search operation freezes after file deletion resolved with proper thread safety
2.4.1 - 2025-01-09
- Package structure consolidated under chunkhound/ directory for improved import reliability and Python packaging best practices
2.4.0 - 2025-01-09
- LanceDB storage growth issue resolved with automatic database optimization during quiet periods
- Configuration system project root detection for .chunkhound.json files improved
- Enhanced database provider architecture with capability detection and activity tracking
- Modernized configuration system by removing legacy registry config building
2.3.1 - 2025-07-09
- MCP server communication reliability improved by preventing stderr output from corrupting JSON-RPC messages
- Enhanced configuration documentation with automatic .chunkhound.json detection examples
2.3.0 - 2025-07-08
- BREAKING: Configuration system completely refactored with centralized management and clear precedence hierarchy
- BREAKING: Automatic configuration file loading removed - config files now only load with explicit
--configflag - BREAKING: Environment variables standardized to
CHUNKHOUND_*prefix with__delimiters (e.g.,CHUNKHOUND_EMBEDDING__API_KEY) - BREAKING: Legacy
OPENAI_API_KEYandOPENAI_BASE_URLenvironment variables no longer supported
- Complete CLI argument coverage for all configuration options
- Centralized configuration precedence: CLI args → Config file → Environment variables → Defaults
- Comprehensive migration guide for updating existing configurations
- Database file gitignore pattern for Lance database files
- MCP server database duplication caused by shared transaction state across async tasks
- Parser architecture inconsistencies for C, Bash, and Makefile language parsers
- Configuration auto-detection issues that caused deployment complexity
2.2.0 - 2025-01-07
- Database freezing during concurrent file operations through proper async/sync boundary handling
- Thread safety issues in DuckDB provider with synchronized WAL cleanup and operation timeouts
- LanceDB duplicate file entries through atomic merge operations and path normalization
- File deletion operations now properly handle async contexts without blocking the event loop
- Aligned LanceDB provider with serial executor pattern for consistency with DuckDB
- Improved path normalization to handle symlinks and different path representations
- Enhanced database operation reliability with proper thread isolation
- Support for complete configuration storage including API keys in .chunkhound.json files
- Consolidated embedding provider creation system for consistent behavior across CLI and config files
2.1.4 - 2025-07-03
- CLI argument defaults no longer override config file values
- Updated dependencies via uv.lock
2.1.3 - 2025-07-03
- Consolidated embedding provider creation to use single factory pattern for consistency
- Reduced embedding provider log verbosity for cleaner output
2.1.2 - 2025-07-03
- API key configuration loading from .chunkhound.json files
- Configuration precedence documentation to match actual behavior
- Complete configuration examples with API key and security guidance
2.1.1 - 2025-07-03
- Centralized version management system for consistent versioning across all components
- Simplified version updates through automated scripts
- Enhanced installation and development documentation
- Code formatting improvements and linting cleanup
- Version consistency across CLI, MCP server, and package initialization
- Import statement in package
__init__.pyfor better module exposure
2.1.0 - 2025-07-02
- Database duplication in MCP server by implementing single-threaded executor pattern
- WAL corruption handling during DuckDB catalog replay
- Parser architecture inconsistencies for C, Bash, and Makefile parsers
- DuckDB foreign key constraint transaction limitations
- Python parser CLI/MCP divergence through unified factory pattern
- Connection management architectural violations
- Consolidated database operations through DuckDBProvider executor pattern
- Simplified ConnectionManager to handle only connection lifecycle
- Updated file discovery patterns to include all 16 supported languages
- Removed deprecated connection methods and schema fields
- Enhanced transaction handling with contextvars for task isolation
- Automatic database migration system for schema updates
- Enhanced parser functionality for C pointer functions and Bash function bodies
- Task-local transaction state management
- Comprehensive executor methods for database operations
2.0.0 - 2025-06-26
- 10 new language parsers: Rust, Go, C++, C, Kotlin, Groovy, Bash, TOML, Makefile, Matlab
- Search pagination with response size limits
- Registry-based parser architecture
- MCP search task coordinator
- Test coverage for file modification tracking
- Comment and docstring indexing for all language parsers
- Background periodic indexing for better performance
- Path filtering support for targeted searches
- HNSW index WAL recovery with enhanced checkpoints
- Embedding cache optimization with CRC32-based content tracking
- BREAKING: 'run' command renamed to 'index' with current directory default
- BREAKING: Parser system refactored to registry pattern
- Centralized language support in Language enum
- Optimized embedding performance with token-aware batching
- Enhanced PyInstaller compatibility
- Improved cross-platform build support (Windows, Ubuntu Docker)
- Enhanced MCP server JSON-RPC communication with logging suppression
- Parser error handling and registry integration
- OpenAI token limit handling
- PyInstaller module path resolution
- Database WAL corruption issues on server exit
- File watcher cancellation responsiveness
- Signal handler safety by removing unsafe database operations
- Windows PyInstaller and MATLAB dependency issues
- Build workflow reliability across platforms
1.2.3 - 2025-06-23
- Default database location changed to current directory for better persistence
- OpenAI token limit exceeded error with dynamic batching for large embedding requests
- Empty chunk filtering to reduce noise in search results
- Python parser validation for empty symbol names
- Windows build support with comprehensive GitHub Actions workflow
- macOS Intel build issues with UV package manager installation
- Cross-platform build workflow reliability
- Windows build support with automated testing
- Enhanced debugging for build processes across platforms
1.2.2 - 2024-12-15
- File watching CLI for real-time code monitoring
- Unified JavaScript and TypeScript parsers
- Default database location to current directory
- Empty symbol validation in Python parser
1.2.1 - 2024-11-28
- Ubuntu 20.04 build support
- Token limit management for MCP search
- Duplicate chunks after file edits
- File modification detection race conditions
1.2.0 - 2024-11-15
- C# language support
- JSON, YAML, and plain text file support
- File watching with real-time indexing
- File deletion handling
- Database connection issues
1.1.0 - 2025-06-12
- Multi-language support: TypeScript, JavaScript, C#, Java, and Markdown
- Comprehensive CLI interface
- Binary distribution with faster startup
- Improved CLI startup performance (90% faster)
- Binary startup performance (16x faster)
- Version display consistency
- Cross-platform build issues
1.0.1 - 2025-06-11
- Python 3.10+ compatibility
- PyPI publishing
- Standalone executable support
- MCP server integration
- Dependency conflicts
- OpenAI model parameter handling
- Binary compilation issues
1.0.0 - 2025-06-10
- Initial release of ChunkHound
- Python parsing with tree-sitter
- DuckDB backend for storage and search
- OpenAI embeddings for semantic search
- CLI interface for indexing and searching
- MCP server for AI assistant integration
- File watching for real-time indexing
- Regex search capabilities
For more information, visit: https://github.com/chunkhound/chunkhound