feat: add configurable cache strategies for dynamic enums#7
feat: add configurable cache strategies for dynamic enums#7dragon-ai-agent merged 4 commits intomainfrom
Conversation
Implements two caching strategies for dynamic enum validation: **Progressive (default)**: Validates terms lazily - checks cache first, then queries ontology directly for closure membership. Valid terms are added to cache incrementally as encountered. Scales well for large ontologies (SNOMED with 100k+ terms). **Greedy**: Expands entire enum upfront and caches all terms. Good for smaller, frequently-validated enums and CI/CD where determinism matters. Changes: - Add CacheStrategy enum (PROGRESSIVE, GREEDY) to models.py - Change enum cache format from JSON to CSV for git-friendliness - Add is_value_in_enum() for progressive validation via closure lookup - Add _add_to_enum_cache() for append-only progressive caching - Add --cache-strategy CLI option to validate-data command - Support cache_strategy in oak_config.yaml and Python API Documentation: - Update docs/caching.md with enum caching strategies section - Update docs/configuration.md with cache_strategy options - Update docs/cli-reference.md with --cache-strategy option - Add caching strategies tutorial to notebooks/02_advanced_usage.ipynb 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Claude finished @cmungall's task —— View job Pull Request Code ReviewI've completed a comprehensive review of this PR that adds configurable caching strategies for dynamic enums. Here's my detailed analysis: 🔍 Overall AssessmentThis is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with good separation of concerns, comprehensive testing, and excellent documentation. ✅ StrengthsArchitecture & Design
Code Quality
Testing
🚨 Issues Found1. Security - File Path Validation (Medium Priority)Location: The cache file path generation uses user input without sufficient validation: prefix_dir = self.config.cache_dir / prefix.lower() # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower()) # base.py:338Issue: No validation that Recommendation: Add path traversal protection: def _get_cache_file(self, prefix: str) -> Path:
# Sanitize prefix to prevent path traversal
safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
if not safe_prefix or safe_prefix in ('.', '..'):
raise ValueError(f"Invalid prefix for caching: {prefix}")
prefix_dir = self.config.cache_dir / safe_prefix2. Bug - Cache File Corruption Risk (Medium Priority)Location: The progressive caching uses append mode without file locking: with open(cache_file, "a", newline="") as f: # base.py:407
writer = csv.DictWriter(f, fieldnames=["curie"])
if not file_exists:
writer.writeheader()
writer.writerow({"curie": value})Issue: Concurrent processes could corrupt the CSV file or write duplicate headers Recommendation: Use file locking or atomic writes: import fcntl # Unix only, use appropriate cross-platform solution
with open(cache_file, "a", newline="") as f:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
# ... write operations3. Performance - Inefficient Cache Loading (Low Priority)Location: The label caching reloads and rewrites entire cache file for single additions: def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
existing = self._load_cache(prefix) # Loads entire file
existing[curie] = label
# Writes entire cache backRecommendation: For progressive mode, consider append-only operations or batch updates. 📊 Performance AnalysisProgressive Mode ✅
Greedy Mode
|
There was a problem hiding this comment.
Pull request overview
This PR adds configurable caching strategies for dynamic enum validation, addressing scalability concerns for large ontologies. The implementation introduces progressive (default, lazy validation) and greedy (upfront expansion) modes, with progressive scaling well for large ontologies like SNOMED (100k+ terms) while greedy provides deterministic behavior suitable for CI/CD. The cache format was also changed from JSON to CSV for better git-friendliness and human readability.
Key changes:
- Added
CacheStrategyenum withPROGRESSIVEandGREEDYmodes - Implemented lazy validation in progressive mode (checks cache → ontology → adds to cache)
- Added strict mode validation to catch fabricated/non-existent term IDs
- Changed enum cache format from JSON to CSV
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_plugins.py | Extensive test coverage for strict/lenient modes, cache strategies, and dynamic enum validation with bindings |
| tests/test_cli.py | Added test verifying --lenient option appears in CLI help |
| src/linkml_term_validator/plugins/dynamic_enum_plugin.py | Refactored to support both progressive and greedy caching strategies for dynamic enum validation |
| src/linkml_term_validator/plugins/binding_plugin.py | Added strict mode, dynamic enum closure validation, and term existence checking |
| src/linkml_term_validator/plugins/base.py | Added enum caching infrastructure, progressive validation, and dynamic enum expansion methods |
| src/linkml_term_validator/models.py | Added CacheStrategy enum and cache_strategy field to ValidationConfig |
| src/linkml_term_validator/cli.py | Added --cache-strategy and --lenient CLI options |
| notebooks/02_advanced_usage.ipynb | Added tutorial section demonstrating cache strategies |
| mkdocs.yml | Added new how-to guide for binding validation |
| docs/howto-bindings.md | New comprehensive guide for validating bindings with dynamic enums |
| docs/configuration.md | Updated with cache strategy configuration examples |
| docs/cli-reference.md | Added documentation for new CLI options |
| docs/caching.md | Expanded with detailed cache strategy explanations and recommendations |
| docs/binding-validation.md | Updated with strict mode documentation |
| docs/anti-hallucination.md | Minor formatting improvements |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Claude finished @cmungall's task —— View job Pull Request Code ReviewI've completed a comprehensive analysis of this PR that adds configurable caching strategies for dynamic enum validation. Here's my detailed review: 🔍 Overall AssessmentThis is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with excellent separation of concerns, comprehensive testing, and thorough documentation. ✅ StrengthsArchitecture & Design
Code Quality
Testing
🚨 Issues Found1. Security - File Path Validation (Medium Priority)Location: The cache file path generation uses user input without sufficient validation: prefix_dir = self.config.cache_dir / prefix.lower() # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower()) # base.py:338Issue: No validation that Recommendation: Add path traversal protection: def _get_cache_file(self, prefix: str) -> Path:
# Sanitize prefix to prevent path traversal
safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
if not safe_prefix or safe_prefix in ('.', '..'):
raise ValueError(f"Invalid prefix for caching: {prefix}")
prefix_dir = self.config.cache_dir / safe_prefix2. Bug - Cache File Corruption Risk (Medium Priority)Location: The progressive caching uses append mode without file locking: with open(cache_file, "a", newline="") as f: # base.py:407
writer = csv.DictWriter(f, fieldnames=["curie"])
if not file_exists:
writer.writeheader()
writer.writerow({"curie": value})Issue: Concurrent processes could corrupt the CSV file or write duplicate headers Recommendation: Use file locking or atomic writes: import fcntl # Unix only, use appropriate cross-platform solution
with open(cache_file, "a", newline="") as f:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
# ... write operations3. Performance - Inefficient Cache Loading (Low Priority)Location: The label caching reloads and rewrites entire cache file for single additions: def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
existing = self._load_cache(prefix) # Loads entire file
existing[curie] = label
# Writes entire cache backRecommendation: For progressive mode, consider append-only operations or batch updates. 4. Minor - Misleading Comment (Low Priority)Location: As noted by copilot-pull-request-reviewer, the comment is misleading: return False # Term doesn't existShould be: 📊 Performance AnalysisProgressive Mode ✅
Greedy Mode
|
Summary
--cache-strategyCLI option andcache_strategyconfig supportDetails
Progressive Caching (Default)
Greedy Caching
Cache Format
CSV files with
curieheader - pandas-compatible, grep-able, git-diff-friendly:curie GO:0008150 GO:0007049Test plan
🤖 Generated with Claude Code