feat: add configurable cache strategies for dynamic enums by cmungall · Pull Request #7 · linkml/linkml-term-validator

cmungall · 2025-12-15T18:14:34Z

Summary

Add configurable caching strategies for dynamic enum validation: progressive (default) and greedy
Progressive mode validates lazily and scales well for large ontologies (SNOMED with 100k+ terms)
Greedy mode expands entire enum upfront - good for CI/CD and smaller enums
Change enum cache format from JSON to CSV for git-friendliness and human readability
Add --cache-strategy CLI option and cache_strategy config support

Details

Progressive Caching (Default)

Check in-memory cache
Check file cache
Query ontology directly (is this term a descendant of the source nodes?)
If valid, add to cache for future lookups

Greedy Caching

On first access, query ontology for ALL descendants
Cache the complete set
Subsequent lookups are simple set membership checks

Cache Format

CSV files with curie header - pandas-compatible, grep-able, git-diff-friendly:

curie
GO:0008150
GO:0007049

Test plan

All 58 unit tests pass
mypy type checking passes
ruff linting passes
Updated documentation (caching.md, configuration.md, cli-reference.md)
Added tutorial section to notebooks/02_advanced_usage.ipynb

🤖 Generated with Claude Code

Implements two caching strategies for dynamic enum validation: **Progressive (default)**: Validates terms lazily - checks cache first, then queries ontology directly for closure membership. Valid terms are added to cache incrementally as encountered. Scales well for large ontologies (SNOMED with 100k+ terms). **Greedy**: Expands entire enum upfront and caches all terms. Good for smaller, frequently-validated enums and CI/CD where determinism matters. Changes: - Add CacheStrategy enum (PROGRESSIVE, GREEDY) to models.py - Change enum cache format from JSON to CSV for git-friendliness - Add is_value_in_enum() for progressive validation via closure lookup - Add _add_to_enum_cache() for append-only progressive caching - Add --cache-strategy CLI option to validate-data command - Support cache_strategy in oak_config.yaml and Python API Documentation: - Update docs/caching.md with enum caching strategies section - Update docs/configuration.md with cache_strategy options - Update docs/cli-reference.md with --cache-strategy option - Add caching strategies tutorial to notebooks/02_advanced_usage.ipynb 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2025-12-15T18:14:50Z

Claude finished @cmungall's task —— View job

Pull Request Code Review

I've completed a comprehensive review of this PR that adds configurable caching strategies for dynamic enums. Here's my detailed analysis:

🔍 Overall Assessment

This is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with good separation of concerns, comprehensive testing, and excellent documentation.

✅ Strengths

Architecture & Design

Clean abstraction: The BaseOntologyPlugin provides excellent shared functionality for cache management
Strategy pattern: Progressive vs. greedy caching is well-implemented with clear tradeoffs
Backward compatibility: All existing APIs remain unchanged, new features are opt-in
Consistent interface: All plugins now share the same cache configuration parameters

Code Quality

Type hints: Excellent use of modern Python typing throughout
Documentation: Comprehensive docstrings with examples and clear explanations
Error handling: Proper exception handling in OAK adapter calls (base.py:766)
Clean separation: Cache logic is properly isolated from validation logic

Testing

58 passing tests: Good coverage including integration tests for both cache strategies
Test isolation: Proper use of fixtures and temporary directories
Edge cases: Tests cover invalid terms, missing ontologies, and complex nested structures

🚨 Issues Found

1. Security - File Path Validation (Medium Priority)

Location: base.py:140-142, base.py:335-339

The cache file path generation uses user input without sufficient validation:

prefix_dir = self.config.cache_dir / prefix.lower()  # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower())  # base.py:338

Issue: No validation that prefix doesn't contain path traversal sequences like ../../../etc/passwd

Recommendation: Add path traversal protection:

def _get_cache_file(self, prefix: str) -> Path:
    # Sanitize prefix to prevent path traversal
    safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
    if not safe_prefix or safe_prefix in ('.', '..'):
        raise ValueError(f"Invalid prefix for caching: {prefix}")
    prefix_dir = self.config.cache_dir / safe_prefix

2. Bug - Cache File Corruption Risk (Medium Priority)

Location: base.py:407-411

The progressive caching uses append mode without file locking:

with open(cache_file, "a", newline="") as f:  # base.py:407
    writer = csv.DictWriter(f, fieldnames=["curie"])
    if not file_exists:
        writer.writeheader()
    writer.writerow({"curie": value})

Issue: Concurrent processes could corrupt the CSV file or write duplicate headers

Recommendation: Use file locking or atomic writes:

import fcntl  # Unix only, use appropriate cross-platform solution

with open(cache_file, "a", newline="") as f:
    fcntl.flock(f.fileno(), fcntl.LOCK_EX)
    # ... write operations

3. Performance - Inefficient Cache Loading (Low Priority)

Location: base.py:174-189

The label caching reloads and rewrites entire cache file for single additions:

def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
    existing = self._load_cache(prefix)  # Loads entire file
    existing[curie] = label
    # Writes entire cache back

Recommendation: For progressive mode, consider append-only operations or batch updates.

📊 Performance Analysis

Progressive Mode ✅

Memory: O(cached_terms) - scales linearly with usage
Disk I/O: Minimal - only queries needed terms
Network: Efficient - caches successful lookups
Best for: Large ontologies (SNOMED), development environments

Greedy Mode ⚠️

Memory: O(enum_size) - could be significant for large enums
Startup time: Higher due to upfront expansion
Network: Front-loads all ontology queries
Best for: CI/CD, smaller ontologies, repeated validations

Recommendation: The default progressive mode is well-chosen for most use cases.

🛡️ Security Assessment

✅ No hardcoded credentials or secrets
✅ No dangerous dynamic code execution
✅ Proper input sanitization (mostly)
⚠️ File path validation needs improvement (see issue Bump actions/setup-node from 3 to 6 #1 above)
✅ CSV format is safe - no code injection risks

🧪 Test Coverage Assessment

Excellent coverage with a few gaps:

Well Tested:

Both cache strategies with real ontology data
Progressive lazy loading behavior
Greedy upfront expansion
CLI integration with new --cache-strategy option
Edge cases like missing terms and invalid CURIEs

Missing Tests (Nice to have):

Concurrent access scenarios (file locking edge cases)
Very large enum expansion (memory/performance limits)
Network failure recovery during cache population
Cache corruption/recovery scenarios

🔄 CSV Format Migration

Excellent decision to switch from JSON to CSV:

Git-friendly: Better diffs and merge conflict resolution
Human readable: Easy to inspect and debug cache contents
Tool compatible: Works with standard CSV tools (grep, pandas, etc.)
Compact: Smaller file sizes for simple data

🎯 Recommendations

Must Fix (Security/Reliability):

Add path traversal validation in cache file generation
Implement file locking for concurrent cache access

Should Consider (Performance):

Optimize label cache updates to avoid full file rewrites
Add cache size limits to prevent unbounded growth

Documentation (Already excellent, minor additions):

Add performance guidance on when to use each strategy
Document cache file format in user docs

🏆 Final Assessment

This is a high-quality implementation that significantly improves the library's usability and performance. The caching strategies are well-designed, the code is clean and well-tested, and the documentation is comprehensive.

Recommendation: ✅ APPROVE with minor security fixes

The issues identified are relatively minor and don't detract from the overall value of this feature. The progressive/greedy caching design will make this library much more practical for real-world use, especially with large ontologies like SNOMED CT.

Risk Level: Low - The security issues are easily addressed and don't expose significant attack surface.

Copilot

Pull request overview

This PR adds configurable caching strategies for dynamic enum validation, addressing scalability concerns for large ontologies. The implementation introduces progressive (default, lazy validation) and greedy (upfront expansion) modes, with progressive scaling well for large ontologies like SNOMED (100k+ terms) while greedy provides deterministic behavior suitable for CI/CD. The cache format was also changed from JSON to CSV for better git-friendliness and human readability.

Key changes:

Added CacheStrategy enum with PROGRESSIVE and GREEDY modes
Implemented lazy validation in progressive mode (checks cache → ontology → adds to cache)
Added strict mode validation to catch fabricated/non-existent term IDs
Changed enum cache format from JSON to CSV

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_plugins.py	Extensive test coverage for strict/lenient modes, cache strategies, and dynamic enum validation with bindings
tests/test_cli.py	Added test verifying `--lenient` option appears in CLI help
src/linkml_term_validator/plugins/dynamic_enum_plugin.py	Refactored to support both progressive and greedy caching strategies for dynamic enum validation
src/linkml_term_validator/plugins/binding_plugin.py	Added strict mode, dynamic enum closure validation, and term existence checking
src/linkml_term_validator/plugins/base.py	Added enum caching infrastructure, progressive validation, and dynamic enum expansion methods
src/linkml_term_validator/models.py	Added `CacheStrategy` enum and `cache_strategy` field to `ValidationConfig`
src/linkml_term_validator/cli.py	Added `--cache-strategy` and `--lenient` CLI options
notebooks/02_advanced_usage.ipynb	Added tutorial section demonstrating cache strategies
mkdocs.yml	Added new how-to guide for binding validation
docs/howto-bindings.md	New comprehensive guide for validating bindings with dynamic enums
docs/configuration.md	Updated with cache strategy configuration examples
docs/cli-reference.md	Added documentation for new CLI options
docs/caching.md	Expanded with detailed cache strategy explanations and recommendations
docs/binding-validation.md	Updated with strict mode documentation
docs/anti-hallucination.md	Minor formatting improvements

src/linkml_term_validator/plugins/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

claude · 2025-12-15T18:21:10Z

Claude finished @cmungall's task —— View job

Pull Request Code Review

I've completed a comprehensive analysis of this PR that adds configurable caching strategies for dynamic enum validation. Here's my detailed review:

🔍 Overall Assessment

This is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with excellent separation of concerns, comprehensive testing, and thorough documentation.

✅ Strengths

Architecture & Design

Clean abstraction: The BaseOntologyPlugin provides excellent shared functionality for cache management
Strategy pattern: Progressive vs. greedy caching is well-implemented with clear tradeoffs
Backward compatibility: All existing APIs remain unchanged, new features are opt-in
Consistent interface: All plugins now share the same cache configuration parameters

Code Quality

Type hints: Excellent use of modern Python typing throughout
Documentation: Comprehensive docstrings with examples and clear explanations
Error handling: Proper exception handling in OAK adapter calls (base.py:766)
Clean separation: Cache logic is properly isolated from validation logic

Testing

58 passing tests: Good coverage including integration tests for both cache strategies
Test isolation: Proper use of fixtures and temporary directories
Edge cases: Tests cover invalid terms, missing ontologies, and complex nested structures

🚨 Issues Found

1. Security - File Path Validation (Medium Priority)

Location: base.py:140-142, base.py:335-339

The cache file path generation uses user input without sufficient validation:

prefix_dir = self.config.cache_dir / prefix.lower()  # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower())  # base.py:338

Issue: No validation that prefix doesn't contain path traversal sequences like ../../../etc/passwd

Recommendation: Add path traversal protection:

def _get_cache_file(self, prefix: str) -> Path:
    # Sanitize prefix to prevent path traversal
    safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
    if not safe_prefix or safe_prefix in ('.', '..'):
        raise ValueError(f"Invalid prefix for caching: {prefix}")
    prefix_dir = self.config.cache_dir / safe_prefix

2. Bug - Cache File Corruption Risk (Medium Priority)

Location: base.py:407-411

The progressive caching uses append mode without file locking:

with open(cache_file, "a", newline="") as f:  # base.py:407
    writer = csv.DictWriter(f, fieldnames=["curie"])
    if not file_exists:
        writer.writeheader()
    writer.writerow({"curie": value})

Issue: Concurrent processes could corrupt the CSV file or write duplicate headers

Recommendation: Use file locking or atomic writes:

import fcntl  # Unix only, use appropriate cross-platform solution

with open(cache_file, "a", newline="") as f:
    fcntl.flock(f.fileno(), fcntl.LOCK_EX)
    # ... write operations

3. Performance - Inefficient Cache Loading (Low Priority)

Location: base.py:174-189

The label caching reloads and rewrites entire cache file for single additions:

def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
    existing = self._load_cache(prefix)  # Loads entire file
    existing[curie] = label
    # Writes entire cache back

Recommendation: For progressive mode, consider append-only operations or batch updates.

4. Minor - Misleading Comment (Low Priority)

Location: base.py:528

As noted by copilot-pull-request-reviewer, the comment is misleading:

return False  # Term doesn't exist

Should be: # Term doesn't exist or adapter lookup failed

📊 Performance Analysis

Progressive Mode ✅

Memory: O(cached_terms) - scales linearly with usage
Disk I/O: Minimal - only queries needed terms
Network: Efficient - caches successful lookups
Best for: Large ontologies (SNOMED), development environments

Greedy Mode ⚠️

Memory: O(enum_size) - could be significant for large enums
Startup time: Higher due to upfront expansion
Network: Front-loads all ontology queries
Best for: CI/CD, smaller ontologies, repeated validations

Recommendation: The default progressive mode is well-chosen for most use cases.

🛡️ Security Assessment

✅ No hardcoded credentials or secrets
✅ No dangerous dynamic code execution
✅ Proper input sanitization (mostly)
⚠️ File path validation needs improvement (see issue Bump actions/setup-node from 3 to 6 #1 above)
✅ CSV format is safe - no code injection risks

🧪 Test Coverage Assessment

Excellent coverage with a few gaps:

Well Tested:

Both cache strategies with real ontology data
Progressive lazy loading behavior (test_plugins.py:900+)
Greedy upfront expansion
CLI integration with new --cache-strategy option
Edge cases like missing terms and invalid CURIEs

Missing Tests (Nice to have):

Concurrent access scenarios (file locking edge cases)
Very large enum expansion (memory/performance limits)
Network failure recovery during cache population
Cache corruption/recovery scenarios

🔄 CSV Format Migration

Excellent decision to switch from JSON to CSV:

Git-friendly: Better diffs and merge conflict resolution
Human readable: Easy to inspect and debug cache contents
Tool compatible: Works with standard CSV tools (grep, pandas, etc.)
Compact: Smaller file sizes for simple data

🎯 Recommendations

Must Fix (Security/Reliability):

Add path traversal validation in cache file generation
Implement file locking for concurrent cache access

Should Consider (Performance):

Optimize label cache updates to avoid full file rewrites
Add cache size limits to prevent unbounded growth

Nice to Have:

Fix misleading comment at base.py:528
Add performance guidance on when to use each strategy in docs

🏆 Final Assessment

This is a high-quality implementation that significantly improves the library's usability and performance. The caching strategies are well-designed, the code is clean and well-tested, and the documentation is comprehensive.

Recommendation: ✅ APPROVE with minor security fixes

The issues identified are relatively minor and don't detract from the overall value of this feature. The progressive/greedy caching design will make this library much more practical for real-world use, especially with large ontologies like SNOMED CT.

Risk Level: Low - The security issues are easily addressed and don't expose significant attack surface.

cmungall and others added 3 commits December 13, 2025 19:11

adding bindings support

d594bd9

adding bindings support

cb4a099

Copilot AI review requested due to automatic review settings December 15, 2025 18:14

Copilot AI reviewed Dec 15, 2025

View reviewed changes

src/linkml_term_validator/plugins/base.py Outdated Show resolved Hide resolved

Update src/linkml_term_validator/plugins/base.py

b6aa539

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dragon-ai-agent merged commit 0a63366 into main Dec 15, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable cache strategies for dynamic enums#7

feat: add configurable cache strategies for dynamic enums#7
dragon-ai-agent merged 4 commits intomainfrom
add-bindings-support

cmungall commented Dec 15, 2025

Uh oh!

claude bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

claude bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cmungall commented Dec 15, 2025

Summary

Details

Progressive Caching (Default)

Greedy Caching

Cache Format

Test plan

Uh oh!

claude bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Code Review

🔍 Overall Assessment

✅ Strengths

Architecture & Design

Code Quality

Testing

🚨 Issues Found

1. Security - File Path Validation (Medium Priority)

2. Bug - Cache File Corruption Risk (Medium Priority)

3. Performance - Inefficient Cache Loading (Low Priority)

📊 Performance Analysis

Progressive Mode ✅

Greedy Mode ⚠️

🛡️ Security Assessment

🧪 Test Coverage Assessment

Well Tested:

Missing Tests (Nice to have):

🔄 CSV Format Migration

🎯 Recommendations

Must Fix (Security/Reliability):

Should Consider (Performance):

Documentation (Already excellent, minor additions):

🏆 Final Assessment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

claude bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Code Review

🔍 Overall Assessment

✅ Strengths

Architecture & Design

Code Quality

Testing

🚨 Issues Found

1. Security - File Path Validation (Medium Priority)

2. Bug - Cache File Corruption Risk (Medium Priority)

3. Performance - Inefficient Cache Loading (Low Priority)

4. Minor - Misleading Comment (Low Priority)

📊 Performance Analysis

Progressive Mode ✅

Greedy Mode ⚠️

🛡️ Security Assessment

🧪 Test Coverage Assessment

Well Tested:

Missing Tests (Nice to have):

🔄 CSV Format Migration

🎯 Recommendations

Must Fix (Security/Reliability):

Should Consider (Performance):

Nice to Have:

🏆 Final Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Dec 15, 2025 •

edited

Loading

claude bot commented Dec 15, 2025 •

edited

Loading