Skip to content

feat: add automated cache corruption detection and recovery #67

@senomorf

Description

@senomorf

Summary

Implement robust cache validation and automatic rebuilding mechanisms to handle corrupted state files and ensure reliable cache operation.

Priority: Medium
Origin: Claude PR review recommendations from PR #58 - "Error Handling Enhancement"

Problem Statement

The current state management caching system (introduced in PR #58) lacks comprehensive corruption detection and recovery mechanisms, which could lead to:

  • Silent failures when cache files are corrupted
  • Inconsistent state between cache and actual Oracle Cloud resources
  • Manual intervention required for cache issues
  • Reduced reliability of the optimization system

Proposed Solution

Core Features

1. Cache Integrity Validation

Implement comprehensive validation for cache files:

validate_and_repair_cache() {
    local state_file="$1"
    
    if ! validate_state_file "$state_file"; then
        log_warning "Cache corrupted, rebuilding from OCI API"
        rebuild_cache_from_api
        return $?
    fi
    
    return 0
}

2. Automated Recovery System

  • Detect corruption during cache load operations
  • Automatically rebuild cache from Oracle Cloud API
  • Graceful fallback to existing instance creation logic
  • Preserve cache history for debugging purposes

3. Enhanced Validation Checks

  • JSON schema validation for state files
  • Timestamp consistency verification
  • OCID format validation
  • File integrity checksums

4. Error Logging and Monitoring

  • Detailed corruption event logging
  • Integration with existing notification system
  • Recovery attempt tracking
  • Performance impact monitoring

Technical Implementation

File Structure

scripts/
├── state-manager.sh          # Enhanced with validation functions
├── cache-validator.sh        # New: Cache validation utilities
└── utils.sh                 # Enhanced with recovery logging

Key Functions

# Validate cache file integrity
validate_state_file() {
    local state_file="$1"
    local checksum_file="$1.checksum"
    
    # Check file existence and permissions
    # Validate JSON structure
    # Verify timestamps and OCIDs
    # Check file integrity with checksums
}

# Rebuild cache from Oracle API
rebuild_cache_from_api() {
    log_info "Rebuilding cache from Oracle Cloud API..."
    
    # Query current instance state from OCI
    # Reconstruct cache file with current data
    # Update timestamps and metadata
    # Generate new integrity checksums
}

# Create cache backup before operations
backup_cache_state() {
    local state_file="$1"
    local backup_dir=".cache/oci-state/backups"
    
    # Create timestamped backup
    # Maintain limited backup history
    # Clean up old backups based on retention policy
}

# Restore from backup if available
restore_cache_from_backup() {
    local state_file="$1"
    local backup_dir=".cache/oci-state/backups"
    
    # Find most recent valid backup
    # Validate backup integrity
    # Restore to main cache location
}

Validation Schema

# JSON schema for state file validation
CACHE_SCHEMA='{
    "type": "object",
    "required": ["version", "timestamp", "region_hash", "instances"],
    "properties": {
        "version": {"type": "string", "pattern": "^v[0-9]+$"},
        "timestamp": {"type": "integer", "minimum": 0},
        "region_hash": {"type": "string", "minLength": 8, "maxLength": 8},
        "instances": {"type": "object"}
    }
}'

Integration Points

1. State Manager Integration

Enhance existing state-manager.sh functions:

  • Add validation calls before cache operations
  • Implement recovery triggers on validation failures
  • Update cache save operations with checksums

2. Workflow Integration

Update GitHub Actions workflow:

  • Add cache validation step after restoration
  • Include recovery status in workflow outputs
  • Implement cache backup creation before operations

3. Error Handling Enhancement

Extend existing error classification:

CACHE_CORRUPTION: "corruption|invalid.*json|checksum.*mismatch"
CACHE_RECOVERY: "rebuild|restore|repair"

Acceptance Criteria

Must Have

  • Cache file validation on every load operation
  • Automatic rebuilding when corruption detected
  • JSON schema validation for state files
  • Integration with existing error handling patterns
  • Preservation of existing performance optimizations

Should Have

  • Cache backup system with configurable retention
  • Integrity checksums for corruption detection
  • Recovery attempt tracking and logging
  • Integration with Telegram notification system
  • Configurable validation strictness levels

Could Have

  • Multiple recovery strategies (backup, API rebuild, manual)
  • Cache corruption pattern analysis
  • Preventive validation during workflow execution
  • Recovery performance metrics

Implementation Guidelines

Follow Existing Patterns

  • Maintain consistency with CLAUDE.md performance requirements
  • Use established logging patterns from utils.sh
  • Follow error classification conventions
  • Preserve 55-second timeout protection

Security Considerations

  • Validate all JSON input before processing
  • Secure backup file permissions (600)
  • No sensitive data in validation logs
  • Input sanitization for all external data

Testing Requirements

  • Unit tests for validation functions
  • Integration tests with corrupted cache scenarios
  • Recovery mechanism validation
  • Performance impact measurement (<2% overhead)

Success Metrics

  • Corruption Detection Rate: 100% of corrupted files detected
  • Recovery Success Rate: >95% automatic recovery success
  • Performance Impact: <2% additional overhead
  • Manual Intervention Rate: <1% requiring manual intervention

Related Issues/PRs

Research Sources

Based on:

  • Claude's specific recommendation for cache corruption recovery
  • Industry best practices for cache integrity management
  • GitHub Actions cache reliability patterns
  • Oracle Cloud API error handling standards

Implementation Notes

This addresses Claude's specific suggestion:

"Consider adding cache corruption recovery: validate_and_repair_cache() function"

Key benefits:

  • Increased reliability of the caching optimization
  • Reduced manual intervention requirements
  • Better error visibility and debugging capabilities
  • Maintained performance while adding robustness

Testing Strategy

  1. Corruption Simulation: Create tests with various corruption scenarios
  2. Recovery Validation: Verify automatic rebuilding from OCI API
  3. Performance Testing: Ensure validation overhead stays minimal
  4. Integration Testing: Test with existing state management workflows

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions