Skip to content

Measure and optimize validation performance #58

@Sam-Bolling

Description

@Sam-Bolling

Problem

The validation system (geojson-validator.ts, swe-validator.ts, constraint-validator.ts, sensorml-validator.ts - total ~1,384 lines with 152 tests across 3 validation systems) has no performance benchmarking despite functional testing coverage. This means:

  • No validation overhead data: Unknown cost of enabling validation (options.validate = true)
  • No validator comparison: Unknown which validator is fastest/slowest (GeoJSON vs SWE vs SensorML)
  • No constraint cost: Unknown overhead of deep constraint validation (intervals, patterns, significant figures)
  • No throughput data: Unknown how many features can be validated per second
  • No scaling data: Unknown how validation performance scales with collection size or nesting depth
  • No optimization data: Cannot make informed decisions about validation strategies

Real-World Impact:

  • Request validation: Should validation always be enabled? Performance cost unknown
  • Batch processing: Validating 1,000+ features - acceptable latency unknown
  • Strict vs permissive mode: Performance difference unknown
  • Embedded devices: CPU overhead must stay within limits
  • Server-side: Validation throughput affects scalability and cost
  • Constraint validation: Deep validation cost unknown (intervals, patterns, significant figures)

Context

This issue was identified during the comprehensive validation conducted January 27-28, 2026.

Related Validation Issues: #12 (GeoJSON Validation), #14 (SWE Common Validation), #15 (SensorML Validation)

Work Item ID: 35 from Remaining Work Items

Repository: https://github.com/OS4CSAPI/ogc-client-CSAPI

Validated Commit: a71706b9592cad7a5ad06e6cf8ddc41fa5387732

Detailed Findings

1. No Performance Benchmarks Exist

Evidence from validation issues:

Issue #12 (GeoJSON): 61 tests, 40.95% coverage (claimed 97.4%)

Issue #14 (SWE Common): 78 tests (50 swe-validator + 28 constraint-validator), 73.68% coverage (claimed 100%)

Issue #15 (SensorML): 13 tests, coverage data not available

Total: 152 tests across 3 validation systems

Current Situation:

  • ✅ Excellent functional tests (152 tests total)
  • ✅ Comprehensive validation logic
  • ZERO performance measurements (no ops/sec, latency, overhead data)
  • ❌ No throughput benchmarks
  • ❌ No validation overhead analysis
  • ❌ No constraint validation cost data

2. Three Validation Systems (Performance Patterns Unknown)

From Issue #12, #14, #15 validation reports:

GeoJSON Validator (376 lines, 61 tests, 40.95% coverage):

  • Pattern: Manual property checking, no schema validation
  • Features: 7 feature type validators, collection validators (0% coverage)
  • Complexity: Simple type checking + required property validation
  • Unknown: Overhead per feature, collection validation performance
export function validateSystemFeature(data: unknown): ValidationResult {
  const errors: string[] = [];
  
  if (!isFeature(data)) {
    errors.push('Object is not a valid GeoJSON Feature');
    return { valid: false, errors };
  }
  
  if (!hasCSAPIProperties(data.properties)) {
    errors.push('Missing required CSAPI properties (featureType, uid)');
    return { valid: false, errors };
  }
  
  const props = data.properties as any;
  if (props.featureType !== 'System') {
    errors.push(`Expected featureType 'System', got '${props.featureType}'`);
  }
  
  return { valid: errors.length === 0, errors: errors.length > 0 ? errors : undefined };
}

Performance Questions:

  • How many features/sec can be validated?
  • Is collection validation proportionally slower (untested)?
  • What's the cost of each property check?

SWE Common Validator (357 lines swe-validator + 312 lines constraint-validator, 78 tests, 73.68% coverage):

  • Pattern: Component-specific validation + optional deep constraint validation
  • Features: 9 component validators, 6 constraint validators
  • Complexity: Type checking + UoM validation + interval checking + pattern matching + significant figures
  • Unknown: Constraint validation overhead, recursive validation cost
export function validateQuantity(data: unknown, validateConstraints = true): ValidationResult {
  const errors: ValidationError[] = [];
  
  if (!hasDataComponentProperties(data)) {
    errors.push({ message: 'Missing required DataComponent properties' });
    return { valid: false, errors };
  }
  
  const component = data as any;
  
  if (component.type !== 'Quantity') {
    errors.push({ message: `Expected type 'Quantity', got '${component.type}'` });
  }
  
  if (!component.uom) {
    errors.push({ message: 'Missing required property: uom' });
  }
  
  // Perform deep constraint validation if value is present
  if (validateConstraints && component.value !== undefined && component.value !== null && errors.length === 0) {
    const constraintResult = validateQuantityConstraint(component as QuantityComponent, component.value);
    if (!constraintResult.valid && constraintResult.errors) {
      errors.push(...constraintResult.errors);
    }
  }
  
  return { valid: errors.length === 0, errors: errors.length > 0 ? errors : undefined };
}

Constraint Validation Example:

export function validateQuantityConstraint(
  component: QuantityComponent | QuantityRangeComponent,
  value: number
): ValidationResult {
  if (!component.constraint) {
    return { valid: true };
  }
  
  const errors: ValidationError[] = [];
  const { intervals, values: allowedValues, significantFigures } = component.constraint;
  
  // Check interval constraints
  if (intervals && intervals.length > 0) {
    const inAnyInterval = intervals.some(([min, max]) => value >= min && value <= max);
    if (!inAnyInterval) {
      errors.push({
        path: 'value',
        message: `Value ${value} is outside allowed intervals: ${JSON.stringify(intervals)}`,
      });
    }
  }
  
  // Check significant figures constraint
  if (significantFigures !== undefined && significantFigures > 0) {
    const actualSigFigs = getSignificantFigures(value);
    if (actualSigFigs > significantFigures) {
      errors.push({
        path: 'value',
        message: `Value ${value} has ${actualSigFigs} significant figures, maximum allowed is ${significantFigures}`,
      });
    }
  }
  
  return errors.length > 0 ? { valid: false, errors } : { valid: true };
}

Performance Questions:

  • What's the cost of constraint validation? 10%? 50%? 100% overhead?
  • Is interval checking expensive (array iteration)?
  • Is significant figures calculation expensive (string manipulation)?
  • Is pattern/regex validation expensive?
  • Should validateConstraints default to true or false?

SensorML Validator (339 lines, 13 tests, coverage N/A):

  • Pattern: Hierarchical validation (type-specific → AbstractProcess → DescribedObject)
  • Features: 4 process type validators, deployment validator, derived property validator
  • Complexity: Deep nesting (PhysicalSystem → AbstractPhysicalProcess → AbstractProcess → DescribedObject)
  • Unknown: Hierarchical validation overhead, async overhead
export async function validateSensorMLProcess(
  process: SensorMLProcess
): Promise<ValidationResult> {
  const errors: string[] = [];
  const warnings: string[] = [];
  
  try {
    if (!process.type) {
      errors.push('Missing required property: type');
    }
    
    switch (process.type) {
      case 'PhysicalSystem':
        validatePhysicalSystem(process as any, errors, warnings);
        break;
      case 'PhysicalComponent':
        validatePhysicalComponent(process as any, errors, warnings);
        break;
      case 'SimpleProcess':
        validateSimpleProcess(process as any, errors, warnings);
        break;
      case 'AggregateProcess':
        validateAggregateProcess(process as any, errors, warnings);
        break;
      default:
        errors.push(`Unknown process type: ${(process as any).type}`);
    }
    
    validateDescribedObject(process, errors, warnings);
  } catch (error) {
    errors.push(`Validation error: ${error}`);
  }
  
  return {
    valid: errors.length === 0,
    errors: errors.length > 0 ? errors : undefined,
    warnings: warnings.length > 0 ? warnings : undefined,
  };
}

Hierarchical Validation Example:

function validatePhysicalSystem(system: any, errors: string[], warnings: string[]): void {
  validateAbstractPhysicalProcess(system, errors, warnings);  // Parent validator
  
  if (system.components && !Array.isArray(system.components)) {
    errors.push('components must be an array');
  }
  
  if (system.connections && !Array.isArray(system.connections)) {
    errors.push('connections must be an array');
  }
  
  if (system.components && system.components.length === 0) {
    warnings.push('PhysicalSystem has no components');
  }
}

Performance Questions:

  • What's the cost of hierarchical validation (4 levels deep)?
  • Is async overhead significant (all functions return Promises)?
  • How many property checks per validation?
  • Should synchronous validation be offered for performance?

3. Unknown Validation Strategy Performance

From parser integration (Issue #10):

Optional validation during parsing:

parse(data: unknown, options: ParserOptions = {}): ParseResult<T> {
  // ... parsing logic ...
  
  // Validate if requested
  if (options.validate) {
    const validationResult = this.validate(parsed, format.format);
    if (!validationResult.valid) {
      errors.push(...(validationResult.errors || []));
      if (options.strict) {
        throw new CSAPIParseError(`Validation failed: ${errors.join(', ')}`, format.format);
      }
    }
    warnings.push(...(validationResult.warnings || []));
  }
  
  return { data: parsed, format, errors, warnings };
}

Three Validation Strategies:

  1. No validation: options.validate = false (default for performance?)
  2. Permissive validation: options.validate = true, options.strict = false (collect errors, don't throw)
  3. Strict validation: options.validate = true, options.strict = true (throw on first error)

Performance Questions:

  • What's the overhead of each strategy?
  • Is strict mode faster (early return)?
  • Should validation default to enabled or disabled?
  • When should users enable validation (dev vs prod)?

4. Unknown Collection Validation Performance

From Issue #12:

Collection validators exist but have 0% coverage:

  • validateSystemFeatureCollection() - 0 calls
  • validateDeploymentFeatureCollection() - 0 calls
  • All 7 collection validators - 0 invocations

Collection Validation Pattern:

export function validateSystemFeatureCollection(
  data: unknown
): ValidationResult {
  const errors: string[] = [];
  
  if (!isFeatureCollection(data)) {
    errors.push('Object is not a valid GeoJSON FeatureCollection');
    return { valid: false, errors };
  }
  
  const collection = data as FeatureCollection;
  const features = collection.features || [];
  
  features.forEach((feature: unknown, index: number) => {
    const result = validateSystemFeature(feature);
    if (!result.valid) {
      errors.push(`Feature at index ${index}: ${result.errors?.join(', ')}`);
    }
  });
  
  return { valid: errors.length === 0, errors: errors.length > 0 ? errors : undefined };
}

Performance Questions:

  • Does validation scale linearly with collection size?
  • At what collection size does it become slow?
  • Should large collections be validated in chunks?
  • What's the memory overhead (error accumulation)?

5. Unknown Constraint Validation Cost

From Issue #14:

6 constraint validators implemented:

  • validateQuantityConstraint: Interval checking, discrete values, significant figures
  • validateCountConstraint: Integer intervals, discrete values
  • validateTextConstraint: Pattern/regex matching, token lists
  • validateCategoryConstraint: Token list matching
  • validateTimeConstraint: Temporal intervals, ISO 8601 parsing
  • validateRangeConstraint: Range endpoint validation, min ≤ max checking

Significant Figures Algorithm:

function getSignificantFigures(value: number): number {
  if (value === 0) return 1;
  if (!isFinite(value)) return Infinity;
  
  // Convert to string and remove leading zeros and decimal point
  const str = Math.abs(value).toString();
  const normalized = str.replace(/^0+\.?0*/, '').replace('.', '');
  
  return normalized.length;
}

Pattern/Regex Validation:

if (pattern && typeof pattern === 'string') {
  try {
    const regex = new RegExp(pattern);
    if (!regex.test(value)) {
      errors.push({
        path: 'value',
        message: `Text value '${value}' does not match required pattern: ${pattern}`,
      });
    }
  } catch (e) {
    errors.push({
      path: 'constraint.pattern',
      message: `Invalid regex pattern: ${pattern}`,
    });
  }
}

Performance Questions:

  • How expensive is significant figures calculation (string manipulation)?
  • How expensive is regex compilation and matching?
  • Should regex patterns be compiled once and cached?
  • What's the cost of interval checking (array iteration)?
  • What's the cost of ISO 8601 datetime parsing?

6. Unknown Recursive Validation Performance

From Issue #14:

Minimal aggregate validation:

  • DataRecord: Checks fields array exists, doesn't validate nested components
  • DataArray: Checks elementCount/elementType exist, doesn't validate elementType structure
  • No automatic recursive validation

However, tests show nested validation works when called manually:

it('should recursively parse deeply nested structures', () => {
  const nested = {
    type: 'DataRecord',
    fields: [
      {
        name: 'innerRecord',
        component: {
          type: 'DataRecord',
          fields: [
            {
              name: 'quantity',
              component: {
                type: 'Quantity',
                uom: { code: 'Cel' },
              },
            },
          ],
        },
      },
    ],
  };
  
  const result = parseDataRecordComponent(nested);  // Recursive parsing + validation
  // ...
});

Performance Questions:

  • How deep can nesting go before performance degrades?
  • What's the overhead of recursive validation calls?
  • Should there be a maximum depth limit?
  • How does depth affect memory usage (call stack)?

7. No Optimization History

No Baseline Data:

  • Cannot track validation performance regressions when adding features
  • Cannot validate optimization attempts
  • Cannot compare validation strategies
  • Cannot document validation overhead for users
  • Cannot decide when to enable/disable validation

8. Validation System Context

From Issues #12, #14, #15:

GeoJSON Validator (376 lines, 61 tests, 40.95% coverage):

  • ✅ 7 feature type validators (System, Deployment, Procedure, SamplingFeature, Property, Datastream, ControlStream)
  • ✅ Collection validators (0% coverage but exist)
  • ❌ No geometry validation (claimed but not implemented)
  • ❌ No link validation (claimed but not implemented)
  • ❌ No temporal validation (claimed but not implemented)
  • Validation approach: Manual property checking

SWE Common Validator (669 lines, 78 tests, 73.68% coverage):

  • ✅ 9 component validators (Quantity, Count, Text, Category, Time, RangeComponent, DataRecord, DataArray, ObservationResult)
  • ✅ 6 constraint validators (intervals, patterns, significant figures, tokens)
  • ❌ 8 claimed validators don't exist (Boolean, Vector, Matrix, DataStream, DataChoice, Geometry)
  • ❌ No automatic nested validation (requires manual recursive calls)
  • Validation approach: Type checking + optional deep constraint validation

SensorML Validator (339 lines, 13 tests, coverage N/A):

  • ✅ 4 process type validators (PhysicalSystem, PhysicalComponent, SimpleProcess, AggregateProcess)
  • ✅ Deployment validator, DerivedProperty validator
  • ✅ Hierarchical validation (4 levels deep)
  • ⚠️ Ajv configured but not used (structural validation instead)
  • Validation approach: Hierarchical type checking

Total: ~1,384 lines of validation code, 152 tests

Proposed Solution

1. Establish Benchmark Infrastructure (DEPENDS ON #55)

PREREQUISITE: This work item REQUIRES the benchmark infrastructure from work item #32 (Issue #55) to be completed first.

Once benchmark infrastructure exists:

2. Create Comprehensive Validation Benchmarks

Create benchmarks/validation.bench.ts (~800-1,200 lines) with:

GeoJSON Validation Benchmarks:

  • All 7 feature types (System, Deployment, Procedure, SamplingFeature, Property, Datastream, ControlStream)
  • Single feature validation (baseline)
  • Collection validation (10, 100, 1,000 features)
  • Invalid feature validation (error path)
  • Property checking overhead

SWE Common Validation Benchmarks:

  • Simple components (Quantity, Count, Text, Category, Time)
  • With constraints vs without constraints
  • Interval checking (1 interval, 5 intervals, 10 intervals)
  • Pattern/regex validation (simple, complex patterns)
  • Significant figures calculation (various precisions)
  • Nested DataRecord validation (1 level, 2 levels, 3 levels deep)
  • DataArray validation

SensorML Validation Benchmarks:

  • All 4 process types (PhysicalSystem, PhysicalComponent, SimpleProcess, AggregateProcess)
  • Hierarchical validation overhead (4 levels deep)
  • Async vs sync overhead (measure Promise overhead)
  • Deployment validation
  • DerivedProperty validation (URI validation)

Validation Strategy Benchmarks:

  • No validation (baseline)
  • Permissive validation (collect errors)
  • Strict validation (throw on error)
  • Compare overhead for each strategy

Constraint Validation Benchmarks:

  • Quantity with no constraints (baseline)
  • Quantity with intervals
  • Quantity with significant figures
  • Text with pattern matching
  • Text with token list
  • Time with temporal intervals

Collection Scaling Benchmarks:

  • Single feature (baseline)
  • 10 features
  • 100 features
  • 1,000 features
  • 10,000 features
  • Test all three validators: GeoJSON, SWE, SensorML

3. Create Memory Usage Benchmarks

Create benchmarks/validation-memory.bench.ts (~200-300 lines) with:

Memory per Validation:

  • Single GeoJSON feature
  • Single SWE Quantity (simple)
  • Single SWE DataRecord (nested, 3 levels)
  • Single SensorML PhysicalSystem

Memory Scaling:

  • 100 features: total memory, average per feature
  • 1,000 features: total memory, GC pressure
  • 10,000 features: total memory, heap usage

Error Accumulation Memory:

  • Validation with 0 errors (baseline)
  • Validation with 10 errors
  • Validation with 100 errors
  • Validation with 1,000 errors (collection)

4. Analyze Benchmark Results

Create benchmarks/validation-analysis.ts (~150-250 lines) with:

Performance Comparison:

  • Validator comparison: GeoJSON vs SWE vs SensorML (fastest vs slowest)
  • Strategy comparison: No validation vs permissive vs strict
  • Constraint comparison: Simple validation vs constraint validation
  • Collection scaling: throughput vs count

Identify Bottlenecks:

  • Operations taking >20% of validation time
  • Operations with >1ms latency per feature
  • Operations with sublinear scaling
  • Memory-intensive operations

Generate Recommendations:

  • When to enable/disable validation (dev vs prod)
  • Which validation strategy to use (permissive vs strict)
  • When to enable constraint validation (always vs selective)
  • Maximum practical collection sizes
  • Optimal nesting depth limits

5. Implement Targeted Optimizations (If Needed)

ONLY if benchmarks identify issues:

Optimization Candidates (benchmark-driven):

  • If property checking slow: Cache type guards
  • If constraint validation expensive: Lazy constraint evaluation
  • If regex slow: Compile and cache regex patterns
  • If collection validation slow: Parallel validation (Web Workers)
  • If hierarchical validation expensive: Flatten validation hierarchy
  • If async overhead significant: Offer synchronous validators

Optimization Guidelines:

  • Only optimize proven bottlenecks (>10% overhead or <1,000 validations/sec)
  • Measure before and after (verify improvement)
  • Document tradeoffs (code complexity vs speed gain)
  • Add regression tests (ensure optimization doesn't break functionality)

6. Document Performance Characteristics

Update README.md with new "Validation Performance" section (~150-250 lines):

Performance Overview:

  • Typical validation overhead: X% (by validator)
  • Typical throughput: X validations/sec (by validator)
  • Memory usage: X KB per validation (by validator)

Validator Performance Comparison:

GeoJSON:   ~XX,XXX validations/sec (simplest, fastest)
SWE:       ~XX,XXX validations/sec (YY% slower due to constraint validation)
SensorML:  ~XX,XXX validations/sec (ZZ% slower due to hierarchical validation)

Validation Strategy Overhead:

No validation:    0% overhead (baseline)
Permissive:       XX% overhead (collect errors)
Strict:           XX% overhead (throw on error)

Constraint Validation Overhead:

No constraints:       XX,XXX ops/sec (baseline)
With intervals:       XX,XXX ops/sec (YY% overhead)
With patterns:        XX,XXX ops/sec (ZZ% overhead)
With sig figures:     XX,XXX ops/sec (AA% overhead)

Best Practices:

  • Development: Enable validation with options.validate = true to catch errors early
  • Production: Disable validation for trusted data to maximize performance
  • Constraint validation: Enable only when data quality enforcement is required
  • Collections: Consider chunked validation for >X,XXX features
  • Nesting: Limit SWE DataRecord nesting to X levels for optimal performance

Performance Targets:

  • Good: <5% validation overhead (<0.05ms per feature)
  • Acceptable: <10% validation overhead (<0.1ms per feature)
  • Poor: >20% validation overhead (>0.2ms per feature) - needs optimization

7. Integrate with CI/CD

Add to .github/workflows/benchmarks.yml (coordinate with #55):

Benchmark Execution:

- name: Run validation benchmarks
  run: npm run bench:validation

- name: Run validation memory benchmarks
  run: npm run bench:validation:memory

Performance Regression Detection:

  • Compare against baseline (main branch)
  • Alert if any benchmark >10% slower
  • Alert if memory usage >20% higher

PR Comments:

  • Post benchmark results to PRs
  • Show comparison with base branch
  • Highlight regressions and improvements

Acceptance Criteria

Benchmark Infrastructure (4 items)

  • ✅ Benchmark infrastructure from Add comprehensive performance benchmarking #55 is complete and available
  • Created benchmarks/validation.bench.ts with comprehensive validation benchmarks (~800-1,200 lines)
  • Created benchmarks/validation-memory.bench.ts with memory usage benchmarks (~200-300 lines)
  • Created benchmarks/validation-analysis.ts with results analysis (~150-250 lines)

GeoJSON Validation Benchmarks (5 items)

  • Benchmarked all 7 feature types (System, Deployment, Procedure, SamplingFeature, Property, Datastream, ControlStream)
  • Benchmarked single feature vs collection validation (10, 100, 1,000 features)
  • Benchmarked valid vs invalid feature validation (error paths)
  • Documented throughput for each feature type (validations/sec)
  • Identified fastest and slowest feature types

SWE Common Validation Benchmarks (7 items)

  • Benchmarked simple components (Quantity, Count, Text, Category, Time)
  • Benchmarked validation with vs without constraints
  • Benchmarked interval checking (1, 5, 10 intervals)
  • Benchmarked pattern/regex validation (simple, complex)
  • Benchmarked significant figures calculation
  • Benchmarked nested DataRecord validation (1, 2, 3 levels)
  • Documented constraint validation overhead (% and absolute time)

SensorML Validation Benchmarks (5 items)

  • Benchmarked all 4 process types (PhysicalSystem, PhysicalComponent, SimpleProcess, AggregateProcess)
  • Benchmarked hierarchical validation overhead (4 levels deep)
  • Measured async overhead (Promise creation/resolution)
  • Benchmarked Deployment and DerivedProperty validation
  • Documented hierarchical validation cost

Validation Strategy Benchmarks (4 items)

  • Benchmarked no validation (baseline)
  • Benchmarked permissive validation (collect errors)
  • Benchmarked strict validation (throw on error)
  • Documented strategy overhead (% and absolute time)

Collection Scaling Benchmarks (5 items)

  • Benchmarked collection validation: 10, 100, 1,000, 10,000 features
  • Tested all three validators at each scale
  • Documented scaling characteristics: linear/sublinear/superlinear
  • Identified performance inflection points (when it becomes slow)
  • Documented maximum practical collection size

Constraint Validation Benchmarks (6 items)

  • Benchmarked Quantity: no constraints, intervals, significant figures
  • Benchmarked Text: no constraints, pattern, token list
  • Benchmarked Time: no constraints, temporal intervals
  • Benchmarked Count: no constraints, integer intervals
  • Benchmarked Category: no constraints, token list
  • Documented constraint validation overhead per type

Memory Benchmarks (5 items)

  • Measured memory per validation (GeoJSON, SWE simple, SWE nested, SensorML)
  • Measured memory scaling (100, 1,000, 10,000 validations)
  • Measured error accumulation memory (0, 10, 100, 1,000 errors)
  • Measured GC pressure for large collections
  • Documented memory recommendations

Performance Analysis (5 items)

  • Analyzed all benchmark results
  • Identified bottlenecks (operations >20% of validation time or >1ms per feature)
  • Generated performance comparison report (validator, strategy, constraint)
  • Created recommendations document (when to enable/disable, strategy choice)
  • Documented current performance characteristics

Optimization (if needed) (4 items)

  • Identified optimization opportunities from benchmark data
  • Implemented targeted optimizations ONLY for proven bottlenecks
  • Re-benchmarked after optimization (verified improvement)
  • Added regression tests to prevent optimization from breaking functionality

Documentation (7 items)

  • Added "Validation Performance" section to README.md (~150-250 lines)
  • Documented typical validation overhead (% by validator)
  • Documented typical throughput (validations/sec by validator)
  • Documented memory usage (KB per validation by validator)
  • Documented validator performance comparison (GeoJSON vs SWE vs SensorML)
  • Documented validation strategy overhead (no validation vs permissive vs strict)
  • Documented best practices (when to enable, strategy choice, collection sizes)

CI/CD Integration (4 items)

  • Added validation benchmarks to .github/workflows/benchmarks.yml
  • Configured performance regression detection (>10% slower = fail)
  • Added PR comment with benchmark results and comparison
  • Verified benchmarks run on every PR and main branch commit

Implementation Notes

Files to Create

Benchmark Files (~1,150-1,750 lines total):

  1. benchmarks/validation.bench.ts (~800-1,200 lines)

    • GeoJSON validation benchmarks (7 feature types × 3 scenarios)
    • SWE Common validation benchmarks (5 component types × constraint variations)
    • SensorML validation benchmarks (4 process types × hierarchical levels)
    • Validation strategy benchmarks (3 strategies)
    • Constraint validation benchmarks (6 constraint types)
    • Collection scaling benchmarks (5 sizes × 3 validators)
  2. benchmarks/validation-memory.bench.ts (~200-300 lines)

    • Memory per validation (4 validator types)
    • Memory scaling (3 sizes)
    • Error accumulation memory (4 error counts)
    • GC pressure analysis
  3. benchmarks/validation-analysis.ts (~150-250 lines)

    • Performance comparison logic
    • Bottleneck identification
    • Recommendation generation
    • Results formatting

Files to Modify

README.md (~150-250 lines added):

  • New "Validation Performance" section with:
    • Performance overview
    • Validator comparison table
    • Strategy overhead table
    • Constraint overhead table
    • Best practices
    • Performance targets

package.json (~10 lines):

{
  "scripts": {
    "bench:validation": "tsx benchmarks/validation.bench.ts",
    "bench:validation:memory": "tsx benchmarks/validation-memory.bench.ts",
    "bench:validation:analyze": "tsx benchmarks/validation-analysis.ts"
  }
}

.github/workflows/benchmarks.yml (coordinate with #55):

  • Add validation benchmark execution
  • Add memory benchmark execution
  • Add regression detection
  • Add PR comment generation

Files to Reference

Validator Source Files (for accurate benchmarking):

  • src/ogc-api/csapi/validation/geojson-validator.ts (376 lines, 61 tests, 40.95% coverage)
  • src/ogc-api/csapi/validation/swe-validator.ts (357 lines, 50 tests, 73.68% coverage)
  • src/ogc-api/csapi/validation/constraint-validator.ts (312 lines, 28 tests)
  • src/ogc-api/csapi/validation/sensorml-validator.ts (339 lines, 13 tests)

Test Fixtures (reuse existing test data):

  • src/ogc-api/csapi/validation/geojson-validator.spec.ts (has sample GeoJSON features)
  • src/ogc-api/csapi/validation/swe-validator.spec.ts (has sample SWE components)
  • src/ogc-api/csapi/validation/constraint-validator.spec.ts (has sample constraints)
  • src/ogc-api/csapi/validation/sensorml-validator.spec.ts (has sample SensorML processes)

Technology Stack

Benchmarking Framework (from #55):

  • Tinybench (statistical benchmarking)
  • Node.js process.memoryUsage() for memory tracking
  • Node.js performance.now() for timing

Benchmark Priorities:

  • High: Validation strategy overhead, GeoJSON validation, constraint validation cost
  • Medium: SWE component validation, collection scaling, SensorML validation
  • Low: Async overhead, extreme nesting (>3 levels), extreme scaling (>10,000)

Performance Targets (Hypothetical - Measure to Confirm)

Validation Overhead:

  • Good: <5% overhead (<0.05ms per feature)
  • Acceptable: <10% overhead (<0.1ms per feature)
  • Poor: >20% overhead (>0.2ms per feature)

Throughput:

  • Good: >20,000 validations/sec (<0.05ms per validation)
  • Acceptable: >10,000 validations/sec (<0.1ms per validation)
  • Poor: <5,000 validations/sec (>0.2ms per validation)

Constraint Validation Overhead:

  • Good: <10% overhead vs no constraints
  • Acceptable: <25% overhead vs no constraints
  • Poor: >50% overhead vs no constraints

Memory:

  • Good: <1 KB per validation
  • Acceptable: <5 KB per validation
  • Poor: >10 KB per validation

Optimization Guidelines

ONLY optimize if benchmarks prove need:

  • Validation overhead >20%
  • Throughput <5,000 validations/sec
  • Memory >10 KB per validation
  • Constraint validation overhead >50%

Optimization Approach:

  1. Identify bottleneck from benchmark data
  2. Profile with Chrome DevTools or Node.js profiler
  3. Implement targeted optimization
  4. Re-benchmark to verify improvement (>20% faster)
  5. Add regression tests
  6. Document tradeoffs

Common Optimizations:

  • Cache type guards and compiled regex patterns
  • Use Set instead of Array for token list checking
  • Early return on invalid data (strict mode)
  • Parallel validation for large collections
  • Synchronous validators (avoid Promise overhead)
  • Lazy constraint evaluation (only when needed)

Dependencies

CRITICAL DEPENDENCY:

Why This Dependency Matters:

Testing Requirements

Benchmark Validation:

  • All benchmarks must run without errors
  • All benchmarks must complete in <60 seconds total
  • All benchmarks must produce consistent results (variance <10%)
  • Memory benchmarks must not cause out-of-memory errors

Regression Tests:

  • Add tests to verify optimizations don't break functionality
  • Rerun all 152 existing validation tests after any optimization
  • Verify coverage remains >70% (current average)

Caveats

Performance is Environment-Dependent:

  • Benchmarks run on specific hardware (document specs)
  • Results vary by Node.js version, CPU, memory
  • Production performance may differ from benchmark environment
  • Document benchmark environment in README

Optimization Tradeoffs:

  • Faster code may be more complex
  • Cached regex patterns increase memory usage
  • Parallel validation adds API complexity
  • Synchronous validators lose flexibility
  • Document all tradeoffs in optimization PRs

Validation Performance Context:

  • GeoJSON likely fastest (simple property checking)
  • SWE likely slowest (constraint validation + nesting)
  • SensorML medium (hierarchical validation)
  • Validation overhead typically <10% of total parse time
  • Network latency typically dominates validation overhead

Priority Justification

Priority: Low

Why Low Priority:

  1. No Known Performance Issues: No user complaints about slow validation
  2. Functional Excellence: Validators work correctly with comprehensive tests (152 tests total)
  3. Not Critical Path: Validation is optional (options.validate), defaults to disabled
  4. Depends on Infrastructure: Cannot start until Add comprehensive performance benchmarking #55 (benchmark infrastructure) is complete
  5. Educational Value: Primarily for documentation and validation strategy guidance

Why Still Important:

  1. Strategy Guidance: Users need to know when to enable/disable validation (dev vs prod)
  2. Regression Prevention: Establish baseline to detect future validation performance degradation
  3. Optimization Guidance: Data-driven decisions about what (if anything) to optimize
  4. Constraint Validation: Understand cost of deep constraint validation
  5. Scaling Guidance: Help users estimate validation overhead for large collections

Impact if Not Addressed:

  • ⚠️ Unknown validation overhead (users can't estimate cost)
  • ⚠️ No baseline for regression detection (can't track performance over time)
  • ⚠️ No optimization guidance (can't prioritize improvements)
  • ⚠️ Unknown constraint validation cost (users don't know if it's worth enabling)
  • ✅ Validators still work correctly (functional quality not affected)
  • ✅ No known performance bottlenecks (no urgency)

Effort Estimate: 10-15 hours (after #55 complete)

  • Benchmark creation: 6-9 hours
  • Memory analysis: 1-2 hours
  • Results analysis: 1-2 hours
  • Documentation: 1-2 hours
  • CI/CD integration: 0.5-1 hour (reuse from Add comprehensive performance benchmarking #55)
  • Optimization (optional, if needed): 2-4 hours

When to Prioritize Higher:

  • If users report slow validation
  • If adding real-time validation features (need performance baseline)
  • If optimizing for embedded/mobile (need overhead data)
  • If validation becomes mandatory (need to minimize overhead)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions