Skip to content

Conversation

@javihern98
Copy link
Contributor

@javihern98 javihern98 commented Jan 23, 2026

Summary

This PR implements tolerance-based comparisons for floating-point numbers in VTL, addressing issue #457, and adds environment variable documentation for issue #458.

Float64 Precision Rationale

IEEE 754 float64 (double precision) has 52 mantissa bits (53 effective with implicit leading 1):

Property Value Meaning
log10(2^53) ≈ 15.95 Maximum significant decimal digits
DBL_DIG 15 Guaranteed decimal digits for round-trip (decimal → float64 → decimal)
DBL_DECIMAL_DIG 17 Digits needed for exact float64 → decimal → float64 round-trip

The valid range for significant digits is 6–15, where:

  • 6 is the minimum practical precision (coarse tolerance)
  • 15 is the maximum guaranteed precision for float64 — beyond this, representation noise appears

Changes

New Environment Variables

Two new environment variables control the behavior:

  1. COMPARISON_ABSOLUTE_THRESHOLD - Controls tolerance for comparison operators

    • Default: 15 (significant digits, the maximum guaranteed precision for float64)
    • Range: 6-15
    • Set to -1 to disable tolerance (exact comparison, may trigger errors or extra decimals not needed)
  2. OUTPUT_NUMBER_SIGNIFICANT_DIGITS - Controls CSV output formatting

    • Same values as above
    • Controls float_format parameter in pandas to_csv

Tolerance Algorithm

Relative tolerance is calculated as: 0.5 * 10^(-(N-1)) where N = significant digits

For the default of 15 significant digits:

  • Relative tolerance = 5e-15
  • Absolute tolerance = relative_tolerance × max(|a|, |b|)

This is the most conservative setting that still filters floating-point precision artifacts, using the full guaranteed precision of float64.

Modified Operators

Standard Comparison Operators (Comparison.py):

  • Equal (=)
  • NotEqual (<>)
  • GreaterEqual (>=) — equality checked before strict >
  • LessEqual (<=) — equality checked before strict <
  • Between

Hierarchical Ruleset Operators (HROperators.py):

  • HREqual
  • HRGreaterEqual — equality checked before strict >
  • HRLessEqual — equality checked before strict <

Output Formatting

CSV output now uses float_format="%.{N}g" to limit floating-point precision in output files.

Documentation (Issue #458)

New docs/environment_variables.rst page documenting:

  • Number handling variables (COMPARISON_ABSOLUTE_THRESHOLD, OUTPUT_NUMBER_SIGNIFICANT_DIGITS)
  • S3/AWS configuration variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION, AWS_ENDPOINT_URL)
  • Usage examples for each scenario
  • Float64 precision rationale

Added to the docs toctree in docs/index.rst.

Claude Code Instructions

New CLAUDE.md file with project-specific instructions for Claude Code, derived from .github/copilot-instructions.md.

Test Coverage

  • Number handling tests refactored to pytest with @pytest.mark.parametrize
  • S3 mock tests updated to expect float_format parameter in to_csv calls
  • All 3463 tests passing
  • Updated test_DEMO1 expected output: now returns 4 rows with real imbalances (filters 35 floating-point artifacts)
  • Strict typing (Union[int, float]) on _numbers_less_equal and _numbers_greater_equal

Breaking Changes

This change affects comparison results for floating-point numbers. Users can:

Goal Setting
Disable tolerance (exact comparisons) COMPARISON_ABSOLUTE_THRESHOLD=-1
More lenient tolerance COMPARISON_ABSOLUTE_THRESHOLD=10 (tolerance ~5e-10)
Strictest tolerance (default) COMPARISON_ABSOLUTE_THRESHOLD=15 (~5e-15, full float64 precision)

Closes #457
Closes #458

…tput formatting

Implements tolerance-based comparison for Number values in equality operators
and configurable output formatting with significant digits.

Changes:
- Add _number_config.py utility module for reading environment variables
- Modify comparison operators (=, >=, <=, between) to use significant digits
  tolerance for Number comparisons
- Update CSV output to use float_format with configurable significant digits
- Add comprehensive tests for all new functionality

Environment variables:
- COMPARISON_ABSOLUTE_THRESHOLD: Controls comparison tolerance (default: 10)
- OUTPUT_NUMBER_SIGNIFICANT_DIGITS: Controls output formatting (default: 10)

Values:
- None/not defined: Uses default value of 10 significant digits
- 6 to 14: Uses specified number of significant digits
- -1: Disables the feature (uses Python's default behavior)

Closes #457
- Add tolerance-based equality checks to HREqual, HRGreaterEqual, HRLessEqual
- Update test expected output for DEMO1 to reflect new tolerance behavior
  (filtering out floating-point precision errors in check_hierarchy results)
- More conservative tolerance (5e-14 instead of 5e-10)
- DEMO1 test now expects 4 real imbalance rows (filters 35 floating-point artifacts)
- Updated test for numbers_are_equal to use smaller difference
- Add --unsafe-fixes flag to ruff check
- Add mandatory step 3 with all quality checks before creating PR
- Require: ruff format, ruff check --fix --unsafe-fixes, mypy, pytest
javihern98 and others added 3 commits January 23, 2026 17:47
IEEE 754 float64 guarantees 15 significant decimal digits (DBL_DIG=15).
Updated DEFAULT_SIGNIFICANT_DIGITS and MAX_SIGNIFICANT_DIGITS from 14 to 15
to use the full guaranteed precision of double-precision floating point.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The S3 mock tests now expect float_format="%.15g" in to_csv calls,
matching the output formatting behavior added for Number type handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@javihern98 javihern98 marked this pull request as ready for review January 23, 2026 18:06
@javihern98 javihern98 requested review from a team and albertohernandez1995 January 23, 2026 18:06
New docs/environment_variables.rst documenting:
- COMPARISON_ABSOLUTE_THRESHOLD (Number comparison tolerance)
- OUTPUT_NUMBER_SIGNIFICANT_DIGITS (CSV output formatting)
- AWS/S3 environment variables
- Usage examples for each scenario

Includes float64 precision rationale (DBL_DIG=15) explaining
the valid range of 6-15 significant digits.

Closes #458

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@javihern98 javihern98 changed the title Fix #457: Handle VTL Number type correctly with tolerance-based comparisons Handle VTL Number type correctly with tolerance-based comparisons Jan 23, 2026
@javihern98 javihern98 changed the title Handle VTL Number type correctly with tolerance-based comparisons Handle VTL Number type correctly with tolerance-based comparisons. Docs updates Jan 23, 2026
@javihern98
Copy link
Contributor Author

Do not merge automatically this branch, I will merge it when Suite team have checked also the changes

javihern98 and others added 3 commits January 23, 2026 21:16
Ensure tolerance-based equality is evaluated before strict < or >
comparison in _numbers_less_equal and _numbers_greater_equal. Also
tighten parameter types from Any to Union[int, float].

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Inline isinstance checks so mypy can narrow types in the Between
operator. Function signatures were already formatted correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert TestCase classes to plain pytest functions with
@pytest.mark.parametrize for cleaner, more concise test definitions.
Add Claude Code instructions based on copilot-instructions.md.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add documentation page for Environment Variables Handle VTL Number type correctly in comparison operators and output formatting

2 participants