Skip to content

fix(api): validate input before sanitization for security#19

Open
arunsanna wants to merge 4 commits intoGenAI-Security-Project:mainfrom
arunsanna:fix/validation-order-issue-14
Open

fix(api): validate input before sanitization for security#19
arunsanna wants to merge 4 commits intoGenAI-Security-Project:mainfrom
arunsanna:fix/validation-order-issue-14

Conversation

@arunsanna
Copy link

Summary

Problem

The code was sanitizing input (with html.escape()) BEFORE validating it. This is a security anti-pattern because:

  1. Sanitization can transform malicious input into something that bypasses validation
  2. Example: <script>org/model</script>&lt;script&gt;org/model&lt;/script&gt; could slip through

Solution

Validate the raw user input first, then sanitize after validation:

# BEFORE (wrong order):
sanitized_model_id = html.escape(model_id)
if not is_valid_hf_input(sanitized_model_id):  # Validating sanitized input

# AFTER (correct order):
if not is_valid_hf_input(model_id):            # Validate raw input first
    sanitized_for_display = html.escape(model_id)
    return error_response(...)
sanitized_model_id = html.escape(model_id)     # Then sanitize

Test Plan

  • Valid model IDs still generate AIBOM correctly
  • Invalid inputs (e.g., <script>alert(1)</script>) are rejected
  • Server logs confirm validation catches invalid input before sanitization
  • Docker build passes
  • API syntax verified via import test

Fixes GenAI-Security-Project#14 - Input validation order bug

The validation was happening AFTER sanitization, which is a security
issue because sanitization could transform malicious input into
something that passes validation. This commit swaps the order to:

1. Validate the raw user input first (catches malicious patterns)
2. Sanitize after validation (for safe display/processing)

Test results:
- Valid model IDs: Successfully generates AIBOM
- Invalid inputs (e.g., <script>alert(1)</script>): Correctly rejected
- Server logs confirm validation catches invalid input before sanitize
Copilot AI review requested due to automatic review settings January 15, 2026 03:59
@arunsanna
Copy link
Author

Test Results

Test 1: Valid Model ID

$ curl -X POST http://localhost:7860/generate -d "model_id=meta-llama/Llama-3.1-8B"
# Result: AIBOM generated successfully
# Completeness score: 85/100

Test 2: Invalid Model ID (XSS attempt)

$ curl -X POST http://localhost:7860/generate -d "model_id=<script>alert(1)</script>"
# Result: Error page returned (input rejected)
# Server log: "Invalid model input format received: <script>alert(1)</script>"

Test 3: Server Logs Confirm Validation Works

2026-01-15 03:58:53,306 - src.aibom_generator.api - WARNING - Invalid model input format received: <script>alert(1)</script>

Test 4: API Import Test

$ docker run --rm --entrypoint python aibom-test -c "from src.aibom_generator.api import app; print('api.py imports successfully')"
# Result: api.py imports successfully

All tests confirm the validation now correctly runs BEFORE sanitization, catching malicious input patterns early.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a security anti-pattern by reordering input validation and sanitization in the /generate endpoint. The fix ensures that raw user input is validated first, then sanitized only for display purposes in error responses, preventing potential bypass scenarios where sanitized input might slip through validation.

Changes:

  • Moved validation of model_id to occur before sanitization with html.escape()
  • Introduced a separate sanitized_for_display variable for safe error message rendering
  • Updated comment documentation to clarify the security-driven ordering
Comments suppressed due to low confidence (4)

HF_files/aibom-generator/src/aibom-generator/api.py:867

  • This endpoint (/api/generate) still follows the old pattern of sanitizing before validation. For consistency with the security fix applied to the /generate endpoint, validation should occur before sanitization here as well. This prevents potential edge cases where sanitization could transform malicious input into something that bypasses validation.
        sanitized_model_id = html.escape(request.model_id)
        if not is_valid_hf_input(sanitized_model_id):

HF_files/aibom-generator/src/aibom-generator/api.py:941

  • This endpoint (/api/generate-with-report) still follows the old pattern of sanitizing before validation. For consistency with the security fix applied to the /generate endpoint, validation should occur before sanitization here as well.
        sanitized_model_id = html.escape(request.model_id)
        if not is_valid_hf_input(sanitized_model_id):

HF_files/aibom-generator/src/aibom-generator/api.py:1053

  • This endpoint (/api/models/{model_id:path}/score) still follows the old pattern of sanitizing before validation. For consistency with the security fix applied to the /generate endpoint, validation should occur before sanitization here as well.
        sanitized_model_id = html.escape(model_id)
        if not is_valid_hf_input(sanitized_model_id):

HF_files/aibom-generator/src/aibom-generator/api.py:1137

  • This endpoint (/api/batch) still follows the old pattern of sanitizing before validation. For consistency with the security fix applied to the /generate endpoint, validation should occur before sanitization here as well.
            sanitized_id = html.escape(model_id)
            if is_valid_hf_input(sanitized_id):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review feedback - apply the same security pattern
to 4 additional endpoints that were missed in the original fix:

- /api/generate
- /api/generate-with-report
- /api/models/{model_id}/score
- /api/batch

All endpoints now validate raw input FIRST, then sanitize only
after validation passes. This ensures consistent security posture
across the entire API surface.
@arunsanna
Copy link
Author

Copilot Review Feedback Addressed ✅

Fixed all 4 additional endpoints identified by Copilot:

Endpoint Status
/api/generate ✅ Fixed
/api/generate-with-report ✅ Fixed
/api/models/{model_id}/score ✅ Fixed
/api/batch ✅ Fixed

All endpoints now follow the same security pattern:

  1. Validate raw input FIRST
  2. Sanitize only AFTER validation passes
  3. Use sanitized value only for display/logging

Address Copilot review: _normalise_model_id() should receive raw
validated input (not HTML-escaped) since it needs to parse URLs
with special characters like / and :.

- Normalize raw validated model_id first
- Sanitize only when passing to HTML templates for display
- Consistent pattern across all template responses
@arunsanna
Copy link
Author

Additional Copilot Review Feedback Addressed ✅

Fixed normalization order issue:

File Line Fix
api.py 601 _normalise_model_id() now receives raw validated input instead of HTML-escaped

Change: Normalization needs to parse URLs with special chars (/, :), so it should operate on raw validated input. HTML sanitization is now done only when passing to templates for display.

Pattern applied consistently to all template responses in the /generate endpoint.

@arunsanna
Copy link
Author

⚠️ Testing Found Bug

Test Space: https://megamind1-aibom-pr19-validation-order.hf.space

Test Results

Test Result
Invalid input rejection (<script>alert(1)</script>) ✅ Correctly rejected
Valid input processing (openai/whisper-tiny) Error

Bug Details

Error: name 'sanitized_model_id' is not defined

The validation order change is correct (validates before sanitizing), but there's a variable reference issue - sanitized_model_id is used somewhere in the code path before it's assigned.

Suggested Fix

Check that sanitized_model_id = html.escape(model_id) is called before any code that references it in the success path.

Needs fix before merge.

The /generate form endpoint was missing the sanitized_model_id
assignment after validation passes. This caused a NameError when
the variable was referenced later in the code.
@arunsanna
Copy link
Author

✅ Bug Fixed and Re-Tested

Commit: 877a650 - fix: add missing sanitized_model_id assignment in form endpoint

Issue

The /generate form endpoint was missing sanitized_model_id = html.escape(model_id) after validation passes, causing a NameError.

Fix

Added the missing assignment at line 601:

# Sanitize for safe display/logging after validation passes
sanitized_model_id = html.escape(model_id)

Re-Test Results

Test Result
Valid input (openai/whisper-tiny) ✅ AIBOM generated successfully
Invalid input (<script>alert(1)</script>) ✅ Correctly rejected
Security validation order ✅ Validates BEFORE sanitization

Test Space: https://megamind1-aibom-pr19-validation-order.hf.space

Ready for merge.

arunsanna added a commit to arunsanna/aibom-generator that referenced this pull request Feb 3, 2026
Reapply of PR GenAI-Security-Project#19 fix for v0.2 architecture.

Security improvement: Validate model_id BEFORE html.escape() sanitization
to prevent potential bypass attacks where malicious input could be
transformed into something that passes validation.

Example: <script>org/model</script> → &lt;script&gt;org/model&lt;/script&gt;
could slip through if sanitization occurs first.
@arunsanna
Copy link
Author

Status Update: Reapplied to v0.2

This security fix (validate before sanitize) has been reapplied to the v0.2 branch in PR #36.

Testing completed:

  • ✅ Local pytest (5/5 tests pass)
  • ✅ Local server tested with malicious input
  • ✅ HF Space aibom-generator-test deployed and tested
  • ✅ HF Space aibom-pr19-validation-order deployed and tested

The fix correctly rejects XSS attempts like <script>alert('xss')</script>/model with "Invalid model ID format."

This PR can be closed in favor of PR #36 which targets v0.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Input validation occurs after sanitization

1 participant