Skip to content

Conversation

@shimrxn
Copy link
Contributor

@shimrxn shimrxn commented Sep 21, 2025

File Upload Service Enhancements

Overview

This PR introduces major enhancements to the File Upload Service (streamlitdw_fe_mt.py), extending functionality beyond simple dataset uploads. The updated features are live and running on the production VM Streamlit server, where they have been tested successfully with real uploads.

These changes significantly improve usability, governance, and scalability for the Redback Data Warehouse team.


Key Features

  1. Data Validation & Preview

    • Automatic previews for CSV, XLSX, and JSON files (first 8 rows).

    • Real-time validation:

      • Detects empty or duplicate column headers.
      • Flags completely empty columns.
    • Warns users when file size exceeds ~50MB.

    • Helps analysts identify issues before uploading.

  2. Tagging Support

    • Tags can be applied at:

      • Bulk level (across all uploads).
      • Per-file level (specific to each file).
    • Tags are captured in provenance logs for governance and searchability.

    • Enables better organization (e.g. training-data, images, videos).

  3. Enhanced Provenance Logging

    • Provenance JSON now includes:

      • tags
      • file_type (MIME type)
      • Upload metadata: timestamp, user, source, preprocessing step.
      • Integrity signatures (SHA256).
    • Provides complete traceability for audit and compliance.

  4. Search & Filtering

    • Quick filename filter in Bronze/Silver views.
    • Tag-based search across provenance logs.
    • Makes it faster to locate project files in production.
  5. ZIP File Handling

    • Previews contents of ZIP files before upload.
    • Automatically unpacks and uploads datasets and media files.
    • Previews tabular files (CSV/JSON/XLSX) inside ZIP archives.
  6. Real-Time Feedback

    • Progress bar tracks multi-file and ZIP uploads.
    • Inline previews for images and videos directly in Streamlit.
    • Clear confirmation messages for each action.
  7. Deletion & Bulk Actions

    • New delete functionality for files in Bronze/Silver buckets.
    • Bulk delete option for provenance logs.
    • Supports better lifecycle and cleanup management.

Screenshots

Screenshot 2025-09-21 181222 Screenshot 2025-09-21 181314 Screenshot 2025-09-21 181324 Screenshot 2025-09-21 181458 Screenshot 2025-09-21 181759 Screenshot 2025-09-21 182033

Testing Performed (on Production VM)

  • Uploaded CSV, XLSX, JSON → previews and validation warnings triggered.
  • Large CSV upload (~60MB) → size warning displayed.
  • ZIP with CSV + PNG → contents previewed, unpacked, and uploaded.
  • Applied tags (bulk + per-file) → confirmed searchable in provenance logs.
  • Bulk provenance log delete tested → logs removed as expected.

Notes

This PR builds on the earlier provenance feature, evolving the File Upload Service into a production-ready, auditable system that supports validation, tagging, previews, real-time feedback, and lifecycle management.

These enhancements are already deployed on the production VM, providing immediate value to the Redback team.


@github-actions
Copy link

🔒 Security Scan Results

🔒 Security Scan Results
=========================

Bandit Scan Results:
-------------------
Run started:2025-09-21 09:53:21.670998

Test results:
>> Issue: [B104:hardcoded_bind_all_interfaces] Possible binding to all interfaces.
   Severity: Medium   Confidence: Medium
   CWE: CWE-605 (https://cwe.mitre.org/data/definitions/605.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b104_hardcoded_bind_all_interfaces.html
   Location: ./Core DW Infrastructure/dremio-api/api.py:100:17
99	    port = int(os.getenv('FLASK_RUN_PORT', 5000))
100	    app.run(host='0.0.0.0', port=port)

--------------------------------------------------
>> Issue: [B104:hardcoded_bind_all_interfaces] Possible binding to all interfaces.
   Severity: Medium   Confidence: Medium
   CWE: CWE-605 (https://cwe.mitre.org/data/definitions/605.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b104_hardcoded_bind_all_interfaces.html
   Location: ./Core DW Infrastructure/flask/flaskapi_dw.py:86:17
85	if __name__ == '__main__':
86	    app.run(host='0.0.0.0', port=5000)  # Running on port 5000 IMPORTANT

--------------------------------------------------
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
   Severity: Medium   Confidence: Low
   CWE: CWE-89 (https://cwe.mitre.org/data/definitions/89.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b608_hardcoded_sql_expressions.html
   Location: ./File Upload Service/app/streamlitdw_fe_mt.py:247:17
246	    except S3Error as e:
247	        st.error(f"Failed to delete {object_name} from {bucket}: {e}")
248	

--------------------------------------------------
>> Issue: [B104:hardcoded_bind_all_interfaces] Possible binding to all interfaces.
   Severity: Medium   Confidence: Medium
   CWE: CWE-605 (https://cwe.mitre.org/data/definitions/605.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b104_hardcoded_bind_all_interfaces.html
   Location: ./File Upload Service/flask/flaskapi_dw.py:86:17
85	if __name__ == '__main__':
86	    app.run(host='0.0.0.0', port=5000)  # Running on port 5000 IMPORTANT

--------------------------------------------------
>> Issue: [B104:hardcoded_bind_all_interfaces] Possible binding to all interfaces.
   Severity: Medium   Confidence: Medium
   CWE: CWE-605 (https://cwe.mitre.org/data/definitions/605.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b104_hardcoded_bind_all_interfaces.html
   Location: ./MongoDB_Connection/Project1/main.py:12:35
11	    debug_mode = os.environ.get('FLASK_DEBUG', 'False').lower() == 'true'
12	    app.run(debug=debug_mode, host='0.0.0.0')

--------------------------------------------------
>> Issue: [B104:hardcoded_bind_all_interfaces] Possible binding to all interfaces.
   Severity: Medium   Confidence: Medium
   CWE: CWE-605 (https://cwe.mitre.org/data/definitions/605.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b104_hardcoded_bind_all_interfaces.html
   Location: ./Structured Dremio Solution/Flask-api/api.py:100:17
99	    port = int(os.getenv('FLASK_RUN_PORT', 5000))
100	    app.run(host='0.0.0.0', port=port)

--------------------------------------------------
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
   Severity: Medium   Confidence: Low
   CWE: CWE-89 (https://cwe.mitre.org/data/definitions/89.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b608_hardcoded_sql_expressions.html
   Location: ./Structured Dremio Solution/Script/pipeline.py:168:12
167	    placeholders = ', '.join(['?' for _ in data[0]])
168	    query = f"INSERT INTO {table_name} VALUES ({placeholders})"
169	    cursor = conn.cursor()

--------------------------------------------------
>> Issue: [B108:hardcoded_tmp_directory] Probable insecure usage of temp file/directory.
   Severity: Medium   Confidence: Medium
   CWE: CWE-377 (https://cwe.mitre.org/data/definitions/377.html)
   More Info: https://bandit.readthedocs.io/en/1.8.6/plugins/b108_hardcoded_tmp_directory.html
   Location: ./pre-processing/pre-processing.py:177:29
176	
177	            temp_file_path = f'/tmp/{obj.object_name}'
178	

--------------------------------------------------

Code scanned:
	Total lines of code: 2774
	Total lines skipped (#nosec): 0
	Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0
		Low: 15
		Medium: 8
		High: 0
	Total issues (by confidence):
		Undefined: 0
		Low: 2
		Medium: 6
		High: 15
Files skipped (0):

Dependency Check Results:
-----------------------

No critical security issues detected.

The code has passed all critical security checks.

@lperry022 lperry022 self-assigned this Sep 22, 2025
Copy link
Contributor

@lperry022 lperry022 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Great work

@lperry022 lperry022 merged commit 2faf6c4 into Redback-Operations:main Sep 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants