Skip to content

Latest commit

 

History

History
274 lines (219 loc) · 7.31 KB

File metadata and controls

274 lines (219 loc) · 7.31 KB

Document Masking/Unmasking REST API - Implementation Plan

Overview

A powerful REST API built with Node.js and TypeScript for masking and unmasking documents. The API accepts file uploads, masks specified keywords, and provides secure recovery mechanisms.

Core Requirements

  • File-based processing: Upload/download files instead of string content
  • Multiple format support: Text, Markdown, JSON, XML, TXT files
  • Flexible keyword input: Space/comma separated with quoted phrases
  • Secure recovery: UUID v4 keys without embedded keyword information
  • Minimal storage: Redis database for keyword mappings only
  • Case handling: Case-insensitive masking, uppercase restoration

API Endpoints

1. POST /api/v1/mask

Purpose: Upload document, mask keywords, return masked file with recovery key

Request:

  • Method: POST
  • Content-Type: multipart/form-data
  • File Field: document (file to mask)
  • Form Field: keywords (keyword string)

Example Keywords Input:

Hello world "Boston Red Sox", 'Pepperoni Pizza', 'Cheese Pizza', beer

Parsed Keywords:

  • Hello
  • world
  • beer
  • Boston Red Sox
  • Pepperoni Pizza
  • Cheese Pizza

Response:

  • Content-Type: application/octet-stream (download file)
  • Headers:
    • X-Recovery-Key: [UUID v4]
    • Content-Disposition: attachment; filename="masked_[original_filename]"
  • Body: Masked document file (keywords → XXXXX)

2. POST /api/v1/unmask

Purpose: Upload masked document, restore original content using recovery key

Request:

  • Method: POST
  • Content-Type: multipart/form-data
  • File Field: maskedDocument (masked file)
  • Form Field: recoveryKey (UUID v4 from masking)

Response:

  • Content-Type: application/octet-stream (download file)
  • Headers:
    • Content-Disposition: attachment; filename="original_[original_filename]"
  • Body: Restored original document (XXXXX → original keywords in UPPERCASE)

Database Schema (Redis)

Key Structure: keyword_map:{recovery_key}

Value Format: JSON array of mappings

[
  [1, "Hello"],
  [1, "world"],
  [3, "Boston Red Sox"],
  [5, "Pepperoni Pizza"],
  [7, "Cheese Pizza"],
  [10, "beer"]
]

Array Format: [lineNumber, originalText]

  • Index 0: Line number (integer)
  • Index 1: Original keyword text (string)

TTL: Optional expiration (1 year automatic delete by redis instance)

Processing Logic

Keyword Parsing

  1. Split by spaces and commas
  2. Extract quoted phrases (single and double quotes)
  3. Case-insensitive matching
  4. Preserve quoted phrases as single keywords
  5. Nested simple or double quotes is not allowed

Masking Process

  1. Read uploaded file line by line
  2. For each line, find keyword matches (case-insensitive)
  3. Replace matches with "XXXXX"
  4. Store mapping: [lineNumber, originalText] in Redis
  5. Write masked line to output file so the outputfile is written line by line at the same time we read the input file line by line
  6. Return file with recovery key in headers

Unmasking Process

  1. Validate recovery key exists in Redis
  2. Read masked file line by line
  3. Count XXXXX occurrences per line
  4. Replace each XXXXX with original text from Redis (UPPERCASE)
  5. Write restored line to output file
  6. Return restored file

Security Considerations

Recovery Key Generation

  • UUID v4 (128-bit entropy)
  • No keyword information embedded
  • Secure random generation
  • Format: xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx

File Handling

  • Temporary file storage (/tmp folder is writable in aws lambda environment) during processing
  • Read file line by line instead of whole read it in the memory
  • Automatic cleanup of temp files
  • Input validation and sanitization

Database Security

  • Redis with authentication
  • Key with 1 year expiration
  • No storage of original documents
  • Minimal data footprint

Technical Stack

Backend

  • Runtime: Node.js version 24
  • Language: TypeScript version 5
  • Framework: Express.js version 5
  • Cloud Provider: AWS Lambda Zip
  • File Upload: Multer version 2
  • Database: Redis with Aiven or redis.io service providers
  • UUID Generation: uuidv4()

Dependencies

{
  "express": "^5.x",
  "multer": "^2.x",
  "redis": "^5.x",
  "uuid": "^13.x",
  "typescript": "^5.x"
}

File Format Support

Processing Approach

  • Line-by-line processing for all formats
  • Keyword matching preserves format structure
  • Masking maintains original file formatting
  • Restoration maintains original file structure

Error Handling

Validation Errors

  • 400 Bad Request: Invalid file format
  • 400 Bad Request: Missing keywords
  • 400 Bad Request: Invalid recovery key format

Processing Errors

  • 422 Unprocessable Entity: Unsupported file format
  • 500 Internal Server Error: Processing failures
  • 404 Not Found: Invalid recovery key

Rate Limiting

  • Optional: Basic rate limiting for abuse prevention
  • Configurable limits per IP address

Implementation Phases

Phase 1: Core API

  1. Set up project structure
  2. Implement file upload/download endpoints
  3. Basic keyword parsing logic
  4. File processing and masking
  5. Recovery key generation

Phase 2: Database Integration

  1. Redis connection and configuration
  2. Keyword mapping storage
  3. Recovery key validation
  4. Data cleanup and TTL

Phase 3: Advanced Features

  1. Multiple file format support
  2. Enhanced error handling
  3. Logging and monitoring
  4. Performance optimization

Phase 4: Production Ready

  1. Security hardening
  2. Documentation
  3. Testing suite
  4. Deployment configuration

Testing Strategy

Unit Tests

  • Keyword parsing logic
  • File processing functions
  • Recovery key generation
  • Database operations

Integration Tests

  • End-to-end masking workflow
  • End-to-end unmasking workflow
  • Error scenarios
  • File format compatibility (binary files or any other than plain text formats are not allowed)

Performance Tests

  • Large file processing
  • Concurrent requests
  • Memory usage optimization

Deployment Considerations

Environment Variables

REDIS_URL="redis://default:$REDIS_PARIS_PASSWORD@redis-11686.crce282.eu-west-3-1.ec2.cloud.redislabs.com:11686"
PORT=3000
TEMP_DIR=/tmp
FILE_SIZE_LIMIT=5MB #same than aws lambda limits where this api will be deployed

Docker Configuration

  • Multi-stage build for production
  • Redis service dependency
  • Health check endpoints
  • Environment-based configuration

Monitoring

  • Request/response logging
  • Error tracking
  • Performance metrics
  • Redis connection monitoring

Usage Examples

Masking a Document in AWS Paris region

curl -X POST \
  https://some-id.lambda-url.eu-west-3.on.aws/mask \
  -F "document=@document.txt" \
  -F "keywords=Hello world \"Boston Red Sox\", 'Pepperoni Pizza', beer"

Unmasking a Document

curl -X POST \
  https://some-id.lambda-url.eu-west-3.on.aws/unmask \
  -F "maskedDocument=@masked_document.txt" \
  -F "recoveryKey=550e8400-e29b-41d4-a716-446655440000"

Success Criteria

  • ✅ File-based upload/download functionality
  • ✅ Support for plain text based document formats (xml, txt, md, json, csv, etc)
  • ✅ Flexible keyword input parsing
  • ✅ Secure UUID v4 recovery keys
  • ✅ Case-insensitive masking with uppercase restoration
  • ✅ Minimal Redis database storage
  • ✅ Proper error handling and validation
  • ✅ Performance with large files
  • ✅ Security best practices
  • ✅ Comprehensive testing coverage