Skip to content

S3 alternative for local database #228

@kien-ship-it

Description

@kien-ship-it

Design Document: Local S3 Migration

Overview

This design introduces SeaweedFS as a self-hosted S3-compatible storage backend for the PDR AI platform, alongside the existing Vercel Blob and UploadThing cloud providers. The core change is a unified Storage Adapter (src/lib/storage.ts) that abstracts all storage operations behind a single interface, routing to the correct backend based on the NEXT_PUBLIC_STORAGE_PROVIDER environment variable.

The architecture is designed around three parallel workstreams:

  1. Infrastructure & Configuration — Docker/SeaweedFS setup, env validation, dependency management
  2. Backend Services — S3 client, Storage Adapter, presigned URL API, bootstrap API extension
  3. Frontend Integration — Upload component dual-mode, document retrieval dual-mode

Key Design Decisions

  • Strategy Pattern for Storage: Rather than scattering if (provider === 'local') checks throughout the codebase, we use a strategy pattern where each provider implements a common interface. The adapter selects the strategy at initialization time.
  • Presigned URLs for Local Uploads: Browser uploads go directly to SeaweedFS via presigned URLs, avoiding server-side proxying and keeping the upload flow consistent with cloud providers.
  • fetchBlob Abstraction Extension: The existing fetchBlob function in vercel-blob.ts is the de facto fetch wrapper for all document retrieval. We extend the Storage Adapter to provide a unified fetchFile that handles SeaweedFS URLs alongside Vercel Blob URLs.
  • No Schema Migration Required: The fileUploads.storageProvider column is varchar(64), so adding seaweedfs as a value requires no DDL changes.

Architecture

graph TB
    subgraph Frontend
        UC[Upload Component]
        DC[Document Viewer]
    end

    subgraph API Layer
        BA[Bootstrap API<br>/api/employer/upload/bootstrap]
        PS[Presign API<br>/api/storage/presign]
        UL[Upload-Local API<br>/api/upload-local]
        UD[Upload Document API<br>/api/uploadDocument]
    end

    subgraph Storage Adapter Layer
        SA[Storage Adapter<br>src/lib/storage.ts]
        CP[Cloud Provider<br>Vercel Blob / UploadThing]
        LP[Local Provider<br>S3 Client → SeaweedFS]
    end

    subgraph Infrastructure
        SW[SeaweedFS Container<br>S3 Gateway :8333]
        PG[(PostgreSQL<br>pdr_ai_v2 + seaweedfs DBs)]
        VOL[Docker Volume<br>seaweedfs_data]
    end

    UC -->|cloud mode| CP
    UC -->|local mode: get presigned URL| PS
    UC -->|local mode: PUT to S3| SW
    UC -->|both modes: trigger OCR| UD
    DC -->|fetch| SA
    BA -->|storageProvider config| UC
    PS -->|generate presigned URL| LP
    UL -->|upload via adapter| SA
    SA -->|cloud| CP
    SA -->|local| LP
    LP --> SW
    SW --> PG
    SW --> VOL

Loading

Request Flow: Local Upload

sequenceDiagram
    participant Browser
    participant BootstrapAPI
    participant PresignAPI
    participant SeaweedFS
    participant UploadDocAPI
    participant DB

    Browser->>BootstrapAPI: GET /api/employer/upload/bootstrap
    BootstrapAPI-->>Browser: { storageProvider: "local", s3Endpoint: "..." }
    Browser->>PresignAPI: POST /api/storage/presign { filename, contentType }
    PresignAPI-->>Browser: { presignedUrl, objectKey, bucket }
    Browser->>SeaweedFS: PUT presignedUrl (file binary)
    SeaweedFS-->>Browser: 200 OK
    Browser->>UploadDocAPI: POST /api/uploadDocument { documentUrl, storageType: "local" }
    UploadDocAPI->>DB: Insert fileUploads (storageProvider: "seaweedfs")
    UploadDocAPI-->>Browser: { jobId, success: true }
Loading

Components and Interfaces

1. S3 Client Module (src/server/storage/s3-client.ts)

Singleton S3 client configured for SeaweedFS. Only instantiated when NEXT_PUBLIC_STORAGE_PROVIDER === 'local'.

// src/server/storage/s3-client.ts
import { S3Client, PutObjectCommand, GetObjectCommand, DeleteObjectCommand } from "@aws-sdk/client-s3";
import { getSignedUrl } from "@aws-sdk/s3-request-presigner";

export function getS3Client(): S3Client;
export function getS3BucketName(): string;

export async function putObject(key: string, body: Buffer, contentType?: string): Promise<void>;
export async function getObjectUrl(key: string): string;
export async function deleteObject(key: string): Promise<void>;
export async function getPresignedUploadUrl(key: string, contentType: string, expiresIn?: number): Promise<string>;
export async function getPresignedDownloadUrl(key: string, expiresIn?: number): Promise<string>;

2. Storage Adapter (src/lib/storage.ts)

Unified interface that delegates to the correct provider.

// src/lib/storage.ts
export interface UploadInput {
  filename: string;
  data: Buffer | ArrayBuffer | Uint8Array;
  contentType?: string;
  userId: string;
}

export interface UploadResult {
  url: string;
  pathname: string;
  contentType?: string;
  provider: "uploadthing" | "vercel_blob" | "seaweedfs";
}

export async function uploadFile(input: UploadInput): Promise<UploadResult>;
export async function getFileUrl(key: string, provider?: string): Promise<string>;
export async function deleteFile(key: string, provider?: string): Promise<void>;
export async function fetchFile(url: string, init?: RequestInit): Promise<Response>;

export function getStorageProvider(): "cloud" | "local";
export function isLocalStorage(): boolean;

3. Presigned URL API Route (src/app/api/storage/presign/route.ts)

// POST /api/storage/presign
// Request: { filename: string; contentType: string }
// Response: { presignedUrl: string; objectKey: string; bucket: string }
// Auth: Clerk (401 if unauthenticated)
// Returns 400 if storage provider is "cloud"

4. Bootstrap API Extension (src/app/api/employer/upload/bootstrap/route.ts)

Extends the existing UploadBootstrapResponse type:

type UploadBootstrapResponse = {
  categories: BootstrapCategory[];
  company: BootstrapCompany;
  isUploadThingConfigured: boolean;
  availableProviders: { azure: boolean; datalab: boolean; landingAI: boolean };
  // NEW fields
  storageProvider: "cloud" | "local";
  s3Endpoint?: string; // only present when storageProvider === "local"
};

5. Upload Component Changes (src/app/employer/upload/UploadForm.tsx)

The existing UploadForm already has a storageMethod concept ("cloud" vs "database"). We extend this to support a third mode: "local" (S3/SeaweedFS). When storageProvider === "local" from bootstrap:

  • The component fetches a presigned URL from /api/storage/presign
  • Uploads the file directly to SeaweedFS via fetch(presignedUrl, { method: 'PUT', body: file })
  • Calls /api/uploadDocument with the S3 URL to trigger the OCR pipeline
  • Progress tracking uses XMLHttpRequest for upload progress events

6. Document Retrieval Extension

The fetchFile function in the Storage Adapter replaces direct fetchBlob calls. For SeaweedFS documents, it constructs the URL from NEXT_PUBLIC_S3_ENDPOINT + storagePathname. Existing fetchBlob calls in OCR adapters and ingestion router are updated to use fetchFile.

7. Docker Infrastructure

New additions to docker-compose.yml:

services:
  seaweedfs:
    image: chrislusf/seaweedfs:latest
    container_name: pdr_ai_v2-seaweedfs
    command: server -s3 -s3.config=/etc/seaweedfs/s3-config.json -filer -dir=/data -volume.max=0 -master.volumeSizeLimitMB=1024 -volume.port=8080
    ports:
      - "8333:8333"
    depends_on:
      db:
        condition: service_healthy
    volumes:
      - seaweedfs_data:/data
      - ./docker/filer.toml:/etc/seaweedfs/filer.toml:ro
      - ./docker/s3-config.json:/etc/seaweedfs/s3-config.json:ro
    profiles:
      - local-storage

volumes:
  seaweedfs_data:

Key command flags:

  • -dir=/data — explicit volume storage directory (required, otherwise volumes can't be allocated)
  • -volume.max=0 — unlimited volume count (low values like 5 get exhausted by internal metadata volumes)
  • -master.volumeSizeLimitMB=1024 — 1GB volumes instead of default 30GB (dev-friendly)
  • -s3.config=/etc/seaweedfs/s3-config.json — S3 credential config (required, newer SeaweedFS denies anonymous access by default)
  • -volume.port=8080 — explicit volume server port to avoid conflicts

New files:

  • docker/filer.toml — SeaweedFS filer config pointing to PostgreSQL
  • docker/s3-config.json — S3 identity/credential config for SeaweedFS (default: accessKey pdr_local_key, secretKey pdr_local_secret)
  • docker/init-db.sql — Extended to create seaweedfs database with filer schema tables

Data Models

Environment Variables (Storage_Config)

Variable Scope Required When Default
NEXT_PUBLIC_STORAGE_PROVIDER Client + Server Always cloud
NEXT_PUBLIC_S3_ENDPOINT Client + Server local
S3_REGION Server local us-east-1
S3_ACCESS_KEY Server local
S3_SECRET_KEY Server local
S3_BUCKET_NAME Server local

Zod Schema Extension (src/env.ts)

Server schema additions:

NEXT_PUBLIC_STORAGE_PROVIDER: z.enum(["cloud", "local"]).default("cloud"),
NEXT_PUBLIC_S3_ENDPOINT: optionalString(),
S3_REGION: optionalString(),
S3_ACCESS_KEY: optionalString(),
S3_SECRET_KEY: optionalString(),
S3_BUCKET_NAME: optionalString(),

Client schema additions:

NEXT_PUBLIC_STORAGE_PROVIDER: z.enum(["cloud", "local"]).default("cloud"),
NEXT_PUBLIC_S3_ENDPOINT: optionalString(),

A superRefine is added to validate that when NEXT_PUBLIC_STORAGE_PROVIDER === "local", the S3 variables are all present.

fileUploads Table (Existing — No Migration)

Column Type Notes
storageProvider varchar(64) Existing values: database, vercel_blob. New value: seaweedfs
storageUrl varchar(1024) Full S3 URL: http://<endpoint>:8333/<bucket>/<key>
storagePathname varchar(1024) S3 object key: documents/<uuid>-<filename>

S3 Object Key Format

documents/{uuid}-{sanitized_filename}

Same pattern as the existing Vercel Blob key format in putFile(), ensuring consistency.

Docker: SeaweedFS Filer Database (init-db.sql)

The init-db.sql script creates the seaweedfs database and the filemeta table required by the [postgres] filer store driver.

Note: The [postgres2] driver (originally planned) has a known SQL formatting bug in current SeaweedFS versions (tested: 3.68 and latest/4.15) where it generates invalid SQL (%!(EXTRA string=filemeta)) during table initialization. We use the [postgres] driver instead, which requires the filemeta table to be pre-created.

CREATE DATABASE seaweedfs;

\c seaweedfs
CREATE TABLE IF NOT EXISTS filemeta (
  dirhash   BIGINT,
  name      VARCHAR(65535),
  directory VARCHAR(65535),
  meta      bytea,
  PRIMARY KEY (dirhash, name)
);

This only affects SeaweedFS's own metadata storage — the app's pdr_ai_v2 database and its Drizzle ORM connection are completely unaffected.

Docker: SeaweedFS Filer Configuration (docker/filer.toml)

The filer config uses the [postgres] section. (The [postgres2] driver is the newer recommended driver but has a SQL formatting bug in current SeaweedFS releases — see init-db.sql note above.)

[postgres]
enabled = true
hostname = "db"
port = 5432
database = "seaweedfs"
username = "postgres"
password = "password"
sslmode = "disable"
connection_max_idle = 2
connection_max_open = 100

Important: The password field must match the POSTGRES_PASSWORD env var in docker-compose.yml (default: password). If you change the PostgreSQL password, update filer.toml to match — it does not support env var interpolation.

Docker: SeaweedFS S3 Credentials (docker/s3-config.json)

S3 access credentials for the SeaweedFS S3 gateway. Newer SeaweedFS versions deny anonymous write access by default, so this config is required.

{
  "identities": [
    {
      "name": "pdr_admin",
      "credentials": [
        {
          "accessKey": "pdr_local_key",
          "secretKey": "pdr_local_secret"
        }
      ],
      "actions": ["Admin", "Read", "List", "Tagging", "Write", "WriteAcp", "ReadAcp"]
    },
    {
      "name": "anonymous",
      "actions": ["Read"]
    }
  ]
}

The S3_ACCESS_KEY and S3_SECRET_KEY env vars used by the app's S3 client must match the credentials defined here.

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Storage provider enum validation

For any string value assigned to NEXT_PUBLIC_STORAGE_PROVIDER, the Zod schema should accept only "cloud" and "local", and reject all other strings. When the variable is absent, it should default to "cloud".

Validates: Requirements 2.1

Property 2: Conditional S3 variable requirement

For any environment configuration where NEXT_PUBLIC_STORAGE_PROVIDER is "local", if any of NEXT_PUBLIC_S3_ENDPOINT, S3_REGION, S3_ACCESS_KEY, S3_SECRET_KEY, or S3_BUCKET_NAME is missing or empty, the Zod validation should fail. When provider is "cloud", these variables should not be required.

Validates: Requirements 2.2, 2.4

Property 3: Presigned URL structure

For any valid S3 object key and content type, the generated presigned upload URL should contain the configured endpoint, bucket name, object key, and a signature query parameter (X-Amz-Signature).

Validates: Requirements 3.3

Property 4: S3 client connection error descriptiveness

For any endpoint string, when the S3 client fails to connect, the thrown error message should contain the endpoint URL so operators can diagnose the issue.

Validates: Requirements 3.4

Property 5: Upload result shape and persistence consistency

For any valid upload input (filename, data, contentType) and for any active storage provider, the uploadFile function should return an UploadResult containing non-empty url, pathname, and provider fields, and the corresponding fileUploads DB record should have storageProvider matching the active provider, storageUrl matching the returned url, and storagePathname matching the returned pathname.

Validates: Requirements 4.4, 4.5, 10.2

Property 6: Upload error propagation

For any error thrown by any storage provider during upload, the error propagated by the Storage Adapter should contain both the provider name (e.g., "seaweedfs", "vercel_blob") and the original error message.

Validates: Requirements 4.6

Property 7: Presign endpoint authentication enforcement

For any request to POST /api/storage/presign without a valid Clerk authentication token, the endpoint should return a 401 status code regardless of the request body content.

Validates: Requirements 5.2

Property 8: Presign response completeness

For any valid presign request (with valid filename and contentType) when storage provider is "local", the response body should contain presignedUrl (non-empty string), objectKey (non-empty string), and bucket (non-empty string).

Validates: Requirements 5.5

Property 9: Mixed-provider document retrieval

For any list of documents with mixed storageProvider values (seaweedfs, vercel_blob, uploadthing), the fetchFile function should correctly resolve each document's URL based on its provider: SeaweedFS documents use NEXT_PUBLIC_S3_ENDPOINT + storagePathname, while cloud documents use their existing retrieval logic.

Validates: Requirements 7.1, 7.3

Property 10: SeaweedFS retrieval error descriptiveness

For any document with storageProvider === "seaweedfs", when the SeaweedFS service is unreachable, the error should indicate that the local storage service is unavailable and include the configured endpoint.

Validates: Requirements 7.4

Property 11: Bootstrap API storage provider reporting

For any value of NEXT_PUBLIC_STORAGE_PROVIDER (including unset), the bootstrap API response should include a storageProvider field that equals the configured value or defaults to "cloud", and should always include the isUploadThingConfigured field for backward compatibility.

Validates: Requirements 9.1, 9.3

Error Handling

Storage Adapter Errors

Scenario Behavior
SeaweedFS unreachable on upload Throw StorageError with provider name, endpoint, and original connection error
SeaweedFS unreachable on retrieval Return descriptive error: "Local storage service unavailable at {endpoint}"
Presigned URL generation fails Throw with S3 client error details and configured endpoint
Invalid storage provider in env Zod validation fails at startup with clear message about valid values
Missing S3 variables for local mode Zod superRefine fails with list of missing required variables
Cloud provider upload fails Propagate original UploadThing/Vercel Blob error with provider name prefix
Presign endpoint called in cloud mode Return 400 with message: "Presigned URLs are not applicable for cloud storage"
Unauthenticated presign request Return 401 with message: "Authentication required"

Error Class

export class StorageError extends Error {
  constructor(
    message: string,
    public readonly provider: string,
    public readonly cause?: Error,
  ) {
    super(`[${provider}] ${message}`);
    this.name = "StorageError";
  }
}

Graceful Degradation

  • If NEXT_PUBLIC_STORAGE_PROVIDER is not set, the system defaults to "cloud" mode — no breaking change for existing deployments.
  • The SeaweedFS Docker service uses the local-storage profile, so it only starts when explicitly requested (docker compose --profile local-storage up).
  • The fetchFile function checks the document's storageProvider field to determine retrieval strategy, so mixed-provider libraries work without configuration changes.

Testing Strategy

Property-Based Testing

Library: fast-check (already in devDependencies)

Each correctness property maps to a single property-based test with minimum 100 iterations. Tests are tagged with the format:

Feature: local-s3-migration, Property {N}: {property_text}

Property tests to implement:

  1. Env validation properties (Properties 1, 2): Generate random strings for NEXT_PUBLIC_STORAGE_PROVIDER and random presence/absence of S3 variables. Assert Zod schema accepts/rejects correctly.
  2. Presigned URL structure (Property 3): Generate random object keys and content types. Assert URL contains expected components.
  3. S3 client error descriptiveness (Property 4): Generate random endpoint strings. Mock connection failure. Assert error contains endpoint.
  4. Upload result consistency (Property 5): Generate random filenames, data buffers, content types. Mock both providers. Assert result shape and DB record correctness.
  5. Error propagation (Property 6): Generate random error messages and provider names. Assert propagated error contains both.
  6. Auth enforcement (Property 7): Generate random request bodies. Assert 401 without auth.
  7. Presign response completeness (Property 8): Generate random filenames and content types. Assert response contains all required fields.
  8. Mixed-provider retrieval (Property 9): Generate random document lists with mixed providers. Assert correct URL resolution per provider.
  9. Retrieval error descriptiveness (Property 10): Generate random document keys. Mock unreachable SeaweedFS. Assert error message content.
  10. Bootstrap reporting (Property 11): Generate random provider values. Assert response shape.

Unit Tests

Unit tests complement property tests for specific examples and edge cases:

  • Bootstrap API returns s3Endpoint only when provider is "local" (Req 9.2)
  • Presign endpoint returns 400 when provider is "cloud" (Req 5.4)
  • Presigned URL expiry is set to 300 seconds (Req 5.3)
  • Cloud mode delegates to Vercel Blob putFile (Req 4.2)
  • Local mode delegates to S3 client (Req 4.3)
  • Upload component calls /api/uploadDocument after successful upload in both modes (Req 6.5)
  • fileUploads table accepts "seaweedfs" as storageProvider value (Req 10.1)

Integration Tests (Manual / CI)

  • Docker Compose stack starts with --profile local-storage and SeaweedFS is reachable on port 8333
  • End-to-end upload flow: presign → PUT to SeaweedFS → trigger OCR pipeline
  • Mixed-provider document library: upload via cloud, upload via local, retrieve both

Parallel Workstream Testing

The three workstreams can be tested independently:

  1. Infrastructure: Docker Compose validation, SeaweedFS connectivity
  2. Backend: Storage Adapter unit/property tests with mocked providers, API route tests
  3. Frontend: Component tests with mocked API responses

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions