Rhombus-AI-Home-Test

Single-host Django + React application for browsing CSV and Excel files in Amazon S3, inferring Pandas data types, previewing the processed result with optional column overrides, and experimenting with Redis/Celery background processing plus PySpark CSV comparison.

Public deployment: https://rhombus-ai-home-test.onrender.com/

What it does

Connects to S3 using runtime AWS credentials supplied by the user.
Lists supported .csv, .xls, and .xlsx objects from a bucket or prefix.
Profiles columns with conservative inference rules for integers, floats, booleans, dates, datetimes, categories, and complex numbers.
Lets the user override inferred types before reprocessing.
Pages through the processed dataset from the backend so reviewers can inspect full files instead of a capped in-memory sample.
Prefers background processing jobs through Celery and Redis while the workbench keeps the last successful preview visible and polls recent run status updates.
Offers an experimental PySpark comparison mode for completed CSV Pandas runs so users can compare runtime, row counts, schema mapping, and preview output.
Stores sanitized processing metadata in Django without persisting AWS secrets.
Exposes a local CLI via infer_data_types.py for quick local-file smoke testing.

Inference and performance approach

CSV files are staged from S3 onto local disk, then profiled in chunks so larger datasets do not depend on a single long-lived streaming response.
Repeated preview-page requests can reuse a small bounded cache of staged S3 files to avoid re-downloading the same CSV on every page change or override.
Excel files are supported, but they still load the selected sheet into memory and are capped at 20 MB in this MVP.
Type inference is intentionally conservative: ambiguous short dates stay as text unless overridden, and manual overrides are validated to avoid lossy coercion.
Date and DateTime are both backed by pandas datetime storage internally, but the preview renders them differently so date-only columns stay calendar-shaped.
Category inference is strict for larger datasets and slightly softer for very small repeated-label samples such as grade-like columns.
Redis and Celery now power the normal workbench processing flow on this branch, while the original synchronous Pandas path remains as an automatic fallback when queue infrastructure is unavailable.
PySpark is scoped intentionally as an experimental CSV comparison mode that runs after a completed Pandas result exists. It compares row counts, Spark-native schema mapping, and preview slices without replacing the current Pandas inference engine.

Stack

Backend: Django 5, Django REST Framework, Pandas, boto3, Celery, Redis, PySpark (experimental)
Frontend: React 19, TypeScript, Vite, Vitest
Local orchestration: Docker Compose for web, Celery worker, and Redis
Deployment shape: single host serving the built frontend from Django

Project structure

backend/: Django project and the data_processing app
frontend/: React + TypeScript frontend
docs/brief/: assignment brief and supporting project notes
examples/: sample datasets for local smoke testing
infer_data_types.py: local CLI wrapper around the shared processing service
Dockerfile: production-oriented single-container deployment
docker-compose.yml: local async development stack for Django, Celery, and Redis

Requirements

Python 3.12 recommended for local work to match the Docker runtime
Node.js 22 or newer recommended
npm 11 or newer
Docker Desktop for container verification and deployment builds
Java 17 or newer for the experimental local PySpark comparison path
Redis for the optional local async-processing stack, unless you use Docker Compose

The current dependency set also installs and runs on Python 3.14, but keeping local development on Python 3.12 reduces drift from the containerized runtime. If you only want the stable synchronous flow, Redis and Java are optional. They are needed only for the experimental async and Spark comparison features in this branch.

Local setup

1. Create the backend environment

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt

2. Install the frontend dependencies

cd frontend
npm install
cd ..

3. Apply migrations

python manage.py migrate

4. Optional: start Redis and a Celery worker for the async path

The new async endpoints require Redis plus a running Celery worker. The quickest local path is Docker Compose:

docker compose up --build

If you prefer to run the pieces manually, start Redis first, then launch the worker in a second terminal:

python -m celery -A rhombus_home_test worker --loglevel=info

Running the app locally

Backend plus built frontend

Build the frontend once, then let Django serve it:

cd frontend
npm run build
cd ..
python manage.py runserver

Open http://127.0.0.1:8000.

Async and experimental stack via Docker Compose

This branch adds a local multi-service setup for the new Celery and Redis path:

docker compose up --build

That starts:

web: Django + the built frontend
worker: Celery background worker
redis: broker/result backend for queued jobs

The Compose setup mounts a shared SQLite volume for both web and worker so background tasks can read and update the same ProcessingRun rows that the web container creates.

Split development mode

Run Django for the API:

python manage.py runserver

In a second terminal, run Vite for the frontend:

cd frontend
npm run dev

Open http://127.0.0.1:5173.

Environment variables

The app works locally without extra configuration, but these variables are supported for deployment:

DJANGO_SECRET_KEY: Django secret key
DJANGO_DEBUG: True or False
DJANGO_ALLOWED_HOSTS: comma-separated hostnames
DJANGO_CSRF_TRUSTED_ORIGINS: comma-separated trusted origins
DJANGO_SQLITE_PATH: optional SQLite database path
DJANGO_FRONTEND_BUILD_DIR: optional override for the built frontend location
WEB_CONCURRENCY: Gunicorn worker count for the container runtime
GUNICORN_TIMEOUT: Gunicorn request timeout in seconds
CSV_CHUNK_SIZE: chunk size used for CSV profiling and preview paging
STAGED_FILE_CACHE_MAX_ITEMS: max number of staged S3 files kept on local disk for reuse (0 disables the cache)
STAGED_FILE_CACHE_TTL_SECONDS: cache lifetime for staged S3 files
REDIS_URL: default Redis connection string
CELERY_BROKER_URL: Celery broker URL, typically Redis
CELERY_RESULT_BACKEND: Celery result backend URL, typically Redis
CELERY_TASK_TIME_LIMIT: hard time limit for background tasks
CELERY_TASK_SOFT_TIME_LIMIT: soft time limit for background tasks
CELERY_TASK_ALWAYS_EAGER: optional local debugging switch to execute async tasks inline
PORT: port used by the container startup command

On Render, RENDER_EXTERNAL_HOSTNAME and RENDER_EXTERNAL_URL are injected automatically and merged into Django's trusted host and origin lists. You only need to set DJANGO_ALLOWED_HOSTS or DJANGO_CSRF_TRUSTED_ORIGINS manually if you later add a custom domain.

See .env.example for a starter local or container configuration.

CLI usage

The local CLI uses the same inference service as the web application:

python infer_data_types.py examples/sample_data.csv --preview-rows 5

Optional Excel sheet selection:

python infer_data_types.py path\to\workbook.xlsx --sheet-name Sheet1

API summary

`POST /api/s3/files`

Lists supported S3 objects for a bucket or prefix.

Request body:

{
  "access_key_id": "AKIA...",
  "secret_access_key": "secret",
  "session_token": "",
  "region": "ap-southeast-2",
  "bucket": "demo-bucket",
  "prefix": "incoming/"
}

Project brief alignment

Pandas inference and conversion: the shared backend service profiles and converts CSV/XLS/XLSX data loaded into pandas DataFrames, with explicit handling for object-like mixed columns, dates, numerics, categories, booleans, and complex values.
Large-file handling and tuning: S3 CSVs are staged locally, profiled in chunks, and paged from the backend. Render tuning knobs such as WEB_CONCURRENCY, GUNICORN_TIMEOUT, CSV_CHUNK_SIZE, and staged-file cache settings are documented below.
Django backend: the API exposes S3 browsing, file processing, preview pagination, health checks, and persisted sanitized run metadata.
React frontend: the UI supports S3 connection, file browsing, file processing, schema review, manual overrides, reset/reprocess actions, and processed preview pagination.
Documentation and testing: the repo includes setup/deploy instructions, backend tests, frontend tests, and a local CLI for smoke testing with sample datasets.

`POST /api/data/process`

Processes the selected S3 object, stores the inferred schema, and returns the first preview page.

Request body:

{
  "access_key_id": "AKIA...",
  "secret_access_key": "secret",
  "session_token": "",
  "region": "ap-southeast-2",
  "bucket": "demo-bucket",
  "prefix": "incoming/",
  "object_key": "incoming/sample.csv",
  "sheet_name": "",
  "preview_row_limit": 100,
  "overrides": [
    { "column": "Score", "target_type": "float" }
  ]
}

Response highlights:

runId: stored processing run identifier used for later preview-page requests and display
schema: inferred or overridden schema metadata
previewRows: first processed page of rows
previewPage: page metadata including page, pageSize, totalRows, and totalPages

`POST /api/data/process-async`

Queues the existing Pandas processing flow as a Celery background task. The enhancement-branch workbench tries this endpoint first and falls back to the synchronous process endpoint only if queue infrastructure is unavailable.

Response highlights:

runId: processing run identifier used for polling
taskId: Celery task identifier
status: initial queued state
engine: currently pandas

`GET /api/data/runs/<id>`

Polls the current run lifecycle state.

While queued or processing, the response returns:

status
engine
progressStage
progressPercent
errorMessage

Once completed, it also returns the same processed payload shape used by POST /api/data/process, including:

rowCount
schema
previewColumns
previewRows
previewPage
warnings
processingMetadata

Completed Spark comparison runs return the shared lifecycle fields plus a sparkComparison payload instead of the main Pandas schema payload.

`GET /api/data/runs`

Lists recent processing and comparison runs for the current workbench file so the frontend can render the jobs tray.

Query params:

object_key: optional S3 object key filter
limit: optional max item count, default 5

`POST /api/data/preview`

Returns a specific processed page using the current schema and file context. The API will reuse the stored ProcessingRun when it is available, but the request can also supply the preview context directly so paging still works in stateless or ephemeral deployment environments.

Request body:

{
  "access_key_id": "AKIA...",
  "secret_access_key": "secret",
  "session_token": "",
  "region": "ap-southeast-2",
  "bucket": "demo-bucket",
  "prefix": "incoming/",
  "run_id": 12,
  "object_key": "incoming/sample.csv",
  "file_type": "csv",
  "selected_sheet": "",
  "row_count": 14941,
  "schema": [
    {
      "column": "Score",
      "inferred_type": "float",
      "storage_type": "Float64",
      "display_type": "Float",
      "nullable": true,
      "confidence": 0.97,
      "warnings": [],
      "null_token_count": 0,
      "sample_values": ["90", "75"],
      "allowed_overrides": ["text", "integer", "float", "boolean", "date", "datetime", "category", "complex"]
    }
  ],
  "preview_columns": ["Score"],
  "page": 3,
  "page_size": 50
}

`POST /api/data/spark-compare`

Queues an experimental CSV-only PySpark comparison against a completed Pandas processing run. This mode is intentionally educational and does not replace the main Pandas inference pipeline.

Request body highlights:

source_run_id: required completed Pandas run identifier for the CSV result being compared
page and page_size: requested Spark preview slice for the comparison payload

Response highlights:

runId: queued Spark comparison run
taskId: Celery task identifier
runType: spark_compare
sourceRunId: completed Pandas run that the comparison is anchored to

Async architecture flow

The enhancement path introduced on this branch follows this lifecycle:

Django request -> Celery task -> Redis broker/result backend -> Celery worker
-> ProcessingRun updates in Django -> frontend polling via GET /api/data/runs/<id>

That keeps Redis focused on transient queueing and task state, while Django remains the durable source of truth for user-visible processing runs.

Testing

Backend:

python manage.py test
python manage.py check

Frontend:

cd frontend
npm test
npm run build

CLI smoke test:

python infer_data_types.py examples/sample_data.csv --preview-rows 5

Docker verification

Build the production image:

docker build -t rhombus-home-test .

Run the container locally:

docker run --rm -p 8000:8000 `
  -e DJANGO_SECRET_KEY=replace-me `
  -e DJANGO_DEBUG=False `
  -e DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1 `
  rhombus-home-test

Then open http://127.0.0.1:8000.

Run the local async stack:

docker compose up --build

Render deployment

Render is the recommended public host for this project because it matches the app's single-container architecture and gives you one public URL for both the Django API and the React frontend.

This branch keeps the current synchronous Pandas flow as the stable deployment default. The new Redis/Celery and PySpark additions are intended to be validated locally or in a separate enhancement/demo environment first rather than forcing them into the existing submission deployment immediately.

Create the service

Sign in to Render and create a new Web Service.
Connect the GitHub repository BrownAssassin/Rhombus-AI-Home-Test.
Select the main branch.
Choose Docker as the runtime so Render builds from the repository Dockerfile.
Set the health check path to /api/health/.

Configure environment variables

Add these values in the Render dashboard before the first deploy:

DJANGO_SECRET_KEY: required secret value for Django
DJANGO_DEBUG=False

Optional settings:

DJANGO_ALLOWED_HOSTS: only needed if you later add a custom domain
DJANGO_CSRF_TRUSTED_ORIGINS: only needed if you later add a custom domain
DJANGO_SQLITE_PATH=/app/data/db.sqlite3: only needed if you later attach a persistent disk and want the SQLite file to survive redeploys
WEB_CONCURRENCY=1: recommended for the starter Render instance size
GUNICORN_TIMEOUT=180: gives longer CSV processing requests room to finish
CSV_CHUNK_SIZE=500: keeps CSV memory use conservative during profiling and paging
STAGED_FILE_CACHE_MAX_ITEMS=2: lets repeated paging and reprocessing reuse the same staged S3 file
STAGED_FILE_CACHE_TTL_SECONDS=900: expires staged files after 15 minutes to keep disk use bounded

Render automatically provides PORT, RENDER_EXTERNAL_HOSTNAME, and RENDER_EXTERNAL_URL, and the app is configured to trust those values without any extra setup.

Deploy and verify

Trigger the initial deploy from Render.
Open the generated onrender.com URL after the health check passes.
Confirm GET /api/health/ returns a healthy response.
Run one end-to-end flow through the UI with a real or demo S3 bucket.

Notes and limitations

AWS credentials are accepted at runtime and are intentionally not stored in the database.
CSV handling is chunked after staging the S3 object to a temp file, and repeat requests can reuse a small bounded local cache of staged files to avoid unnecessary re-downloads. If you set STAGED_FILE_CACHE_MAX_ITEMS=0, staged files are cleaned up after each request instead of being reused.
Excel handling is capped at 20 MB in this MVP, and each preview-page request reloads the selected sheet because pandas does not offer the same chunked read path as CSV.
Type inference is intentionally conservative. Ambiguous date columns stay as text unless the user overrides them.
Small repeated-label columns can infer as Category, but high-cardinality or mostly-unique string columns intentionally stay as Text.
The app supports paginated preview browsing across the processed dataset, but it does not export a full transformed file in this MVP.
A Render deployment that keeps SQLite in the container filesystem is suitable for demos, but ProcessingRun history resets whenever the service is rebuilt or restarted.
Redis and Celery are the real production-style enhancement in this branch. They move longer processing jobs out of the request-response cycle while keeping Django as the source of truth for run status.
The PySpark feature in this branch is experimental and CSV-only. It is intentionally framed as a comparison/learning tool, not as a drop-in replacement for the main Pandas inference engine.
The polished main branch remains the stable submission snapshot. This branch is a post-submission enhancement that demonstrates how Redis/Celery and PySpark could be applied to the app thoughtfully.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
backend		backend
docker		docker
docs/brief		docs/brief
examples		examples
frontend		frontend
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
infer_data_types.py		infer_data_types.py
manage.py		manage.py
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Rhombus-AI-Home-Test

What it does

Inference and performance approach

Stack

Project structure

Requirements

Local setup

1. Create the backend environment

2. Install the frontend dependencies

3. Apply migrations

4. Optional: start Redis and a Celery worker for the async path

Running the app locally

Backend plus built frontend

Async and experimental stack via Docker Compose

Split development mode

Environment variables

CLI usage

API summary

POST /api/s3/files

Project brief alignment

POST /api/data/process

POST /api/data/process-async

GET /api/data/runs/<id>

GET /api/data/runs

POST /api/data/preview

POST /api/data/spark-compare

Async architecture flow

Testing

Docker verification

Render deployment

Create the service

Configure environment variables

Deploy and verify

Notes and limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

`POST /api/s3/files`

`POST /api/data/process`

`POST /api/data/process-async`

`GET /api/data/runs/<id>`

`GET /api/data/runs`

`POST /api/data/preview`

`POST /api/data/spark-compare`