Single-host Django + React application for browsing CSV and Excel files in Amazon S3, inferring Pandas data types, previewing the processed result with optional column overrides, and experimenting with Redis/Celery background processing plus PySpark CSV comparison.
Public deployment: https://rhombus-ai-home-test.onrender.com/
- Connects to S3 using runtime AWS credentials supplied by the user.
- Lists supported
.csv,.xls, and.xlsxobjects from a bucket or prefix. - Profiles columns with conservative inference rules for integers, floats, booleans, dates, datetimes, categories, and complex numbers.
- Lets the user override inferred types before reprocessing.
- Pages through the processed dataset from the backend so reviewers can inspect full files instead of a capped in-memory sample.
- Prefers background processing jobs through Celery and Redis while the workbench keeps the last successful preview visible and polls recent run status updates.
- Offers an experimental PySpark comparison mode for completed CSV Pandas runs so users can compare runtime, row counts, schema mapping, and preview output.
- Stores sanitized processing metadata in Django without persisting AWS secrets.
- Exposes a local CLI via
infer_data_types.pyfor quick local-file smoke testing.
- CSV files are staged from S3 onto local disk, then profiled in chunks so larger datasets do not depend on a single long-lived streaming response.
- Repeated preview-page requests can reuse a small bounded cache of staged S3 files to avoid re-downloading the same CSV on every page change or override.
- Excel files are supported, but they still load the selected sheet into memory and are capped at 20 MB in this MVP.
- Type inference is intentionally conservative: ambiguous short dates stay as text unless overridden, and manual overrides are validated to avoid lossy coercion.
- Date and DateTime are both backed by pandas datetime storage internally, but the preview renders them differently so date-only columns stay calendar-shaped.
- Category inference is strict for larger datasets and slightly softer for very small repeated-label samples such as grade-like columns.
- Redis and Celery now power the normal workbench processing flow on this branch, while the original synchronous Pandas path remains as an automatic fallback when queue infrastructure is unavailable.
- PySpark is scoped intentionally as an experimental CSV comparison mode that runs after a completed Pandas result exists. It compares row counts, Spark-native schema mapping, and preview slices without replacing the current Pandas inference engine.
- Backend: Django 5, Django REST Framework, Pandas, boto3, Celery, Redis, PySpark (experimental)
- Frontend: React 19, TypeScript, Vite, Vitest
- Local orchestration: Docker Compose for web, Celery worker, and Redis
- Deployment shape: single host serving the built frontend from Django
backend/: Django project and thedata_processingappfrontend/: React + TypeScript frontenddocs/brief/: assignment brief and supporting project notesexamples/: sample datasets for local smoke testinginfer_data_types.py: local CLI wrapper around the shared processing serviceDockerfile: production-oriented single-container deploymentdocker-compose.yml: local async development stack for Django, Celery, and Redis
- Python 3.12 recommended for local work to match the Docker runtime
- Node.js 22 or newer recommended
- npm 11 or newer
- Docker Desktop for container verification and deployment builds
- Java 17 or newer for the experimental local PySpark comparison path
- Redis for the optional local async-processing stack, unless you use Docker Compose
The current dependency set also installs and runs on Python 3.14, but keeping local development on Python 3.12 reduces drift from the containerized runtime. If you only want the stable synchronous flow, Redis and Java are optional. They are needed only for the experimental async and Spark comparison features in this branch.
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txtcd frontend
npm install
cd ..python manage.py migrateThe new async endpoints require Redis plus a running Celery worker. The quickest local path is Docker Compose:
docker compose up --buildIf you prefer to run the pieces manually, start Redis first, then launch the worker in a second terminal:
python -m celery -A rhombus_home_test worker --loglevel=infoBuild the frontend once, then let Django serve it:
cd frontend
npm run build
cd ..
python manage.py runserverOpen http://127.0.0.1:8000.
This branch adds a local multi-service setup for the new Celery and Redis path:
docker compose up --buildThat starts:
web: Django + the built frontendworker: Celery background workerredis: broker/result backend for queued jobs
The Compose setup mounts a shared SQLite volume for both web and worker so background tasks can read and update the same ProcessingRun rows that the web container creates.
Run Django for the API:
python manage.py runserverIn a second terminal, run Vite for the frontend:
cd frontend
npm run devOpen http://127.0.0.1:5173.
The app works locally without extra configuration, but these variables are supported for deployment:
DJANGO_SECRET_KEY: Django secret keyDJANGO_DEBUG:TrueorFalseDJANGO_ALLOWED_HOSTS: comma-separated hostnamesDJANGO_CSRF_TRUSTED_ORIGINS: comma-separated trusted originsDJANGO_SQLITE_PATH: optional SQLite database pathDJANGO_FRONTEND_BUILD_DIR: optional override for the built frontend locationWEB_CONCURRENCY: Gunicorn worker count for the container runtimeGUNICORN_TIMEOUT: Gunicorn request timeout in secondsCSV_CHUNK_SIZE: chunk size used for CSV profiling and preview pagingSTAGED_FILE_CACHE_MAX_ITEMS: max number of staged S3 files kept on local disk for reuse (0disables the cache)STAGED_FILE_CACHE_TTL_SECONDS: cache lifetime for staged S3 filesREDIS_URL: default Redis connection stringCELERY_BROKER_URL: Celery broker URL, typically RedisCELERY_RESULT_BACKEND: Celery result backend URL, typically RedisCELERY_TASK_TIME_LIMIT: hard time limit for background tasksCELERY_TASK_SOFT_TIME_LIMIT: soft time limit for background tasksCELERY_TASK_ALWAYS_EAGER: optional local debugging switch to execute async tasks inlinePORT: port used by the container startup command
On Render, RENDER_EXTERNAL_HOSTNAME and RENDER_EXTERNAL_URL are injected automatically and merged into Django's trusted host and origin lists. You only need to set DJANGO_ALLOWED_HOSTS or DJANGO_CSRF_TRUSTED_ORIGINS manually if you later add a custom domain.
See .env.example for a starter local or container configuration.
The local CLI uses the same inference service as the web application:
python infer_data_types.py examples/sample_data.csv --preview-rows 5Optional Excel sheet selection:
python infer_data_types.py path\to\workbook.xlsx --sheet-name Sheet1Lists supported S3 objects for a bucket or prefix.
Request body:
{
"access_key_id": "AKIA...",
"secret_access_key": "secret",
"session_token": "",
"region": "ap-southeast-2",
"bucket": "demo-bucket",
"prefix": "incoming/"
}- Pandas inference and conversion: the shared backend service profiles and converts CSV/XLS/XLSX data loaded into pandas DataFrames, with explicit handling for object-like mixed columns, dates, numerics, categories, booleans, and complex values.
- Large-file handling and tuning: S3 CSVs are staged locally, profiled in chunks, and paged from the backend. Render tuning knobs such as
WEB_CONCURRENCY,GUNICORN_TIMEOUT,CSV_CHUNK_SIZE, and staged-file cache settings are documented below. - Django backend: the API exposes S3 browsing, file processing, preview pagination, health checks, and persisted sanitized run metadata.
- React frontend: the UI supports S3 connection, file browsing, file processing, schema review, manual overrides, reset/reprocess actions, and processed preview pagination.
- Documentation and testing: the repo includes setup/deploy instructions, backend tests, frontend tests, and a local CLI for smoke testing with sample datasets.
Processes the selected S3 object, stores the inferred schema, and returns the first preview page.
Request body:
{
"access_key_id": "AKIA...",
"secret_access_key": "secret",
"session_token": "",
"region": "ap-southeast-2",
"bucket": "demo-bucket",
"prefix": "incoming/",
"object_key": "incoming/sample.csv",
"sheet_name": "",
"preview_row_limit": 100,
"overrides": [
{ "column": "Score", "target_type": "float" }
]
}Response highlights:
runId: stored processing run identifier used for later preview-page requests and displayschema: inferred or overridden schema metadatapreviewRows: first processed page of rowspreviewPage: page metadata includingpage,pageSize,totalRows, andtotalPages
Queues the existing Pandas processing flow as a Celery background task. The enhancement-branch workbench tries this endpoint first and falls back to the synchronous process endpoint only if queue infrastructure is unavailable.
Response highlights:
runId: processing run identifier used for pollingtaskId: Celery task identifierstatus: initial queued stateengine: currentlypandas
Polls the current run lifecycle state.
While queued or processing, the response returns:
statusengineprogressStageprogressPercenterrorMessage
Once completed, it also returns the same processed payload shape used by POST /api/data/process, including:
rowCountschemapreviewColumnspreviewRowspreviewPagewarningsprocessingMetadata
Completed Spark comparison runs return the shared lifecycle fields plus a sparkComparison payload instead of the main Pandas schema payload.
Lists recent processing and comparison runs for the current workbench file so the frontend can render the jobs tray.
Query params:
object_key: optional S3 object key filterlimit: optional max item count, default5
Returns a specific processed page using the current schema and file context. The API will reuse the stored ProcessingRun
when it is available, but the request can also supply the preview context directly so paging still works in stateless or
ephemeral deployment environments.
Request body:
{
"access_key_id": "AKIA...",
"secret_access_key": "secret",
"session_token": "",
"region": "ap-southeast-2",
"bucket": "demo-bucket",
"prefix": "incoming/",
"run_id": 12,
"object_key": "incoming/sample.csv",
"file_type": "csv",
"selected_sheet": "",
"row_count": 14941,
"schema": [
{
"column": "Score",
"inferred_type": "float",
"storage_type": "Float64",
"display_type": "Float",
"nullable": true,
"confidence": 0.97,
"warnings": [],
"null_token_count": 0,
"sample_values": ["90", "75"],
"allowed_overrides": ["text", "integer", "float", "boolean", "date", "datetime", "category", "complex"]
}
],
"preview_columns": ["Score"],
"page": 3,
"page_size": 50
}Queues an experimental CSV-only PySpark comparison against a completed Pandas processing run. This mode is intentionally educational and does not replace the main Pandas inference pipeline.
Request body highlights:
source_run_id: required completed Pandas run identifier for the CSV result being comparedpageandpage_size: requested Spark preview slice for the comparison payload
Response highlights:
runId: queued Spark comparison runtaskId: Celery task identifierrunType:spark_comparesourceRunId: completed Pandas run that the comparison is anchored to
The enhancement path introduced on this branch follows this lifecycle:
Django request -> Celery task -> Redis broker/result backend -> Celery worker
-> ProcessingRun updates in Django -> frontend polling via GET /api/data/runs/<id>
That keeps Redis focused on transient queueing and task state, while Django remains the durable source of truth for user-visible processing runs.
Backend:
python manage.py test
python manage.py checkFrontend:
cd frontend
npm test
npm run buildCLI smoke test:
python infer_data_types.py examples/sample_data.csv --preview-rows 5Build the production image:
docker build -t rhombus-home-test .Run the container locally:
docker run --rm -p 8000:8000 `
-e DJANGO_SECRET_KEY=replace-me `
-e DJANGO_DEBUG=False `
-e DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1 `
rhombus-home-testThen open http://127.0.0.1:8000.
Run the local async stack:
docker compose up --buildRender is the recommended public host for this project because it matches the app's single-container architecture and gives you one public URL for both the Django API and the React frontend.
This branch keeps the current synchronous Pandas flow as the stable deployment default. The new Redis/Celery and PySpark additions are intended to be validated locally or in a separate enhancement/demo environment first rather than forcing them into the existing submission deployment immediately.
- Sign in to Render and create a new Web Service.
- Connect the GitHub repository
BrownAssassin/Rhombus-AI-Home-Test. - Select the
mainbranch. - Choose
Dockeras the runtime so Render builds from the repositoryDockerfile. - Set the health check path to
/api/health/.
Add these values in the Render dashboard before the first deploy:
DJANGO_SECRET_KEY: required secret value for DjangoDJANGO_DEBUG=False
Optional settings:
DJANGO_ALLOWED_HOSTS: only needed if you later add a custom domainDJANGO_CSRF_TRUSTED_ORIGINS: only needed if you later add a custom domainDJANGO_SQLITE_PATH=/app/data/db.sqlite3: only needed if you later attach a persistent disk and want the SQLite file to survive redeploysWEB_CONCURRENCY=1: recommended for the starter Render instance sizeGUNICORN_TIMEOUT=180: gives longer CSV processing requests room to finishCSV_CHUNK_SIZE=500: keeps CSV memory use conservative during profiling and pagingSTAGED_FILE_CACHE_MAX_ITEMS=2: lets repeated paging and reprocessing reuse the same staged S3 fileSTAGED_FILE_CACHE_TTL_SECONDS=900: expires staged files after 15 minutes to keep disk use bounded
Render automatically provides PORT, RENDER_EXTERNAL_HOSTNAME, and RENDER_EXTERNAL_URL, and the app is configured to trust those values without any extra setup.
- Trigger the initial deploy from Render.
- Open the generated
onrender.comURL after the health check passes. - Confirm
GET /api/health/returns a healthy response. - Run one end-to-end flow through the UI with a real or demo S3 bucket.
- AWS credentials are accepted at runtime and are intentionally not stored in the database.
- CSV handling is chunked after staging the S3 object to a temp file, and repeat requests can reuse a small bounded local cache of staged files to avoid unnecessary re-downloads. If you set
STAGED_FILE_CACHE_MAX_ITEMS=0, staged files are cleaned up after each request instead of being reused. - Excel handling is capped at 20 MB in this MVP, and each preview-page request reloads the selected sheet because pandas does not offer the same chunked read path as CSV.
- Type inference is intentionally conservative. Ambiguous date columns stay as text unless the user overrides them.
- Small repeated-label columns can infer as
Category, but high-cardinality or mostly-unique string columns intentionally stay asText. - The app supports paginated preview browsing across the processed dataset, but it does not export a full transformed file in this MVP.
- A Render deployment that keeps SQLite in the container filesystem is suitable for demos, but
ProcessingRunhistory resets whenever the service is rebuilt or restarted. - Redis and Celery are the real production-style enhancement in this branch. They move longer processing jobs out of the request-response cycle while keeping Django as the source of truth for run status.
- The PySpark feature in this branch is experimental and CSV-only. It is intentionally framed as a comparison/learning tool, not as a drop-in replacement for the main Pandas inference engine.
- The polished
mainbranch remains the stable submission snapshot. This branch is a post-submission enhancement that demonstrates how Redis/Celery and PySpark could be applied to the app thoughtfully.