How to set up your local machine.
- Python >= 3.11
- Node.js
- Yarn
- uv (recommended) or pip
uv is faster and provides reproducible builds via lockfile.
uv sync # Creates .venv and installs all dependencies
uv run data_formulator # Run app (opens browser automatically)
uv run data_formulator --dev # Run backend only (for frontend development)Which command to use:
- End users / testing the full app:
uv run data_formulator- starts server and opens browser to http://localhost:5567 - Frontend development:
uv run data_formulator --dev- starts backend server only, then runyarn startseparately for the Vite dev server on http://localhost:5173
-
Create a Virtual Environment
python -m venv venv source venv/bin/activate # Unix # or .\venv\Scripts\activate # Windows
-
Install Dependencies
pip install -r requirements.txt
-
Configure environment variables (optional)
- copy
.env.templateto.envand fill in your values:- API keys: set
{PROVIDER}_ENABLED=true,{PROVIDER}_API_KEY=..., and{PROVIDER}_MODELS=...for each LLM provider you want to use. See the LiteLLM setup guide for provider-specific fields. - Server settings:
DISABLE_DISPLAY_KEYS,SANDBOX, etc. - Azure Blob workspace (optional): see Azure Blob Storage Workspace below.
- API keys: set
- this lets Data Formulator automatically load API keys at startup so you don't need to enter them in the UI.
- copy
-
Run the app
# Unix ./local_server.sh # Windows .\local_server.bat # Or directly data_formulator # Opens browser automatically data_formulator --dev # Backend only (for frontend development)
-
Install NPM packages
yarn
-
Development mode
First, start the backend server (in a separate terminal):
uv run data_formulator --dev # or ./local_server.shThen, run the frontend in development mode with hot reloading:
yarn start
Open http://localhost:5173 to view it in the browser. The page will reload if you make edits. You will also see any lint errors in the console.
-
Build the frontend and then the backend
Compile the TypeScript files and bundle the project:
yarn build
This builds the app for production to the
py-src/data_formulator/distfolder.Then, build python package:
# With uv uv build # Or with pip pip install build python -m build
This will create a python wheel in the
dist/folder. The name would bedata_formulator-<version>-py3-none-any.whl -
Test the artifact
You can then install the build result wheel (testing in a virtual environment is recommended):
# replace <version> with the actual build version. pip install dist/data_formulator-<version>-py3-none-any.whl
Once installed, you can run Data Formulator with:
data_formulator
or
python -m data_formulator
Open http://localhost:5567 to view it in the browser.
AI-generated Python code runs inside a sandbox to isolate it from the main server process. Two backends are available:
| Backend | Flag | How it works | Overhead |
|---|---|---|---|
| local (default) | --sandbox local |
Persistent warm subprocess with pre-imported pandas/numpy/duckdb. Audit hooks block file writes and dangerous operations (subprocess, shutil, etc.). | ~1 ms |
| docker | --sandbox docker |
Each execution runs in a disposable docker run --rm container. Workspace is mounted read-only; output is returned via a bind-mounted parquet file. Memory/CPU/PID limits enforced. |
~700 ms |
# Use the default local sandbox
python -m data_formulator
# Use Docker sandbox (requires Docker daemon)
python -m data_formulator --sandbox dockerThe Docker sandbox image is built from py-src/data_formulator/sandbox/Dockerfile.sandbox:
docker build -t data-formulator-sandbox -f py-src/data_formulator/sandbox/Dockerfile.sandbox .Source: py-src/data_formulator/sandbox/
By default, workspace data (uploaded files, parquet tables, metadata) is stored on the local filesystem under ~/.data_formulator/workspaces/. For cloud deployments you can switch to Azure Blob Storage so all workspace data lives in a blob container instead.
-
Install extra dependencies:
pip install azure-storage-blob # or with uv: uv pip install azure-storage-blob -
Create a storage account & container (one-time setup):
az storage account create -n <account> -g <resource-group> -l eastus --sku Standard_LRS az storage container create -n data-formulator --account-name <account>
-
Get the connection string:
az storage account show-connection-string -n <account> -g <resource-group> -o tsv
-
Add to
.env:WORKSPACE_BACKEND=azure_blob AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=... # AZURE_BLOB_CONTAINER=data-formulator # default, change if needed
-
Run normally:
uv run data_formulator --dev
Or pass as CLI flags:
data_formulator --workspace-backend azure_blob \ --azure-blob-connection-string "DefaultEndpointsProtocol=https;AccountName=..."
In production (Azure App Service, AKS, etc.) you can authenticate the app to blob storage via Managed Identity instead of a connection string. This eliminates secrets entirely.
-
Install extra dependencies:
pip install azure-storage-blob azure-identity
-
Assign a role to the app's Managed Identity:
# Get the App Service's principal ID PRINCIPAL_ID=$(az webapp identity show -n <app-name> -g <rg> --query principalId -o tsv) # Grant it "Storage Blob Data Contributor" on the storage account az role assignment create \ --assignee "$PRINCIPAL_ID" \ --role "Storage Blob Data Contributor" \ --scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
-
Set environment variables (no secrets needed):
WORKSPACE_BACKEND=azure_blob AZURE_BLOB_ACCOUNT_URL=https://<account>.blob.core.windows.net # AZURE_BLOB_CONTAINER=data-formulator
The app uses
DefaultAzureCredential, which automatically picks up the Managed Identity. -
For local development with the same Entra ID path, log in with the Azure CLI:
az login # Grant your user the same "Storage Blob Data Contributor" role az role assignment create \ --assignee "<your-email@example.com>" \ --role "Storage Blob Data Contributor" \ --scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
Then set:
WORKSPACE_BACKEND=azure_blob AZURE_BLOB_ACCOUNT_URL=https://<account>.blob.core.windows.net
DefaultAzureCredentialwill use youraz loginsession.
| Method | Env var | When to use |
|---|---|---|
| Connection string | AZURE_BLOB_CONNECTION_STRING |
Local dev, quick tests |
| Entra ID (Managed Identity) | AZURE_BLOB_ACCOUNT_URL |
Azure App Service, AKS β no secrets |
| Entra ID (az login) | AZURE_BLOB_ACCOUNT_URL |
Local dev without secrets |
| Entra ID (service principal) | AZURE_BLOB_ACCOUNT_URL + AZURE_CLIENT_ID / AZURE_TENANT_ID / AZURE_CLIENT_SECRET |
CI/CD pipelines |
If both AZURE_BLOB_CONNECTION_STRING and AZURE_BLOB_ACCOUNT_URL are set, the connection string takes precedence.
All workspace data is stored under <datalake_root>/<sanitized_identity_id>/ inside the container:
data-formulator/ β container
workspaces/ β datalake_root (default)
browser_550e8400.../ β anonymous user workspace
workspace.yaml
sales_data.parquet
user_alice_example_com/ β authenticated user workspace
workspace.yaml
quarterly_report.parquet
| Flag | Env var | Default | Description |
|---|---|---|---|
--workspace-backend |
WORKSPACE_BACKEND |
local |
local or azure_blob |
--azure-blob-connection-string |
AZURE_BLOB_CONNECTION_STRING |
β | Shared-key connection string |
--azure-blob-account-url |
AZURE_BLOB_ACCOUNT_URL |
β | Account URL for Entra ID auth |
--azure-blob-container |
AZURE_BLOB_CONTAINER |
data-formulator |
Blob container name |
When deploying Data Formulator to production, please be aware of the following security considerations:
-
Workspace and table data: Table data is stored in per-identity workspaces (e.g. parquet files). DuckDB is used only in-memory per request when needed (e.g. for SQL mode); no persistent DuckDB database files are created by the app.
-
Identity Management:
- Each user's data is isolated by a namespaced identity key (e.g.,
user:alice@example.comorbrowser:550e8400-...) - Anonymous users get a browser-based UUID stored in localStorage
- Authenticated users get their verified user ID from the auth provider
- Each user's data is isolated by a namespaced identity key (e.g.,
-
Data persistence: User data may be written to workspace storage (e.g. parquet) on the server. In multi-tenant deployments, ensure workspace directories are isolated and access-controlled.
For production deployment, consider:
- Use
--disable-databaseflag to disable table-connector routes when you do not need external or uploaded table support - Implement proper authentication, authorization, and other security measures as needed for your specific use case, for example:
- User authentication (OAuth, JWT tokens, etc.)
- Role-based access control
- API rate limiting
- HTTPS/TLS encryption
- Input validation and sanitization
# For stateless deployment (recommended for public hosting)
python -m data_formulator.app --disable-databaseData Formulator supports a hybrid identity system that supports both anonymous and authenticated users.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend Request β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Headers: β
β X-Identity-Id: "browser:550e8400-..." (namespace sent by client) β
β Authorization: Bearer <jwt> (if custom auth implemented) β
β (Azure also adds X-MS-CLIENT-PRINCIPAL-ID automatically) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backend Identity Resolution β
β (auth.py: get_identity_id) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Priority 1: Azure X-MS-CLIENT-PRINCIPAL-ID β "user:<azure_id>" β
β Priority 2: JWT Bearer token (if implemented) β "user:<jwt_sub>" β
β Priority 3: X-Identity-Id header β ALWAYS "browser:<id>" β
β (client-provided namespace is IGNORED for security) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Isolation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β "user:alice@example.com" β alice's DuckDB file (ONLY via auth) β
β "browser:550e8400-..." β anonymous user's DuckDB file β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Critical Security Rule: The backend NEVER trusts the namespace prefix from the client-provided X-Identity-Id header. Even if a client sends X-Identity-Id: "user:alice@...", the backend strips the prefix and forces browser:alice@.... Only verified authentication (Azure headers or JWT) can result in a user: prefixed identity.
The key security principle is namespaced isolation with forced prefixing:
| Scenario | X-Identity-Id Sent | Backend Resolution | Storage Key |
|---|---|---|---|
| Anonymous user | browser:550e8400-... |
Strips prefix, forces browser: |
browser:550e8400-... |
| Azure logged-in user | browser:550e8400-... |
Uses Azure header (priority 1) | user:alice@... |
| Attacker spoofing | user:alice@... (forged) |
No valid auth, strips & forces browser: |
browser:alice@... |
Why this is secure: An attacker sending X-Identity-Id: user:alice@... gets browser:alice@... as their storage key, which is completely separate from the real user:alice@... that only authenticated Alice can access.
To add JWT-based authentication:
- Backend (
tables_routes.py): Uncomment and configure the JWT verification code inget_identity_id() - Frontend (
utils.tsx): ImplementgetAuthToken()to retrieve the JWT from your auth context - Add JWT secret to Flask config:
current_app.config['JWT_SECRET']
When deployed to Azure with EasyAuth enabled:
- Azure automatically adds
X-MS-CLIENT-PRINCIPAL-IDheader to authenticated requests - The backend reads this header first (highest priority)
- No frontend changes needed - Azure handles the auth flow
The frontend (src/app/identity.ts) manages identity as follows:
// Identity is always initialized with browser ID
identity: { type: 'browser', id: getBrowserId() }
// If user logs in (e.g., via Azure), it's updated to:
identity: { type: 'user', id: userInfo.userId }
// All API requests send namespaced identity:
// X-Identity-Id: "browser:550e8400-..." or "user:alice@..."This ensures:
- Anonymous users: Work immediately with localStorage-based browser ID
- Logged-in users: Get their verified user ID from the auth provider
- Cross-tab consistency: Browser ID is shared via localStorage across all tabs
See the Usage section on the README.md page.