Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
35a1f83
upgrade to python 3.11 and with uv support
Chenglong-MS Jan 31, 2026
a65798d
make import more clear
Chenglong-MS Jan 31, 2026
fe60de0
clean up data loader to prep for new design
Chenglong-MS Feb 1, 2026
fd5a233
udpate auth workflow to prep for new workspace manage
Chenglong-MS Feb 1, 2026
dd9ebd3
wip on new workspace design
Chenglong-MS Feb 4, 2026
f671cc1
updates to unify data execution method
Chenglong-MS Feb 4, 2026
bf5a609
unify computation approach for both in memory and datalake tables
Chenglong-MS Feb 6, 2026
1a3ec07
after unification, remove separate agents
Chenglong-MS Feb 6, 2026
8e6ea7c
temp
Chenglong-MS Feb 6, 2026
d0a1371
semantic enhanced chart assemble
Chenglong-MS Feb 6, 2026
9d6a34e
some cleaning
Chenglong-MS Feb 6, 2026
4c83824
an algorithm to calculate layout dynamically, might be a good intervi…
Chenglong-MS Feb 7, 2026
b14e242
useless? ui design
Chenglong-MS Feb 7, 2026
08a5840
refactor design into design tokens
Chenglong-MS Feb 8, 2026
10526c0
optimize data loader
Chenglong-MS Feb 8, 2026
463d541
fix
Chenglong-MS Feb 8, 2026
9d522fa
Add projection types and projection centers for the map
BAIGUANGMEI Feb 8, 2026
48f0c33
Add map support prompts
BAIGUANGMEI Feb 8, 2026
6086035
Clean import
BAIGUANGMEI Feb 8, 2026
b4e9ed4
Merge pull request #1 from microsoft/main
BAIGUANGMEI Feb 8, 2026
f98309f
Update src/app/utils.tsx
Chenglong-MS Feb 8, 2026
824a404
Update py-src/data_formulator/agents/agent_data_rec.py
Chenglong-MS Feb 8, 2026
86b2dc5
Update src/components/ChartTemplates.tsx
Chenglong-MS Feb 8, 2026
fd7aa53
Apply suggestions from code review
Chenglong-MS Feb 8, 2026
4ad7219
Merge pull request #232 from BAIGUANGMEI/feature/map-projection-support
Chenglong-MS Feb 8, 2026
c8307ff
minor
Chenglong-MS Feb 8, 2026
b0da0d2
Merge remote-tracking branch 'refs/remotes/origin/dev' into dev
Chenglong-MS Feb 8, 2026
61d826c
update config
Chenglong-MS Feb 8, 2026
7f172ed
udpate rendering workflow
Chenglong-MS Feb 8, 2026
1d2623a
fix scale bug
Chenglong-MS Feb 8, 2026
c1b1677
fix data sync with workspace
Chenglong-MS Feb 9, 2026
9aec036
fix auto refresh performance bug
Chenglong-MS Feb 9, 2026
f88a277
fix performance bug -- it's message snack
Chenglong-MS Feb 10, 2026
ce78f92
design fixes
Chenglong-MS Feb 10, 2026
245045f
small stuff
Chenglong-MS Feb 10, 2026
157adba
nit
Chenglong-MS Feb 10, 2026
286187e
minor
Chenglong-MS Feb 10, 2026
ea1ef9e
fix flow
Chenglong-MS Feb 10, 2026
3342878
fix
Chenglong-MS Feb 10, 2026
e0ba6c9
session management
Chenglong-MS Feb 11, 2026
51c5c4c
fix some issues with dataloaders
Chenglong-MS Feb 11, 2026
3364731
fix cross-platform file locking for Windows compatibility
BAIGUANGMEI Feb 11, 2026
370df8e
Merge pull request #233 from BAIGUANGMEI/fix/windows-fcntl-compat
Chenglong-MS Feb 11, 2026
15a1960
some fixes
Chenglong-MS Feb 11, 2026
225b0e6
Merge remote-tracking branch 'refs/remotes/origin/dev' into dev
Chenglong-MS Feb 11, 2026
763c7c1
bug fixes
Chenglong-MS Feb 12, 2026
10a9ce2
some work around editor
Chenglong-MS Feb 12, 2026
644d1df
workflow check
Chenglong-MS Feb 12, 2026
107e557
update workflow
Chenglong-MS Feb 12, 2026
61d3f5f
workflow ix 2
Chenglong-MS Feb 12, 2026
f3b01b3
minor
Chenglong-MS Feb 12, 2026
098198c
rename
Chenglong-MS Feb 12, 2026
1fbc34e
temp revert to name-based approach
Chenglong-MS Feb 12, 2026
b60d727
update requirements
Chenglong-MS Feb 12, 2026
4012dc6
update to use uv in build
Chenglong-MS Feb 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,23 +1,24 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/python
{
"name": "Python 3",
"name": "Data Formulator Dev",
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",
"image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye",

// Features to add to the dev container. More info: https://containers.dev/features.
"features": {
"ghcr.io/devcontainers/features/node:1": {
"version": "18"
},
"ghcr.io/devcontainers/features/azure-cli:1": {}
},
"features": {
"ghcr.io/devcontainers/features/node:1": {
"version": "18"
},
"ghcr.io/devcontainers/features/azure-cli:1": {},
"ghcr.io/astral-sh/uv:1": {}
},

// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
"forwardPorts": [5000, 5173],

// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "cd /workspaces/data-formulator && npm install && npm run build && python3 -m venv /workspaces/data-formulator/venv && . /workspaces/data-formulator/venv/bin/activate && pip install -e /workspaces/data-formulator --verbose && data_formulator"
"postCreateCommand": "cd /workspaces/data-formulator && npm install && npm run build && uv sync && uv run data_formulator"

// Configure tool-specific properties.
// "customizations": {},
Expand Down
4 changes: 1 addition & 3 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,4 @@
# python -m data_formulator -p 5000 --exec-python-in-subprocess true --disable-display-keys true

DISABLE_DISPLAY_KEYS=false # if true, the display keys will not be shown in the frontend
EXEC_PYTHON_IN_SUBPROCESS=false # if true, the python code will be executed in a subprocess to avoid crashing the main app, but it will increase the time of response

LOCAL_DB_DIR= # the directory to store the local database, if not provided, the app will use the temp directory
EXEC_PYTHON_IN_SUBPROCESS=false # if true, the python code will be executed in a subprocess to avoid crashing the main app, but it will increase the time of response
52 changes: 26 additions & 26 deletions .github/workflows/python-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ on:
pull_request:
branches: [ "main" ]

# Global permissions required for OIDC authentication
permissions:
id-token: write
contents: read

jobs:
build:
runs-on: ubuntu-latest
Expand All @@ -21,42 +26,37 @@ jobs:
with:
node-version: 20
cache: 'yarn'
- name: Set up Python 3.12
uses: actions/setup-python@v5
- name: Set up uv
uses: astral-sh/setup-uv@v4
with:
python-version: 3.12
- name: Install node dependencies
run: yarn install
- name: Install python dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
python -m pip install build
- name: Build frontend
run: yarn build
- name: Build python artifact
run: python -m build
run: uv build
- name: Archive production artifacts
uses: actions/upload-artifact@v4
with:
name: release-dist
path: dist

# pypi-publish:
# runs-on: ubuntu-latest
# needs:
# - build
# if: github.event_name == 'push' && contains(github.event.head_commit.message, '[deploy]')
# environment:
# name: pypi
# url: https://pypi.org/p/data-formulator
# permissions:
# id-token: write
# steps:
# - name: Retrieve release distributions
# uses: actions/download-artifact@v4
# with:
# name: release-dist
# path: dist/
# - name: Publish package distributions to PyPI
# uses: pypa/gh-action-pypi-publish@release/v1
pypi-publish:
runs-on: ubuntu-latest
needs:
- build
if: github.event_name == 'push' && contains(github.event.head_commit.message, '[deploy]')
environment:
name: pypi
url: https://pypi.org/p/data-formulator
permissions:
id-token: write
steps:
- name: Retrieve release distributions
uses: actions/download-artifact@v4
with:
name: release-dist
path: dist/
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@

*env
.venv/
*api-keys.env
**/*.ipynb_checkpoints/
.DS_Store
Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11
149 changes: 131 additions & 18 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,34 @@
How to set up your local machine.

## Prerequisites
* Python > 3.11
* Python >= 3.11
* Node.js
* Yarn
* [uv](https://docs.astral.sh/uv/) (recommended) or pip

## Backend (Python)

### Option 1: With uv (recommended)

uv is faster and provides reproducible builds via lockfile.

```bash
uv sync # Creates .venv and installs all dependencies
uv run data_formulator # Run app (opens browser automatically)
uv run data_formulator --dev # Run backend only (for frontend development)
```

**Which command to use:**
- **End users / testing the full app**: `uv run data_formulator` - starts server and opens browser to http://localhost:5000
- **Frontend development**: `uv run data_formulator --dev` - starts backend server only, then run `yarn start` separately for the Vite dev server on http://localhost:5173

### Option 2: With pip (fallback)

- **Create a Virtual Environment**
```bash
python -m venv venv
.\venv\Scripts\activate
source venv/bin/activate # Unix
# or .\venv\Scripts\activate # Windows
```

- **Install Dependencies**
Expand All @@ -29,7 +47,6 @@ How to set up your local machine.
- configure settings as needed:
- DISABLE_DISPLAY_KEYS: if true, API keys will not be shown in the frontend
- EXEC_PYTHON_IN_SUBPROCESS: if true, Python code runs in a subprocess (safer but slower), you may consider setting it true when you are hosting Data Formulator for others
- LOCAL_DB_DIR: directory to store the local database (uses temp directory if not set)
- External database settings (when USE_EXTERNAL_DB=true):
- DB_NAME: name to refer to this database connection
- DB_TYPE: mysql or postgresql (currently only these two are supported)
Expand All @@ -41,14 +58,16 @@ How to set up your local machine.


- **Run the app**
- **Windows**
```bash
.\local_server.bat
```

- **Unix-based**
```bash
# Unix
./local_server.sh

# Windows
.\local_server.bat

# Or directly
data_formulator # Opens browser automatically
data_formulator --dev # Backend only (for frontend development)
```

## Frontend (TypeScript)
Expand All @@ -61,7 +80,12 @@ How to set up your local machine.

- **Development mode**

Run the front-end in development mode using, allowing real-time edits and previews:
First, start the backend server (in a separate terminal):
```bash
uv run data_formulator --dev # or ./local_server.sh
```

Then, run the frontend in development mode with hot reloading:
```bash
yarn start
```
Expand All @@ -81,6 +105,10 @@ How to set up your local machine.
Then, build python package:

```bash
# With uv
uv build

# Or with pip
pip install build
python -m build
```
Expand Down Expand Up @@ -112,23 +140,23 @@ How to set up your local machine.

When deploying Data Formulator to production, please be aware of the following security considerations:

### Database Storage Security
### Database and Data Storage Security

1. **Local DuckDB Files**: When database functionality is enabled (default), Data Formulator stores DuckDB database files locally on the server. These files contain user data and are stored in the system's temporary directory or a configured `LOCAL_DB_DIR`.
1. **Workspace and table data**: Table data is stored in per-identity workspaces (e.g. parquet files). DuckDB is used only in-memory per request when needed (e.g. for SQL mode); no persistent DuckDB database files are created by the app.

2. **Session Management**:
- When database is **enabled**: Session IDs are stored in Flask sessions (cookies) and linked to local DuckDB files
- When database is **disabled**: No persistent storage is used, and no cookies are set. Session IDs are generated per request for API consistency
2. **Identity Management**:
- Each user's data is isolated by a namespaced identity key (e.g., `user:alice@example.com` or `browser:550e8400-...`)
- Anonymous users get a browser-based UUID stored in localStorage
- Authenticated users get their verified user ID from the auth provider

3. **Data Persistence**: User data processed through Data Formulator may be temporarily stored in these local DuckDB files, which could be a security risk in multi-tenant environments.
3. **Data persistence**: User data may be written to workspace storage (e.g. parquet) on the server. In multi-tenant deployments, ensure workspace directories are isolated and access-controlled.

### Recommended Security Measures

For production deployment, consider:

1. **Use `--disable-database` flag** for stateless deployments where no data persistence is needed
1. **Use `--disable-database` flag** to disable table-connector routes when you do not need external or uploaded table support
2. **Implement proper authentication, authorization, and other security measures** as needed for your specific use case, for example:
- Store DuckDB file in a database
- User authentication (OAuth, JWT tokens, etc.)
- Role-based access control
- API rate limiting
Expand All @@ -142,5 +170,90 @@ For production deployment, consider:
python -m data_formulator.app --disable-database
```

## Authentication Architecture

Data Formulator supports a **hybrid identity system** that supports both anonymous and authenticated users.

### Identity Flow Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│ Frontend Request │
├─────────────────────────────────────────────────────────────────────┤
│ Headers: │
│ X-Identity-Id: "browser:550e8400-..." (namespace sent by client) │
│ Authorization: Bearer <jwt> (if custom auth implemented) │
│ (Azure also adds X-MS-CLIENT-PRINCIPAL-ID automatically) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Backend Identity Resolution │
│ (auth.py: get_identity_id) │
├─────────────────────────────────────────────────────────────────────┤
│ Priority 1: Azure X-MS-CLIENT-PRINCIPAL-ID → "user:<azure_id>" │
│ Priority 2: JWT Bearer token (if implemented) → "user:<jwt_sub>" │
│ Priority 3: X-Identity-Id header → ALWAYS "browser:<id>" │
│ (client-provided namespace is IGNORED for security) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Storage Isolation │
├─────────────────────────────────────────────────────────────────────┤
│ "user:alice@example.com" → alice's DuckDB file (ONLY via auth) │
│ "browser:550e8400-..." → anonymous user's DuckDB file │
└─────────────────────────────────────────────────────────────────────┘
```

### Security Model

**Critical Security Rule:** The backend NEVER trusts the namespace prefix from the client-provided `X-Identity-Id` header. Even if a client sends `X-Identity-Id: "user:alice@..."`, the backend strips the prefix and forces `browser:alice@...`. Only verified authentication (Azure headers or JWT) can result in a `user:` prefixed identity.

The key security principle is **namespaced isolation with forced prefixing**:

| Scenario | X-Identity-Id Sent | Backend Resolution | Storage Key |
|----------|-------------------|-------------------|-------------|
| Anonymous user | `browser:550e8400-...` | Strips prefix, forces `browser:` | `browser:550e8400-...` |
| Azure logged-in user | `browser:550e8400-...` | Uses Azure header (priority 1) | `user:alice@...` |
| Attacker spoofing | `user:alice@...` (forged) | No valid auth, strips & forces `browser:` | `browser:alice@...` |

**Why this is secure:** An attacker sending `X-Identity-Id: user:alice@...` gets `browser:alice@...` as their storage key, which is completely separate from the real `user:alice@...` that only authenticated Alice can access.

### Implementing Custom Authentication

To add JWT-based authentication:

1. **Backend** (`tables_routes.py`): Uncomment and configure the JWT verification code in `get_identity_id()`
2. **Frontend** (`utils.tsx`): Implement `getAuthToken()` to retrieve the JWT from your auth context
3. **Add JWT secret** to Flask config: `current_app.config['JWT_SECRET']`

### Azure App Service Authentication

When deployed to Azure with EasyAuth enabled:
- Azure automatically adds `X-MS-CLIENT-PRINCIPAL-ID` header to authenticated requests
- The backend reads this header first (highest priority)
- No frontend changes needed - Azure handles the auth flow

### Frontend Identity Management

The frontend (`src/app/identity.ts`) manages identity as follows:

```typescript
// Identity is always initialized with browser ID
identity: { type: 'browser', id: getBrowserId() }

// If user logs in (e.g., via Azure), it's updated to:
identity: { type: 'user', id: userInfo.userId }

// All API requests send namespaced identity:
// X-Identity-Id: "browser:550e8400-..." or "user:alice@..."
```

This ensures:
1. **Anonymous users**: Work immediately with localStorage-based browser ID
2. **Logged-in users**: Get their verified user ID from the auth provider
3. **Cross-tab consistency**: Browser ID is shared via localStorage across all tabs

## Usage
See the [Usage section on the README.md page](README.md#usage).
Loading