Skip to content

Commit 2b49f45

Browse files
First package version
0 parents  commit 2b49f45

13 files changed

Lines changed: 1371 additions & 0 deletions

File tree

.github/workflows/ci.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main, develop]
6+
pull_request:
7+
branches: [main, develop]
8+
9+
jobs:
10+
lint-and-coverage:
11+
name: Lint and Coverage
12+
runs-on: ubuntu-latest
13+
strategy:
14+
matrix:
15+
python-version: ["3.9", "3.10", "3.11", "3.12"]
16+
17+
steps:
18+
- uses: actions/checkout@v4
19+
20+
- name: Set up Python ${{ matrix.python-version }}
21+
uses: actions/setup-python@v5
22+
with:
23+
python-version: ${{ matrix.python-version }}
24+
25+
- name: Install dependencies
26+
run: |
27+
python -m pip install --upgrade pip
28+
pip install -e .
29+
pip install black flake8 coverage
30+
31+
- name: Black
32+
run: black --check src
33+
34+
- name: Flake8
35+
run: flake8 src
36+
37+
- name: Coverage
38+
run: |
39+
coverage run -m spine_db.cli --help
40+
coverage report -m --include="src/*"
41+
coverage xml
42+
43+
- name: Upload coverage to Codecov
44+
if: matrix.python-version == '3.11'
45+
uses: codecov/codecov-action@v4
46+
with:
47+
file: ./coverage.xml
48+
fail_ci_if_error: false
49+
token: ${{ secrets.CODECOV_TOKEN }}

.gitignore

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*.egg-info/
5+
6+
# Virtualenv
7+
.venv/
8+
venv/
9+
10+
# Build artifacts
11+
build/
12+
dist/
13+
14+
# Test/coverage
15+
.pytest_cache/
16+
.coverage
17+
coverage.xml
18+
19+
# Tooling
20+
.mypy_cache/
21+
.ruff_cache/
22+
23+
# Editors/OS
24+
.vscode/
25+
.idea/
26+
.DS_Store

.pre-commit-config.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Pre-commit configuration for spine-db
2+
# Run: pip install pre-commit && pre-commit install
3+
# Automatically runs code quality checks before each commit
4+
5+
repos:
6+
- repo: https://github.com/psf/black
7+
rev: 24.8.0
8+
hooks:
9+
- id: black
10+
language_version: python3
11+
args: [--line-length=88]
12+
files: ^src/.*\.py$
13+
14+
- repo: https://github.com/pycqa/isort
15+
rev: 5.13.2
16+
hooks:
17+
- id: isort
18+
args: [--profile=black, --line-length=88]
19+
files: ^src/.*\.py$
20+
21+
- repo: https://github.com/pycqa/flake8
22+
rev: 7.1.1
23+
hooks:
24+
- id: flake8
25+
args: [--max-line-length=88, "--extend-ignore=E203,W503,E501"]
26+
files: ^src/.*\.py$

README.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
# SPINE Database (spine-db)
2+
3+
Standalone metadata indexing and browsing system for SPINE production runs.
4+
5+
**Key Feature**: Works with cloud databases (e.g., Supabase) for **multi-site deployments** - index files from S3DF, NERSC, or anywhere, all writing to one central database.
6+
7+
## Quick Start
8+
9+
### 1. Install
10+
11+
```bash
12+
# From this repo root
13+
pip install -e .
14+
15+
# Or from PyPI (once published)
16+
# pip install spine-db
17+
```
18+
19+
### 2. Set Up Database
20+
21+
**Option A: Cloud Database (Recommended for Multi-Site)**
22+
23+
For production at multiple sites (S3DF, NERSC, etc.):
24+
25+
```bash
26+
# Sign up at supabase.com (free, no credit card)
27+
# Create project → get connection string
28+
DB_URL="postgresql://user:pass@db.xyz.supabase.co:5432/postgres"
29+
30+
# Works from anywhere - S3DF, NERSC, your laptop
31+
```
32+
33+
**Option B: Local Testing with SQLite**
34+
35+
```bash
36+
# Database created automatically on first use
37+
DB_URL="sqlite:///spine_files.db"
38+
```
39+
40+
**Option C: Local PostgreSQL at S3DF/NERSC**
41+
42+
```bash
43+
# Install postgres
44+
conda install postgresql
45+
initdb ~/postgres_data
46+
pg_ctl -D ~/postgres_data start
47+
48+
# Create database
49+
createdb spine_db
50+
DB_URL="postgresql:///spine_db"
51+
```
52+
53+
Initialize the schema once:
54+
55+
```bash
56+
spine-db setup --db $DB_URL
57+
```
58+
59+
### 3. Index Your Files
60+
61+
```bash
62+
# Index from glob pattern
63+
spine-db inject --db $DB_URL --source /path/to/output/*.h5
64+
65+
# Index from file list
66+
spine-db inject --db $DB_URL --source-list file_list.txt
67+
68+
# Re-index existing files
69+
spine-db inject --db $DB_URL --source output/*.h5 --no-skip-existing
70+
```
71+
72+
### 4. Launch Web UI
73+
74+
```bash
75+
spine-db app --db $DB_URL --port 8050
76+
```
77+
78+
Open your browser to http://localhost:8050
79+
80+
## Features
81+
82+
**Current**:
83+
84+
- **Database Schema**: `spine_files` table tracking file_path, spine_version, model_name, dataset_name, created_at
85+
- **Multi-Site Support**: Works with cloud databases - S3DF and NERSC can write to same DB
86+
- **Metadata Extractor**: Reads HDF5 root attributes, infers from file paths
87+
- **Indexer**: CLI tool with glob patterns, file lists, skip/re-index options
88+
- **Web UI**:
89+
- Filter by version, model, dataset
90+
- Sort by creation date (newest first)
91+
- Adjustable result limit
92+
- Full file path tooltips
93+
- Total and filtered counts
94+
95+
**Future Enhancements**:
96+
97+
- Server-side pagination for large result sets
98+
- Semantic version parsing (major.minor.patch) for better filtering
99+
- Advanced analytics and histograms
100+
- Export filtered results to CSV/file lists
101+
- REST API for programmatic access
102+
- Detailed metadata panel on row click
103+
104+
## Architecture
105+
106+
```
107+
src/spine_db/
108+
├── __init__.py # Package metadata
109+
├── cli.py # CLI entry point
110+
├── schema.py # SQLAlchemy models
111+
├── extractor.py # HDF5 metadata extraction
112+
├── indexer.py # Indexer logic
113+
├── setup.py # Database setup helper
114+
├── app.py # Dash web UI
115+
└── README.md # This file
116+
```
117+
118+
## Usage Examples
119+
120+
### Testing Locally
121+
122+
```bash
123+
# Index some files
124+
spine-db inject \
125+
--db sqlite:///test_spine_files.db \
126+
--source jobs/*/output/*.h5
127+
128+
# Launch UI
129+
spine-db app \
130+
--db sqlite:///test_spine_files.db \
131+
--debug
132+
```
133+
134+
### Production Setup
135+
136+
**Multi-Site with Cloud Database** (S3DF + NERSC → Supabase):
137+
138+
```bash
139+
# 1. Create free Supabase project at supabase.com
140+
# Get connection string from project settings
141+
142+
# 2. Index from S3DF
143+
ssh s3df.slac.stanford.edu
144+
spine-db inject \
145+
--db postgresql://user:pass@db.xyz.supabase.co:5432/postgres \
146+
--source-list /sdf/data/neutrino/spine_outputs.txt
147+
148+
# 3. Index from NERSC (same database!)
149+
ssh perlmutter-p1.nersc.gov
150+
spine-db inject \
151+
--db postgresql://user:pass@db.xyz.supabase.co:5432/postgres \
152+
--source /global/cfs/cdirs/dune/www/data/*.h5
153+
154+
# 4. Run UI anywhere (or deploy to Render/Railway for free)
155+
spine-db app \
156+
--db postgresql://user:pass@db.xyz.supabase.co:5432/postgres \
157+
--host 0.0.0.0 \
158+
--port 8050
159+
```
160+
161+
**Local PostgreSQL at S3DF**:
162+
163+
```bash
164+
# 1. Set up PostgreSQL
165+
conda install postgresql
166+
initdb ~/postgres_data
167+
pg_ctl -D ~/postgres_data start
168+
createdb spine_db
169+
170+
# 2. Index production outputs
171+
spine-db inject \
172+
--db postgresql:///spine_db \
173+
--source-list /path/to/production_files.txt
174+
175+
# 3. Run UI (access via SSH tunnel)
176+
spine-db app \
177+
--db postgresql:///spine_db \
178+
--host 0.0.0.0 \
179+
--port 8050
180+
181+
# From your laptop:
182+
# ssh -L 8050:localhost:8050 user@s3df.slac.stanford.edu
183+
# Then open http://localhost:8050
184+
```
185+
186+
### Extracting Metadata
187+
188+
```bash
189+
# Test metadata extraction on a file
190+
python -m spine_db.extractor /path/to/output.h5
191+
```
192+
193+
## Database Schema
194+
195+
```sql
196+
CREATE TABLE spine_files (
197+
id SERIAL PRIMARY KEY,
198+
file_path VARCHAR UNIQUE NOT NULL,
199+
spine_version VARCHAR,
200+
spine_prod_version VARCHAR,
201+
model_name VARCHAR,
202+
dataset_name VARCHAR,
203+
run INTEGER,
204+
subrun INTEGER,
205+
event_min INTEGER,
206+
event_max INTEGER,
207+
num_events INTEGER,
208+
created_at TIMESTAMP NOT NULL DEFAULT NOW()
209+
);
210+
211+
CREATE INDEX idx_spine_files_file_path ON spine_files(file_path);
212+
CREATE INDEX idx_spine_files_created_at ON spine_files(created_at);
213+
CREATE INDEX idx_spine_files_model_name ON spine_files(model_name);
214+
CREATE INDEX idx_spine_files_dataset_name ON spine_files(dataset_name);
215+
CREATE INDEX idx_spine_files_run ON spine_files(run);
216+
CREATE INDEX idx_spine_files_subrun ON spine_files(subrun);
217+
```
218+
219+
## Database Size Estimates
220+
221+
Each row stores ~350 bytes (file path, version, model, dataset, timestamps, indexes).
222+
223+
**Examples**:
224+
- 1,000 files = ~350 KB
225+
- 10,000 files = ~3.5 MB
226+
- 100,000 files = ~35 MB
227+
- 1,000,000 files = ~350 MB
228+
229+
**Supabase free tier (500 MB)** = ~1.4 million files = several years of heavy production.
230+
231+
## Integration with spine-prod
232+
233+
### Automatic Indexing After Job Completion
234+
235+
Add indexing to your submit.py workflow:
236+
237+
```python
238+
# After job completion
239+
if not dry_run and job_ids:
240+
# Index output files
241+
output_files = glob.glob(f"{job_dir}/output/*.h5")
242+
if output_files:
243+
subprocess.run([
244+
"spine-db", "inject",
245+
"--db", os.environ.get("SPINE_DB_URL", "sqlite:///spine_files.db"),
246+
"--source", *output_files
247+
])
248+
```
249+
250+
### Future: Pipeline Integration
251+
252+
In pipelines, add an indexing stage:
253+
254+
```yaml
255+
stages:
256+
- name: reconstruction
257+
config: infer/icarus/latest.cfg
258+
files: data/*.root
259+
260+
- name: index
261+
depends_on: [reconstruction]
262+
# Custom indexer stage
263+
script: |
264+
spine-db inject \
265+
--db $SPINE_DB_URL \
266+
--source {{ reconstruction.output }}
267+
```
268+
269+
## Contributing
270+
271+
This is Phase 1 - a minimal working system. Future phases will:
272+
1. Add robust pagination and versioning (Phase 2)
273+
2. Implement security and deployment (Phase 3)
274+
3. Extract to separate repo with migrations (Phase 4)
275+
4. Add advanced features (Phase 5)
276+
277+
Feedback and contributions welcome!
278+
279+
## License
280+
281+
Same as SPINE and spine-prod.

0 commit comments

Comments
 (0)