The developer toolkit for the Open Science Archive — define metadata schemas, write validation hooks, build ingesters, and deploy conventions.
Pre-release — APIs will change without notice.
pip install osa-pyA convention defines what data your archive accepts. Here's an example that archives protein structures and runs analysis on the deposited files:
from datetime import date
import httpx
from pydantic import BaseModel
from osa import (
Schema, Field, Record, Reject, hook, convention,
IngesterContext, IngesterRecord,
)
# 1. Define the metadata schema
class PDBStructure(Schema):
__schema_id__ = "pdb-structure"
pdb_id: str
title: str
method: str
resolution: float | None = Field(default=None, unit="Å")
deposition_date: date
molecular_weight: float = Field(unit="kDa")
chain_count: int
# 2. Validate incoming records
@hook
def validate_structure(record: Record[PDBStructure]) -> None:
if record.metadata.resolution and record.metadata.resolution > 4.0:
raise Reject("Resolution too low for reliable analysis")
if not record.files.glob("*.cif"):
raise Reject("At least one CIF file is required")
# 3. Extract features from the data
class Pocket(BaseModel):
pocket_id: int
score: float
volume: float
@hook
def find_pockets(record: Record[PDBStructure]) -> list[Pocket]:
cif = record.files["structure.cif"]
size_kb = cif.size / 1024
return [Pocket(pocket_id=0, score=round(size_kb / 100, 2), volume=size_kb)]
# 4. Pull records from an external source
class PDBIngester:
name = "pdb-rcsb"
schedule = None
initial_run = None
max_file_mb = 50.0
async def pull(self, *, ctx: IngesterContext, limit=None, **kwargs):
for pdb_id in ["4TOS", "1TIM", "6LU7"][:limit]:
entry = httpx.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}").json()
file_ref = await ctx.add_file(
pdb_id, "structure.cif",
url=f"https://files.rcsb.org/download/{pdb_id}.cif",
)
yield IngesterRecord(
source_id=pdb_id,
metadata={"pdb_id": pdb_id, "title": entry["struct"]["title"], ...},
files=[file_ref],
)
# 5. Register the convention
convention(
title="Protein Structures",
version="1.0.0",
schema=PDBStructure,
ingester=PDBIngester,
hooks=[validate_structure, find_pockets],
files={"accepted_types": [".cif", ".pdb"], "max_count": 5},
)Register the convention as a setuptools entry point in pyproject.toml:
[project.entry-points."osa.conventions"]
my_convention = "my_package"Test the full convention pipeline end-to-end — the ingester pulls real data, then every hook runs against each record:
osa test --limit 2Running ingester pdb-rcsb...
Fetched 2 record(s) (4TOS, 1TIM)
Running hooks...
4TOS
✓ validate_structure
✓ find_pockets → 1 Pocket(s)
1TIM
✓ validate_structure
✓ find_pockets → 1 Pocket(s)
2 record(s), 2 accepted, 0 rejected
You can also test individual hooks in-process with synthetic data:
from osa.testing import run_hook
run_hook(validate_structure, meta={"pdb_id": "TEST", ...}, files=tmp_path)Run a full OSA stack locally with Docker:
osa init my-archive
cd my-archive
osa startThis scaffolds a project directory with osa.yaml, .env, and a Docker Compose stack (Postgres, OSA server, docker-socket-proxy). Authentication is handled automatically — osa start mints a dev JWT so osa deploy and osa ingestion start work immediately.
Deploy conventions to a running archive:
osa deployThis builds OCI images for your hooks and ingesters, pushes them to the server's registry, and registers the convention.
| Command | Description |
|---|---|
osa init [dir] |
Scaffold a new project (osa.yaml, .env, .gitignore) |
osa start |
Start the local OSA stack |
osa stop |
Stop the local stack |
osa logs [-f] [service] |
View container logs |
osa status |
Show running containers |
| Command | Description |
|---|---|
osa test [--limit N] |
Test the full pipeline: ingester → hooks |
osa deploy |
Build OCI images and register conventions |
osa meta |
Print the convention manifest as JSON |
osa ingestion start |
Trigger an ingestion run |
| Command | Description |
|---|---|
osa login |
Authenticate via ORCID device flow |
osa logout |
Remove stored credentials |
osa link --server <url> |
Link project to a remote archive |
Schema — a Pydantic model defining typed metadata fields. Each field can carry units and constraints. The schema is the contract between depositors, hooks, and the archive.
Hook — a pure function decorated with @hook that receives a Record[T] and returns structured results. Hooks run as OCI containers — the SDK builds the image, the server orchestrates execution.
Convention — a bundle of a schema, hooks, file requirements, and an optional ingester. Conventions are the unit of deployment.
Ingester — an async generator that pulls records from external systems (APIs, databases, file servers) into the archive on a schedule.
Record[T] — generic container binding a schema type T to its metadata, files, and SRN (Scientific Resource Name).