This repo contains a docker-compose.yml file which configures the containers and their interaction.
To run the containers:
- users and passwords (adjust env variables as needed and set new passwords):
Optionally add the following env variables for postgres and/or OpenSearch (not needed for local dev):
cp env.template .env
POSTGRES_ADDRESS(default "postgres") andPOSTGRES_PORT(default 5432)OPENSEARCH_ADDRESS(default "opensearch") andOPENSEARCH_PORT(default 9200)FASTAPI_ADDRESS(default "127.0.0.1") andFASTAPI_PORT(default 8080)
- API keys for search API server:
cp keys.env.template keys.env
- Dev config for docker containers:
cp docker-compose.override.yml.template docker-compose.override.yml
-
docker compose up -d
- create postgreSQL table structure, see below.
- create OpenSearch index, see below.
- run transformation process, see below.
- when using pgAdmin, register a new server with
Host name"postgres" (container name in docker network) with port "5432". - provide credentials as defined in
.env.
-
cd scripts - Install uv and run
uv sync
-
cd scripts/postgres_data -
create table structure and repo config as defined in
scripts/postgres_data/create_sql(to start from scratch, you have to remove the tables first with DROP):uv run create_db.py
-
load XML data from
scripts/postgres_data/data(populates tableharvest_events):uv run import_data.py
-
transform data from
scripts/postgres_data/datato a local dir (to test transformation, alternative to using the Celery process):uv run transform.py -i harvests_{repo_suffix} -o {repo_suffix}_json -s JSON_schema_file [-n]If the -n flag is provided, the JSON data will also be normalized and validated against the JSON schema file
utils/schema.json.
-
cd scripts/opensearch_data -
create
test_dataciteindex (deletes existingtest_dataciteindex):uv run create_index.py
-
for sample OpenSearch queries, see open_search_queries
-
to test queries requiring vector embeddings, run
uv run query_index.py
The transformer container provides an API to start the transformation and indexing process.
A transformation requires a harvest_run_id.
When running the script import_data.py (scripts/postgres_data/data),
for each endpoint a harves run is created, the single OAI-PMH records are registered as harvest events,
and the harvest run is then closed. Note that a transformation can only be performed for a closed harvest run.
-
check if transformer container is up and running:
http://127.0.0.1:8080/health
-
To obtain a harvest run id and status for a given endpoint (https://dabar.srce.hr/oai):
http://127.0.0.1:8080/harvest_run?harvest_url=https%3A%2F%2Fdabar.srce.hr%2Foai -
start transformation process:
http://127.0.0.1:8080/index?harvest_run_id=xyz -
see transformation task results in flower:
http://127.0.0.1:5555/tasks
After starting the stack with docker compose up, you can run the harvester for a given repository URL, e.g.:
docker compose run harvester https://lifesciences.datastations.nl/oaiBefore running the e2e tests locally, the env var POSTGRES_DB needs to be set to "testdb"
since the e2e tests and the API have to use the same DB in order for the tests to work.
Note that the e2e tests reinit "testdb" on each run. Since "testdb" is hardcoded in the e2e tests,
the productive db "dataset" won't be overwritten by running the e2e tests.
To run the e2e tests:
uv run pytest -s e2e