Metadata Warehouse

Docker Compose Setup

This repo contains a docker-compose.yml file which configures the containers and their interaction. To run the containers:

users and passwords (adjust env variables as needed and set new passwords):
```
cp env.template .env
```
Optionally add the following env variables for postgres and/or OpenSearch (not needed for local dev):
- POSTGRES_ADDRESS (default "postgres") and POSTGRES_PORT (default 5432)
- OPENSEARCH_ADDRESS (default "opensearch") and OPENSEARCH_PORT (default 9200)
- FASTAPI_ADDRESS (default "127.0.0.1") and FASTAPI_PORT (default 8080)
API keys for search API server:
```
cp keys.env.template keys.env
```

Dev config for docker containers:

cp docker-compose.override.yml.template docker-compose.override.yml

```
docker compose up -d
```
create postgreSQL table structure, see below.
create OpenSearch index, see below.
run transformation process, see below.

pgAdmin

when using pgAdmin, register a new server with Host name "postgres" (container name in docker network) with port "5432".
provide credentials as defined in .env.

Basic Setup

```
cd scripts
```
Install uv and run
```
uv sync
```

Create Postgres DB and Load and Transform Data

```
cd scripts/postgres_data
```
create table structure and repo config as defined in scripts/postgres_data/create_sql (to start from scratch, you have to remove the tables first with DROP):
```
uv run create_db.py
```
load XML data from scripts/postgres_data/data (populates table harvest_events):
```
 uv run import_data.py
```
transform data from scripts/postgres_data/data to a local dir (to test transformation, alternative to using the Celery process):
```
uv run transform.py -i harvests_{repo_suffix} -o {repo_suffix}_json -s JSON_schema_file [-n]
```
If the -n flag is provided, the JSON data will also be normalized and validated against the JSON schema file utils/schema.json.

Create OpenSearch Index

```
cd scripts/opensearch_data
```
create test_datacite index (deletes existing test_datacite index):
```
uv run create_index.py
```
for sample OpenSearch queries, see open_search_queries
to test queries requiring vector embeddings, run
```
uv run query_index.py
```

Run Transformation Process

The transformer container provides an API to start the transformation and indexing process.

A transformation requires a harvest_run_id. When running the script import_data.py (scripts/postgres_data/data), for each endpoint a harves run is created, the single OAI-PMH records are registered as harvest events, and the harvest run is then closed. Note that a transformation can only be performed for a closed harvest run.

check if transformer container is up and running:
```
http://127.0.0.1:8080/health
```
To obtain a harvest run id and status for a given endpoint (https://dabar.srce.hr/oai):
```
http://127.0.0.1:8080/harvest_run?harvest_url=https%3A%2F%2Fdabar.srce.hr%2Foai
```

start transformation process:

http://127.0.0.1:8080/index?harvest_run_id=xyz

see transformation task results in flower:
```
http://127.0.0.1:5555/tasks
```

After starting the stack with docker compose up, you can run the harvester for a given repository URL, e.g.:

docker compose run harvester https://lifesciences.datastations.nl/oai

Run E2E Tests

Before running the e2e tests locally, the env var POSTGRES_DB needs to be set to "testdb" since the e2e tests and the API have to use the same DB in order for the tests to work. Note that the e2e tests reinit "testdb" on each run. Since "testdb" is hardcoded in the e2e tests, the productive db "dataset" won't be overwritten by running the e2e tests.

To run the e2e tests:

uv run pytest -s e2e

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
analytics		analytics
docker/transform		docker/transform
docs		docs
e2e		e2e
etc		etc
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.override.yml.template		docker-compose.override.yml.template
docker-compose.yml		docker-compose.yml
env.template		env.template
env.template_e2e		env.template_e2e
keys.env.template		keys.env.template
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata Warehouse

Docker Compose Setup

pgAdmin

Basic Setup

Create Postgres DB and Load and Transform Data

Create OpenSearch Index

Run Transformation Process

Run E2E Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Metadata Warehouse

Docker Compose Setup

pgAdmin

Basic Setup

Create Postgres DB and Load and Transform Data

Create OpenSearch Index

Run Transformation Process

Run E2E Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages