Skip to content

Commit 84d01f2

Browse files
authored
Merge pull request #29 from dataforgoodfr/performances
Pipeline mesure performances
2 parents c3fa1d6 + 150e958 commit 84d01f2

12 files changed

Lines changed: 803 additions & 512 deletions

File tree

.env.template

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,13 @@ DATABASE_URL=postgresql://eu_fact_force:eu_fact_force@localhost:5432/eu_fact_for
2929
# -----------------------------------------------------------------------------
3030

3131
# If AWS_STORAGE_BUCKET_NAME is set, Django uses S3 for default file storage.
32-
# For local dev with LocalStack, set USE_LOCAL_STACK=1 and the endpoint below.
32+
# For local dev with RustFS (docker-compose): set AWS_S3_ENDPOINT_URL and use the same credentials.
33+
# RustFS Web Console: http://localhost:9001 | S3 API: http://localhost:9000
3334
# USE_LOCAL_STACK=1
34-
# AWS_ACCESS_KEY_ID=test
35-
# AWS_SECRET_ACCESS_KEY=test
35+
# AWS_ACCESS_KEY_ID=minioadmin
36+
# AWS_SECRET_ACCESS_KEY=minioadmin
3637
# AWS_STORAGE_BUCKET_NAME=eu-fact-force-files
3738
# AWS_S3_REGION_NAME=eu-west-1
38-
# AWS_S3_ENDPOINT_URL=http://localhost:4566
39+
# AWS_S3_ENDPOINT_URL=http://localhost:9000
3940

4041
# In production: set real AWS credentials and do not set AWS_S3_ENDPOINT_URL.

.gitignore

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ eu_fact_force/ingestion/parsing/output/analysis/
174174
s3/
175175
eu_fact_force/exploration/
176176
annotated_pdf/
177-
178-
# docker volumes
177+
eu_fact_force/exploration/docling/results/annotated_pdf/
179178
postgres_data/
180-
rustfs_data
179+
rustfs_data/
180+
data/

README.md

Lines changed: 44 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,13 @@ uv run pytest
112112

113113
### Déploiement de l'application
114114

115-
L'application se compose d'un serveur Django, d'une base PostgreSQL (avec pgvector) et de LocalStack pour le stockage S3.
115+
L'application se compose d'un serveur Django, d'une base PostgreSQL (avec pgvector) et de **RustFS** pour le stockage S3 (compatible AWS), avec une interface web pour déposer des fichiers manuellement.
116116
Pour déployer et utiliser l'application en local :
117117

118118
**1. Prérequis**
119119

120120
- [Python 3.12+](https://www.python.org/) et [uv](https://docs.astral.sh/uv/)
121-
- [Docker](https://www.docker.com/) et Docker Compose (pour Postgres et LocalStack)
121+
- [Docker](https://www.docker.com/) et Docker Compose (pour Postgres et RustFS)
122122

123123
**2. Variables d'environnement**
124124

@@ -130,15 +130,15 @@ cp .env.template .env
130130

131131
Pour un usage local avec les services Docker, les valeurs par défaut de `.env.template` (notamment `DATABASE_URL=postgresql://eu_fact_force:eu_fact_force@localhost:5432/eu_fact_force`) conviennent.
132132

133-
**3. Lancer les services (Postgres et LocalStack)**
133+
**3. Lancer les services (Postgres et RustFS)**
134134

135135
À la racine du projet :
136136

137137
```bash
138138
docker compose up -d
139139
```
140140

141-
Cela démarre PostgreSQL (port 5432) et LocalStack S3 (port 4566). Le bucket configuré est créé automatiquement au démarrage de LocalStack.
141+
Cela démarre PostgreSQL (port 5432) et RustFS (API S3 sur le port 9000). Le bucket configuré est créé automatiquement au premier démarrage. **Interface web RustFS** : [http://localhost:9001](http://localhost:9001) — identifiants S3 (Access Key / Secret Key) : ceux définis dans `.env` (par défaut `minioadmin`). Vous pouvez y créer des buckets, des dossiers et déposer des fichiers manuellement.
142142

143143
**4. Installer les dépendances et appliquer les migrations**
144144

@@ -165,14 +165,49 @@ L'application est alors disponible sur [http://127.0.0.1:8000/](http://127.0.0.1
165165

166166
**Utilisation du stockage S3 en local**
167167

168-
Pour que Django utilise LocalStack pour le stockage des fichiers, décommentez et renseignez dans `.env` les variables S3 (voir `.env.template`), par exemple :
168+
Avec `docker compose`, l’app est configurée pour utiliser RustFS. Pour lancer Django au host (sans conteneur app) et pointer vers RustFS, décommentez dans `.env` les variables S3 (voir `.env.template`) et définissez par exemple :
169169

170170
```bash
171-
USE_LOCAL_STACK=1
172-
AWS_ACCESS_KEY_ID=test
173-
AWS_SECRET_ACCESS_KEY=test
171+
AWS_S3_ENDPOINT_URL=http://localhost:9000
172+
AWS_ACCESS_KEY_ID=minioadmin
173+
AWS_SECRET_ACCESS_KEY=minioadmin
174174
AWS_STORAGE_BUCKET_NAME=eu-fact-force-files
175175
AWS_S3_REGION_NAME=eu-west-1
176176
```
177177

178-
Sans ces variables, l'application utilise le stockage fichier local par défaut.
178+
Sans ces variables, l’application utilise le stockage fichier local par défaut.
179+
180+
## Test de performance
181+
182+
Le projet propose un ensemble de documents relatifs aux liens entre les vaccins et l'autisme.
183+
Ces documents vont permettre de tester de bout en bout la pipeline :
184+
- parsing des pdf,
185+
- extraction des chunks,
186+
- vectorisation des chuncks,
187+
- mécanisme de recherche.
188+
189+
Puisque tous les documents ne sont pas nécessairement facilement accessible via les API, les documents et les metadata sont réunis dans un archive (puis un S3 dans un second temps).
190+
L'archive contient :
191+
- la liste des paragraphes les plus pertinents à extraire dans le json `vaccins_annotated.json`,
192+
- les fichiers pdf,
193+
- un fichier json par pdf contenant les métadonnées.
194+
195+
Le fichier json contient la structure suivante :
196+
197+
```json
198+
{
199+
"tags_pubmed": [
200+
"tag1",
201+
"tag2",
202+
"tag3"
203+
],
204+
"title" : "Title",
205+
"category" : "category",
206+
"type" : "type",
207+
"journal": "journal",
208+
"authors" : ["first author", "seocond author"],
209+
"year": 2022,
210+
"url" : "http",
211+
"doi" : "test_doi"
212+
}
213+
```

docker-compose.yml

Lines changed: 33 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ services:
1212
DEBUG: ${DEBUG:-0}
1313
SECRET_KEY: ${SECRET_KEY:-dev-secret-key-change-in-production}
1414
DATABASE_URL: ${DATABASE_URL:-postgresql://eu_fact_force:eu_fact_force@postgres:5432/eu_fact_force}
15-
AWS_ENDPOINT_URL: ${AWS_ENDPOINT_URL:-http://localstack:4566}
16-
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-test}
17-
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-test}
15+
AWS_S3_ENDPOINT_URL: ${AWS_S3_ENDPOINT_URL:-http://rustfs:9000}
16+
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-minioadmin}
17+
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
1818
AWS_S3_REGION_NAME: ${AWS_S3_REGION_NAME:-eu-west-1}
1919
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
2020
depends_on:
2121
postgres:
2222
condition: service_healthy
23-
localstack:
23+
rustfs:
2424
condition: service_started
2525
labels:
2626
- traefik.enable=true
@@ -30,6 +30,7 @@ services:
3030
- traefik.docker.network=d4g-internal
3131
- traefik.http.services.eu-fact-force.loadbalancer.server.port=8000
3232

33+
# PostgreSQL 18+ : monter sur /var/lib/postgresql (l’image gère le sous-dossier de version).
3334
postgres:
3435
image: pgvector/pgvector:pg18-trixie
3536
environment:
@@ -39,29 +40,42 @@ services:
3940
ports:
4041
- 5432
4142
volumes:
42-
- postgres_data:/var/lib/postgresql
43+
- ./postgres_data:/var/lib/postgresql
4344
- ./docker/postgres/init:/docker-entrypoint-initdb.d:ro
4445
healthcheck:
4546
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-eu_fact_force} -d ${POSTGRES_DB:-eu_fact_force}"]
4647
interval: 5s
4748
timeout: 5s
4849
retries: 5
4950

50-
localstack:
51-
image: localstack/localstack:latest
51+
# RustFS: S3-compatible object storage with web console (Apache 2.0).
52+
# Console UI: http://localhost:9001 | S3 API: http://localhost:9000
53+
rustfs:
54+
image: rustfs/rustfs:latest
55+
restart: unless-stopped
5256
ports:
53-
- 4566
57+
- "9000:9000"
58+
- "9001:9001"
5459
environment:
55-
SERVICES: s3
56-
PERSISTENCE: 1
57-
AWS_DEFAULT_REGION: ${AWS_S3_REGION_NAME:-eu-west-1}
58-
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
59-
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-test}
60-
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-test}
61-
DEBUG: ${DEBUG:-0}
60+
RUSTFS_ACCESS_KEY: ${AWS_ACCESS_KEY_ID:-minioadmin}
61+
RUSTFS_SECRET_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
62+
RUSTFS_CONSOLE_ENABLE: "true"
63+
command: ["--console-enable", "/data"]
6264
volumes:
63-
- ./s3:/var/lib/localstack
64-
- ./docker/localstack/init-ready.d:/etc/localstack/init/ready.d:ro
65+
- ./rustfs_data:/data
6566

66-
volumes:
67-
postgres_data:
67+
# Create default S3 bucket on first run (depends on RustFS).
68+
rustfs-init:
69+
image: amazon/aws-cli:latest
70+
depends_on:
71+
- rustfs
72+
entrypoint: ["/bin/sh", "-c"]
73+
command:
74+
- |
75+
sleep 5
76+
aws s3 mb s3://$${AWS_STORAGE_BUCKET_NAME} --endpoint-url http://rustfs:9000 2>/dev/null || true
77+
echo "Bucket ready."
78+
environment:
79+
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-minioadmin}
80+
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
81+
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
#!/usr/bin/env python3
2-
"""Create the default S3 bucket when LocalStack is ready."""
2+
"""Create the default S3 bucket and the performances bucket when LocalStack is ready."""
33

44
import os
55

66
import boto3
77

8+
PERFORMANCES_BUCKET_NAME = "performances"
9+
810
bucket = os.environ.get("AWS_STORAGE_BUCKET_NAME", "eu-fact-force")
911
region = os.environ.get("AWS_S3_REGION_NAME", "eu-west-1")
1012

@@ -15,10 +17,12 @@
1517
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", "test"),
1618
region_name=region,
1719
)
18-
try:
19-
client.create_bucket(Bucket=bucket)
20-
print(f"Created bucket: {bucket}")
21-
except client.exceptions.BucketAlreadyOwnedByYou:
22-
print(f"Bucket already exists: {bucket}")
23-
except Exception as e:
24-
print(f"Bucket creation skipped: {e}")
20+
21+
for name in (bucket, PERFORMANCES_BUCKET_NAME):
22+
try:
23+
client.create_bucket(Bucket=name)
24+
print(f"Created bucket: {name}")
25+
except client.exceptions.BucketAlreadyOwnedByYou:
26+
print(f"Bucket already exists: {name}")
27+
except Exception as e:
28+
print(f"Bucket creation skipped for {name}: {e}")

eu_fact_force/app/settings.py

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
"""
1212

1313
import os
14+
import sys
1415
from pathlib import Path
1516
from urllib.parse import urlparse
1617

@@ -25,6 +26,13 @@
2526
if _load_env.exists():
2627
load_dotenv(_load_env)
2728

29+
# Sous pytest : forcer S3 local (RustFS) pour éviter InvalidAccessKeyId avec des clés .env
30+
_run_by_pytest = "pytest" in sys.argv[0] or "pytest" in str(sys.argv)
31+
if _run_by_pytest:
32+
os.environ["AWS_S3_ENDPOINT_URL"] = "http://localhost:9000"
33+
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
34+
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
35+
os.environ["AWS_STORAGE_BUCKET_NAME"] = "eu-fact-force-files"
2836

2937
# Quick-start development settings - unsuitable for production
3038
# See https://docs.djangoproject.com/en/6.0/howto/deployment/checklist/
@@ -106,21 +114,15 @@ def _get_databases():
106114
"PASSWORD": parsed.password,
107115
"HOST": parsed.hostname,
108116
"PORT": parsed.port or "5432",
117+
"TEST": {
118+
"NAME": f"test_{parsed.path.lstrip('/')}",
119+
},
109120
}
110121
}
111122

112123

113124
DATABASES = _get_databases()
114125

115-
# Use a dedicated test database name so tests do not overwrite dev/prod data
116-
if (
117-
"default" in DATABASES
118-
and DATABASES["default"]["ENGINE"] == "django.db.backends.postgresql"
119-
):
120-
_db_name = DATABASES["default"].get("NAME", "eu_fact_force")
121-
DATABASES["default"].setdefault("TEST", {})["NAME"] = f"test_{_db_name}"
122-
123-
124126
# Password validation
125127
# https://docs.djangoproject.com/en/6.0/ref/settings/#auth-password-validators
126128

@@ -159,17 +161,28 @@ def _get_databases():
159161
# Must be an absolute filesystem path (string) for collectstatic / StaticFilesStorage
160162
STATIC_ROOT = str((BASE_DIR.parent / "staticfiles").resolve())
161163

162-
# S3 / LocalStack storage (switch via AWS_S3_ENDPOINT_URL or USE_LOCAL_STACK)
164+
# S3 / MinIO / LocalStack storage (switch via AWS_S3_ENDPOINT_URL or USE_LOCAL_STACK)
163165
# django-storages reads AWS_S3_ENDPOINT_URL from this module
164-
AWS_S3_ENDPOINT_URL = os.environ.get("AWS_S3_ENDPOINT_URL") or (
165-
"http://localhost:4566" if os.environ.get("USE_LOCAL_STACK") else None
166+
# Valeurs par défaut : RustFS local (9000) ou LocalStack (4566) si USE_LOCAL_STACK=1
167+
AWS_S3_ENDPOINT_URL = os.environ.get("AWS_S3_ENDPOINT_URL") or "http://localhost:9000"
168+
if AWS_S3_ENDPOINT_URL and (
169+
"localhost" in AWS_S3_ENDPOINT_URL or "127.0.0.1" in AWS_S3_ENDPOINT_URL
170+
):
171+
os.environ.setdefault("AWS_ACCESS_KEY_ID", "minioadmin")
172+
os.environ.setdefault("AWS_SECRET_ACCESS_KEY", "minioadmin")
173+
# Must match eu_fact_force.ingestion.s3.save_file_to_s3 / get_default_bucket(): uploads use
174+
# boto3 with this default bucket even when AWS_STORAGE_BUCKET_NAME is unset, so default_storage
175+
# must use the same bucket or opens fall back to FileSystemStorage and FileNotFoundError.
176+
_DEFAULT_FILES_BUCKET = "eu-fact-force-files"
177+
_AWS_STORAGE_BUCKET_NAME = (
178+
os.environ.get("AWS_STORAGE_BUCKET_NAME") or _DEFAULT_FILES_BUCKET
166179
)
167-
if os.environ.get("AWS_STORAGE_BUCKET_NAME"):
180+
if _AWS_STORAGE_BUCKET_NAME:
168181
STORAGES = {
169182
"default": {
170183
"BACKEND": "storages.backends.s3boto3.S3Boto3Storage",
171184
"OPTIONS": {
172-
"bucket_name": os.environ.get("AWS_STORAGE_BUCKET_NAME"),
185+
"bucket_name": _AWS_STORAGE_BUCKET_NAME,
173186
"region_name": os.environ.get("AWS_S3_REGION_NAME", "eu-west-1"),
174187
"custom_domain": False,
175188
},
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import json
2+
import logging
3+
from pathlib import Path
4+
5+
from django.core.management.base import BaseCommand
6+
7+
from eu_fact_force.ingestion.embedding import add_embeddings
8+
from eu_fact_force.ingestion.parsing import parse_file
9+
from eu_fact_force.ingestion.services import save_chunks, save_to_s3_and_postgres
10+
11+
logger = logging.getLogger(__name__)
12+
13+
PERFORMANCES_BUCKET_NAME = "performances"
14+
VACCINS_ANNOTATED_KEY = "vaccins_annotated.json"
15+
PDF_PREFIX = "pdf"
16+
17+
18+
class Command(BaseCommand):
19+
help = (
20+
"Read vaccins_annotated.json from S3 bucket performances, "
21+
"download PDF + JSON per entry, run full ingestion pipeline."
22+
)
23+
24+
def handle(self, *args, **options):
25+
performance_dir = Path(__file__).resolve().parents[4] / "data" / "vaccine_perfs"
26+
pdfs = list(performance_dir.glob("*.pdf"))
27+
for pdf_path in pdfs:
28+
logger.info(f"Processing {pdf_path.stem}")
29+
key = pdf_path.stem
30+
metadata = json.load(pdf_path.with_suffix(".json").open())
31+
source_file = save_to_s3_and_postgres(
32+
pdf_path,
33+
tags_pubmed=metadata.get("tags_pubmed", []),
34+
doi=key,
35+
)
36+
document_parts = parse_file(source_file)
37+
chunks = save_chunks(source_file, document_parts)
38+
add_embeddings(chunks)
39+
40+
self.stdout.write(self.style.SUCCESS(f"Done. Processed: {len(pdfs)}"))

0 commit comments

Comments
 (0)