Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
0a13464
Add prototype creation script
philippesaade-wmde Mar 11, 2025
4eb77d2
Fix tqdm
philippesaade-wmde Mar 11, 2025
c9b360b
updated docker-compose.yml, docker7 run.py, wikidataItemDB, wikidataR…
exowanderer Mar 11, 2025
34d9bb4
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
76a5f31
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
481a95b
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
d1dd406
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
d84dade
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
3783b0d
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
03f6f14
added docstrings to wikidataEmbed
exowanderer Mar 11, 2025
ed87ce4
moved mega dict to json and fixed junior eng file open bugs
exowanderer Mar 11, 2025
90b013c
added data dir exist checks
exowanderer Mar 11, 2025
ffedc9a
modified test dir exists and create
exowanderer Mar 11, 2025
8c44bbe
Handle network errors from Jina's side
philippesaade-wmde Mar 12, 2025
bba9b7f
Merge pull request #3 from philippesaade-wmde/main
exowanderer Mar 12, 2025
7712660
change cache database to base64 for the vectors
philippesaade-wmde Mar 12, 2025
0d425af
Merge pull request #2 from exowanderer/main
exowanderer Mar 13, 2025
75e477c
Refactored docker/1_Data_Processing_save_labels_descriptions/run.py b…
exowanderer Mar 13, 2025
d1a256a
removed unnecessary commented code at top of file
exowanderer Mar 13, 2025
cf20d46
Added PYTHONPATH env variable to Docekr file to bypass need for sys.p…
exowanderer Mar 13, 2025
e78cd40
Refactored docker/2_Data_Processing_save_items_per_lang/run.py based …
exowanderer Mar 13, 2025
c0d50bd
Refactored docker/3_Add_Wikidata_to_AstraDB/run.py based on flake8;…
exowanderer Mar 13, 2025
15921fa
Added ENV PYTHONPATH="${PYTHONPATH}:/src" to all Dockerfiles in docke…
exowanderer Mar 13, 2025
e420e7b
Refactored docker/4_Run_Retrieval/run.py based on output from flake8
exowanderer Mar 13, 2025
efc1e5b
Refactors docker/5_Run_Rerank/run.py based on flake8 output
exowanderer Mar 13, 2025
c05ed26
Refactors docker/6_Push_Huggingface/run.py based on flake8 output
exowanderer Mar 13, 2025
3fd2820
Refactored docker/7_Create_Prototype/run.py based on flake8 output
exowanderer Mar 13, 2025
3f9f52e
Refactors src/JinaAI based on output from flake8
exowanderer Mar 13, 2025
3bc4454
Refactors src/__init__.py based on output from flake8
exowanderer Mar 13, 2025
19ada14
Refactored src/experimental_functions/word_embeding.py based on outpu…
exowanderer Mar 13, 2025
b42436b
Refactored src/wikidataCache.py based on output from flake8
exowanderer Mar 13, 2025
3e49a74
Handle Jina and DataStax API errors
philippesaade-wmde Mar 13, 2025
7db1df9
Handle Jina and DataStax API errors
philippesaade-wmde Mar 13, 2025
b349ff1
tested new PYTHONPATH; Changed ENV PYTHONPATH=':/src' to ENV PYTHONPA…
exowanderer Mar 13, 2025
e248427
Fix duplicate ids error
philippesaade-wmde Mar 14, 2025
7492186
Changed ENV PYTHONPATH=:/src to ENV PYTHONPATH=:/ in Docker/3*/Docker…
exowanderer Mar 14, 2025
b221351
Fix bulk caching of embeddings
philippesaade-wmde Mar 15, 2025
58b0a88
Fix bulk caching of embeddings
philippesaade-wmde Mar 15, 2025
b6aa205
Include merge cache script
philippesaade-wmde Mar 17, 2025
30c3e3b
Vacuuming the database when migrating
philippesaade-wmde Mar 17, 2025
daaf7f4
Merge branch 'code_review' into updated_merge_conflicts
exowanderer Mar 17, 2025
0a60706
Merge pull request #5 from philippesaade-wmde/updated_merge_conflicts
exowanderer Mar 17, 2025
d6ed4aa
added more todos for docker 7 run.py
exowanderer Mar 17, 2025
3d1c9ae
added blank line at bottom of all Dockerfile files
exowanderer Mar 17, 2025
4d0e55a
added default setting for lang_in_wp to avoid collision
exowanderer Mar 17, 2025
2eea238
improved embedded conditional statemetns in 2_docker run.py
exowanderer Mar 17, 2025
3828ba3
refactor open to with open in docker3 run.py
exowanderer Mar 17, 2025
1f7d4ac
refactor open to with open in docker4 run.py
exowanderer Mar 17, 2025
b560f9f
added space at bottom of json file in docker7
exowanderer Mar 17, 2025
72e74a7
added space at bottom of json file in run_exp.sh
exowanderer Mar 17, 2025
2d5c450
added bash forloop example to run_exp.sh
exowanderer Mar 17, 2025
b8376f1
added space at bottom of json file in run_exp.sh
exowanderer Mar 17, 2025
d101f07
fixed too long lline in wikidatRetriever
exowanderer Mar 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 57 additions & 16 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
services:
data_processing_save_ids:
data_processing_save_labels_descriptions:
build:
context: .
dockerfile: ./docker/1_Data_Processing_save_ids/Dockerfile
dockerfile: ./docker/1_Data_Processing_save_labels_descriptions/Dockerfile
volumes:
- ./data:/data # Mount the ./data folder from the host to /data in the container
tty: true
Expand All @@ -12,10 +12,10 @@ services:
LANGUAGE: "de"
OFFSET: 0

data_processing_save_entities:
data_processing_save_items_per_lang:
build:
context: .
dockerfile: ./docker/2_Data_Processing_save_entities/Dockerfile
dockerfile: ./docker/2_Data_Processing_save_items_per_lang/Dockerfile
volumes:
- ./data:/data # Mount the ./data folder from the host to /data in the container
tty: true
Expand Down Expand Up @@ -43,12 +43,16 @@ services:
PYTHONUNBUFFERED: 1
MODEL: "jina"
SAMPLE: "true"
API_KEY: "datastax_wikidata_nvidia.json"
API_KEY: "datastax_wikidata2.json"
EMBED_BATCH_SIZE: 8
QUERY_BATCH_SIZE: 1000
OFFSET: 2560000
COLLECTION_NAME: "wikidata_test_v1"
LANGUAGE: 'ar'
QUERY_BATCH_SIZE: 100
OFFSET: 120000
COLLECTION_NAME: "wikidatav1"
LANGUAGE: 'en'
TEXTIFIER_LANGUAGE: 'en'
ELASTICSEARCH_URL: "http://localhost:9200"
ELASTICSEARCH: "false"
network_mode: "host"

run_retrieval:
build:
Expand All @@ -70,17 +74,16 @@ services:
environment:
PYTHONUNBUFFERED: 1
MODEL: "jina"
API_KEY: "datastax_wikidata_nvidia.json"
COLLECTION_NAME: "wikidata_test_v1"
API_KEY: "datastax_wikidata.json"
COLLECTION_NAME: "wikidata_texttest"
BATCH_SIZE: 100
EVALUATION_PATH: "Mintaka/processed_dataframe_langtest.pkl"
EVALUATION_PATH: "Mintaka/processed_dataframe.pkl"
# COMPARATIVE: "true"
# COMPARATIVE_COLS: "Correct QID,Wrong QID"
QUERY_COL: "Question"
# QUERY_LANGUAGE: "ar"
# DB_LANGUAGE: "en,ar"
QUERY_LANGUAGE: "en"
# DB_LANGUAGE: "en"
PREFIX: ""
ELASTICSEARCH_URL: "http://localhost:9200"
network_mode: "host"

run_rerank:
Expand All @@ -107,4 +110,42 @@ services:
BATCH_SIZE: 1
QUERY_COL: "Question"
LANGUAGE: "de"
network_mode: "host"
network_mode: "host"

push_huggingface:
build:
context: .
dockerfile: ./docker/6_Push_Huggingface/Dockerfile
volumes:
- ./data:/data
tty: true
container_name: push_huggingface
environment:
PYTHONUNBUFFERED: 1
QUEUE_SIZE: 5000
NUM_PROCESSES: 4
SKIPLINES: 0
ITERATION: 36

create_prototype:
build:
context: .
dockerfile: ./docker/7_Create_Prototype/Dockerfile
volumes:
- ./data:/data
- ~/.cache/huggingface:/root/.cache/huggingface
tty: true
container_name: create_prototype
environment:
PYTHONUNBUFFERED: 1
MODEL: "jinaapi"
API_KEY: "datastax_wikidata.json"
EMBED_BATCH_SIZE: 100
# QUEUE_SIZE: 5000
NUM_PROCESSES: 23
OFFSET: 0
COLLECTION_NAME: "wikidata_prototype"
LANGUAGE: 'en'
TEXTIFIER_LANGUAGE: 'en'
# CHUNK_NUM: 5
network_mode: "host"
41 changes: 0 additions & 41 deletions docker/1_Data_Processing_save_ids/run.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,19 @@ LABEL maintainer="philippe.saade@wikimedia.de"
WORKDIR /app

# Copy the requirements file into the container
COPY ./docker/1_Data_Processing_save_ids/requirements.txt requirements.txt
COPY ./docker/1_Data_Processing_save_labels_descriptions/requirements.txt requirements.txt

# Install the dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code into the container
COPY ./docker/1_Data_Processing_save_ids /app
COPY ./docker/1_Data_Processing_save_labels_descriptions /app
COPY ./src /src

# Set up the volume for the data folder
VOLUME [ "/data" ]

ENV PYTHONPATH="${PYTHONPATH}:/"

# Run the Python script
CMD ["python", "run.py"]
CMD ["python", "run.py"]
69 changes: 69 additions & 0 deletions docker/1_Data_Processing_save_labels_descriptions/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
from multiprocessing import Manager
import os
import time
import json

from src.wikidataDumpReader import WikidataDumpReader
from src.wikidataItemDB import WikidataItem

FILEPATH = os.getenv("FILEPATH", '../data/Wikidata/latest-all.json.bz2')
PUSH_SIZE = int(os.getenv("PUSH_SIZE", 20000))
QUEUE_SIZE = int(os.getenv("QUEUE_SIZE", 15000))
NUM_PROCESSES = int(os.getenv("NUM_PROCESSES", 4))
SKIPLINES = int(os.getenv("SKIPLINES", 0))
LANGUAGE = os.getenv("LANGUAGE", 'en')


def save_items_to_sqlite(item, data_batch, sqlitDBlock):
if (item is not None):
labels = WikidataItem.clean_label_description(item['labels'])
descriptions = WikidataItem.clean_label_description(
item['descriptions']
)
labels = json.dumps(labels, separators=(',', ':'))
descriptions = json.dumps(descriptions, separators=(',', ':'))
in_wikipedia = WikidataItem.is_in_wikipedia(item)
data_batch.append({
'id': item['id'],
'labels': labels,
'descriptions': descriptions,
'in_wikipedia': in_wikipedia,
})

with sqlitDBlock:
if len(data_batch) > PUSH_SIZE:
worked = WikidataItem.add_bulk_items(list(
data_batch[:PUSH_SIZE]
))
if worked:
del data_batch[:PUSH_SIZE]


if __name__ == "__main__":
multiprocess_manager = Manager()
sqlitDBlock = multiprocess_manager.Lock()
data_batch = multiprocess_manager.list()

wikidata = WikidataDumpReader(
FILEPATH,
num_processes=NUM_PROCESSES,
queue_size=QUEUE_SIZE,
skiplines=SKIPLINES
)

wikidata.run(
lambda item: save_items_to_sqlite(
item,
data_batch,
sqlitDBlock
),
max_iterations=None,
verbose=True
)

while len(data_batch) > 0:
worked = WikidataItem.add_bulk_items(list(data_batch))
if worked:
del data_batch[:PUSH_SIZE]
else:
time.sleep(1)
41 changes: 0 additions & 41 deletions docker/2_Data_Processing_save_entities/run.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,7 @@ COPY ./src /src
# Set up the volume for the data folder
VOLUME [ "/data" ]

ENV PYTHONPATH="${PYTHONPATH}:/"

# Run the Python script
CMD ["python", "run.py"]
CMD ["python", "run.py"]
73 changes: 73 additions & 0 deletions docker/2_Data_Processing_save_items_per_lang/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from multiprocessing import Manager
import os
import time

from src.wikidataDumpReader import WikidataDumpReader
from src.wikidataLangDB import WikidataLang

FILEPATH = os.getenv("FILEPATH", '../data/Wikidata/latest-all.json.bz2')
PUSH_SIZE = int(os.getenv("PUSH_SIZE", 2000))
QUEUE_SIZE = int(os.getenv("QUEUE_SIZE", 1500))
NUM_PROCESSES = int(os.getenv("NUM_PROCESSES", 8))
SKIPLINES = int(os.getenv("SKIPLINES", 0))
LANGUAGE = os.getenv("LANGUAGE", 'en')


def save_entities_to_sqlite(item, data_batch, sqliteDBlock):
"""_summary_
# TODO Add a docstring

Args:
item (_type_): _description_
data_batch (_type_): _description_
sqliteDBlock (_type_): _description_
"""
if item is not None:
# Check if the item is a valid entity
return

lang_in_wp = WikidataLang.is_in_wikipedia(item, language=LANGUAGE)
if not lang_in_wp:
# If the entity is not in the specified language Wikipedia, skip
return

item = WikidataLang.normalise_item(item, language=LANGUAGE)
data_batch.append(item)

with sqliteDBlock:
if len(data_batch) > PUSH_SIZE:
worked = WikidataLang.add_bulk_entities(list(
data_batch[:PUSH_SIZE]
))
if worked:
del data_batch[:PUSH_SIZE]


if __name__ == "__main__":
multiprocess_manager = Manager()
sqliteDBlock = multiprocess_manager.Lock()
data_batch = multiprocess_manager.list()

wikidata = WikidataDumpReader(
FILEPATH,
num_processes=NUM_PROCESSES,
queue_size=QUEUE_SIZE,
skiplines=SKIPLINES
)

wikidata.run(
lambda item: save_entities_to_sqlite(
item,
data_batch,
sqliteDBlock
),
max_iterations=None,
verbose=True
)

while len(data_batch) > 0:
worked = WikidataLang.add_bulk_entities(list(data_batch))
if worked:
del data_batch[:PUSH_SIZE]
else:
time.sleep(1)
4 changes: 3 additions & 1 deletion docker/3_Add_Wikidata_to_AstraDB/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,7 @@ COPY ./API_tokens /API_tokens
# Set up the volume for the data folder
VOLUME [ "/data" ]

ENV PYTHONPATH="${PYTHONPATH}:/"

# Run the Python script
CMD ["python", "run.py"]
CMD ["python", "run.py"]
3 changes: 1 addition & 2 deletions docker/3_Add_Wikidata_to_AstraDB/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,4 @@ langchain_experimental
ragstack-ai-langchain[knowledge-store]==1.3.0
langchain-astradb
astrapy
elasticsearch
mediawikiapi
elasticsearch
Loading