Skip to content

Commit 68050a9

Browse files
authored
Merge pull request #24 from TogetherCrew/fix/mediawiki-activities-wrong-arg
feat: trying to fix the no doc_ref_id error on loading documents!
2 parents d3fe5b9 + 9f2e06a commit 68050a9

File tree

2 files changed

+5
-2
lines changed

2 files changed

+5
-2
lines changed

hivemind_etl/mediawiki/etl.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,11 @@ def transform(self) -> list[Document]:
4242
documents: list[Document] = []
4343
for page in pages:
4444
try:
45+
# Generate a ref_doc_id if needed for newer llama-index versions
46+
doc_id = page.page_id
4547
documents.append(
4648
Document(
47-
doc_id=page.page_id,
49+
doc_id=doc_id,
4850
text=page.revision.text,
4951
metadata={
5052
"title": page.title,
@@ -57,6 +59,7 @@ def transform(self) -> list[Document]:
5759
"contributor_user_id": page.revision.contributor.user_id,
5860
"sha1": page.revision.sha1,
5961
"model": page.revision.model,
62+
"ref_doc_id": doc_id, # Add ref_doc_id to metadata
6063
},
6164
excluded_embed_metadata_keys=[
6265
"namespace",

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
python-dotenv>=1.0.0, <2.0.0
2-
tc-hivemind-backend==1.4.0
2+
tc-hivemind-backend==1.4.2.post2
33
llama-index-storage-docstore-redis==0.1.2
44
llama-index-storage-docstore-mongodb==0.1.3
55
crawlee[playwright]==0.3.8

0 commit comments

Comments
 (0)