-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Add batch uploader and document re-parsing tools to Python SDK #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # .gitignore # api/apps/sdk/doc.py # sdk/python/ragflow_sdk/modules/dataset.py
…ailed_documents.py
|
keyang.lk seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive batch processing tools to the RAGFlow Python SDK, including a batch uploader for efficient bulk document uploads and a tool for re-parsing failed documents. The implementation includes extensive test coverage, documentation, and example scripts.
Key Changes:
- New tools module with BatchUploader, DocumentExtractor, FieldMapper, FileReader, and FailedDocumentReparser
- Batch document upload API endpoint with metadata support
- Optimized document parsing with batch processing
- Comprehensive test suites with unit tests
- Enhanced error handling and response parsing in SDK core
Reviewed changes
Copilot reviewed 22 out of 24 changed files in this pull request and generated 30 comments.
Show a summary per file
| File | Description |
|---|---|
sdk/python/ragflow_sdk/tools/batch_uploader.py |
Implements batch upload with snapshot-based resume support and retry logic |
sdk/python/ragflow_sdk/tools/reparse_failed_documents.py |
Tool for identifying and re-parsing failed documents with pagination |
sdk/python/ragflow_sdk/tools/document_extractor.py |
Iterator-based document extraction from various file formats |
sdk/python/ragflow_sdk/tools/field_mapper.py |
Flexible field mapping with auto-detection capabilities |
sdk/python/ragflow_sdk/tools/file_reader.py |
Multi-format file reader supporting JSON, CSV, Excel, etc. |
sdk/python/ragflow_sdk/tools/models.py |
Data models for tools including Snapshot, FileCursor, Document |
sdk/python/test/test_tools/ |
Comprehensive unit tests for all new tools |
api/apps/sdk/doc.py |
New batch upload API endpoint and optimized document parsing |
sdk/python/ragflow_sdk/ragflow.py |
Enhanced response parsing with proper error handling |
sdk/python/ragflow_sdk/modules/dataset.py |
New upload_documents_with_meta method with batch support |
sdk/python/pyproject.toml |
Added test dependencies (pandas, pytest-cov, etc.) |
sdk/python/examples/ |
Example scripts for batch upload and document reparsing |
web/.env |
PORT changed from 9222 to 9223 |
| snapshot_file=snapshot_file, | ||
| file_extension=file_extension | ||
| ): | ||
| current_file_path = file_path |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable current_file_path is not used.
| current_file_path = file_path |
| """Disable beartype by monkey patching beartype_this_package to do nothing.""" | ||
| try: | ||
| import beartype.claw | ||
| original_beartype_this_package = beartype.claw.beartype_this_package |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable original_beartype_this_package is not used.
| original_beartype_this_package = beartype.claw.beartype_this_package |
| return res.json() | ||
| except Exception as e: | ||
| error_url = url or res.url if hasattr(res, 'url') else 'unknown' | ||
| raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}") |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable error_url is not used.
| raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}") | |
| raise Exception( | |
| f"Failed to parse JSON response (status {res.status_code}, URL: {error_url}): " | |
| f"{str(e)}. Response text: {res.text[:500]}" | |
| ) |
| # Create test file | ||
| filename = f"test.{extension}" | ||
| create_method = getattr(self, create_func) | ||
| filepath = create_method(temp_dir, filename, test_data) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable filepath is not used.
| filepath = create_method(temp_dir, filename, test_data) | |
| _ = create_method(temp_dir, filename, test_data) |
| filepath = create_method(temp_dir, filename, test_data) | ||
| else: # xlsx or xls | ||
| filepath = self._create_excel_file(temp_dir, filename, test_data) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable filepath is not used.
| filepath = create_method(temp_dir, filename, test_data) | |
| else: # xlsx or xls | |
| filepath = self._create_excel_file(temp_dir, filename, test_data) | |
| create_method(temp_dir, filename, test_data) | |
| else: # xlsx or xls | |
| self._create_excel_file(temp_dir, filename, test_data) |
| import os | ||
| import tempfile | ||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'MagicMock' is not used.
Import of 'patch' is not used.
Import of 'call' is not used.
| from unittest.mock import Mock, MagicMock, patch, call | |
| from unittest.mock import Mock |
| import tempfile | ||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call | ||
| from pathlib import Path |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Path' is not used.
| from pathlib import Path |
| from pathlib import Path | ||
|
|
||
| from ragflow_sdk.tools import BatchUploader, DocumentExtractor, FieldMapper | ||
| from ragflow_sdk.tools.models import Snapshot, FileCursor |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Snapshot' is not used.
| from ragflow_sdk.tools.models import Snapshot, FileCursor | |
| from ragflow_sdk.tools.models import FileCursor |
| # | ||
|
|
||
| import pytest | ||
| from unittest.mock import Mock, MagicMock, patch, call |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'MagicMock' is not used.
Import of 'patch' is not used.
| from unittest.mock import Mock, MagicMock, patch, call | |
| from unittest.mock import Mock |
|
|
||
| beartype.claw.beartype_this_package = noop_beartype_this_package | ||
| os.environ['BEARTYPE_DISABLE'] = '1' | ||
| except ImportError: |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except ImportError: | |
| except ImportError: | |
| # beartype is an optional dependency in tests; if it's not installed, just skip the monkey-patch. |
Summary
This PR extends the RAGFlow Python SDK with powerful batch processing tools for document management, including a batch uploader and a tool for re-parsing failed documents.
Key Features
BatchUploaderclass for efficient bulk document uploads with progress tracking and error handlingChanges
ragflow_sdk/tools/batch_uploader.py- Batch upload functionalityragflow_sdk/tools/reparse_failed_documents.py- Document re-parsing toolragflow_sdk/tools/document_extractor.py- Document extraction utilitiesragflow_sdk/tools/field_mapper.py- Field mapping systemragflow_sdk/tools/file_reader.py- Multi-format file readerragflow_sdk/tools/models.py- Data models for toolsragflow_sdk/modules/dataset.pywith batch operation supportSolution Description