Skip to content

Conversation

@CHLK
Copy link
Collaborator

@CHLK CHLK commented Jan 9, 2026

Summary

This PR extends the RAGFlow Python SDK with powerful batch processing tools for document management, including a batch uploader and a tool for re-parsing failed documents.

Key Features

  • Batch Uploader: New BatchUploader class for efficient bulk document uploads with progress tracking and error handling
  • Document Extractor: Utility for extracting documents from various sources (files, directories, URLs)
  • Field Mapper: Flexible field mapping system for document metadata transformation
  • File Reader: Support for reading multiple file formats (PDF, DOCX, TXT, etc.)
  • Reparse Tool: Tool for identifying and re-parsing failed documents in datasets
  • Comprehensive Documentation: Extensive README with usage examples and best practices

Changes

  • Added ragflow_sdk/tools/batch_uploader.py - Batch upload functionality
  • Added ragflow_sdk/tools/reparse_failed_documents.py - Document re-parsing tool
  • Added ragflow_sdk/tools/document_extractor.py - Document extraction utilities
  • Added ragflow_sdk/tools/field_mapper.py - Field mapping system
  • Added ragflow_sdk/tools/file_reader.py - Multi-format file reader
  • Added ragflow_sdk/tools/models.py - Data models for tools
  • Updated ragflow_sdk/modules/dataset.py with batch operation support
  • Added comprehensive test suites for all new tools
  • Updated SDK documentation with examples

Solution Description

@CHLK CHLK self-assigned this Jan 9, 2026
@CHLK CHLK added the enhancement New feature or request label Jan 9, 2026
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


keyang.lk seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive batch processing tools to the RAGFlow Python SDK, including a batch uploader for efficient bulk document uploads and a tool for re-parsing failed documents. The implementation includes extensive test coverage, documentation, and example scripts.

Key Changes:

  • New tools module with BatchUploader, DocumentExtractor, FieldMapper, FileReader, and FailedDocumentReparser
  • Batch document upload API endpoint with metadata support
  • Optimized document parsing with batch processing
  • Comprehensive test suites with unit tests
  • Enhanced error handling and response parsing in SDK core

Reviewed changes

Copilot reviewed 22 out of 24 changed files in this pull request and generated 30 comments.

Show a summary per file
File Description
sdk/python/ragflow_sdk/tools/batch_uploader.py Implements batch upload with snapshot-based resume support and retry logic
sdk/python/ragflow_sdk/tools/reparse_failed_documents.py Tool for identifying and re-parsing failed documents with pagination
sdk/python/ragflow_sdk/tools/document_extractor.py Iterator-based document extraction from various file formats
sdk/python/ragflow_sdk/tools/field_mapper.py Flexible field mapping with auto-detection capabilities
sdk/python/ragflow_sdk/tools/file_reader.py Multi-format file reader supporting JSON, CSV, Excel, etc.
sdk/python/ragflow_sdk/tools/models.py Data models for tools including Snapshot, FileCursor, Document
sdk/python/test/test_tools/ Comprehensive unit tests for all new tools
api/apps/sdk/doc.py New batch upload API endpoint and optimized document parsing
sdk/python/ragflow_sdk/ragflow.py Enhanced response parsing with proper error handling
sdk/python/ragflow_sdk/modules/dataset.py New upload_documents_with_meta method with batch support
sdk/python/pyproject.toml Added test dependencies (pandas, pytest-cov, etc.)
sdk/python/examples/ Example scripts for batch upload and document reparsing
web/.env PORT changed from 9222 to 9223

snapshot_file=snapshot_file,
file_extension=file_extension
):
current_file_path = file_path
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable current_file_path is not used.

Suggested change
current_file_path = file_path

Copilot uses AI. Check for mistakes.
"""Disable beartype by monkey patching beartype_this_package to do nothing."""
try:
import beartype.claw
original_beartype_this_package = beartype.claw.beartype_this_package
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable original_beartype_this_package is not used.

Suggested change
original_beartype_this_package = beartype.claw.beartype_this_package

Copilot uses AI. Check for mistakes.
return res.json()
except Exception as e:
error_url = url or res.url if hasattr(res, 'url') else 'unknown'
raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}")
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable error_url is not used.

Suggested change
raise Exception(f"Failed to parse JSON response (status {res.status_code}): {str(e)}. Response text: {res.text[:500]}")
raise Exception(
f"Failed to parse JSON response (status {res.status_code}, URL: {error_url}): "
f"{str(e)}. Response text: {res.text[:500]}"
)

Copilot uses AI. Check for mistakes.
# Create test file
filename = f"test.{extension}"
create_method = getattr(self, create_func)
filepath = create_method(temp_dir, filename, test_data)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable filepath is not used.

Suggested change
filepath = create_method(temp_dir, filename, test_data)
_ = create_method(temp_dir, filename, test_data)

Copilot uses AI. Check for mistakes.
Comment on lines +293 to +295
filepath = create_method(temp_dir, filename, test_data)
else: # xlsx or xls
filepath = self._create_excel_file(temp_dir, filename, test_data)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable filepath is not used.

Suggested change
filepath = create_method(temp_dir, filename, test_data)
else: # xlsx or xls
filepath = self._create_excel_file(temp_dir, filename, test_data)
create_method(temp_dir, filename, test_data)
else: # xlsx or xls
self._create_excel_file(temp_dir, filename, test_data)

Copilot uses AI. Check for mistakes.
import os
import tempfile
import pytest
from unittest.mock import Mock, MagicMock, patch, call
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'MagicMock' is not used.
Import of 'patch' is not used.
Import of 'call' is not used.

Suggested change
from unittest.mock import Mock, MagicMock, patch, call
from unittest.mock import Mock

Copilot uses AI. Check for mistakes.
import tempfile
import pytest
from unittest.mock import Mock, MagicMock, patch, call
from pathlib import Path
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
from pathlib import Path

from ragflow_sdk.tools import BatchUploader, DocumentExtractor, FieldMapper
from ragflow_sdk.tools.models import Snapshot, FileCursor
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Snapshot' is not used.

Suggested change
from ragflow_sdk.tools.models import Snapshot, FileCursor
from ragflow_sdk.tools.models import FileCursor

Copilot uses AI. Check for mistakes.
#

import pytest
from unittest.mock import Mock, MagicMock, patch, call
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'MagicMock' is not used.
Import of 'patch' is not used.

Suggested change
from unittest.mock import Mock, MagicMock, patch, call
from unittest.mock import Mock

Copilot uses AI. Check for mistakes.

beartype.claw.beartype_this_package = noop_beartype_this_package
os.environ['BEARTYPE_DISABLE'] = '1'
except ImportError:
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except ImportError:
except ImportError:
# beartype is an optional dependency in tests; if it's not installed, just skip the monkey-patch.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants