PDF to Markdown Conversion for Knowledge Graph Construction

This directory provides tools for converting PDF documents to Markdown format, which can then be processed into JSON format suitable for knowledge graph construction with AutoSchemaKG.

Creator: swgj

Overview

The PDF-to-Markdown transformation pipeline enables you to:

Convert PDF files to clean Markdown format
Extract images and generate descriptions (optional)
Convert Markdown to JSON format for KG construction
Process single files or entire directories

Prerequisites

Due to the version requirements of marker-pdf, we recommend creating a separate conda environment for PDF-to-Markdown transformation.

Installation

1. Clone the PDF Transform Repository

git clone https://github.com/Swgj/pdf_process
cd pdf_process

2. Create Conda Environment

conda create --name pdf-marker pip python=3.10
conda activate pdf-marker

3. Install Dependencies

pip install 'marker-pdf[full]'
pip install google-genai

Configuration

Edit the `config.yaml` File

The configuration file controls all aspects of the PDF processing pipeline:

processing_config:
  llm_service: "marker.services.azure_openai.AzureOpenAIService" # Azure OpenAI Service
  # To use default Gemini server, comment out the line above
  
  other_config:
    use_llm: true
    extract_images: false  # false: use LLM for text description; true: extract images without descriptions
    page_range: null  # null: process all pages, or use List[int] format like [9, 10, 11, 12]
    max_concurrency: 2 # maximum number of concurrent processes
    
    # Azure OpenAI API configuration
    azure_endpoint: <your endpoint>
    azure_api_version: "2024-10-21"
    deployment_name: "gpt-4o"

# API configuration
api:
  # api_key_env: "GEMINI_API_KEY"  # Uncomment for Gemini API key
  api_key_env: "AZURE_API_KEY"      # Use for Azure OpenAI

# Input path configuration - can be a file or folder path
input:
  # Supports relative and absolute paths
  path: "test_data"  # Can be a single file path or folder path
  # path: "data/Apple_Environmental_Progress_Report_2024.pdf"  # Example of a single file
  
  # If it's a folder, you can set file filtering conditions
  file_filters:
    extensions: [".pdf"]  # Only process PDF files
    recursive: true       # Whether to recursively process subfolders
    exclude_patterns:     # Exclude files that match these patterns
      - "*temp*"
      - "*~*"

# Output configuration
output:
  base_dir: "md_output"     # Output directory
  create_subdirs: true      # Whether to create a subdirectory for each input file
  format: "md"              # Output format (md, txt)
  
# Logging configuration
logging:
  level: "INFO"  # DEBUG, INFO, WARNING, ERROR
  show_progress: true

Key Configuration Options

LLM Service Options

Azure OpenAI: Set llm_service: "marker.services.azure_openai.AzureOpenAIService"
Gemini: Comment out the llm_service line to use default Gemini

Processing Options

use_llm: Enable LLM-based text extraction and enhancement
extract_images:
- false: Use LLM to generate text descriptions of images
- true: Extract images as files without descriptions
page_range: Specify pages to process (e.g., [9, 10, 11, 12]) or null for all pages
max_concurrency: Number of concurrent processes (adjust based on your system)

Input Options

path: Single file or directory path
file_filters.extensions: File types to process (default: [".pdf"])
file_filters.recursive: Process subdirectories
file_filters.exclude_patterns: Patterns to exclude from processing

Output Options

base_dir: Output directory for Markdown files
create_subdirs: Create separate subdirectories for each input file
format: Output format (md or txt)

Usage Workflow

Step 1: PDF to Markdown Conversion

Place your PDF files in the input directory specified in config.yaml
Run the conversion script:
```
bash run.sh
```
Output: You'll find Markdown files in the md_output directory (or your specified base_dir)

Step 2: Markdown to JSON Conversion

After obtaining Markdown files, convert them to JSON format for AutoSchemaKG:

# Return to the AutoSchemaKG parent directory
cd /path/to/AutoSchemaKG

# Convert Markdown to JSON
python -m atlas_rag.kg_construction.utils.md_processing.markdown_to_json \
    --input example_data/md_data \
    --output example_data

Parameters:

--input: Path to the directory containing Markdown files
--output: Path where JSON files will be saved

Step 3: Knowledge Graph Construction

Use the generated JSON files with AutoSchemaKG for knowledge graph construction:

from atlas_rag.kg_construction.triple_extraction import KnowledgeGraphExtractor
from atlas_rag.kg_construction.triple_config import ProcessingConfig
from atlas_rag.llm_generator import LLMGenerator

# Your KG construction code here
# See main README for full example

Example: Complete Workflow

# 1. Set up environment
conda activate pdf-marker
cd pdf_process

# 2. Configure config.yaml with your settings
# Edit: input path, Azure endpoint, API keys, etc.

# 3. Convert PDF to Markdown
bash run.sh

# 4. Return to AutoSchemaKG directory
cd /path/to/AutoSchemaKG

# 5. Convert Markdown to JSON
python -m atlas_rag.kg_construction.utils.md_processing.markdown_to_json \
    --input pdf_process/md_output \
    --output example/example_data

# 6. Run KG construction
# Use the JSON files in your KG construction pipeline

Credits

PDF-to-Markdown conversion tool developed by swgj.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF to Markdown Conversion for Knowledge Graph Construction

Overview

Prerequisites

Installation

1. Clone the PDF Transform Repository

2. Create Conda Environment

3. Install Dependencies

Configuration

Edit the `config.yaml` File

Key Configuration Options

LLM Service Options

Processing Options

Input Options

Output Options

Usage Workflow

Step 1: PDF to Markdown Conversion

Step 2: Markdown to JSON Conversion

Step 3: Knowledge Graph Construction

Example: Complete Workflow

Credits

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

PDF to Markdown Conversion for Knowledge Graph Construction

Overview

Prerequisites

Installation

1. Clone the PDF Transform Repository

2. Create Conda Environment

3. Install Dependencies

Configuration

Edit the config.yaml File

Key Configuration Options

LLM Service Options

Processing Options

Input Options

Output Options

Usage Workflow

Step 1: PDF to Markdown Conversion

Step 2: Markdown to JSON Conversion

Step 3: Knowledge Graph Construction

Example: Complete Workflow

Credits

Edit the `config.yaml` File