POP Translation Pipeline

A notebook-based pipeline for translating agricultural POP (Package of Practices) document pages directly from PDF into English while preserving structure as closely as possible.

This project uses a PDF-page translation workflow with Gemini. In testing, this method produced stronger overall results and better structure preservation, especially when paired with the Gemini Pro model for difficult pages containing complex tables.

Overview

The pipeline is designed to process a source POP document in a controlled, page-range-based workflow.

Final workflow

Split the source POP document into page-wise PDF files
Send each page PDF directly to Gemini for English translation to get output in HTML file
Extract images from the original PDF and then inject them back into the Translated HTML File
Convert translated HTML files into page-wise PDF files
Merge all translated page PDFs into one final translated PDF

Repository Structure

POP-Translation/
├── notebooks/
│   ├── 01_document_to_pagewise_pdf.ipynb
│   ├── 02_page_pdf_to_translated_html.ipynb
│   ├── 03_extract_and_inject_images.ipynb
│   ├── 04_final_html_to_page_pdf.ipynb
│   └── 05_merge_page_pdfs.ipynb
├── prompts/
│   └── page_to_pdf.txt
├── .gitignore
└── README.md

Notebook Pipeline

01_document_to_pagewise_pdf.ipynb Splits the source POP document into page-wise PDF files for a selected page range and document.
02_page_pdf_to_translated_html.ipynb Reads each page-wise PDF and sends it directly to Gemini for English translation while attempting to preserve structure by converting it into an HTML file.
03_extract_and_inject_images.ipynb Extracts the images from the original page-wise PDF and injects them back into the translated HTML File.
04_final_html_to_page_pdf.ipynb Converts the HTML files into PDF page-wise files.
05_merge_page_pdfs.ipynb Merges all the page-wise PDF files into one final translated PDF file.

Environment Setup

Create a virtual environment and install the required packages:

cd ~/POP-Translation
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install google-genai pymupdf beautifulsoup4 weasyprint ipykernel

Register the environment as a Jupyter kernel:

python -m ipykernel install --user \
  --name  gemini_html_output \
  --display-name "Python (gemini_html_output)"

In Jupyter, switch the notebook kernel to:

Python (gemini_html_output)

Required Dependencies

This project uses:

google-genai
pymupdf
beautifulsoup4
weasyprint
ipykernel

Input Requirements

Place the source POP document PDF locally before running the notebooks. Example source file:

Hindi/POP.pdf

The notebooks are designed so that the source PDF path, page range and model configuration can be changed from the config cell.

API Key Setup

Set your Gemini API key before running the translation notebook. Recommended method: environment variable

export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"

The translation notebook reads the key from the environment.

Output Structure

A typical local working structure looks like this:

Workdir/
└── Hindi/
    └── source/
        ├── page_001/
        │   ├── page_001.pdf
        │   ├── translated.html
        │   ├── images/
        │   │   ├── image_1.png
        │   │   └── image_2.png
        │   ├── final_with_images.html
        │   ├── final_with_images_preview.html
        │   └── final_page.pdf
        ├── page_002/
        │   ├── page_002.pdf
        │   ├── translated.html
        │   ├── images/
        │   ├── final_with_images.html
        │   ├── final_with_images_preview.html
        │   └── final_page.pdf
        ├── page_003/
        │   └── ...
        └── final_output/
             └── source_translated_pages_001_to_227.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POP Translation Pipeline

Overview

Final workflow

Repository Structure

Notebook Pipeline

Environment Setup

Required Dependencies

Input Requirements

API Key Setup

Output Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
notebooks		notebooks
prompts		prompts
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

POP Translation Pipeline

Overview

Final workflow

Repository Structure

Notebook Pipeline

Environment Setup

Required Dependencies

Input Requirements

API Key Setup

Output Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages