Last Updated: May 2025
GrGoogleOCR is a C# Windows Forms application demonstrating how to use Google Cloud Document AI for Optical Character Recognition (OCR) on PDF and image files. It removes the original text layer, if any. It extracts text and layout information from the OCR results and rebuilds the PDF with an embedded, searchable text layer using two different PDF libraries provided in separate branches.
This project tackles the common problem of non-searchable, scanned PDFs. It provides a C# implementation for:
- Processing PDF files with Google's Document AI OCR service. You will bring your own ProcessorId and ProjectId.
- Parsing the rich OCR results (text, layout, optional styles).
- Embedding this data back into a PDF as a searchable text layer.
To showcase different trade-offs between features and licensing, this repository offers two main branches.
This repository uses a dual-branch approach:
-
PdfSharpCore Branch
- Goal: Provide a permissively licensed (MIT) solution that works "out of the box".
- Library: Uses the open-source
PdfSharpCorelibrary. - Features: Creates a basic, searchable but visible text layer.
- Limitations: Lacks advanced PDF features like precise character spacing to fit OCR boxes and true invisible text layers. The visual fidelity might not be perfect, but it enables search.
- Best For: Users needing a free/permissive solution, a basic searchable PDF, or a starting point for Google AI integration.
-
SyncfusionPdf Branch
- Goal: Demonstrate the full potential with a feature-rich commercial library.
- Library: Uses the
Syncfusion.Pdflibrary. - Features: Creates high-fidelity PDF text layers, supports advanced features like better text placement (filling box width by character spacing) and invisible text.
- Limitations: Requires a Syncfusion license for production/commercial use. However, it runs without a license for evaluation, adding a warning banner to the generated PDFs. Syncfusion also offers a free Community License and a free license for Open Source Projects – check their terms to see if you qualify.
- Best For: Users evaluating Syncfusion, those who own a license, qualifying community/OSS users, or anyone wanting to see the best possible PDF output.
To switch branches: Use git checkout syncfusionpdf or git checkout pdfsharpcore or download the code from either.
- Google Document AI Integration: Leverages Google's powerful cloud-based OCR engine.
- PDF Processing: Processes existing PDF files page by page.
- Configurable OCR: Allows selection of OCR modes (Lines, Tokens, Symbols), language hints, and style info requests.
- JSON Caching: Saves Google AI results locally to avoid reprocessing.
- Searchable PDF Output: Creates a new PDF file with the OCR text embedded.
- Configurable Settings: Provides a UI to manage Google Cloud details, OCR options, etc.
- Language: C#
- Framework: .NET8
- UI: Windows Forms (WinForms)
- Google Cloud Client:
Google.Cloud.DocumentAI.V1 - JSON Handling:
System.Text.Json - PDF Manipulation:
PdfSharpCore(mainbranch) /Syncfusion.Pdf(syncfusionbranch)
- Windows Operating System
- .NET SDK (Match your project's target framework)
- Visual Studio (Recommended for building/debugging)
This application requires a Google Cloud Platform (GCP) account and Document AI setup:
- Create a GCP Project: Google Cloud Console.
- Enable Billing: Required for Document AI. There are generous allowances for new accounts.
- Enable Document AI API: Search for and enable it in the "APIs & Services" > "Library".
- Create a Processor: Create a "Document OCR" or "Form Parser" processor in the Document AI section. Note its Location and Processor ID.
- Create a Service Account & Key: Go to "IAM & Admin" > "Service Accounts", create an account, grant it the "Document AI User" role, and download its keys.
- Clone this repository:
git clone [Your_Repo_URL] - Navigate to the directory:
cd GrGoogleOCR - Choose your branch:
- For the basic version:
git checkout main(or stay if it's default) - For the full-featured demo:
git checkout syncfusionpdf
- For the basic version:
- Open the solution (
.sln) file in Visual Studio. - Restore NuGet Packages: Right-click the solution and choose "Restore NuGet Packages".
- Note for
syncfusionbranch: You might need to configure the Syncfusion NuGet feed in Visual Studio if the packages don't restore automatically.
- Note for
- Build the solution (
Build>Build Solution).
- Run the
GrGoogleOCR.exe. - Configure the settings in the Property Grid (GCP Project, Location, Processor, PDF File Path). TextIsVisible = false takes effect in the SyncfusionPdf branch only.
- Note on Font/Style Info (
IsStyleInfoWanted): Be aware that enablingIsStyleInfoWanted = trueasks Google Document AI for detailed font and style information. While this can improve PDF output (especially with libraries like Syncfusion), it engages a more advanced processing model that significantly increases the cost (potentially by 5x or more per page – always check Google's current pricing). Only enable this if you specifically need font data and are aware of the cost implications. You may need font info if your text contains e.g. words in italics and you need to preserve this in the text layer. For a simple searchable PDF, you don't need italics in the (anyway invisible) text layer. To extract text to HTML, for instance, you may need it if it conveys significant meaning. - Click the "Go" button.
- Output files (
.json,.pdf) will appear in the input PDF's directory. The Syncfusion version will also show the results in the application.
Google Document AI has rate limits. In my experience, launching 4-5 page OCR requests in parallel per second is usually accepted. Requesting character-level info results in a much bigger JSON so you may want to slow down.
- The PdfSharpCore branch uses the permissively licensed (MIT) PdfSharCore library.
- The SyncfusionPdf branch uses Syncfusion's PDF library.
- It will run without a license for evaluation, but it will add a warning banner to generated PDFs.
- To remove the banner and for production/commercial use, a Syncfusion license is required.
- Syncfusion offers a free Community License and a free Open Source Project License. We highly recommend checking if you qualify, as it provides access to the full-featured version at no cost for eligible users. Visit the Syncfusion Licensing page for details.
- This project itself has its own license (Apache 2.0) to comply with Google sample code licensing terms.
Contributions are welcome! Feel free to fork the repository, create feature branches, and submit pull requests. Please specify which branch your changes apply to.
