PdfToTxt is a simple package for converting a PDF file into TXT with PHP
composer require raphaelramosds/pdf-to-txtConverts file.pdf into file.txt and save it on path/to/txt directory
$ptt = new PdfToTxt('path/to/file.pdf', 'path/to/txt', 'file');
$ptt->convert();You can also use it as a web application. Just clone the repository, build the Docker image and run the container with Docker Compose
Web interface for converting PDF files to TXT
Unfortunately, this package can only be used in a Linux environment. Additionally, you will need to install the following dependencies
# Install Tesseract OCR and its support to PT-BR language
sudo apt install tesseract-ocr tesseract-ocr-por# Install
sudo apt install imagemagick php-imagick
# Enable imagick extension
sudo phpenmod imagick
# (Optional) Check if it is enabled
php -m | grep imagickIt uses ImageMagick to convert all PDF pages into JPG format, extracts their content using Tesseract OCR and compiles the results into a single TXT file.
While some PDF files use standard fonts that can be easily mapped to text, others rely on custom fonts which often store characters as vector graphics. In such cases, OCR becomes necessary to extract readable content. Therefore, in the future, I plan to add Ghostscript support to this package as an alternative method for handling these PDFs without relying solely on OCR.
You can use the following Ghostscript command to convert a PDF into a plain text file
gs -sDEVICE=txtwrite -o file.txt file.pdfBefore using this approach, it's recommended to check which fonts are used in the PDF. You can do that with the following command
gs -DPDFINFO file.pdfUnit tests were written with PHPUnit
./vendor/bin/phpunit tests