Skip to content

fix: install JAI Image I/O JPEG2000 plugin for PDFBox#8

Open
ferblape wants to merge 1 commit into
mainfrom
ferblape/install-jai-jpeg2000
Open

fix: install JAI Image I/O JPEG2000 plugin for PDFBox#8
ferblape wants to merge 1 commit into
mainfrom
ferblape/install-jai-jpeg2000

Conversation

@ferblape
Copy link
Copy Markdown
Member

Summary

  • PDFBox was logging Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed and silently skipping OCR on PDF pages whose images use JPEG2000 (JPX) streams — common in Spanish/Catalan/Basque admin scans.
  • Drop jai-imageio-core and jai-imageio-jpeg2000 (v1.4.0) into /tika-extras/, which the apache/tika:latest-full entrypoint already includes on the server classpath, so J2KImageReaderSpi is auto-discovered via the ImageIO ServiceLoader.

Test plan

  • docker build succeeds against apache/tika:latest-full
  • docker inspect confirms /tika-extras/* is on the entrypoint -cp
  • JAR ships META-INF/services/javax.imageio.spi.ImageReaderSpi registering com.github.jaiimageio.jpeg2000.impl.J2KImageReaderSpi
  • Tika server starts cleanly; POSTing a JP2 to /detect and /meta returns image/jp2 with no MissingImageReaderException in logs
  • Re-run a previously-failing production PDF and confirm OCR text is extracted from JPEG2000 pages

PDFBox failed to rasterize PDF pages with JPEG2000-encoded image streams,
logging "Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image
I/O Tools are not installed" and skipping OCR on those pages. Drop the
jai-imageio-core and jai-imageio-jpeg2000 JARs into /tika-extras/, which
the apache/tika base image already includes on the server classpath, so
the J2KImageReaderSpi is auto-discovered via the ImageIO ServiceLoader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant