See post on TDS Further readings: Extracting Semi-Structured Data from PDFs on a large scale Extracting Tabular Data from PDFs