Extracting text from PDFs is hard

The majority of PDFs do not contain structured text. Each letter bears no relation to the previous. This makes extracting text from PDFs accurately pretty much impossible. This post is a summary on how far I got using off the shelf libraries and tools in Mac OS.

Since PDF Differ is a mac app so we can use Apple’s PDFKit which can attempt to extract text. However, its default text extraction is less than ideal. It can handle simple documents which contain one column of text but it will struggle with multiple columns and documents which are not English. I occasionally found that extracted text contained duplicated characters and additional whitespace characters. These additions were not consistent across a range of documents so I'm not able to justify adding a post processing step to remove them.

This got me thinking, how far can we get using Apple's Vision Framework optical character recognition (OCR) features? The approach I used is to rasterize each PDF page and then use Vision's text extraction features. Saldy, the results haven't been much better. Vision's OCR tools can recognise columns of text but the results start to fall apart when script fonts are used. I didn't want to abandon this experiment so I added a preprocessing step to the images to make the text more prominent but I soon realised this step would need to be customised by the user for different documents, thus increasing support requests.

Due to these limitations, I will be dropping the text comparison mode. I don't want to support a feature which only partly works for certain documents. My plan is to focus more on visual testing and to allow users to not only compare spreads against pages but to compare any arbitrary amount of pages on a single printing plate against a single page document. It will also support rotated documents as well. I have no regrets going down the Vision framework rabbit hole since I may be able to use this library to help with template matching documents.

Even though Apple's Vision framework didn't work for my needs, it is still incredible. If you're working on a project where you need to extract text from a PDF such as an annual report, I still recommend checking it out. It is far superior to Tesseract and other open source alternatives.