Why the tool exists and how it works
I started PDF Differ after talking to designers and art directors who faced similar problems to what I was trying to solve at Google Fonts. Producing documents was covered by Adobe applications, but actually testing the output of these applications was an issue. A common complaint I heard was the limitations of Adobe's Acrobat Pro comparison mode. Several art directors resorted to comparing PDFs side by side.
Even though existing PDF comparison tools exist, I knew there was a gap in the market to offer a tool that would significantly speed up the process for design professionals. I spent a few weeks researching the existing tools. Most of them either compared textual content or compared pages as bitmaps. It was rare to encounter a tool that could do both. None of them were capable of comparing documents with different paper sizes.
I naively assumed I could write a new tool in a weekend (as one does). I have just completed the first public beta release, and I have spent over six hundred hours getting everything in place. I hope the trials and tribulations below give insight into how a seemingly simple brief, "a PDF comparison tool for professional designers," can take a lot of effort.
The majority of designers and the publishing industry work on Macs, and this isn't going to change any time soon, so it makes sense to only target this platform for the first release. Apple's developer tools also make developing the application more efficient because they include PDFKit, which, as the name suggests, is a library to work with PDFs. Unfortunately, this library doesn’t support PDFs with transparent images (version 1.4 added support for transparency in 2003). This is why most designers use Acrobat DC or Pro. In order for the app to support transparency and more modern PDF features, I’ve had to use an external PDF library. Adobe already offers such a library, but the licensing costs are hefty. One free alternative is Ghost Script, but this also has a high price tag for commercial use. After some research, I've decided to use PDFium since it's Apache licensed, maintained by Google, and free for commercial use. Another option is Mozilla's PDF.js, but it means I'd have to use JavaScript and probably release the tool as a web app.
In order to match spreads against pages and to skip printer's marks, I've used a technique called template matching. Computer vision libraries such as OpenCV already include functions to do this. However, I’ve written my implementation to keep the application’s file size to a minimum. My implementation uses a technique called image pyramiding to speed up the matching process. To gain further speed improvements, the application will template match multiple pages simultaneously. PDF Differ still takes a considerable amount of time to template match documents with hundreds of pages, so I hope to improve this in the future. One possible option is to template match using a fast Fourier transform or use Apple’s Vision library to match features in two documents.
I’ve decided to release the text comparison mode in an unfinished state. Extracting textual information from PDFs is a very difficult problem. The PDF format doesn’t store textual information such as paragraphs or sentences. It simply paints each individual character onto the page. This makes it very challenging to extract text successfully from documents with complex layouts. Currently, I am relying on PDFKit’s own text extraction methods. If this tool is successful, I will investigate this subject further. Once the text is extracted, I use my implementation of the Myers diff algorithm to compare the text in both documents. This algorithm is used by programmers to compare changes in source code. To elegantly display the results, the left portion of the screen contains the old PDF, while the right side contains the new PDF. By clicking on each sentence, the viewport will jump to the correct PDF page and highlight the selected text. I hope this layout will help designers check whether documents are reflowing correctly or for copywriters to compare text revisions quickly.
I’ve decided not to distribute PDF Differ on the Apple App Store since they don’t allow developers to offer free trials. I strongly believe it is important to allow anyone to try out software, free of charge, no questions asked, before committing to a purchase.
We have something special which should help designers save time and avoid costly printing mistakes. The beta is open to anyone who is interested. Feel free to download the application and try it out for 30 days. If you need longer to evaluate, I will provide you with a temporary licence key.