PDF File analysis

From: https://trailofbits.github.io/ctf/forensics/

PDF is an extremely complicated document file format, with enough tricks and hiding places to write about for years. This also makes it popular for CTF forensics challenges. The NSA wrote a guide to these hiding places in 2008 titled “Hidden Data and Metadata in Adobe PDF Files: Publication Risks and Countermeasures.” It’s no longer available at its original URL, but you can find a copy here. Ange Albertini also keeps a wiki on GitHub of PDF file format tricks.

The PDF format is partially plain-text, like HTML, but with many binary “objects” in the contents. Didier Stevens has written good introductory material about the format. The binary objects can be compressed or even encrypted data, and include content in scripting languages like JavaScript or Flash. To display the structure of a PDF, you can either browse it with a text editor, or open it with a PDF-aware file-format editor like Origami.

qpdf is one tool that can be useful for exploring a PDF and transforming or extracting information from it. Another is a framework in Ruby called Origami.

When exploring PDF content for hidden data, some of the hiding places to check include:

  • non-visible layers
  • Adobe’s metadata format “XMP”
  • the “incremental generation” feature of PDF wherein a previous version is retained but not visible to the user
  • white text on a white background
  • text behind images
  • an image behind an overlapping image
  • non-displayed comments

There are also several Python packages for working with the PDF file format, like PeepDF, that enable you to write your own parsing scripts.