Everything Has a Fingerprint: The Case of Blank Paper
This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy.
Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like hand geometry, as well as “behavioral biometrics” like voice, handwriting, typing patterns, and even gait analysis. Many techniques for deanonymization, the principal topic of this blog, work by “fingerprinting” people’s preferences, habits, or style.
But this article is not about biometrics, nor is it about fingerprinting of content or complex systems such as a web browser in conjunction with the OS and the user. I will instead discuss one of the most surprising domains of fingerprinting — blank paper.
This is what paper looks like up close — far from being smooth, it has a rich natural structure. Even considering this, the state-of-the-art study on fingerprinting of physical documents, by Will Clarkson and colleagues at Princeton, achieves something remarkable: they show how to extract fingerprints from paper using just commodity scanners, and no microscopic technology. The fingerprint survives when the document/paper is printed on, written or scribbled on, or even soaked in water.
A small (10mm tall) region of paper scanned from two different angles — top-to-bottom and left-to-right
The image above, taken from the Princeton paper, shows what the output of a scanner looks like. Not quite the resolution of the microscopic image, but a lot of structure is still visible. The key technique is: by scanning the paper at different orientations and comparing the images, the height at each point is estimated from which a 3-D map of the not-so-flat surface of the paper is constructed.
These 3-D maps can be used as fingerprints, but for efficiency they look at the maps of only about 100 randomly picked small “patches” on the paper. To further compress the extracted information, they do a “dimensionality reduction,” resulting in a 400 byte “feature vector” for each piece of paper, which is the fingerprint.
To verify or compare an observed fingerprint against a stored one, they simply look at the Hamming distance between the two bit-vectors. Why does this simple comparison technique succeed? Comparison of two human fingerprints is a lot more difficult, after all. It’s because a rectangular piece of paper has a nice property that human skin doesn’t: when the objects being fingerprinted have a precise, fixed geometry, fingerprint verification is easy — it is just a pointwise comparison of the corresponding features.
The result of such comparisons is this: two fingerprints from different pieces of paper match in roughly 50% of the bits, almost always in the 45%–55% range. Two fingerprints from the same piece of paper, on the other hand, differ in less than 5% of the bits, and occasionally up to 20% of bits if it has been handled particularly badly, such as by soaking. Therefore it is straightforward to infer whether or not two fingerprints came from the same piece of paper.
Readers familiar with the “33 bits of entropy” concept might notice that the fingerprint here is 400 bytes long, or 3200 bits, which is ridiculously high. There are surely less than 250 pieces of paper in the world — that’s a million for every person — which means that these fingerprints should easily be able to uniquely identify every piece of paper in the world.  The authors estimate that the chance of an error is no more than 1 in 10148. In other words, they achieve perfect accuracy.
What are the implications? As the authors point out, document identification “has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods.” On the negative side, it “could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.”
 This is often referred to as device fingerprinting, but I find that a poor choice of terminology and will use reserve that term for a different concept in this series.
 It is hard to estimate entropy exactly in cases like this, but the feature vector is obtained via Principal Component Analysis, which makes it likely that the entropy is close to the maximum value of 3200 bits.
Thanks to Will Clarkson for reviewing a draft of this post.