Everything Has a Fingerprint: The Case of Blank Paper

September 13, 2011 at 10:41 am 1 comment

This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy.

Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like hand geometry, as well as “behavioral biometrics” like voice, handwriting, typing patterns, and even gait analysis. Many techniques for deanonymization, the principal topic of this blog, work by “fingerprinting” people’s preferences, habits, or style.

But this article is not about biometrics, nor is it about fingerprinting of content or complex systems such as a web browser in conjunction with the OS and the user.[1] I will instead discuss one of the most surprising domains of fingerprinting — blank paper.

This is what paper looks like up close — far from being smooth, it has a rich natural structure. Even considering this, the state-of-the-art study on fingerprinting of physical documents, by Will Clarkson and colleagues at Princeton, achieves something remarkable: they show how to extract fingerprints from paper using just commodity scanners, and no microscopic technology. The fingerprint survives when the document/paper is printed on, written or scribbled on, or even soaked in water.

A small (10mm tall) region of paper scanned from two different angles — top-to-bottom and left-to-right

The image above, taken from the Princeton paper, shows what the output of a scanner looks like. Not quite the resolution of the microscopic image, but a lot of structure is still visible. The key technique is: by scanning the paper at different orientations and comparing the images, the height at each point is estimated from which a 3-D map of the not-so-flat surface of the paper is constructed.

These 3-D maps can be used as fingerprints, but for efficiency they look at the maps of only about 100 randomly picked small “patches” on the paper. To further compress the extracted information, they do a “dimensionality reduction,” resulting in a 400 byte “feature vector” for each piece of paper, which is the fingerprint.

To verify or compare an observed fingerprint against a stored one, they simply look at the Hamming distance between the two bit-vectors. Why does this simple comparison technique succeed? Comparison of two human fingerprints is a lot more difficult, after all. It’s because a rectangular piece of paper has a nice property that human skin doesn’t: when the objects being fingerprinted have a precise, fixed geometry, fingerprint verification is easy — it is just a pointwise comparison of the corresponding features.

The result of such comparisons is this: two fingerprints from different pieces of paper match in roughly 50% of the bits, almost always in the 45%–55% range. Two fingerprints from the same piece of paper, on the other hand, differ in less than 5% of the bits, and occasionally up to 20% of bits if it has been handled particularly badly, such as by soaking. Therefore it is straightforward to infer whether or not two fingerprints came from the same piece of paper.

Readers familiar with the “33 bits of entropy” concept might notice that the fingerprint here is 400 bytes long, or 3200 bits, which is ridiculously high. There are surely less than 250 pieces of paper in the world — that’s a million for every person — which means that these fingerprints should easily be able to uniquely identify every piece of paper in the world. [2] The authors estimate that the chance of an error is no more than 1 in 10148. In other words, they achieve perfect accuracy.

What are the implications? As the authors point out, document identification “has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods.” On the negative side, it “could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.”

[1] This is often referred to as device fingerprinting, but I find that a poor choice of terminology and will use reserve that term for a different concept in this series.

[2] It is hard to estimate entropy exactly in cases like this, but the feature vector is obtained via Principal Component Analysis, which makes it likely that the entropy is close to the maximum value of 3200 bits.

Thanks to Will Clarkson for reviewing a draft of this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

Entry filed under: Uncategorized. Tags: .

Google+ and Privacy: A Roundup No Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noise

1 Comment Add your own

  • 1. Rahul  |  September 22, 2011 at 4:07 am

    I am able to appreciate your concern over “fingerprinting” of documents. Since the context revolves around biometrics, I can’t stop but ask about how do you view the UIDAI project back in India.
    Under UIDAI govt. is going to collect fingerprints of all 10 fingers + the iris scan. Some questions are being asked about it – the threat to privacy all this data is going to pose.
    Do you think such questions have any basis or these are at best just flimsy? [ I don’t think the govt. could have gone for a better method than this to give identity to 1.2bn people]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed

About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.


Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 245 other followers