Everything Has a Fingerprint — Don’t Forget Scanners and Printers
Previous articles in this series looked at fingerprinting of blank paper, digital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.
To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.
There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.
A group at Purdue has been very active in this area, as well as in printer identification, which I will discuss later in this article. These two papers are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.
To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.
Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly moment-based features of the noise vector.
Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.
The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.
The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.
The Purdue group have also worked on printer identification, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is banding — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.
Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the Fourier Transform of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.
Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify individual letters. By extracting a set of statistical features and applying an SVM classifier, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.
Needless to say, by combining the classification results from all the ‘e’s in a typical document, they were able to match documents to printers 100% of the time in their tests. Presumably the same method would apply for all other characters, but wasn’t tested due to the additional manual effort required for different shapes.
Inkjet printers seem to be even more variable than laser printers; an example is shown in the picture taken from this paper. I found it a bit hard to discern exactly what the state of the art is, but I’m guessing that if it isn’t already possible to detect different printer models with essentially perfect accuracy, it will soon be.
The privacy implications of printer identification, in the context of a whistleblower who wishes to print and mail some documents anonymously, would seem to be minimal. If you’re printing from the office, printer logs (that record a history of print jobs along with user information) would probably be a more realistic threat. If you’re using a home printer, there is typically no known set of documents that came from your printer to compare against, unless law enforcement has physical possession of your printer.