<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>33 Bits of Entropy &#187; Uncategorized</title>
	<atom:link href="http://33bits.org/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://33bits.org</link>
	<description>The End of Anonymized Data and What to Do About It</description>
	<lastBuildDate>Wed, 02 May 2012 17:37:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='33bits.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>33 Bits of Entropy &#187; Uncategorized</title>
		<link>http://33bits.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://33bits.org/osd.xml" title="33 Bits of Entropy" />
	<atom:link rel='hub' href='http://33bits.org/?pushpress=hub'/>
		<item>
		<title>Selfish Reasons to do Peer Review, and Other Program Committee Observations</title>
		<link>http://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/</link>
		<comments>http://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/#comments</comments>
		<pubDate>Wed, 02 May 2012 17:37:10 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[peer review]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=1039</guid>
		<description><![CDATA[I’ve been on several program committees in the last year and a half. As I’ve written earlier, getting a behind-the-scenes look at how things work significantly improved my perception of research and academia. This post is a more elaborate set of observations based on my experience. It is targeted both at my colleagues with the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1039&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I’ve been on several program committees in the last year and a half. As I’ve <a href="http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/">written earlier</a>, getting a behind-the-scenes look at how things work significantly improved my perception of research and academia. This post is a more elaborate set of observations based on my experience. It is targeted both at my colleagues with the hope of starting a discussion, as well as at outsiders as a continuation of my series on explaining how the scientific community functions (that began with the post linked above) .</p>
<p><strong>Benefits of doing peer review.</strong> Peer review is often considered a burden that one grudgingly accepts in order to keep the system working. But in my experience, especially for a junior researcher, the effort is well worth the time.</p>
<p>The most obvious advantage of being on a PC is that it forces you to read papers. Now if you’re the type that never needs external motivation to get things accomplished, this wouldn’t matter to you — you’d do literature study on a regular basis anyway. But many of us aren’t that disciplined; I’m certainly not.</p>
<p>There are also insights you get that you <em>can’t</em> reproduce by having perfect self-discipline. PC work gives you a raw, unfiltered look into the research that people have chosen to work on. This is a 6-month-or-so head start for getting on top of emerging trends compared to only reading published papers. You also get a better idea of common pitfalls to avoid.</p>
<p>Finally, peer review is one of the rare opportunities to read papers critically (it is harder with published work because it doesn’t have as many loopholes). This is not a natural skill for most people — our cognitive biases predispose us to confuse good rhetoric with sound logic.</p>
<p><strong>Which type of meeting?</strong> I&#8217;ve been on PCs with all three types of discussions: physical meetings, phone meetings and online. I think it&#8217;s important to have a meeting, whether physical or phone. I learn a lot, and the outcome feels fairer. Besides, quite often one reviewer is able to point out something the others have missed. Chairs of online-only PCs do try to elicit some interaction between reviewers, but for hard-to-explain but easy-to-understand reasons, the bandwidth in an interactive meeting tends to be much higher.</p>
<p>Phone meetings are suitable for smaller conferences and workshops. In my experience, members mostly tend to go on mute and tune out except when the papers they reviewed are being discussed. I don’t necessarily see a problem with this.</p>
<p>In physical meetings, I’ve found that members often make comments or voice opinions on papers they haven&#8217;t really read. I don&#8217;t think this is in the best interest of fair reviewing (although I’ve heard a contrary opinion). I wonder if a strategy involving smaller breakout groups would be more effective.</p>
<p>The one advantage of not having a meeting is of course that it saves time. I’ve found that the time commitment for the meeting is about a third of the reviewing time (for both physical and phone meetings), which I don’t consider to be too much of a burden given the improved outcomes.</p>
<p>Overall, my experience from these meetings is that members act professionally for the most part without egos or emotions getting in the way. While there is inevitably some randomness in the process, I believe that the horror stories of careless reviewers — everyone has at least one to narrate — are exaggerated. One possible reason for this misunderstanding is that there is a <em>lot</em> that&#8217;s discussed at meetings after the reviews are written, and often this feedback doesn&#8217;t make it into the reviews.</p>
<p><strong>Problem areas.</strong> Finally, here are some aspects of PCs that I think could be improved. I have deliberately omitted the most common problems (such as an untenable number of submissions and low acceptance rates) that everybody knows and talks about. Instead, these are less frequently discussed but yet (IMO) fairly important issues.</p>
<p><em>Lost reviews.</em> Since reviewers aren’t perfect, sometimes bad papers with persistent authors manage to get published by being resubmitted to other venues until they hit a relatively sloppy panel of reviewers. The reason this works (when it does) is that past reviews of a recycled paper are “lost”. This is a shame; it wastes reviewer effort and lowers the overall quality of publications.</p>
<p><em>Community boundaries.</em> As a reviewer I’ve started to realize how difficult it is to publish in other communities’ venues. As an example, at security conferences we often see papers by outsiders that have something useful to say, but are unfortunately inadequately familiar with the “central dogma” of crypto/security research, namely adversarial thinking. [1] While I can see the temptation to reject these papers with a cursory note, I think we should be patient with these people, explain how we do things and if possible offer to work with them to improve the paper.</p>
<p><em>Unfruitful directions.</em> Sometimes research directions don’t pan out, either because the world has moved on and the underlying assumptions are no longer true, or because the technical challenges are too hard. But researchers naturally resist having to change their research area, and so there are lots of papers written on topics that stopped being relevant years ago. The reason these papers keep getting published is that they are assigned for review to other people working in the same area. I’ve seen program chairs make an effort to push back on this, but the current situation is far from optimal.</p>
<p>In conclusion, my opinion is that peer review in my community is a relatively well-functioning process, albeit with a lot of scope for improvement. I believe this improvement can be accomplished in an evolutionary way without having to change anything too radically.</p>
<p>[1] The crypto/security community essentially derives its identity from adversarial thinking. Incidentally, I feel that it is <a href="http://33bits.org/2010/11/08/adversarial-thinking-considered-harmful-sometimes/">not always suitable</a> for privacy, which is why I believe computer scientists who study privacy should stop viewing ourselves as a subset of the security community.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/1039/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/1039/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/1039/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1039&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>A Critical Look at Decentralized Personal Data Architectures</title>
		<link>http://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/</link>
		<comments>http://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/#comments</comments>
		<pubDate>Tue, 21 Feb 2012 16:27:54 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[distributed social networks]]></category>
		<category><![CDATA[economics]]></category>
		<category><![CDATA[personal data stores]]></category>
		<category><![CDATA[policy]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=1033</guid>
		<description><![CDATA[I have a new paper with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, Helen Nissenbaum and Dan Boneh (the Adnostic gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1033&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have a <a href="http://randomwalker.info/publications/critical-look-at-decentralization-v1.pdf">new paper</a> with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, <a href="http://www.nyu.edu/projects/nissenbaum/">Helen Nissenbaum</a> and <a href="http://crypto.stanford.edu/~dabo/">Dan Boneh</a> (the <a href="http://crypto.stanford.edu/adnostic/">Adnostic</a> gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized personal data architectures” — aren’t quite the panacea that they’ve been made out to be.</p>
<p>The paper is only a synopsis of our work so far — in our notes we have over 80 projects, papers and proposals that we’ve studied, so we intend to follow up with a more complete analysis. For now, our goal is to kick off a discussion and give the community something to think about. The paper was a lot of fun to write, and we hope you will enjoy reading it. We recognize that many of our views and conclusions may be controversial, and we welcome comments.</p>
<p><strong>Abstract</strong>:</p>
<p>While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals.</p>
<p>Developers, regulators, and consumer advocates have looked to alternative decentralized architectures as the natural response to threats posed by these centralized services.  The result has been a great variety of solutions that include personal data stores (PDS), infomediaries, Vendor Relationship Management (VRM) systems, and federated and distributed social networks.  And yet, for all these efforts, decentralized personal data architectures have seen little adoption.</p>
<p>This position paper attempts to account for these failures, challenging the accepted wisdom in the web community on the feasibility and desirability of these approaches. We start with a historical discussion of the development of various categories of decentralized personal data architectures. Then we survey the main ideas to illustrate the common themes among these efforts. We tease apart the design characteristics of these systems from the social values that they (are intended to) promote. We use this understanding to point out numerous drawbacks of the decentralization paradigm, some inherent and others incidental. We end with recommendations for designers of these systems for working towards goals that are achievable, but perhaps more limited in scope and ambition.</p>
<hr />
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or follow me on <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/1033/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/1033/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/1033/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1033&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>Is Writing Style Sufficient to Deanonymize Material Posted Online?</title>
		<link>http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/</link>
		<comments>http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/#comments</comments>
		<pubDate>Mon, 20 Feb 2012 17:40:20 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[de-anonymization]]></category>
		<category><![CDATA[free speech]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[stylometry]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=1023</guid>
		<description><![CDATA[I have a new paper appearing at IEEE S&#38;P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1023&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have a <a href="http://randomwalker.info/publications/author-identification-draft.pdf">new paper</a> appearing at <a href="http://www.ieee-security.org/TC/SP2012/">IEEE S&amp;P</a> with Hristo Paskov, Neil Gong, <a href="http://www.cs.berkeley.edu/~bethenco/">John Bethencourt</a>, <a href="http://www.emilstefanov.net/">Emil Stefanov</a>, Richard Shin and <a href="http://www.cs.berkeley.edu/~dawnsong/">Dawn Song</a> on Internet-scale authorship identification based on <em>stylometry</em>, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my <a href="http://33bits.org/2009/01/15/de-anonymizing-the-internet/">previous post</a> speculating on the possibility of Internet-scale authorship identification:</p>
<p style="padding-left:30px;" dir="ltr">Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.</p>
<p>The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace <a href="http://books.google.com/books/about/Inference_and_disputed_authorship.html?id=KKKFAAAAMAAJ">determined the authorship</a> of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.</p>
<p><strong>Impact</strong>. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]</p>
<p>Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?</p>
<p><strong>Experiments</strong>. Our experimental methodology is set up to directly address this question. Our primary data source was the <a href="http://www.icwsm.org/2009/data/">ICWSM 2009 Spinn3r Blog Dataset</a>, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random <em>k</em> (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the <em>same context</em>. We call this <em>post-to-blog matching</em>.</p>
<p>In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel.  But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from <em>different contexts</em>. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.</p>
<p>We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to <a href="http://planete.inrialpes.fr/~perito/">Daniele Perito</a> for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this <em>blog-to-blog matching</em>. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.</p>
<p><strong>Results</strong>. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.</p>
<p>But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog <a href="http://en.wikipedia.org/wiki/Jessica_Cutler#Washingtonienne">Washingtonienne</a> we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.</p>
<p>We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.</p>
<p>An important capability is <em>confidence estimation</em>, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of <a href="http://en.wikipedia.org/wiki/Precision_and_recall">precision and recall</a>. We find that <strong>we can improve precision from 20% to over 80%</strong> with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.</p>
<p>We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.</p>
<p><em>We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition</em>. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.</p>
<p><strong>Techniques</strong>. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the <a href="http://dl.acm.org/citation.cfm?id=1344413">Writeprints paper</a> did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.</p>
<p>An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.</p>
<p>In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “<a href="http://en.wikipedia.org/wiki/Support_vector_machine">hyperplane</a>” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.</p>
<p>We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the <a href="http://33bits.org/about/netflix-paper-home-page/">Netflix paper</a>. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.</p>
<p>In an <a href="http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/">earlier article</a>, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.</p>
<p><strong>Concluding thoughts</strong>. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.</p>
<p>The good news for authors who would like to protect themselves against deanonymization, it appears that <a href="https://www.cs.drexel.edu/~greenie/brennan_paper.pdf">manually changing one’s style</a> is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.</p>
<p>[1] A team from Israel have <a href="http://dl.acm.org/citation.cfm?id=1961137">studied</a> authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the <a href="http://www.msnbc.msn.com/id/44905911/ns/technology_and_science-science/">news</a> recently for some related work.</p>
<p>[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.</p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or follow me on <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/1023/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/1023/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1023&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>An Update on Career Plans and Some Observations on the Nature of Research</title>
		<link>http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/</link>
		<comments>http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 19:05:56 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[academia]]></category>
		<category><![CDATA[job]]></category>
		<category><![CDATA[peer review]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=1021</guid>
		<description><![CDATA[I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey. My Ph.D. years [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1021&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey.</p>
<p class="c0">My Ph.D. years in Austin were productive and blissful. When I finished and came West, I knew I enjoyed research tremendously, but there were many aspects of research culture that made me worry if I’d fit in. I hoped my postdoc would give me some clarity.</p>
<p class="c0">Happily, that’s exactly what happened, especially after I started being an active participant in program committees and other community activities. It’s been an enlightening and humbling experience. I’ve come to realize that in many cases, there are perfectly good reasons why frequently-criticized aspects of the culture are just the way they are. Certainly there are still facets that are far from ideal, but my overall view of the culture of scientific research and the value of research to society is dramatically more positive than it was when I graduated.</p>
<p class="c0">Let me illustrate. One of my major complaints when I was in grad school was that almost nobody does interdisciplinary research (which is true — the percentage of research papers that span different disciplines is tiny). Then I actually tried doing it, and came to the obvious-in-retrospect realization that collaborating with people who don’t speak your language is <em><span class="c5">hard</span></em>.</p>
<p class="c0">Make no mistake, I’m as committed to cross-disciplinary research as I ever was (I just finished writing a grant proposal with Prof’s <span class="c1"><a class="c3" href="http://www.nyu.edu/projects/nissenbaum/">Helen Nissenbaum</a></span> and <span class="c1"><a class="c3" href="http://www.law.berkeley.edu/php-programs/faculty/facultyProfile.php?facID=1018">Deirdre Mulligan</a></span>). I’ve gradually been getting better at it and I expect to do a lot of it in my career. But if a researcher makes a decision to stick to their sub-discipline, I can’t really fault them for that.</p>
<p class="c0">As another example, consider the lack of a &#8220;publish-then-filter&#8221; model for research papers, a whole two decades after the Web made it technologically straightforward. Many people find this incomprehensibly backward and inefficient. Academia.edu founder Richard Price wrote an <span class="c1"><a class="c3" href="http://techcrunch.com/2012/02/05/the-future-of-peer-review/">article</a></span> two days ago arguing that the future of peer review will look like a mix of Pagerank and Twitter. Three years ago, that could have been me talking. Today my view is very different.</p>
<p class="c0">Science is not a popularity contest; Pagerank is irrelevant as a peer-review mechanism. Basically, scientific peer review is the only process that exists for systematically separating truths from untruths. Like democracy, it has its problems, but at least it works. Social media is probably the worst analogy — it seems to be better at amplifying falsehoods than facts. Wikipedia-style crowdsourcing has its strengths, but it can hit-or-miss.</p>
<p class="c0">To be clear, I think peer review is probably going to change; I would like it to be done in public, for one. But even this simple change is fraught with difficulty — how would you ensure that reviewers aren’t influenced by each others’ reviews? This is an important factor in the current system. During my program committee meetings, I came to realize just how many of these little procedures for minimizing bias are built into the system and how seriously people take the spirit of this process. Revamping peer review while keeping what works is going to be slow and challenging.</p>
<p class="c0">Moving on, some of my other concerns have been disappearing due to recent events. Restrictive publisher copyrights are a perfect example. I have more of a problem with this than most researchers do — I did my Master’s in India, which means I’ve been on the other side of the paywall. But it looks like that pot may finally have <span class="c1"><a class="c3" href="http://news.sciencemag.org/scienceinsider/2012/02/thousands-of-scientists-vow-to-b.html">boiled over</a></span>. I think it’s only a matter of time now before open access becomes the norm in all disciplines.</p>
<p class="c0">There are certainly areas where the status quo is not great and not getting any better. Today if a researcher makes a discovery that’s not significant enough to write a paper about, they choose not to share that discovery at all. Unfortunately, this is the rational behavior for a self-interested researcher, because there is no way to get credit for anything other than published papers. Michael Neilsen’s <span class="c1"><a class="c3" href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902">excellent book</a></span> exploring the future of networked science gives me some hope that change may be on the horizon.</p>
<p class="c0">I hope this post has given you a more nuanced appreciation of the nature of scientific research. Misconceptions about research and especially about academia seem to be widespread among the people I talk to both online and offline; I harbored a few myself during my Ph.D., as I said earlier. So I’m thinking of doing posts like this one on a semi-regular basis on this blog or on Google+. But that will probably have to wait until after my job search is done.</p>
<p class="c0">To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or follow me on <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/1021/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/1021/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/1021/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1021&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>Printer Dots, Pervasive Tracking and the Transparent Society</title>
		<link>http://33bits.org/2011/10/18/printer-dotspervasive-tracking-and-the-transparent-society/</link>
		<comments>http://33bits.org/2011/10/18/printer-dotspervasive-tracking-and-the-transparent-society/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 19:35:51 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[civil liberties]]></category>
		<category><![CDATA[fingerprinting]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[surveillance]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=1005</guid>
		<description><![CDATA[So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [1, 2, 3, 4], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance. Here’s a fascinating example of tracking that’s all around you but that you’re probably [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1005&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [<a href="http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/">1</a>, <a href="http://33bits.org/2011/09/19/digital-camera-fingerprinting/">2</a>, <a href="http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/">3</a>, <a href="http://33bits.org/2011/10/11/everything-has-a-fingerprint-%e2%80%94-dont-forget-scanners-and-printers/">4</a>], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance.</em></p>
<p>Here’s a fascinating example of tracking that’s all around you but that you’re probably unaware of:</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">Color laser printers and photocopiers print small yellow dots on every page for tracking purposes.</div>
<p>My source for this is the EFF’s <a href="http://en.wikipedia.org/wiki/Seth_Schoen">Seth Schoen</a>, who has made his<a href="https://www.eff.org/files/filenode/printers/ccc.pdf"> </a><a href="https://www.eff.org/files/filenode/printers/ccc.pdf">presentation</a> on the subject available.</p>
<p><a href="http://33bits.files.wordpress.com/2011/10/yellowtrackingdots.png"><img class="aligncenter size-medium wp-image-1007" title="yellowtrackingdots" src="http://33bits.files.wordpress.com/2011/10/yellowtrackingdots.png?w=300&h=222" alt="" width="300" height="222" /></a></p>
<p>The dots are not normally visible, but can be seen by a variety of methods such as shining a blue LED flashlight, magnification under a microscope or scanning the document with a commodity scanner. The pattern of dots typically encodes the device serial number and a timestamp; some parts of the code are yet unidentified. There are interesting differences between the codes used by different manufacturers. [1] Some examples are shown in the pictures. There’s a lot more information in the presentation.</p>
<div id="attachment_1006" class="wp-caption aligncenter" style="width: 465px"><a href="http://33bits.files.wordpress.com/2011/10/patterns.png"><img class="size-full wp-image-1006" title="patterns" src="http://33bits.files.wordpress.com/2011/10/patterns.png?w=455&h=106" alt="" width="455" height="106" /></a><p class="wp-caption-text">Pattern of dots from three different printers: Epson, HP LaserJet and Canon.</p></div>
<p>Schoen says the dots could have been the result of the Secret Service pressuring printer manufacturers to cooperate, going back as far as the 1980s. The EFF’s Freedom of Information Act request on the matter from 2005 has been “mired in bureaucracy.”</p>
<p>The EFF as well as the<a href="http://seeingyellow.com/"> </a><a href="http://seeingyellow.com/">Seeing Yellow project</a> would like to see these dots gone. The EFF has consistently argued against pervasive tracking. In <a href="https://www.eff.org/wp/biometrics-whos-watching-you">this article</a> on biometric surveillance, they say:</p>
<blockquote><p>EFF believes that perfect tracking is inimical to a free society. A society in which everyone&#8217;s actions are tracked is not, in principle, free. It may be a livable society, but would not be our society.</p></blockquote>
<p>Eloquently stated. You don’t have to be a privacy advocate to see that there are problems with mass surveillance, especially by the State. But I’d like to ask the question: can we really hope to stave off a surveillance society forever, or are efforts like the Seeing Yellow project just buying time?</p>
<p>My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. I hope the present series of articles has convinced you that even if privacy advocates are successful in preventing the deployment of <em>explicit</em> tracking mechanisms, just about everything around you is <em>inherently</em> trackable. [2]</p>
<p>And even if we can prevent the State from setting up a surveillance infrastructure, there are undeniable commercial benefits in tracking everything that’s trackable, which means that private actors will deploy this infrastructure, as they’ve done with online tracking. If history is any indication, most people will happily allow themselves to be tracked in exchange for free or discounted services. From there it’s a simple step for the government to obtain the records of any person of interest.</p>
<p>If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is <strong>symmetrically distributed</strong>, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.</p>
<p>Another important principle is for those who do conduct tracking to be required to be <strong>transparent</strong> about it, to have social and legal processes in place to determine what uses are acceptable, and to allow opting out in contexts where that makes sense. Because ultimately what matters in terms of societal freedom is not surveillance itself, but how surveillance affects the balance of power. To be sure, the society I describe — pervasive but transparent tracking, accessible to everyone, and with limited opt-outs — would be different from ours, and would take some adjusting to, but that doesn’t make it <em>worse</em> than ours.</p>
<p>I am hardly the first to make this argument. A similar position was first prominently articulated by David Brin his 1999 book <a href="http://www.amazon.com/Transparent-Society-Technology-Between-Privacy/dp/0738201448">Transparent Society</a>. What the last decade has shown is just how inevitable pervasive tracking is. For example, Brin focused too much on cameras and assumed that tracking people indoors would always be infeasible. That view seems almost quaint today.</p>
<p>Let me be clear: I have absolutely no beef with efforts to oppose pervasive tracking. Even if being watched all of the time is our eventual destiny, society won’t be ready for it any time soon — these changes take decades if not generations. The pace at which the industry wants us to make us switch to “living in public” is far faster than we’re capable of. Buying time is therefore extremely valuable.</p>
<p>That said, embracing the Transparent Society view has important consequences for civil libertarians. It suggests working toward an achievable if sub-optimal goal instead of an ideal but impossible one. It also suggests that the “democratization of surveillance” should be <em>encouraged</em> rather than feared.</p>
<p>Here are some currently hot privacy and civil-liberties issues that I think will have a significant impact on the distribution of power in a ubiquitous-surveillance society: <a href="http://www.aclu.org/blog/free-speech/it-legal-photograph-or-videotape-police">the right to videotape on-duty police officers and other public officials</a>, transparent government initiatives including <a href="http://en.wikipedia.org/wiki/Freedom_of_Information_Act_(United_States)">FOIA</a> requests, and closer to my own interests, the <a href="http://donottrack.us">Do Not Track</a> opt-out mechanism, and tools like <a href="http://fourthparty.info/">FourthParty</a> which have helped illuminate the dark world of online tracking.</p>
<p>Let me close by calling out one battle in particular. Throughout this series, we’ve seen that fingerprinting techniques have security-enhancing applications (such as forensics), as well as privacy-infringing ones, but that most research papers on fingerprinting consider only the former question. I believe the primary reason is that <em>funding</em> is for the most part available only for the former type of research and not for the latter. However, we need a culture of research into privacy-infringing technologies, whether funded by federal grants or otherwise, in order to achieve the goals of symmetry and transparency in tracking.</p>
<p>[1] Note that this is just an encoding and not encryption. The current system allows anyone to read the dots; public-key encryption would allow at least nominally restricting the decoding ability to only law-enforcement personnel, but there is no evidence that this is being done.</p>
<p>[2] This is analogous to the cookies-vs-fingerprinting issue in online tracking, and why cookie-blocking alone is not sufficient to escape tracking.</p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/1005/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/1005/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/1005/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=1005&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/10/18/printer-dotspervasive-tracking-and-the-transparent-society/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/yellowtrackingdots.png?w=300" medium="image">
			<media:title type="html">yellowtrackingdots</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/patterns.png" medium="image">
			<media:title type="html">patterns</media:title>
		</media:content>
	</item>
		<item>
		<title>Everything Has a Fingerprint — Don&#8217;t Forget Scanners and Printers</title>
		<link>http://33bits.org/2011/10/11/everything-has-a-fingerprint-%e2%80%94-dont-forget-scanners-and-printers/</link>
		<comments>http://33bits.org/2011/10/11/everything-has-a-fingerprint-%e2%80%94-dont-forget-scanners-and-printers/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 18:02:25 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fingerprinting]]></category>
		<category><![CDATA[forensics]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=994</guid>
		<description><![CDATA[Previous articles in this series looked at fingerprinting of blank paper, digital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting. To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=994&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p><em>Previous articles in this series looked at fingerprinting of <a href="http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/">blank paper</a>, <a href="http://33bits.org/2011/09/19/digital-camera-fingerprinting/">digital cameras</a> and <a href="http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/">RFID chips</a>. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.</em></p>
<p>To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.</p>
<p>There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.</p>
<p>A <a href="http://cobweb.ecn.purdue.edu/~prints/">group at Purdue</a> has been very active in this area, as well as in printer identification, which I will discuss later in this article. These <a href="http://cobweb.ecn.purdue.edu/~prints/public/papers/ei07-nitin2.pdf">two</a> <a href="http://cobweb.ecn.purdue.edu/~prints/public/papers/iwcf08_khanna.pdf">papers</a> are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.</p>
<p>To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.</p>
<p>Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly <a href="http://en.wikipedia.org/wiki/Standardized_moment">moment</a>-based features of the noise vector.</p>
<p>Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.</p>
<p>The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.</p>
<p>The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.</p>
<p>The Purdue group have also worked on <a href="http://cobweb.ecn.purdue.edu/~prints/public/papers/nip05-mikkilineni.pdf">printer identification</a>, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is <em>banding</em> — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.</p>
<div id="attachment_995" class="wp-caption aligncenter" style="width: 465px"><a href="http://33bits.files.wordpress.com/2011/10/printer-fingerprint.png"><img class="size-full wp-image-995 " title="Laser printer signature" src="http://33bits.files.wordpress.com/2011/10/printer-fingerprint.png?w=455&h=364" alt="" width="455" height="364" /></a><p class="wp-caption-text">Fourier Transform of greyscale amplitudes of a background fill (printed with an HP LaserJet)</p></div>
<p>Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the <a href="http://en.wikipedia.org/wiki/Fourier_transform">Fourier Transform</a> of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.</p>
<p>Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify <em>individual letters</em>. By extracting a set of statistical features and applying an <a href="http://en.wikipedia.org/wiki/Support_vector_machine">SVM classifier</a>, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.</p>
<p>Needless to say, by combining the classification results from all the ‘e’s in a typical document, they were able to match documents to printers 100% of the time in their tests. Presumably the same method would apply for all other characters, but wasn’t tested due to the additional manual effort required for different shapes.</p>
<div id="attachment_996" class="wp-caption aligncenter" style="width: 424px"><a href="http://33bits.files.wordpress.com/2011/10/printer_vertical_lines.png"><img class="size-full wp-image-996" title="Inkjet printers: vertical lines" src="http://33bits.files.wordpress.com/2011/10/printer_vertical_lines.png?w=455" alt=""   /></a><p class="wp-caption-text">Vertical lines printed by three different inkjet printers</p></div>
<p>Inkjet printers seem to be even more variable than laser printers; an example is shown in the picture taken from <a href="http://cobweb.ecn.purdue.edu/~prints/public/papers/sp_article_09_chiang.pdf">this paper</a>. I found it a bit hard to discern exactly what the state of the art is, but I’m guessing that if it isn’t already possible to detect different printer models with essentially perfect accuracy, it will soon be.</p>
<p>The privacy implications of printer identification, in the context of a whistleblower who wishes to print and mail some documents anonymously, would seem to be minimal. If you’re printing from the office, printer logs (that record a history of print jobs along with user information) would probably be a more realistic threat. If you’re using a home printer, there is typically no known set of documents that came from your printer to compare against, unless law enforcement has physical possession of your printer.</p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/994/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/994/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/994/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=994&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/10/11/everything-has-a-fingerprint-%e2%80%94-dont-forget-scanners-and-printers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/printer-fingerprint.png" medium="image">
			<media:title type="html">Laser printer signature</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/printer_vertical_lines.png" medium="image">
			<media:title type="html">Inkjet printers: vertical lines</media:title>
		</media:content>
	</item>
		<item>
		<title>Fingerprinting of RFID Tags and High-Tech Stalking</title>
		<link>http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/</link>
		<comments>http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/#comments</comments>
		<pubDate>Tue, 04 Oct 2011 21:20:19 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fingerprinting]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=989</guid>
		<description><![CDATA[Previous articles in this series looked at fingerprinting of blank paper and digital cameras. This article is about fingerprinting of RFID, a domain where research has directly investigated the privacy threat, namely tracking people in public. The principle behind RFID fingerprinting is the same as with digital cameras: Microscopic physical irregularities due to natural structure and/or manufacturing defects [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=989&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>Previous articles in this series looked at fingerprinting of <a href="http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/">blank paper</a> and <a href="http://33bits.org/2011/09/19/digital-camera-fingerprinting/">digital cameras</a>. This article is about fingerprinting of RFID, a domain where research has directly investigated the privacy threat, namely tracking people in public.</em></p>
<p>The principle behind RFID fingerprinting is the same as with digital cameras:</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">Microscopic physical irregularities due to natural structure and/or manufacturing defects cause observable, albeit tiny, behavioral differences.</div>
<p><strong>The basics.</strong> First let’s get the obvious question out of the way: why are we talking about devious methods of identifying RFID chips, when the primary raison d&#8217;être of RFID is to enable unique identification? Why not just use them in the normal way?</p>
<p>The answer is that fingerprinting, which exploits the physical properties of RFID chips rather than their logical behavior, allows identifying them in unintended ways and in unintended contexts, and this is powerful. RFID applications, for example in e-passports or smart cards, can often be <a href="http://www.schneier.com/blog/archives/2008/08/hacking_mifare.html">cloned</a> at the logical level, either because there is no authentication or because authentication is broken. Fingerprinting can make the system (more) secure, since fingerprints arise from microscopic randomness and there is no known way to create a tag with a given fingerprint.</p>
<p>If sensor patterns in digital cameras are a relatively clean example of fingerprinting, RF (and anything to do with the electromagnetic spectrum in general) is the opposite. First, the data is an arbitrary waveform instead of an fixed-size sequence of bits. This means that a simple point-by-point comparison won’t work for fingerprint verification; the task is conceptually more similar to algorithmically comparing two faces. Second, the probe signal itself is variable. RFID chips are passive: they respond to the signal produced by the reader (and draw power from it).[1] This means that the fingerprinting system is in full control of what kind of signal to interrogate the chip with. It’s a bit like being given a blank canvas to paint on.</p>
<p><strong>Techniques.</strong> A <a href="http://www.syssec.ethz.ch/research/identification">group at ETH Zurich</a> has done some impressive work in this area. In their <a href="http://www.syssec.ethz.ch/research/usenixsec09_phyid_rfid.pdf">2009 paper</a>, they report being able to compare an RFID card with a stored fingerprint and determine if they are the same, with an error rate of 2.5%–4.5% depending on settings.[2] They use two types of signals to probe the chip with — “burst” and “sweep” — and extract features from the response based on the <a href="http://en.wikipedia.org/wiki/Frequency_spectrum">spectrum</a>.</p>
<div id="attachment_990" class="wp-caption aligncenter" style="width: 465px"><a href="http://33bits.files.wordpress.com/2011/10/rfid.png"><img class="size-full wp-image-990" title="rfid" src="http://33bits.files.wordpress.com/2011/10/rfid.png?w=455&h=161" alt="" width="455" height="161" /></a><p class="wp-caption-text">Chip response to different signals. Fingerprints are extracted from characteristic features of these responses.</p></div>
<p>Other papers have demonstrated different ways to generate signals/extract features. A University of Arkansas team <a href="http://comp.uark.edu/~drt/pubs/2010/Fingerprinting_RFID_Tags2010.pdf">exploited</a> the minimum power required to get a response from the tag at various frequencies. The authors achieved a 94% true-positive rate using 50 identical tags, with only a 0.1% false-positive rate. (About 6% of the time, the algorithm didn’t produce an output.)</p>
<p>Yet other techniques, namely the energy and <a href="http://en.wikipedia.org/wiki/Q_factor">Q factor</a> of higher harmonics were studied in a <a href="http://www.nist.gov/pml/electromagnetics/rf_electronics/upload/RFID_counter_TMTT.pdf">couple</a> of <a href="http://www.nist.gov/pml/electromagnetics/rf_electronics/upload/RFID-resonance.pdf">papers</a> out of NIST. In the latter work, they experimented with 20 cards which consisted of 4 batches of 5 ‘identical’ cards in each. The overall identification accuracy was 96%.</p>
<p>It seems safe to say that RFID fingerprinting techniques are still in their infancy, and there is much room for improvement by considering new categories of features, by combining different types of features, or by using different classification algorithms on the extracted features.</p>
<p><strong>Privacy.</strong> RF fingerprinting, like other types of fingerprinting, shows a duality between security-enhancing and privacy-infringing applications, but in a less direct way.  There are two types of RFID systems: “near-field” based on inductive coupling, used in contactless smartcards and the like, and “far field” based on backscatter, used in vehicle identification, inventory control, etc. <em>The papers discussed so far pertain to near-field systems.</em> There are no real privacy-infringing applications of near-field RF fingerprinting, because you can’t get close enough to extract a fingerprint without the owner of the tag knowing about it. Far-field systems, to which we will now turn, are ideally suited to high-tech stalking.</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">Fingerprinting provides the ability to enhance the security of near-field RFID systems and to infringe privacy in the context of far-field RFID chips.</div>
<p>In a recent <a href="http://www.syssec.ethz.ch/research/zanetti_pets11_CR.pdf">paper</a>, the Zurich team mentioned earlier investigated the possibility of tracking a people in a shopping mall based on strategically placed sensors, assuming that shoppers have several (far-field) RFID tags on them. The point is that it is possible to design chips that prevent tracking at the logical level by authenticating the reader, but this is impossible at the physical level.</p>
<p>Why would people have RFID tags on them? Tags used for inventory control in stores, and not deactivated at the point-of-sale are one <a href="http://online.wsj.com/article/SB10001424052748704421304575383213061198090.html">increasingly common possibility</a> — they would end up in shopping bags (or even on clothes being worn, although that’s less likely). RFID tags in wallets and medical devices are another source; these are tags that the user <em>wants</em> to be present and functional.</p>
<p>What makes the tracking device the authors built powerful is that it is low-cost and can be operated surreptitiously at some distance from the victim: up to 2.75 meters, or 9 feet. They show that 5.4 bits of entropy can be extracted from a single tag, which means that 5 tags on a person gives 22 bits, easily enough to distinguish everyone who might be in a particular mall.</p>
<p>To assess the practical privacy risk, technological feasibility is only one dimension. We also need to ask who the adversary is and what the incentives are. Tracking people, especially shoppers, in physical space has the strongest incentive of all: selling products. While online tracking is pervasive, the majority of shopping dollars are still spent offline, and there’s still no good way to automatically identify people when they are in the vicinity in order to target offers to them. Facial recognition technology is highly error-prone and creeps people out, and that’s where RF fingerprinting comes in.</p>
<p>That said, RF fingerprinting is only one of the many ways of passively tracking people <em>en masse</em> in physical space — unintentional leaks of identifiers from smartphones and logical-layer identification of RFID tags seem more likely — but it’s probably the hardest to defend against. It is possible to disable RFID tags, but this is usually irreversible and it’s difficult to be sure you haven’t missed any. RFID jammers are another option but they are far from easy to use and are probably <a href="http://consumerist.com/2007/01/protect-your-rfid-credit-card-with-a-rf-jammer.html">illegal in the U.S</a>. One of the ETH Zurich researchers <a href="http://www.mics.org/Workshop2011/Slides/Zanetti_WS11_FingerprintingRFIDTags.pdf">suggests</a> tinfoil wrapping when going out shopping :-)</p>
<p style="text-align:center;"><img class="aligncenter" title="tinfoil" src="http://33bits.files.wordpress.com/2011/10/tinfoil.png?w=162&h=338" alt="" width="162" height="338" /></p>
<p>[1] Active RFID chips exist but most commercial systems use passive ones, and that’s what the fingerprinting research has focused on.</p>
<p>[2] They used a population of 50 tags, but this number is largely irrelevant since the experiment was one of binary classification rather than 1-out-of-n identification.</p>
<p>&nbsp;</p>
<p><em>Thanks to Vincent Toubiana for comments on a draft.</em></p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/989/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=989&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/rfid.png" medium="image">
			<media:title type="html">rfid</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/10/tinfoil.png" medium="image">
			<media:title type="html">tinfoil</media:title>
		</media:content>
	</item>
		<item>
		<title>No Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noise</title>
		<link>http://33bits.org/2011/09/19/digital-camera-fingerprinting/</link>
		<comments>http://33bits.org/2011/09/19/digital-camera-fingerprinting/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 17:25:56 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[de-anonymization]]></category>
		<category><![CDATA[fingerprinting]]></category>
		<category><![CDATA[signal processing]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=980</guid>
		<description><![CDATA[The previous article looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale. For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=980&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p class="c0"><em><span class="c4">The </span><span class="c6 c4"><a class="c2" href="http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/">previous article</a></span><span class="c4"> looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale.</span></em></p>
<p class="c0">For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that come out of a manufacturing plant behave slightly differently in characteristic ways, and can therefore be distinguished based on their output or behavior. How could this be? The unifying principle is this:</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">Microscopic physical irregularities due to natural structure and/or manufacturing defects cause observable, albeit tiny, behavioral differences.</div>
<p class="c0">Digital camera identification belongs to a class of techniques that exploits ‘pattern noise’ in the ‘sensor arrays’ that capture images. The same techniques can be used to fingerprint a scanner by analyzing pixel-level patterns in the images scanned by it, but that’ll be the focus of a later article.</p>
<p class="c0 c9" style="text-align:center;"><a href="http://33bits.files.wordpress.com/2011/09/imageds.jpg"><img class="aligncenter size-full wp-image-981" title="Dark signal" src="http://33bits.files.wordpress.com/2011/09/imageds.jpg?w=455&h=303" alt="" width="455" height="303" /></a></p>
<p class="c0"><strong>A long-exposure dark frame [<span class="c6"><a class="c2" href="http://www.cameralabs.com/forum/viewtopic.php?t=1094">source</a></span>]. Click image to see full size. Three ‘hot pixels’ and some other sensor noise can be seen.</strong></p>
<p class="c0">A photo taken in the absence of any light doesn’t look completely black; a variety of factors introduce noise. There is random noise that varies in every image, but there is also ‘pattern noise’ due to inherent structural defects or irregularities in the physical sensor array. The key property of the latter kind of noise is that it manifests the same way every image taken by the camera.[1] Thus, the total noise vector produced by a camera is not identical between images, nor is it completely independent.</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">The pixel-level noise components in images taken by the same camera are correlated with each other.</div>
<p class="c0">Nevertheless, separating the pattern noise from random noise and the image itself — after all, a good camera will seek to minimize the strength or ‘power’ of the noise in relation to the image — is a very difficult task, and is the primary technical challenge that camera fingerprinting techniques must address.</p>
<p class="c0"><strong><span class="c3">Security vs. privacy.</span></strong> A quick note about the applications of camera fingerprinting. We saw in the <span class="c6"><a class="c2" href="http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/">previous article</a></span> that there are security-enhancing and privacy-infringing applications of document fingerprinting. In fact, this is almost <em><span class="c4">always</span></em> the case with fingerprinting techniques. [2]</p>
<p class="c0">Camera fingerprinting can be used on the one hand for detecting forgeries (e.g., photoshopped images), and to <span class="c6"><a class="c2" href="http://www.physorg.com/news64638499.html">aid criminal investigations</a></span> by determining who (or rather, which camera) might have taken a picture. On the other hand, it could potentially also be used for unmasking individuals who wish to disseminate photos anonymously online.</p>
<p class="c0">Sadly, most papers studying fingerprinting study only the former type of application, which is why we’ll have to speculate a bit on the privacy impact, even though the underlying math of fingerprinting is the same.</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">Most fingerprinting techniques have both security-enhancing and privacy-infringing applications. The underlying principles are the same but they are applied slightly differently.</div>
<p class="c0">Another point to note is that because of the focus on forensics, <em><span class="c4">most of the work in this area so far has studied distinguishing different camera models</span></em>. But there are some preliminary results on distinguishing ‘identical’ cameras, and it appears that the same techniques will work.</p>
<p class="c0"><strong><span class="c3">In more detail.</span></strong> Let’s look at what I think is the most well-known <span class="c6"><a class="c2" href="http://www.ws.binghamton.edu/fridrich/Research/double.pdf">paper</a></span> on sensor pattern noise fingerprinting, by Binghamton University researchers Jan Lukáš,<a class="c2" href="http://en.wikipedia.org/wiki/Jessica_Fridrich"> </a><span class="c6"><a class="c2" href="http://en.wikipedia.org/wiki/Jessica_Fridrich">Jessica Fridrich</a></span>, and Miroslav Golja. [3] Here’s how it works: the first step is to build a reference pattern of a camera from multiple known images taken from it, so that later an unsourced image can be compared against these reference patterns. The authors suggest using at least 50, but for good measure, they use 320 in their experiments. In the forensics context, the investigator probably has physical possession of the camera and therefore can generate an unlimited number of images. We’ll discuss what this requirement means in the privacy-breach context later.</p>
<p class="c0">There are two steps to build the reference pattern. First, for each image, a <span class="c6"><a class="c2" href="http://en.wikipedia.org/wiki/Noise_reduction#In_images">denoising filter</a></span> is applied, and the denoised image is subtracted from the original to leave only the noise. Next, the noise is averaged across all the reference images — this way the random noise cancels out and leaves the pattern noise.</p>
<p class="c0">Comparing a new image to a reference pattern, to test if it came from that camera, is easy: extract the noise from the test image, and compare this noise pixel-by-pixel with the reference noise. The noise from the test image includes random noise, so the match won’t be close to perfect, but nevertheless the <em><span class="c4">correlation</span></em> between the two noise patterns will be roughly equal to the contribution of pattern noise towards the total noise in the test image. On the other hand, if the test image didn’t come from the same camera, the correlation will be close to zero.</p>
<p class="c0">The authors experimented with nine cameras, of which two were from the same brand and model (Olympus Camedia C765). In addition, two other cameras had the same type of sensor. There was not a single error in their 2,700 tests, including those involving the two ‘identical’ cameras — in each case, the algorithm correctly identified which of the nine cameras a given image came from. By extrapolating the correlation curves, they conservatively estimate that for a False Accept Rate of 10<sup>-3</sup>, their method achieves a False Reject Rate of anywhere between 10<sup>-2</sup> to 10<sup>-10</sup> or even less depending on the camera model and camera settings.</p>
<p class="c0">The takeaway from this seems to be that distinguishing between cameras of different models can be performed with essentially perfect accuracy. Distinguishing between cameras of the same model also seems to have very high accuracy, but it is hard to generalize because of the small sample size.</p>
<p class="c0"><strong><span class="c3">Improvements.</span></strong> Impressive as the above numbers are, there are at least two major ways in which this result can, and has been improved. First, the Binghamton paper is focused on a specific signal, sensor noise. But there are several stages in image acquisition and processing pipeline in the camera, each of which could leave idiosyncratic effects on the image. <span class="c6"><a class="c2" href="http://www.busim.ee.boun.edu.tr/~sankur/SankurFolder/IEEE_IFS_Cellphon_Camera.pdf">This paper</a></span> out of Turkey incorporates many such effects by considering all patterns of certain types that occur in the lower order (least significant) bits of the image, which seems like a rather powerful technique.</p>
<p class="c0">The effects other than sensor noise seem to help more with identifying the camera model than the specific device, but to the extent that the former is a component of the latter, it is useful. They achieve a 97.5% accuracy among 16 test cameras — but with cellphone cameras with pictures at a resolution of just 640&#215;480.</p>
<p class="c0">Second is the effect of the scene itself on the noise. Denoising transformations are not perfect — sharp boundaries look like noise. The Binghamton researchers picked their denoising filter (a wavelet transform) to minimize this problem, but a recent <span class="c6"><a class="c2" href="http://wrap.warwick.ac.uk/3318/1/WRAP_Li_Source_Camera.pdf">paper</a></span> by Chang-Tsun Li claims to do it better, and shows even better numerical results: with 6 cameras (all different models), accurate (over 99%) identification for image fragments cropped to just 256 x 512.</p>
<p class="c0"><strong><span class="c3">What does this mean for privacy?</span></strong> I said earlier that there is a duality between security and privacy, but let’s examine the relationship in more detail. In privacy-infringing applications like mass surveillance, the algorithm need not always produce an answer, and it can occasionally be wrong when it does. The penalty for errors is much lower. On the other hand, the matching algorithm in surveillance-like applications needs to handle a far larger number of candidate cameras. The key point is:</p>
<div style="background-color:#eef;border:1px dashed #bcc;margin-left:20px;margin-bottom:15px;padding:5px;">The parameters of fingerprinting algorithms can usually be tweaked to handle a larger number of classes (i.e., devices) at the expense of accuracy.</div>
<p class="c0">My intuition is that state-of-the-art techniques, configured slightly differently, should allow probabilistic deanonymization from among tens of thousands of different cameras. A Flickr or Picasa profile with a few dozen images should suffice to fingerprint a camera.[4] Combined with metadata such as location, this puts us within striking distance of Internet-scale source-camera identification from anonymous images. I really hope there will be some serious research on this question.</p>
<p class="c0">Finally, a word defenses. If you find yourself in a position where you wish to anonymously publicize a sensitive photograph you took, but your camera is publicly tied to your identity because you’ve previously shared pictures on social networks (and who hasn’t), how do you protect yourself?</p>
<p class="c0">Compressing the image is one possibility, because that destroys the &#8216;lower-order&#8217; bits that fingerprinting crucially depends on. However, it would have to be way more aggressive than most camera defaults (JPEG quality factor ~60% according to one of the studies, whereas defaults are ~95%). A different strategy is rotating the image slightly in order to ‘desynchronize’ it, throwing off the fingerprint matching. An attack that defeats this will have to be much more sophisticated and will have a far higher error rate.</p>
<p class="c0">The deanonymization threat here is analogous to <span class="c6"><a class="c2" href="http://33bits.org/2009/01/15/de-anonymizing-the-internet/">writing-style fingerprinting</a></span>: there are simple defenses, albeit not foolproof, but sadly most users are unaware of the problem, let alone solutions.</p>
<p class="c0">[1] That was a bit simplified; mathematically, there is an additive component (dark signal nonuniformity) and a multiplicative component (photoresponse nonuniformity). The former is easy to correct for, and higher-end cameras do, but the latter isn’t.</p>
<p class="c0">[2] Much has been said about the tension between security and privacy at a social/legal/political level, but I’m making a relatively uncontroversial technical statement here.</p>
<p class="c0">[3] Fridrich is incidentally one of the pioneers of<a class="c2" href="http://en.wikipedia.org/wiki/Speedcubing"> </a><span class="c6"><a class="c2" href="http://en.wikipedia.org/wiki/Speedcubing">speedcubing</a></span> i.e., speed-solving the Rubik’s cube.</p>
<p class="c0">[4] The Binghamton paper uses 320 images per camera for building a fingerprint (and recommends at least 50); the Turkey paper uses 100, and Li’s paper 50. I suspect that if more than one image taken from the unknown camera is available, then the number of reference images can be brought down by a corresponding factor.</p>
<p class="c0">To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/980/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/980/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/980/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=980&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/09/19/digital-camera-fingerprinting/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/09/imageds.jpg" medium="image">
			<media:title type="html">Dark signal</media:title>
		</media:content>
	</item>
		<item>
		<title>Everything Has a Fingerprint: The Case of Blank Paper</title>
		<link>http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/</link>
		<comments>http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/#comments</comments>
		<pubDate>Tue, 13 Sep 2011 18:41:56 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fingerprinting]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=963</guid>
		<description><![CDATA[This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy. Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like hand geometry, as well [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=963&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p style="padding-left:30px;"><em>This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy.</em></p>
<p>Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like <a href="http://en.wikipedia.org/wiki/Hand_geometry">hand geometry</a>, as well as “behavioral biometrics” like voice, handwriting, typing patterns, and even <a href="http://www.springerlink.com/content/9k91axk7lx5h6jxx/">gait analysis</a>. Many techniques for deanonymization, the principal topic of this blog, work by “fingerprinting” people’s preferences, habits, or style.</p>
<p>But this article is not about biometrics, nor is it about fingerprinting of <a href="http://en.wikipedia.org/wiki/Acoustic_fingerprint">content</a> or complex systems such as <a href="http://panopticlick.eff.org/">a web browser in conjunction with the OS and the user</a>.[1] I will instead discuss one of the most surprising domains of fingerprinting — blank paper.</p>
<p><a href="http://33bits.files.wordpress.com/2011/09/blankpaperupclose.jpeg"><img class="aligncenter size-full wp-image-964" title="Blank paper under the microscope" src="http://33bits.files.wordpress.com/2011/09/blankpaperupclose.jpeg?w=455" alt=""   /></a></p>
<p>This is what paper looks like up close — far from being smooth, it has a rich natural structure. Even considering this, the state-of-the-art <a href="http://citp.princeton.edu/pub/paper09oak.pdf">study</a> on fingerprinting of physical documents, by <a href="http://www.cs.princeton.edu/~wclarkso/">Will Clarkson</a> and colleagues at Princeton, achieves something remarkable: they show how to extract fingerprints from paper using just commodity scanners, and no microscopic technology. The fingerprint survives when the document/paper is printed on, written or scribbled on, or even soaked in water.</p>
<div class="mceTemp mceIEcenter">
<dl class="wp-caption aligncenter">
<dt class="wp-caption-dt"><a href="http://33bits.files.wordpress.com/2011/09/scannedpaper.png"><img class="size-full wp-image-965" title="Scanned paper" src="http://33bits.files.wordpress.com/2011/09/scannedpaper.png?w=455&h=336" alt="" width="455" height="336" /></a></dt>
</dl>
<h5 class="wp-caption-dd">A small (10mm tall) region of paper scanned from two different angles — top-to-bottom and left-to-right</h5>
</div>
<p>The image above, taken from the Princeton paper, shows what the output of a scanner looks like. Not quite the resolution of the microscopic image, but a lot of structure is still visible. The key technique is: by scanning the paper at different orientations and comparing the images, the height at each point is estimated from which a 3-D map of the not-so-flat surface of the paper is constructed.</p>
<p>These 3-D maps can be used as fingerprints, but for efficiency they look at the maps of only about 100 randomly picked small “patches” on the paper. To further compress the extracted information, they do a “dimensionality reduction,” resulting in a 400 byte “feature vector” for each piece of paper, which is the fingerprint.</p>
<p>To verify or compare an observed fingerprint against a stored one, they simply look at the Hamming distance between the two bit-vectors. Why does this simple comparison technique succeed? Comparison of two human fingerprints is a lot more difficult, after all. It’s because a rectangular piece of paper has a nice property that human skin doesn’t: <em>when the objects being fingerprinted have a precise, fixed geometry, fingerprint verification is easy — it is just a pointwise comparison of the corresponding features.</em></p>
<p>The result of such comparisons is this: two fingerprints from different pieces of paper match in roughly 50% of the bits, almost always in the 45%–55% range. Two fingerprints from the same piece of paper, on the other hand, differ in less than 5% of the bits, and occasionally up to 20% of bits if it has been handled particularly badly, such as by soaking. Therefore it is straightforward to infer whether or not two fingerprints came from the same piece of paper.</p>
<p>Readers familiar with the “<a href="http://33bits.org/about/">33 bits of entropy</a>” concept might notice that the fingerprint here is 400 bytes long, or 3200 bits, which is ridiculously high. There are surely less than 2<sup>50</sup> pieces of paper in the world — that’s a million for every person — which means that these fingerprints should easily be able to uniquely identify every piece of paper in the world. [2] The authors estimate that the chance of an error is no more than 1 in 10<sup>148</sup>. In other words, they achieve perfect accuracy.</p>
<p>What are the implications? As the authors point out, document identification “has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods.” On the negative side, it “could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.”</p>
<p>[1] This is often referred to as<a href="http://en.wikipedia.org/wiki/Device_fingerprint"> device fingerprinting</a>, but I find that a poor choice of terminology and will use reserve that term for a different concept in this series.</p>
<p>[2] It is hard to estimate entropy exactly in cases like this, but the feature vector is obtained via <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a>, which makes it likely that the entropy is close to the maximum value of 3200 bits.</p>
<p><em>Thanks to Will Clarkson for reviewing a draft of this post.</em></p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/963/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/963/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/963/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=963&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/09/13/everything-has-a-fingerprint-the-case-of-blank-paper/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/09/blankpaperupclose.jpeg" medium="image">
			<media:title type="html">Blank paper under the microscope</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/09/scannedpaper.png" medium="image">
			<media:title type="html">Scanned paper</media:title>
		</media:content>
	</item>
		<item>
		<title>Google+ and Privacy: A Roundup</title>
		<link>http://33bits.org/2011/07/03/google-and-privacy-a-roundup/</link>
		<comments>http://33bits.org/2011/07/03/google-and-privacy-a-roundup/#comments</comments>
		<pubDate>Sun, 03 Jul 2011 19:04:52 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[privacy]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=938</guid>
		<description><![CDATA[By all accounts, Google has done a great job with Plus, both on privacy and on the closely related goal of better capturing real-life social nuances. [1] This article will summarize the privacy discussions I’ve had in the first few days of using the service and the news I’ve come across. The origin of Circles [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=938&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>By all accounts, Google has done a great job with <a href="https://plus.google.com/">Plus</a>, both on privacy and on the closely related goal of better capturing real-life social nuances. [1] This article will summarize the privacy discussions I’ve had in the first few days of using the service and the news I’ve come across.</p>
<p><strong>The origin of Circles</strong></p>
<p>“Circles,” as you’re probably aware, is the big privacy-enhancing feature. A presentation titled “<a href="http://www.slideshare.net/padday/the-real-life-social-network-v2">The Real-Life Social Network</a>” by user-experience designer <a href="http://twitter.com/padday">Paul Adams</a> almost exactly a year ago went viral in the tech community; it looks <a href="http://www.readwriteweb.com/archives/google_to_launch_major_new_social_network_called_c.php">likely</a> this was the genesis, or at least a crystallization, of the Circles concept.</p>
<p>But Adams defected to Facebook a few months later, which lead to <a href="http://techcrunch.com/2010/12/20/paul-adams-googler-whose-presentation-foretold-facebook-groups-heads-to-facebook/">speculation</a> that it was the end of whatever plans Google may have had for the concept. But little did the world know at the time that Plus was a company-wide, bet-the-farm initiative involving <a href="http://www.wired.com/epicenter/2011/06/inside-google-plus-social/all/1">30 product teams</a> and hundreds of engineers, and that the departure of one made no difference.</p>
<p>Meanwhile, Facebook introduced a <a href="http://www.facebook.com/help/?page=768">friend-lists feature</a> but it was DOA. When you’re staring at a giant list of several hundred “friends” — Facebook doesn’t do a good job of discouraging indiscriminate friending — categorizing them all is intimidating to say the least. My guess is that Facebook was merely playing the <a href="http://preibusch.de/publications/social_networks/privacy_jungle_dataset.htm">privacy communication game</a>.</p>
<p><strong>Why are circles effective?</strong></p>
<p>I did an informal poll to see if people are taking advantage of Circles to organize their friend groups. Admittedly, I was looking at a tech-savvy, privacy-conscious group of users, but the response was overwhelming, and it was enough to convince me that Circles will be a success. There’s a lot of excitement among the early user community as they collectively figure out the technology as well as the norms and best practices for Circles. For example, this <a href="https://plus.google.com/u/0/111661289724043424828/posts/SqaoG4Jc9rc">tip on how to copy a circle</a> has been shared over 400 times as I write this.</p>
<p>One obvious explanation is that Circles captures real-life boundaries, and this is what users have been waiting for all along. That’s no doubt true, but I think there’s more to it than that. Multiple people have pointed out how the exemplary user interface for creating circles encouraged them to explore the feature. It is gratifying to see that Google has finally learned the importance of interface and interaction design in getting social right.</p>
<p>There are several other UI features that contribute to the success of Circles. When friending someone, you’re <em>forced</em> to pick one or more circles, instead of being allowed to drop them into a generic bucket and categorize them later. But in spite of this, the UI is so good that I find it no harder than friending on Facebook.</p>
<p>In addition, you have to pick circles to share each post with (but again the interface makes it really easy). Finally, each post has a little snippet that shows who can see it, which has the effect of constantly reminding you to mind the information flow. In short, it is nearly impossible to ignore the Circles paradigm.</p>
<p><strong>The resharing bug</strong></p>
<p>Google+ tries to balance privacy with Twitter-like resharing, which is always going to be tricky. Amusing inconsistencies result if you share a post with a circle that doesn’t include the original poster. A more serious issue, pointed out by many people including an <a href="http://blogs.ft.com/fttechhub/2011/06/google-plus-privacy-flaw">FT blogger</a>, is that  “limited” posts can be publicly reshared. To their credit, Google engineers acknowledged it and quickly disabled the feature.</p>
<p>Meanwhile, some have opined that this issue is “<a href="http://www.techdirt.com/articles/20110701/00262714929/first-totally-bogus-privacy-issue-over-google-raised.shtml">totally bogus</a>” and that this is <a href="http://www.buzzmachine.com/2011/06/30/social-is-for-sharing-not-hiding/">how life works</a> and how email works, in that when you tell someone a secret, they could share it with others. I strongly disagree, for two reasons.</p>
<p>First, this is <em>not</em> how the real world (or even email) works. Someone can repeat a secret you told them in real life, or forward an email, but they typically won’t <em>broadcast it to the whole world</em>. We’re talking about making something <em>public</em> here, something that will be forever associated with your real name and could very well come up in a web search.</p>
<p>Second, user-interface hints are an important and well-established way of nudging privacy-impacting behaviors. If there’s a ‘share’ button with a ‘public’ setting, many users will assume that it is OK to do just that. Twitter used to allow public retweets of protected tweets, and a <a href="http://w2spconf.com/2010/papers/p28.pdf">study</a> found that this had been done millions of times. In response, Twitter removed this ability. The <a href="http://privicons.org/">privicons</a> project seeks to embed similar hints in emails.</p>
<p>In other words, the privacy skeptics are missing the point: the goal of the feature is not to try to technologically <em>prevent</em> leakage of protected information, but to better <em>communicate</em> to users what’s OK to share and what isn’t. And in this case, the simplest way to do that is to remove the 1-click ability to share protected content publicly, and instead let users copy-paste if they really want to do that. It would also make sense to remind users to be careful when they’re sharing a limited to their circles, which, I’m happy to see, is <a href="https://plus.google.com/u/0/103541694080221120019/posts/htTdkLezSjP">exactly what Google is doing</a>.</p>
<div id="attachment_939" class="wp-caption aligncenter" style="width: 352px"><img class="size-full wp-image-939 " title="sharingreminder" src="http://33bits.files.wordpress.com/2011/07/sharingreminder.png?w=455" alt=""   /><p class="wp-caption-text">The tip you now see when you share a limited post (with another limited group). This is my favorite Google+ feature.</p></div>
<p><strong>A window into your circles</strong></p>
<p>Paul Ohm <a href="https://plus.google.com/u/0/117949726855391305467/posts/Ykc3irss45D">points out</a> that if someone shares content with a set of circles that includes you, you get to see 21 users who are part of those circles, apparently picked at random. [2] This means that if you look at these lists of 21 over time you can figure out a lot about someone&#8217;s circles, and possibly decipher them completely. Note that by default your profile shows a list of users in your circles, but not who&#8217;s in <em>which</em> circle, which for most people is <a href="http://twitter.com/mrgunn/statuses/86531372822441984">significantly more sensitive</a>.</p>
<p>In my view, this is an interesting finding, but not anything Google needs to fix; the feature is very useful (and arguably privacy-<em>enhancing</em>) and the information leakage is an inevitable tradeoff. But it’s definitely something that users would do well to be aware of: the secrecy of your circles is far from bulletproof.</p>
<p>Speaking of which, the network visibility of different users on their profile page confused me terribly, until I realized Google+ is A/B testing that privacy setting! These are the two possibilities you could see when you edit your profile and click the circles area in the left sidebar: <a href="http://dl.dropbox.com/u/131764/web/graphprefs1.png">A</a>, <a href="http://dl.dropbox.com/u/131764/web/graphprefs2.png">B</a>. This is very interesting and unusual. At any rate, very few users seem to have changed the defaults so far, based on a random sample of a few dozen profiles.</p>
<p><strong>Identity and distributed social networking</strong></p>
<p>Some people are peeved that Google+ discourages you from participating pseudonymously. I don’t think a social network that wants to target the mainstream and wants to capture real-world relationships has any real choice about this. In fact, I want it to go further. Right now, Google+ often suggests I add someone I’ve already added, which turns out to be because I’ve corresponded with multiple email addresses belonging to that person. Such user confusion could be minimized if the system did some graph-mining to automatically figure out which identities belong to the same person. [3]</p>
<p>A related question is what this will mean for distributed social networking, which was <a href="http://www.wired.com/epicenter/2010/05/facebook-rogue/">hailed</a> a year ago as the savior of privacy and user control. My guess is that Google+ will take the wind out of it — <a href="https://www.google.com/takeout/">Google takeout</a> gives you a significant degree of control over your data. Further, due to the <a href="http://allthingsd.com/20110607/whats-twitters-identity-now-that-its-apples-identity-provider/">Apple-Twitter integration</a> and the success of Android, the threat of Facebook monopolizing identities has been obliterated; there are at least three strong players now.</p>
<p>Another reason why Google+ competes with distributed social networks: for people worried about the social networking service provider (or the Government) reading their posts, client-side encryption on top of Google+ could work. The Circles feature is exactly what is needed to make encrypted posts viable, because you can make a circle of those who are using a compatible encryption/decryption plugin. At least a half-dozen such plugins have been created over the years (examples: <a href="http://www.bbc.co.uk/news/technology-12215921">1</a>, <a href="https://uprotect.it/index">2</a>), but it doesn’t make much sense to use these over Facebook or Twitter. Once the <a href="http://news.cnet.com/8301-19882_3-20075974-250/developer-api-for-google-its-coming/">Google+ developer API</a> rolls out, I’m sure we’ll see yet another avatar of the encrypted status message idea, and perhaps the the n-th time will be the charm.</p>
<p><strong>Concluding thoughts</strong></p>
<p>Two years ago, I <a href="http://33bits.org/2009/09/09/livejournal-done-right-the-case-for-a-social-network-with-built-in-privacy/">wrote</a> that there’s a market case for a privacy-respecting social network to fill Livejournal’s shoes. Google+ seems poised to fulfill most of what I anticipated in that essay; the asymmetric nature of relationships and the ability to present different facets of one’s life to different people are two important characteristics that the two social networks have in common. [4]</p>
<p>Many have speculated on whether, and to what extent, Google+ is a threat to Facebook. One recurring comparison is Facebook as “ghetto” compared to Plus, such as in <a href="http://i.imgur.com/OJiZu.png">this image</a> making the rounds on Reddit, reminiscent of Facebook vs. Myspace a few years ago. This perception of “coolness” and “class” is the single biggest thing Google+ has got going for it, more than any technological feature.</p>
<p>It’s funny how people see different things in Google+. While I’m planning to use Google+ as a Livejournal replacement for protected posts, since that’s what fits my needs, the majority of the commentary has compared it to Facebook. A few think it could <a href="http://venturebeat.com/2011/06/30/google-could-make-twitter-the-next-myspace/">replace Twitter</a>, generalizing from their own corner of the Google+ network where people haven’t been using the privacy options. Forbes, being a business publication, thinks <a href="http://blogs.forbes.com/quentinhardy/2011/06/29/google-other-targets/">LinkedIn is the target</a>. I’ve seen a couple of commenters saying they might use it instead of Yammer, another business tool. According to yet other articles, <a href="http://www.pixiq.com/article/google-may-not-kill-facebook-but-flickr-should-be-worried">Flickr</a>, <a href="http://gigaom.com/2011/06/28/why-google-plus-wont-hurt-facebook-but-skype-will-hate-it/">Skype</a> and various other Internet companies should be shaking in their boots. Have you heard the parable of the <a href="http://www.noogenesis.com/pineapple/blind_men_elephant.html">blind men and the elephant</a>?</p>
<p>In short, Google+ is whatever you want it to be, and probably a better version of it. It’s remarkable that they’ve pulled this off without making it a confusing, bloated mess. Myspace founder Tom Anderson seems to have the <a href="https://plus.google.com/112063946124358686266/posts/SrQrSSXeViq">most sensible view</a> so far: Google+ is simply a better … <em>Google</em>, in that the company now has a smoother, more integrated set of services. You’d think people would have figured it out from the name!</p>
<p>[1] I will use the term “privacy” in this article to encompass both senses.</p>
<p>[2] It’s actually 22 users, including yourself and the poster. It’s not clear just how random the list is; in my perusal, mutual friends seem to be preferentially picked.</p>
<p>[3] I am <em>not</em> suggesting that Google+ should prevent users from having multiple accounts, although Circles makes it much less useful/necessary to have multiple accounts.</p>
<p>[4] On the other hand, when it comes to third party data collection, I <a href="http://33bits.org/2011/03/18/privacy-and-the-market-for-lemons-or-how-websites-are-like-used-cars/">do not believe</a> that the market can fix itself.</p>
<p>I’m grateful to <a href="http://josephhall.org/">Joe Hall</a>, <a href="http://stanford.edu/~jmayer/">Jonathan Mayer</a>, and many, many others with whom I had interesting discussions, mostly via Google+ itself, on the topics that led to this post.</p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a> or <a href="https://plus.google.com/u/0/110908828231461227679">Google+</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/938/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/938/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/938/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&#038;blog=5017838&#038;post=938&#038;subd=33bits&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/07/03/google-and-privacy-a-roundup/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2011/07/sharingreminder.png" medium="image">
			<media:title type="html">sharingreminder</media:title>
		</media:content>
	</item>
	</channel>
</rss>
