<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>33 Bits of Entropy &#187; re-identification</title>
	<atom:link href="http://33bits.org/tag/re-identification/feed/" rel="self" type="application/rss+xml" />
	<link>http://33bits.org</link>
	<description>The End of Anonymized Data and What to Do About It</description>
	<lastBuildDate>Mon, 30 Jan 2012 06:39:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='33bits.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>33 Bits of Entropy &#187; re-identification</title>
		<link>http://33bits.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://33bits.org/osd.xml" title="33 Bits of Entropy" />
	<atom:link rel='hub' href='http://33bits.org/?pushpress=hub'/>
		<item>
		<title>Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge</title>
		<link>http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/</link>
		<comments>http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/#comments</comments>
		<pubDate>Wed, 09 Mar 2011 12:30:42 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[contest]]></category>
		<category><![CDATA[de-anonymization]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[re-identification]]></category>
		<category><![CDATA[social networks]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=697</guid>
		<description><![CDATA[The title of this post is also the title of a new paper of mine with Elaine Shi and Ben Rubinstein. You can grab a PDF or a web-friendly HTML version generated using my Project Luther software. A brief de-anonymization history. As early as the first version of my Netflix de-anonymization paper with Vitaly Shmatikov [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=697&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p>The title of this post is also the title of a <a href="http://arxiv.org/abs/1102.4374">new paper</a> of mine with <a href="http://www2.parc.com/csl/members/eshi/elaine.htm">Elaine Shi</a> and <a href="http://www.cs.berkeley.edu/~benr/">Ben Rubinstein</a>. You can grab a <a href="http://arxiv.org/pdf/1102.4374v1">PDF</a> or a web-friendly <a href="http://randomwalker.info/luther/kaggle-deanonymization/">HTML version</a> generated using my <a href="http://projectluther.org">Project Luther</a> software.</p>
<p><strong>A brief de-anonymization history.</strong> As early as the first version of my <a href="http://33bits.org/about/netflix-paper-home-page/">Netflix de-anonymization paper</a> with <a href="http://www.cs.utexas.edu/~shmat/">Vitaly Shmatikov</a> back in 2006, a colleague suggested that de-anonymization can in fact be used to game machine-learning contests—by simply “looking up” the attributes of de-anonymized users instead of predicting them. We off-handedly threw in paragraph in our paper discussing this possibility, and a New Scientist writer seized on it as an angle for her <a href="http://www.cs.utexas.edu/~shmat/newsci-netflix.html">article</a>.[1] Nothing came of it, of course; we had no interest in gaming the Netflix Prize.</p>
<p>During the years 2007-2009, Shmatikov and I worked on de-anonymizing social networks. The <a href="http://33bits.org/2009/03/19/de-anonymizing-social-networks/">paper that resulted</a> (<a href="http://www.cs.utexas.edu/~shmat/shmat_oak09.pdf">PDF</a>, <a href="http://randomwalker.info/social-networks/">HTML</a>) showed how to take two graphs representing social networks and map the nodes to each other based on the <em>graph structure alone</em>—no usernames, no nothing. As you might imagine, this was a phenomenally harder technical challenge than our Netflix work. (<a href="http://www.cs.cornell.edu/~lars/">Backstrom</a>, <a href="http://research.microsoft.com/en-us/people/dwork/">Dwork</a> and <a href="http://www.cs.cornell.edu/home/kleinber/">Kleinberg</a> had previously published a <a href="http://portal.acm.org/citation.cfm?id=1242598">paper</a> on social network de-anonymization; the crucial difference was that we showed how to put two social network graphs together rather than search for a small piece of graph-structured auxiliary information in a large graph.)</p>
<p>The context for these two papers is that data mining on social networks—whether online social networks, telephone call networks, or any type of <a href="http://33bits.org/2009/02/15/social-network-analysis-can-quantity-substitute-for-quality/">network of links between individuals</a>—can be very lucrative. Social networking websites would benefit from outsourcing “anonymized” graphs to advertisers and such; we showed that the privacy guarantees are questionable-to-nonexistent since the anonymization can be reversed. No major social network has gone down this path (as far as I know), quite possibly in part because of the two papers, although smaller players often fly under the radar.</p>
<p><strong>The Kaggle contest.</strong> <a href="http://www.kaggle.com/">Kaggle</a> is a platform for machine learning competitions. They ran the <a href="http://www.kaggle.com/socialNetwork">IJCNN social network challenge</a> to promote research on <a href="http://www.cs.cornell.edu/home/kleinber/link-pred.pdf">link prediction</a>. The contest dataset was created by crawling an online social network—which was later revealed to be Flickr—and partitioning the obtained edge set into a large training set and a smaller test set of edges augmented with an equal number of fake edges. The challenge was to predict which edges were real and which were fake. Node identities in the released data were obfuscated.</p>
<p>There are many, many anonymized databases out there; I come across a new one every other week. I pick de-anonymization projects if it will advance the art significantly (yes, de-anonymization is still <a href="http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/">partly an art</a>), or if it is <a href="http://33bits.org/2008/11/12/57/">fun</a>. The Kaggle contest was a bit of both, and so when my collaborators invited me to join them, it was too juicy to pass up.</p>
<p>The Kaggle contest is actually much more suitable to game through de-anonymization than the Netflix Prize would have been. As we explain in the paper:</p>
<blockquote><p>One factor that greatly affects both [the privacy risk and the risk of gaming]—in opposite directions—is whether the underlying data is already publicly available. If it is, then there is likely no privacy risk; however, it furnishes a ready source of high-quality data to game the contest.</p></blockquote>
<p>The first step was to do <a href="http://randomwalker.info/luther/kaggle-deanonymization/Background.html">our own crawl of Flickr</a>; this turned out to be relatively easy. The two graphs (the Kaggle graph and our Flickr crawl), were 95% similar, as we were later able to determine. The difference is primarily due to Flickr users adding and deleting contacts between Kaggle’s crawl and ours. Armed with the auxiliary data, we set about the task of matching up the two graphs based on the structure. To clarify: our goal was to map the nodes in the Kaggle training and test dataset to real Flickr nodes. That would allow us to simply look  up the pairs of nodes in the test set in the Flickr graph to see whether or not the edge exists.</p>
<p><strong>De-anonymization.</strong> Our effort validated the broad strategy in my paper with Shmatikov, which consists of two steps: “seed finding” and “propagation.” In the former step we somehow de-anonymize a small number of nodes; in the latter step we use these as “anchors” to propagate the de-anonymization to more and more nodes. In this step the algorithm feeds on its own output.</p>
<p>Let me first describe propagation because it is simpler.[2] As the algorithm progresses, it maintains a (partial) mapping between the nodes in the true Flickr graph and the Kaggle graph. We iteratively try to extend the mapping as  follows: pick an arbitrary as-yet-unmapped node in the Kaggle graph, find the “most similar” node in the Flickr graph, and if they are “sufficiently similar,” they get mapped to each other.</p>
<p>Similarity between a Kaggle node and a Flickr node is defined as <a href="http://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> between the already-mapped neighbors of the Kaggle node and the already-mapped neighbors of the Flickr node (nodes mapped to each other are treated as identical for the purpose of cosine comparison).</p>
<p style="text-align:left;"><img class="aligncenter" title="Propagation" src="http://randomwalker.info/luther/kaggle-deanonymization/propagation.png" alt="" width="282" height="174" />In the diagram, the blue  nodes have already been mapped. The similarity between A and B is 2 / (√3·√3) =  ⅔. Whether or not edges exist between A and A’ or B and B’ is irrelevant.</p>
<p>There are many heuristics that go into the “sufficiently similar” criterion, which are <a href="http://randomwalker.info/luther/kaggle-deanonymization/De_anonymization.html">described in our paper</a>. Due to the high percentage of common edges between the graphs, we were able to use a relatively pure form of the propagation algorithm; the one my paper with Shmatikov, in contast, was filled with lots more messy heuristics.</p>
<p><strong>Those elusive seeds.</strong> Seed identification was far more challenging. In the earlier paper, we didn’t do seed identification on real graphs; we only showed it possible under certain models for error in auxiliary information. We used a “pattern-search” technique, as did the Backstrom et al paper uses a similar approach. It wasn’t clear whether this method would work, for reasons I won’t go into.</p>
<p>So we developed a new technique based on “combinatorial optimization.” At a high level, this means that instead of finding seeds one by one, we try to find them all at once! The first step is to find a set of k (we used k=20) nodes in the Kaggle graph and k nodes in our Flickr graph that are likely to correspond to each other (in some order); the next step is to find this correspondence.</p>
<p>The latter step is the hard one, and basically involves solving an NP-hard problem of finding a permutation that minimizes a certain weighting function. During the contest I basically stared at <a href="http://randomwalker.info/misc/kaggle/cosines.reverse.txt">this page of numbers</a> for a couple of hours, and then wrote down the mapping, which to my great relief turned out to be correct! But later we were able to show how to solve it in an automated and scalable fashion <a href="http://randomwalker.info/luther/kaggle-deanonymization/Graph_Matching_via_Simulate.html">using simulated annealing</a>, a well-known technique to approximately solve NP-hard problems for small enough problem sizes. This method is one of the main research contributions in our paper.</p>
<p>After carrying out seed identification, and then propagation, we had de-anonymized about 65% of the edges in the contest test set and the accuracy was about 95%. The main reason we didn’t succeed on the other third of the edges was that one or both the nodes had a very small number of contacts/friends, resulting in too little information to de-anonymize. Our task was far from over: combining de-anonymization with regular link prediction also involved nontrivial research insights, for which I will again refer you to the <a href="http://randomwalker.info/luther/kaggle-deanonymization/Link_Prediction.html">relevant section</a> of the paper.</p>
<p><strong>Lessons</strong>. The main question that our work raises is where this leaves us with respect to future machine-learning contests. One necessary step that would help a lot is to amend contest rules to prohibit de-anonymization and to require source code submission for human verification, but as we <a href="http://randomwalker.info/luther/kaggle-deanonymization/Discussion.html">explain</a> in the paper:</p>
<blockquote><p>The loophole in this approach is the possibility of <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a>. While source-code verification would undoubtedly catch a contestant who achieved their results using de-anonymization alone, the more realistic threat is that of de-anonymization being used to bridge a small gap. In this scenario, a machine learning algorithm would be trained on the test set, the correct results having been obtained via de-anonymization. Since successful [machine learning] solutions are composites of numerous algorithms, and consequently have a huge number of parameters, it should be possible to conceal a significant amount of overfitting in this manner.</p></blockquote>
<p>As with the privacy question, there are no easy answers. It has been over a decade since Latanya Sweeney’s work provided the first <a href="https://www.eff.org/deeplinks/2009/09/what-information-personally-identifiable">dramatic demonstration</a> of the privacy problems with data anonymization; we still aren’t close to fixing things. I foresee a rocky road ahead for machine-learning contests as well. I expect I will have more to say about this topic on this blog; stay tuned.</p>
<p>[1] Amusingly, it was a whole year after that before anyone paid any attention to the <em>privacy</em> claims in that paper.</p>
<p>[2] The description is from <a href="http://www.kaggle.com/index.php?option=com_ccboard&amp;view=postlist&amp;forum=26&amp;topic=257&amp;Itemid=&amp;task_id=2464">my post on the Kaggle forum</a> which also contains a few additional details.</p>
<p>To stay on top of future posts, <a href="http://33bits.org/feed/">subscribe</a> to the RSS feed or <a href="http://twitter.com/random_walker">follow me on Twitter</a>.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/697/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/697/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/697/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=697&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://randomwalker.info/luther/kaggle-deanonymization/propagation.png" medium="image">
			<media:title type="html">Propagation</media:title>
		</media:content>
	</item>
		<item>
		<title>Myths and Fallacies of &#8220;Personally Identifiable Information&#8221;</title>
		<link>http://33bits.org/2010/06/21/myths-and-fallacies-of-personally-identifiable-information/</link>
		<comments>http://33bits.org/2010/06/21/myths-and-fallacies-of-personally-identifiable-information/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 20:12:44 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[HIPAA]]></category>
		<category><![CDATA[law]]></category>
		<category><![CDATA[PII]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=511</guid>
		<description><![CDATA[I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of &#8220;personally identifiable information&#8221; (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=511&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have a new paper (<a href="http://www.cs.utexas.edu/users/shmat/shmat_cacm10.pdf">PDF</a>) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of &#8220;personally identifiable information&#8221; (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:</p>
<p>The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.</p>
<p>On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub &#8220;PII&#8221; in a way that <strong>prevents the possibility of re-identification</strong>. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why &#8220;PII&#8221; has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.</p>
<p>What we are gunning after is the get-out-of-jail-free card, a.k.a. &#8220;safe harbor,&#8221; particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic &#8220;de-identification&#8221; of the data (although this isn&#8217;t the spirit of the law). Even your genome is not considered identifying!</p>
<p>Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.</p>
<p>We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn&#8217;t going to be fixed overnight. However, the current situation is a farce and needs to change.</p>
<p>Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.</p>
<p><em>Thanks to Vitaly Shmatikov for comments on a draft of this post.</em></p>
<p><em><span style="font-style:normal;">To stay on top of future posts, </span><a href="http://33bits.org/feed/"><span style="font-style:normal;">subscribe </span></a><span style="font-style:normal;">to the RSS feed or </span><a href="http://twitter.com/random_walker"><span style="font-style:normal;">follow me on Twitter</span></a><span style="font-style:normal;">.</span></em></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/511/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/511/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/511/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=511&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2010/06/21/myths-and-fallacies-of-personally-identifiable-information/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>The Secret Life of Data</title>
		<link>http://33bits.org/2010/02/06/the-secret-life-of-data/</link>
		<comments>http://33bits.org/2010/02/06/the-secret-life-of-data/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 20:48:12 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aggregation]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[re-identification]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=327</guid>
		<description><![CDATA[Some people claim that re-identification attacks don&#8217;t matter, the reasoning being: &#8220;I&#8217;m not important enough for anyone to want to invest time on learning private facts about me.&#8221; At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=327&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Some people claim that re-identification attacks don&#8217;t matter, the reasoning being: &#8220;<em>I&#8217;m not important enough for anyone to want to invest time on learning private facts about me.</em>&#8221; At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.</p>
<p>The argument is nonetheless fallacious, because re-identification typically doesn&#8217;t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on &#8220;large-scale&#8221; or &#8220;en masse&#8221;.) On the other hand, the <em>harm</em> that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.</p>
<p>In today&#8217;s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like <a id="n8qy" title="ChoicePoint" href="http://en.wikipedia.org/wiki/ChoicePoint">ChoicePoint</a> that sell intelligence on individuals, whereas the party interested in <em>using</em> the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.</p>
<p>Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell &#8220;anonymized&#8221; customer data: for example <a id="vag4" title="practice fusion" href="http://www.practicefusion.com/pages/news_mentions.html">Practice Fusion</a> &#8220;subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.&#8221; On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.</p>
<p>Another piece of the puzzle is what happens when a company goes bankrupt. <a id="xt:y" title="Decode genetics" href="http://www.nytimes.com/2009/11/18/business/18gene.html">Decode genetics</a> recently did, which is particularly interesting because they are sitting on a ton of <a id="am92" title="genetic data" href="http://scienceblogs.com/geneticfuture/2009/11/decode_genetics_finally_goes_u.php">genetic data</a>. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.</p>
<p>At the recent <a id="dhyg" title="FTC privacy roundtable" href="http://33bits.org/2010/01/31/in-which-i-come-out-notes-from-the-ftc-privacy-roundtable/">FTC privacy roundtable</a>, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I&#8217;m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?)  Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — &#8220;data laundering,&#8221; perhaps?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/327/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/327/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/327/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=327&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2010/02/06/the-secret-life-of-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>De-anonymization is not X: The Need for Re-identification Science</title>
		<link>http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/</link>
		<comments>http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 21:42:15 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[author recognition]]></category>
		<category><![CDATA[k-anonymity]]></category>
		<category><![CDATA[re-identification]]></category>
		<category><![CDATA[stylometry]]></category>
		<category><![CDATA[theory]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=269</guid>
		<description><![CDATA[In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the record linkage problem in statistics and census studies, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=269&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the <strong><a id="r2sa" title="record linkage" href="http://en.wikipedia.org/wiki/Record_linkage">record linkage</a></strong> problem in statistics and census studies, the <strong>search</strong> problem in information retrieval, the <strong>classification</strong> problem in <a id="uxkl" title="machine learning" href="http://en.wikipedia.org/wiki/Machine_learning">machine learning</a>, and finally, <strong>biometric</strong><strong> identification</strong>. Noticing inter-disciplinary connections is often very illuminating and sometimes leads to breakthroughs, but I fear that in the case of re-identification, these connections have done more harm than good.</p>
<p><strong>Record linkage and k-anonymity.</strong> <a id="j8i." title="Sweeney" href="http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html">Sweeney</a>&#8216;s well-known experiment with health records was essentially an exercise in record linkage. The re-identification technique used was the simplest possible &#8212; a database JOIN. The unfortunate consequence was that for many years, the anonymization problem was overgeneralized based on that single experiment. In particular, it led to the development of two related and heavily flawed notions: <em>k-anonymity</em> and <em>quasi-identifier</em>.</p>
<p>The main problem with k-anonymity it is that it attempts avoid privacy breaches via purely syntactic manipulations to the data, without any model for reasoning about the &#8216;adversary&#8217; or attacker. A future post will analyze the limitations of k-anonymity in more detail. &#8216;Quasi-identifier&#8217; is a notion that arises from attempting to see some attributes (such as ZIP code) but not others (such as tastes and behavior) as contributing to re-identifiability. However, the major lesson from the re-identification papers of the last few years has been that any information at all about a person can be potentially used to aid re-identification.</p>
<p><strong>Movie ratings and noise.</strong> Let&#8217;s move on to other connections that turned out to be red herrings. Prior to our <a id="cxt_" title="Netflix paper" href="http://33bits.org/about/netflix-paper-home-page/">Netflix paper</a>, Frankowski et al. <a id="q.:n" title="attempted to de-anonymize movie ratings" href="http://www.grouplens.org/node/118">studied de-anonymization of users via movie ratings</a> collected as part of the GroupLens research project. Their algorithm achieved some success, but failed when noise was added to the auxiliary information. I believe this to be because the authors modeled re-identification as a search problem (I have no way to know if that was their mental model, but the algorithms they came up with seem inspired by the search literature.)</p>
<p>What does it mean to view re-identification as a search problem? A user&#8217;s anonymized movie preference record is treated as the collection of words on a web page, and the auxiliary information (another record of movie preferences, from a different database) is treated as a list of search terms. The reason this approach fails is that in the movie context, users typically enter distinct, albeit overlapping, sets of information into different sites or sources. This leads to a great deal of &#8216;noise&#8217; that the algorithm must deal with. While noise in web pages is of course an issue for web search, noise in the search terms themselves is not. That explains why search algorithms come up short when applied to re-identification.</p>
<p>The robustness against noise was the key distinguishing element that made the re-identification attack in the Netflix paper stand out from most previous work. Any re-identification attack that goes beyond Sweeney-style demographic attributes must incorporate this as a key feature. &#8216;Fuzzy&#8217; matching is tricky, and there is no universal algorithm that can be used. Rather, it needs to be tailored to the type of dataset based on an understanding of human behavior.</p>
<p><strong>Hope for authorship recognition</strong>. Now for my final example. I&#8217;m collaborating with other researchers, including <a href="http://www.cs.berkeley.edu/~bethenco/">John Bethencourt</a> and <a href="http://www.emilstefanov.net/">Emil Stefanov</a>, on some (currently exploratory) investigations into authorship recognition (see my post on <a id="l:uk" title="De-anonymizing the Internet" href="http://33bits.org/2009/01/15/de-anonymizing-the-internet/">De-anonymizing the Internet</a>). We&#8217;ve been wondering why progress in existing papers seems to hit a wall at around 100 authors, and how we can break past this limit and carry out de-anonymization on a truly Internet scale. My conjecture is that most previous papers hit the wall because they framed authorship recognition as a classification problem, which is probably the right model for forensics applications. For breaking Internet anonymity, however, this model is not appropriate.</p>
<p>In a de-anonymization problem, if you only succeed for some fraction of the authors, but you do so in a verifiable way, i.e, your algorithm either says &#8220;Here is the identity of X&#8221; or &#8220;I am unable to de-anonymize X&#8221;, that&#8217;s great. In a classification problem, that&#8217;s not acceptable. Further, in de-anonymization, if we can reduce the set of candidate identities for X from a million to (say) 10, that&#8217;s fantastic. In a classification problem, that&#8217;s a 90% error rate.</p>
<p>These may seem like minor differences, but they radically affect the variety of features that we are able to use. We can throw in a whole lot of features that only work for some authors but not for others. This is why I believe that Internet-scale text de-anonymization is fundamentally possible, although it will only work for a subset of users that cannot be predicted beforehand.</p>
<p><strong>Re-identification science.</strong> Paul Ohm refers to what I and other researchers do as &#8220;re-identification science.&#8221; While this is flattering, I don&#8217;t think we&#8217;ve done enough to deserve the badge. But we need to change that, because efforts to understand re-identification algorithms by reducing them to known paradigms have been unsuccessful, as I have shown in this post. Among other things, we need to better understand the theoretical limits of anonymization and to extract the common principles underlying the more complex re-identification techniques developed in recent years.</p>
<p><em>Thanks to Vitaly Shmatikov for reviewing an earlier draft of this post.</em></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/269/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/269/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/269/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=269&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/10/14/de-anonymization-is-not-x-the-need-for-re-identification-science/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>Oklahoma Abortion Law: Bloggers get it Wrong</title>
		<link>http://33bits.org/2009/10/09/oklahoma-abortion-law-the-bloggers-get-it-wrong/</link>
		<comments>http://33bits.org/2009/10/09/oklahoma-abortion-law-the-bloggers-get-it-wrong/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 18:24:11 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[law]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=254</guid>
		<description><![CDATA[The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=254&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The State of Oklahoma just passed <a id="m-ql" title="legislation" href="http://www.sos.state.ok.us/documents/Legislation/52nd/2009/1R/HB/1595.pdf">legislation</a> requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law <a id="j_c2" title="controversy" href="http://www.huffingtonpost.com/2009/10/08/oklahoma-abortion-law-det_n_313779.html">gained steam rapidly</a> after bloggers <a id="e:q3" title="revealed" href="http://feministsforchoice.com/new-oklahoma-abortion-law-being-challenged.htm">revealed</a> that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.</p>
<p>As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the <a id="qqpj" title="text of the law" href="http://www.sos.state.ok.us/documents/Legislation/52nd/2009/1R/HB/1595.pdf">text of the law</a> a more careful reading, and realized that the bloggers have been misinterpreting the law all along.</p>
<p>While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made <em>public</em> are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.</p>
<p>I&#8217;m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a <a id="clj1" title="lawsuit" href="http://www.courthousenews.com/2009/10/01/New_Abortion_Law_Challenged_in_Oklahoma.htm">lawsuit</a> aimed at stopping the law from going into effect. The <a id="c8oc" title="complaint" href="http://lawprofessors.typepad.com/files/oklaabort.pdf">text</a> of this complaint, as <a href="http://paulohm.com/">Paul Ohm</a> notes, does <em>not</em> raise the &#8220;public posting&#8221; claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.</p>
<p>But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I&#8217;m reminded of <a id="ahuz" title="the affair of the gay sheep" href="http://www.nytimes.com/2007/01/25/science/25sheep.html">the affair of the gay sheep</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/254/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/254/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=254&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/10/09/oklahoma-abortion-law-the-bloggers-get-it-wrong/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs</title>
		<link>http://33bits.org/2009/05/13/your-morning-commute-is-unique-on-the-anonymity-of-homework-location-pairs/</link>
		<comments>http://33bits.org/2009/05/13/your-morning-commute-is-unique-on-the-anonymity-of-homework-location-pairs/#comments</comments>
		<pubDate>Wed, 13 May 2009 06:42:11 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=176</guid>
		<description><![CDATA[Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations &#8212; to a block level &#8212; identifies them uniquely. Even if we look at the much [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=176&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="Map" src="http://farm4.static.flickr.com/3544/3526741531_a8caf22c7d_m.jpg" alt="" width="240" height="205" />Philippe Golle and Kurt Partridge of PARC have a cute paper (<a href="http://crypto.stanford.edu/~pgolle/papers/commute.pdf">pdf</a>) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations &#8212; to a block level &#8212; identifies them <em>uniquely</em>.</p>
<p>Even if we look at the much coarser granularity of a <a href="http://en.wikipedia.org/wiki/Census_tract">census tract</a> &#8212; tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract &#8212; for the average person, there are only around 20 other people who share the same home and work location. There&#8217;s more: 5% of people are uniquely identified by their home and work locations <em>even if it is known only at the census tract level</em>. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.</p>
<p>The paper is timely, because Location Based Services  are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:</p>
<ol>
<li> who has access to anonymized location data?</li>
<li>how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?</li>
</ol>
<p>The authors don&#8217;t say much about these questions, but that&#8217;s probably because there are too many possibilities to list! In this post I will examine a few.</p>
<p><img class="alignright" title="GPS" src="http://www.blogcdn.com/www.engadget.com/media/2007/03/nyctaxi.jpg" alt="" width="220" height="108" /><strong>GPS navigation.</strong> This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as <a href="http://www.engadget.com/2007/03/09/nyc-cab-drivers-say-no-thanks-to-gps-installation/">NYC cab drivers threatening to strike</a>. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider &#8212; unlike in the context of <a href="http://33bits.org/2009/03/19/de-anonymizing-social-networks/">social networks</a>, people often don&#8217;t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn&#8217;t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.</p>
<p>The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there&#8217;s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.</p>
<p><a href="http://www.google.com/latitude/intro.html"><img class="size-full wp-image-189 alignright" title="Screenshot" src="http://33bits.files.wordpress.com/2009/05/screenshot1.png?w=455" alt="Screenshot"   /></a><strong>Location based social networking.</strong> Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. <a href="http://www.google.com/latitude/intro.html">Google Latitude </a>and <a href="http://loopt.com/">Loopt</a> are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don&#8217;t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a <em>pair</em> of locations is uniquely identifying, a <em>trail</em> is emphatically so.</p>
<p>The same threat also applies to data being subpoena&#8217;d, so data retention policies need to take into consideration the uselessness of anonymizing location data.</p>
<p>I don&#8217;t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?</p>
<p><strong>Plain old web browsing.</strong> Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of <a href="http://www.google.com/search?q=ip+address+geolocation+database">free or paid databases</a> for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.</p>
<p>A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. <a href="http://www.skyhookwireless.com/howitworks/wps.php">Skyhook, Inc.</a> has spent years <a href="http://en.wikipedia.org/wiki/Wardriving">wardriving</a> the country (and <a href="http://www.skyhookwireless.com/careers/drivers.php">abroad</a>) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook&#8217;s database and determine the user&#8217;s current location. Note that you don&#8217;t have to be browsing wirelessly &#8212; all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has <a href="http://www.intomobile.com/2009/03/27/opera-working-on-w3c-standardized-geolocation-api-partners-with-skyhook.html">teamed up with Skyhook</a> and is &#8220;looking forward to a future where geolocation data is as assumed part of the browsing experience.&#8221; The protocol by which the browser communicates geolocation to the website is being <a href="http://dev.w3.org/geo/api/spec-source.html">standardized by the W3C</a>.</p>
<p>The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a <a href="http://blog.programmableweb.com/2008/10/22/google-gears-geolocation-api-gets-wifi/">competing offering</a> that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.</p>
<p><a href="http://gizmodo.com/389268/eye+fi-announces-explore-share-and-home-models"><img class="alignright" title="Eye-Fi SD card" src="http://gizmodo.com/assets/resources/2008/05/eyeexplore.jpg" alt="" width="120" height="150" /></a><strong>Image metadata.</strong> An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.</p>
<p>If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.</p>
<p>Now let us turn to the other major question: what are the <strong>sources of auxiliary data</strong> that might link location pairs to identities? The easiest approach is probably to buy data from <a href="http://acxiom.com">Acxiom</a>, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a &#8220;join,&#8221; i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by  &#8220;DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.&#8221;</p>
<p>Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary &#8212; such as when you&#8217;re trying to browse anonymously &#8212; the adversary already has access to the auxiliary data. Google&#8217;s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.</p>
<p>Finally, while I&#8217;ve talked about image data being the <em>target</em> of de-anonymization, it may equally well be used as the <em>auxiliary information</em> that links a location pair to an identity &#8212; a <em>non-anonymous</em> Flickr account with sufficiently many geotagged photos probably reveals an identifiable user&#8217;s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people&#8217;s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)</p>
<p><img class="alignright size-thumbnail wp-image-201" title="devices" src="http://33bits.files.wordpress.com/2009/05/devices1.png?w=150&#038;h=121" alt="devices" width="150" height="121" /><strong>Summary.</strong> A number of devices in our daily lives transmit our physical location to service providers whom we don&#8217;t necessarily trust, and who keep might keep this data around or transmit it to third parties we don&#8217;t know about. The average user simply doesn&#8217;t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/176/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/176/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/176/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=176&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/05/13/your-morning-commute-is-unique-on-the-anonymity-of-homework-location-pairs/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://farm4.static.flickr.com/3544/3526741531_a8caf22c7d_m.jpg" medium="image">
			<media:title type="html">Map</media:title>
		</media:content>

		<media:content url="http://www.blogcdn.com/www.engadget.com/media/2007/03/nyctaxi.jpg" medium="image">
			<media:title type="html">GPS</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2009/05/screenshot1.png" medium="image">
			<media:title type="html">Screenshot</media:title>
		</media:content>

		<media:content url="http://gizmodo.com/assets/resources/2008/05/eyeexplore.jpg" medium="image">
			<media:title type="html">Eye-Fi SD card</media:title>
		</media:content>

		<media:content url="http://33bits.files.wordpress.com/2009/05/devices1.png?w=150" medium="image">
			<media:title type="html">devices</media:title>
		</media:content>
	</item>
		<item>
		<title>De-anonymizing Social Networks</title>
		<link>http://33bits.org/2009/03/19/de-anonymizing-social-networks/</link>
		<comments>http://33bits.org/2009/03/19/de-anonymizing-social-networks/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 11:09:34 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>
		<category><![CDATA[social networks]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=161</guid>
		<description><![CDATA[Our social networks paper is finally officially out! It will be appearing at this year&#8217;s IEEE S&#38;P (Oakland). Download: PDF &#124; PS &#124; HTML Please read the FAQ about the paper. Abstract: Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=161&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Our social networks paper is finally officially out! It will be appearing at this year&#8217;s IEEE S&amp;P (Oakland).</p>
<p>Download: <a href="http://www.cs.utexas.edu/~shmat/shmat_oak09.pdf">PDF</a> | <a href="http://www.cs.utexas.edu/~shmat/shmat_oak09.ps">PS</a> | <a href="http://randomwalker.info/social-networks/">HTML</a></p>
<p>Please read the <strong><a href="http://www.cs.utexas.edu/~shmat/socialnetworks-faq.html">FAQ about the paper</a></strong>.</p>
<p><strong>Abstract:</strong></p>
<blockquote><p>Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, <span class="textit">i.e.</span>, removing names, addresses, <span class="textit">etc.</span></p>
<p>We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.</p>
<p>Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy &#8220;sybil&#8221; nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary&#8217;s auxiliary information is small.</p></blockquote>
<p>The HTML version was produced using  my <a href="http://randomwalker.info/projectluther/">Project Luther</a> software, which in my opinion produces much prettier output than anything else (especially math formulas). Another big benefit is the handling of citations: it automatically searches various bibliographic databases and adds abstract/bibtex/download links and even finds and adds links to author homepages in the bib entries.</p>
<p>I have never formally announced or released Luther; it needs more work before it can be generally usable, and my time is limited. Drop me a line if you&#8217;re interested in using it.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/161/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/161/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/161/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=161&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/03/19/de-anonymizing-social-networks/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>
	</item>
		<item>
		<title>Anonymous Data Collection: Lessons from the A-Rod Affair</title>
		<link>http://33bits.org/2009/02/19/anonymous-data-collection-lessons-from-the-a-rod-affair/</link>
		<comments>http://33bits.org/2009/02/19/anonymous-data-collection-lessons-from-the-a-rod-affair/#comments</comments>
		<pubDate>Thu, 19 Feb 2009 02:24:10 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[a-rod]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=138</guid>
		<description><![CDATA[Recently, the Alex Rodriguez steroid controversy has been in the news. The aspect that interests me is the manner in which it came to attention: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=138&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Alex_Rodriguez_Talking.jpg/180px-Alex_Rodriguez_Talking.jpg" alt="" style="float:right;">Recently, the Alex Rodriguez steroid controversy has been <a id="ewu1" title="in the news" href="http://news.google.com/news?q=alex+rodriguez+steroids&amp;btnG=Search+News">in the news</a>. The aspect that interests me is the <a id="fu_z" title="manner in which it came to attention" href="http://www.nytimes.com/2009/02/16/technology/16link.html">manner in which it came to attention</a>: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players were using banned substances. When Federal agents came calling, the sample turned out to be not so anonymous after all.</p>
<p>The <a id="d.8r" title="failure of anonymity" href="http://www.nytimes.com/2009/02/11/sports/baseball/11orza.html?ref=sports">failure of anonymity</a> here was total&#8211;the testing lab simply failed to destroy the samples or even take the labels off them, and the Players&#8217; Union, which conducted the survey, failed to call the lab and ask them to do so during the more than one-week window that they had before the subpoena was issued.</p>
<p>However, there are a number of ways in which things could have gone wrong even if one or more of the parties had followed proper procedure. None of the scenarios below result in as straightforward an association between player and steroid use as we have seen. On the other hand, they can be just as damaging in the court of public opinion.</p>
<ul>
<li>If the samples were not destroyed, but simply de-identified, <a id="no:4" title="DNA can be recovered even after years" href="http://www.google.com/search?q=stored+urine+sample+dna">DNA can be recovered even after years</a>, and the DNA can be used to match the player to the sample. You might argue the feds can&#8217;t easily get hold of players&#8217; DNA to run such a matching, but once the association between drug test result and DNA has been made, it is a sword of Damocles hanging over the player&#8217;s head (note that A-Rod&#8217;s drug test happened six years ago.) The trend in recent years has been toward increased DNA profiling and bigger and bigger databases, and unlabeled samples therefore pose a clear danger.</li>
<li>If the samples are destroyed, and the test results are stored in de-identified form, anonymity could still be compromised. A drug test measures the concentrations of a bunch of different chemicals in the urine. It is likely that this results in a &#8220;profile&#8221; that is characteristic of a person&#8211;just like a variety of other biometric characteristics. If the same player, having stopped the use of banned substances, provides another urine sample, it is possible that this profile can be matched to the old one based on the fact that most of the urine chemicals have not changed in concentration. It is an interesting research question to see how stable the &#8220;profiles&#8221; are, and what their discriminatory power is.</li>
<li>Even more sophisticated attacks are possible. Let&#8217;s say that participant names are known, but other than that the only thing that&#8217;s released is a single statistic: the percentage of players that tested positive. Now, if the survey is performed on a regular basis, and a certain player (who happens to use steroids) participates only some of the time, the overall statistic is going to be slightly higher whenever that player participates. In spite of confounding factors, such as the fact that other players might also drop in and out, statistical techniques can be used to tease out this correlation. 
<p>This might sound like a tall order at first, but it is a proven attack strategy. The technique was used recently in a <a id="fupu" title="PLoS Genetics paper" href="http://spittoon.23andme.com/2008/08/28/faces-in-a-crowd-new-dna-technique-can-pick-one-persons-dna-signature-out-of-hundreds/">PLoS Genetics paper</a> to identify if an individual had contributed DNA to an aggregate sample of hundreds of individuals. </p>
<p>I performed a quick experiment, assuming that there are 1,000 players in the sample, of which 100 participate half the time (the rest participate all the time). 5% of the players dope, and each player either dopes throughout the study period or not at all. Testing is done every 3 months; the list of participants in each wave of the survey is known, as well as the percentage of players who tested positive in each wave. I found that after 3 years, there is enough information to identify 80% of the cheating players who participate irregularly. (Players who participate regularly are clearly safe.) </p>
<p><em>[Technical note: that's an equal error rate of 20%; i.e, 20% of the cheating players are not accused, and 20% of the accused are innocent. There is a trade-off between the two numbers, as always; if a higher accuracy is required, say only 10% of accused players are innocent, then 65% of the cheating players can be identified.]</em></li>
<li>When applicable, a combination of the above techniques such as matching de-identified profiles across different time-periods of a survey (or different surveys) can greatly increase the attacker&#8217;s potential.</li>
</ul>
<p>The point of the above scenarios is to convince you that you can never, ever be certain that the connection between a person and their data has been definitively severed. Regular readers of this blog will know that this is a recurring theme of my research. The quantity of data being collected today and the computational power available have destroyed the traditional and ingrained assumptions about anonymity. <a id="z::g" title="Well-established procedures" href="http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html">Well-established procedures</a> have been shown to be <a id="s:cs" title="completely inadequate" href="http://arxiv.org/abs/0803.0032">completely inadequate</a>, and it is far from clear that things can be fixed. Anyone who cares about their privacy must be vigilant against giving up their data under false promises of anonymity.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/138/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=138&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/02/19/anonymous-data-collection-lessons-from-the-a-rod-affair/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Alex_Rodriguez_Talking.jpg/180px-Alex_Rodriguez_Talking.jpg" medium="image" />
	</item>
		<item>
		<title>De-anonymizing the Internet</title>
		<link>http://33bits.org/2009/01/15/de-anonymizing-the-internet/</link>
		<comments>http://33bits.org/2009/01/15/de-anonymizing-the-internet/#comments</comments>
		<pubDate>Thu, 15 Jan 2009 03:16:45 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[author recognition]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[re-identification]]></category>
		<category><![CDATA[stylometry]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=108</guid>
		<description><![CDATA[I&#8217;ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I&#8217;ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=108&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img style="width:106px;height:118px;float:right;margin:0 0 0 1em;" src="https://docs.google.com/File?id=dgwzqjjp_183djt7n3hb_b" alt="" />I&#8217;ve been thinking about this problem for quite a while: <strong><em>is it possible to de-anonymize text that is posted</em><em> anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known?</em></strong> I&#8217;ve discussed this with many privacy researchers but until recently never written anything down. When someone asked <a id="nhoi" title="essentially the same question" href="http://news.ycombinator.com/item?id=413730">essentially the same question</a> on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.</p>
<p>Each one of us has a writing style that is idiosyncratic enough to have a unique &#8220;fingerprint&#8221;. However, it is an open question whether it can be efficiently extracted.</p>
<p>The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say &#8216;since&#8217; and &#8216;because&#8217;. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these &#8216;markers&#8217;, you can construct a profile.</p>
<p>The beginning of modern, rigorous research in this field was by <a id="x49h" title="Mosteller and Wallace" href="http://www.press.uchicago.edu/presssite/metadata.epl?mode=synopsis&amp;bookkey=256524">Mosteller and Wallace</a> in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other &#8220;coups&#8221; for writing-style de-anonymization are the <a id="wcjr" title="identification of the author of Primary Colors" href="http://en.wikipedia.org/wiki/Primary_Colors#Unmasking_of_anonymous">identification of the author of Primary Colors</a>, as well as the unabomber (<a id="m080" title="his brother recognized his style" href="http://en.wikipedia.org/wiki/Unabomber#Search">his brother recognized his style</a>, it wasn&#8217;t done by statistical/computational means).</p>
<p>The current state of the art is summarized in this <a id="syzl" title="bibliography" href="http://www.stat.rutgers.edu/%7Emadigan/AUTHORID/bibliography.html">bibliography</a>. Now, that list stops at 2005, but I&#8217;m assuming there haven&#8217;t been earth-shattering changes since then. I&#8217;m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so &#8212; i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.</p>
<p>The other interesting thing is that the papers are fixated with &#8216;topic-free&#8217; identification, where the texts aren&#8217;t about a particular topic, making the problem harder. The good news is that when you&#8217;re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.</p>
<p>So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you&#8217;d need fairly long texts, perhaps a page or two. It&#8217;s doubtful that anything can be done with a single average-length email.</p>
<p>Another potential de-anonymization strategy is to use typing pattern fingerprinting (<a id="h5e0" title="keystroke dynamics" href="http://en.wikipedia.org/wiki/Keystroke_dynamics">keystroke dynamics</a>), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it&#8217;s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don&#8217;t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics&#8230;</p>
<p>By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The <a id="s_0q" title="great firewall of China" href="http://en.wikipedia.org/wiki/Golden_Shield_Project">great firewall of China</a> could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.</p>
<p>It think it&#8217;s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts &#8212; rants about employers, accounts from cheating spouses, political dissent, etc. &#8212; that have <em>already</em> been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.</p>
<p>If you&#8217;re interested in working on this problem&#8211;either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity&#8211;drop me a line.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/108/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=108&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2009/01/15/de-anonymizing-the-internet/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="https://docs.google.com/File?id=dgwzqjjp_183djt7n3hb_b" medium="image" />
	</item>
		<item>
		<title>The Fallacy of Anonymous Institutions</title>
		<link>http://33bits.org/2008/12/15/the-fallacy-of-anonymous-institutions/</link>
		<comments>http://33bits.org/2008/12/15/the-fallacy-of-anonymous-institutions/#comments</comments>
		<pubDate>Mon, 15 Dec 2008 10:48:03 +0000</pubDate>
		<dc:creator>Arvind Narayanan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[anonymity]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[re-identification]]></category>
		<category><![CDATA[social networks]]></category>

		<guid isPermaLink="false">http://33bits.org/?p=97</guid>
		<description><![CDATA[The graph below is from the paper &#8220;Chains of affection: The structure of adolescent romantic and sexual networks.&#8221; The name of the school that the data was collected from is not revealed, and is given the working name &#8220;Jefferson High.&#8221; It is part of the National Longitudinal Study of Adolescent Health, containing very detailed health [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=97&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The graph below is from the paper &#8220;<a id="dejc" title="The structure of adolescent romantic and sexual networks" href="http://www.citeulike.org/user/eegilbert/article/3390957">Chains of affection: The structure of adolescent romantic and sexual networks</a>.&#8221; The name of the school that the data was collected from is not revealed, and is given the working name &#8220;Jefferson High.&#8221; It is part of the <a id="fkof" title="National Longitudinal Study of Adolescent Health" href="http://www.cpc.unc.edu/projects/addhealth">National Longitudinal Study of Adolescent Health</a>, containing very detailed health information on 100,000 high school students in 140 schools. In 12 of the schools, the entire sexual network was mapped out.</p>
<div id="bl" style="text-align:center;padding:1em 0;"><a href="http://docs.google.com/File?id=dgwzqjjp_162gd7qkhfh_b" target="_blank"><img style="width:400px;height:327.5px;" src="http://docs.google.com/File?id=dgwzqjjp_162gd7qkhfh_b" alt="" /></a></div>
<p>Clearly, the authors felt that concealing the identity of the school is important for protecting the privacy of the participants. It&#8217;s not hard to see why: firstly, the aggregate information presented in the study could by itself be unpleasant, especially facts about the prevalence of adolescent sexual activity in a conservative rural town (see below). Second, and more importantly, knowing the identity of the school can lead to further de-anonymization of the individuals in the network.</p>
<p>The graph above is rich enough that a few individuals can identify themselves purely based on the local information available to them, and thus learn things about their neighbors in the graph. A group of individuals getting together will have an even easier time of it. Furthermore, the actual paper provides a <strong>richer, temporally ordered version</strong> of the graph above.</p>
<p>But even strangers may benefit: depending on how well the temporal information in the sexual graph correlates with other temporal information that may be available, say from Facebook, de-anonymization might be possible with little or no co-operation from the subjects themselves. Soon, I will have more to say about research results on de-anonymizing graphs with loosely correlated external/auxiliary data.</p>
<p>Having established the privacy risk, let&#8217;s see how easy it is to re-identify Jefferson High. The authors give us these helpful clues:</p>
<blockquote><p>“Jefferson High School” is an almost all-white high school of roughly 1000 students located in a mid-sized mid-western town. Jefferson High is the only public high school in the town. The town, “Jefferson City” is over an hour away by car from the nearest large city. Jefferson City is surrounded by beautiful countryside, home to many agricultural enterprises. The town itself is working class, although there remain some vestiges of better times. At one period, the town served as a resort for city dwellers, drawing an annual influx of summer visitors. This is no longer the case, and many of the old resort properties show signs of decay. The community is densely settled. At the time of our fieldwork, students were reacting to the deaths of two girls killed in an automobile accident.</p></blockquote>
<p>Some further facts presented have high amusement value, and are equally useful for re-identification:</p>
<blockquote><p>Jefferson students earn lower grades, are suspended more, feel less attached to school, and come from poorer families than those at comparable schools. They are more likely than students in other high schools to have trouble paying attention, have lower self-esteem, pray more, have fewer expectations about college, and are more likely to have a permanent tattoo.  Compared to other students in large disproportionately white schools, adolescents in Jefferson High are more likely to drink until they are drunk. In schools of comparable race and size, on average 30% of 10th-12th grade students smoke cigarettes regularly, whereas in Jefferson, 36% of all 10th to 12th graders smoke. Drug use is moderate, comparable to national norms.  Somewhat more than half of all students report having had sex, a rate comparable to the national average, and only slightly higher than observed for schools similar with respect to race and size.  Nevertheless, if Jefferson is not Middletown, it looks like an awful lot like it. The adolescents at Jefferson High are pretty normal. In describing the events of the past year, many students report that there is absolutely nothing to do in Jefferson. For fun, students like to drive to the outskirts of town and get drunk. Jefferson is a close-knit insular predominantly working-class community which offers few activities for its youth.</p></blockquote>
<p>A database of public schools in the U.S. is <a id="b8l9" title="available for sale" href="http://www.odditysoftware.com/page-datasales39.htm">available for sale</a> for $75, containing very detailed information about each school. I&#8217;m quite confident that the information in there is sufficient to re-identify Jefferson High.</p>
<p><em>This thesis of this blog that the amount of entropy required to de-anonymize an individual &#8212; 33 bits &#8212; is low enough that it doesn&#8217;t offer meaningful protection in most circumstances. Obviously, the argument applies even more strongly to the anonymity of a well-defined group of people.</em><em><br />
</em><br />
Let&#8217;s be clear: the paper is from 1994; who slept with whom in high school is not a huge deal a decade and a half later. However, the problem is systemic, and <em>IRBs (Institutional Review Boards) keep blithely approving releases of data with such nominal de-identification applied</em>. The re-identification of the institutional affiliation of an entire population of a study is of more concern from the privacy perspective than the de-anonymization of individual identities: it needs to be done only once, and affects hundreds or thousands of individuals.</p>
<p>Recently, a group of researchers from the Berkman Center released a <a href="http://dvn.iq.harvard.edu/dvn/dv/t3" target="_blank">dataset</a> of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.” It was promptly <a id="jox8" title="de-anonymized" href="http://michaelzimmer.org/2008/10/03/more-on-the-anonymity-of-the-facebook-dataset-its-harvard-college/">de-anonymized</a> by Michael Zimmer, who revealed that it was Harvard College:</p>
<blockquote><p>As I noted <a href="http://michaelzimmer.org/2008/09/30/on-the-anonymity-of-the-facebook-dataset/" target="_blank">here</a>, the <a href="http://cyber.law.harvard.edu/node/4682" target="_blank">press release</a> and the public <a href="http://dvn.iq.harvard.edu/dvn/dv/t3/faces/study/StudyPage.jsp?studyId=36598&amp;tab=files" target="_blank">codebook</a> for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A <a href="http://collegesearch.collegeboard.com/search/adv_typeofschool.jsp" target="_blank">quick search for schools</a> reveals there are <strong>only 7</strong> private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.</p>
<p>[...]</p>
<p>Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution.</p></blockquote>
<p>Another amusing example is a <a id="t7qc" title="paper on mobile phone call graphs" href="http://arxiv.org/abs/physics/0610104v1">paper on mobile phone call graphs</a> which attempts to keep the identity of an entire country secret. I found that the approximate population of the country reported in the paper together with the mobile phone penetration rate is sufficient to uniquely identify it.</p>
<p>Suppressing the identity of your study population has some privacy benefits: at least, it won&#8217;t show up in google searches. But relying on it for any kind of serious privacy protection would be foolish. Scrubbing an entire dataset or research paper of clues about the study population can be hard or impossible; further, a single study participant corroborating the published results or methodology might be sufficient for de-anonymization of the group. The only solution is therefore to assume that the identity of the study population will be discovered, and to try to ensure that individual identities will still be safe from re-identification.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/33bits.wordpress.com/97/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/33bits.wordpress.com/97/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/33bits.wordpress.com/97/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=33bits.org&amp;blog=5017838&amp;post=97&amp;subd=33bits&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://33bits.org/2008/12/15/the-fallacy-of-anonymous-institutions/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/aa438b63ff1e9b75693aeabbeddae5eb?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">randomwalker</media:title>
		</media:content>

		<media:content url="http://docs.google.com/File?id=dgwzqjjp_162gd7qkhfh_b" medium="image" />
	</item>
	</channel>
</rss>
