Reporting Fail: The Reidentification of Personal Genome Project Participants

Last week, a Forbes article by Adam Tanner announced that a research team led by Latanya Sweeney had re-identified “more than 40% of a sample of anonymous participants” in Harvard’s Personal Genome Project. Sweeney is a progenitor of demonstration attack research. Her research was extremely influential during the design of HIPAA, and I have both praised and criticized her work before.

Right off the bat, Tanner’s article is misleading. From the headline, a reader would assume that research participants were re-identified using their genetic sequence. And the “40% of a sample” line suggests that Sweeney had re-identified 40% of a random sample. Neither of these assumptions is correct. Even using the words “re-identified” and “anonymous” is improvident. Yet the misinformation has proliferated, with rounding up to “nearly half” or “97%.”

Here’s what actually happened: Sweeney’s research team scraped data on 1,130 random (presumably) volunteers in the Personal Genome Project database. Of those 1,130 volunteers, 579 had voluntarily provided their zip code, full date of birth, and gender. (Note that if the data had been de-identified using the HIPAA standard, zip code and date of birth would have been truncated.) From this special subset, 115 research participants had uploaded files to the Personal Genome Project website with filenames containing their names. (Or the number might be 103—there are several discrepancies in the report’s text and discrimination matrix which frustrate any precise description.) Another 126 of the subgroup sample could be matched to likely identities found in voter registration records and other (unidentified) public records, for a total of 241 re-identifications.

So, from the subset of 579 research participants who provided birth date, zip code, and gender, Sweeney’s team was able to provide a guess for 241 of them—about 42%. Sweeney’s research team submitted these 241 names to the PGP and learned that almost all of them (97%) were correct, allowing for nicknames.

A few things are noteworthy here. First, the 42% figure includes research participants who were “re-identified” using their names.  This may be a useful demonstration to remind participants to think about the files they upload to the PGP website. Or it might not; if the files also contained the participants’ names, the participants may have proceeded with the conscious presumption that users of the PGP website were unlikely to harass them. In any case, the embedded names approach is not relevant to an assessment of re-identification risk because the participants were not de-identified. Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.

However, if these participants hadn’t uploaded records containing their names, some of them nevertheless would have been re-identifiable through the other routes. Sweeney’s team reports that 35 of those 115 participants with embedded names were also linkable to voter lists or other public records, and 80 weren’t. So, taking out the 80 who could not be linked using public records and voter registers (and assuming that the name was not used to inform and improve the re-identification process for these other 35), Sweeney’s team could claim to have reidentified 161 of the 579 participants who had provided their birthdates, zip codes, and gender. Even if we assume that all of the matches are accurate, the team provided a guess based on public records and voter registration data for only 28% of the sample who had provided their birth dates, zip codes, and genders.

In the context of the reidentification risk debate today, 28% is actually quite low. After all, Sweeney has said that “87% of the U.S. population are uniquely identified” by the combination of these three pieces of information. The claim has been repeated so many times that it has achieved nearly axiomatic status.

If anything, the findings from this exercise illustrate the chasm between uniqueness and re-identifiability. The latter requires effort (triangulation between multiple sources), even when the linkable information is basic demographics. Sweeney’s team acknowledges this, reframing the 87% figure as an “upper bound” based on estimates of uniqueness that do not guarantee identifiability. The press has not grasped that this study shows that reidentification risk is lower than many would have expected. Unfortunately reidentification risk is just technical enough to confuse the uninitiated. For a breathtaking misunderstanding of how Sweeney’s results here relate to her earlier 87% estimate, check out MIT’s Technology Review.

When there is a match, the question is whether the zip, birth date and sex uniquely identify an individual. Sweeney has argued in the past that it does with an accuracy of up to 87 per cent, depending on factors such as the density of people living in the zip code in question.

These results seem to prove her right.

Oh my god, no MIT Tech Review. That is not correct.

Though Sweeney’s study has some lapses in critical detail, it is much more careful and much less misleading than the reporting on it. I am especially disappointed by Tanner’s Forbes article.   Since Tanner is a colleague and collaborator of Sweeney’s and is able to digest her results, I am disturbed by the gap between Tanner’s reporting and Sweeney’s findings. The significance that many participants were re-identified using their actual names should not have escaped his notice. His decision to exclude this fact contributes to the fearmongering so common in this area.

Smoke If You Got ‘Em

I’m here in rainy, lovely Eugene, Oregon watching the Oregon Law Review symposium, A Step Forward: Creating a Just Drug Policy for the United States. (You can watch it live.) Jane is presenting her paper Defending the Dog – here’s the conclusion:

The narcotics dog doesn’t deserve the bad reputation it has received among scholars. The dog is the first generation of police tools that can usher a dramatic shift away from human criminal investigation and the attendant biases and conflicts of interests. Moreover, the reaction to the narcotics dog, as compared to the cadaver-sniffing dog, reveals an unsettling tendency to exploit criminal procedure when we are not enthusiastic about the underlying substantive criminal law. The natural instinct to do so may be counterproductive because drug enforcement will persist, with uneven results, and without a critical mass of public outrage.

Drug policy is a little far afield from my usual interests, but given the overwhelming use of Title III warrants (about 85% in 2011) to combat drug trafficking, and pending bills such as CISPA (which allows sharing for national security purposes – trafficking has long qualified), it seems well worth a Friday to learn more. (And, Jane’s empirical work brings some helpful rigor to the issue.) Updates as events warrant…

Privacy Law in Sixty Seconds (or so)

I am occasionally struck by my good fortune to write in an area that has such a supportive community. Much credit is due to the influence, ingenuity, and incessant hard work of Paul Schwartz and Dan Solove. Invariably, every privacy scholar has benefited from Dan’s and Paul’s support. This promotional video for their informal treatise Privacy Law Fundamentals nicely captures the combination of thougthfulness and goofiness that Paul and Dan have fostered. There are some jokes at the expense of celebrities and Europe, which is always good fun. (Privacy Law Fundamentals also happens to be the book I recommend to people who are new to privacy law.)

Is Data Speech?

Jane Yakowitz Bambauer has a new article forthcoming in 66 Stanford Law Review __ (forthcoming 2014), titled “Is Data Speech?” Here’s the abstract:

Privacy laws rely on the unexamined assumption that the collection of data is not speech. That assumption is incorrect. Privacy scholars,
recognizing an imminent clash between this long-held assumption and First Amendment protections of information, argue that data is different from the
sort of speech the Constitution intended to protect. But they fail to articulate a meaningful distinction between data and other, more traditional forms of expression. Meanwhile, First Amendment scholars have not paid sufficient attention to new technologies that automatically capture data. These
technologies reopen challenging questions about what “speech” is.
This Article makes two bold and overdue contributions to the First Amendment literature. First, it argues that when the scope of First Amendment coverage is ambiguous, courts should analyze the government’s motive for regulating. Second, it highlights and strengthens the strands of First Amendment theory that protect the right to create knowledge. Whenever the state regulates in order to interfere with knowledge, that regulation should draw First Amendment scrutiny.
In combination, these theories show clearly why data must receive First Amendment protection. When the collection or distribution of data troubles lawmakers, it does so because data has the potential to inform, and to inspire new opinions. Data privacy laws regulate minds, not technology. Thus, for all practical purposes, and in every context relevant to the privacy debates, data is speech.

Cyberwar and Cyberespionage

My paper “Ghost in the Network” is available from SSRN. It’s forthcoming in the University of Pennsylvania Law Review. I’m appending the abstract and (weirdly, but I hope it will become apparent why) the conclusion below. Comments welcomed.


Cyberattacks are inevitable and widespread. Existing scholarship on cyberespionage and cyberwar is undermined by its futile obsession with preventing attacks. This Article draws on research in normal accident theory and complex system design to argue that successful attacks are unavoidable. Cybersecurity must focus on mitigating breaches rather than preventing them. First, the Article analyzes cybersecurity’s market failures and information asymmetries. It argues that these economic and structural factors necessitate greater regulation, particularly given the abject failures of alternative approaches. Second, the Article divides cyber-threats into two categories: known and unknown. To reduce the impact of known threats with identified fixes, the federal government should combine funding and legal mandates to push firms to redesign their computer systems. Redesign should follow two principles: disaggregation, dispersing data across many locations; and heterogeneity, running those disaggregated components on variegated software and hardware. For unknown threats – “zero-day” attacks – regulation should seek to increase the government’s access to markets for these exploits. Regulation cannot exorcise the ghost in the network, but it can contain the damage it causes.


Something terrible is going to happen in cyberspace. That may help.

The U.S. suffers serious but less visible cyberattacks daily. Complex technology, mixed with victims’ reluctance to disclose the scale of harms, leads to underappreciation of cyber-risks. This disjunction generates the ongoing puzzle of cybersecurity: the gap between dramatic assessments of risks the U.S. faces and minimalist measures the country has taken to address them. America’s predictions do not match its bets. One of those positions is wrong. But the economic and structural factors that impede regulation suggest reform will not occur without a dramatic focusing event.[1] The U.S. did not address its educational deficiencies in math and science until the Soviets launched Sputnik into orbit.[2] Until the near-meltdown at Three Mile Island, America was complacent about nuclear energy safety.[3] And it required the attacks of 9/11 for the country to address the rise in international terrorism, the gaps in its intelligence systems, and the weaknesses in aviation security.[4] This Article’s role is to sit on the shelf, awaiting with dread that focusing event. When it occurs, regulators will need a model for a response. This Article offers one.

Cybersecurity offers copious challenges for future research. Two are particularly relevant for this Article. First, data integrity is a difficult puzzle. Restoring data after attacks is unhelpful if one cannot tell good information from bad – we must be able to distinguish authorized updates from unauthorized ones. This seemingly technical puzzle has important implications for provenance in other areas, from rules of evidence to intellectual property, which struggle with similar authentication problems. Second, nation-states are now engaged in the long twilight struggle of espionage and hacking in cyberspace. At present, there are neither formal rules nor tacit norms that govern conduct. Eventually, though, countries must arrive at accommodations. Spying[5], assassination[6], and armed combat[7] all benefited from shared rules, even during the Cold War. Lawyers can raise awareness of these benefits and help shape the system that emerges. Future research can contribute to both these inquiries.

For now, ghosts roam the network. They cannot be driven out. We must lessen the effects of their touch.

The Illegal Process and Orwell’s Metaphors

James Grimmelmann and David Post have responses to Orwell’s Armchair up at the University of Chicago Law Review’s Dialogue site. I’m grateful and flattered to have them as partners in the discussion, and I am very excited to read their articles!

Privacy, Security, and Cybercrime

In a forthcoming paper, I argue that security and privacy issues differ in important ways that are typically neglected by both scholars and courts. If you’re in Chicago at the end of the week, you can hear me drone on about the piece on a panel on cybercrime at a symposium at Northwestern University School of Law run by the Journal of Criminal Law and Criminology. (It runs from 10:30am – 4:30pm, with a reception afterwards. NwU is at 375 E. Chicago Avenue, and the symposium is in Lincoln Hall, Levy Mayer 104.) Hope to see you there! Here’s the abstract:

Legal scholarship tends to conflate privacy and security. However, security and privacy can, and should, be treated as distinct concerns. Privacy discourse involves difficult normative decisions about competing claims to legitimate access to, use of, and alteration of information. It is about selecting among different philosophies, and choosing how various rights and entitlements ought to be ordered. Security implements those choices – it intermediates between information and privacy selections. This Article argues separating privacy from security has important practical consequences. Security failings should be penalized more readily, and more heavily, than privacy ones, because there are no competing moral claims to resolve, and because security flaws make all parties worse off. Currently, security flaws are penalized too rarely, and privacy ones too readily. The Article closes with a set of policy questions highlighted by the privacy versus security distinction that deserve further research.

Beating Revenge Porn with Copyright

The lawsuit against scumbag Web site has generated attention to the problem of revenge porn, and to the paucity of legal remedies available to victims of it. Danielle Citron has two excellent posts over at Concurring Opinions analyzing the relevant statutory block, 47 U.S.C. 230, and the few cases that cut through its immunity. (I disagree with Danielle on the statutory interpretation point in the first post – in my view, the courts are right to interpret the language of 230 and not its purpose. I’m not a strict textualist, but judges have to be limited by something, and the language of the statute seems like the right boundary.)

I have a draft article that proposes a solution to sexting, revenge porn, and the like. I’ll put up an excerpt after submission season ends and I can give the piece the attention it deserves. But, for the moment, here’s a different proposal: why don’t all revenge porn victims submit takedown notifications under Title II of the DMCA (17 U.S.C. 512(c)(3))? Doing so puts the site on the horns of a dilemma: remove the content, or face liability under the Copyright Act. (I suspect a jury would be all too ready to find infringement, and since damages are up to the jury, the award could be sizeable. Even defending such a suit would be costly for the site’s proprietors.)

There are two objections to my plan, both potentially significant. First: the victim is not the copyright owner. Second: 512(c)(3)(A)(vi) requires certification under penalty of perjury that the complaining party is authorized to act on behalf of the copyright owner / owner of an exclusive right, and no one wants to be prosecuted for perjury. (Not even Roger Clemens.) But: I have responses to both.

Photography is a challenging area for copyright law. Some photographs will not even be eligible for copyright: those that lack the requisite originality. Some photos will merely capture the natural world with no input from the photographer – think of an accidental iPhone snap, or just pointing your high-speed camera at a parade and holding down the shutter button. And in some cases, the person pressing the camera button will not be the photographer. All the creative work has been done by someone else – someone who created or set up the tableau which the photograph records. (See, for example, Bridgeman Art Library v. Corel Corp., 36 F. Supp. 2d 191 (S.D.N.Y. 1991).) That means that it is the person who created the scene who could obtain copyright. The photographer is a mere amanuensis (probably a terrific Scrabble word). (Cf. Thomson v. Larson, 147 F.3d 195 (2d Cir. 1998).) So, a crude approximation of the rule for authorship in photos would be this: the source of the original, creative elements of the photo is an author.

For revenge porn, I think there is a defensible position that the subject – the victim – of the image or video is at least a joint author. Why do people look at these images? (A good question there, full stop.) Because of the subject – not because of the lighting, the use of unusual color or angle, the excellent development of the print, or any other contribution by the photographer. Put it this way: imagine that the victim is replaced by a dummy, or Felix the Cat. No one is even going to glance at the photo: there’s nothing expressive or original about it.

I think that means that a victim, and her attorney, can often take a legally defensible position that she is an author of the photo. That means she can, under 512(c)(3), send a take-down notification to the site. This raises the second objection: you have to certify, under penalty of perjury, that you are authorized to act on behalf of the owner of an exclusive right. I think that the victim is such an owner – indeed, an author. One would also hope that, in close cases, a prosecutor might decline to pursue perjury charges against the victim / her attorney. I don’t think it would be easy to prove perjury beyond a reasonable doubt. Heck, they couldn’t get Clemens! Or Barry Bonds! (True, they had expensive lawyers.) I suspect a jury would be sympathetic to the dilemma the victim faces. And I would hope that a prosecutor would either see a better use of her limited resources, or would feel constrained by the likely public reaction to an attempt to prosecute someone who already had been harmed so greatly.

Theoretically, a site owner could mount a counter-suit against the victim under 512(g). But: they’re not going to get much money out of it. (See Lenz v. Universal.) And I can only think of two successful 512(g) cases, both with egregious sets of facts – Lenz, and Online Policy Group v. Diebold. As I tell my students, 512(g) is a bit like the reverse doctrine of equivalents in patent law: it exists in theory, but not in practice.

This tactic pushes the edge. But there aren’t many options for victims of revenge porn, and this may be a gambit worth trying.