Over the course of the last fifteen or so years, the belief that “de-identification” of personally identifiable information preserves the anonymity of those individuals has been repeatedly called up short by scholars and journalists. It would be difficult to overstate the importance, for privacy law and policy, of the early work of “re-identification scholars,” as I’ll call them. In the mid-1990s, the Massachusetts Group Insurance Commission (GIC) released data on individual hospital visits by state employees in order to aid important research. As Massachusetts Governor Bill Weld assured employees, their data had been “anonymized,” with all obvious identifiers, such as name, address, and Social Security number, removed. But Latanya Sweeney, then an MIT graduate student, wasn’t buying it. When, in 1996, Weld collapsed at a local event and was admitted to the hospital, she set out to show that she could re-identify his GIC entry. For twenty dollars, she purchased the full roll of Cambridge voter-registration records, and by linking the two data sets, which individually were innocuous enough, she was able to re-identify his GIC entry. As privacy law scholar Paul Ohm put it, “In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.”
Sweeney’s demonstration led to important changes in privacy law, especially under HIPAA. But that demonstration was just the beginning. In 2006, the New York Times was able to re-identify one individual (and only one individual) in a publicly available research dataset of the three-month AOL search history of over 600,000 users. The Times demonstration led to a class-action lawsuit (which settled out of court), an FTC complaint, and soul-searching in Congress. That same year, Netflix began a three-year contest, offering a $1 million prize to whomever could most improve the algorithm by which the company predicts how much a particular user will enjoy a particular movie. To enable the contest, Netflix made publicly available a dataset of the movie ratings of 500,000 of its customers, whose names it replaced with numerical identifiers. In a 2008 paper, Arvind Narayanan, then a graduate student at UT-Austin, along with his advisor, showed that by linking the “anonymized” Netflix prize dataset to the Internet Movie Database (IMDb), in which viewers review movies, often under their own names, many Netflix users could be re-identified, revealing information that was suggestive of their political preferences and other potentially sensitive information. (Remarkably, notwithstanding the re-identification demonstration, after awarding the prize in 2009 to a team from AT&T, in 2010, Netflix announced plans for a second contest, which it cancelled only after tussling with a class-action lawsuit (again, settled out of court) and the FTC.) Earlier this year, Yaniv Erlich and colleagues, using a novel technique involving surnames and the Y chromosome, re-identified five men who had participated in the 1000 Genomes Project — an international consortium to place, in an open online database, the sequenced genomes of (as it turns out, 2500) “unidentified” people — who had also participated in a study of Mormon families in Utah.
Most recently, Sweeney and colleagues re-identified participants in Harvard’s Personal Genome Project (PGP), who are warned of this risk, using the same technique she used to re-identify Weld in 1997. As a scholar of research ethics and regulation — and also a PGP participant — this latest demonstration piqued my interest. Although much has been said about the appropriate legal and policy responses to these demonstrations (my own thoughts are here), there has been very little discussion about the legal and ethical aspects of the demonstrations themselves. As a modest step in filling that gap, I’m pleased to announce an online symposium, to take place here at the Bill of Health the week of May 20th, that will address both the scientific and policy value of these demonstrations and the legal and ethical issues they raise. Participants fill diverse stakeholder roles (data holder, data provider — i.e., research participant, re-identification researcher, privacy scholar, research ethicist) and will, I expect, have a range of perspectives on these questions:
I hope readers will join us on May 20.