Reporting Fail: The Reidentification of Personal Genome Project Participants

Last week, a Forbes article by Adam Tanner announced that a research team led by Latanya Sweeney had re-identified “more than 40% of a sample of anonymous participants” in Harvard’s Personal Genome Project. Sweeney is a progenitor of demonstration attack research. Her research was extremely influential during the design of HIPAA, and I have both praised and criticized her work before.

Right off the bat, Tanner’s article is misleading. From the headline, a reader would assume that research participants were re-identified using their genetic sequence. And the “40% of a sample” line suggests that Sweeney had re-identified 40% of a random sample. Neither of these assumptions is correct. Even using the words “re-identified” and “anonymous” is improvident. Yet the misinformation has proliferated, with rounding up to “nearly half” or “97%.”

Here’s what actually happened: Sweeney’s research team scraped data on 1,130 random (presumably) volunteers in the Personal Genome Project database. Of those 1,130 volunteers, 579 had voluntarily provided their zip code, full date of birth, and gender. (Note that if the data had been de-identified using the HIPAA standard, zip code and date of birth would have been truncated.) From this special subset, 115 research participants had uploaded files to the Personal Genome Project website with filenames containing their names. (Or the number might be 103—there are several discrepancies in the report’s text and discrimination matrix which frustrate any precise description.) Another 126 of the subgroup sample could be matched to likely identities found in voter registration records and other (unidentified) public records, for a total of 241 re-identifications.

So, from the subset of 579 research participants who provided birth date, zip code, and gender, Sweeney’s team was able to provide a guess for 241 of them—about 42%. Sweeney’s research team submitted these 241 names to the PGP and learned that almost all of them (97%) were correct, allowing for nicknames.

A few things are noteworthy here. First, the 42% figure includes research participants who were “re-identified” using their names.  This may be a useful demonstration to remind participants to think about the files they upload to the PGP website. Or it might not; if the files also contained the participants’ names, the participants may have proceeded with the conscious presumption that users of the PGP website were unlikely to harass them. In any case, the embedded names approach is not relevant to an assessment of re-identification risk because the participants were not de-identified. Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.

However, if these participants hadn’t uploaded records containing their names, some of them nevertheless would have been re-identifiable through the other routes. Sweeney’s team reports that 35 of those 115 participants with embedded names were also linkable to voter lists or other public records, and 80 weren’t. So, taking out the 80 who could not be linked using public records and voter registers (and assuming that the name was not used to inform and improve the re-identification process for these other 35), Sweeney’s team could claim to have reidentified 161 of the 579 participants who had provided their birthdates, zip codes, and gender. Even if we assume that all of the matches are accurate, the team provided a guess based on public records and voter registration data for only 28% of the sample who had provided their birth dates, zip codes, and genders.

In the context of the reidentification risk debate today, 28% is actually quite low. After all, Sweeney has said that “87% of the U.S. population are uniquely identified” by the combination of these three pieces of information. The claim has been repeated so many times that it has achieved nearly axiomatic status.

If anything, the findings from this exercise illustrate the chasm between uniqueness and re-identifiability. The latter requires effort (triangulation between multiple sources), even when the linkable information is basic demographics. Sweeney’s team acknowledges this, reframing the 87% figure as an “upper bound” based on estimates of uniqueness that do not guarantee identifiability. The press has not grasped that this study shows that reidentification risk is lower than many would have expected. Unfortunately reidentification risk is just technical enough to confuse the uninitiated. For a breathtaking misunderstanding of how Sweeney’s results here relate to her earlier 87% estimate, check out MIT’s Technology Review.

When there is a match, the question is whether the zip, birth date and sex uniquely identify an individual. Sweeney has argued in the past that it does with an accuracy of up to 87 per cent, depending on factors such as the density of people living in the zip code in question.

These results seem to prove her right.

Oh my god, no MIT Tech Review. That is not correct.

Though Sweeney’s study has some lapses in critical detail, it is much more careful and much less misleading than the reporting on it. I am especially disappointed by Tanner’s Forbes article.   Since Tanner is a colleague and collaborator of Sweeney’s and is able to digest her results, I am disturbed by the gap between Tanner’s reporting and Sweeney’s findings. The significance that many participants were re-identified using their actual names should not have escaped his notice. His decision to exclude this fact contributes to the fearmongering so common in this area.

5 Responses to “Reporting Fail: The Reidentification of Personal Genome Project Participants”

  1. [...] Reporting Fail: Reidentification of Personal Genome Project Participants [...]

  2. [...] assess the “too many mouths” hypothesis False models are useful BECAUSE they’re false Reporting Fail: The Reidentification of Personal Genome Project Participants Politics versus scientific peer [...]

  3. Thanks for reviewing the paper and highlighting the work. Let me offer a couple of clarifications and encourage people to read the paper for themselves at . There are many lessons learned in this experiment.

    The blogger mentioned our use of hidden names, so let me clarify. We consider profiles having hidden names as re-identifications because the names appear irregularly and are not explicit or obvious. Detecting them is one re-identification strategy we report in the paper. Here is how it happens. Some participants previously uploaded auxiliary files that contained genetic information, as received from other sources. The public appearance of an auxiliary file is a compressed file that does not reveal the participant’s name. Only after the file is downloaded and uncompressed might you see a new filename that includes the name of the participant. This begs the question, did the participant know his name was there? If the public profiles had a name field where a participant could enter his name, then we would know the participant intended for his name to be present. But the profiles have no such name field. In ad hoc discussions, participants voiced that these hidden names resulted from how they received the compressed file, implying a lack of action on their part to insert their names and no technical option to remove them (if they saw them). So, detecting hidden names is one of the re-identification strategies reported in the work.

    Here are a few more points. Regardless of whether the profile had a hidden name, we also put names to profiles based on demographics (date of birth, gender, and ZIP), when present. The PGP staff scored results for correctness. To help participants, we offer a service for participants to learn how unique their demographics may be and we offer technical solutions for participants to change their profiles; see website above. Anyone can test how unique she may be from her demographics using our server at .

    In terms of how this work relates to other misconceptions commonly voiced about re-identifications, this work demonstrates that a significant number of people can be re-identified by basic demographics (date of birth, gender, ZIP). Two undergraduate students working on term projects in two different courses did some of the original re-identifications, so it is not a matter of advanced science to do these kinds of re-identifications. Adam Tanner, the noted reporter, was not a collaborator, but instead, only had access to our paper after we publicly released it (same as the blogger). He reports having independently re-identified and interviewed some participants himself. There are many lessons learned from the experiment. Feel free to follow @LatanyaSweeney for updates.

  4. Relevant links were removed when posted. Here they are again (second attempt):

    paper and tools at dataprivacylab.org

    test demographics at aboutmyride.org

  5. AARGH… test demographics at aboutmyinfo.org