Last week, a Forbes article by Adam Tanner announced that a research team led by Latanya Sweeney had re-identified “more than 40% of a sample of anonymous participants” in Harvard’s Personal Genome Project. Sweeney is a progenitor of demonstration attack research. Her research was extremely influential during the design of HIPAA, and I have both praised and criticized her work before.
Right off the bat, Tanner’s article is misleading. From the headline, a reader would assume that research participants were re-identified using their genetic sequence. And the “40% of a sample” line suggests that Sweeney had re-identified 40% of a random sample. Neither of these assumptions is correct. Even using the words “re-identified” and “anonymous” is improvident. Yet the misinformation has proliferated, with rounding up to “nearly half” or “97%.”
Here’s what actually happened: Sweeney’s research team scraped data on 1,130 random (presumably) volunteers in the Personal Genome Project database. Of those 1,130 volunteers, 579 had voluntarily provided their zip code, full date of birth, and gender. (Note that if the data had been de-identified using the HIPAA standard, zip code and date of birth would have been truncated.) From this special subset, 115 research participants had uploaded files to the Personal Genome Project website with filenames containing their names. (Or the number might be 103—there are several discrepancies in the report’s text and discrimination matrix which frustrate any precise description.) Another 126 of the subgroup sample could be matched to likely identities found in voter registration records and other (unidentified) public records, for a total of 241 re-identifications.
So, from the subset of 579 research participants who provided birth date, zip code, and gender, Sweeney’s team was able to provide a guess for 241 of them—about 42%. Sweeney’s research team submitted these 241 names to the PGP and learned that almost all of them (97%) were correct, allowing for nicknames.
A few things are noteworthy here. First, the 42% figure includes research participants who were “re-identified” using their names. This may be a useful demonstration to remind participants to think about the files they upload to the PGP website. Or it might not; if the files also contained the participants’ names, the participants may have proceeded with the conscious presumption that users of the PGP website were unlikely to harass them. In any case, the embedded names approach is not relevant to an assessment of re-identification risk because the participants were not de-identified. Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.
However, if these participants hadn’t uploaded records containing their names, some of them nevertheless would have been re-identifiable through the other routes. Sweeney’s team reports that 35 of those 115 participants with embedded names were also linkable to voter lists or other public records, and 80 weren’t. So, taking out the 80 who could not be linked using public records and voter registers (and assuming that the name was not used to inform and improve the re-identification process for these other 35), Sweeney’s team could claim to have reidentified 161 of the 579 participants who had provided their birthdates, zip codes, and gender. Even if we assume that all of the matches are accurate, the team provided a guess based on public records and voter registration data for only 28% of the sample who had provided their birth dates, zip codes, and genders.
In the context of the reidentification risk debate today, 28% is actually quite low. After all, Sweeney has said that “87% of the U.S. population are uniquely identified” by the combination of these three pieces of information. The claim has been repeated so many times that it has achieved nearly axiomatic status.
If anything, the findings from this exercise illustrate the chasm between uniqueness and re-identifiability. The latter requires effort (triangulation between multiple sources), even when the linkable information is basic demographics. Sweeney’s team acknowledges this, reframing the 87% figure as an “upper bound” based on estimates of uniqueness that do not guarantee identifiability. The press has not grasped that this study shows that reidentification risk is lower than many would have expected. Unfortunately reidentification risk is just technical enough to confuse the uninitiated. For a breathtaking misunderstanding of how Sweeney’s results here relate to her earlier 87% estimate, check out MIT’s Technology Review.
When there is a match, the question is whether the zip, birth date and sex uniquely identify an individual. Sweeney has argued in the past that it does with an accuracy of up to 87 per cent, depending on factors such as the density of people living in the zip code in question.
These results seem to prove her right.
Oh my god, no MIT Tech Review. That is not correct.
Though Sweeney’s study has some lapses in critical detail, it is much more careful and much less misleading than the reporting on it. I am especially disappointed by Tanner’s Forbes article. Since Tanner is a colleague and collaborator of Sweeney’s and is able to digest her results, I am disturbed by the gap between Tanner’s reporting and Sweeney’s findings. The significance that many participants were re-identified using their actual names should not have escaped his notice. His decision to exclude this fact contributes to the fearmongering so common in this area.