Vioxx, the non-steroidal anti-inflammatory drug once prescribed for arthritis, was on the market for over five years before it was withdrawn from the market in 2004. Though a group of small-scale studies had found a correlation between Vioxx and increased risk of heart attack, the FDA did not have convincing evidence until it completed its own analysis of 1.4 million Kaiser Permanente HMO members. By the time Vioxx was pulled, it had caused between 88,000 and 139,000 unnecessary heart attacks, and 27,000-55,000 avoidable deaths.
The Vioxx debacle is a haunting illustration of the importance of large-scale data research. Dr. Richard Platt, one of the FDA’s drug risk researchers, described a series of “what if” scenarios in 2007 FDA testimony. (Barbara Evans describes the study here.) If researchers had had access to 7 million longitudinal patient record, a statistically significant relationship between Vioxx and heart attack would have been revealed in under three years. If researchers had had access to 100 million longitudinal patient records, the relationship would have been discovered in just three months. Of course, if public health researchers did post-market studies that looked for everything all the time, many of the results that look significant would be the product of random noise. But even if it took six months or one year to become confident in the results from a nation-wide health research database, tens of thousands of deaths may have been averted.
These are the consequences of HIPAA’s overcautious privacy rules. HIPAA allows health providers and insurers to release patient health information for research use only if the researcher enters into contractual agreements with each individual data-holder or if the data complies with HIPAA’s deidentification standards. These research exceptions are much too narrow to harness the full potential of the data. The contractual exception (“limited use” datasets) is practical only for work on small or medium scales, and only when the data holder agrees to work with the researchers. The de-identification exception allows research data to be shared freely, but the data producer is required to remove a lot of information that may be critical to the research and to building longitudinal data files. The HIPAA privacy rules were designed to avoid the risk that health research data could be used to re-identify a patient, but reidentification is not as easy as we have been told.
Latanya Sweeney’s reidentification of William Weld, Governor of Massachusetts at the time, using a pre-HIPAA research database of hospital records is the quintessential reidentification attack. Paul Ohm describes the famous attack nicely:
At the time [the research data was released], William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then- graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Latanya Sweeney used census data to estimate that 87% of the population has a unique combination of 5-digit zip code, birth date, and gender, and implied that the same sort of attack, using voter registration records or other public files. Phillip Golle’s replication corrected the figure to 63%, though that’s hardly comforting. But these uniqueness statistics are rather misleading. There is an important difference between distinguishability and identifiability. Distinguishability is a necessary condition to conduct the sort of matching attack that Ohm describes, but it is not sufficient. Latanya Sweeney conflated the two when she suggested that a unique individual can be identified by linking the unique combination of attributes to public records—voter registration records, e.g.. But public records are never complete. We know, for example, that a significant portion of the population is not registered to vote. How was Sweeney so sure that there was not another man who shared Gov. Weld’s birth date and zip code who was not registered to vote?
Daniel Barth-Jones has recently uploaded a fascinating new article that revisits the famous Gov. Weld reidentification. To start with, Sweeney’s estimate of the Cambridge population is way off. There were nearly 100,000 people living in Cambridge at the time of the William Weld attack. This should have been the first hint that Sweeney’s methodology was overly simple. She reported a population of 54,000 because that is the number of Cambridge residents who were registered to vote. Sweeney used these records as if they described the entire population.
By comparing Sweeney’s count of Cambridge voter registrants with U.S. Census records, Barth-Jones confirmed that many voting-age adults in Cambridge (about 35%) were not registered to vote. These non-registrants are obviously immune from the record-matching attack that Sweeney performed, but they also provide unwitting protection to people who areregistered to vote. (Hey! Non-voters DO perform a civic function!) In William Weld’s case, the census data show that approximately 174 men living in Weld’s zip code were Weld’s age. We don’t know their precise birth dates, but we can calculate that the chance another man living in Weld’s zip code shared his birth date was about 35%.This is quite important all on its own to illustrate the difference between identifiability and distinguishability. Most of those 174 men had a unique combination of birth date, gender, and zip code, but each one of them was quite likely—35% likely—to be non-unique.
Sweeney presumably used the voter registration records to rule out the possibility that some of these 174 Cambridge men shared Gov. Weld’s birth date. But even if Sweeney did indeed confirm that no other registered voter shared Weld’s gender, zip, and birth date, she could not have been sure about the 50 or so Cambridge residents who were Weld’s age and were not registered to vote. Thus, at best, Weld’s chance of having a unique birth date, zip code, and gender combination is 87%. Put differently, the chance that Latanya Sweeney’s matching attack would have been wrong using these three variables alone was 13%– much worse than traditional 5% statistical confidence.
Barth-Jones’ study nicely illustrates why a matching attack using voter registration records would not have been sufficient to re-identfy William Weld. As it turns out, the voter registry also wasn’t necessary. William Weld suffered a public collapse while giving a commencement speech, and local television and newspapers had covered the dates and details of his treatment at Deaconess Waltham Hospital. The details from these news reports made Weld identifiable with certainty without matching anything to the voter records. Thus, the attack may say something about the vulnerability of celebrities and of compulsive live-bloggers, but it never demonstrated the proposition for which it has come to be known. Individuals cannot be reidentified “with astonishing ease.”
In Barth-Jones’ words,
The Weld reidentification ushered in an era of undeserved skepticism about the effectiveness of anonymization. The Department of Health and Human Services, the federal agency that gets to set the privacy rules under HIPAA, was overly impressed by Latanya Sweeney’s study despite its obvious flaws. The Notice of Proposed Rulemaking leading up to the passage of the current HIPAA regulations stated the following:
A 1997 MIT study showed that, because of the public availability of the Cambridge, Massachusetts voting list, 97 percent of the individuals in Cambridge whose data appeared in a data base which contained only their nine digit ZIP code and birth date could be identified with certainty.
This statement cannot be true. At least one third were not registered to vote.
Barth-Jones concludes that despite the misinformation, HHS managed to come away with a de-identification rule that appropriately strikes the balance between privacy and utility in research databases. But the Vioxx “What-If” study should give us pause. We labor under the inertia of significant status quo bias when we continue to accept existing HIPAA regulations. Reidentification risk is speculative. Attacks do not happen in practice. Meanwhile, the opportunity costs of HIPAA’s research regulations include a body count.