Death by HIPAA

Vioxx, the non-steroidal anti-inflammatory drug once prescribed for arthritis, was on the market for over five years before it was withdrawn from the market in 2004. Though a group of small-scale studies had found a correlation between Vioxx and increased risk of heart attack, the FDA did not have convincing evidence until it completed its own analysis of 1.4 million Kaiser Permanente HMO members.  By the time Vioxx was pulled, it had caused between 88,000 and 139,000 unnecessary heart attacks, and 27,000-55,000 avoidable deaths.

The Vioxx debacle is a haunting illustration of the importance of large-scale data research. Dr. Richard Platt, one of the FDA’s drug risk researchers, described a series of “what if” scenarios in 2007 FDA testimony. (Barbara Evans describes the study here.) If researchers had had access to 7 million longitudinal patient record, a statistically significant relationship between Vioxx and heart attack would have been revealed in under three years. If researchers had had access to 100 million longitudinal patient records, the relationship would have been discovered in just three months. Of course, if public health researchers did post-market studies that looked for everything all the time, many of the results that look significant would be the product of random noise. But even if it took six months or one year to become confident in the results from a nation-wide health research database, tens of thousands of deaths may have been averted.

These are the consequences of HIPAA’s overcautious privacy rules. HIPAA allows health providers and insurers to release patient health information for research use only if the researcher enters into contractual agreements with each individual data-holder or if the data complies with HIPAA’s deidentification standards. These research exceptions are much too narrow to harness the full potential of the data. The contractual exception (“limited use” datasets) is practical only for work on small or medium scales, and only when the data holder agrees to work with the researchers. The de-identification exception allows research data to be shared freely, but the data producer is required to remove a lot of information that may be critical to the research and to building longitudinal data files. The HIPAA privacy rules were designed to avoid the risk that health research data could be used to re-identify a patient, but reidentification is not as easy as we have been told.

Latanya Sweeney’s reidentification of William Weld, Governor of Massachusetts at the time, using a pre-HIPAA research database of hospital records is the quintessential reidentification attack. Paul Ohm describes the famous attack nicely:

At the time [the research data was released], William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then- graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.

Latanya Sweeney used census data to estimate that 87% of the population has a unique combination of 5-digit zip code, birth date, and gender, and implied that the same sort of attack, using voter registration records or other public files. Phillip Golle’s replication corrected the figure to 63%, though that’s hardly comforting. But these uniqueness statistics are rather misleading. There is an important difference between distinguishability and identifiability. Distinguishability is a necessary condition to conduct the sort of matching attack that Ohm describes, but it is not sufficient. Latanya Sweeney conflated the two when she suggested that a unique individual can be identified by linking the unique combination of attributes to public records—voter registration records, e.g.. But public records are never complete. We know, for example, that a significant portion of the population is not registered to vote. How was Sweeney so sure that there was not another man who shared Gov. Weld’s birth date and zip code who was not registered to vote?

Daniel Barth-Jones has recently uploaded a fascinating new article that revisits the famous Gov. Weld reidentification. To start with, Sweeney’s estimate of the Cambridge population is way off. There were nearly 100,000 people living in Cambridge at the time of the William Weld attack. This should have been the first hint that Sweeney’s methodology was overly simple. She reported a population of 54,000 because that is the number of Cambridge residents who were registered to vote. Sweeney used these records as if they described the entire population.

By comparing Sweeney’s count of Cambridge voter registrants with U.S. Census records, Barth-Jones confirmed that many voting-age adults in Cambridge (about 35%) were not registered to vote. These non-registrants are obviously immune from the record-matching attack that Sweeney performed, but they also provide unwitting protection to people who areregistered to vote. (Hey! Non-voters DO perform a civic function!) In William Weld’s case, the census data show that approximately 174 men living in Weld’s zip code were Weld’s age. We don’t know their precise birth dates, but we can calculate that the chance another man living in Weld’s zip code shared his birth date was about 35%.This is quite important all on its own to illustrate the difference between identifiability and distinguishability. Most of those 174 men had a unique combination of birth date, gender, and zip code, but each one of them was quite likely—35% likely—to be non-unique.

Sweeney presumably used the voter registration records to rule out the possibility that some of these 174 Cambridge men shared Gov. Weld’s birth date. But even if Sweeney did indeed confirm that no other registered voter shared Weld’s gender, zip, and birth date, she could not have been sure about the 50 or so Cambridge residents who were Weld’s age and were not registered to vote. Thus, at best, Weld’s chance of having a unique birth date, zip code, and gender combination is 87%. Put differently, the chance that Latanya Sweeney’s matching attack would have been wrong using these three variables alone was 13%– much worse than traditional 5% statistical confidence.

Barth-Jones’ study nicely illustrates why a matching attack using voter registration records would not have been sufficient to re-identfy William Weld. As it turns out, the voter registry also wasn’t necessary. William Weld suffered a public collapse while giving a commencement speech, and local television and newspapers had covered the dates and details of his treatment at Deaconess Waltham Hospital. The details from these news reports made Weld identifiable with certainty without matching anything to the voter records. Thus, the attack may say something about the vulnerability of celebrities and of compulsive live-bloggers, but it never demonstrated the proposition for which it has come to be known. Individuals cannot be reidentified “with astonishing ease.”

In Barth-Jones’ words,

It’s difficult to overstate the influence of the Weld/Cambridge voter list attack on health privacy policy in the United States.

The Weld reidentification ushered in an era of undeserved skepticism about the effectiveness of anonymization. The Department of Health and Human Services, the federal agency that gets to set the privacy rules under HIPAA, was overly impressed by Latanya Sweeney’s study despite its obvious flaws. The Notice of Proposed Rulemaking leading up to the passage of the current HIPAA regulations stated the following:

A 1997 MIT study showed that, because of the public availability of the Cambridge, Massachusetts voting list, 97 percent of the individuals in Cambridge whose data appeared in a data base which contained only their nine digit ZIP code and birth date could be identified with certainty.

This statement cannot be true. At least one third were not registered to vote.

Barth-Jones concludes that despite the misinformation, HHS managed to come away with a de-identification rule that appropriately strikes the balance between privacy and utility in research databases. But the Vioxx “What-If” study should give us pause. We labor under the inertia of significant status quo bias when we continue to accept existing HIPAA regulations. Reidentification risk is speculative. Attacks do not happen in practice. Meanwhile, the opportunity costs of HIPAA’s research regulations include a body count.

6 Responses to “Death by HIPAA”

  1. OK… but how confident in the match would a health insurer need to be to deny coverage to an applicant who is 50% likely to be an individual with an undesirable profile their data mining software identified? That’s what health records anonymity is trying to stop.

  2. The one issue I struggle with in this article is that the only option was to use a limited data set containing de-identified data. Isn’t 45 CFR 164.512 of the HIPAA regulations designed for this kind of public health disclosure? It offers an exception to the nondisclosure requirement for sharing protected health information with an organization like the FDA.

    § 164.512 Uses and disclosures for which an authorization or opportunity to agree or object is not required.
    A covered entity may use or disclose protected health information without the written authorization of the individual, as described in §164.508, or the opportunity for the individual to agree or object as described in §164.510, in the situations covered by this section, subject to the applicable requirements of this section. When the covered entity is required by this section to inform the individual of, or when the individual may agree to, a use or disclosure permitted by this section, the covered entity’s information and the individual’s agreement may be given orally.
    (a) Standard: Uses and disclosures required by law.
    (1) A covered entity may use or disclose protected health information to the extent that such use or disclosure is required by law and the use or disclosure complies with and is limited to the relevant requirements of such law.
    (2) A covered entity must meet the requirements described in paragraph (c), (e), or (f) of this section for uses or disclosures required by law.
    (b) Standard: uses and disclosures for public health activities
    (1) Permitted disclosures. A covered entity may disclose protected health information for the public
    health activities and purposes described in this paragraph to:
    (i) A public health authority that is authorized by law to collect or receive such information for the
    purpose of preventing or controlling disease, injury, or disability, including, but not limited to, the reporting of disease, injury, vital events such as birth or death, and the conduct of public health surveillance, public health investigations, and public health interventions; or, at the direction of a public health authority, to an official of a foreign government agency that is acting in collaboration with a public health authority;
    (ii) A public health authority or other appropriate government authority authorized by law to receive reports of child abuse or neglect;
    (iii) A person subject to the jurisdiction of the Food and Drug Administration (FDA) with respect to an FDA-regulated product or activity for which that person has responsibility, for the purpose of activities related to the quality, safety or effectiveness of such FDA-regulated product or activity. Such purposes include:
    (A) To collect or report adverse events (or similar activities with respect to food or dietary supplements), product defects or problems (including problems with the use or labeling of a product), or biological product deviations;
    (B) To track FDA-regulated products;
    (C) To enable product recalls, repairs, or replacement, or lookback (including locating and notifying individuals who have received products that have been recalled, withdrawn, or are the subject of lookback); or
    (D) To conduct post marketing surveillance;
    (iv) A person who may have been exposed to a communicable disease or may otherwise be at risk of contracting or spreading a disease or condition, if the covered entity or public health authority is authorized by law to notify such person as necessary in the conduct of a public health intervention or investigation . . .
    HIPAA Administrative Simplification Regulation Text March 2006

  3. HIPAA has nothing to do with this. If drugmakers had adequate after-market reporting of adverse events to FDA it would have been pulled earlier.

  4. [...] comes the news that HIPAA’s privacy requirements may have hampered research efforts that could have prevented an estimated 90,000 unnecessary heart attacks and 25,000 deaths. Vioxx, the non-steroidal anti-inflammatory drug once prescribed for arthritis, was on the market [...]

  5. Can you give an example of data that would have been useful for a longitudinal study of Vioxx but that HIPAA’s de-identification requirements would have removed?

  6. You state the following:

    “HIPAA allows health providers and insurers to release patient health information for research use only if the researcher enters into contractual agreements with each individual data-holder or if the data complies with HIPAA’s deidentification standards.”

    This is not true. A health plan or integrated delivery system could do a study with its own fully identified data for quality control purposes as part of health care operations. A bona fide researcher could access fully identified data with a waiver of authorization from an IRB. Such waivers are routinely granted for retrospective record reviews so long as appropriate data security safeguards are in place.

    The article is inflammatory and inaccurate.