Earlier this year, the journal Science published a study called “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata” by Yves-Alexandre de Montjoye et al. The article has reinvigorated claims that deidentified research data can be reidentified easily. These claims are not new, but their recitation in a vaunted science journal led to a new round of panic in the popular press.
The particulars of the actual study are neither objectionable nor enlightening. The authors demonstrate that in highly dimensional databases (for example, those with a lot of variables that can take a lot of different values), each person in the database is distinguishable from the others. Indeed, each person looks distinguishable from the others based on just a small subset of details about them. This will not surprise anybody who actually uses research data because the whole point of accessing individual-level data is to make use of the unique combinations of factors that the people represented in the database possess. Otherwise, aggregated tables would do. What is surprising, however, is the authors’ bold conclusions that their study somehow proves that data anonymization is an “inadequate” concept and that “the open sharing of raw deidentified metadata data sets is not the future.” How Science permitted this sweeping condemnation of open data based on such thin evidence is itself a study in the fear and ideology that drives policy and scientific discourse around privacy.
What the de Montjoye Study Actually Demonstrated
The credit card metadata study used a database consisting of three months of credit card records for 1.1 million clients in an unspecified OECD country. The bank removed names, addresses, and other direct identifiers, but did nothing else to mask the data. The authors used this database to evaluate the chance that any given person is unique among clients in the database based on X number of purchase transactions. So, using an example from the paper, if Scott was the only person who made a purchase at a particular bakery on September 23rd and at a particular restaurant on September 24th, he would be unique with only two transactions within the database. The authors use these “tuples” (place-date combinations) to estimate the chance that a person in the database looks unique compared to the other data subjects. They found that 90% of the data subjects were unique in the database based on just four place-date tuples. And the rate of uniqueness increased if approximate price information was added to each tuple.
The authors treat database uniqueness and reidentifiability as one and the same. That is, the authors treat the chance that a person is unique in the dataset based on X number of tuples as the chance that the person can be reidentified.
I am sympathetic to the authors’ goal of finding concrete, a quantifiable measure of privacy risk. But database uniqueness should not be its measure. Measures of sample uniqueness systematically exaggerate the risk of reidentification. Consequently, any research and data sharing policy that relies only on sample uniqueness as the measure of re-identification risk will strike the balance of privacy and data utility interests in the wrong place.
Problem 1: Sample Uniqueness is Not Reidentification. (It’s Not Even Actual Uniqueness.)
The greatest defect in the Science article is treating uniqueness within a sample database as equivalent to “reidentification,” which the authors do several times. For example, the authors state that 90% of individuals can be “uniquely reidentified” with just four place-date tuples. I suspect that most readers interpreted the article and its subsequent coverage in the popular media to mean that if you know just four pieces of place-date purchase information for a person, you are 90% likely to be able to figure out who they are in the de-identified research database. But the authors did not come close to proving that.
The problem is that uniqueness in a deidentified research database cannot tell us whether the data subject is actually unique in the general population. The research database will describe only a sample of the population, and may be missing a lot of information about each of its data subjects. Inferring actual uniqueness from database uniqueness requires some extra information and modeling about what proportion of the population is sampled, and how complete the data about them is.
To give an extreme example, let’s go back to “Scott”—the credit card-holder who went to a bakery on September 23rd and a restaurant on September 24th. Suppose that his data was part of a research dataset that included the purchase histories of just ten credit card customers. Using this database on ten people, could we reliably say anything about whether Scott was the only person in his city to go to the bakery and the restaurant? Of course not. We may have a hunch that the city’s inhabitants are unlikely to go to this bakery and that restaurant on the same days that Scott did, but we’d be using our intuitions rather than the research data to draw our conclusions about uniqueness. Read more…
Comments Off on Is De-Identification Dead Again?
Filed under: Uncategorized