Guest post by Daniel Barth-Jones
For anyone who follows the increasingly critical topic of data privacy closely, it would have been impossible to miss the remarkable chain reaction that followed the New York TLC’s (Taxi and Limousine Commission) recent release of data on more than 173 million taxi rides in response to a FOIL (Freedom of Information Law) request by Urbanist and self-described “Data Junkie” Chris Whong. It wasn’t long at all after the data went public that the sharp eyes and keen wit of software engineer Vijay Pandurangan detected that taxi drivers’ license numbers and taxi plate (or medallion) numbers hadn’t been anonymized properly and could be decoded due to the failed encryption process.
Soon after Pandurangan’s revelation of the botched unsalted MD5 cryptographic hash in the TLC data, Anthony Tockar, working on a summer Data Science internship with Neustar, posted his blog “Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset” with the aim of introducing the concept of “differential privacy” and announcing Neustar’s expertise in this area. (It’s well worth checking out both Tockar’s short, but informative, tutorial on differential privacy and his application of the method to the maps of the TLC taxi data as his smartly designed graphics allow you interactively adjust differential privacy’s “epsilon” parameter and see its impact on the results.)
To illustrate possible rider privacy risks for the TLC taxi-data, Tockar, armed with some celebrity paparazzi photos and some clever insights as to when, where and how to find potential vulnerabilities produced a blog post replete with attention grabbing tales of miserly celebrities who stiffed drivers on their tips and cyber-stalking strip club patrons, which quickly went viral. And so as to up the fear, uncertainty, and dread (FUD) factors surrounding his attacks, Tockar further gravely warned us all in his post that:
Equipped with this [TLC Taxi] dataset, and just a little auxiliary information about you, it would be quite trivial for someone to follow your movements, collecting data on your whereabouts and habits, while you remain blissfully unaware. A stalker could find out where you live and work. Your partner may spy on you. A thief could work out when you’re away from home, based on your habits.
However, as I’ll explain in more detail, sorting out these quite concerning claims in a rational fashion which will enable us to consider complex decisions about the possible trade-offs between Freedom of Information and open government principles and data privacy concerns requires that we move beyond mere citation of anecdotes (or worse, collections of anecdotes in which carefully targeted and especially vulnerable, non-representative cases have been repackaged as “anecdata”). Instead, we must base our risk assessment in a systematic investigation appropriately founded in the principles of scientific study design and statistically representative samples. Regrettably though, this wasn’t the case here and has quite often not been the case for many headline snatching re-identification attacks that have repeatedly made the news in recent years.