PCLinuxOS Magazine

PCLinuxOS

Debunking The Myth Of "Anonymous" Data

by Paige Collings
Electronic Frontier Foundation
Reprinted under Creative Commons License

Today, almost everything about our lives is digitally recorded and stored somewhere. Each credit card purchase, personal medical diagnosis, and preference about music and books is recorded and then used to predict what we like and dislike, and—ultimately—who we are.

This often happens without our knowledge or consent. Personal information that corporations collect from our online behaviors sells for astonishing profits and incentivizes online actors to collect as much as possible. Every mouse click and screen swipe can be tracked and then sold to ad-tech companies and the data brokers that service them.

In an attempt to justify this pervasive surveillance ecosystem, corporations often claim to de-identify our data. This supposedly removes all personal information (such as a person's name) from the data point (such as the fact that an unnamed person bought a particular medicine at a particular time and place). Personal data can also be aggregated, whereby data about multiple people is combined with the intention of removing personal identifying information and thereby protecting user privacy.

Sometimes companies say our personal data is "anonymized," implying a one-way ratchet where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way. As Professor Matt Blaze, an expert in the field of cryptography and data privacy, succinctly summarized: "something that seems anonymous, more often than not, is not anonymous, even if it's designed with the best intentions."

Anonymization…and Re-Identification?

Personal data can be considered on a spectrum of identifiability. At the top is data that can directly identify people, such as a name or state identity number, which can be referred to as "direct identifiers." Next is information indirectly linked to individuals, like personal phone numbers and email addresses, which some call "indirect identifiers." After this comes data connected to multiple people, such as a favorite restaurant or movie. The other end of this spectrum is information that cannot be linked to any specific person—such as aggregated census data, and data that is not directly related to individuals at all like weather reports.

Data anonymization is often undertaken in two ways. First, some personal identifiers like our names and social security numbers might be deleted. Second, other categories of personal information might be modified—such as obscuring our bank account numbers. For example, the Safe Harbor provision contained with the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that only the first three digits of a zip code can be reported in scrubbed data.

However, in practice, any attempt at de-identification requires removal not only of your identifiable information, but also of information that can identify you when considered in combination with other information known about you. Here's an example:

First, think about the number of people that share your specific ZIP or postal code.

Next, think about how many of those people also share your birthday.

Now, think about how many people share your exact birthday, ZIP code, and gender.

According to one landmark study, these three characteristics are enough to uniquely identify 87% of the U.S. population. A different study showed that 63% of the U.S. population can be uniquely identified from these three facts.

We cannot trust corporations to self-regulate. The financial benefit and business usefulness of our personal data often outweighs our privacy and anonymity. In re-obtaining the real identity of the person involved (direct identifier) alongside a person's preferences (indirect identifier), corporations are able to continue profiting from our most sensitive information. For instance, a website that asks supposedly "anonymous" users for seemingly trivial information about themselves may be able to use that information to make a unique profile for an individual.

Location Surveillance

To understand this system in practice, we can look at location data. This includes the data collected by apps on your mobile device about your whereabouts: from the weekly trips to your local supermarket to your last appointment at a health center, an immigration clinic, or a protest planning meeting. The collection of this location data on our devices is sufficiently precise for law enforcement to place suspects at the scene of a crime, and for juries to convict people on the basis of that evidence. What's more, whatever personal data is collected by the government can be misused by its employees, stolen by criminals or foreign governments, and used in unpredictable ways by agency leaders for nefarious new purposes. And all too often, such high tech surveillance disparately burdens people of color.

Practically speaking, there is no way to de-identify individual location data since these data points serve as unique personal identifiers of their own. And even when location data is said to have been anonymized, re-identification can be achieved by correlating de-identified data with other publicly available data like voter rolls or information that's sold by data brokers. One study from 2013 found that researchers could uniquely identify 50% of people using only two randomly chosen time and location data points.

Done right, aggregating location data can work towards preserving our personal rights to privacy by producing non-individualized counts of behaviors instead of detailed timelines of individual location history. For instance, an aggregation might tell you how many people's phones reported their location as being in a certain city within the last month, but not the exact phone number and other data points that would connect this directly and personally to you. However, there's often pressure on the experts doing the aggregation to generate granular aggregate data sets that might be more meaningful to a particular decision-maker but which simultaneously expose individuals to an erosion of their personal privacy.

Moreover, most third-party location tracking is designed to build profiles of real people. This means that every time a tracker collects a piece of information, it needs something to tie that information to a particular person. This can happen indirectly by correlating collected data with a particular device or browser, which might later correlate to one person or a group of people, such as a household. Trackers can also use artificial identifiers, like mobile ad IDs and cookies to reach users with targeted messaging. And "anonymous" profiles of personal information can nearly always be linked back to real people—including where they live, what they read, and what they buy.

For data brokers dealing in our personal information, our data can either be useful for their profit-making or truly anonymous, but not both. EFF has long opposed location surveillance programs that can turn our lives into open books for scrutiny by police, surveillance-based advertisers, identity thieves, and stalkers. We've also long blown the whistle on phony anonymization.

As a matter of public policy, it is critical that user privacy is not sacrificed in favor of filling the pockets of corporations. And for any data sharing plan, consent is critical: did each person consent to the method of data collection, and did they consent to the particular use? Consent must be specific, informed, opt-in, and voluntary.

Previous Page Top Next Page