Cautionary Tales: Learning from the Frontlines of Data Privacy and Security - Part 3 of 3

Is your data really anonymized?

So far in this series, we've covered the ways data privacy techniques can strengthen cybersecurity programs, highlighting how security-only defenses often miss the mark against insider threats.

Part 1 introduced the value of collaboration between privacy and security teams to create a unified data protection strategy. In this post, we focused on how data minimization and anonymization can protect against insider data breaches.

Part 2 focused specifically on data anonymization. We discussed the privacy, security, and compliance benefits of anonymous data, the obstacles to anonymizing data at scale, and how synthetic data can be used to overcome these obstacles. Subsalt’s platform is built to unlock anonymous synthetic data at scale to protect sensitive data.

Part 3 is about real-world challenges of data anonymization. We’ll talk about an example of where traditional anonymization techniques failed, causing expensive insider data breaches.

The Netflix Prize: Linking Data to Break Anonymization

The Netflix Prize, launched in 2006, was a competition aimed at improving the accuracy of Netflix's movie recommendation algorithm. Netflix released a dataset containing 100 million movie ratings from 500,000 “anonymized” users, challenging data scientists and enthusiasts worldwide to create an algorithm that could beat Netflix's existing recommendation system.

Netflix planned to award $1 million to the winning team, sparking widespread interest and participation. This initiative highlighted the potential of collaborative innovation in improving algorithmic predictions, but it also unintentionally revealed the challenges and risks associated with anonymizing personal data for public use.

Following the release of the dataset, two University of Texas researchers demonstrated that despite Netflix’s efforts to anonymize their data, most individual users in the Netflix dataset could be re-identified by cross-referencing publicly available movie ratings on the Internet Movie Database (IMDb).

By comparing the overlapping ratings and timestamps between the two datasets, the researchers were able to identify individual Netflix users, exposing not only their viewing habits but, potentially, other sensitive information, such as their political preferences.

This data breach led to a lawsuit against Netflix and the end of the Netflix Prize. It serves as a cautionary tale about the limits of legacy approaches to data anonymization and the ease with which seemingly protected data can be re-identified.

The Tests for Anonymization

Regulators in the EU have defined three criteria for evaluating whether data is personal information for privacy and security purposes:

Singling Out
Linkability
Inference

The re-identification attack constructed against the Netflix Prize data set demonstrates all three criteria. True data anonymization eliminates all three risks.

“Singling Out” Risks

The "singling out" test measures whether an individual's record is sufficiently unique that it can be uniquely identified in an anonymized data set. In the Netflix Prize, the specific combination of movies rated and reviewed by each user in the Netflix and IMDB datasets was highly unique. The two researchers who conducted the re-identification attack on the dataset determined that with only eight reviews and review dates, 99% of users in the data set could be singled out.

Traditional methods of noise injection and data fuzzing provided virtually no protection against this singling-out attack. Researchers determined that they could still uniquely re-identify 99% of individual users based on six real reviews and two totally fake reviews. And randomizing review dates within a 14-day window also caused no significant reduction in the accuracy of these attacks.

Similar singling-out attacks have been conducted on other data sets, including a study of “anonymized” credit card transactions in which researchers at MIT determined that 90% of cardholders could be re-identified based on only four credit card transactions.

‍“Linkability” Risks

“Linkability” measures the ability to link at least two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases). During the Netflix Prize, users were re-identified by linking unique patterns in Netflix movie reviews to the IMDB database, which was not anonymized.

This demonstrates how singling out and linkability work together to create re-identification risk. Because patterns in Netflix’s data were unique, they could be correlated to the IMDB data set with very high accuracy.

This represents an important general pattern: more unique records simultaneously increase singling out and linkability risks, making anonymization of complex user data difficult. Said another way: the more data you collect, the harder anonymization becomes.

The more unique and distinct your data records are, the easier it is to match them to external data sources, increasing the risk of re-identification.

This has important implications for organizations that want to anonymize their data. As they collect higher-quality and more specific user data, anonymization becomes more difficult, and the risk of data breaches from seemingly innocuous data sets increases.

Anonymization solutions need to account for this risk, specifically given that unique patterns are hard to disguise through noise injection or data fuzzing.

“Inference” Risks

The inference test assesses whether a data set can be used to infer attributes that are not in that data set. These inferable attributes can increase singling out risks and linkability risks.

In the Netflix Prize, researchers found that users’ viewing history could be used to infer political and sexual preferences. This discloses more private information that was in the original data set, increasing the possibility of harm to viewers whose data was released.

It also created new re-identification risks by allowing researchers to single out and link data about individual users based on these newly inferred attributes. This expands the set of attributes available for re-identification attacks beyond the data found explicitly in the original data set.

Check out Article 29 (PDF) of the Data Protection Working Party to learn more about the implications of data anonymization techniques.

The Path Forward

These examples, far from mere academic exercises, serve as reminders of the ongoing battle to prevent data breaches caused by negligent and malicious insiders with access to sensitive data. They highlight the need for a shift towards more rigorous anonymization techniques.

Synthetic data platforms, like Subsalt, are built to meet this need.

Introducing Subsalt

Subsalt is revolutionizing data anonymization by bringing synthetic data into enterprises, moving beyond old manual methods. Its main goal is to meet the tough standards for anonymization required by law, ensuring that the risk of identifying individuals from data is greatly reduced.

With Subsalt's advanced query engine, synthetic data that effectively addresses challenges such as "linkability" and "singling out” is easy to create and manage.

Compliance with laws like HIPAA, CPPA, and GDPR is guaranteed.

Embracing the Future with Optimism

The above mentioned privacy failures are instructive, underscoring the challenges and opportunities within data anonymization practices. They call for a balanced approach that honors both the utility of data and the imperative for privacy.

With tools like Subsalt, we stand on the brink of a new era in data security, ready to tackle the complexities with confidence and at scale.

Elevate Your Data Protection Strategy Through Anonymization

Is anonymization a cornerstone of your approach to safeguarding data?

Let's connect. Discover how Subsalt can integrate advanced anonymization into your cybersecurity toolkit.

‍