Defending from Within: The Critical Role of Anonymization in Data Security - Part 2 of 3

Is the cybersecurity biggest risk inside your company?

Insider threat incidents surged by 47% over two years, now taking businesses and average of 85 days to contain and costing them $15 million annually.

An impenetrable perimeter can’t stop that – so what can you do about it?

Addressing the Unseen Danger: Strategies to Combat Insider Threats

In our first piece, we showed the critical yet overlooked link between security and privacy teams. Perimeter-based security measures are great at fortifying digital boundaries against external threats. But, they often are insufficient for the risks posed by insiders, either negligent or malicious.

This gap presents an opportunity to incorporate privacy strategies, like data minimization and anonymization into security programs. These approaches reduce the risk of data breaches and protect the organization and its users.

The growth in insider threats, combined with increasingly strict privacy and security laws and regulations, have made these techniques key components of data protection strategy.

In this piece, we explain anonymization, defining its purpose and reviewing its value in a robust enterprise security program. We’ll also confront the challenges organizations face in adopting it and novel approaches for how modern anonymization methods can be woven into security.

Personal Data vs Anonymized Data

To better understand the key role of anonymization, it's important to understand what data accounts for the most harmful and expensive data breaches: personal data.

Personal data is information about a specific person. Laws like HIPAA, GDPR, and CCPA explicitly protect personal data. And, penalize improper use or inadequate protection of it.

A breach of personal data has serious legal and regulatory implications and is expensive to remediate.

While necessary for many tasks like customer support or processing payroll, personal data is unnecessary for many others, like training AI models.

Anonymized data is not personal data. It is freed from the rules and consent requirements that govern personal data.

A breach involving anonymized data does not provoke the same regulatory repercussions.

Anonymization enables organizations to securely lock away sensitive personal information and instead grant access to data that poses a much lower risk to the company.

This process boosts an organization's security posture. It also keeps data flowing to the teams that derive valuable insights for the business.

Anonymization Demystified

‍
Let’s quickly define anonymization: a process that renders data untraceable back to an individual—a one-way transformation. This is the definition under laws like GDPR and CCPA.

Sounds simple. But, it is quite hard at scale for a couple of reasons.

Challenge 1: Re-identification

The risk of re-identification is not just in PII, like names and email addresses. The risk lies in many quasi-identifiers - things like zip codes, ages, spending patterns, and racial backgrounds. When combined, they can uniquely identify a person even without any PII.

The obvious solution might seem to be to remove all quasi-identifiers. But…

Challenge 2: Quality Loss

Removing ever more information for privacy greatly reduces the data's usefulness. Imagine if you masked every sensitive column in a table. You wouldn’t be able to learn anything!

These two competing forces illustrate the tension between maintaining adequate data protection and keeping utility high enough for analysis. The goal is to minimize lost utility while getting meaningful reductions in re-identification risk.

Unraveling the Complexity of Anonymization in Organizations

If anonymization holds such significant value, why do many organizations struggle to implement it effectively?

The Answer: it’s really hard to adequately anonymize data at scale.

As previously outlined, finding the perfect balance between data protection and utility is tricky.

You must minimize information to achieve anonymity. But, you have to preserve enough data to be useful. It is a challenging process to balance these requirements without a manual fine-grained review of every unique analysis.

So, traditional approaches to anonymization depend on context and take a long time (weeks or months!). This requires a team effort among legal, data, and engineering teams.

These infrequent collaborators must navigate the many decisions about data transformations. They must choose whether to redact, obfuscate, or truncate information to meet both legal duties and data science goals. This process can last for months, posing significant administrative burden and delaying time to value.

It’s definitionally unscalable.

The Promise and Pitfalls of Synthetic Data for Anonymization

New methods are emerging to unlock scalable anonymization. They offer innovative solutions to the enduring privacy versus utility tradeoff. Among these, synthetic data is uniquely promising.

This technique involves crafting computer-generated data based on sensitive source material. This computer-generated data preserves the structure and statistical properties of the original data but not its privacy and security risks.

But like any solution that promises to comprehensively solve a security or privacy problem through novel technology, we should investigate the risks carefully.

Risk of Inadequate Anonymization: Despite its surface-level impenetrability, synthetic data may not fully anonymize the personal data. Synthetic datasets can reveal private information about individuals or groups through statistical re-identification attacks.

If left unprotected against these attacks, synthetic data should still be considered personal data, and often is by regulators.

Data Quality Issues: Static synthetic datasets contain ‘noise’ to achieve the required privacy gains. As a result, they lose fidelity to the source data in the noisiest areas. This makes a single synthetic data set inadequate for all potential analyses.

The key to using synthetic data well is scalably balancing these trade-offs. Data teams need to trust the synthetic data for analytics, AI/ML, or research projects.

Legal and Security teams need assurances on the anonymization.

For companies that can manage these challenges, synthetic data is a strong anonymization tool.

Introducing Subsalt: Enterprise Data Anonymization

Subsalt brings the promise of synthetic data for anonymization to the enterprise, replacing legacy manual processes and techniques.

Subsalt's primary function is to meet the strict legal and technical requirements for anonymization. This entails ensuring the synthetic data carries minimal risk of re-identification and giving the assurances needed to meet the necessary data protection standards.

Subsalt's query engine generates synthetic data tailored to specific use cases on demand. This unique query-time automation ensures that users get the best data for their needs without any synthetic data technical know-how.

Up Next: Avoiding Disaster

As we continue our exploration, the next post will walk through real-world scenarios where a good data anonymization strategy could have averted disaster.

Redefining Your Data Protection Strategy with Anonymization

Have you established anonymization as a standard element of your data protection arsenal?

Reach out — we're here to help explore how Subsalt can add anonymization to your cybersecurity capabilities.

‍