What is the difference between PII and personal data?

PII originates from the U.S. framework (NIST SP 800-122) and focuses on data that can directly identify an individual. 'Personal data' is the GDPR's term, which is broader: it covers any information relating to an identifiable person, even indirectly. In practice, all PII qualifies as personal data under the GDPR. However, the GDPR protects data that NIST would not consider PII in a strict sense, such as cookie identifiers.

Is a work email address PII?

Yes. A corporate email like firstname.lastname@company.com directly identifies a natural person. Even generic addresses like info@company.com can contain PII in the email metadata (sender name, signature, reply-to fields, etc.).

Is an IP address PII?

It depends on the jurisdiction. Under the GDPR, yes — the Court of Justice of the European Union ruled in Breyer v. Germany (2016) that a dynamic IP address can constitute personal data. In the U.S., IP addresses are generally not considered PII under NIST, but the CCPA classifies them as personal information.

What happens if I properly anonymize PII?

If anonymization is irreversible and effective, the resulting data ceases to be personal data under the GDPR and falls outside the CCPA's scope as de-identified data. This means it can be processed without a legal basis, without obligation to respond to access requests, and without breach notification requirements. However, if any possibility of re-identification exists, the original regulatory framework remains applicable.

How much does a PII breach cost?

According to the IBM Cost of a Data Breach 2025 report, the average cost per compromised customer PII record is $160. The global average cost of a complete incident reaches $4.44 million. In the United States, the figure rises to $10.22 million per incident.

PII (Personally Identifiable Information): Types, Risks & Protection

Most PII breaches don't originate in production systems. They happen in unaudited staging environments, CSV files copied to local machines, or API logs that capture full payloads into unencrypted storage buckets.

Personally identifiable information (PII) appears in 53% of all data breaches reported globally, according to the IBM Cost of a Data Breach 2025 report. Each compromised customer PII record costs an average of $160. For employee PII, the figure reaches $168. In an organization managing tens of thousands of records, the financial impact of a single breach escalates rapidly.

This article covers what PII is, how it's classified, where it typically goes unprotected in enterprise infrastructure, and which techniques reduce exposure.

What is PII exactly?

PII (Personally Identifiable Information) is any data that, alone or combined with other data, can be used to identify a specific individual. The term has a formal definition in NIST SP 800-122, and it maps closely — though not identically — to what the GDPR calls "personal data" and what the CCPA refers to as "personal information."

The distinction matters. NIST defines PII around the ability to distinguish or trace an individual's identity. The GDPR expands the scope to any information relating to an identified or identifiable natural person, even indirectly. The CCPA takes a broader approach still, explicitly covering household-level data and inferences drawn from consumer behavior.

For a company operating across multiple jurisdictions, the practical implication is this: the definition of what constitutes PII varies depending on the regulatory framework in play. A device identifier may not qualify as PII under NIST's strict reading, but it is personal data under the GDPR and personal information under the CCPA. Security teams that limit their protection scope to obvious identifiers — SSNs, names, emails — leave significant surface area unaddressed.

Types of PII: direct and indirect identifiers

Not all PII carries the same risk level. The most operationally useful classification divides PII into two categories based on identification capability.

Direct PII (explicit identifiers)

Each of these data points can identify an individual without ambiguity, on its own:

Full name

Social Security Number (SSN)

Passport or driver's license number

Personal email address

Phone number

Bank account or credit card number

Biometric data: fingerprints, facial recognition, iris scans

Facial photograph

Medical record number

Indirect PII (quasi-identifiers)

In isolation, none of these data points identify an individual. Combined with each other or cross-referenced against external sources, identification becomes possible. This is the category most organizations underestimate:

Date of birth

ZIP code

Gender

Job title + company name

IP address

Cookie or device identifiers

Geolocation data

Purchase or browsing history

A widely cited study by Latanya Sweeney at Harvard demonstrated that 87% of the U.S. population could be uniquely identified using just three quasi-identifiers: ZIP code, date of birth, and gender. These are three fields that most organizations do not classify as sensitive in their data protection policies.

Where PII hides in your infrastructure

The primary problem with PII is rarely the lack of protection on known systems. It's the unawareness of every location where PII is stored.

These are the points where security teams most frequently discover exposed PII:

Production databases cloned to lower environments. This is the most common scenario. Someone clones production to staging or development to debug an issue, and real customer data remains there — unmasked, accessible to the entire engineering team. Data masking should be a mandatory step before any copy reaches a non-production environment.

Application logs and monitoring systems. APIs that log full request/response bodies often dump names, emails, tokens, and addresses in plain text. If your logs flow into Elasticsearch, Splunk, or CloudWatch without field-level filtering, you have PII exposed in a system that is probably outside the scope of your compliance audit.

Flat files in cloud storage. CSVs, Excel exports, database dumps. Many end up in S3 buckets, Google Cloud Storage, or Azure Blob containers with overly permissive access controls. Point-in-time exports created for ad hoc analysis that remain stored — with real personal data — for months or years without oversight.

BI and analytics tools. Dashboards in Metabase, Tableau, or Looker that query production directly without an anonymization layer in between. Analysts see PII in the clear every time they open a report.

Backups and snapshots. An encrypted-at-rest backup still contains PII. If someone restores that backup to an uncontrolled environment, the data is fully exposed. Protection must be applied before the backup, not after.

The first step in protecting PII is identifying where it lives. Data classification tools automate this discovery process, scanning databases, files, and cloud storage to locate fields containing PII proactively.

Why exposed PII is so expensive

The figures from the IBM Cost of a Data Breach 2025 report are instructive:

The global average cost of a data breach reached $4.44 million.

In the United States, the figure rises to $10.22 million per incident.

Customer PII is the most frequently compromised data type, present in 53% of breaches.

20% of breaches studied involved shadow AI — AI tools adopted by employees without security team oversight — adding up to $670,000 in extra cost per incident.

Beyond direct incident costs, regulatory penalties add another layer. Under the GDPR, a PII breach can result in fines of up to 20 million euros or 4% of global revenue. HIPAA violations in the U.S. carry penalties up to $2.13 million per violation category per year. Under the CCPA, statutory damages range from $100 to $750 per consumer per incident — numbers that scale quickly in class action suits.

At the federal level, the FTC has increasingly used its Section 5 authority to pursue companies with inadequate data security practices, resulting in consent decrees that impose ongoing compliance obligations for years.

Reputational damage compounds the financial impact. A customer whose PII is leaked is unlikely to renew. This is the hardest cost to quantify, but the one with the longest tail.

How to protect PII: techniques and approaches

PII protection requires combining multiple techniques. The right choice depends on how the data will be used and the acceptable level of residual risk.

Automated discovery and classification

Automated discovery scans databases, files, cloud storage, and logs to identify where PII exists and what type it is: SSNs, emails, bank account numbers, phone numbers, and so on. Data classification tools use pattern recognition engines (regex, NLP, fingerprinting) to detect these identifiers systematically.

The alternative — relying on teams to manually document which tables contain PII — rarely works at scale.

Data masking

Data masking replaces real PII values with fictitious but structurally consistent values. A real name becomes another realistic name. A real bank account number transforms into one with valid format that doesn't belong to any actual account.

There are two primary variants:

Static Data Masking (SDM): applied to a copy of the data. The destination environment (staging, QA, analytics) receives data that is already masked. Original production data remains untouched.

Dynamic Data Masking (DDM): applied in real time based on the profile of the user running the query. A DBA sees full data; an analyst sees PII fields masked.

For development and testing environments, static masking is the standard approach. Gigantics applies masking with referential integrity, which ensures that relationships between tables remain intact after masking. Without this capability, masked data can break JOIN queries and invalidate QA processes.

Anonymization

Data anonymization is irreversible. There is no key or mapping that allows recovery of the original value. This irreversibility carries a significant regulatory advantage: under the GDPR, properly anonymized data ceases to be personal data. Under the CCPA, de-identified data that meets specific technical and administrative safeguards falls outside the statute's scope. This reduces compliance obligations and the overall risk profile.

Common anonymization techniques include generalization (converting an exact age into a range), suppression (removing the field entirely), perturbation (adding statistical noise), and k-anonymity (ensuring each record is indistinguishable from at least k-1 other records).

Pseudonymization

Pseudonymization replaces direct identifiers with tokens or aliases. Unlike anonymization, the process is reversible if access to the mapping table is available. The GDPR recognizes it as a valid security measure (Article 32), but pseudonymized data remains personal data under the regulation.

It is useful when re-identification is needed in specific scenarios — for example, to respond to a data subject access request — but you want to minimize risk in day-to-day operations.

Access controls and least privilege

No data protection technique replaces a well-defined access control framework. The principle of least privilege means that every person and every system accesses only the PII required for their function. Combined with masking or pseudonymization, it establishes a defense-in-depth approach.

PII across U.S. regulatory frameworks

Unlike the EU, the United States does not have a single, comprehensive federal privacy law. PII protection is governed by a patchwork of sector-specific statutes and state laws. These are the frameworks your compliance team should have on their radar:

NIST SP 800-122 provides the foundational definition of PII for federal agencies and government contractors. It distinguishes between PII that requires protection based on confidentiality impact level — low, moderate, or high. While not a law, NIST standards inform most federal data handling requirements and serve as a reference across the private sector.

HIPAA (Health Insurance Portability and Accountability Act) defines 18 specific identifiers that constitute Protected Health Information (PHI) when linked to health data. Covered entities and business associates must implement administrative, physical, and technical safeguards. HIPAA's Privacy Rule provides two de-identification methods — Safe Harbor (removing all 18 identifiers) and Expert Determination (statistical verification that re-identification risk is minimal). Techniques like data masking and anonymization map directly to these requirements.

CCPA/CPRA (California Consumer Privacy Act, amended by the California Privacy Rights Act) gives California residents the right to know, delete, and opt out of the sale of their personal information. The law defines personal information broadly, including inferences drawn from consumer data to create profiles.

State privacy laws continue to expand. Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), and over a dozen other states have enacted their own privacy legislation. Each has slightly different definitions, thresholds, and requirements, making PII management more complex for organizations operating nationally.

The practical takeaway: data minimization — collecting and retaining only the PII strictly necessary for a declared purpose — is a principle shared across virtually every framework. If your database stores 120 fields and you need 15, those extra 105 fields represent unnecessary attack surface.

Checklist: PII protection in your organization

To assess the current state of PII protection in your organization, these are the most relevant control points:

Do you maintain an up-to-date inventory of where PII is stored across your infrastructure?
Do development, testing, and staging environments use masked data, or do they run on direct copies of production?
Do application logs filter PII fields before persisting them?
Do flat files and exports in cloud storage have retention policies and encryption in place?
Do BI dashboards query anonymized data or production in the clear?
Does your team have a defined process for responding to data subject access or deletion requests?
Are AI tools used by employees (shadow AI) inventoried and controlled?

If the answer is "no" to more than two of these questions, the exposure surface is significant. Addressing these gaps before an incident occurs is considerably less costly than responding to one after the fact.

PII: What Is Personally Identifiable Information and How to Protect It