Database Data Masking: Techniques, Tokenization & Validation

In database environments, data masking is not limited to hiding isolated fields. The real challenge is preserving behavior: constraints, formats, uniqueness, and relationships between entities. Without this design, the result is often a "masked" dataset that breaks validations, degrades queries, or introduces inconsistencies that are difficult to trace.

This article focuses on masking techniques and patterns applied to databases, with a focus on practical decisions: which technique fits based on the data type, the required level of consistency, and the provisioning or access mode.

Factors Conditioning Data Masking in a Relational Model

Referential Integrity (PK/FK) and Table Mappings

If relationships exist, the goal is not to "mask a column" but to preserve mappings. A value in a Primary Key (PK) cannot be transformed independently of its associated Foreign Keys (FKs). The strategy must define:

Which entities fall within the scope.

Which columns participate in relationships.

How consistency is maintained across all occurrences of the same attribute.

Uniqueness, Alternate Keys, and "Lookup" Fields

Fields such as email, username, customer number, or internal identifiers are usually subject to uniqueness (constraints or functional requirements). Masking that generates duplicates introduces downstream failures and exceptions. Additionally, attributes used for "lookups" (searches) may require consistency to avoid degrading internal workflows.

Formats, Validations, and Constraints

Beyond length, many systems apply validations: patterns, checks, ranges, allowed domains, or application-level validations. Masking must produce values compatible with:

Regex/formats.

Temporal and numerical ranges.

Checksums where they exist.

Valid domains (e.g., codes or enumerations).

Consistency Across Systems and the Need for Determinism

If the same attribute appears in multiple tables or services, the masked value must be consistent across occurrences to allow for joins, reconciliations, or internal traceability. In these cases, consistent transformations by domain or logical key are prioritized.

Most Common Data Masking Techniques and When to Apply Them

Substitution (Replacement) with Dictionaries and Domain Rules

Replaces values with other plausible ones from the same domain (dictionaries, controlled lists, or synthetic generation).

Typical Application: Names, addresses, cities, descriptive texts.

Critical Point: Coherence between tables (e.g., repeating the same name in all its occurrences) and alignment with languages/countries when the domain requires it.

Format-Preserving Masking

Maintains the pattern and length; in demanding environments, it also guarantees valid values for real validations.

Typical Application: Formatted identifiers, phone numbers, zip codes, patterned internal IDs.

Critical Point: "Same format" does not imply "value accepted by the application." It is advisable to validate against the same rules used by the system.

Tokenization for Consistency Across Tables and Systems

Replaces values with consistent tokens to preserve joins and internal correlation without exposing the real data.

Typical Application: Operational identifiers reused across multiple entities (customer, employee, account).

Critical Point: Governance of the tokenization scheme (access, rotation, traceability) and controls over the token generation process.

Permutation (Shuffling) to Conserve Distribution

Reorders values within a set to maintain distribution without preserving the original mapping.

Typical Application: Categories, segments, and attributes where global distribution matters.

Critical Point: Small domains or correlated attributes (risk of inference if combined with other columns).

Controlled Variation in Dates and Numerical Values

Applies controlled perturbations (offsets/ranges) while preserving properties like relative order or boundaries.

Typical Application: Dates, amounts, metrics where plausibility is needed without exact precision.

Critical Point: Impact on aggregations, thresholds, and business rules; tolerance must be defined per use case.

How to Preserve Referential Integrity During Masking

Entity-Based Design and Application Order

A maintainable design starts with entities (customer, order, invoice) and their relationships. Generally:

Identify keys and relationships.
Define rules per domain (which column is transformed and how).
Apply transformations while guaranteeing PK/FK consistency.
Execute integrity validations before distributing the dataset.

Strategies for Keys: Surrogate Keys vs. Natural Keys

Surrogate Keys (Technical IDs): In many cases, these are tokenized or generated consistently to maintain relationships.

Natural Keys (Emails, documents, account numbers): These require treatment based on domain, uniqueness, and compatibility with validations.

The choice is usually dictated by what the system uses for joins and searches, rather than the "PK" label in the schema.

Collision Control and Consistent Generation

When masking unique fields, mechanisms must be established to:

Avoid collisions.

Control output domains.

Maintain consistency across occurrences.

In large datasets, this point often determines the quality of the result: masking without collision control leads to operational errors and rework.

Technical Validations After Applying Data Masking

Integrity Checks (Orphans, Constraints): Checking for FKs and orphans in critical relationships, verifying table/column level constraints, and detecting inconsistencies in bridge tables.

Uniqueness and Duplicates in Critical Identifiers: Checking for duplicates in emails/usernames/IDs, unexpected nulls in keys, and inconsistencies in columns used for searches or reconciliation.

Format Quality and Expected Ranges: Regex/lengths per domain, plausible temporal ranges, and permitted domains (enumerations, countries, types).

These validations should be automated: they reduce diagnostic costs and prevent the distribution of defective datasets.

Common Errors in Database Data Masking

Superficial Masking and Small Domains

If the output domain is small or predictable, the risk of inference increases. It is preferable to expand domains and apply coherent rules per entity.

Unforeseen Correlations Between Columns

Masking columns independently can preserve correlations that allow for inference (e.g., a combination of zip code + date + attribute). It is advisable to review correlated columns and apply coordinated strategies.

Execution Drift Due to Lack of Versioning

Without versioned policies and change traceability, it is common for masking to "drift" between executions, creating inconsistencies between environments or breaks in internal integrations.

Conclusion

In databases, the value of data masking is determined by consistency, referential integrity, uniqueness, and validation. The specific technique must be chosen based on the domain and model constraints; the result is considered acceptable only when it meets constraints and preserves mappings between entities according to internal use case requirements.

Data Masking in Databases: Techniques and Validations