Copying production data into non-production environments creates avoidable exposure of sensitive information. Yet overly simplified dummy datasets often fail to reproduce real-world edge cases, reducing test coverage and confidence in releases.



In MySQL, effective data masking is not just about hiding PII. Done properly, it must preserve cross-table consistency, data formats, and application behavior so test environments remain reliable. This guide walks through a practical approach to identifying sensitive fields, applying masking techniques that remain consistent across tables, and operationalizing the process for repeatable environment refreshes.




1. Schema Scanning and Sensitive Data Detection



Before applying masking rules, start by establishing a complete picture of your schema and sensitive data surface area:


  • Identify primary keys, foreign keys, constraints, and table relationships.

  • Classify columns containing PII, financial, health, credentials/secrets, or other confidential data.

  • Extract schema definitions and metadata:


SHOW CREATE TABLE …

mysqldump --no-data …

information_schema (e.g., KEY_COLUMN_USAGE, COLUMNS) to detect keys and constraints.


This baseline allows you to define masking rules by data type and apply transformations while keeping relationships consistent across dependent tables.




2. Applying Masking Techniques



a. Format-Preserving Substitution (Deterministic)



Use deterministic substitutions when you need the same input to always produce the same output (useful for repeatability and for maintaining relationships).



Example (simple substitution for email format):


UPDATE users
SET email = CONCAT(first_name, '@example.com');


For fields that must remain unique or stable across refreshes, use a deterministic function based on a stable identifier (e.g., user ID). For example:



UPDATE users
SET email = CONCAT(SHA2(CONCAT(id, 'static_salt'), 256), '@example.com');


b) Controlled Random Value Generation



Randomization is useful for numeric fields or standardized formats—but ensure the output range and format match real constraints.



UPDATE customers
SET phone = CONCAT('+34 ', 600000000 + FLOOR(RAND() * 100000000));


If you need reproducibility, avoid RAND() and switch to a deterministic approach (e.g., hashing an ID into a valid range).



c) Structured Replacement for Sensitive Identifiers



When replacing identifiers, keep format constraints intact. For example, generating a 16-digit value:


UPDATE credit_cards
SET number = LPAD(FLOOR(RAND() * 10000000000000000), 16, '4');


Note: This preserves length but not necessarily checksum rules (e.g., Luhn). If your application validates checksums, use a Luhn-compliant generator instead.



d) Preserving Relationships Between Tables (Mapping)



For foreign keys and cross-table references, use a mapping table so related records remain consistent.


-- Auxiliary mapping
CREATE TEMPORARY TABLE id_map AS
SELECT original_id, UUID() AS fake_id
FROM clients;

-- Cross-table update
UPDATE orders o
JOIN id_map m ON o.client_id = m.original_id
SET o.client_id = m.fake_id;


Best practices for relational masking:



  • Create mappings once, then apply them consistently across all dependent tables.

  • Apply masking in a controlled order (parents first, then children) to avoid constraint issues.

  • Run updates inside a transaction where possible (or take a backup/snapshot first).




3. Workflow Automation



In CI/CD and environment refresh workflows, masking should be repeatable, versioned, and easy to run:


  • Integrate masking steps into GitLab CI, Jenkins, or GitHub Actions.

  • Keep masking rules in external configuration (per environment) to avoid hardcoding.

  • Track schema versions and masking execution parameters so refreshes are reproducible.

  • Use a dedicated tool (e.g., Gigantics) when you need governance, consistency, and centralized rule management at scale.



4. Traceability and Regulatory Requirements



A mature masking program should be able to answer, at minimum:


  • Which fields were modified?

  • What method/rule was applied to each field?

  • When did the job run, and against which schema version?


This helps support privacy and security controls in non-production environments, including traceability expectations that often appear in frameworks such as GDPR or NIS2.




Common Errors When Masking Data in MySQL



  • Breaking Relationships Between Tables: Masking without consistent mappings across related tables leads to test failures and unreliable environments.

  • Not Preserving Business Rules: Ignoring formats, constraints, encodings, and validations can create failures unrelated to real defects.

  • Lack of Traceability: Not recording what was masked, how, and when increases operational risk and complicates compliance.

  • Inconsistent Results Across Runs: Ad-hoc scripts often produce different outputs each execution, hurting reproducibility and debugging.

  • Skipping Data Discovery: Masking without prior identification/classification leaves blind spots and can result in unmasked sensitive fields.




How Gigantics Solves This



At Gigantics, we work directly on your MySQL databases to:


  • Scan the complete schema and detect sensitive fields.

  • Classify them with configurable labels (PII, confidential, etc.).

  • Apply masking rules without breaking relationships or structures.

  • Generate automatic audits for traceability and compliance with GDPR, NIS2, and other regulations.