Masking sensitive data is essential whenever you move structured files like CSV and JSON outside of secure production environments. Whether it’s for Development, QA, Analytics, or sharing datasets with third-party partners, these files often contain Personal Identifiable Information (PII) that, if left unprotected, exposes your company to massive compliance risks (GDPR, SOC2) and security breaches.
In this guide, you’ll learn how to apply format-preserving masking techniques using Python—ensuring your data remains functional for testing and analysis while keeping sensitive values like names, emails, and credit cards 100% anonymous.
Masking Sensitive Data in CSV Files
CSV files are the standard for data portability. However, they are often handled in local machines or shared via Slack/Email for troubleshooting, which are highly exposed environments. Masking must modify the content without breaking the column structure or delimiters.
Common targets
- name, email, phone_number
- ssn, credit_card, iban
- address, zip_code, internal_id
Techniques
The following examples are implemented in Python scripts, using pandas and random. These can be adapted to various data structures and formats
Substitution
df['name'] = ['User_' + str(i) for i in df.index]Truncation
df['credit_card'] = df['credit_card'].apply(lambda x: 'XXXX-XXXX-XXXX-' + x[-4:])Randomization
df['zip_code'] = [random.randint(10000, 99999) for _ in df.index]Full CSV masking script
import pandas as pd
import random
df = pd.read_csv("input.csv")
df['name'] = ['User_' + str(i) for i in df.index]
df['ssn'] = df['ssn'].apply(lambda x: "***-**-" + x[-4:])
df['zip_code'] = [random.randint(10000, 99999) for _ in df.index]
df.to_csv("masked_output.csv", index=False)Always verify that:
- Headers and delimiters are preserved
- Masked file remains readable by your tools
- Output doesn’t break any downstream processing logic
Masking Sensitive Data in JSON Files
JSON is the backbone of modern APIs and NoSQL databases. Masking nested JSON structures is critical for Staging environments and API Integration tests, where developers need realistic payloads without seeing real customer tokens or secrets.
Typical keys to mask:
- email, ssn, card.number
- user_id, auth_token, address.zip
- Any nested field containing sensitive or financial data
Techniques
Email (randomized)
def mask_email(email):
return "user_" + str(random.randint(1000,9999)) + "@demo.com"SSN (truncated)
record['ssn'] = "***-**-" + record['ssn'][-4:]Token (nulling)
record['accessToken'] = NoneFull JSON masking script
import json
import random
def mask_email(email):
return "user_" + str(random.randint(1000,9999)) + "@demo.com"
with open("input.json", "r") as f:
data = json.load(f)
for record in data:
if 'email' in record:
record['email'] = mask_email(record['email'])
if 'ssn' in record:
record['ssn'] = "***-**-" + record['ssn'][-4:]
if 'accessToken' in record:
record['accessToken'] = None
with open("masked_output.json", "w") as f:
json.dump(data, f, indent=2)Ensure:
- The JSON structure remains valid
- No required fields are removed
- Output passes schema validation or test parsing

