You can't protect data you can't find. Before masking, anonymizing, or applying any security policy, an organization has to answer a prior question: where exactly does sensitive data live across its systems? Data discovery answers it. It scans databases and data stores, locates personal and sensitive information, and maps where it sits so it can be protected.
It's the first stage of any data security and compliance program: every control that follows depends on knowing what data you hold and where.
What is data discovery?
Data discovery is the process of identifying and locating sensitive information spread across an organization's data sources: SQL and NoSQL databases, production and non-production environments, backups, and exports. When the process is automated, an engine inspects schemas and data samples, recognizes the fields that hold personal information, and records their location, without a column-by-column manual review.
The distinction from manual work is one of feasibility rather than convenience. A mid-sized company holds thousands of columns across dozens of systems; reviewing them by hand doesn't scale, and the inventory goes stale the moment a schema changes. Automation turns it into a repeatable process.
The term also names a branch of business intelligence focused on exploring data for insights. In a security and privacy context, it means locating sensitive and personal data in order to protect it.
Data discovery vs. data classification
These are two linked stages that are easy to conflate. Discovery answers where sensitive data is: it locates the fields that contain personal data. Classification answers what it is and how protected it must be: it tags those fields and assigns sensitivity levels that drive how they are handled.
In practice they run in sequence — first the data is located, then it is labeled by sensitivity, a decision that depends on the classification tool in use.
How data discovery works
Effective data discovery doesn't rely on the column name, which is often unreliable. To locate personal data (PII), the tools in this category combine several detection techniques.
The most basic layer reads metadata: column names and data types. Because a column called field_07 can hold a national ID, that analysis is paired with content inspection — samples of real values matched against regular expressions and dictionaries of known formats, such as the pattern of an IBAN or an email address. More advanced tools add machine learning models that recognize patterns a fixed rule misses, and express each finding as a probability rather than a yes-or-no answer.
A risk score is then applied to those findings, which makes prioritization possible: a column holding health data is not comparable to one holding an internal code. The output is an inventory of where sensitive information sits and how critical each location is.
What sensitive data discovery finds
A discovery engine looks beyond the obvious and identifies several layers of sensitive information:
- Direct identifiers —name, national ID, email, phone, IBAN— that point to a person on their own.
- Quasi-identifiers —postal code, date of birth, occupation— that don't identify in isolation but do when combined. They are the main cause of re-identification and the ones a manual review tends to miss.
- Special categories under GDPR —health, biometric, or ethnic-origin data— that require reinforced protection and, therefore, reliable detection.
Sensitive data also lives outside neatly named columns. It shows up in free-text fields, comments, and attached documents, where only pattern-based detection — unstructured data discovery — can surface it.
Why data discovery matters for compliance
GDPR requires organizations to know what personal data they process and where. That knowledge underpins the record of processing activities (Article 30) and the accountability principle. Without a reliable inventory, that record rests on guesswork, and data discovery turns it into an accurate, maintained map. The same need drives CCPA and HIPAA programs, which equally depend on knowing where regulated data resides.
The impact is concrete. After a breach, knowing what data the affected system held is what makes accurate, on-time notification possible under GDPR and the NIS2 directive. And that inventory shouldn't stop at production: non-production environments accumulate copies that are rarely tracked and become a blind spot, where periodic discovery keeps personal data from spreading out of control.
What to look for in data discovery tools
Data discovery tools don't all perform the same, and a few criteria separate them. The first is coverage: a solid solution scans SQL and NoSQL databases, structured and unstructured data, and cloud as well as on-premises stores, because sensitive information lives across all of them. The second is the detection method. Tools that only look for columns named "email" or "ssn" miss anything with non-standard naming; pattern- and dictionary-based detection, with a confidence score, is what actually locates relevant data.
Beyond that, the engine should score the risk of each field to prioritize the critical ones, and run on a schedule rather than once, since schemas evolve and a static inventory loses value. Finally, when handling sensitive data, the analysis should run inside the organization's own infrastructure, without sending data to third parties, and record every finding for audit purposes.
How Gigantics does it
Gigantics runs data discovery automatically and local-first. It connects to your sources, examines column names, types, and sample values, and applies machine learning models alongside dictionaries to identify PII. Each field gets labels and a sensitivity level with a confidence score, plus a risk heat map of the schema.
The process includes a confirmation step to review detections and produces audit-ready reports. From there, those same labels feed the rules for anonymization and synthetic data generation, so discovery and protection run as a single flow.

