Data discovery Data security

6 min read

Data Discovery: How to Find Sensitive Data Across Your Systems

What data discovery is, how it locates sensitive and personal data across your systems, and what to look for in data discovery tools.

author-image

Sara Codarlupo

Marketing Specialist @Gigantics

You can't protect data you can't find. Before masking, anonymizing, or applying any security policy, an organization has to answer a prior question: where exactly does sensitive data live across its systems? Data discovery answers it. It scans databases and data stores, locates personal and sensitive information, and maps where it sits so it can be protected.



It's the first stage of any data security and compliance program: every control that follows depends on knowing what data you hold and where.




What is data discovery?



Data discovery is the process of identifying and locating sensitive information spread across an organization's data sources: SQL and NoSQL databases, production and non-production environments, backups, and exports. When the process is automated, an engine inspects schemas and data samples, recognizes the fields that hold personal information, and records their location, without a column-by-column manual review.



The distinction from manual work is one of feasibility rather than convenience. A mid-sized company holds thousands of columns across dozens of systems; reviewing them by hand doesn't scale, and the inventory goes stale the moment a schema changes. Automation turns it into a repeatable process.



The term also names a branch of business intelligence focused on exploring data for insights. In a security and privacy context, it means locating sensitive and personal data in order to protect it.




Data discovery vs. data classification



These are two linked stages that are easy to conflate. Discovery answers where sensitive data is: it locates the fields that contain personal data. Classification answers what it is and how protected it must be: it tags those fields and assigns sensitivity levels that drive how they are handled.



In practice they run in sequence — first the data is located, then it is labeled by sensitivity, a decision that depends on the classification tool in use.




How data discovery works



Effective data discovery doesn't rely on the column name, which is often unreliable. To locate personal data (PII), the tools in this category combine several detection techniques.



The most basic layer reads metadata: column names and data types. Because a column called field_07 can hold a national ID, that analysis is paired with content inspection — samples of real values matched against regular expressions and dictionaries of known formats, such as the pattern of an IBAN or an email address. More advanced tools add machine learning models that recognize patterns a fixed rule misses, and express each finding as a probability rather than a yes-or-no answer.



A risk score is then applied to those findings, which makes prioritization possible: a column holding health data is not comparable to one holding an internal code. The output is an inventory of where sensitive information sits and how critical each location is.




What sensitive data discovery finds



A discovery engine looks beyond the obvious and identifies several layers of sensitive information:


  • Direct identifiers —name, national ID, email, phone, IBAN— that point to a person on their own.

  • Quasi-identifiers —postal code, date of birth, occupation— that don't identify in isolation but do when combined. They are the main cause of re-identification and the ones a manual review tends to miss.

  • Special categories under GDPR —health, biometric, or ethnic-origin data— that require reinforced protection and, therefore, reliable detection.



Sensitive data also lives outside neatly named columns. It shows up in free-text fields, comments, and attached documents, where only pattern-based detection — unstructured data discovery — can surface it.




Why data discovery matters for compliance



GDPR requires organizations to know what personal data they process and where. That knowledge underpins the record of processing activities (Article 30) and the accountability principle. Without a reliable inventory, that record rests on guesswork, and data discovery turns it into an accurate, maintained map. The same need drives CCPA and HIPAA programs, which equally depend on knowing where regulated data resides.



The impact is concrete. After a breach, knowing what data the affected system held is what makes accurate, on-time notification possible under GDPR and the NIS2 directive. And that inventory shouldn't stop at production: non-production environments accumulate copies that are rarely tracked and become a blind spot, where periodic discovery keeps personal data from spreading out of control.




What to look for in data discovery tools



Data discovery tools don't all perform the same, and a few criteria separate them. The first is coverage: a solid solution scans SQL and NoSQL databases, structured and unstructured data, and cloud as well as on-premises stores, because sensitive information lives across all of them. The second is the detection method. Tools that only look for columns named "email" or "ssn" miss anything with non-standard naming; pattern- and dictionary-based detection, with a confidence score, is what actually locates relevant data.



Beyond that, the engine should score the risk of each field to prioritize the critical ones, and run on a schedule rather than once, since schemas evolve and a static inventory loses value. Finally, when handling sensitive data, the analysis should run inside the organization's own infrastructure, without sending data to third parties, and record every finding for audit purposes.




How Gigantics does it



Gigantics runs data discovery automatically and local-first. It connects to your sources, examines column names, types, and sample values, and applies machine learning models alongside dictionaries to identify PII. Each field gets labels and a sensitivity level with a confidence score, plus a risk heat map of the schema.



The process includes a confirmation step to review detections and produces audit-ready reports. From there, those same labels feed the rules for anonymization and synthetic data generation, so discovery and protection run as a single flow.


Turn data discovery into a continuous process.

Gigantics scans your SQL and NoSQL databases, identifies PII with AI, and scores its risk, without the data leaving your infrastructure. The starting point for masking, anonymizing, or synthesizing with confidence.

Get your Technical Demo

AI-driven discovery • SQL & NoSQL • On-premise or in your VPC • Audit-ready evidence



Frequently asked questions



What is data discovery?



It's the process of scanning an organization's data sources to automatically locate and identify the fields that contain sensitive or personal information, without manual review. In a data security context, its goal is to find that data so it can be protected.



What is the difference between data discovery and data classification?



Discovery locates where sensitive data is; classification labels it and assigns a sensitivity level. Location comes first, categorization second.



Why does data discovery matter for GDPR?



Because you can't protect or inventory what you haven't located. Discovery provides the inventory of personal data that the record of processing activities (Article 30), breach response, and audits all rely on.



What data does data discovery find?



Direct identifiers (name, national ID, phone, email) and indirect ones — combinations of fields that allow re-identification. A solid solution covers structured and unstructured data across SQL and NoSQL sources.



Can data discovery be done manually?



It's viable in small databases, but it doesn't scale: it's slow and the inventory goes stale as soon as the schema changes. Automation makes it a repeatable process.