Configure Sensitive Data Discovery (Public Preview)
Note
In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.
Sensitive data discovery (SDD) comprises two major elements: identifiers and templates.
Identifier
The identifier is the basic building block of SDD. Essentially, an identifier includes a pattern (e.g., a regex or a list of values) and a list of tags to apply to data that matches that pattern. For example, if a column sample matches a regex defined in an identifier, then all the tags in that identifier will be applied to that column.
SDD applies identifiers to samples of data to assess the likelihood that columns contain data that fits the pattern specified in the identifier (such as a social security number). SDD then scores columns by the percentage of values that match the pattern defined. This score determines whether or not the configured tags will be applied to a column.
There are two types of identifiers:
-
Built-in identifier: These identifiers are included with Immuta and detect common categories of data (such as social security numbers, zip codes, and routing numbers) and cannot be modified. Users can list built-in identifiers through the Immuta API or view this Built-in identifiers reference page.
-
Custom identifier: Custom identifiers allow data governors to create their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.
By default, all identifiers are matched against data sources when SDD is triggered, unless a template (defined below) is applied to a data source.
Custom identifier types
The three types of custom identifiers are described in the table below.
Custom Identifier Type | Definition | Use Case |
---|---|---|
Regex identifier | This identifier contains a case-insensitive regular expression that allows users to match a custom regex against column values. | If the built-in identifiers do not contain a regex that could match against values within your data sources, use this identifier to create your own regex. See Create a regex identifier for a specific use case example. |
Column name regex identifier | This identifier includes a case-insensitive regular expression that is only matched against column names, not against the values in the column. | If a column name clearly denotes that it contains a type of data, you could create this identifier to match the regex against the name of columns instead of the column values. See Create a column name regex identifier for a specific use case example. |
Dictionary identifier | This identifier contains a list of words and phrases to match against column values. | Create a dictionary identifier if there are words or phrases included in your datasets that you want recognized and tagged, but will not be detected by the built-in identifiers. See Create a dictionary identifier for a specific use case example. |
Templates
A template is a collection of identifiers and settings that drive the configuration of SDD runs. The settings users can apply through templates include the following:
classifiers
(identifiers) are applied to data sources in the SDD run.minConfidence
is an optional override for theminConfidence
established in the identifier(s). When the detection confidence is at least the percentage defined inminConfidence
, tags are applied.tags
is an optional override for the tags applied by the identifiers.sample size
is an optional override for how many records to sample from the data source.
Users may apply a template globally or to a specific set of data sources. When SDD is triggered on a data source, it will use the identifiers and settings in its configured template to run the detection job. If no template has been configured, SDD will use the global settings, described below. By default, the global settings will use all identifiers in the system to run the detection.
SDD global settings
Global template
When SDD is triggered on a data source, identifiers in the template applied to it run the detection job, while data sources without a template applied to them will have the identifiers or template defined in the global settings run the detection job. By default, the global setting will use all identifiers in the system to run the detection. However, a system administrator can configure Immuta to use a global template to run the detection instead. While a template is actively global, it cannot be deleted by users.
Sample size
SDD applies an identifier to a sample of data to assess the likelihood that a column contains data that fits the pattern specified in the identifier.
By default, SDD samples 1000 records (the sample size) during this process. However, administrators can configure the sample size taken by SDD on the Immuta app settings page. In general, increasing the sample size increases the accuracy of SDD predictions, but decreasing the number of records sampled during SDD may be necessary to meet some organizations' compliance requirements.
Running SDD
SDD runs automatically when users create a new data source or when a new column is detected through schema monitoring, but users can also trigger SDD in the Immuta UI, through the Immuta CLI, or through the API.
Dry run
Users can also configure SDD to do a dryRun
, which allows them to see what tags would be applied to a data source
without actually applying them. See the
Run sensitive data discovery on data sources page
for details.
Tag mutability
When SDD is triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied, but if SDD is triggered because a new column is detected by schema monitoring, no tags will be modified on existing columns.
SDD workflow
Two common workflows for using SDD are outlined below. The first illustrates how to apply a single global template to all data sources, while the second outlines how users can create and apply templates to data sources they own.
Workflow 1: Apply a global template to all data sources
- Data governor creates a template using one or more built-in or custom identifiers.
- System administrator adds this template to the global settings so that it applies to all data sources.
- Users trigger SDD on data sources.
Workflow 2: Apply a template to a specific data source
- Data governor creates one or more custom identifiers:
- Data owner creates a template containing one or more identifiers.
- Data owner applies their template to one or more data sources.
- Data owner triggers SDD on one or more data sources, and tags are applied to columns where identifiers were detected.