Configure Sensitive Data Discovery (Public Preview)

Note

In previous documentation, identifier is referred to as classifier. The language is being updated to identifier to be more accurate and not conflate meaning with the Immuta data classification and frameworks feature.

Sensitive data discovery (SDD) comprises two major elements: identifiers and templates.

Identifier

The identifier is the basic building block of SDD. Essentially, an identifier includes a pattern (e.g., a regex or a list of values) and a list of tags to apply to data that matches that pattern. For example, if a column sample matches a regex defined in an identifier, then all the tags in that identifier will be applied to that column.

SDD applies identifiers to samples of data to assess the likelihood that columns contain data that fits the pattern specified in the identifier (such as a social security number). SDD then scores columns by the percentage of values that match the pattern defined. This score determines whether or not the configured tags will be applied to a column.

There are two types of identifiers:

Built-in identifier: These identifiers are included with Immuta and detect common categories of data (such as social security numbers, zip codes, and routing numbers) and cannot be modified. Users can list built-in identifiers through the Immuta API or view this Built-in identifiers reference page.
Custom identifier: Custom identifiers allow data governors to create their own regular expressions, dictionaries, and tags that SDD will use to discover and tag data.

By default, all identifiers are matched against data sources when SDD is triggered, unless a template (defined below) is applied to a data source.

Custom identifier types

The three types of custom identifiers are described in the table below.

Custom Identifier Type	Definition	Use Case
Regex identifier	This identifier contains a case-insensitive regular expression that allows users to match a custom regex against column values.	If the built-in identifiers do not contain a regex that could match against values within your data sources, use this identifier to create your own regex. See Create a regex identifier for a specific use case example.
Column name regex identifier	This identifier includes a case-insensitive regular expression that is only matched against column names, not against the values in the column.	If a column name clearly denotes that it contains a type of data, you could create this identifier to match the regex against the name of columns instead of the column values. See Create a column name regex identifier for a specific use case example.
Dictionary identifier	This identifier contains a list of words and phrases to match against column values.	Create a dictionary identifier if there are words or phrases included in your datasets that you want recognized and tagged, but will not be detected by the built-in identifiers. See Create a dictionary identifier for a specific use case example.

Templates

A template is a collection of identifiers and settings that drive the configuration of SDD runs. The settings users can apply through templates include the following:

classifiers (identifiers) are applied to data sources in the SDD run.
minConfidence is an optional override for the minConfidence established in the identifier(s). When the detection confidence is at least the percentage defined in minConfidence, tags are applied.
tags is an optional override for the tags applied by the identifiers.
sample size is an optional override for how many records to sample from the data source.

Users may apply a template globally or to a specific set of data sources. When SDD is triggered on a data source, it will use the identifiers and settings in its configured template to run the detection job. If no template has been configured, SDD will use the global settings, described below. By default, the global settings will use all identifiers in the system to run the detection.

SDD global settings

Global template

When SDD is triggered on a data source, identifiers in the template applied to it run the detection job, while data sources without a template applied to them will have the identifiers or template defined in the global settings run the detection job. By default, the global setting will use all identifiers in the system to run the detection. However, a system administrator can configure Immuta to use a global template to run the detection instead. While a template is actively global, it cannot be deleted by users.

Sample size

SDD applies an identifier to a sample of data to assess the likelihood that a column contains data that fits the pattern specified in the identifier.

By default, SDD samples 1000 records (the sample size) during this process. However, administrators can configure the sample size taken by SDD on the Immuta app settings page. In general, increasing the sample size increases the accuracy of SDD predictions, but decreasing the number of records sampled during SDD may be necessary to meet some organizations' compliance requirements.

Running SDD

SDD runs automatically when users create a new data source or when a new column is detected through schema monitoring, but users can also trigger SDD in the Immuta UI, through the Immuta CLI, or through the API.

Dry run

Users can also configure SDD to do a dryRun, which allows them to see what tags would be applied to a data source without actually applying them. See the Run sensitive data discovery on data sources page for details.

Tag mutability

When SDD is triggered by a data owner, all column tags that were previously applied by SDD are removed and the tags prescribed by the latest run are applied, but if SDD is triggered because a new column is detected by schema monitoring, no tags will be modified on existing columns.

SDD workflow

Two common workflows for using SDD are outlined below. The first illustrates how to apply a single global template to all data sources, while the second outlines how users can create and apply templates to data sources they own.

Workflow 1: Apply a global template to all data sources

Data governor creates a template using one or more built-in or custom identifiers.
System administrator adds this template to the global settings so that it applies to all data sources.
Users trigger SDD on data sources.

Workflow 2: Apply a template to a specific data source

Data governor creates one or more custom identifiers:
Data owner creates a template containing one or more identifiers.
Data owner applies their template to one or more data sources.
Data owner triggers SDD on one or more data sources, and tags are applied to columns where identifiers were detected.