Schema Monitoring and Automatic Sensitive Data Discovery

Prerequisite: Before using this walkthrough, please ensure that you’ve done Parts 1-3 in the POV Data Setup walkthrough.

Overview

Immuta considers itself a “live” metadata aggregator - not only metadata about your data but also your users. Considering data specifically, to be “live” means Immuta will monitor for schema changes in your database and reflect those changes in your Immuta instance. This allows you to register your databases with Immuta and not have to worry about registering individual tables today or in the future.

Additionally, when the tables are discovered through the registration process, Immuta will inspect the table data for sensitive information and tag it as such. These tags are critical for scaling tag-based policies which you will learn about in subsequent walkthroughs. This sensitive data discovery is done by inspecting samples of your data and using algorithms to decide what we believe the data contains. Those tags are editable or new custom tags can be curated and added by you.

It is also possible to add tags curated or discovered in other systems or catalogs. While this is not specifically covered in this walkthrough, it’s important to understand.

Business Value

Both the monitoring for new data and discovering and tagging sensitive data aligns with the Scalability and Evolvability theme, removing redundant and arduous work. As users create new tables or columns in your database, those tables/columns will be automatically registered in Immuta and automatically tagged. Why does this matter? Because once they are registered and tagged, policies can immediately be applied - this means humans can be completely removed from the process by creating tag-based policies that dynamically attach themselves to new tables. (We’ll walk through tag-based policies shortly.)

Because of this, the business reaps

Increased revenue: accelerate data access / time to data because where sensitive data lives is well understood.
Decreased cost: operating efficiently at scale, agility at scale.
Decrease risk: sensitive data discovered immediately.

Registering Metadata

Assumptions: Your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):

CREATE_DATA_SOURCE: in order to register the data with Immuta
GOVERNANCE: to create a custom tag in Immuta

Custom Tag

We are going to create a custom tag to tag the data with, this will:

Help differentiate your real data from this fake POV data.
Help build global policies across these tables from multiple compute/warehouses, if you have more than one.

To create a custom tag,

Click the Governance icon in the left sidebar of the Immuta console.
Click on the Tags tab.
Click + Add Tags.
Name the tag Immuta POV. You can delete the nested tag placeholder in order to save.
Click Save.

Register a Schema

Let’s walk through registration of a schema to monitor (You do not need GOVERNANCE permission to do this step, only CREATE_DATA_SOURCE):

From the Data Source page, click the + New Data Source button.
Storage Technology: Choose the Storage Technology of interest. This should align to where you loaded the data in the POV Data Setup walkthrough, but of course could be your own data as well. Note that if you are using the same Databricks workspace for Databricks and SQL Analytics, you only need to load it once.
Connection Information: This is the account Immuta will use to monitor your database and query the data metadata. This account should have read access to the data that you need to register. For simplicity, you may want to use the same account you used to load the data in the POV Data Setup walkthrough, but it’s best if you can use an Admin account for registering the data and a separate user account for querying it (which we’ll do later). It should also point to the data you loaded in the POV Data Setup walkthrough which should be the immuta_pov database unless you named the database something else or placed the data somewhere else.
Virtual Population: There are several options here for how you want Immuta to monitor your database to automatically populate metadata. In our case we want to choose the first option: Create sources for all tables in this database and monitor for changes.
Basic Information: This section allows you to apply a convention to how the tables are named. If you have multiple data warehouses/compute and you’ve already registered these tables once and are registering them now from a 2nd (or more) warehouse/compute, you will have to change the naming convention for the Immuta data source name and schema project so you can tell them apart in the Immuta UI. This will NOT impact what they are named in the native database.
Advanced (Optional):
1. Note that Sensitive Data Discovery is enabled.
2. We are going to add that Immuta POV tag we created above by going to the last section “Data Source 1. Tags”
3. Click Edit.
4. Enter Immuta POV and click Add.
5. This will add that tag to any table that is discovered now or in the future.
6. You can leave the defaults for the rest.
Click Create to kick off the job.
Repeat these steps for each warehouse/compute you have (not to be confused with a Snowflake warehouse; we mean other data warehouses, like Redshift, Databricks SQL analytics, etc.).

You will be dumped into a screen that depicts the progress of your monitoring job. You’ll also see a gear spinning in the upper right corner of the screen which depicts the jobs that are running, one of those being the “fingerprint,” which is what is used to gather statistics about your tables and run the Sensitive Data Discovery.

Once the tables are registered and the gear stops spinning, click into the Immuta POV Immuta Fake Hr Data table. Once there, click on the Data Dictionary tab. In there you will see your columns as well as the Sensitive Data that was discovered. Also note that because we found a specific entity (such as Discovered.Entity.Person Name), we also tag that column with other derivative tags (such as Discovered.Identifier Indirect). This hierarchy will become important in the Hierarchical Tag-Based Policy Definitions walkthrough.

Also visit the Data Dictionary in the Immuta POV Immuta Fake Credit Card Transactions table. If you scroll to the bottom column, transaction_country, you’ll notice we incorrectly tagged it as Discovered.Entity.State - you can go ahead and remove that tag. Notice it is simply disabled so that when monitoring runs again it will not be re-tagged with the incorrect Discovered.Entity.State tag.

One thing worth mentioning is that the table is completely protected after being discovered based on our default policy. We’ll learn more about this in subsequent sections.

Anti-Patterns

This anti-pattern is pretty obvious - instead of automatically detecting schema/database changes you would have to manually manage that, and instead of automatically detecting sensitive data, you would also have to manually manage that.

It’s not just the manual time suck, but also complicates the process, because not only must you understand when a new table is present, but you then must remember to tag it and potentially protect it appropriately. This leaves you ripe for data leaks as new data is created across your organization, almost daily.

Next Steps

If you came to this walkthrough from the POV Data Setup, please make sure to complete the final Part 5 there!

Otherwise, feel free to return to the POV Guide to move on to your next topic.