Data Sources in Immuta

Data owners expose their data across their organization to other users by registering that data in Immuta as a data source.

By default, data owners can register data in Immuta without affecting existing policies on those tables in their remote system, so users who had access to a table before it was registered can still access that data without interruption. If this default behavior is disabled on the App Settings page, a subscription policy that requires data owners to manually add subscribers to data sources will automatically apply to new data sources (unless a global policy you create applies), blocking access to those tables.

For information about this default subscription policy and how to manage it, see the default subscription policy page.

Registering Data Sources with the dbt Cloud Integration

The dbt Cloud integration allows Immuta to be connected to your dbt Cloud jobs so that updates run through dbt populate in Immuta. Once dbt and Immuta are connected and a job runs to update your database, that update will automatically be applied to your Immuta instance. While this is similar to Schema Monitoring in that new data sources will be updated, created, and deleted when prompted by the dbt jobs, it differs in that the dbt Cloud integration can also sync tags, column descriptions, and data source descriptions from your data sources into Immuta.

For a tutorial, see Connect Data Sources Using dbt Cloud Integration

Limitations

You cannot update a dbt Cloud API key or delete the dbt Cloud integration from the UI.
- To update the dbt Cloud API Key:
  
  PUT /dbt/{accountId}/{projectId}/{environmentId} (with a payload similar to {apiKey: <newKey>})
  
  Example Immuta CLI Command
```
immuta api /dbt/{accountId}/{projectId}/{environmentId} -X \
PUT -P accountId=1 -P projectId=10 -P environmentId=100 -d apiKey=<newKey>
```
- To delete the dbt integration: This will delete all data sources created with the integration:
  
  DELETE /dbt/{accountId}/{projectId}/{environmentId}
There are no distinguishing features on dbt data sources within Immuta. The dbt integration functions as a catalog but Immuta does not link the data source to the catalog. This allows the data source user to remove tags in the UI. Note that the tags will be re-added the next time the job runs through.

Data Sources With Nested Columns

When data sources support nested columns, these columns get parsed into a nested Data Dictionary. Below is a list of data sources that support nested columns:

S3
Azure Blob
Databricks sources with complex data types enabled
- When complex types are enabled, Databricks data sources can have columns that are arrays, maps, or structs that can be nested.

Data Source Health

When an Immuta data source is created, a background job is submitted to compute the row count and high cardinality column for the data source. This job uses the connection information provided at data source creation time. A data source initially has a health status of “healthy” because the initial health check performed is a simple SQL query against the source to make sure the source can be queried at all. After the background job for the row count/high cardinality column computation is complete, the health status is updated. If one or both of those jobs failed, the health status will change to “Unhealthy.”

These background jobs can be disabled during data source creation by adding a specific tag to prevent automatic table statistics. This prevent statistics tag can be set on the App Settings page by a System Administrator. The data source will still show as healthy; however, there are some considerations. Disabling the collection of statistics will prevent the Immuta Query Engine cost-based optimizer from correctly estimating query plan costs. This could have a significant, negative performance impact on any queries executed through the Query Engine against a data source that has statistic collection disabled. Additionally, with automatic table statistics disabled, these policies will be unavailable until the Data Source Owner manually generates the fingerprint:

Masking with format preserving masking
Masking with K-Anonymization
Masking using randomized response

Follow these instructions for more details on Data Source Health Checks.

Unhealthy Databricks Data Sources

Unhealthy data sources may fail their row count queries if they run against a cluster that has the Databricks query watchdog enabled.

Limitations

Data sources with over 1600 columns will not have health checks run, but will still appear as healthy. The health check cannot be run automatically or manually.

Data Source User Roles

There are various roles users and groups can play relating to each data source. These roles are managed through the Members tab of the Data Source. They include

Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
Subscribers: Those who have access to the data source data. With the appropriate data accesses and attributes, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are responsible for managing the data source's documentation and the Data Dictionary.
Ingest: Those who are responsible for ingesting data for the data source. This role only applies to object-backed data sources (since query-backed data sources are ingested automatically). Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.

See Manage Data Sources for a tutorial on modifying user roles.

Data Dictionary

The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.

Dictionary columns are automatically generated when the data source is created if the remote storage technology supports SQL. Otherwise, Data Owners or Experts can create the entries for the Data Dictionary manually.