Spark Direct File Reads

Audience: Databricks and EMR Users

Content Summary: This page describes how to read from a file path in Spark. Direct file reads are supported in Databricks and EMR environments.

Overview

In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.

When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.

Read Data

Spark Direct File Reads in EMR

EMR uses the same integration as Databricks, but you will need to use the immuta SparkSession just as you normally would to interact with Immuta data sources.

For example, instead of spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet"), use immuta.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet").

Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory (or by using a where predicate). Use the tabs below to view examples of reading data using these methods.

Read Data from an Individual Parquet FileRead Partitioned Data from a Sub-Directory

Read Data from an Individual Parquet File

To read from an individual file, load a partition file from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")

Read Partitioned Data from a Sub-Directory

To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")

Alternatively, load a parquet partition using a where predicate:

spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")

Object-Backed Data Sources

Direct file reads in Spark are also supported for object-backed Immuta data sources (such as S3 or Azure Blob data sources) using the is3a file system:

spark.read.format("parquet").load("is3a://immuta/test/path")

Limitations

Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
On Databricks, multiple input paths are supported as long as they belong to the same data source. However, for EMR only a single input path is supported.
CSV-backed tables are not currently supported.

Loading a delta partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use a where predicate:

# Not recommended by Spark and not supported in Immuta
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01")

# Recommended by Spark and supported in Immuta.
spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")