Spark Direct File Reads
Audience: Databricks and EMR Users
Content Summary: This page describes how to read from a file path in Spark. Direct file reads are supported in Databricks and EMR environments.
Overview
In addition to supporting direct file reads through workspace and scratch paths, Immuta allows direct file reads in Spark for file paths. As a result, users who prefer to interact with their data using file paths or who have existing workflows revolving around file paths can continue to use these workflows without rewriting those queries for Immuta.
When reading from a path in Spark, the Immuta Databricks plugin queries the Immuta Web Service to find Databricks data sources for the current user that are backed by data from the specified path. If found, the query plan maps to the Immuta data source and follows existing code paths for policy enforcement.
Read Data
Spark Direct File Reads in EMR
EMR uses the same integration as Databricks, but you will need to use the
immuta
SparkSession just as you normally would
to interact with Immuta data sources.
For example, instead of
spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
,
use
immuta.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
.
Users can read data from individual parquet files in a sub-directory and partitioned data from a sub-directory
(or by using a where
predicate). Use the tabs below to view examples of reading data using these methods.
Read Data from an Individual Parquet File
To read from an individual file, load a partition file from a sub-directory:
spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01/my_file.parquet")
Read Partitioned Data from a Sub-Directory
To read partitioned data from a sub-directory, load a parquet partition from a sub-directory:
spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table/partition_column=01")
Alternatively, load a parquet partition using a where
predicate:
spark.read.format("parquet").load("s3:/my_bucket/path/to/my_parquet_table").where("partition_column=01")
Object-Backed Data Sources
Direct file reads in Spark are also supported for object-backed Immuta data sources (such as S3 or Azure Blob
data sources) using the
is3a file system
:
spark.read.format("parquet").load("is3a://immuta/test/path")
Limitations
- Direct file reads for Immuta data sources only apply to table-backed Immuta data sources, not data sources created from views or queries.
- If more than one data source has been created for a path, Immuta will use the first valid data source it finds. It is therefore not recommended to use this integration when more than one data source has been created for a path.
- On Databricks, multiple input paths are supported as long as they belong to the same data source. However, for EMR only a single input path is supported.
- CSV-backed tables are not currently supported.
-
Loading a
delta
partition from a sub-directory is not recommended by Spark and is not supported in Immuta. Instead, use awhere
predicate:# Not recommended by Spark and not supported in Immuta spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table/partition_column=01") # Recommended by Spark and supported in Immuta. spark.read.format("delta").load("s3:/my_bucket/path/to/my_delta_table").where("partition_column=01")