Using the Immuta SparkSession (Spark 2)
Audience: Data Users
Content Summary: This page outlines how to use the Immuta SparkSession with spark-submit, spark-shell, and pyspark.
Immuta SparkSession Background: For Spark 2, the Immuta SparkSession must be used in order to access Immuta data sources. Once the Immuta Spark Installation has been completed on your Spark cluster, then you are able to use the special Immuta Spark interfaces that are detailed below. For data storage technologies that support batch processing workloads, the Immuta SparkSession allows users to query data sources the same way that they query Hive tables with Spark SQL.
When querying metastore-backed data sources, such as Hive and Impala, the Immuta Session accesses the data directly in HDFS. Other data source types will pass through the Query Engine. In order to take advantage of the performance gains provided by directly acting on the files in HDFS in your Spark jobs, you must create Immuta data sources for metastore-backed data sources with tables that are persisted in HDFS.
For guidance on querying data sources across multiple clusters and/or remote databases, see Leveraging Data on Other Clusters and Databases.
Using the Immuta SparkSession
With spark-submit
Launch the special immuta-spark-submit
interface, and submit jobs just like you would with spark-submit
:
immuta-spark-submit <job>
With spark-shell
First, launch the special immuta-spark-shell
interface:
immuta-spark-shell
Then, Use the immuta
variable just like you would spark
:
immuta.catalog.listTables().show()
val df = immuta.table("my_immuta_datasource")
df.show()
val df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()
Next, use the immuta
format to specify partition information:
val df3 = immuta.read.format("immuta")
.option("dbtable", "my_immuta_datasource")
.option("partitionColumn", "id")
.option("lowerBound", "0")
.option("upperBound", "300")
.option("numPartitions", "3")
.load()
df3.show()
The immuta
format also supports query pushdown:
val df4 = immuta.read.format("immuta")
.option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
.load()
df4.show()
Finally, specify the fetch size:
val df5 = immuta.read.format("immuta")
.option("dbtable", "my_immuta_datasource")
.option("fetchsize", "500").load()
df5.show()
With pyspark
First, launch the special immuta-pyspark
interface:
immuta-pyspark
Then, use the immuta
variable just like you would spark
:
immuta.catalog.listTables()
df = immuta.table("my_immuta_datasource")
df.show()
df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()
Finally, use the immuta
format to specify partition information:
df3 = immuta.read.format("immuta")
.option("dbtable", "my_immuta_datasource")
.option("partitionColumn", "id")
.option("lowerBound", "0")
.option("upperBound", "300")
.option("numPartitions", "3")
.load()
df3.show()
The immuta
format also supports query pushdown:
df4 = immuta.read.format("immuta")
.option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
.load()
df4.show()
To specify the fetch size:
df5 = immuta.read.format("immuta")
.option("dbtable", "my_immuta_datasource")
.option("fetchsize", "500").load()
df5.show()