May 13, 2023

Differences between SparkContext & SparkSession

 

Differences between SparkContext & SparkSession


Spark Core (Spark Context)

Spark SQL (Spark Session)

Spark 1.x, three entry points were introduced: SparkContext, SQLContext and HiveContext.

  • To use Hive functionality, use Hive Context

  • To use SQL functionality, use SQL Context


val sc = new SparkContext(sparkConf)

val sqlContext = new SQLContext(sc)

val hiveContext = new HiveContext(sc)

Spark 2.0 entry point is Spark Session

Spark Session replaced HiveContext and SQLContext

Additionally, it gives developers immediate access to SparkContext. 



// Two ways you can access spark context from spark session

val spark_context = sparkSession._sc

val spark_context = sparkSession.sparkContext

Only process Txt/CSV files

We can process Text/CSV/Parquet, ORC, AVRO, JSON, S3, MySQL, ORACLE, HBASE, CASSANDRA

Mostly process structured & semi-structured

Spark Context

Spark Session

RDD - High level representation of data in Spark Core

RDD Only data

DataFrames - High level representation of data in Spark SQL

RDD (data) + Schema

No such API

Data Source API (universal API)

Input/Output module of SQL

Universal API - reads data from any file system and writes the data to any file system


Spark Catalyst Optimizer

Spark SQL automatically optimizes the big queries


Tungsten (Memory Management)

Slow

Fast

  • Spark SQL runs faster due to 

  • Spark Catalyst Optimizer (query optimizer) & Tungsten (memory management)

Immutable

Cached

Lazy

Immutable

Cached

Lazy

No comments:

Post a Comment