Prabhath Kota: Spark 1 vs Spark 2

May 18, 2020

Spark 1 vs Spark 2

Spark 1.x Spark 2.x
Spark Context is the entry point Spark Session is the entry point
We need to create separately sql context, hive context if we have only SparkContext. Spark Session is enough
Spark 1.x uses compilers which uses of several function calls and CPU cycles, because of which so much unnecessary work spent on CPU cycles. Spark 2.x uses performance enhanced Tungsten engine
1X 10X times faster than Spark 1.X
Spark Streaming (uses RDD batch concept) Structured Streaming (uses DataFrames/DataSet APIs)
Unified Dataset and DataFrame APIs (Dataset has more type safety, not available in Python). Now Dataframe is just an alias for Dataset of Row
Many machine learning algorithms like Gaussian Mixture Model, MaxAbsScaler, Bisecting K-Means clustering feature transformer are added to DataFrame based API and many ML algorithms added to PySpark and SparkR also.
RDD based API is going into maintenance mode DataFrame based has become the primary API now

Prabhath Kota

May 18, 2020

Spark 1 vs Spark 2

No comments:

Post a Comment

Spark 1.x	Spark 2.x
Spark Context is the entry point	Spark Session is the entry point
We need to create separately sql context, hive context if we have only SparkContext.	Spark Session is enough
Spark 1.x uses compilers which uses of several function calls and CPU cycles, because of which so much unnecessary work spent on CPU cycles.	Spark 2.x uses performance enhanced Tungsten engine
1X	10X times faster than Spark 1.X
Spark Streaming (uses RDD batch concept)	Structured Streaming (uses DataFrames/DataSet APIs)
	Unified Dataset and DataFrame APIs (Dataset has more type safety, not available in Python). Now Dataframe is just an alias for Dataset of Row
	Many machine learning algorithms like Gaussian Mixture Model, MaxAbsScaler, Bisecting K-Means clustering feature transformer are added to DataFrame based API and many ML algorithms added to PySpark and SparkR also.
RDD based API is going into maintenance mode	DataFrame based has become the primary API now