May 18, 2020

Spark 1 vs Spark 2

 
Spark 1.xSpark 2.x
Spark Context is the entry pointSpark Session is the entry point
We need to create separately sql context, hive context if we have only SparkContext.Spark Session is enough
Spark 1.x uses compilers which uses of several function calls and CPU cycles, because of which so much unnecessary work spent on CPU cycles.Spark 2.x uses performance enhanced Tungsten engine
1X10X times faster than Spark 1.X
Spark Streaming (uses RDD batch concept)Structured Streaming (uses DataFrames/DataSet APIs)
Unified Dataset and DataFrame APIs (Dataset has more type safety, not available in Python). Now Dataframe is just an alias for Dataset of Row
Many machine learning algorithms like Gaussian Mixture Model, MaxAbsScaler, Bisecting K-Means clustering feature transformer are added to DataFrame based API and many ML algorithms added to PySpark and SparkR also.
RDD based API is going into maintenance modeDataFrame based has become the primary API now

 

No comments:

Post a Comment