Spark 1.x | Spark 2.x |
Spark Context is the entry point | Spark Session is the entry point |
We need to create separately sql context, hive context if we have only SparkContext. | Spark Session is enough |
Spark 1.x uses compilers which uses of several function calls and CPU cycles, because of which so much unnecessary work spent on CPU cycles. | Spark 2.x uses performance enhanced Tungsten engine |
1X | 10X times faster than Spark 1.X |
Spark Streaming (uses RDD batch concept) | Structured Streaming (uses DataFrames/DataSet APIs) |
| Unified Dataset and DataFrame APIs (Dataset has more type safety, not available in Python). Now Dataframe is just an alias for Dataset of Row |
| Many machine learning algorithms like Gaussian Mixture Model, MaxAbsScaler, Bisecting K-Means clustering feature transformer are added to DataFrame based API and many ML algorithms added to PySpark and SparkR also. |
RDD based API is going into maintenance mode | DataFrame based has become the primary API now |
No comments:
Post a Comment