| Spark Streaming | Structured Streaming |
| Spark 1.X | Introduced in Spark 2.X |
| Separate library in Spark to process continuously flowing Streaming data | Built on Spark SQL library |
| Uses DStreams API powered by Spark RDDs. It works on micro batches (each batch represent RDDs) | This model is based on Dataframe and Dataset APIs. No batch concept here. |
| DStrams provide us data divided into chunks as RDDs received from source of Streaming to be processed and outputs batches of processed data | Here we keep adding stream data to DataFrame (Unbounded table) |
| Not easy to apply | We can easily apply SQL query or scala operations on streaming data |
| Result of Unbounded table/dataframe is based on mode of your operations Complete, Append, Update |
| RDD | Dataframe/Dataset are more optimized & less time consuming, easy to understand. Apply aggregations |
| No such option called event-time, only works with timestamp when the data is received. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss | (Windowing) With event-time handling of late data, Structured Streaming outweighs Spark Streaming. |
No comments:
Post a Comment