May 18, 2020

PySpark Streaming Vs Structured Streaming

Spark StreamingStructured Streaming
Spark 1.XIntroduced in Spark 2.X
Separate library in Spark to process continuously flowing Streaming dataBuilt on Spark SQL library
Uses DStreams API powered by Spark RDDs. It works on micro batches (each batch represent RDDs)This model is based on Dataframe and Dataset APIs. No batch concept here.
DStrams provide us data divided into chunks as RDDs received from source of Streaming to be processed and outputs batches of processed dataHere we keep adding stream data to DataFrame (Unbounded table)
Not easy to applyWe can easily apply SQL query or scala operations on streaming data
Result of Unbounded table/dataframe is based on mode of your operations Complete, Append, Update
RDDDataframe/Dataset are more optimized & less time consuming, easy to understand. Apply aggregations
No such option called event-time, only works with timestamp when the data is received. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss(Windowing) With event-time handling of late data, Structured Streaming outweighs Spark Streaming.


No comments:

Post a Comment