Prabhath Kota: PySpark Streaming Vs Structured Streaming

May 18, 2020

Spark Streaming	Structured Streaming
Spark 1.X	Introduced in Spark 2.X
Separate library in Spark to process continuously flowing Streaming data	Built on Spark SQL library
Uses DStreams API powered by Spark RDDs. It works on micro batches (each batch represent RDDs)	This model is based on Dataframe and Dataset APIs. No batch concept here.
DStrams provide us data divided into chunks as RDDs received from source of Streaming to be processed and outputs batches of processed data	Here we keep adding stream data to DataFrame (Unbounded table)
Not easy to apply	We can easily apply SQL query or scala operations on streaming data
	Result of Unbounded table/dataframe is based on mode of your operations Complete, Append, Update
RDD	Dataframe/Dataset are more optimized & less time consuming, easy to understand. Apply aggregations
No such option called event-time, only works with timestamp when the data is received. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss	(Windowing) With event-time handling of late data, Structured Streaming outweighs Spark Streaming.

Prabhath Kota