Avoid shuffling at any cost, if shuffling is necessary, use combiner
groupByKey does not use combiner, it should be avoided (not optimized)
Amount of data sent over n/w is huge
reduceByKey uses combiner (map side combine) - it is optimized
Amount of data sent over n/w is less
Combiner:
It computes intermediate values for each partition to avoid shuffling
No comments:
Post a Comment