Role of Spark context?
Spark context acts as bridge b/w Cluster (execution env) and the Driver
Set how much memory
Set no of cores
The SparkContext is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs.
SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext
Differences between SparkContext & SparkSession
Spark Driver Vs Executor
Driver ships the logic (jar) to workers through the help of SparkContext
Driver just controls, won't execute
Executors are the ones which are on worker nodes does the processing
Executor Vs Executor core
Executors running on slave/worker nodes does the processing
Executor cores are no of cpu cores/threads which are tied to Executor
E.g., 2 cores means 2 parallel processes can happen
Yarn client Vs cluster mode?
makeRDD Vs parallelize
Both are same
val data = List(1,2,3,4)
val rdd = sc.parallelize(data)
or
val rdd = sc.makeRDD(data)
rdd.toDebugString()
val data = List(1,2,3,4)
val rdd = sc.makeRDD(data)
val rdd1 = rdd.map(x => x+1)
val rdd2 = rdd1.map(x => x*2)
val rdd3 = rdd2.map(x => x+2)
rdd.toDebugString()
Yarn Vs Spark Fault Tolerance
Yarn and Spark has two different roles
Spark can run on Yarn / Mesos / Stand Alone
Yarn is a resource manager
Yarn fault tolerance includes restarting AM (application master) & Containers
Spark is an execution framework
Spark fault tolerance includes recovering partitions when there is a failure
Using lineage, Spark recover the partitions
Transformation Vs Action
Transformation will result in intermediate RDD
map on RDD results in RDD
map, filter, flatMap
Action
show, collect, count, take
Map Vs FlatMap
Both are transformations
Map -> one to one
FlatMap -> one to many
map() Vs mapPartitions()
map()
To process every row in RDD - map()
mapPartition()
map() - A repetitive operation on a row
E.g., search a value in database for every row
Connect to DB & Close the connection for every row
So instead of that, we use mapPartition/(), for every partition, we call DB only once, not for every row
This gives huge performance boost
Narrow Vs Wide Transformations
Lineage Vs DAG
Lineage
Lineage is a logical plan which tells how you can create RDD by applying multiple transformations on your parent RDD
If you get rdds from parent rdd
rdd -> rdd1 -> rdd2 -> rdd3 (only transformations)
rdd3.toDebugString
DAG
DAG is a physical plan
The logical plan (Lineage) is submitted to Catalyst Optimizer
DAG scheduler create the physical plan into multiple stages (transformations, actions)
DAG has more info
How stages are dependant
How stages can run in parallel
DAG
Driver will read the code & prepare DAG
Then it will create stages
Then it will each stage, it will perform data processing using tasks
Each task can handle 64 MB of data
Stage-0
Map (Key-Value Pair)
Narrow transformations (textFile, flatMap, Map)
Stage-1
Shuffle & Reduce
redcueByKey is wide transformation (reduceByKep)
Cache Vs Persist
Cache is MEMORY_ONLY
Limited flexibility as it is MEMORY_ONLY
Persist defines which type of Memory mode
More flexible in choosing which mode
MEMORY_AND_DISK
MEMORY_AND_DISK_SERIALIZTION
MEMORY_ONLY
….
AggregateByKey Vs CombineByKey
DataFrame Vs DataSet
RDD Vs DataFrame Vs DataSet
Repartition Vs Coalesce
Spark window functions
Need to calculate value for each row, considering some group of rows
E.g., Avg salary column (not across all) but per dept
Types:
Ranking
Analytical
Aggregate
No comments:
Post a Comment