Prabhath Kota: Spark Interview Questions

May 13, 2023

Spark Interview Questions

Role of Spark context?

Spark context acts as bridge b/w Cluster (execution env) and the Driver
Set how much memory
Set no of cores
The SparkContext is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs.
SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext

Differences between SparkContext & SparkSession

Spark Core (Spark Context)	Spark SQL (Spark Session)
Spark 1.x, three entry points were introduced: SparkContext, SQLContext and HiveContext. To use Hive functionality, use Hive Context To use SQL functionality, use SQL Context val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val hiveContext = new HiveContext(sc)	Spark 2.0 entry point is Spark Session Spark Session replaced HiveContext and SQLContext Additionally, it gives developers immediate access to SparkContext. // Two ways you can access spark context from spark session val spark_context = sparkSession._sc val spark_context = sparkSession.sparkContext
Only process Txt/CSV files	We can process Text/CSV/Parquet, ORC, AVRO, JSON, S3, MySQL, ORACLE, HBASE, CASSANDRA Mostly process structured & semi-structured
Spark Context	Spark Session
RDD - High level representation of data in Spark Core RDD Only data	DataFrames - High level representation of data in Spark SQL RDD (data) + Schema
No such API	Data Source API (universal API) Input/Output module of SQL Universal API - reads data from any file system and writes the data to any file system
	Spark Catalyst Optimizer Spark SQL automatically optimizes the big queries
	Tungsten (Memory Management)
Slow	Fast Spark SQL runs faster due to Spark Catalyst Optimizer (query optimizer) & Tungsten (memory management)
Immutable Cached Lazy	Immutable Cached Lazy

Spark Driver Vs Executor

Driver ships the logic (jar) to workers through the help of SparkContext

Driver just controls, won't execute

Executors are the ones which are on worker nodes does the processing

Executor Vs Executor core

Executors running on slave/worker nodes does the processing
Executor cores are no of cpu cores/threads which are tied to Executor

E.g., 2 cores means 2 parallel processes can happen

Yarn client Vs cluster mode?

Client Mode	Cluster Mode
Driver runs in Edge node (means outside the cluster)	Driver runs inside Cluster (one of the data nodes where data is present)
Driver (in Edge node) Containers/Executors (in cluster)	Driver (in cluster) Containers/Executors (in cluster)
As Driver (Master) runs in Edge node, Resource Manager will create an empty Application Master Container in one of the data nodes where data is not present	Here Driver (Master) runs on the cluster itself on one of the data nodes where data is not present
Node manager creates a container in the data nodes where data is present	Same here
Executors are slaves which does the processing of the container with jar + data	Same here
Less safe since if Edge node goes down, can't process Communication b/w Master (Driver is outside the cluster) & Slave (executor - is inside the cluster) is from outside cluster to inside the cluster	More safe & highly available Even if data node goes down, Yarn or Resource Manager will spin up another data node (keep Application Master Container & keep Driver) Communication b/w Master (Driver) & Slave (executor) is with in the Cluster
Used for short running jobs E.g., Test, running SQL queries and see results immediately	Used for long running jobs

makeRDD Vs parallelize

Both are same
val data = List(1,2,3,4)
val rdd = sc.parallelize(data)
or
val rdd = sc.makeRDD(data)

rdd.toDebugString()

val data = List(1,2,3,4)
val rdd = sc.makeRDD(data)
val rdd1 = rdd.map(x => x+1)
val rdd2 = rdd1.map(x => x*2)
val rdd3 = rdd2.map(x => x+2)
rdd.toDebugString()

Yarn Vs Spark Fault Tolerance

Yarn and Spark has two different roles
Spark can run on Yarn / Mesos / Stand Alone
Yarn is a resource manager

Yarn fault tolerance includes restarting AM (application master) & Containers

Spark is an execution framework

Spark fault tolerance includes recovering partitions when there is a failure
Using lineage, Spark recover the partitions

Transformation Vs Action

Transformation will result in intermediate RDD

map on RDD results in RDD
map, filter, flatMap

Action

show, collect, count, take

Map Vs FlatMap

Both are transformations
Map -> one to one
FlatMap -> one to many

map() Vs mapPartitions()

map()

To process every row in RDD - map()

mapPartition()

map() - A repetitive operation on a row

E.g., search a value in database for every row

Connect to DB & Close the connection for every row

So instead of that, we use mapPartition/(), for every partition, we call DB only once, not for every row

This gives huge performance boost

Narrow Vs Wide Transformations

Narrow Transformation	Wide Transformation
Won't cause shuffling	Create shuffling
E.g., Map, Filter, FlatMap, MapPartition, Union, Sample	E.g., reduceBy, GroupBy, Count, Distinct, Repartition, Coalesce, Intersection, Cartesian

Lineage Vs DAG

Lineage

Lineage is a logical plan which tells how you can create RDD by applying multiple transformations on your parent RDD
If you get rdds from parent rdd
rdd -> rdd1 -> rdd2 -> rdd3 (only transformations)
rdd3.toDebugString

DAG

DAG is a physical plan
The logical plan (Lineage) is submitted to Catalyst Optimizer
DAG scheduler create the physical plan into multiple stages (transformations, actions)
DAG has more info

How stages are dependant
How stages can run in parallel

DAG

Driver will read the code & prepare DAG
Then it will create stages
Then it will each stage, it will perform data processing using tasks
Each task can handle 64 MB of data

Stage-0

Map (Key-Value Pair)
Narrow transformations (textFile, flatMap, Map)

Stage-1

Shuffle & Reduce
redcueByKey is wide transformation (reduceByKep)

Cache Vs Persist

Cache is MEMORY_ONLY

Limited flexibility as it is MEMORY_ONLY

Persist defines which type of Memory mode

More flexible in choosing which mode
MEMORY_AND_DISK
MEMORY_AND_DISK_SERIALIZTION
MEMORY_ONLY
….

AggregateByKey Vs CombineByKey

DataFrame Vs DataSet

DataFrame	DataSet
	Introduced in Spark 1.6
No type safety It can run without Schema as well Errors caught at compile time	Type Safety Schema is integral part of DataSet (schema is must) Compile time type safety Direct operations on user defined functions RDD functional nature (map, filter, lambda functions) Dataframe's optimization
Consumes more memory than DataSets	More optimized Consumes less memory
Available in Java, Scala & Python	Available only in Java & Scala Single API for Java & Scala
Dataframe = DataSet(Row)	Collection of objects <Person>, <Person>....
	Faster & optimal
df.filter(x => x.age > 50).show() #Error - value age is not a member	case class Person(name: String, age: Int) ds = df.as[Person] ds.filter(x => x.age > 50).show() # works

RDD Vs DataFrame Vs DataSet

RDD	DataFrame	DataSet
Type Safe	Not Type safe	Type Safe
Developer has to take care of optimization	Auto optimization using Catalyst Optimizer	Auto optimized (using Catalyst Optimizer)
Not as good as datasets in performance	Not as good as datasets in performance	Better performance
Not memory efficient	Not memory efficient	More memory efficient

Repartition Vs Coalesce

Re-partition	Coalesce
Increase/Decrease the size of partition	Only decrease size of partition
Shuffling is there	No shuffling
Costly operation as shuffling is there, so takes more time as well	Not costly operation as no shuffling is there
Used when size of data is huge	Used when size of data is less
The fundamental benefit of a repartition is uniform data distribution	coalesce can’t guarantee - since it stitches partitions together, the output partitions might be of uneven size, which may or may not cause problems later, depending on how skewed they are

Spark window functions

Need to calculate value for each row, considering some group of rows

E.g., Avg salary column (not across all) but per dept

Types:

Ranking
Analytical
Aggregate

Rank Vs Dense_rank

Catalyst Optimizer

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)