May 13, 2023

Spark Interview Questions

 Role of Spark context?

  • Spark context acts as bridge b/w Cluster (execution env) and the Driver

  • Set how much memory

  • Set no of cores

  • The SparkContext is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs. 

  • SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext 


Differences between SparkContext & SparkSession


Spark Core (Spark Context)

Spark SQL (Spark Session)

Spark 1.x, three entry points were introduced: SparkContext, SQLContext and HiveContext.

  • To use Hive functionality, use Hive Context

  • To use SQL functionality, use SQL Context


val sc = new SparkContext(sparkConf)

val sqlContext = new SQLContext(sc)

val hiveContext = new HiveContext(sc)

Spark 2.0 entry point is Spark Session

Spark Session replaced HiveContext and SQLContext

Additionally, it gives developers immediate access to SparkContext. 



// Two ways you can access spark context from spark session

val spark_context = sparkSession._sc

val spark_context = sparkSession.sparkContext

Only process Txt/CSV files

We can process Text/CSV/Parquet, ORC, AVRO, JSON, S3, MySQL, ORACLE, HBASE, CASSANDRA

Mostly process structured & semi-structured

Spark Context

Spark Session

RDD - High level representation of data in Spark Core

RDD Only data

DataFrames - High level representation of data in Spark SQL

RDD (data) + Schema

No such API

Data Source API (universal API)

Input/Output module of SQL

Universal API - reads data from any file system and writes the data to any file system


Spark Catalyst Optimizer

Spark SQL automatically optimizes the big queries


Tungsten (Memory Management)

Slow

Fast

  • Spark SQL runs faster due to 

  • Spark Catalyst Optimizer (query optimizer) & Tungsten (memory management)

Immutable

Cached

Lazy

Immutable

Cached

Lazy

Spark Driver Vs Executor

  • Driver ships the logic (jar) to workers through the help of SparkContext

    • Driver just controls, won't execute 

  • Executors are the ones which are on worker nodes does the processing


Executor Vs Executor core

  • Executors running on slave/worker nodes does the processing

  • Executor cores are no of cpu cores/threads which are tied to Executor

    • E.g., 2 cores means 2 parallel processes can happen

Yarn client Vs cluster mode?


Client Mode

Cluster Mode

Driver runs in Edge node (means outside the cluster)

Driver runs inside Cluster (one of the data nodes where data is present)

Driver (in Edge node)

Containers/Executors (in cluster)

Driver (in cluster)

Containers/Executors (in cluster)

As Driver (Master) runs in Edge node, Resource Manager will create an empty Application Master Container in one of the data nodes where data is not present

Here Driver (Master) runs on the cluster itself on one of the data nodes where data is not present

Node manager creates a container in the data nodes where data is present


Same here

Executors are slaves which does the processing of the container with jar + data

Same here

Less safe since if Edge node goes down, can't process






Communication b/w Master (Driver is outside the cluster) & Slave (executor - is inside the  cluster) is from outside cluster to inside the cluster



More safe & highly available

Even if data node goes down, Yarn or Resource Manager will spin up another data node (keep Application Master Container & keep Driver)


Communication b/w Master (Driver) & Slave (executor) is with in the Cluster

Used for short running jobs

E.g., Test, running SQL queries and see results immediately


Used for long running jobs

makeRDD Vs parallelize

  • Both are same

  • val data = List(1,2,3,4)

  • val rdd = sc.parallelize(data)

  • or

  • val rdd = sc.makeRDD(data)


rdd.toDebugString()

  • val data = List(1,2,3,4)

  • val rdd = sc.makeRDD(data)

  • val rdd1 = rdd.map(x => x+1)

  • val rdd2 = rdd1.map(x => x*2)

  • val rdd3 = rdd2.map(x => x+2)

  • rdd.toDebugString()


Yarn Vs Spark Fault Tolerance

  • Yarn and Spark has two different roles

  • Spark can run on Yarn / Mesos / Stand Alone

  • Yarn is a resource manager

    • Yarn fault tolerance includes restarting AM (application master) & Containers

  • Spark is an execution framework

    • Spark fault tolerance includes recovering partitions when there is a failure

    • Using lineage, Spark recover the partitions 

Transformation Vs Action

  • Transformation will result in intermediate RDD

    • map on RDD results in RDD

    • map, filter, flatMap

  • Action

    • show, collect, count, take

Map Vs FlatMap

  • Both are transformations

  • Map -> one to one

  • FlatMap -> one to many


map() Vs mapPartitions()

  • map()

    • To process every row in RDD - map()

  • mapPartition()

    • map() - A repetitive operation on a row

      • E.g., search a value in database for every row

        • Connect to DB & Close the connection for every row

      • So instead of that, we use mapPartition/(), for every partition, we call DB only once, not for every row

  • This gives huge performance boost


Narrow Vs Wide Transformations

Narrow Transformation

Wide Transformation

Won't cause shuffling

Create shuffling

E.g., Map, Filter, FlatMap, MapPartition, Union, Sample

E.g., reduceBy, GroupBy, Count, Distinct, Repartition, Coalesce, Intersection, Cartesian




Lineage Vs DAG

  • Lineage

    • Lineage is a logical plan which tells how you can create RDD by applying multiple transformations on your parent RDD

    • If you get rdds from parent rdd

    • rdd -> rdd1 -> rdd2 -> rdd3 (only transformations)

    • rdd3.toDebugString

  • DAG

    • DAG is a physical plan

    • The logical plan (Lineage) is submitted to Catalyst Optimizer

    • DAG scheduler create the physical plan into multiple stages (transformations, actions)

    • DAG has more info

      • How stages are dependant

      • How stages can run in parallel

    • DAG

      • Driver will read the code & prepare DAG

      • Then it will create stages

      • Then it will each stage, it will perform data processing using tasks

      • Each task can handle 64 MB of data

      •  

      • Stage-0

        • Map (Key-Value Pair)

        • Narrow transformations (textFile, flatMap, Map)

      • Stage-1

        • Shuffle & Reduce

        • redcueByKey is wide transformation (reduceByKep)


Cache Vs Persist

  • Cache is MEMORY_ONLY

    • Limited flexibility as it is MEMORY_ONLY

  • Persist defines which type of Memory mode

    • More flexible in choosing which mode

    • MEMORY_AND_DISK

    • MEMORY_AND_DISK_SERIALIZTION

    • MEMORY_ONLY

    • ….

AggregateByKey Vs CombineByKey

DataFrame Vs DataSet


DataFrame

DataSet


Introduced in Spark 1.6

  • No type safety

  • It can run without Schema as well

  • Errors caught at compile time

  • Type Safety

  • Schema is integral part of DataSet (schema is must)

  • Compile time type safety

  • Direct operations on user defined functions

  • RDD functional nature (map, filter, lambda functions)

  • Dataframe's optimization

Consumes more memory than DataSets

More optimized

Consumes less memory

Available in Java, Scala & Python

Available only in Java & Scala

Single API for Java & Scala

Dataframe = DataSet(Row)

Collection of objects

<Person>, <Person>....


Faster & optimal

df.filter(x => x.age > 50).show()

#Error - value age is not a member

case class Person(name: String, age: Int) 


ds = df.as[Person]

ds.filter(x => x.age > 50).show()

# works




RDD Vs DataFrame Vs DataSet


RDD

DataFrame

DataSet

Type Safe

Not Type safe

Type Safe

Developer has to take care of optimization

Auto optimization using Catalyst Optimizer

Auto optimized (using Catalyst Optimizer)


Not as good as datasets in performance

Not as good as datasets in performance

Better performance

Not memory efficient

Not memory efficient

More memory efficient


Repartition Vs Coalesce

Re-partition

Coalesce

Increase/Decrease the size of partition

Only decrease size of partition

Shuffling is there

No shuffling

Costly operation as shuffling is there, so takes more time as well

Not costly operation as no shuffling is there

Used when size of data is huge

Used when size of data is less

The fundamental benefit of a repartition is uniform data distribution

coalesce can’t guarantee - since it stitches partitions together, the output partitions might be of uneven size, which may or may not cause problems later, depending on how skewed they are


Spark window functions

  • Need to calculate value for each row, considering some group of rows

    • E.g., Avg salary column (not across all) but per dept

  • Types:

    • Ranking

    • Analytical

    • Aggregate


Rank Vs Dense_rank



Catalyst Optimizer


No comments:

Post a Comment