This are the component taking responsibility of planning , execution and scheduling .
SparkContext : Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Cluster manager : An external service for acquiring resources on the cluster (e.g. standalon manager, Mesos, YARN).
Worker node : Any node that can run application code in the cluster.
Executor : A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Execution in Spark :-
Spark applications
run as independent sets of processes on a cluster, coordinated by the
Spark Context
object in your main program
(called the driver program).
Once you create Spark
context(SC) , each worker will starts an executor. This is a separate process(JVM)
and it loads your jar too. Executor connect back to your driver program .
Next, it sends your application code
(defined by JAR or Python files passed to Spark Context) to the executors.
Finally, Spark Context sends tasks to
the executors to run. Driver can send them
commands like reduceByKey,flatMap etc.
When the driver quits, it will shut down the executors too. A new process is started on each worker ,when the Spark Context is constructed.
Summary :
An application in Spark is executed in three steps:
1.create RDD graph , i.e DAG to represent entire computation.
2. create stage graph,i.e a DAG of stages that is a logical execution plan based on the RDD graph.Stages
are created by breaking the RDD graph at shuffle boundaries.
3. based on this
plan , schedule and execute tasks on workers.