Spark: Spark WorkFlow

This are the component taking responsibility of planning , execution and scheduling .

Driver program : The process running the main() function of the application and creating the SparkContext.

SparkContext     : Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Cluster manager   : An external service for acquiring resources on the cluster (e.g. standalon manager, Mesos, YARN).

Worker node : Any node that can run application code in the cluster.

Executor    : A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Execution in Spark :-

Spark applications run as independent sets of processes on a cluster, coordinated by the Spark Context object in your main program (called the driver program).

Once you create Spark context(SC) , each worker will starts an executor. This is a separate process(JVM) and it loads your jar too. Executor connect back to your driver program .

Next, it sends your application code (defined by JAR or Python files passed to Spark Context) to the executors. Finally, Spark Context sends tasks to the executors to run. Driver can send them commands like reduceByKey,flatMap etc.

When the driver quits, it will shut down the executors too. A new process is started on each worker ,when the Spark Context is constructed.

Summary :

An application in Spark is executed in three steps:

1.create RDD graph , i.e DAG to represent entire computation.

2. create stage graph,i.e a DAG of stages that is a logical execution plan based on the RDD graph.Stages are created by breaking the RDD graph at shuffle boundaries.

3. based on this plan , schedule and execute tasks on workers.

Spark

Sunday, 1 May 2016

Spark WorkFlow

No comments:

Post a Comment