Spark Components |
Spark Core
Spark Core contains
the basic functionality of Spark, required for running jobs and needed by other
Components which includes - components for task scheduling, memory management,
fault recovery, interacting with storage systems.
Spark Core is also
home to the API that defines resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction. RDDs represent a collection of items
distributed across many compute nodes that can be manipulated in parallel.
Spark Core contains
logic for accessing various file systems, such as HDFS, Gluster FS,Amazon S3
and so on.
Spark SQL
Spark SQL is Spark’s
package for working with structured data. It allows querying data via SQL as
well as the Apache Hive variant of SQL—called the Hive Query Language (HQL).
Spark SQL can also be
used for reading and writing data to and from various structured formats and
data sources, such as JSON, Parquet files, relational databases and Hive.
Spark SQL also
includes a Thrift server, which can be used by external systems, such as
business intelligence tools, to query data through Spark SQL using classic JDBC
and ODBC protocols
Spark Streaming
Spark Streaming is a
framework for ingesting and processing real-time streaming data from various
sources. Examples of data streams include log files generated by production web
servers, or queues of messages containing status updates posted by users of a
web service.
Spark Streaming is a
framework for ingesting real-time streaming data from various sources.
The supported
streaming sources include HDFS, Kafka, Flume, Twitter, ZeroMQ, and custom
ones.
MLlib
Spark comes with a
library containing common machine learning (ML) functionality, called MLlib.
MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as
supporting functionality such as model evaluation and data import.
GraphX
GraphX is a library
for manipulating graphs (e.g., a social network’s friend graph) and performing
graph-parallel computations.Graphs are data structures comprising vertices and
the edges connecting them.
GraphX provides
functions for building graphs, represented as "graph RDDs": EdgeRDD
and VertexRDD. GraphX contains
implementations of the most important algorithms of graph theory, such as page
rank, connected components, shortest paths, SVD++, and others.
Cluster Managers
While maximizing
flexibility, Spark can run over a variety of cluster managers, including Hadoop
YARN, Apache Mesos, and a simple cluster manager included in Spark itself
called the Standalone Scheduler.