Spark: Spark's Component

Spark Components

Spark Core

Spark Core contains the basic functionality of Spark, required for running jobs and needed by other Components which includes - components for task scheduling, memory management, fault recovery, interacting with storage systems.

Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark Core contains logic for accessing various file systems, such as HDFS, Gluster FS,Amazon S3 and so on.

Spark SQL

Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL).

Spark SQL can also be used for reading and writing data to and from various structured formats and data sources, such as JSON, Parquet files, relational databases and Hive.

Spark SQL also includes a Thrift server, which can be used by external systems, such as business intelligence tools, to query data through Spark SQL using classic JDBC and ODBC protocols

Spark Streaming

Spark Streaming is a framework for ingesting and processing real-time streaming data from various sources. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.

Spark Streaming is a framework for ingesting real-time streaming data from various sources.

The supported streaming sources include HDFS, Kafka, Flume, Twitter, ZeroMQ, and custom

ones.

MLlib

Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

GraphX

GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.Graphs are data structures comprising vertices and the edges connecting them.

GraphX provides functions for building graphs, represented as "graph RDDs": EdgeRDD and VertexRDD. GraphX contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others.

Cluster Managers

While maximizing flexibility, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.

Spark

Saturday, 30 April 2016

Spark's Component

No comments:

Post a Comment