Saturday 30 April 2016

Spark's Component

Spark Components
Spark Core
Spark Core contains the basic functionality of Spark, required for running jobs and needed by other Components which includes - components for task scheduling, memory management, fault recovery, interacting with storage systems.

Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark Core contains logic for accessing various file systems, such as HDFS, Gluster FS,Amazon S3 and so on.

Spark SQL
Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL).
Spark SQL can also be used for reading and writing data to and from various structured formats and data sources, such as JSON, Parquet files, relational databases and Hive.
Spark SQL also includes a Thrift server, which can be used by external systems, such as business intelligence tools, to query data through Spark SQL using classic JDBC and ODBC protocols

Spark Streaming
Spark Streaming is a framework for ingesting and processing real-time streaming data from various sources. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.

Spark Streaming is a framework for ingesting real-time streaming data from various sources.
The supported streaming sources include HDFS, Kafka, Flume, Twitter, ZeroMQ, and custom
ones.

MLlib
Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.Graphs are data structures comprising vertices and the edges connecting them.
GraphX provides functions for building graphs, represented as "graph RDDs": EdgeRDD and VertexRDD. GraphX contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others.

Cluster Managers
While maximizing flexibility, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.



What is Apache Spark

Spark :


Apache Spark is general-purpose cluster computing System to process big data at a very fast speed in comparison with traditional Map-Reduce. Spark is design uniquely, which allows it keeping large amounts of data in memory. Spark programs can be 100 times faster than their Map Reduce counterparts.

In Compare to Map reduce model , Spark support more types of computations , including interactive queries and streaming processing .The Biggest difference come up between map reduce and Spark is "Use of memory (RAM)".
In Map reduce, memory is primarily used to actual computation and it makes process very slow. While In Spark, memory is used for both purpose to compute and also to store objects.

Spark is like a shopping mall which have all layouts in one place .It means Spark is designed to cover all types of workloads that required separate distributed systems, including batch applications, iterative algorithms, interactive queries and Streaming.  Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines.

Spark is written in Scala and it is designed to be highly accessible by Using Python, Java, Scala, and SQL.


There some application which are not suitable for Spark due to its distributed architecture. Spark’s overhead is negligible when handling large amounts of data, but if you have small amount of data which can be fit in single machine than it will more beneficial to go with some other framework .Also Spark was not made with OLTP application – which have fast and numerous transactions. It is better suited for OLAP – which have batch jobs and data mining.