Spark Vs MapReduce
🚀🔥🤔What is Spark? Is this is the Substitute of Mapreduce or Hadoop?🙄
.
.
.
.
.
.
✍️Let's understand what is Spark:-
========================
📝General Purpose
📝In Memory
📝Computation Enginee
📢Let's understand Each of them In depth:-
================================
a) Computation Enginee:-
=================
✏️Spark is a alternative of Mapreduce , it is not alternative of Hadoop.
✏️We should compare Computation engine (Mapreduce) with Computation engine(Spark)
✏️Spark says iam a plug, play computation enginee but I need two things , Storage from which I can take data and a resource manager which manages the work.
✏️Spark says iam not bound to any specific storage , u just tells the storage i will pick up the data and work on it.
b) In Memory :-
==========
📢Let's compare with Mapreduce to understand Spark in a better manner.
✏️ Lets consider we have 5 Mapreduce jobs (mr1,mr2,mr3,mr4,mr5)
✏️ From where #Mapreduce job takes the data , Obviously it will take data form the Hdfs(it means a disk access is required to take the data from Hdfs)
✏️After getting data it will process and the output of each Mapreduce if fed to Hdfs again( it means again a disk access is required to fed the output)
✏️It means for each Mapreduce job we need to "2" disk i/o access.
✏️ it means if there are 100 map reduce jobs number of disk i/o access will be "200"
✏️Disk access is painfull it takes alot ot time .
Then what about Spark ?🤔
===================
Let's consider there are 5 variables V1,V2,V3,V4,V5 in memory.
✏️Initially V1 will take data from Hdfs(which means a disk access is required to take data from HDFS)
✏️It processes are give it to V2 which is memory
✏️V2 processes and give it to V3
✏️V3 processes and give it to V4
✏️V4 processes and give it to V5
✏️V5 processes and final output is fed to Hdfs(it means a disk access is required to fed the output)
✍️It means in "Spark" how many ever the variables or jobs it might be Spark need only "2 " disk i/o access .
1) For taking data from input
2) For feeding output
✍️Spark is 10 to 100 times faster than Mapreduce.
C) Genral Purpose:
===========
✏️In hadoop to clean data -- We use Pig
✏️ In hadoop for querying data -- we need Hive
✏️ In hadoop for data ingestion --- we need Sqoop
🚀We know that to do above things there is no common style of code where we can make necessary add-ons and work .
🚀To work on above in hadoop we need to learn each and everything
But spark says :-
=============
✏️Spark says I am a general purpose compute enginee.
✏️ Just learn one style of writing the code and by making necessary addon' s and deletions we can perform all cleaning of data, Querying , and ingestion of data .
Comments
Post a Comment