Spark Vs MapReduce

 

🚀🔥🤔What is Spark? Is this is the Substitute of Mapreduce or Hadoop?🙄
.
.
.
.
.

.





✍️Let's understand what is Spark:-
      ========================
     📝General Purpose

    📝In Memory

    📝Computation Enginee

📢Let's understand Each of them In depth:-
================================
a) Computation Enginee:-
     =================
    ✏️Spark is a alternative of Mapreduce , it is not alternative of Hadoop.

    ✏️We should compare Computation engine (Mapreduce) with Computation engine(Spark)

   ✏️Spark says iam a plug, play computation enginee but I need two things , Storage from which I can take data and a resource manager which manages the work.

   ✏️Spark says iam not bound to any specific storage , u just tells the storage i will pick up the data and work on it.

b) In Memory :-
     ==========
  📢Let's compare with Mapreduce to understand Spark in a better manner.

  ✏️ Lets consider we have 5 Mapreduce jobs (mr1,mr2,mr3,mr4,mr5)

  ✏️ From where #Mapreduce job takes the data , Obviously it will take data form the Hdfs(it means a disk access is required to take the data from Hdfs)

✏️After getting data it will process and the output of each Mapreduce if fed to Hdfs again( it means again a disk access is required to fed the output)

  ✏️It means for each Mapreduce job we need to "2" disk i/o access.

  ✏️ it means if there are 100 map reduce jobs number of disk i/o access will be "200"

  ✏️Disk access is  painfull it takes alot ot time .

Then what about Spark ?🤔
===================
Let's consider there are 5 variables V1,V2,V3,V4,V5 in memory.

✏️Initially V1 will take data from Hdfs(which means a disk access is required to take data from HDFS)

✏️It processes are give it to V2 which is memory

✏️V2 processes and give it to V3

✏️V3 processes and give it to V4

✏️V4 processes and give it to V5

✏️V5 processes and final output is fed to Hdfs(it means a disk access is required to fed the output)

✍️It means in "Spark"  how many ever the variables or jobs it might be  Spark need only "2 " disk i/o access .
  1) For taking data from input
   2) For feeding output

✍️Spark is 10 to 100 times faster than Mapreduce.

C) Genral Purpose:
     ===========
  ✏️In hadoop to clean data -- We use Pig

✏️ In hadoop for querying data -- we need Hive

✏️ In hadoop for data ingestion --- we need Sqoop

🚀We know that to do above things there is no common style of code where we can make necessary add-ons and work .

🚀To work on above in hadoop we need to learn each and everything

But spark says :-
=============
✏️Spark says I am a general purpose compute enginee.

✏️ Just learn one style of writing the code and by making necessary addon' s and deletions  we can perform all cleaning of data, Querying , and ingestion of data .

Comments

Popular Posts