Spark Vs MapReduce

August 09, 2022

Spark Vs MapReduce

🚀🔥🤔What is Spark? Is this is the Substitute of Mapreduce or Hadoop?🙄
.
.
.
.
.

✍️Let's understand what is Spark:-
========================
📝General Purpose

📝In Memory

📝Computation Enginee

📢Let's understand Each of them In depth:-
================================
a) Computation Enginee:-
=================
✏️Spark is a alternative of Mapreduce , it is not alternative of Hadoop.

✏️We should compare Computation engine (Mapreduce) with Computation engine(Spark)

✏️Spark says iam a plug, play computation enginee but I need two things , Storage from which I can take data and a resource manager which manages the work.

✏️Spark says iam not bound to any specific storage , u just tells the storage i will pick up the data and work on it.

b) In Memory :-
==========
📢Let's compare with Mapreduce to understand Spark in a better manner.

✏️ Lets consider we have 5 Mapreduce jobs (mr1,mr2,mr3,mr4,mr5)

✏️ From where #Mapreduce job takes the data , Obviously it will take data form the Hdfs(it means a disk access is required to take the data from Hdfs)

✏️After getting data it will process and the output of each Mapreduce if fed to Hdfs again( it means again a disk access is required to fed the output)

✏️It means for each Mapreduce job we need to "2" disk i/o access.

✏️ it means if there are 100 map reduce jobs number of disk i/o access will be "200"

✏️Disk access is painfull it takes alot ot time .

Then what about Spark ?🤔
===================
Let's consider there are 5 variables V1,V2,V3,V4,V5 in memory.

✏️Initially V1 will take data from Hdfs(which means a disk access is required to take data from HDFS)

✏️It processes are give it to V2 which is memory

✏️V2 processes and give it to V3

✏️V3 processes and give it to V4

✏️V4 processes and give it to V5

✏️V5 processes and final output is fed to Hdfs(it means a disk access is required to fed the output)

✍️It means in "Spark" how many ever the variables or jobs it might be Spark need only "2 " disk i/o access .
1) For taking data from input
2) For feeding output

✍️Spark is 10 to 100 times faster than Mapreduce.

C) Genral Purpose:
===========
✏️In hadoop to clean data -- We use Pig

✏️ In hadoop for querying data -- we need Hive

✏️ In hadoop for data ingestion --- we need Sqoop

🚀We know that to do above things there is no common style of code where we can make necessary add-ons and work .

🚀To work on above in hadoop we need to learn each and everything

But spark says :-
=============
✏️Spark says I am a general purpose compute enginee.

✏️ Just learn one style of writing the code and by making necessary addon' s and deletions we can perform all cleaning of data, Querying , and ingestion of data .

Search This Blog

BigData Tech Stack

Spark Vs MapReduce

Comments

Post a Comment

Popular Posts

Impetus DataEngineer Interview Questions

Optum DataEngineer Interview Questions ?