How ReduceBykey Works internally in Spark?
What is #ReduceByKey in Spark? How it works Internally?
.
.
.
Let's Understand In depth :-
=======================
✏️ReduceByKey is a wide Transformation
✏️As it is a wide Transformation Shuffling anyhow involves
✍️What is Shuffling?
==============
๐Moving data from one machine to other machine termed as Shuffling
✅Now Let's consider an example we have 3 Machines M1,M2,M3
✏️Let's consider a use case we have to calculate the amount spent by user , below tuples are stored on the machines where key is a userid and value is the amount
Machine. Data stored
M1 (33,200)
33,600)
M2. (33,300)
M3 (33,400)
✏️ReduceBykey group up all the similar keys and works on the values.
✏️ReduceBykey works on the two rows at a time .
✏️ReduceBykey works on the principle of #datalocality , Which means it try to aggregate on the existing Machine if it finds the similar dataset.
✏️Now How aggregation happens on each machine will understand.
Let's consider M1
===============
(33,200)
(33,600)
✏️This is our data on M1
✏️ReduceByKey((x,y)=>x+y)), it works on two rows at a time it means
It considers
x=(33,200)
y=(33,600)
=========
x+y = (33,200+600)
✏️Here we can see ReduceBykey is doing #local aggregation as it found similar keys on the existing machine.
✏️Here it consider #x as (33,200) and #y as (33,600) but here what It does is it groups similar keys and aggregate over values and gives the results.
✅x+y =(33,800)
✅M1 final result is (33,800)
๐M2
===
(33,300)
๐M3
====
33,400)
M1. M2. M3
(33,800). ( 33,300). (33,400)
|
|
|
|
==========================
M4
====
(33,800)
(33,300)
(33,400)
===========================
✅Above we can see that Data is moved from M1,M2,M3 to M4 this is known as #Shuffling.
๐Now how overall aggregation happens on M4 Will see:-
===========================
๐M4
==
(33,800)
(33,300)
(33,400)
๐Now will apply #ReduceBykey for aggregation, anyhow reduceBykey shuffled and brought everything on to one machine, now it will start aggregating .
x=(33,800)
y=(33,300)
=======
๐x+y = (33,800+300)
๐x+y = (33,1100) This is the final result we got from (33,800) and (33,300)
๐Now The final result (33,1100) and one more tuple which is left ( 33,400) above will go under aggregation.
x=(33,1100)
y= (33,400)
==========
x+y = (33,1100+400)
===========
๐x+y= (33,1500)
๐Now the final result we got is (33,1500)
๐In this way reduceBykey works.
M1 M2 M3
(33,200). ( 33,300). ( 33,400)
(33,600)
============================
M4 :-(33,1500)
============================
In this manner ReduceBykey works Internally.
Comments
Post a Comment