How ReduceBykey Works internally in Spark?

 

What is #ReduceByKey in Spark? How it works Internally?


.
.
.

Let's Understand In depth :-

=======================

✏️ReduceByKey is a wide Transformation

✏️As it is a wide Transformation Shuffling anyhow involves

✍️What is Shuffling?

      ==============

๐Ÿ“ŒMoving data from one machine to other machine termed as Shuffling

✅Now Let's consider an example we have 3 Machines M1,M2,M3


✏️Let's consider a use case we have to calculate the amount spent by user , below tuples are stored on the machines where key is a userid and value is the amount

   Machine.                          Data stored

   M1                                     (33,200)
                                              33,600)

   M2.                                    (33,300)

  M3                                      (33,400)


✏️ReduceBykey group up all the similar keys and works on the values.

✏️ReduceBykey works on the two rows at a time .

✏️ReduceBykey works on the principle of #datalocality , Which means it try to aggregate on the existing Machine if it finds the similar dataset.

✏️Now How aggregation happens on each machine will understand.

Let's consider M1
===============

  (33,200)
  (33,600)

✏️This is our data on M1

✏️ReduceByKey((x,y)=>x+y)), it works on two rows at a time it means

It considers

x=(33,200)
y=(33,600)
=========
x+y = (33,200+600)

✏️Here we can see ReduceBykey is doing #local aggregation as it found similar keys on the existing machine.

✏️Here it consider #x as (33,200) and #y as (33,600) but here what It does is it groups similar keys and aggregate over values and gives the results.

✅x+y =(33,800)

✅M1 final result is  (33,800)

๐Ÿš€M2
    ===

(33,300)

๐Ÿš€M3
   ====
33,400)

M1.                   M2.                          M3

(33,800).        ( 33,300).               (33,400)
  |
  |
  |
  |
==========================
     M4
     ====
  (33,800)
  (33,300)
  (33,400)
===========================

✅Above we can see that Data is moved from M1,M2,M3 to M4 this is known as #Shuffling.

๐Ÿš€Now how overall aggregation happens on M4 Will see:-
===========================

๐Ÿš€M4
      ==

      (33,800)
       (33,300)
       (33,400)

๐Ÿš€Now will apply #ReduceBykey for aggregation, anyhow reduceBykey shuffled and brought everything on to one machine, now it will start aggregating .

x=(33,800)
y=(33,300)
=======

๐Ÿš€x+y = (33,800+300)

๐Ÿš€x+y = (33,1100) This is the final result we got from (33,800) and (33,300)

๐Ÿš€Now The final result (33,1100) and one more tuple which is left ( 33,400) above will go under aggregation.

x=(33,1100)
y= (33,400)
==========
  x+y = (33,1100+400)
===========

๐Ÿš€x+y= (33,1500)

๐Ÿš€Now the final result we got is (33,1500)

๐Ÿš€In this way reduceBykey works.

M1                      M2                          M3

(33,200).        (  33,300).              ( 33,400)
(33,600)

============================
M4 :-(33,1500)
============================

In this manner ReduceBykey works Internally.

Comments

Popular Posts