August 10, 2022

How ReduceBykey Works internally in Spark?

What is #ReduceByKey in Spark? How it works Internally?

.
.
.

Let's Understand In depth :-

=======================

✏️ReduceByKey is a wide Transformation

✏️As it is a wide Transformation Shuffling anyhow involves

✍️What is Shuffling?

==============

📌Moving data from one machine to other machine termed as Shuffling

✅Now Let's consider an example we have 3 Machines M1,M2,M3

✏️Let's consider a use case we have to calculate the amount spent by user , below tuples are stored on the machines where key is a userid and value is the amount

Machine. Data stored

M1 (33,200)
33,600)

M2. (33,300)

M3 (33,400)

✏️ReduceBykey group up all the similar keys and works on the values.

✏️ReduceBykey works on the two rows at a time .

✏️ReduceBykey works on the principle of #datalocality , Which means it try to aggregate on the existing Machine if it finds the similar dataset.

✏️Now How aggregation happens on each machine will understand.

Let's consider M1
===============

(33,200)
(33,600)

✏️This is our data on M1

✏️ReduceByKey((x,y)=>x+y)), it works on two rows at a time it means

It considers

x=(33,200)
y=(33,600)
=========
x+y = (33,200+600)

✏️Here we can see ReduceBykey is doing #local aggregation as it found similar keys on the existing machine.

✏️Here it consider #x as (33,200) and #y as (33,600) but here what It does is it groups similar keys and aggregate over values and gives the results.

✅x+y =(33,800)

✅M1 final result is (33,800)

🚀M2
===

(33,300)

🚀M3
====
33,400)

M1. M2. M3

(33,800).        ( 33,300).               (33,400)
|
|
|
|
==========================
     M4
     ====
(33,800)
(33,300)
(33,400)
===========================

✅Above we can see that Data is moved from M1,M2,M3 to M4 this is known as #Shuffling.

🚀Now how overall aggregation happens on M4 Will see:-
===========================

🚀M4
==

      (33,800)
       (33,300)
       (33,400)

🚀Now will apply #ReduceBykey for aggregation, anyhow reduceBykey shuffled and brought everything on to one machine, now it will start aggregating .

x=(33,800)
y=(33,300)
=======

🚀x+y = (33,800+300)

🚀x+y = (33,1100) This is the final result we got from (33,800) and (33,300)

🚀Now The final result (33,1100) and one more tuple which is left ( 33,400) above will go under aggregation.

x=(33,1100)
y= (33,400)
==========
x+y = (33,1100+400)
===========

🚀x+y= (33,1500)

🚀Now the final result we got is (33,1500)

🚀In this way reduceBykey works.

M1 M2 M3

(33,200). ( 33,300). ( 33,400)
(33,600)

============================
M4 :-(33,1500)
============================

In this manner ReduceBykey works Internally.

Search This Blog

BigData Tech Stack

How ReduceBykey Works internally in Spark?

What is #ReduceByKey in Spark? How it works Internally?

Comments

Post a Comment

Popular Posts

Impetus DataEngineer Interview Questions

Optum DataEngineer Interview Questions ?