RDD's in Spark
Do you know how Data is stored in Spark?Do you know what are #Rdd's in spark?
By Karthik Kondpak
✍️Let's understand Each How data is stored :-
============================
✏️The basic unit which holds the data in spark is called as "#RDD"
✏️RDD stands for "#Resilient Distributed Dataset"
✍️Will understand What is "Resilient" and "Distributed Dataset" in depth:-
==============================
a) Distributed Dataset:-
=================
✏️We know #List from traditional languages, it is stored on one machine.
✏️But let's say List is stored on 4 Machines , distributing the data.
✏️Let's consider we have a list on 4 Machines as L1,L2,L3,L4 (Lists) on Machines M1,M2,M3,M4.
✏️Now rdd is nothing but
rdd=L1+L2+L3+L4
✏️All together list is considered as " #RDD"
✏️In the similar Way "rdd" , rdd is nothing but "inmemory distributed collection
✏️Rdd's are distributed in memory.
b) Resilient :-
=========
✏️Ability to Quickly recover from failures
✏️Rdd's are Resilient to failures , it means if something fails they can recover it back
✏️From Hadoop We know that replication factor , but replication factor in case of HDFS (or) on disk is good but replication factor (or) replicas in memory are not good.
✏️Memory is very costly and we cannot keep "3" replicas in memory .
✍️Then how Resilient behaviour is achieved in Rdd's:-
=========================
✏️Rdd's provide fault tolerance through lineage graph
📝What is lineage graph?
==================
✏️ Lineage graph is nothing but it keeps track of #transformations to be executed after an action is called.
✏️In simple terms Lineage graph keeps information like
a) rdd is dependent on #what & #from where and on performing what operation it is generated.
✏️Instead of holding data it holds information about dependencies.
📝How and when lineage graph is created?
================================
✏️When an action is encountered a path will be created which defines order of execution of the operation.
✏️This information is remembered in the from of "Lineage graph"
📝How lineage graph helps to recover ?
=============================
✏️Let's consider we have a code:-
rdd1= load data from Hdfs
rdd2=rdd1.map(" ---")
rdd3=rdd3.filter("----")
✏️Let's say we have lost rdd3 then how it recovers back
✏️When rdd3 is lost we quickly look for its parent rdd which is rdd2 using lineage graph
✏️We see how rdd3 is generated on applying what function it is generated and we apply the same operation once more and we get back our rdd3 once again.
✏️Immutable nature of Rdd's are responsible to achieve Resilient behaviour of Rdd's .
✏️If Rdd's are mutable we cannot achive Resilient behaviour .
🤛Stay tuned 🥳will understand what happens if Rdd's are mutable .🤛
Like share and comment
ReplyDelete