RDD's in Spark

 

Do you know how Data is stored in Spark?Do you know what are #Rdd's in spark?


By Karthik Kondpak






✍️Let's understand Each How data is stored :-
============================
✏️The basic unit which holds the data in spark is called as "#RDD"

✏️RDD stands for "#Resilient Distributed Dataset"

✍️Will understand  What is "Resilient" and "Distributed Dataset" in depth:-
==============================
a) Distributed Dataset:-
    =================
  ✏️We know  #List  from traditional languages, it is stored on one machine.

✏️But let's say List is stored on 4 Machines , distributing the data.

✏️Let's consider we have a list on 4 Machines as  L1,L2,L3,L4 (Lists) on Machines M1,M2,M3,M4.

✏️Now rdd is nothing but
  rdd=L1+L2+L3+L4

✏️All together list is considered as " #RDD"

✏️In the similar Way "rdd" , rdd is nothing but "inmemory distributed collection

✏️Rdd's are distributed  in memory.

b) Resilient :-
    =========

✏️Ability to Quickly recover from failures

✏️Rdd's are Resilient to failures , it means if something fails they can recover it back

✏️From Hadoop We know that replication factor , but replication factor in case of HDFS (or) on disk is good but replication factor (or) replicas in memory are not good.

✏️Memory is very costly and we cannot keep "3" replicas in memory .

✍️Then how Resilient behaviour is achieved in Rdd's:-
=========================

✏️Rdd's provide fault tolerance through lineage graph

   📝What is lineage graph?
        ==================

   ✏️ Lineage graph is nothing but it keeps    track of #transformations to be executed after an action is called.

✏️In simple terms Lineage graph keeps information like
a) rdd is dependent on #what  & #from where and on performing what operation it is  generated.

✏️Instead of holding data it holds information about dependencies.

📝How and when  lineage graph is created?
================================

✏️When an action is encountered a path will be created which defines order of execution of the operation.

✏️This information is remembered in the from of "Lineage graph"

📝How lineage graph helps to recover ?
      =============================

✏️Let's consider  we have a code:-

       rdd1= load data from Hdfs
       rdd2=rdd1.map(" ---")
       rdd3=rdd3.filter("----")

✏️Let's say we have lost rdd3 then how it recovers back

✏️When rdd3 is lost we quickly look for its parent rdd which is rdd2 using lineage graph

✏️We see how rdd3 is generated on applying what function it is generated and we apply the same operation once more and we get back our rdd3 once again.

✏️Immutable nature of Rdd's are responsible to achieve Resilient behaviour of Rdd's .

✏️If Rdd's are mutable we cannot achive Resilient behaviour .

🤛Stay tuned  🥳will understand what happens if Rdd's are mutable .🤛

Comments

Post a Comment

Popular Posts