Do you know how Data is stored in Spark?Do you know what are #Rdd's in spark?

By Karthik Kondpak

✍️Let's understand Each How data is stored :-
============================
✏️The basic unit which holds the data in spark is called as "#RDD"

✏️RDD stands for "#Resilient Distributed Dataset"

✍️Will understand What is "Resilient" and "Distributed Dataset" in depth:-
==============================
a) Distributed Dataset:-
=================
✏️We know #List from traditional languages, it is stored on one machine.

✏️But let's say List is stored on 4 Machines , distributing the data.

✏️Let's consider we have a list on 4 Machines as L1,L2,L3,L4 (Lists) on Machines M1,M2,M3,M4.

✏️Now rdd is nothing but
rdd=L1+L2+L3+L4

✏️All together list is considered as " #RDD"

✏️In the similar Way "rdd" , rdd is nothing but "inmemory distributed collection

✏️Rdd's are distributed in memory.

b) Resilient :-
=========

✏️Ability to Quickly recover from failures

✏️Rdd's are Resilient to failures , it means if something fails they can recover it back

✏️From Hadoop We know that replication factor , but replication factor in case of HDFS (or) on disk is good but replication factor (or) replicas in memory are not good.

✏️Memory is very costly and we cannot keep "3" replicas in memory .

✍️Then how Resilient behaviour is achieved in Rdd's:-
=========================

✏️Rdd's provide fault tolerance through lineage graph

📝What is lineage graph?
==================

✏️ Lineage graph is nothing but it keeps track of #transformations to be executed after an action is called.

✏️In simple terms Lineage graph keeps information like
a) rdd is dependent on #what & #from where and on performing what operation it is generated.

✏️Instead of holding data it holds information about dependencies.

📝How and when lineage graph is created?
================================

✏️When an action is encountered a path will be created which defines order of execution of the operation.

✏️This information is remembered in the from of "Lineage graph"

📝How lineage graph helps to recover ?
=============================

✏️Let's consider we have a code:-

       rdd1= load data from Hdfs
       rdd2=rdd1.map(" ---")
       rdd3=rdd3.filter("----")

✏️Let's say we have lost rdd3 then how it recovers back

✏️When rdd3 is lost we quickly look for its parent rdd which is rdd2 using lineage graph

✏️We see how rdd3 is generated on applying what function it is generated and we apply the same operation once more and we get back our rdd3 once again.

✏️Immutable nature of Rdd's are responsible to achieve Resilient behaviour of Rdd's .

✏️If Rdd's are mutable we cannot achive Resilient behaviour .

🤛Stay tuned 🥳will understand what happens if Rdd's are mutable .🤛

Comments

BigdataTechStack10 August 2022 at 23:04
Like share and comment
ReplyDelete
Replies

Add comment

Search This Blog

BigData Tech Stack

RDD's in Spark

Do you know how Data is stored in Spark?Do you know what are #Rdd's in spark?

By Karthik Kondpak

Comments

Post a Comment

Popular Posts

Impetus DataEngineer Interview Questions

Fractal Analytics DataEngineer Interview Questions