Introduction  to  HBase

Introduction to HBase

What is HBase?

  • HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase is a database while HDFS is a file system.

Ex:-

Screenshot (311).png

  • HBase is used for real-time querying and it is primarily used to store or process unstructured Hadoop data as a lake. We use HBase when we need to create column-family database.
  • HBase is used in cases where we need random read and write operations & it can perform a number of operations per second on a large data sets. HBase gives strong data consistency and it can handle very large tables with billions of rows and columns on top of commodity hardware cluster.

Why HBase works on top of HDFS?

  • HDFS is a rigid structure and doesn't allow changes.

  • It doesn't facilitate dynamic storage.

  • HBase allows for dynamic changes and can be utilized for standalone applications.

  • HBase is ideally suited for random read and write of data, stored in HDFS.

HBASE Architecture

1111-7.png

  • Components of HBASE :-

  • Zookeeper

  • HMaster
  • Region Server
  • Region
  • Column Family

Region

Apache-HBase-Data-Model-Explanation.jpg

-A table is divided into many regions which are then stored in region server (can be in the same region server or different).

  • Regions are found in region server and there are multiple regions in a region server. Also there can be regions of different tables in the same region server.

  • There can be any number of regions of a table since a table is divided into many regions.

  • In HBase, data is divided into region with default size of 256 MB.

  • Ex :- In the above table, we can see the sub-columns Name, Age, Bname, etc.. . Let's say we have a data of 512 MB so the data here will be divided into two regions with each region getting 256 MB each.

  • When the data is put to be stored, it first gets stored in region 1 (say R1) and when the default size of 256 MB gets full, then the remaining data gets stored into region 2 (say R2) and in this case since we have 512 MB of data. Therefore, R1 and R2 both will have 256 MB of data each.
  • Now, what if the actual data that we have to store is more than 512 MB? That would mean that our regions - R1 and R2 will only store a combine of 512 MB of data. In such case, we can create more regions, as much as we need.
  • Inside a region, there is 'memstorage', 'blockcache' and 'hfile'.

Screenshot (313).png

  • To understand these, we'd need to understand - read and and write operations.

Write Operation Diagram.jpg

(i)Write Operation

  • Below is the diagram to understand how write operation works!

Untitled Diagram-Page-1.jpg

  • Whenever data is written into hbase, it is stored in HLog/WAL (Write Ahead Log) and memstore.

  • WAL(Write Ahead Log) is a file that every region server maintains. It acts as a backup data. For ex:- If in future, the data file stored in region server gets corrupted or something, we can always rely to get our data from HLog/WAL.

  • memstore also known as write buffer. Before data is stored in the actual disk/ram, it is stored in memstore. Memstore has a size limit of 100 MB so it only stores data upto 100 MB and then flushes it down to the disk.

  • Now the data that memstore flushes down to the disk are first broken down into small hfiles and then stored in the disk.

  • At last the data is stored in hfile. So hfile contains the actual data.

(ii)Read Operation

  • Just like write operation has memstorage, read operation has blockcache .
  • blockcache has the data that we frequently read. So when the request to read that data comes, it can just be pulled from blockcache instead of surfing through the disk, saving time. And the data which is least recently used gets removed from blockcache.

  • the data in the blockcache is actually stored on the ram, that is why the least recently data is removed i.e. to free up the space of ram.

Region Server

  • A region server handles multiple region. Also in the same region server, regions of more than one table can be stored.
  • For example:- In a region server, we can have multiple regions of different-different regions, like region of employee table, customer table, college table, order table, etc., .

HMaster

  • HMaster manages all the region servers.

  • It's operations are:- (i) Create, Update, Delete operation. (ii) Region Assignment. (iii) Reassigning regions after load balancing. (iv) Manage region server failures.

Zookeeper

  • Zookeeper manages the entire cluster.

  • HMaster along with all the Regions Server continously send heartbeat signals to Zookeeper so that it knows that they are active.

  • What if HMaster dies? --> In such case that Active HMaster gets replaced with the Inactive Hmaster making the latter one active. [See the fig. below]

Untitled Diagram.jpg

  • Zookeeper also handles Root Table & Meta Table.

Root Table and Meta Table

  • In HBase, to handle read and write operations, there are two types of table - root and meta.

  • Both these tables are handled by Zookeeper.

  • In the entire cluster, there's only root table whereas meta table can be more than one.

  • Both these tables are stored on Region Servers. These tables contains details of region like - in what region what data is stored and what region is stored in what region server.

  • So, whenever we want to read a data, these tables are approached and they tell the location of region. Them after going into region - memstorage, blockcache and hfiles are read.

Note - HMaster contains all the data in a root table and Region Server contains meta data.

Compactions

  • When data us stored in HBase, it is divided into many small hfiles with very small sizes. Because of which reading or updating record becomes hard and painful.

  • So in order to manage this problem, compaction was introduced.

  • There are two types of compactoin:-

  1. Minor Compaction
  2. Major Compaction
  1. Minor Compaction - In this all required hfiles are combined into small-small various hfiles.

Minor Compaction.jpg

  1. Major Compaction - In this all required hfiles are combined into single hfile.

Major Compaction.jpg

Summary of HBase Architecture

1111-7.png

  • Zookeeper manages the entire cluster. It takes care of inactive HMasters.

  • HMaster manages the region servers also crud operation and region assignment and region server failures.

  • A Region Server maintains regions.

  • Regions reside in region servers and store data in memstore, blockcache and hfiles.

This whole article is based on my learnings. Few things could be wrong or may be not fully described, so feel free to let me know!