With the advent of BigData, there is a large focus around HBase systems, there is a need to persist and retrieve billions of data records in real-time. HBase systems provides exactly that. Because of this large focus, people have different view points around HBase systems. As part of this blog post, we share what HBase exactly is, and the architecture behind this massive system.
Suppose there was a database, which allowed dynamic addition/deletion of columns and multiple values in a cell according to timestamp. Further, suppose this database is highly fault-tolerant,available and provides real-time fast access to petabytes of data. This is HBase. Yes, HBase is a distributed column-oriented database built on top of HDFS. Hmm. Quite a lot of terms in just one sentence. Lets look at it closely.
Basically HBase is a collection of Labeled Tables. Each table has rows and columns but the columns are grouped in column families which can be uniquely identified by column-family prefix. Every cell is uniquely identified by the (row number,column, cell version) and can be visualized as an uniterpreted array of bytes versioned by a timestamp. This means there can be multiple cells with different versions for the same row and column. The greatest advantage of HBase tables is that every row is uniquely identified by a primary key and the primary key just is a collection of bytes. This qualifies any column as the primary key for the table. Further row operations are atomic.
HBase tables support scale-out easily because new column families can be added or removed as and when required. All columns in a column-family are stored sequentially. This enables tunings and storage specifications at the column family level and facilitates addition/removal of columns on demand.
Ok, so how is HBase distributed?
Well, as the number of rows in a table grows beyond a predecided limit, the rows are grouped to form regions of the table. These regions can be distributed over different nodes(Region Server Slaves) in HBase cluster. So the database client just needs to access the desired regions of the table only and not all (which is a big advantage in case of huge data).
HBase components in Action:
The question is how does such a huge distributed database manage client interactions? Well the answer is HBase Master and Zookeeper.
1. There is one HBase Master responsible for keeping information and managing regions of table stored on different nodes called Region Server Slaves.
2. Every Region Server Slave stores different regions of different tables and is responsible for acknowledging the HBase master about the status of data added/removed from the regions that it stores. All the data is persisted via Hadoop File System.
3. Zookeeper is appointed by the HBase master to keep active watch over every Region Server Slave by creating a znode entry corresponding to every Region Server Slave.
4. Special metadata tables called Catalog Tables store the information about location of different Regions on different Region Server Slaves. The –ROOT- file contains the addresses of various .META. files and each .META. file contains information about the regions that each Region Server Slave stores.
5. So when a client needs some data, it contacts the Zookeeper service to find out the location of –ROOT-. The client finds out the location of .META. from the –ROOT- file . Finally, the client comes to know the address of the Region Server Slave that stores its desired data from the .META. file. Then the client can directly access the Region Server Slave and read /write data into the region of interest. Please see the interaction flow shown below:
HBase is the perfect solution for applications processing big data. It does not support SQL but it clearly fixes the scalability issue with the current databases. To scale HBase , we just need to add few more nodes in the cluster and they will be able to store more regions of the tables. HBase also guarantees fast real-time access with the strong presence of Zookeeper.
However , as per the official HBase docs : HBase isn’t suitable for every problem. If you have small data sets, limited hardware and need for RDBMS like features, then HBase is definitely not the best choice. But in the era of enormous unstructured data generated at blinking fast speed, HBase is here to deliver its promise.