Publish Date: 2022-09-14
 
                    
In columnar Databases, a table is a collection of rows. but what makes a difference is that the row is a key value pair where the key is the primary identifier for the record. while the value is one or more of column families, which is collection of columns that contain one or more columns or cells.
in this second part of our article related to HBase, we will demonstrate some HBase operations that cover create, read, insert, update and delete of data using HBase shell.
if you are unfamiliar with HBase , check out my previous article Columnar Databases: HBase Data Overview to get fast understanding and an overview of data models in HBase.
lets start with namespaces level.
hbase(main):001:0> list_namespace
NAMESPACE
default
hbase
2 row(s) in 0.3970 seconds
if you want create a namespace, use create_namespace command followed by a namespace name in single quotes :
hbase(main):002:0> create_namespace 'class1'
0 row(s) in 0.1640 seconds
check and list namespaces again t see if it succeeds :
hbase(main):003:0> list_namespace
NAMESPACE
class1
default
hbase
3 row(s) in 0.0460 seconds
as this article is a continuous of previous article , we will use the same table presented there. and for the record, this is a table we are using through the rest of our demo :
| info | grade | |||||
|---|---|---|---|---|---|---|
| name | gender | age | math | physics | science | |
| 20221 | ahmed | male | 22 | 17 | 18 | 16 | 
| 20222 | sara | female | 20 | 14 | 16 | 15 | 
| 20223 | omar | male | 25 | 18 | 17 | 16 | 
| 20224 | layla | female | 18 | 15 | 14 | 13 | 
to create a table Use the create command . You must specify the table name and the ColumnFamily(s) name. Use the next syntax to do just that:
hbase(main):004:0> create 'namespace:table', 'column_family'
if you don’t specify namespace it will use the default namespace. For our demo, lets create a students table in with 2 column families info and grade :
hbase(main):005:0> create 'students' , 'info', 'grade'
0 row(s) in 1.2520 seconds
=> Hbase::Table - students
you can use 'list' command to confirm that our students table is created successfully:
hbase(main):006:0> list
TABLE
students
1 row(s) in 0.0150 seconds
=> ["students"]
Now use the describe (you can use shortened version of it: desc ) command to see table details, including default configurations values :
hbase(main):007:0> describe 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.1510 seconds
now that we created our table, it is time to put some data to work with, you can use many approaches especially in bulk and continues insert, but for our case we will simply use shell as our data is relatively small.
To put data into your table, use the put command with this syntax:
hbase(main):000:0> put 'table_name', 'row_id', 'column_family:column', 'value'
the syntax seems confusing , and to make it clear, let us in the next command insert three values, one at a time, all corresponds to one row.
The first insert is 20221, column info:name, with a value of ahmed . Columns in HBase are comprised of a column family prefix, info in our example, followed by a colon and then a column qualifier suffix, name in our case:
hbase(main):026:0> put 'students','20221', 'info:name','ahmed'
0 row(s) in 0.2950 seconds
hbase(main):027:0> put 'students','20221', 'info:gender','male'
0 row(s) in 0.0120 seconds
hbase(main):028:0> put 'students','20221', 'info:age','22'
0 row(s) in 0.0160 seconds
in the same way, you can add more cell values for grade column family to the same student 20221:
hbase(main):029:0> put 'students','20221', 'grade:math','17'
0 row(s) in 0.0280 seconds
hbase(main):030:0> put 'students','20221', 'grade:physics','18'
0 row(s) in 0.0170 seconds
hbase(main):031:0> put 'students','20221', 'grade:science','16'
0 row(s) in 0.0220 seconds
if you want to select the data we just insert, we can use 'scan' command:
hbase(main):033:0> scan 'students'  # to show table data
ROW                          COLUMN+CELL
20221                       column=grade:math, timestamp=1662386270882, value=17
20221                       column=grade:physics, timestamp=1662386270967, value=18
20221                       column=grade:science, timestamp=1662386271032, value=16
20221                       column=info:age, timestamp=1662385287949, value=22
20221                       column=info:gender, timestamp=1662385287894, value=male
20221                       column=info:name, timestamp=1662385287819, value=ahmed
1 row(s) in 0.0710 seconds
the data returned shows the values correspond to each column field for the row 20221 with timestamp as well.
Now lets add more rows to get some meaningful queries. Here we will add 3 more students with 2 column families:
put 'students','20222', 'info:name','sara'
put 'students','20222', 'info:gender','female'
put 'students','20222', 'info:age','20'
put 'students','20223', 'info:name','omar'
put 'students','20223', 'info:gender','male'
put 'students','20223', 'info:age','25'
put 'students','20224', 'info:name','layla'
put 'students','20224', 'info:gender','female'
put 'students','20224', 'info:age','18'
put 'students','20222', 'grade:math','14'
put 'students','20222', 'grade:physics','16'
put 'students','20222', 'grade:science','15'
put 'students','20223', 'grade:math','18'
put 'students','20223', 'grade:physics','17'
put 'students','20223', 'grade:science','16'
put 'students','20224', 'grade:math','15'
put 'students','20224', 'grade:physics','14'
put 'students','20224', 'grade:science','13'
you can count the number of records in the students table to verify:
hbase(main):056:0> count 'students'
4 row(s) in 0.0320 seconds
=> 4
use scan command again to see how data look like, but be careful and don’t use it against large tables as it can cause performance issue ,because all the data will get fetched.
Another way query table data, is to scan only the data in the specific column family, lets for a moment select info column family, we can use next syntax to achieve that:
hbase(main):058:0> scan 'students',{COLUMNS=>'info'}
ROW                          COLUMN+CELL
20221                       column=info:age, timestamp=1662385287949, value=22
20221                       column=info:gender, timestamp=1662385287894, value=male
20221                       column=info:name, timestamp=1662385287819, value=ahmed
20222                       column=info:age, timestamp=1662387271716, value=20
20222                       column=info:gender, timestamp=1662387271626, value=female
20222                       column=info:name, timestamp=1662387271561, value=sara
20223                       column=info:age, timestamp=1662387271976, value=25
20223                       column=info:gender, timestamp=1662387271888, value=male
20223                       column=info:name, timestamp=1662387271840, value=omar
20224                       column=info:age, timestamp=1662387272219, value=18
20224                       column=info:gender, timestamp=1662387272084, value=female
20224                       column=info:name, timestamp=1662387272023, value=layla
4 row(s) in 0.0860 seconds
we can also query only the math mark in the grade column family:
hbase(main):060:0> scan 'students',{COLUMNS=>'grade:math'}
ROW                          COLUMN+CELL
20221                       column=grade:math, timestamp=1662387264427, value=17
20222                       column=grade:math, timestamp=1662387264583, value=14
20223                       column=grade:math, timestamp=1662387264795, value=18
20224                       column=grade:math, timestamp=1662387264965, value=15
4 row(s) in 0.0580 seconds
In Get mode, data is queried based on the row key. For example if we want to get related information for student '20224' :
hbase(main):061:0> get 'students','20224'
COLUMN             CELL
grade:math        timestamp=1662387264965, value=15
grade:physics     timestamp=1662387265005, value=14
grade:science     timestamp=1662387265054, value=13
info:age          timestamp=1662387272219, value=18
info:gender       timestamp=1662387272084, value=female
info:name         timestamp=1662387272023, value=layla
6 row(s) in 0.0580 seconds
we can also fetch for specific column for student '20224':
hbase(main):062:0> get 'students','20224','info:name'
COLUMN             CELL
info:name         timestamp=1662387272023, value=layla
1 row(s) in 0.0200 seconds
we can also query data by specifying some criteria. Lets get only student names , but limit the result to only 2 rows:
hbase(main):064:0> scan 'students',{STARTROW=>'20221', 'LIMIT'=>2, COLUMNS=>'info:name'}
ROW                          COLUMN+CELL
20221                       column=info:name, timestamp=1662385287819, value=ahmed
20222                       column=info:name, timestamp=1662387271561, value=sara
2 row(s) in 0.0360 seconds
HBase can store multiples versions for the same row. In our case we will put (update) the student '20221' name, to have multiples versions:
hbase(main):065:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0160 seconds
hbase(main):066:0> put 'students','20221','info:name','Ahmed Sghir'
0 row(s) in 0.0130 seconds
hbase(main):067:0> put 'students','20221','info:name','Ahmed Wald Nabil'
0 row(s) in 0.0150 seconds
if you query the student '20221' name:
hbase(main):069:0> get 'students','20221','info:name'
COLUMN                       CELL
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
1 row(s) in 0.0110 seconds
as you can see, you only get the latest update, the same behavior as in relational databases. however, this is not what we want to !? we want to get all the previous updates (called versions in HBase) as well and this is what HBase is god on it.
You can set VERSIONS to specify the number of versions (updates to be queried:
hbase(main):070:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
3 row(s) in 0.0190 seconds
The version is specified during the search, but the last record is still displayed. Although VERSIONS is added, only one record is returned after the get operation. This is because the default value of VERSIONS is 1 during table creation. To verify that execute the next command and check VERSIONS property:
hbase(main):071:0> desc 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0810 seconds
To view data of multiple versions, run the following statement to change the value of VERSIONS of the table ( or you can specify the value when creating the table as well):
hbase(main):072:0> alter 'students', {NAME=>'info', 'VERSIONS'=>5}
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.1770 seconds
Then, insert multiple data record updates to student '20221' names:
hbase(main):073:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0230 seconds
hbase(main):074:0> put 'students','20221','info:name','Ahmed Ben Ali'
0 row(s) in 0.0180 seconds
hbase(main):075:0> put 'students','20221','info:name','Ahmed Wald Ali'
0 row(s) in 0.0240 seconds
now to get the last 5 names versions for student '20221', we just run get command as in the next command:
hbase(main):076:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662391886964, value=Ahmed Wald Ali
info:name                   timestamp=1662391885641, value=Ahmed Ben Ali
info:name                   timestamp=1662391882289, value=Ahmed Ali
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
info:name                   timestamp=1662385287819, value=ahmed
7 row(s) in 0.0350 seconds
if you want, you can delete specific single row field, for example lets check the age of student '20221':
hbase(main):078:0> get 'students','20221', 'info:age'
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
1 row(s) in 0.0460 seconds
let’s delete this age value, that correspond to 'Ahmed' student:
hbase(main):079:0> delete 'students','20221','info:age'
0 row(s) in 0.0590 seconds
if we check again, the age value is gone, and it is no longer in the database:
hbase(main):080:0> get 'students','20221', 'info:age'
COLUMN                       CELL
0 row(s) in 0.0110 seconds
hbase(main):081:0> get 'students','20221'
COLUMN                       CELL
grade:math                  timestamp=1662387264427, value=17
grade:physics               timestamp=1662387264488, value=18
grade:science               timestamp=1662387264533, value=16
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662391886964, value=Ahmed Wald Ali
5 row(s) in 0.0310 seconds
we can also delete all records of a specific row as well, using 'deleteall' command:
hbase(main):082:0> deleteall 'students','20221'
0 row(s) in 0.0150 seconds
If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the disable command :
hbase(main):083:0> disable 'students'
0 row(s) in 2.4950 seconds
N.B: you can use 'enable' command in the same way if just change some settings.
To drop (delete) a table, use the drop command:
hbase(main):084:0> drop 'students'
0 row(s) in 1.2880 seconds
in the same way, you can drop a namesapace if you are no longer using it:
This command drop_namespace is used to drop the namespace . You can drop only empty namespace, so you must drop all the tables created in that namespace before attempting to drop it.
as our namespace 'class1' is empty, we can run the next command to do so:
hbase(main):085:0> drop_namespace 'class1'
0 row(s) in 1.3680 seconds
by the end of this demo, you now can use different commands to manipulate data in HBae, including create, put, get, scan, delete drop and many others.