Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

1

Hands on Lab

Adding Value to HBase with IBM InfoSphere

BigInsights and BigSQL

Session Number 1687

Piotr Pruski, IBM, [email protected] ( @ppruski)

Benjamin Leonhardi, IBM

2

Table of Contents

Lab Setup ............................................................................................................................ 3

Getting Started .................................................................................................................... 3

Administering the Big SQL and HBase Servers................................................................. 4

Part I – Creating Big SQL Tables and Loading Data ......................................................... 6

Background ..................................................................................................................... 6

One-to-one Mapping....................................................................................................... 9

Adding New JDBC Drivers ...................................................................................... 11

One-to-one Mapping with UNIQUE Clause................................................................. 13

Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16

Why do we need many-to-one mapping? ................................................................. 17

Data Collation Problem............................................................................................. 19

Many-to-one Mapping with Binary Encoding.............................................................. 20

Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22

Load Data: Error Handling ........................................................................................... 26

[OPTIONAL] HBase Access via JAQL ....................................................................... 27

PART II – A – Query Handling........................................................................................ 31

The Data........................................................................................................................ 31

Projection Pushdown .................................................................................................... 33

Predicate Pushdown ...................................................................................................... 34

Point Scan ................................................................................................................. 34

Partial Row Scan....................................................................................................... 35

Range Scan................................................................................................................ 35

Full Table Scan ......................................................................................................... 36

Automatic Index Usage................................................................................................. 37

Pushing Down Filters into HBase................................................................................. 38

Table Access Hints ....................................................................................................... 39

Accessmode .............................................................................................................. 39

PART II – B – Connecting to Big SQL Server via JDBC................................................ 40

Business Intelligence and Reporting via BIRT............................................................. 41

Communities ..................................................................................................................... 48

Thank You! ....................................................................................................................... 48

Acknowledgements and Disclaimers................................................................................ 49

3

Lab Setup This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick Start Edition uses a non-warranted program license, and is not for production use. The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere BigInsights, while being able to use real data and run real applications. The Quick Start Edition puts no data limit on the cluster and there is no time limit on the license. The following table outlines the users and passwords that are pre-configured on the image:

username password root password biadmin biadmin db2inst1 password

Getting Started To prepare for the contents of this lab, you must go through the following process to start all of the Hadoop components. 1. Start the VMware image by clicking the “Power on this virtual machine” button in

VMware Workstation if the VM is not already on. 2. Log into the VMware virtual machine using the following information

� user: biadmin � password: biadmin

3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start

VM. This view provides you with quick links to access the following functions that will be used throughout the course of this exercise: � Big SQL Shell � HBase Shell � Jaql Shell � Linux gnome-terminal

4

4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons).

Linux Terminal

Once all components have started successfully as shown below you may move to the next section.

Administering the Big SQL and HBase Servers BigInsights provides both command-line tools and a user interface to manage the Big SQL and HBase servers. In this section, we will briefly go over the user interface which is part of BigInsights Web Console. 1. Bring up the BigInsights web console by double clicking on the BigInsights

WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select HBase to view the status of HBase master and region servers.

2. Similarly, click on Big SQL from the same tab to view its status.

start-all.sh

�� Note: This command may take a few minutes to finish.

…

[INFO] Progress - 100%

[INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive,

hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms

5

3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions

and other metrics. Go to the BigInsights Welcome tab and select “Access Secure Cluster Servers.” You may need to enable pop-ups from the site when prompted.

Alternatively, point browser to the following bottom two URL’s noted in the image below.

Some interesting information from the web interfaces are:

� HBase root directory

• This can be used to find the size of an HBase table. � List of tables with descriptions.

6

� Each table displays lists of regions with start and end keys.

• This information can be used to compact or split tables as needed. � Metrics for each region server.

• These can be used to determine if there are hot regions which are serving the majority of requests to a table. Such regions can be split. It also helps determine the effects and effectiveness of block cache, bloom filters and memory settings.

4. Perform a health check of HBase and Big SQL which is different from the status checks

done above. It verifies the health of the functionality. From the Linux gnome-terminal, issue the following commands.

Linux Terminal

Linux Terminal

Part I – Creating Big SQL Tables and Loading Data In this part of the lab, our main goal is to demonstrate a migration of a table from a relational database to Big Insights using Big SQL over HBase. We will understand how HBase handles row keys and some pitfalls that users may encounter when moving data from a relational database to HBase tables. We will also try some useful options like pre-creating regions to see how it can help with data loading and queries. We will also explore various ways to load data.

Background

$BIGINSIGHTS_HOME/bin/healthcheck.sh hbase

[INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]

[INFO] Progress - Health check hbase

[INFO] Deployer - Try to start hbase if hbase service is stopped...

[INFO] Deployer - Double check whether hbase is started successfully...

[INFO] @bivm - hbase-master(active) started, pid 6627

[INFO] @bivm - hbase-regionserver started, pid 6745

[INFO] Deployer - hbase service started

[INFO] Deployer - hbase service is healthy


[INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes :

26335ms

$BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql

[INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]

[INFO] Progress - Health check bigsql

[INFO] @bivm - bigsql-server already running, pid 6949

[INFO] Deployer - Ping Check Success: bivm/192.168.230.137:7052

[INFO] @bivm - bigsql is healthy


[INFO] DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes : 1121ms

7

In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW), SLS_SALES_FACT. The details of the tables along with its primary key information are depicted in the figure below.

SLS_SALES_FACT

PK ORDER_DAY_KEY

PK ORGANIZATION_KEY

PK EMPLOYEE_KEY

PK RETAILER_KEY

PK RETAILER_SITE_KEY

PK PROMOTION_KEY

PK ORDER_METHOD_KEY

SALES_ORDER_KEY

SHIP_DAY_KEY

CLOSE_DAY_KEY

QUANTITY

UNIT_COST

UNIT_PRICE

UNIT_SALE_PRICE

GROSS_MARGIN

SALE_TOTAL

GROSS_PROFIT

There is an instance of DB2 contained on this image which contains this table with data already loaded that we will use in our migration. From the Linux gnome-terminal, switch to the DB2 instance user as shown below.

Linux Terminal

As db2inst1, connect to the pre-created database, gosales.

Linux Terminal

Upon successful connection, you should see the following output on the terminal.

Issue the following command to list all of the tables contained in this database.

su - db2inst1

� Note: The password for the db2inst1 is password. Enter this when prompted.

db2 CONNECT TO gosales

Database Connection Information

Database server = DB2/LINUXX8664 10.5.0

SQL authorization ID = DB2INST1 Local database alias = GOSALES

8

Linux Terminal

db2 LIST TABLES

Table/View Schema Type Creation time

------------------------------- --------------- ----- --------------------------

SLS_SALES_FACT DB2INST1 T 2013-08-22-14.51.27.228148

SLS_SALES_FACT_10P DB2INST1 T 2013-08-22-14.54.01.622569

SLS_SALES_FACT_25P DB2INST1 T 2013-08-22-14.55.46.416787

3 record(s) selected.

Examine how many rows we have in this table to ensure later everything will be migrated properly. Issue the following select statement.

Linux Terminal

You should expect 44603 rows in this table.

Use the following describe command to view all of the columns and data types that are contained within this table.

Linux Terminal

� Note: Here you will see three tables. Each one is essentially the same except with one key difference – the amount of data that is contained within them. The remaining instructions in this lab exercise will use the SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately.

db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"

1

-----------

44603


db2 "DESCRIBE TABLE sls_sales_fact_10p"

9

One-to-one Mapping In this section, we will use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping. Big SQL supports both, one-to-one and many-to-one mappings. In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single SQL column. In the following example, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column ‘name’ and so on.

To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from the BigInsights Shell folder on desktop and use the create schema command to create a

schema named gosalesdw.

BigSQL Shell

Data type Column

Column name schema Data type name Length Scale Nulls

------------------------------- --------- ------------------- ---------- ----- -----

-

ORDER_DAY_KEY SYSIBM INTEGER 4 0 Yes

ORGANIZATION_KEY SYSIBM INTEGER 4 0 Yes

EMPLOYEE_KEY SYSIBM INTEGER 4 0 Yes

RETAILER_KEY SYSIBM INTEGER 4 0 Yes

RETAILER_SITE_KEY SYSIBM INTEGER 4 0 Yes

PRODUCT_KEY SYSIBM INTEGER 4 0 Yes

PROMOTION_KEY SYSIBM INTEGER 4 0 Yes

ORDER_METHOD_KEY SYSIBM INTEGER 4 0 Yes

SALES_ORDER_KEY SYSIBM INTEGER 4 0 Yes

SHIP_DAY_KEY SYSIBM INTEGER 4 0 Yes

CLOSE_DAY_KEY SYSIBM INTEGER 4 0 Yes

QUANTITY SYSIBM INTEGER 4 0 Yes

UNIT_COST SYSIBM DECIMAL 19 2 Yes

UNIT_PRICE SYSIBM DECIMAL 19 2 Yes

UNIT_SALE_PRICE SYSIBM DECIMAL 19 2 Yes

GROSS_MARGIN SYSIBM DOUBLE 8 0 Yes

SALE_TOTAL SYSIBM DECIMAL 19 2 Yes

GROSS_PROFIT SYSIBM DECIMAL 19 2 Yes


CREATE SCHEMA gosalesdw;

10

Issue the following command in the same BigSQL shell that is open. This DDL statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier.

BigSQL Shell

Big SQL supports a load from source command that can be used to load data from warehouse sources which we’ll use first. It also supports loading data from delimited files using a load hbase command which we will use later.

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT

(

ORDER_DAY_KEY int,

ORGANIZATION_KEY int,

EMPLOYEE_KEY int,

RETAILER_KEY int,

RETAILER_SITE_KEY int,

PRODUCT_KEY int,

PROMOTION_KEY int,

ORDER_METHOD_KEY int,

SALES_ORDER_KEY int,

SHIP_DAY_KEY int,

CLOSE_DAY_KEY int,

QUANTITY int,

UNIT_COST decimal(19,2),

UNIT_PRICE decimal(19,2),

UNIT_SALE_PRICE decimal(19,2),

GROSS_MARGIN double,

SALE_TOTAL decimal(19,2),

GROSS_PROFIT decimal(19,2)

)

COLUMN MAPPING

(

key mapped by (ORDER_DAY_KEY),

cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY),

cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY),

cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY),

cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),

cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY),

cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY),

cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY),

cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY),

cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY),

cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY),

cf_data:cq_QUANTITY mapped by (QUANTITY),

cf_data:cq_UNIT_COST mapped by (UNIT_COST),

cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE),

cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE),

cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN),

cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL),

cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) );

11

Adding New JDBC Drivers The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a BigSQL shell, we need first add the driver for the JDBC source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory. From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory.

Linux Terminal

From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal.

BigSQL Shell

Terminate the BigSQL shell with the quit command.

BigSQL Shell

Copy the same DB2 driver to the JSQSH share directory with the following command.

Linux Terminal

When a user adds new drivers, the Big SQL server must be restarted. You could do this either from the web console, or use the follow command from the Linux gnome-terminal.

Linux Terminal

Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was closed in our earlier step with the quit command and check if in fact the driver was loaded into JSQSH.

BigSQL Shell

Now that the drivers have been set, the load can finally take place. The load from

source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table. Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in BigSQL.

BigSQL Shell

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib

\drivers

quit

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/

stop.sh bigsql && start.sh bigsql

\drivers

12

You should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified earlier.

Try to verify this with a select count statement as shown.

BigSQL Shell

Notice there is a discrepancy between the results from the load operation and the select count statement.

Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on desktop and issue the following count command to verify the number of rows.

HBase Shell

It should be apparent that the results from the Big SQL statement and HBase commands conform to one another.

However, this doesn’t yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when we query the table. The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a row with the same row key exists, HBase will write the new value as a new version for that column/cell. When querying the table, only latest value is returned by Big SQL. In many cases, this behaviour could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44603 rows, the load reported 44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No errors are thrown in such scenarios therefore it is always recommended to cross-check the number of rows by querying the table as we did. Now that we understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row.

LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES'

WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE

SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND;

44603 rows affected (total: 1m37.74s)

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;

+----+

| |

+----+

| 33 |

+----+

1 row in results(first row: 3.13s; total: 3.13s)

count 'gosalesdw.sls_sales_fact'

33 row(s) in 0.7000 seconds

13

First, from the BigSQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key.

BigSQL Shell

As expected, we only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key.

Using the HBase shell, we can retrieve previous versions for a row key. Use the following get command to get the top 4 versions of the row with row key 20070720.

HBase Shell

Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve 4 rows in the output.

Optionally try the same command again specifying a larger version number. For example, VERSIONS => 100. Either way, most likely, this is not the intended behaviour that users may expect when performing such migration. They probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions for this. One is to define the table with a composite row key to enforce uniqueness which will be explored later in this lab. Another option, outlined in the next section, is to force each row key to be unique by appending a UUID.

One-to-one Mapping with UNIQUE Clause

SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = 20070720;

+------------------+

| organization_key |

+------------------+

| 11171 |

+------------------+ 33 row(s) in 0.7000 seconds

get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

COLUMN CELL

cf_data:cq_ORGANIZATION_KEY timestamp=1383365546430,

value=11171


value=11171


value=11171


value=11171


14

Another option while performing such a migration is to use the force key unique option when creating the table using BigSQL syntax. This option will force the load to add a UUID to the row key. It helps to prevent versioning of cells. However, this method is quite inefficient as it stores more data and also makes queries slower. Issue the following command in the BigSQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what was seen in the previous section with one exception: the force key unique clause is specified for the column mapping of the row key.

BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE

(

ORDER_DAY_KEY int,

ORGANIZATION_KEY int,

EMPLOYEE_KEY int,

RETAILER_KEY int,

RETAILER_SITE_KEY int,

PRODUCT_KEY int,

PROMOTION_KEY int,


SALES_ORDER_KEY int,

SHIP_DAY_KEY int,

CLOSE_DAY_KEY int,

QUANTITY int,

UNIT_COST decimal(19,2),

UNIT_PRICE decimal(19,2),

UNIT_SALE_PRICE decimal(19,2),

GROSS_MARGIN double,

SALE_TOTAL decimal(19,2),


)

COLUMN MAPPING

(

key mapped by (ORDER_DAY_KEY) force key unique,

cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY),

cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY),

cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY),

cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),

cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY),

cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY),

cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY),

cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY),

cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY),

cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY),


cf_data:cq_UNIT_COST mapped by (UNIT_COST),

cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE),

cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE),

cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN),

cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL),

cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) );

15

In the previous section, we used the load from source command to get the data from our table on DB2 source into HBase. This may not always be feasible which is why in this section we explore another loading statement, load hbase. This will load data into HBase using flat files – which perhaps is an export of the data form the relational source. Issue the following statement which will load data from a file into an InfoSphere BigInsights HBase table.

BigSQL Shell

Once again, you should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified.

Verify the number of rows loaded with a select count statement as shown.

BigSQL Shell

This time there is no discrepancy between the results from the load operation and the select count statement.

Issue the same count from the HBase shell to be sure.

HBase Shell

The values are persistent across load, select, and count.

As in the previous section, from the BigSQL shell, issue the following select query with a predicate on the order day key.

BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'

DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_unique;

Note: The load hbase command can take in an optional list of columns. If no column list is specified, it

will use the column ordering in table definition. The input file can be on DFS or on the local file system where Big SQL server is running.

44603 rows affected (total: 26.95s)

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;

+-------+

| |

+-------+

| 44603 |

+-------+ 1 row in results(first row: 1.61s; total: 1.61s)

count 'gosalesdw.sls_sales_fact_unique'

...


16

In the previous section, only one row was returned for the specified date. This time, expect to see 1405 rows since the rows are now forced to be unique due to our clause in the create statement and therefore no versioning should be applied.

Once again, as in the previous section, we can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to attempt to retrieve the top 4 versions of the row with row key 20070720.

HBase Shell

Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the fact we have appended the UUID to each row key; (20070720 + UUID).

Therefore, instead, issue the follow HBase command to do a scan versus a get. This will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we are interested in retrieving.

HBase Shell

Notice there are no discrepancies between the results from Big SQL select and HBase scan.

Many-to-one Mapping (Composite Keys and Dense Columns) This section is dedicated to the other option of trying to enforce uniqueness of the cells and that is to define a table with a composite row key (aka many-to-one mapping). In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. A composite key is an HBase row key that is mapped to multiple SQL columns. A dense column is an HBase column that is mapped to multiple SQL columns. In the following example, the row key contains two parts – userid and account number. Each part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple

SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = 20070720;

1405 rows in results(first row: 0.47s; total: 0.58s)

get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

COLUMN CELL


scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720', STOPROW => '20070721'}


17

SQL columns. Note that we can have a mix. For example, we can have a composite key, a dense column and a non-dense column or any mix of these.

Issue the following DDL statement from the BigSQL shell which represents all entities from our relational table using a many-to-one mapping. Take notice of the column mapping section where multiple columns can be mapped to single family:qualifier’s.

BigSQL Shell

Why do we need many-to-one mapping? HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier and timestamp are also stored. This means a lot of duplicate information is kept. HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the relational world is not sparse. If we were to store each SQL column individually on HBase, as in our previous two sections, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (meaning, the row key, column family, and column qualifier) for each value. As an example, after loading data into this table we will examine the storage space for each of the three tables created thus far.

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE

(

ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY

int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,


SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,

QUANTITY int,

UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE

decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),


)

COLUMN MAPPING

(

key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,

RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,

ORDER_METHOD_KEY),

cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,

CLOSE_DAY_KEY),


cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE,

UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) );

11111_ac11 fname1_lname1

11111#11#0.25

Column Family: cf_data

balancfirst_name

acc_no userid last_name

interemin_ba SQL

HBase cq_names cq_acct key

18

As in the previous section, issue the following statement which will load data from a file into the InfoSphere BigInsights HBase table.

BigSQL Shell

Notice, the number of rows loaded into a table with many-to-one mapping remains the same even though we are storing less data! This statement also executes much faster than the previous load for this exact reason.

Issue the same statements and commands from both the BigSQL and HBase shell’s as in the previous two sections to verify that the number of rows is the same as in the original dataset. All of the results should be the same as in the previous section.

BigSQL Shell

BigSQL Shell

HBase Shell

As noted earlier, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense column where the HBase row key or HBase column(s) are made up of multiple relational table columns. This is because HBase would repeat row key, column family name, column name and timestamp for each column value. For relational data which is usually dense, this would cause an explosion in the required storage space. Issue the following command as biadmin from a Linux gnome-terminal to check the directory sizes for the three tables we created thus far.


DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense;


SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;

+-------+

| |

+-------+

| 44603 |

+-------+ 1 row in results(first row: 0.93s; total: 0.93s)

SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = 20070720;


scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'}


19

Linux Terminal

Notice that the dense table is significantly smaller than the others. The table in which we forced uniqueness is the largest since it needs to append a UUID to each row key.

Data Collation Problem All data represented thus far has been stored as strings. That is the default encoding on HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you may run into cases where a query returns wrong results. The following scenario walks through a situation where data is not collated correctly. Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a lager numerical value and does not conform to any date standard since it contains an extra digit.

BigSQL Shell

Issue a scan on the table with the following start and stop criteria.

HBase Shell

Take notice of the last three rows/cells returned from the output of this scan. The newly added row shows up in the scan even though its integer value is not between 20070720 and 20070721.

hadoop fs -du /hbase/

…

17731926 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact

3188 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense

47906322 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique …

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,

ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,

PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171, 4428, 7109, 5588, 30265, 5501, 605);

Note: Insert command is available for HBase tables. However, it is not a supported feature


20

Now insert another row into the table with the following command. This time we are conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last value returned in the table; i.e., 20070721.

BigSQL Shell

Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day.

HBase Shell

Now notice that the newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721. This is an example of numeric data is not collated properly. Meaning, the rows are not being stored in numerical order as one might expect but rather in byte lexicographical order.

Many-to-one Mapping with Binary Encoding

200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_DOLLAR_VALUES,

timestamp=1376692067977, value=

0264\x005501\x00605

200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_OTHER_KEYS,


0264\x005501\x00605

200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_QUANTITY,


0264\x005501\x00605 1406 row(s) in 4.2400 seconds

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,

ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,

PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171, 4428, 7109, 5588, 30265, 5501, 605);




0264\x005501\x00605



0264\x005501\x00605



0264\x005501\x00605



265\x005501\x00605



265\x005501\x00605



265\x005501\x00605 1407 row(s) in 2.8840 seconds

21

Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, one HBase column can be encoded as binary and another as string. String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly. Binary encoding in Big SQL is sort-able, so numeric data including negative number collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it. Issue the following DDL statement from the BigSQL shell to create a dense table as we did in the previous section, but this time overriding the default encoding to binary.

BigSQL Shell

Once again, use the load hbase data command to load the data into the table. This time

we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another possible option to speed up load is to increase the write buffer size.

BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY

(





QUANTITY int,




)

COLUMN MAPPING

(



ORDER_METHOD_KEY),


CLOSE_DAY_KEY),



UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)

)

default encoding binary;


DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE

gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;

22

Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary.

BigSQL Shell

Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense.

BigSQL Shell

One main point to see here is that the query can return faster. (Numeric types are also collated properly).

There is no custom serialization/deserialization logic required for string encoding. This makes it portable in the case one would want to use another application to read data in HBase tables. A main use case for string encoding is when someone wants to map existing data. Delimited data is a very common form of storing data and it can be easily mapped using Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings are slow. Also, numeric data is not collated correctly as seen. Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is you get data encoded by Big SQL logic and may not be portable as-is.

Many-to-one Mapping with HBase Pre-created Regions and External Tables HBase automatically handles splitting regions when they reach a set limit. In some scenarios like bulk loading, it is more efficient to pre-create regions so that the load operation can take place in parallel. The data for sales is 4 months, April through July for the year 2007. We can pre-create regions by specifying splits in create table command.


SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go –m discard;

Note: The “go –m discard” option is used so that the results of the command will not be displayed in the terminal.


SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go –m discard;


Note: You will probably not see much, if any, performance differences in this lab exercise since we are working with such a small dataset.

23

In this section, we will create a table within the HBase shell with pre-defined splits, not using any Big SQL features at first. Than we will showcase how users can map existing data in HBase to Big SQL which can prove to be a very common practice. This is made possible by creating what is called external tables. Start by issuing the following statement in the HBase shell. This will create the sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007.

HBase Shell

Issue the following list command on the HBase shell to verify the newly created table.

HBase Shell

Note that if we were to list the tables from the Big SQL shell, we would not see this table because we have not made any association yet to Big SQL. Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split.

create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data',

REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION =>

'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS =>

'0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER

=> 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE =>

'65536'}, {SPLITS => ['200704', '200705', '200706', '200707']}

list

24

Examine the pre-created regions for this table as we had defined when creating the table.

Execute the following create external hbase command to map the existing table we have just created in HBase to Big SQL. Some thing to note about the command:

� The create table statement allows specifying a different name for SQL table through hbase table name clause. Using external tables, you can also create multiple views of same HBase table. For example, one table can map to few columns and another table to another set of columns etc.

� Notice the column mapping section of the create table statement allows specifying a different separator for each column and row key.

� Another place where external tables can be used is to map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler.

BigSQL Shell

25

The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable.

Use the following command to load the external table we have defined.

BigSQL Shell

Verify that the same number of rows loaded is also the same number of rows returned by querying the external SQL table.

BigSQL Shell

CREATE EXTERNAL HBASE TABLE

GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT

(





QUANTITY int,




)

COLUMN MAPPING

(



ORDER_METHOD_KEY) SEPARATOR '-',


CLOSE_DAY_KEY) SEPARATOR '/',



UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|'

)

HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';

Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also, secondary indexes cannot be created via Big SQL on external tables.


DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE

gosalesdw.external_sls_sales_fact_dense_split;


SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;

26

Verify the same from the HBase shell directly on the underlying HBase table.

HBase Shell

Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a “-” which is what we defined when originally creating the external table.

HBase Shell

In the following output you can also see the other seperators we defined for the external table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY.

Of course in Big SQL we don't need to specify the separators such as “-” when querying against the table as with the command below.

BigSQL Shell

Load Data: Error Handling

In this final section of the part of the lab, we will examine how to handle errors during the load operation. The load hbase command has an option to continue past errors. The LOG ERROR ROWS

IN FILE clause can be used to specify a file name to log any rows that could not be loaded

+--------+

| |

+--------+

| 446023 |

+--------+ 1 row in results(first row: 6.44s; total: 6.46s)

count 'gosalesdw.sls_sales_fact_dense_split'

...


get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-7109-5588-30263-5501-605'

COLUMN CELL

cf_data:cq_DOLLAR_VALUES timestamp=1376690502630,

value=33.59|62.65|62.65|0.4638|1566.25|726.50

cf_data:cq_OTHER_KEYS timestamp=1376690502630,

value=481896/20070723/20070723

cf_data:cq_QUANTITY timestamp=1376690502630, value=25

3 row(s) in 0. 0610 seconds

SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE

ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY

= 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND

PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605;

27

because of errors. Some of the common errors are invalid numeric types, and a separator existing within the data for string encoding.

Linux Terminal

Note that separator appearing within the data is an issue with string encoding. Knowing there are errors with the input data, proceed to issue the following load command, specifying a directory and file where to put the “bad” rows.

BigSQL Shell

In this example, 4 rows did not get loaded because of errors. Note that load reports all the rows that passed through it

Examine the specified file in the load command to view the rows which we not loaded.

Linux Terminal

[OPTIONAL] HBase Access via JAQL

Jaql has an HBase module that can be used to create and insert data into HBase tables and query them efficiently using multiple modes - local mode that directly access HBase as well as map reduce mode. It allows specifying query optimization options similar to what is available in hbase shell. The capability to transparently use map reduce jobs makes it work well with bigger tables. At the same time, users can force local mode when they run point or

hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt

2007072a 11171 … … … … … … … … … … …

… … … … …

b0070720 11171 … … … … … … … … … … …

… … … … …

2007-07-20 11171 … … … … … … … … … … …

… … … … …

20070720 11-71 … … … … … … … … … … …

… … … … …

20070721 11171 … … … … … … … … … … … … … … … …

LOAD HBASE DATA INPATH

'/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS

TERMINATED BY '\t' INTO TABLE

gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/SLS_SALES_FACT_load.err';

1 row affected (total: 2.74s)

hadoop fs -cat /tmp/SLS_SALES_FACT_load.err

"2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"

"b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"

"2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"

28

range queries. It allows use of a SQL language subset termed as Jaql SQL which provides the capability to join, perform grouping and other aggregations on tables. It also provides access to data from different sources such as relational DBMS and different formats like delimited files, Avro and anything that is supported by Jaql. The results of the query can be written in different formats to HDFS and read by other BigInsights applications like BigSheets for further analysis. In this section, we’ll first pull information from our relational DMBS and than go over use of Jaql HBase module, specifically the additional features that it provides. Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for Big SQL by adding the “--jaql" option as shown below. This is a much better environment to work with than the standard Jaql Shell as it provides features like previous command using the up arrow key and you can also traverse through your commands using the left/right arrow keys.

Linux Terminal

Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following command.

BigSQL/JAQL Shell

Add the JDBC driver JAR file to the classpath.

BigSQL/JAQL Shell

Supply the connection information.

BigSQL/JAQL Shell

Specify the rows to be retrieved with a SQL select statement.

BigSQL/JAQL Shell

In many-to-one mapping for row key, we went over creation of a composite key. In the next few steps, we will use Jaql to load the same data using a composite key and dense columns. We’ll pack all columns that make up primary key of the relational table into a HBase row key, and we’ll also pack other columns into dense HBase columns. Define a variable to read the original data from the relational JDBC source. This converts each tuple of the table into a JSON record.

/opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql;

import dbms::jdbc;

addRelativeClassPath(getSystemSearchPath(), '/opt/ibm/db2/V10.5/java/db2jcc.jar');

db := jdbc::connect(

driver = 'com.ibm.db2.jcc.DB2Driver',

url = 'jdbc:db2://localhost:50000/gosales', properties = {user: "db2inst1", password: "password"} );

DESC := jdbc::prepare( db, query = "SELECT * FROM db2inst1.sls_sales_fact_10p");

29

BigSQL/JAQL Shell

Transform the record into the required format. Essentially we are doing the same procedure as when we defined the many-to-one mapping in the previous sections. For the first element, which we will use for HBase row key, concatenate the values of the columns that form the primary key of the sales fact table using a “-” separator. For the remaining columns, pack them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator), cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator).

BigSQL/JAQL Shell

Verify the data is in the correct format by querying the first record.

BigSQL/JAQL Shell

Now we have the data ready to be written into HBase. First import the hbase module which prepares jaql by loading required jars and preparing the environment using the HBase configuration files.

BigSQL/JAQL Shell

Use hbaseString to define a schema for the HBase table. The HBase table does not get created until something is written into it. An array of records that match the specified schema

ssf = localRead(DESC);

ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY",

$."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY",

$."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY",

$."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY",

$."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN",

$."SALE_TOTAL", $."GROSS_PROFIT"] -> transform

{

key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"-

",$[6],"-",$[7]),

cf_data: {

cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]),

cq_QUANTITY: strcat($[11]),

cq_DOLLAR_VALUES:

strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17])

} };

ssft -> top 1;

{

"key": "20070418-11114-4415-7314-5794-30124-5501-605",

"cf_data": {

"cq_OTHER_KEYS": "254121/20070423/20070423",

"cq_QUANTITY": "60",

"cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m"

}

} (1 row in 2.40s)

import hbase(*);

30

should be used to write into the HBase table. The data types correspond to how Jaql will interpret the data.

BigSQL/JAQL Shell

Write to the table using the previously created ssft array which matches the specified schema.

BigSQL/JAQL Shell

A write operation will create the HBase table, and populate it with the input data. To confirm, use hbase shell to count (or scan) the table and verify the data was written with the right number of rows.

HBase Shell

To read the contents of the HBase table using Jaql, use read on the hbaseString. In the

following command we are also passing the read directly into a count function to verify the right number of rows.

BigSQL/JAQL Shell

To query for rows matching a particular order day key 20070720, use setKeyRange for the

partial range query. Use localRead for point and range queries as Jaql is tuned for local execution and performs efficiently.

BigSQL/JAQL Shell

Perform the same query using HBase shell. Both complete in similar amount of time.

HBase Shell

SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*:

string}}, create=true, replace=true, rowBatchSize=10000, colBatchSize=200 );

� Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for

scanner catching and column batch size by the internal HBase scan object. The column batch size is useful when there are a huge number of columns in rows.

ssft -> write(SSFHT);

count 'sales_fact2'


count(read(SSFHT));

44603

localRead(SSFHT -> setKeyRange('20070720', '20070721'));

scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721', CACHE => 10000}

31

To query for a row when we have the values for all primary key columns, we can construct the entire row key and perform a point query.

BigSQL/JAQL Shell

Identically, this is what the statement would look like from the HBase shell.

HBase Shell

To use a filter from Jaql, use setFilter function along with addFilter. In the below case, the predicate is on quantity column which is the leading part of the dense column and hence can be used in the predicate.

BigSQL/JAQL Shell

PART II – A – Query Handling Efficiently querying HBase requires pushing as much to the server(s) as possible. This includes projection pushdown or fetching the minimal set of columns that are required by the query. It also includes pushing down query predicates into the server as scan limits, filters, index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down regions or eliminate regions which fall outside the range. Indexes help to leverage this key lookup but they use two tables to achieve this. Filters cannot eliminate regions but some have capability to skip within a region. They help to narrow down the data set returned to the client. With limited metadata/statistics about HBase tables, supporting a variety of hints helps improve query efficiency.

The Data This section describes the schema which the sample data will use to demonstrate the effects of pushdown from Big SQL.

localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501-605'));

get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605'

read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter,

HBaseKeyArrayToBinary(["481896/"]),

compareOp.equal,

comparators.BinaryPrefixComparator,

"cf_data",

"cq_OTHER_KEYS",

true

)

]) );

32

We will use a tpch table: orders table with 150,000 rows defined using the mapping shown below. Issue the following command from a Big SQL shell to create the orders table. Notice this table has a many-to-one mapping, meaning there is a composite key and dense columns.

BigSQL Shell

Load the sample data into the newly created table by issuing the following command.

BigSQL Shell

In next set of sections, we examine the output from Big SQL log files to point out what you can check for to confirm pushdown from Big SQL. To view log messages, you may have to first change logging levels using the below commands.

BigSQL Shell

BigSQL Shell

Note that columns are pushed down at HBase level. So in many-to-one mappings, if the query requires only one part of a dense column with many parts, the entire value for dense

CREATE HBASE TABLE ORDERS

(

O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1),

O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15),

O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)

)

column mapping

(

key mapped by (O_CUSTKEY,O_ORDERKEY),

cf:d mapped by

(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO

MMENT),

cf:od mapped by (O_ORDERDATE)

)

default encoding binary;

� Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except with one key difference. This is in the amount of data that is contained within them. The remaining instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately.

LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS;


log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info;

log com.ibm.jaql.modules.hcat.hbase info;

33

column will be returned. Therefore it is efficient to pack together columns that are usually queried together. Use the following command to tail the Big SQL log file. Keep this open in a terminal

throughout this entire part of this lab. We will be referring to it quite often to see what is going on behind the scenes when running certain commands.

Linux Terminal

Projection Pushdown The first query here does a SELECT * and requests all HBase columns used in the table mapping. The original HBase table could have a lot more columns; we may have defined an external table mapping to just a few columns. In such cases, only the HBase columns used in mapping will be retrieved.

BigSQL Shell

In the Big SQL log file, we can see that we returned data from both columns.

BigSQL Log

This second query request only one HBase column:

BigSQL Shell

Notice that the query returns much faster since we are returning much less data.

Verify from the log file that this query only executed against one column.

BigSQL Log

The third query request only one HBase column also.

tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log

SELECT * FROM orders go -m discard;


…

…HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=, totalColumns=2, …}

SELECT o_totalprice FROM orders go -m discard;


…

…HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1, …}

34

BigSQL Shell

Although this query actually returns lesser data, it actually has higher response time because serialization/deserialization of type timestamp is expensive.

BigSQL Log

Predicate Pushdown

Point Scan Identifying and using point scans is the most effective optimization for queries into HBase. For converting to point scan, we need to get the predicate value covering the full row key. This could come in as multiple predicates as Big SQL supports composite keys. The query analyzer in Big SQL is capable of combining multiple predicates to identify a full row scan. Currently, this analysis happens at run time in the storage handler. At that point, the decision of whether or not to use map reduce has already been made. To bypass map reduce, a user has to provide explicit local mode access hints currently. In the example below, the command “set force local on” makes sure all queries executing in the session do not use map reduce.

BigSQL Shell

Issue the following select statement that will provide predicates for the columns that comprise of the full row key. They are custkey and orderkey.

BigSQL Shell

If we check the logs, you can see that Big SQL successfully took both predicates specified and combined them to do a row scan using all parts of the composite key.

SELECT o_orderdate FROM orders go -m discard;


…

…HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1, …}

set force local on;

select o_orderkey, o_totalprice from orders where o_custkey=4 and o_orderkey=5612065;

+------------+--------------+

| o_orderkey | o_totalprice |

+------------+--------------+

| 5612065 | 71845.25781 |

+------------+--------------+ 1 row in results(first row: 0.18s; total: 0.18s)

35

BigSQL Log

Partial Row Scan This section shows the capability of Big SQL server to process predicates on leading parts of row key – and not necessarily the full row key as in the previous section. Issue the following example query that provides a predicate for the first part of the row key, custkey.

BigSQL Shell

Checking the logs, you can see the predicate on first part of row key is converted to a range scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest possible byte to cover the partial range.

BigSQL Log

Range Scan When there are range predicates, we can set the start or stop row or both. In our example query below we have a ‘less than’ predicate; therefore we only know the stop row. However, even setting this will help eliminate regions with row keys that fall above the stop row. Issue the following command.

BigSQL Shell

…

… Found a row scan by combining all composite key parts.

… Found a row scan from row key parts

… HBase filter list created using AND.

… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):

[PrefixFilter \x01\x80\x00\x00\x04], …,

stopRow=\x01\x80\x00\x00\x04\x01\x80\x00\x00\x00\x00U\xA2!, startRow=\x01\x80\x00\x00\x04\x01\x80\x00\x00\x00\x00U\xA2!, totalColumns=1, …}

select o_orderkey, o_totalprice from orders where o_custkey=4;

+------------+--------------+


+------------+--------------+

| 5453440 | 17938.41016 |

| 5612065 | 71845.25781 |

+------------+--------------+ 2 rows in results(first row: 0.19s; total: 0.19s)

…

… Found a row scan that uses the first 1 part(s) of composite key.




[PrefixFilter \x01\x80\x00\x00\x04], …, stopRow=\x01\x80\x00\x00\x04\xFF,

startRow=\x01\x80\x00\x00\x01, totalColumns=1, …}

select o_orderkey, o_totalprice from orders where o_custkey < 15;

36

Notice in the log file that similarly to the previous section, we are also only using the first part of the composite key since we are specifying custkey as the predicate. However, in this case since we only know the stop row (less than 3), there is no value for the start row portion of the scan.

BigSQL Log

Full Table Scan This section simply shows an example of what happens when none of the predicates can be pushed down to HBase In this example query, the predicate (orderkey) is on non-leading part of row key and therefore is not pushed down. Issue the command to see this will result in a full table scan.

BigSQL Shell

As can be determined by examining the logs, in cases where none of the predicates can be pushed to HBase, a full table scan is required. Meaning there are no specified values for either start or stop row.

BigSQL Log

+------------+--------------+


+------------+--------------+

| 5453440 | 17938.41016 |

| 5612065 | 71845.25781 |

| 5805349 | 255145.51562 |

| 5987111 | 97765.57812 |

| 5692738 | 143292.53125 |

| 5885190 | 125285.42969 |

| 5693440 | 117319.15625 |

| 5880160 | 198773.68750 |

| 5414466 | 149205.60938 |

| 5534435 | 136184.51562 |

| 5566567 | 56285.71094 |

+------------+--------------+ 11 rows in results(first row: 0.22s; total: 0.22s)

…



…

… HBase scan details:{…, families={cf=[d]}, …, stopRow=\x01\x80\x00\x00\x0F,

startRow=, totalColumns=1, …}

select o_orderkey, o_totalprice from orders where o_orderkey=5612065;

+------------+--------------+


+------------+--------------+

| 5612065 | 71845.25781 |

+------------+--------------+ 1 row in results(first row: 1.90s; total: 1.90s)

37

Automatic Index Usage This section will demonstrate the benefits of an index lookup. Before creating an index, let’s first execute a query that will invoke a full table scan so we can do a comparison later to see the performance benefits we can achieve by creating an index on particular column(s). Notice we are specifying a predicate on the clerk column which is the middle part of a dense column defined.

BigSQL Shell

As you can see below in the log file, there is no usage of an index.

BigSQL Log

Issue the following command to create the index on the clerk column which is the middle part of a dense column in table. This creates a new table to store index data. The index table stores the column value and row key it appears in.

BigSQL Shell

Re-issue the exact same command as we did earlier.

BigSQL Shell

…

… HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …}

SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard;


…

… indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined,

minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:

[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],

indexScanCandidateInfo: [hasIndexScanCandidate: false]]

…

CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';

Note: The create index statement will create the new index table which uses

<base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table

using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For

HBase, we have a separate index handler.


SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard;

38

After creating the index and issuing the same select statement, Big SQL will automatically take advantage of the index that was created and avoids a full table scan which results in a much faster response time.

You can verify in the log file that Big SQL. In this case the index table is scanned for all matching rows that start with value of predicate clerk, in this case Clerk#000000999. From the matching row(s), the row key(s) of base table are extracted and get requests are batched and sent to the data table.

BigSQL Log

If there was no index, the predicate could not be pushed down as it is the non-leading part of a dense column. In such cases, a full table scan is required as seen at the beginning of this section.

Pushing Down Filters into HBase Though HBase filters do not avoid full table scan, they limit the rows and data returned to the client. HBase filters have a skip facility which lets them skip over certain portions of data. Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan. There are filters that can limit the data within a row. For example, when we need to only get columns in the key part of filter, some filters like FirstKeyOnlyFilter and

KeyOnlyFilter can be applied to get only a single instance of the row key part of data. The sample query below will demonstrate a case where Big SQL pushes down a row scan and a column filter.

BigSQL Shell


…

… indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails:

JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns":

[{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator":

"%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1,

columns: [Ljava.lang.String;@3ced3ced, startValue: \x01Clerk#000000999\x00,

stopValue: \x01Clerk#000000999\x00]], valuesInfo: [minValue: [B@4b834b83,

minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:

[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],

indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate:

IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator:

=,isVariableLength: false,type: null,encoding: BINARY]]]

… Found an index scan from index scan candidates. Details:

… Index name: ix_clerk

…

… Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999,

stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:] … Index query successful.

Note: For a composite index where multiple columns are used to define an index, predicates are handled and pushed down similar to what is done for composite row keys.

39

Notice, the predicate on the custkey column triggers the row scan. The column filter, SingleColumnValueFilter, is triggered because there is a predicate on the leading part of a dense column (cf:d).

BigSQL Log

This way Big SQL can automatically convert predicates into many of these filters and thus handle queries more efficiently.

Table Access Hints Access hints affect the strategy that is used to read the table, identify the source of the data, and how to optimize a query. For example, the strategy can affect such behaviour as whether MapReduce is employed to implement the join or whether a memory (hash) join is employed. These hints can also control how to access data from specific sources. The table access hint that we will explore here is: accessmode.

Accessmode The accessmode hint is very important for HBase. It avoids map reduce overhead. Combined with point queries, they ensure sub-second response time without being affected by the total data size. There are multiple ways to specify accessmode hint – as query hint or at session level. Note that session level hints take precedence. If “set force local off;” is run in a session, all subsequent queries will always use map reduce even if an explicit accessmode=‘local’ hint is specified on the query. You can check the state of accessmode, if it was explicitly set, on the session with the following command in the Big SQL shell.

BigSQL Shell

If you kept the same shell open throughout this part of the lab, you will see the following output. This is because we used “set force local on” earlier in one of the previous sections.

SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND

o_orderstatus='P' go -m discard;


…




…


[SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], …, stopRow=, startRow=\x01\x80\x01\x86\xA1, totalColumns=1, …}

set;

40

To change the setting back to the default, you can change the value to automatic with the following command.

BigSQL Shell

Issue the following select query.

BigSQL Shell

Notice how long the query takes.

Issue the same query with an accessmode hint this time.

BigSQL Shell

Notice how the query responds much faster with the results. This is because of the local accessmode, hence no mapreduce job employed.

PART II – B – Connecting to Big SQL Server via JDBC Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop.

+--------------------+-------+

| key | value |

+--------------------+-------+

| bigsql.force.local | true |

+--------------------+-------+ 1 row in results(first row: 0.0s; total: 0.0s)

set force local auto;

select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065;

+------------+

| o_orderkey |

+------------+

| 5612065 |

+------------+


select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=4 and o_orderkey=5612065;

+------------+

| o_orderkey |

+------------+

| 5612065 |

+------------+


41

Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications). In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open source business intelligence and report tool that can plug into Eclipse. We will use this tool to run some very simple reports using SQL queries on data stored in HBase on our Hadoop environment.

Business Intelligence and Reporting via BIRT To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon.

When promoted to do so, leave the default workspace as is. Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other.... Than click on: Report Design -> OK as shown below.

Once in the Report Design perspective, double-click on Orders.rptdesign from the Navigator pane (on the bottom left-hand side) to open the pre-created report.

42

Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries) contain a red 'X' beside them. This is because the pre-created report queries are not yet associated to a data source. Now all that is necessary, prior to being able to run the report, is to set up the JDBC connection to BigSQL. To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section, select Download the Big SQL Client drivers.

Save the file to /home/biadmin/Desktop/IBD-1687A/.

Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL drivers, while removing tedious steps of designing a report in BIRT.

43

Open the folder where you saved the file and extract the contents of the client package under the same directory.

Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data Source from the Data Explorer pane on the top left-hand side. In the New Data Source window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click Next.

44

In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the Manage JDBC Drivers window appears click on Add…

Point to the location where the client drivers were extracted than click OK. Once added, you should have an entry for the BigSQLDriver in the Driver Class dropdown field list. Select it, and complete the fields with the following information:

• Database URL: jdbc:bigsql://localhost:7052

• User Name: biadmin

• Password: biadmin

45

Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver.

Double-click 'Orders per year' and add the Big SQL connection that was just defined.

Examine the query:

WITH test

(order_year, order_date)

AS

(SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS

ONLY) SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year

46

Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen' data set and examine the query.

Now that we have defined the Data Source and have the Data Sets configured, run the report in Web Viewer as shown in the diagram below.

The output from the web viewer against the orders table on Big SQL should be as follows.

WITH base (o_clerk, tot) AS

(SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk

ORDER BY tot DESC) SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY

Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored.

47

As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. We used BIRT here, but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions. Similarly, other tools like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights cluster.

48

Communities

• On-line communities, User Groups, Technical Forums, Blogs, Social

networks, and more

o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity

• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions

o Recognizing individuals who have made the most outstanding

contributions to Information Management, Business Analytics, and

Enterprise Content Management communities

• ibm.com/champion

Thank You!

Your Feedback is Important!

• Access the Conference Agenda Builder to complete your session

surveys

o Any web or mobile browser at

http://iod13surveys.com/surveys.html

o Any Agenda Builder kiosk onsite

49

Acknowledgements and Disclaimers:

Availability: References in this presentation to IBM products, programs, or services do not

imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and

reflect their own views. They are provided for informational purposes only, and are neither

intended to, nor shall have the effect of being, legal or other guidance or advice to any

participant. While efforts were made to verify the completeness and accuracy of the

information contained in this presentation, it is provided AS-IS without warranty of any kind,

express or implied. IBM shall not be responsible for any damages arising out of the use of, or

otherwise related to, this presentation or any other materials. Nothing contained in this

presentation is intended to, nor shall have the effect of, creating any warranties or

representations from IBM or its suppliers or licensors, or altering the terms and conditions of

the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have

used IBM products and the results they may have achieved. Actual environmental costs and

performance characteristics may vary by customer. Nothing contained in these materials is

intended to, nor shall have the effect of, stating or implying that any activities undertaken by

you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved. • U.S. Government Users Restricted Rights - Use, duplication or disclosure

restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or

registered trademarks of International Business Machines Corporation in the

United States, other countries, or both. If these and other IBM trademarked terms

are marked on their first occurrence in this information with a trademark symbol

(® or ™), these symbols indicate U.S. registered or common law trademarks

owned by IBM at the time this information was published. Such trademarks may

also be registered or common law trademarks in other countries. A current list of

IBM trademarks is available on the Web at “Copyright and trademark

information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of

others.

Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Technology

Transcript of Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL