TR-4541: NetApp FAS and HBase · Technical Report NetApp FAS and HBase Hadoop Online NoSQL with...

Technical Report

NetApp FAS and HBase Hadoop Online NoSQL with NetApp FAS NFS Connector for Hadoop Ankita Dhawale and Karthikeyan Nagalingam, NetApp September 2016 | TR-4541

Abstract

This document introduces NetApp® FAS NFS Connector for Apache HBase, which enables

performing random read/write with the help of a connector directly on NFS storage. It

describes the underlying architecture of HBase, deployment with NFS data, and performance

results, and it showcases the benefits of using HBase with NetApp FAS NFS Connector.

2 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.

TABLE OF CONTENTS

1 Introduction ........................................................................................................................................... 3

1.1 Big Data and HBase Overview ........................................................................................................................ 3

2 Solution Overview ................................................................................................................................ 3

2.1 NetApp FAS .................................................................................................................................................... 3

2.2 NetApp FAS NFS Connector for Hadoop ........................................................................................................ 3

2.3 HBase Solution Architecture ........................................................................................................................... 4

3 Solution Architecture ........................................................................................................................... 5

3.1 HBase Cluster Architecture ............................................................................................................................. 5

4 Solution Validation ............................................................................................................................... 6

4.1 Hardware and Software Prerequisites ............................................................................................................ 6

4.2 Setting an HBase Cluster with NFS Connector ............................................................................................... 6

4.3 Test Validation ................................................................................................................................................ 7

5 Conclusion .......................................................................................................................................... 12

Appendix .................................................................................................................................................... 13

References ................................................................................................................................................. 13

LIST OF TABLES

Table 1) Hardware and software prerequisites. .............................................................................................................. 6

Table 2) Workloads. ....................................................................................................................................................... 8

LIST OF FIGURES

Figure 1) HBase architecture. ......................................................................................................................................... 4

Figure 2) HBase solution architecture. ........................................................................................................................... 5

Figure 3) CDH web UI. ................................................................................................................................................... 6

Figure 4) Test results for HBase cluster with 50% reads and 5 million records. ............................................................. 8

Figure 5) Test results for HBase cluster with 100% reads and 5 million records. ........................................................... 9

Figure 6) Test results for HBase cluster with 100% insertion and 10 million records. .................................................... 9

Figure 7) Test results for HBase cluster with 100% insertion and 30 million records. .................................................. 10

Figure 8) Test results for HBase cluster with 55% insertion and 45% reads with 10 million records. ........................... 10

Figure 9) Test results for HBase cluster with 55% insertion and 45% reads with 30 million records. ........................... 11

Figure 10) Test results for HBase cluster with 50% reads and 64 million records. ....................................................... 11

Figure 11) Test results for HBase cluster with 100% reads and 64 million data size. .................................................. 12

Figure 12) Storage-side CPU usage. ........................................................................................................................... 12


1 Introduction

1.1 Big Data and HBase Overview

Big data analytics is an emerging field in which huge amounts of data are examined to recognize patterns

in the data. These patterns can be useful for predictive analysis, especially related to human behavior

and interactions. Considerable cost, effort, and time are associated with loading big data into traditional

relational databases for analysis. Therefore, new approaches to storing, analyzing, and accessing data

have emerged.

HBase is an open-source, nonrelational, distributed NoSQL database that was developed as part of the

Apache Hadoop ecosystem. It generally runs on top of the Hadoop Distributed File System (HDFS), as

well as other file systems. Apache HBase is used when random, real-time read/write access to big data is

needed.

HBase has a few advantages that make it a powerful NoSQL database. It offers:

Horizontal scalability

Strictly consistent reads and writes

Automatic and configurable sharding of tables

Automatic failover support between region servers

An easy-to-use Java API for client access

2 Solution Overview

HBase normally uses HDFS to store data, but in our setup, data resides in the NetApp FAS storage

controller’s NFS volume, which is accessed through NFS Connector. This approach eliminates the need

to store data on HDFS and the requirements of maintaining three copies of data. We use only one copy of

data, which saves close to 66% of storage when compared with the traditional approach.

NetApp FAS NFS Connector provides data access for Apache HBase from NFS storage without

significantly changing the existing system and its configurations. Current trends indicate that data is set to

grow exponentially, and this growth therefore necessitates large amounts of storage space. It is in this

space that NetApp can help in storing and managing petabytes of data.

2.1 NetApp FAS

NetApp FAS is a combination of high-performance hardware and adaptive storage software, which can

unify your SAN and NAS needs. When used with HBase, the NetApp FAS solution can provide

manageability, high performance, and efficiency. For more information about FAS controllers, see the

NetApp FAS page.

2.2 NetApp FAS NFS Connector for Hadoop

The NetApp FAS NFS Connector for Hadoop allows NetApp ONTAP® access for use by analytics

software such as Apache Hadoop and (NoSQL) Apache HBase. For more information about NetApp FAS

NFS Connector, see TR-4382.

http://www.netapp.com/in/products/storage-systems/fas8000/fas8000-tech-specs.aspx

https://www.netapp.com/us/media/tr-4382.pdf


2.3 HBase Solution Architecture

HBase uses a master-slave architecture. Typically, the HBase cluster has one master node called

HMaster and multiple region servers called HRegionServer. Each region server contains multiple

regions called HRegions.

Just as in a relational database, data in HBase is stored in tables, and those tables are stored in regions.

When a table becomes too big, it is partitioned into multiple regions. Those regions are assigned to region

servers across the cluster. Each region server hosts roughly the same number of regions. HBase

recommends having predefined regions, which should be equal to 10 times the number of region servers

that are available. Each region server contains a write-ahead log (called HLog) and multiple regions.

Each region includes a MemStore and several StoreFiles (HFiles). The data resides in the StoreFiles in

the form of column families. The MemStore holds in-memory modifications to the store (data).

The mapping of regions to a region server is kept in a system table called .META. When reading or

writing data from HBase, the clients read the required region information from the .META table and

directly communicate with the appropriate region server. Each region is identified by the start key

(inclusive) and by the end key (exclusive).

In Apache HBase, ZooKeeper coordinates, communicates, and shares its state between the masters and

the region servers. HBase has a design policy of using ZooKeeper only for transient data (that is, for

coordination and state communication). Therefore, if the HBase’s ZooKeeper data is removed, only the

transient operations are affected—data can continue to be written and read to or from HBase.

This solution architecture, as depicted in Figure 1, uses NetApp FAS NFS Connector NFS storage for the

HBase datastore. The NetApp FAS NFS Connector helps to connect with the NFS storage. This setup

uses the NetApp FAS8080 storage controller.

Figure 1) HBase architecture.


3 Solution Architecture

This section describes the HBase architecture that was used for the validation and testing, including its

components and layout.

3.1 HBase Cluster Architecture

HBase provides a wide-column data model and random real-time CRUD operations on top of NFS

storage. They can horizontally scale out to efficiently serve billions of rows and millions of columns

by autosharding. Because each region is served by only one region server at a time, it supports strong

consistency for reads and writes. Automatic failover of the region server is supported.

Figure 2 illustrates the architecture in which the HMaster, five region servers, and NetApp FAS8080 are

connected through a 10GB switch. One aggregate was created by using 10 SAS drives from FAS8080,

and one volume was used for the validation.

Figure 2) HBase solution architecture.


4 Solution Validation

4.1 Hardware and Software Prerequisites

Table 1) Hardware and software prerequisites.

Hardware

6 x host servers 2.4GHz CPU, 256GB RAM

Storage NetApp FAS8080 with 10 x SAS disks

Network 1 x 10Gbps switches

Software

Operating system Red Hat Enterprise Linux 6.6

Database HBase 1.2.0-cdh5.7.0

Benchmarking tool Yahoo! Cloud Serving Benchmark (YCSB) 0.8.0

Cloudera version 5.7.0

4.2 Setting an HBase Cluster with NFS Connector

To install the NetApp FAS NFS Connector for HBase, complete the following steps:

Note: Cloudera Distribution for Hadoop (CDH) is used for testing, and the HBase that is used is provided only under CDH. After the HBase service is added, its status appears as shown in Figure 3.

Figure 3) CDH web UI.

1. Download the JAR file (hadoop-nfs-connector-2.0.0.jar) from

https://github.com/NetApp/NetApp-Hadoop-NFS-Connector/releases.

2. Replace the hadoop-nfs-2.6.0-cdh5.7.xxxx.jar file with hadoop-nfs-3.0.0-

SNAPSHOT.jar from the previously mentioned link.

Note: The hadoop-nfs-2.6.0-cdh5.7.xxxx.jar file is shipped with Cloudera.

https://github.com/NetApp/NetApp-Hadoop-NFS-Connector/releases


3. Extract the NFS Connector JAR files to the following locations:

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/ hadoop-nfs-

3.0.0-SNAPSHOT.jar

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hadoop-nfs-

connector-2.0.0.jar

4. The NFS Connector JAR files also must be present in the following locations:

Hadoop’s library folder:

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/ hadoop-nfs-3.0.0-SNAPSHOT.jar

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/hadoop-nfs-connector-2.0.0.jar

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/ hadoop-nfs-3.0.0-SNAPSHOT.jar

HBase libraries:

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase/lib/ hadoop-nfs-3.0.0-SNAPSHOT.jar

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase/lib/hadoop-nfs-connector-2.0.0.jar

ZooKeeper:

/ opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/zookeeper/lib/ hadoop-nfs-3.0.0-

SNAPSHOT.jar

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/zookeeper/lib/hadoop-nfs-connector-2.0.0.jar

5. Configure core-site.xml by referring to the section “Installation and Configuration” in TR-4382:

NetApp FAS NFS Connector for Hadoop.

6. Configure the hbase-site.xml.

The hbase-site.xml file is present at six different locations in Cloudera, and all of them must be

updated. In the Cloudera search box, type hbase-site.xml. This action displays the contents of

the hbase-site.xml. Add the following property to all six locations one by one and restart the

cluster. Verify that the current root directory of HBase points to NFS in the HBase web UI home page.

<property>

<name>hbase.rootdir</name>

<value>nfs://10.63.150.20:2049/Volume1_NFS/hbase</value>

</property>

4.3 Test Validation

The Yahoo! Cloud Serving Benchmark (YCSB) tool was used for benchmarking validation of HBase

(NFS) data through NetApp FAS NFS Connector for Hadoop.

Note: During the tests, the Java heap must be kept high. After completing a large number of data operations, the heap must be flushed either by waiting for the autoflush, by restarting the service, or by using manual flush.

YCSB Benchmarking

YCSB is an open-source specification and program that is commonly used for comparing performance of

the NoSQL database management system.

YCSB consists of two components:

A client that generates the load according to a workload type and records the latency and throughput that are associated with that workload

Workload files that define the workload type by describing the size of the dataset, the total number of requests, and the ratio of read and write queries

There are six major workload types in YCSB:

Update-heavy workload A: 50/50 update/read ratio

Read-heavy workload B: 5/95 update/read ratio




Read-only workload C: 100% read-only, maps closely to workload B

Read-latest workload D: 5/95 insert/read ratio; the read load is skewed toward the end of the key range, has similarities to workload B

Short-ranges workload E: 5/95 insert/read ratio, over a range of up to 100 records

Read/modify/write workload F: 50/50 write/read ratio, similar to workload A, but the writes are actual updates or modifications rather than just blind writes as in workload A

For more information, visit the YCSB GitHub.

HBase Validation with YCSB

Table 2 list the workloads that were used for benchmarking.

Table 2) Workloads.

Workload Read % Update % Scan % Insert %

Workload A 50 50

Workload C 100

Data Insertion 100

Insert and Read 45 55

Dataset sizes of 5 million, 10 million, 30 million, and 64 million were used in the tests.

In first test scenario, we tested 5 million operations and collected the NFS storage results.

Figure 4 shows the test results for a mixed workload of 50% reads and 50% writes. A throughput of

around 102,000 IOPS with a median latency of 0.9ms at 100 threads was achieved. From Figure 4, it can

be noted that the throughput increases linearly with the number of threads. Increasing the threads beyond

500 improves throughput but also increases the latencies. NetApp recommends retaining the threads at

500 for better performance.

Figure 4) Test results for HBase cluster with 50% reads and 5 million records.

Figure 5 shows test results for 100% read workloads. The throughput is linearly proportional to the

number of threads. Going beyond 500 threads increases the throughput significantly. Latency values

during this testing were 0.8ms for 100 threads, and 113986 IOPS were observed.

102530

109628

119223

80000

100000

120000

140000

100 250 500

Th

rou

gh

pu

t

Number of Threads

Throughput: 50% Reads with 5 Million Records

https://github.com/brianfrankcooper/YCSB



Figure 6 shows the results with 10 million records for 100% insertion. HBase generally performs well

during writes, and therefore this test was performed to validate the write performance.

Figure 6) Test results for HBase cluster with 100% insertion and 10 million records.

Figure 7 shows the results of 100% insertion with a 30 million data size. The throughput increases linearly

with the number of threads but beyond 256 threads latency increases along with throughput values.

113986

116732

121247

110000

112000

114000

116000

118000

120000

122000

100 250 500

Th

rou

gh

pu

t

Number of Threads


77145

78480

84380

72000

74000

76000

78000

80000

82000

84000

86000

64 124 256

Th

rou

gh

pu

t

Number of Threads

Throughput: 100% Insertion with 10 Million Records


Figure 7) Test results for HBase cluster with 100% insertion and 30 million records.

Figure 8 shows the benchmarking results of 55% insertion of data and 45% reads for 10 million records.

One observation here is that when the thread count was less than 256, the throughput was low. It

therefore can be concluded that it would be best to increase the number of threads for this type of data

insertion and reading pattern.

Figure 8) Test results for HBase cluster with 55% insertion and 45% reads with 10 million records.

71227

81003

86991

60000

65000

70000

75000

80000

85000

90000

64 124 256

Th

rou

gh

pu

t

Number of Threads

Throughput: 100% Insertion with 30 Million Records

58308

74420

86920

40000

45000

50000

55000

60000

65000

70000

75000

80000

85000

90000

256 512 1024

Thro

ugh

pu

t

Number of Threads

Throughput: 55% Insertion and 45% Reads with 10 Million Records


Figure 9 shows the results of a workload of 55% insertion and 45% reads with 30 million records. It was

observed that the latencies during this workload were 5.8ms for reads and 9.2ms for insertion.

Figure 9) Test results for HBase cluster with 55% insertion and 45% reads with 30 million records.

Figure 10 shows the test results for workload A of YCSB benchmarking for 50% read operations for 64

million records. During the test, we observed 90,000 IOPS with 1,024 threads, and latency was recorded

as 5.7ms.


Additionally, workload C from YCSB benchmarking was run with 64 million data for a 100% read

operation. Figure 11 shows during the testing, data was loaded into the database by using workload A

34480

5862561309

20000

25000

30000

35000

40000

45000

50000

55000

60000

65000

256 512 1024

Thro

ugh

pu

t

Number of Threads

Throughput: 55% Insertion and 45% Reads with 30 Million Records

75872

79299

90422

65000

70000

75000

80000

85000

90000

95000

256 512 1024

Th

rou

gh

pu

t

Number of Threads



(50% read and 50% write operations) and then workload C was used to run the YCSB testing. IOPS of

around 100,000 was observed, and latency during this testing was 9.8ms with 1,024 threads.

Figure 11) Test results for HBase cluster with 100% reads and 64 million data size.

Throughout this process, the storage-side CPU usage was recorded (Figure 12). It was observed to be

somewhere around 20% to 30%. More headroom is available on the storage side.

Figure 12) Storage-side CPU usage.

5 Conclusion

The NetApp FAS NFS Connector for HBase is very easy to deploy and provides the ability to use existing

NFS storage for random read/write of data. Benchmarking results show that HBase performs well with

large quantities of data on NFS storage.

63602

101758 103494

60000

80000

100000

120000

256 512 1024

Th

rou

gh

pu

t

Number of Threads



Appendix

Download and save YCSB from GitHub. The YCSB commands are as follows:

1. Run the load command to load commands to the database.

./bin/ycsb load hbase10 -p columnfamily=cf -P workloads/workloada -p table=tablename -p

recordcount=30000000 -p operationcount=30000000 -threads 256 -s > load_file.txt

2. Use the run command to run the desired workload on the loaded dataset.

./bin/ycsb run hbase10 -p columnfamily=cf -P workloads/workloada -p table=tablename -p

recordcount=30000000 -p operationcount=30000000 -threads 256 -s > run_file.txt

The load_file.txt and run_file.txt store the output of the commands.

References

The following references were used in this TR:

HBase architecture http://www.netwoven.com/2013/10/hbase-overview-of-architecture-and-data-model/

Big data overview http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data

HBase introduction https://haifengl.wordpress.com/2014/05/14/hbase-and-accumulo/

NetApp FAS NFS Connector for Hadoop https://www.netapp.com/us/media/tr-4382.pdf

https://github.com/brianfrankcooper/YCSB

http://www.netwoven.com/2013/10/hbase-overview-of-architecture-and-data-model/

http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data

https://haifengl.wordpress.com/2014/05/14/hbase-and-accumulo/



Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.

Copyright Information

Copyright © 1994–2016 NetApp, Inc. All rights reserved. Printed in the U.S. No part of this document covered by copyright may be reproduced in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an electronic retrieval system—without prior written permission of the copyright owner.

Software derived from copyrighted NetApp material is subject to the following license and disclaimer:

THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

NetApp reserves the right to change any products described herein at any time, and without notice. NetApp assumes no responsibility or liability arising from the use of products described herein, except as expressly agreed to in writing by NetApp. The use or purchase of this product does not convey a license under any patent rights, trademark rights, or any other intellectual property rights of NetApp.

The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.

RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).

Trademark Information

NetApp, the NetApp logo, Go Further, Faster, AltaVault, ASUP, AutoSupport, Campaign Express, Cloud ONTAP, Clustered Data ONTAP, Customer Fitness, Data ONTAP, DataMotion, Flash Accel, Flash Cache, Flash Pool, FlashRay, FlexArray, FlexCache, FlexClone, FlexPod, FlexScale, FlexShare, FlexVol, FPolicy, GetSuccessful, LockVault, Manage ONTAP, Mars, MetroCluster, MultiStore, NetApp Fitness, NetApp Insight, OnCommand, ONTAP, ONTAPI, RAID DP, RAID-TEC, SANshare, SANtricity, SecureShare, Simplicity, Simulate ONTAP, SnapCenter, SnapCopy, Snap Creator, SnapDrive, SnapIntegrator, SnapLock, SnapManager, SnapMirror, SnapMover, SnapProtect, SnapRestore, Snapshot, SnapValidator, SnapVault, SolidFire, StorageGRID, Tech OnTap, Unbound Cloud, vFiler, WAFL, and other names are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. A current list of NetApp trademarks is available on the web at http://www.netapp.com/us/legal/netapptmlist.aspx. TR-4541-0916

http://support.netapp.com/matrix/mtx/login.do

http://www.netapp.com/us/legal/netapptmlist.aspx

TR-4541: NetApp FAS and HBase · Technical Report NetApp FAS and HBase Hadoop Online NoSQL with...

Documents

Transcript of TR-4541: NetApp FAS and HBase · Technical Report NetApp FAS and HBase Hadoop Online NoSQL with...