TR-4541: NetApp FAS and HBase · Technical Report NetApp FAS and HBase Hadoop Online NoSQL with...
Transcript of TR-4541: NetApp FAS and HBase · Technical Report NetApp FAS and HBase Hadoop Online NoSQL with...
Technical Report
NetApp FAS and HBase Hadoop Online NoSQL with NetApp FAS NFS Connector for Hadoop Ankita Dhawale and Karthikeyan Nagalingam, NetApp September 2016 | TR-4541
Abstract
This document introduces NetApp® FAS NFS Connector for Apache HBase, which enables
performing random read/write with the help of a connector directly on NFS storage. It
describes the underlying architecture of HBase, deployment with NFS data, and performance
results, and it showcases the benefits of using HBase with NetApp FAS NFS Connector.
2 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
TABLE OF CONTENTS
1 Introduction ........................................................................................................................................... 3
1.1 Big Data and HBase Overview ........................................................................................................................ 3
2 Solution Overview ................................................................................................................................ 3
2.1 NetApp FAS .................................................................................................................................................... 3
2.2 NetApp FAS NFS Connector for Hadoop ........................................................................................................ 3
2.3 HBase Solution Architecture ........................................................................................................................... 4
3 Solution Architecture ........................................................................................................................... 5
3.1 HBase Cluster Architecture ............................................................................................................................. 5
4 Solution Validation ............................................................................................................................... 6
4.1 Hardware and Software Prerequisites ............................................................................................................ 6
4.2 Setting an HBase Cluster with NFS Connector ............................................................................................... 6
4.3 Test Validation ................................................................................................................................................ 7
5 Conclusion .......................................................................................................................................... 12
Appendix .................................................................................................................................................... 13
References ................................................................................................................................................. 13
LIST OF TABLES
Table 1) Hardware and software prerequisites. .............................................................................................................. 6
Table 2) Workloads. ....................................................................................................................................................... 8
LIST OF FIGURES
Figure 1) HBase architecture. ......................................................................................................................................... 4
Figure 2) HBase solution architecture. ........................................................................................................................... 5
Figure 3) CDH web UI. ................................................................................................................................................... 6
Figure 4) Test results for HBase cluster with 50% reads and 5 million records. ............................................................. 8
Figure 5) Test results for HBase cluster with 100% reads and 5 million records. ........................................................... 9
Figure 6) Test results for HBase cluster with 100% insertion and 10 million records. .................................................... 9
Figure 7) Test results for HBase cluster with 100% insertion and 30 million records. .................................................. 10
Figure 8) Test results for HBase cluster with 55% insertion and 45% reads with 10 million records. ........................... 10
Figure 9) Test results for HBase cluster with 55% insertion and 45% reads with 30 million records. ........................... 11
Figure 10) Test results for HBase cluster with 50% reads and 64 million records. ....................................................... 11
Figure 11) Test results for HBase cluster with 100% reads and 64 million data size. .................................................. 12
Figure 12) Storage-side CPU usage. ........................................................................................................................... 12
3 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
1 Introduction
1.1 Big Data and HBase Overview
Big data analytics is an emerging field in which huge amounts of data are examined to recognize patterns
in the data. These patterns can be useful for predictive analysis, especially related to human behavior
and interactions. Considerable cost, effort, and time are associated with loading big data into traditional
relational databases for analysis. Therefore, new approaches to storing, analyzing, and accessing data
have emerged.
HBase is an open-source, nonrelational, distributed NoSQL database that was developed as part of the
Apache Hadoop ecosystem. It generally runs on top of the Hadoop Distributed File System (HDFS), as
well as other file systems. Apache HBase is used when random, real-time read/write access to big data is
needed.
HBase has a few advantages that make it a powerful NoSQL database. It offers:
Horizontal scalability
Strictly consistent reads and writes
Automatic and configurable sharding of tables
Automatic failover support between region servers
An easy-to-use Java API for client access
2 Solution Overview
HBase normally uses HDFS to store data, but in our setup, data resides in the NetApp FAS storage
controller’s NFS volume, which is accessed through NFS Connector. This approach eliminates the need
to store data on HDFS and the requirements of maintaining three copies of data. We use only one copy of
data, which saves close to 66% of storage when compared with the traditional approach.
NetApp FAS NFS Connector provides data access for Apache HBase from NFS storage without
significantly changing the existing system and its configurations. Current trends indicate that data is set to
grow exponentially, and this growth therefore necessitates large amounts of storage space. It is in this
space that NetApp can help in storing and managing petabytes of data.
2.1 NetApp FAS
NetApp FAS is a combination of high-performance hardware and adaptive storage software, which can
unify your SAN and NAS needs. When used with HBase, the NetApp FAS solution can provide
manageability, high performance, and efficiency. For more information about FAS controllers, see the
NetApp FAS page.
2.2 NetApp FAS NFS Connector for Hadoop
The NetApp FAS NFS Connector for Hadoop allows NetApp ONTAP® access for use by analytics
software such as Apache Hadoop and (NoSQL) Apache HBase. For more information about NetApp FAS
NFS Connector, see TR-4382.
4 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
2.3 HBase Solution Architecture
HBase uses a master-slave architecture. Typically, the HBase cluster has one master node called
HMaster and multiple region servers called HRegionServer. Each region server contains multiple
regions called HRegions.
Just as in a relational database, data in HBase is stored in tables, and those tables are stored in regions.
When a table becomes too big, it is partitioned into multiple regions. Those regions are assigned to region
servers across the cluster. Each region server hosts roughly the same number of regions. HBase
recommends having predefined regions, which should be equal to 10 times the number of region servers
that are available. Each region server contains a write-ahead log (called HLog) and multiple regions.
Each region includes a MemStore and several StoreFiles (HFiles). The data resides in the StoreFiles in
the form of column families. The MemStore holds in-memory modifications to the store (data).
The mapping of regions to a region server is kept in a system table called .META. When reading or
writing data from HBase, the clients read the required region information from the .META table and
directly communicate with the appropriate region server. Each region is identified by the start key
(inclusive) and by the end key (exclusive).
In Apache HBase, ZooKeeper coordinates, communicates, and shares its state between the masters and
the region servers. HBase has a design policy of using ZooKeeper only for transient data (that is, for
coordination and state communication). Therefore, if the HBase’s ZooKeeper data is removed, only the
transient operations are affected—data can continue to be written and read to or from HBase.
This solution architecture, as depicted in Figure 1, uses NetApp FAS NFS Connector NFS storage for the
HBase datastore. The NetApp FAS NFS Connector helps to connect with the NFS storage. This setup
uses the NetApp FAS8080 storage controller.
Figure 1) HBase architecture.
5 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
3 Solution Architecture
This section describes the HBase architecture that was used for the validation and testing, including its
components and layout.
3.1 HBase Cluster Architecture
HBase provides a wide-column data model and random real-time CRUD operations on top of NFS
storage. They can horizontally scale out to efficiently serve billions of rows and millions of columns
by autosharding. Because each region is served by only one region server at a time, it supports strong
consistency for reads and writes. Automatic failover of the region server is supported.
Figure 2 illustrates the architecture in which the HMaster, five region servers, and NetApp FAS8080 are
connected through a 10GB switch. One aggregate was created by using 10 SAS drives from FAS8080,
and one volume was used for the validation.
Figure 2) HBase solution architecture.
6 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
4 Solution Validation
4.1 Hardware and Software Prerequisites
Table 1) Hardware and software prerequisites.
Hardware
6 x host servers 2.4GHz CPU, 256GB RAM
Storage NetApp FAS8080 with 10 x SAS disks
Network 1 x 10Gbps switches
Software
Operating system Red Hat Enterprise Linux 6.6
Database HBase 1.2.0-cdh5.7.0
Benchmarking tool Yahoo! Cloud Serving Benchmark (YCSB) 0.8.0
Cloudera version 5.7.0
4.2 Setting an HBase Cluster with NFS Connector
To install the NetApp FAS NFS Connector for HBase, complete the following steps:
Note: Cloudera Distribution for Hadoop (CDH) is used for testing, and the HBase that is used is provided only under CDH. After the HBase service is added, its status appears as shown in Figure 3.
Figure 3) CDH web UI.
1. Download the JAR file (hadoop-nfs-connector-2.0.0.jar) from
https://github.com/NetApp/NetApp-Hadoop-NFS-Connector/releases.
2. Replace the hadoop-nfs-2.6.0-cdh5.7.xxxx.jar file with hadoop-nfs-3.0.0-
SNAPSHOT.jar from the previously mentioned link.
Note: The hadoop-nfs-2.6.0-cdh5.7.xxxx.jar file is shipped with Cloudera.
7 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
3. Extract the NFS Connector JAR files to the following locations:
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/ hadoop-nfs-
3.0.0-SNAPSHOT.jar
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/hadoop-nfs-
connector-2.0.0.jar
4. The NFS Connector JAR files also must be present in the following locations:
Hadoop’s library folder:
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/ hadoop-nfs-3.0.0-SNAPSHOT.jar
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/hadoop-nfs-connector-2.0.0.jar
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/ hadoop-nfs-3.0.0-SNAPSHOT.jar
HBase libraries:
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase/lib/ hadoop-nfs-3.0.0-SNAPSHOT.jar
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase/lib/hadoop-nfs-connector-2.0.0.jar
ZooKeeper:
/ opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/zookeeper/lib/ hadoop-nfs-3.0.0-
SNAPSHOT.jar
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/zookeeper/lib/hadoop-nfs-connector-2.0.0.jar
5. Configure core-site.xml by referring to the section “Installation and Configuration” in TR-4382:
NetApp FAS NFS Connector for Hadoop.
6. Configure the hbase-site.xml.
The hbase-site.xml file is present at six different locations in Cloudera, and all of them must be
updated. In the Cloudera search box, type hbase-site.xml. This action displays the contents of
the hbase-site.xml. Add the following property to all six locations one by one and restart the
cluster. Verify that the current root directory of HBase points to NFS in the HBase web UI home page.
<property>
<name>hbase.rootdir</name>
<value>nfs://10.63.150.20:2049/Volume1_NFS/hbase</value>
</property>
4.3 Test Validation
The Yahoo! Cloud Serving Benchmark (YCSB) tool was used for benchmarking validation of HBase
(NFS) data through NetApp FAS NFS Connector for Hadoop.
Note: During the tests, the Java heap must be kept high. After completing a large number of data operations, the heap must be flushed either by waiting for the autoflush, by restarting the service, or by using manual flush.
YCSB Benchmarking
YCSB is an open-source specification and program that is commonly used for comparing performance of
the NoSQL database management system.
YCSB consists of two components:
A client that generates the load according to a workload type and records the latency and throughput that are associated with that workload
Workload files that define the workload type by describing the size of the dataset, the total number of requests, and the ratio of read and write queries
There are six major workload types in YCSB:
Update-heavy workload A: 50/50 update/read ratio
Read-heavy workload B: 5/95 update/read ratio
8 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Read-only workload C: 100% read-only, maps closely to workload B
Read-latest workload D: 5/95 insert/read ratio; the read load is skewed toward the end of the key range, has similarities to workload B
Short-ranges workload E: 5/95 insert/read ratio, over a range of up to 100 records
Read/modify/write workload F: 50/50 write/read ratio, similar to workload A, but the writes are actual updates or modifications rather than just blind writes as in workload A
For more information, visit the YCSB GitHub.
HBase Validation with YCSB
Table 2 list the workloads that were used for benchmarking.
Table 2) Workloads.
Workload Read % Update % Scan % Insert %
Workload A 50 50
Workload C 100
Data Insertion 100
Insert and Read 45 55
Dataset sizes of 5 million, 10 million, 30 million, and 64 million were used in the tests.
In first test scenario, we tested 5 million operations and collected the NFS storage results.
Figure 4 shows the test results for a mixed workload of 50% reads and 50% writes. A throughput of
around 102,000 IOPS with a median latency of 0.9ms at 100 threads was achieved. From Figure 4, it can
be noted that the throughput increases linearly with the number of threads. Increasing the threads beyond
500 improves throughput but also increases the latencies. NetApp recommends retaining the threads at
500 for better performance.
Figure 4) Test results for HBase cluster with 50% reads and 5 million records.
Figure 5 shows test results for 100% read workloads. The throughput is linearly proportional to the
number of threads. Going beyond 500 threads increases the throughput significantly. Latency values
during this testing were 0.8ms for 100 threads, and 113986 IOPS were observed.
102530
109628
119223
80000
100000
120000
140000
100 250 500
Th
rou
gh
pu
t
Number of Threads
Throughput: 50% Reads with 5 Million Records
9 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Figure 5) Test results for HBase cluster with 100% reads and 5 million records.
Figure 6 shows the results with 10 million records for 100% insertion. HBase generally performs well
during writes, and therefore this test was performed to validate the write performance.
Figure 6) Test results for HBase cluster with 100% insertion and 10 million records.
Figure 7 shows the results of 100% insertion with a 30 million data size. The throughput increases linearly
with the number of threads but beyond 256 threads latency increases along with throughput values.
113986
116732
121247
110000
112000
114000
116000
118000
120000
122000
100 250 500
Th
rou
gh
pu
t
Number of Threads
Throughput: 100% Reads with 5 Million Records
77145
78480
84380
72000
74000
76000
78000
80000
82000
84000
86000
64 124 256
Th
rou
gh
pu
t
Number of Threads
Throughput: 100% Insertion with 10 Million Records
10 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Figure 7) Test results for HBase cluster with 100% insertion and 30 million records.
Figure 8 shows the benchmarking results of 55% insertion of data and 45% reads for 10 million records.
One observation here is that when the thread count was less than 256, the throughput was low. It
therefore can be concluded that it would be best to increase the number of threads for this type of data
insertion and reading pattern.
Figure 8) Test results for HBase cluster with 55% insertion and 45% reads with 10 million records.
71227
81003
86991
60000
65000
70000
75000
80000
85000
90000
64 124 256
Th
rou
gh
pu
t
Number of Threads
Throughput: 100% Insertion with 30 Million Records
58308
74420
86920
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
256 512 1024
Thro
ugh
pu
t
Number of Threads
Throughput: 55% Insertion and 45% Reads with 10 Million Records
11 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Figure 9 shows the results of a workload of 55% insertion and 45% reads with 30 million records. It was
observed that the latencies during this workload were 5.8ms for reads and 9.2ms for insertion.
Figure 9) Test results for HBase cluster with 55% insertion and 45% reads with 30 million records.
Figure 10 shows the test results for workload A of YCSB benchmarking for 50% read operations for 64
million records. During the test, we observed 90,000 IOPS with 1,024 threads, and latency was recorded
as 5.7ms.
Figure 10) Test results for HBase cluster with 50% reads and 64 million records.
Additionally, workload C from YCSB benchmarking was run with 64 million data for a 100% read
operation. Figure 11 shows during the testing, data was loaded into the database by using workload A
34480
5862561309
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
256 512 1024
Thro
ugh
pu
t
Number of Threads
Throughput: 55% Insertion and 45% Reads with 30 Million Records
75872
79299
90422
65000
70000
75000
80000
85000
90000
95000
256 512 1024
Th
rou
gh
pu
t
Number of Threads
Throughput: 50% Reads with 64 Million Records
12 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
(50% read and 50% write operations) and then workload C was used to run the YCSB testing. IOPS of
around 100,000 was observed, and latency during this testing was 9.8ms with 1,024 threads.
Figure 11) Test results for HBase cluster with 100% reads and 64 million data size.
Throughout this process, the storage-side CPU usage was recorded (Figure 12). It was observed to be
somewhere around 20% to 30%. More headroom is available on the storage side.
Figure 12) Storage-side CPU usage.
5 Conclusion
The NetApp FAS NFS Connector for HBase is very easy to deploy and provides the ability to use existing
NFS storage for random read/write of data. Benchmarking results show that HBase performs well with
large quantities of data on NFS storage.
63602
101758 103494
60000
80000
100000
120000
256 512 1024
Th
rou
gh
pu
t
Number of Threads
Throughput: 100% Reads with 64 Million Records
13 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Appendix
Download and save YCSB from GitHub. The YCSB commands are as follows:
1. Run the load command to load commands to the database.
./bin/ycsb load hbase10 -p columnfamily=cf -P workloads/workloada -p table=tablename -p
recordcount=30000000 -p operationcount=30000000 -threads 256 -s > load_file.txt
2. Use the run command to run the desired workload on the loaded dataset.
./bin/ycsb run hbase10 -p columnfamily=cf -P workloads/workloada -p table=tablename -p
recordcount=30000000 -p operationcount=30000000 -threads 256 -s > run_file.txt
The load_file.txt and run_file.txt store the output of the commands.
References
The following references were used in this TR:
HBase architecture http://www.netwoven.com/2013/10/hbase-overview-of-architecture-and-data-model/
Big data overview http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data
HBase introduction https://haifengl.wordpress.com/2014/05/14/hbase-and-accumulo/
NetApp FAS NFS Connector for Hadoop https://www.netapp.com/us/media/tr-4382.pdf
14 NetApp FAS and HBase: FAS NFS Connector for Hadoop © 2016 NetApp, Inc. All Rights Reserved.
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.
Copyright Information
Copyright © 1994–2016 NetApp, Inc. All rights reserved. Printed in the U.S. No part of this document covered by copyright may be reproduced in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an electronic retrieval system—without prior written permission of the copyright owner.
Software derived from copyrighted NetApp material is subject to the following license and disclaimer:
THIS SOFTWARE IS PROVIDED BY NETAPP "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WHICH ARE HEREBY DISCLAIMED. IN NO EVENT SHALL NETAPP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NetApp reserves the right to change any products described herein at any time, and without notice. NetApp assumes no responsibility or liability arising from the use of products described herein, except as expressly agreed to in writing by NetApp. The use or purchase of this product does not convey a license under any patent rights, trademark rights, or any other intellectual property rights of NetApp.
The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications.
RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.277-7103 (October 1988) and FAR 52-227-19 (June 1987).
Trademark Information
NetApp, the NetApp logo, Go Further, Faster, AltaVault, ASUP, AutoSupport, Campaign Express, Cloud ONTAP, Clustered Data ONTAP, Customer Fitness, Data ONTAP, DataMotion, Flash Accel, Flash Cache, Flash Pool, FlashRay, FlexArray, FlexCache, FlexClone, FlexPod, FlexScale, FlexShare, FlexVol, FPolicy, GetSuccessful, LockVault, Manage ONTAP, Mars, MetroCluster, MultiStore, NetApp Fitness, NetApp Insight, OnCommand, ONTAP, ONTAPI, RAID DP, RAID-TEC, SANshare, SANtricity, SecureShare, Simplicity, Simulate ONTAP, SnapCenter, SnapCopy, Snap Creator, SnapDrive, SnapIntegrator, SnapLock, SnapManager, SnapMirror, SnapMover, SnapProtect, SnapRestore, Snapshot, SnapValidator, SnapVault, SolidFire, StorageGRID, Tech OnTap, Unbound Cloud, vFiler, WAFL, and other names are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. A current list of NetApp trademarks is available on the web at http://www.netapp.com/us/legal/netapptmlist.aspx. TR-4541-0916