Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha...
-
Upload
yahoo-developer-network -
Category
Documents
-
view
7.623 -
download
1
Transcript of Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha...
Apache Hadoop India Summit 2011
Searching Information
Inside Hadoop Platform
Abinasha KaranaDirector-Technology
Bizosys Technologies Pvt Ltd.
www.bizosys.com
Apache Hadoop India Summit 2011
To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr
Apache Hadoop India Summit 2011
Map-reduce
Result not in a mouse clickWhat didn’t work for us
Apache Hadoop India Summit 2011
It required vertical scaling with manual sharding and subsequent resharding as data grew
Lucene/Solr
What didn’t work for us
Apache Hadoop India Summit 2011
We built a new search for
Hadoop Platform HDFS and HBase
What we did
Apache Hadoop India Summit 2011
my learning from designing, developing
and benchmarking a distributed,
real-time search engine whose Index is
stored and served out of HBase
In the next few slides you will hear about
Apache Hadoop India Summit 2011
Key Learning1. Using SSD is a design decision.
2. Methods to reduce HBase table storage size
3. Serving a request without accessing ALL region servers
4. Methods to move processing near the data
5. Byte block caching to lower network and I/O trips to HBase
6. Configuration to balance network vs. CPU vs. I/O vs memory
Apache Hadoop India Summit 2011
1 Using SSD is a design decision
SSD improved HSearch response time by 66% over SATA. However, SSD is costlier.
In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
Apache Hadoop India Summit 2011
.. Our SSD Friendly Schema Design
Keyword + Document in 1 Table
Keyword: Reads all for a query.
Document: Reads 10 docs / query.
Keyword + Document in 2 Tables
SSD deployment is All or none SSD deployment is only for Keyword Table
Apache Hadoop India Summit 2011
2 Methods to reduceHBase table storage size
Storing a 4 byte cell requires >27bytes in HBase.
Key len
gth
Valu
e leng
th
Ro
w len
gth
Ro
w B
ytes
Fam
ily Len
gth
Fam
ily Bytes
Qu
alifier Bytes
Tim
estamp
Key Typ
e
Valu
e B
ytes
4 BYTES1 BYTE4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTESBYTES
Apache Hadoop India Summit 2011
.. to 1/3rd
• Stored large cell values by merging cells• Reduced the Family name to 1 Character• Reduced the Qualifier name to 1 Character
Apache Hadoop India Summit 2011
3Serving a request without
accessing ALL region servers
Consider a 100 node cluster of HBase and a single
search request need to access all of them.
Bad Design.. Clogged Network.. No scaling
Apache Hadoop India Summit 2011
3 And our solution…
0-1 M
1-2 M
2-3 M
3-4 M
4-5 M
Ro
w R
ang
es
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Fam
ily A
Fam
ily B
Scan “Family A” - 5 Machines HitScan Table A - 3 Machines Hit
Machine 1
Machine 2
Machine 3 Machine 3
Machine 4
Machine 5
Table A Table B
Index Table was divided on Column-Family as separate tables
Apache Hadoop India Summit 2011
4Methods to move
processing near the data
public class TermFilter implements Filter {public ReturnCode filterKeyValue(KeyValue kv) {
boolean isMatched = isFound(kv);if (isMatched ) return ReturnCode.INCLUDE;return ReturnCode.NEXT_ROW;
}…
• Sent filtered Rows over network.
public class DocFilter implements Filter {public void filterRow(List<KeyValue> kvL) {
byte[] val = extractNeededPiece(kvL);kvL.clear();kvL.add(new KeyValue(row,fam,,val));
}….
• Sent relevant section of a Field over network.
E.g. Matched rows for a keyword
E.g. Computing a best match section from within a document for a given query
• Sent relevant Fields of a Row over network.
Apache Hadoop India Summit 2011
5Byte block caching to lower
network and I/O trips to HBase
• Object caching – With growing number of objects
we encountered ‘Out of Memory’ exception
• HBase commit - Frequent flushing to HBase
introduced network and I/O latencies.
Converting Objects to intermediate Byte Blocks
increased record processing by 20x in 1 batch.
Apache Hadoop India Summit 2011
6Configuration to balance Network vs. CPU vs. I/O vs. Memory
DiskI/O
Network
Memory CPU
Block Caching
IPC Caching
Compression
Compression
Aggressive GC
In a Single Machine
Apache Hadoop India Summit 2011
… and it’s settings
• Network– Increased IPC Cache Limits (hbase.client.scanner.caching)
• CPU– JVM agressive heap
("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)
• I/O– LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)
• Memory– HBase block caching (hfile.block.cache.size) and overall memory
allocation for data-node and region-server.
Apache Hadoop India Summit 2011
.. and parallelized to multi-machines
•HTable.batch (Get, Put, Deletes)•ParallelHTable (Scans)•FUTURE-coprocessors (hbase 0.92 release).
Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count
Apache Hadoop India Summit 2011
HSearch Benchmarks on AWS
• Amazon Large instance 7.5 GB Memory * 11 Machines
with a single 7.5K SATA drive
• 100 Million Wikipedia pages of total 270GB and
completely indexed (Included common stopwords)
• 10 Million pages repeated 10 times. (Total indexing time
is 5 Hours)
• Search Query Response speed using a – regular word is 1.5 sec
– common word such as “hill” found 1.6 million matches and
sorted in 7 seconds
Apache Hadoop India Summit 2011
www.sourceforge.net/bizosyshsearch
• Apache-licensed (2 versions released)
• Distributed, real-time search
• Supports XML documents with rich search syntax
and various filtration criteria such as document
type, field type.
Apache Hadoop India Summit 2011
References• Initial performance reports (Bizosys HSearch, a Nosql
search engine, featured in Intel Cloud Builders Success Stories (July, 2010)
• HSearch is currently in use at http://www.10screens.com
• More on hsearch– http://www.bizosys.com/blog/– http://bizosyshsearch.sourceforge.net/
• More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf