SEARCHING BILLIONS OF PRODUCT LOGS IN REAL...
Transcript of SEARCHING BILLIONS OF PRODUCT LOGS IN REAL...
![Page 1: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/1.jpg)
SEARCHING BILLIONS OF PRODUCT LOGS IN
REAL TIMERyan Tabora - Think Big Analytics
NoSQL Search Roadshow - June 6, 2013
1
![Page 2: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/2.jpg)
WHO AM I?
Ryan Tabora
Think Big Analytics - Senior Data Engineer
Lover of dachshunds, bass, and zombies
2
![Page 3: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/3.jpg)
OVERVIEW
Primers
What are product logs?
How do they apply to big data?
Real use case
Real issues and designs
Conclusion
3
![Page 4: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/4.jpg)
PRODUCT LOGS?•Device data
• IT, Energy, Healthcare, Manufacturing, Telecom ...
• These devices are pushing data back home (pull works too!)
• As more devices are sold/installed, more and more data comes back to ‘home base’
4
![Page 5: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/5.jpg)
• Realtime Visualization
• Realtime Response
• Ad Hoc Analysis
• Full Historical Capture
• Blended Data Sets
POWER OF DEVICE DATA
DEVICEDATA
5
![Page 6: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/6.jpg)
TRADITIONAL APPROACHES
• SQL: PostGres, MySQL, Oracle, Microsoft
• SQL provides many of the search features required for typical search applications
• Joins, regex, group by, sorting, etc
• But the these technologies can only scale so far...
6
![Page 7: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/7.jpg)
• Hadoop
• HBase/Cassandra/Accumulo
• Search features are very limited
• HBase row scans, primary key index
• Cassandra limited secondary indexing
NEW TECHNIQUESSTORING DATA
7
![Page 8: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/8.jpg)
•What is an index?
• Lucene
• Paralleling Index Creation
•MapReduce/Flume/Storm
• Real Time Search
• Searching before it hits disk
NEW TECHNIQUESINDEXING DATA
8
![Page 9: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/9.jpg)
• Solr/ElasticSearch
• Both build on top of Lucene
• Search servers
• RESTful HTTP APIs
• Easy to administer
• Add powerful text/numerical search capabilities
NEW TECHNIQUESSEARCHING DATA
9
![Page 10: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/10.jpg)
BASIC SEARCH FEATURES
• Boolean logic (AND, OR + -)
• Sorting and Group By
• Range queries
• Phrase/Prefix/Fuzzy queries
10
![Page 11: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/11.jpg)
ADVANCED SEARCH FEATURES
• Custom ranking/scoring
•More like this
• Auto suggest
• Faceting/Highlighting
• Geo-spacial search
11
![Page 12: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/12.jpg)
SCALING SEARCH
• ElasticSearch and SolrCloud both have distributed features built in
• Auto-sharding
• Replication
•Query routing
• Transaction log
12
![Page 13: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/13.jpg)
USE CASE
Problem
Sample Solution
Core Design Issues
Other Solutions
13
![Page 14: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/14.jpg)
THE PROBLEM
Home Base
NetApp FilerNetApp FilerDevice
Client A
NetApp FilerNetApp FilerDevice
Client B
NetApp FilerNetApp FilerDevice
Client C
Log
Log
Log
LogsREST API
Full SQL Access
Flat File Access
LatestAll Applications
Engineers& Analysts
14
![Page 15: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/15.jpg)
SEARCH APPLICATION FEATURES
• Find last three days of raw logs from an entire cluster
• Group capacity available grouped by machine serial number and show the largest capacities first
• Search all device header lines for “FAILURE”
• View all hard disk objects that have product number 2341AB
• Find all motherboards with an associated customer ticket
15
![Page 16: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/16.jpg)
SAMPLE SOLUTION
Logs
Ingestion
Parsing/Loading
Custom
RESTful Search API
QueriesIndexing
HDFS
MapReduce
16
![Page 17: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/17.jpg)
INGESTION
17
![Page 18: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/18.jpg)
PARSING, LOADING, AND INDEXING
Load HBase with parsed objects
Store HBase ROW_IDStore pointer to raw file in HDFSIndex a number of desired fields
/ingestion/sequencefile1
/ingestion/sequencefile2
/ingestion/sequencefile3
1534 4562 5323 7232
4601 5105
0
0
0 1492 2987 4767 5987
18
![Page 19: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/19.jpg)
INSIDE OF HBASE
...... ......
...... ...
...... .........
object5
…...
object4
...
object3object2
...
object1rowkey
19
![Page 20: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/20.jpg)
THE SOLR DOCUMENT
2343sfOffset/ingest/file2sequenceFile
1333-2241-3411cluster_id42ADFF-BZMM
...
configs.log...
2013/05/12
WARNING: DISK DEAD...
headercontentsfile_namedate_sentsystem_id
rowkey
Solr Document
20
![Page 21: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/21.jpg)
SEARCH APPLICATION
Search Application
Query
Data locations
Stored Object
UserQuery Results
2
1
34
5
8
67
Raw Data Location
Raw DataRow_id
21
![Page 22: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/22.jpg)
CORE DESIGN ISSUES
• Changing the Solr schema (manual reindex)
• Elastic shard scaling (manual reindex)
•No distributed joining (denormalizing the data)
• Replication*
•Manually managing Solr partitioning/sharding*
•Write durability*
22
![Page 23: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/23.jpg)
SOLRCLOUD
• Automatic shard creation, routing
• Replication
• Limited to a fixed number of shards defined on initial creation
• ZooKeeper for coordination
• Large community
23
![Page 24: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/24.jpg)
ELASTICSEARCH
• Similar feature set to Solr
• Purpose built for easily managing a distributed index
• Rapidly growing community
• Custom built coordination mechanism
• JSON based API
24
![Page 25: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/25.jpg)
• Integrates Cassandra and Solr
• Automatic indexing in Solr/storing in Cassandra
• Automatic partitioning
• Automatic reindexing
•Not limited to fixed number of shards
• Proprietary and costs money
DATASTAX ENTERPRISE
+
25
![Page 26: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/26.jpg)
• Collecting and analyzing device data/product logs can be a very difficult challenge
• You can use NoSQL and search technologies like Solr or ElasticSearch in unison...
• ...but it is not always easy to integrate search with NoSQL
CONCLUSION
26
![Page 27: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/27.jpg)
QUESTIONS?
• Feel free to reach out if you have any questions or need help with big data/search!
• http://ryantabora.com
• http://thinkbiganalytics.com
• http://www.slideshare.net/ratabora
• @ryantabora
27
![Page 28: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/28.jpg)
BONUS SLIDES
28
![Page 29: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/29.jpg)
HBASE AND SOLR
• Automatic partitioning/reindexing
• Automatic index updates on HBase inserts/deletes
•Mapping HBase cells to a Solr schema
•No perfect commercial/open source solution yet
•Many many many more...
29
![Page 30: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/30.jpg)
HBASE + SOLRAUTOMATIC INDEXING
• HBase coprocessors are like storedprocs/triggers
•New, powerful, and dangerous
• Triggers on HBase puts/deletes
•Mapping data to a schema?
30
![Page 31: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/31.jpg)
HBASE + SOLR WRITE DURABILITY
Solr Shard 1
Solr Shard 3
Solr Shard 2
HBase Table - SOLR_QUEUE
MapReduce Indexing Application
Solr Queue Reader
Create SolrDocument from raw ASUP1
Get oldest SolrDocument from HBase Queue Table
3
Use custom hash algorithm to determine which shard to add SolrDocument to
4
Query Solr, if SolrDocument was added, then remove it
from the SOLR_QUEUE
5
SolrDocument
SolrDocument
Add SolrDocument to HBase for durability2
31
![Page 32: SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIMEnosqlroadshow.com/dl/.../SearchingBillionsofProductLogsinRealTime… · • Real Time Search • Searching before it hits disk NEW TECHNIQUES](https://reader036.fdocuments.us/reader036/viewer/2022081407/5f8e32b26a71ed117d32af6e/html5/thumbnails/32.jpg)
HBASE + SOLRELASTIC SHARDING
• HBase’s distributing mechanism uses the concept of regions to split data across many nodes
• Region splitting can be automatic or manual (performance degradation as regions split)
• Piggybacking Solr sharding on HBase Region splitting
32