Solr+Hadoop = Big Data Search
-
Upload
cloudera-inc -
Category
Technology
-
view
18.506 -
download
0
Transcript of Solr+Hadoop = Big Data Search
1
Solr%+%Hadoop%=%Big%Data%SearchMark%Miller
2
Who$Am$I?Cloudera$employee,$Lucene/Solr$committer,$Lucene$PMC$member,Apache$member
First$job$out$of$college$was$in$the$Newspaper$archiving$business.
First$full$time$employee$at$LucidWorks$G$a$startup$aroundLucene/Solr.
Spent$a$couple$years$as$“Core”$engineering$manager,$reporting$tothe$VP$of$engineering.
3
Very%fast%and%feature%rich%‘core’%search%engine%library.%
Compact%and%powerful,%Lucene%is%an%extremely%popular%full>textsearch%library.
Provides%low%level%API’s%for%analyzing,%indexing,%and%searchingtext,%along%with%a%myriad%of%related%features.
Just%the%core%>%either%you%write%the%‘glue’%or%use%a%higher%levelsearch%engine%built%with%Lucene.
4
Solr%(pronounced%"solar")%is%an%open%source%enterprise%searchplatform%from%the%Apache%Lucene%project.%Its%major%featuresinclude%full;text%search,%hit%highlighting,%faceted%search,%dynamicclustering,%database%integration,%and%rich%document%(e.g.,%Word,PDF)%handling.%Providing%distributed%search%and%indexreplication,%Solr%is%highly%scalable.%Solr%is%the%most%popularenterprise%search%engine.
;%Wikipedia
5
Search'on'Hadoop'History'Katta
'Blur
'SolBase
'HBASE73529'SOLR71301
'SOLR71045
'Ad7Hoc
•••••••
6
Family'Tree
...
7
Strengthen(the(Family(Bonds
No(need(to(build(something(radically(new(8(we(have(thepieces(we(need.
Focus(on(integration(points.
Create(high(quality,(first(class(integrations(and(contributethe(work(to(the(projects(involved.
Focus(on(integration(and(quality(first(8(then(performanceand(scale.
•
•
•
•
8
SolrCloud
9
Solr%Integration
Read%and%Write%directly%to%HDFS
First%Class%Custom%Directory%Support%in%Solr
Support%Solr%Replication%on%HDFS
Other%improvements%around%usability%and%configuration
•
••
•
10
Read%and%Write%directly%to%HDFS
Lucene%did%not%historically%support%append%only%file%system
“Flexible%Indexing”%brought%around%support%for%append%onlyfilesystem%support
Lucene%support%append%only%filesystem%by%default%since%4.2
•
•
•
11
Lucene&Directory&AbstractionIt’s&how&Lucene&interacts&with&index&files.Solr&uses&the&Lucene&library&and&offers&DirectoryFactory
Class&Directory&{&&&&&&&&listAll();&&&&&&&&createOutput(file,&context);&&&&&&&&openInput(file,&context);&&&&&&&&deleteFile(file);&&&&&&&&makeLock(file);&&&&&&&&clearLock(file);&&&&&&&&…}
12
Putting'the'Index'in'HDFS
Solr'relies'on'the'filesystem'cache'to'operate'at'full'speed.
HDFS'not'known'for'it’s'random'access'speed.
Apache'Blur'has'already'solved'this'with'an'HdfsDirectory'thatworks'on'top'of'a'BlockDirectory.
The'“block'cache”'caches'the'hot'blocks'of'the'index'off'heap(direct'byte'array)'and'takes'the'place'of'the'filesystem'cache.
We'contributed'back'optional'‘write’'caching.
•
•
•
•
•
13
Putting'the'TransactionLog'in'HDFS
HdfsUpdateLog'added'9'extends'UpdateLog
Triggered'by'setting'the'UpdateLog'dataDir'to'something'thatstarts'with'hdfs:/'9'no'additional'configuration'necessary.
Same'extensive'testing'as'used'on'UpdateLog
•
•
•
14
Running&Solr&on&HDFS
Set&DirectoryFactory&to&HdfsDirectoryFactory&and&set&the&dataDir&to&alocation&in&hdfs.
Set&LockType&to&‘hdfs’
Use&an&UpdateLog&dataDir&location&that&begins&with&‘hdfs:/’
Or&java&FDsolr.directoryFactory=HdfsDirectoryFactory&
&&&&&&&&&&&&&&&FDsolr.lockType=solr.HdfsLockFactory
&&&&&&&&&&&&&&&FDsolr.updatelog=hdfs://host:port/path&Fjar&start.jar
•
•
•
•
15
Solr%Replication%on%HDFS
While%Solr%has%exposed%a%plug8able%DirectoryFactory%for%a%longtime%now,%it%was%really%quite%limited.
Most%glaring,%only%a%local%file%system%based%Directory%wouldwork%with%replication.
There%where%also%other%more%minor%areas%that%relied%on%a%localfilesystem%Directory%implementation.
•
•
•
16
Future&Solr&Replication&on&HDFS
Take&advantage&of&“distributed&filesystem”&and&allow&forsomething&similar&to&HBase®ions.
If&a&node&goes&down,&the&data&is&still&available&in&HDFS&D&allowfor&that&index&to&be&automatically&served&by&a&node&that&is&still&upif&it&has&the&capacity.
•
•
Solr&Node Solr&Node Solr&Node
HDFS
17
MR#Index#BuildingScalable#index#creation#via#map8reduce
Many#initial#‘homegrown’#implementations#sent#documents#from#reducer#toSolrCloud#over#http
To#really#scale,#you#want#the#reducers#to#create#the#indexes#in#HDFS#andthen#load#them#up#with#Solr
The#ideal#impl#will#allow#using#as#many#reducers#as#are#available#in#yourhadoop#cluster,#and#then#merge#the#indexes#down#to#the#correct#number#of‘shards’
•
•
•
•
18
MR#Index#Building
Mapper:Parse#input#into
indexable#document
Mapper:Parse#input#into
indexable#document
Mapper:Parse#input#into
indexable#document
Index#shard1
Index#shard2
Arbitrary#reducing#steps#of#indexing#and#merging
End@Reducer#(shard#1):Index#document
End@Reducer#(shard#2):Index#document
19
SolrCloud(Aware
Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.
What(URL’s(to(GoLive(to.
The(Schema(to(use(when(building(indexes.
Match(hash(E>(shard(assignments(of(a(Solr(cluster.
•
•
•
•
20
GoLive
After+building+your+indexes+with+map:reduce,+how+do+youdeploy+them+to+your+Solr+cluster?
We+want+it+to+be+easy+:+so+we+built+the+GoLive+option.
GoLive+allows+you+to+easily+merge+the+indexes+you+havecreated+atomically+into+a+live+running+Solr+cluster.
Paired+with+the+ZooKeeper+Aware+ability,+this+allows+you+tosimply+point+your+map:reduce+job+to+your+Solr+cluster+and+it+willautomatically+discover+how+many+shards+to+build+and+whatlocations+to+deliver+the+final+indexes+to+in+HDFS.
•
••
•
21
Flume&Solr&Sync
Flume&is&a&distributed,&reliable,&and&available&service&forefficiently&collecting,&aggregating,&and&moving&large&amountsof&log&data.&It&has&a&simple&and&flexible&architecture&based&onstreaming&data&flows.&It&is&robust&and&fault&tolerant&withtunable&reliability&mechanisms&and&many&failover&andrecovery&mechanisms.&It&uses&a&simple&extensible&data&modelthat&allows&for&online&analytic&application.
=&Apache&Flume&Website
OtherLogs
22
Flume.Solr.Sync
HDFS
FlumeAgent
FlumeAgent
Solr
23
SolrCloud(Aware
Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.
What(URL’s(to(send(data(to.
The(Schema(for(the(collection(being(indexed(to.
•
•
•
24
HBase&Integration
Collaboration&between&NGData&&&Cloudera
NGData&are&creators&of&the&Lily&data&management&platform
Lily&HBase&Indexer
Service&which&acts&as&a&HBase&replication&listener
HBase&replication&features,&such&as&filtering,&supported
Replication&updates&trigger&indexing&of&updates&(rows)
Integrates&Morphlines&library&for&ETL&of&rows
AL2&licensed&on&github&https://github.com/ngdata
••••••••
25
HBase&Integration
HDFS
HBase
interactive&load Indexer(s)
Triggers&on&updates Solr&server
Solr&serverSolr&serverSolr&serverSolr&server
26
Morphlines
A,morphline,is,a,configuration,file,that,allows,you,to,define,ETLtransformation,pipelines
Extract,content,from,input,files,,transform,content,,load,content,(egto,Solr)
Uses,Tika,to,extract,content,from,a,large,variety,of,input,documents
Part,of,the,CDK,(Cloudera,Development,Kit)
•
•
•
•
27
Morphlines
syslog FlumeAgent
Solr3Sink
Command:3readLine
Command:3grok
Command:3loadSolr
Solr
3Open3Source3framework3for3simple3ETL
3Ships3as3part3Cloudera3Developer3Kit3(CDK)
3It’s3a3Java3library
3AL23licensed3on3githubhttps://github.com/cloudera/cdk
3Similar3to3Unix3pipelines
3Configuration3over3coding
3Supports3common3Hadoop3formatsAvroSequence3fileTextEtc…
••••
•••
28
Morphlines
+Integrate+with+and+load+into+Apache+Solr
+Flexible+log+file+analysis
+Single:line+record,+multi:line+records,+CSV+files+
+Regex+based+pattern+matching+and+extraction+
+Integration+with+Avro+
+Integration+with+Apache+Hadoop+Sequence+Files
+Integration+with+SolrCell+and+all+Apache+Tika+parsers+
+Auto:detection+of+MIME+types+from+binary+data+using+++Apache+Tika
••••••••
29
Morphlines
+Scripting+support+for+dynamic+java+code+
+Operations+on+fields+for+assignment+and+comparison
+Operations+on+fields+with+list+and+set+semantics+
+if:then:else+conditionals+
+A+small+rules+engine+(tryRules)
+String+and+timestamp+conversions+
+slf4j+logging
+Yammer+metrics+and+counters+
+Decompression+and+unpacking+of+arbitrarily+nested+container+fileformats
+Etc…
•••••••••
•
30
Morphlines+Example+Config
morphlines+:+[
+{
+++id+:+morphline1
+++importCommands+:+["com.cloudera.**",+"org.apache.solr.**"]
+++commands+:+[
+++++{+readLine+{}+}++++++++++++++++++++
+++++{+
+++++++grok+{+
+++++++++dictionaryFiles+:+[/tmp/grokFdictionaries]+++++++++++++++++++++++++++++++
+++++++++expressions+:+{+
+++++++++++message+:+"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}+%
{SYSLOGHOST:syslog_hostname}+%{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:+%
{GREEDYDATA:syslog_message}"""
+++++++++}
+++++++}
+++++}
+++++{+loadSolr+{}+}+++++
++++]
+}
]
Example(Input<164>Feb++4+10:46:14+syslog+sshd[607]:+listening+on+0.0.0.0+port+22
Output(Recordsyslog_pri:164
syslog_timestamp:Feb++4+10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening+on+0.0.0.0+port+22.
31
Hue$Integration
HueSimple$UI
Navigated,$faceted$drill$down
Customizable$display
Full$text$search,$standard$SolrAPI$and$query$language
••••
32
Cloudera)Search
https://ccp.cloudera.com/display/SUPPORT/Downloads
Or)Google
“cloudera=search=download”
Mark%Miller,%Cloudera @heismark