Post on 27-Jan-2015
description
How to build a small How to build a small
distributed search distributed search
engine using open engine using open
source softwaresource software
Building a distributed search engine
Search engine subsytems:Search engine subsytems:● Page databasePage database
● List of the pages to retrieveList of the pages to retrieve
● Pages retrieval and savePages retrieval and save
● Page content parsingPage content parsing
● Full-text indexing of the contentsFull-text indexing of the contents
● Graph database of the links for rankingGraph database of the links for ranking
Building a distributed search engine
Open Source SoftwareOpen Source Software
• Apache HadoopApache Hadoop• MapReduceMapReduce• HDFSHDFS• HBaseHBase
• Apache LuceneApache Lucene
Building a distributed search engine
HDFSHDFS
Hadoop Distributed File SystemHadoop Distributed File System
Building a distributed search engine
HDFS – Assumptions and goalsHDFS – Assumptions and goals
● Hardware failureHardware failure
● Big dataBig data
● Write once / read manyWrite once / read many
● Moving computation, not dataMoving computation, not data
Building a distributed search engine
Building a distributed search engine
Building a distributed search engine
LuceneLucene
Building a distributed search engine
Lucene - Inverse IndexingLucene - Inverse IndexingTerm Doc Id Weight
JUG301 0.97198 0.65120 0.43
Lugano301 0.94278 0.15451 0.87103 0.45763 0.77
Building a distributed search engine
Lucene - Indexing main classesLucene - Indexing main classes
IndexWriterIndexWriter DirectoryDirectory AnalyzerAnalyzer DocumentDocument FieldField
Building a distributed search engine
Lucene - Searching main classesLucene - Searching main classes
IndexSearcherIndexSearcher CollectorCollector QueryQuery TopDocsTopDocs ScoreDocScoreDoc
Building a distributed search engine
Lucene - AnalyzersLucene - Analyzers
StopWordsStopWords ””the book is on the table” [book, table]→the book is on the table” [book, table]→
StemmingStemming [paint, paints, painted, …] paint→[paint, paints, painted, …] paint→
SynonimsSynonims [cat, feline] cat→[cat, feline] cat→
Building a distributed search engine
Lucene - Search optionsLucene - Search options
FieldsFields Title: JUGTitle: JUG body: ”JUG Lugano”body: ”JUG Lugano”
WildcardsWildcards J?G [JUG, JAG, ...]→J?G [JUG, JAG, ...]→ J*G [JUG, JEEG, JUNG, …]→J*G [JUG, JEEG, JUNG, …]→
FuzzyFuzzy (basata su vocabolario) (basata su vocabolario) JUG~[n] [MUG, JAG, …]→JUG~[n] [MUG, JAG, …]→
Building a distributed search engine
Lucene - Search optionsLucene - Search options RangeRange
Year: [2002 TO 2012]Year: [2002 TO 2012] Name: {Alberto TO Andrea}Name: {Alberto TO Andrea}
BoostBoost JUG^5 LuganoJUG^5 Lugano ””JUG Lugano”^5JUG Lugano”^5
ProximityProximity ””JUG Lugano”~5JUG Lugano”~5
Boolean and existanceBoolean and existance AND, OR, NOT, (), +, -AND, OR, NOT, (), +, -
Building a distributed search engine
HDFS - Lucene IntegrationHDFS - Lucene Integration
File copy from/to HDFSFile copy from/to HDFS
Patch IndexWriter/DirectorPatch IndexWriter/Directory
Rewrite of IndexWriter on RAMRewrite of IndexWriter on RAM
Lucene 4Lucene 4
Building a distributed search engine
And now...And now...
Hands on!Hands on!