IXE: IIXE: Idearedeare I Indexing ndexing EEnginengine
Ideare SpAIdeare SpA
www.ideare.comwww.ideare.com
Keep it as simple as possible but not Keep it as simple as possible but not simplersimpler..
Albert EinsteinAlbert Einstein
HistoryHistory
6/966/96 IOL-University cooperationIOL-University cooperation10/9610/96 Arianna: first SE for Italian WebArianna: first SE for Italian Web1/981/98 EUROsearch projectEUROsearch project10/9810/98 WebNet: Categorization by WebNet: Categorization by
ContextContext10/9810/98 Automated Arianna catalogueAutomated Arianna catalogue11/9811/98 WWW8 paper on CategorizationWWW8 paper on Categorization1/991/99 Ideare spin-offIdeare spin-off3/003/00 Tiscali purchases 60% of IdeareTiscali purchases 60% of Ideare6/016/01 First release of IXEFirst release of IXE01-0201-02 Large scale deploymentLarge scale deployment
GoalsGoals
Specialized tool (indexing and Specialized tool (indexing and search)search)
C++ framework with high-level C++ framework with high-level primitivesprimitives– Applications built with few lines of C++– Specialization by inheritance
High performanceHigh performanceScalabilityScalabilitySimple to maintainSimple to maintain
ApproachApproach
Quick and cleanQuick and cleanSignificant effort in designing best Significant effort in designing best
abstractionsabstractionsRefined through extensive usageRefined through extensive usage
Fundamental IdeasFundamental Ideas
Rely on hardware caching and mmapRely on hardware caching and mmap– Keep data as compact as possible– Stucture on disk same as used by
algorithmsRely on good data structures and Rely on good data structures and
algorithmsalgorithms– STL
Specialize data structuresSpecialize data structures– For indexing– For search
IndexingIndexing
Posting List are created in memoryPosting List are created in memory– Provide as much memory as possible to
indexing machinesWhen size of lists reaches a When size of lists reaches a
threshold, dump partial index to diskthreshold, dump partial index to diskPerform final merging of partial Perform final merging of partial
indexesindexesMerging operation used also for:Merging operation used also for:
– Incremental indexing– Distributed indexing
ColorsColors
Generalization of Google hits Generalization of Google hits properties (anchor, size, properties (anchor, size, capitaliation)capitaliation)
Similar to Fulcrum zonesSimilar to Fulcrum zonesUsed for rankingUsed for ranking
– E.g. title words contribute more to rank of document
and selective queriesand selective queriestext matches author = attardi
Early 2001Early 2001
IXE releasedIXE released Ideare starts deploymentIdeare starts deployment
– June: Italian Web (50 Mil. documents) served by 3 PCs with IXE
– Fall: expanded to Germany, France, Switzerland
– Fall: Video, Image, Shopping search on IXE
October: evaluation and negotiations October: evaluation and negotiations with major German search portalwith major German search portal
OverallOverall
IXE runs on PCs (better than Solaris IXE runs on PCs (better than Solaris or Alpha)or Alpha)
Fully self-contained libraryFully self-contained library Its own multithreaded serverIts own multithreaded serverDistributed crawlerDistributed crawlerDistributed indexing and mergeDistributed indexing and mergeParallel searchParallel searchWeb Service architectureWeb Service architecture .NET managed code interface.NET managed code interface
FeaturesFeatures
Full text + phrase + proximityFull text + phrase + proximityBoolean queriesBoolean queriesColors: HTML, XML tagsColors: HTML, XML tagsMultiple collectionsMultiple collections Incremental indexingIncremental indexingScalability:Scalability:
– TeraByte collections– Distributed multithreaded servers
Features (2)Features (2)
Pluggable Document Readers: Pluggable Document Readers: Office, PDFOffice, PDF
Compressed document cacheCompressed document cacheDocument snippets with Document snippets with
highlightshighlightsProgrammable query syntaxProgrammable query syntaxClustering of results (prototype)Clustering of results (prototype)
TechnologyTechnology
C++ OO architectureC++ OO architecture Fast indexingFast indexing
– Sort-based inversion Fast searchFast search
– Efficient algorithms and data structures– Query Compiler
• Small Adaptive Set Intersection– Suffix array with supra index– Memory mapped index files
Programmable API libraryProgrammable API library Template metaprogrammingTemplate metaprogramming Full Object Data BaseFull Object Data Base
ArchitectureArchitecture
GathererGatherer Table<DocInfo>Table<DocInfo>
IndexerIndexerLexiconPostings
Hit Lists
DocStore
mmap
Berkeley DB
name:time:size:
DocInfo
mmaplocal
cache
mmap
DocInfo DocInfo
name:time:size:
name:time:size:
DocInfo DocInfo
name:time:size:title:summary:type:
name:time:size:title:summary:type:
DocInfo DocInfo
name:time:size:title:summary:type:
name:time:size:title:summary:type:
ArchitectureArchitecture
Gatherers
.html, .doc, .pdf, .ps, .txt
Gatherers
.html, .doc, .pdf, .ps, .txt
MultithreadQuery
MultithreadQueryIndexersIndexers
IndexPosting
DocStore
Storing Objects in Relational TablesStoring Objects in Relational Tables
SQLSQLcreate table video (name varchar(256),
caption varchar(2048), format INT, PRIMARY KEY(name))
Template MetaprogrammingTemplate Metaprogramming
class Video : public DocInfo {class Video : public DocInfo {char*char* name;name;char*char* caption;caption;intint format;format;
META(Video, (SUPERCLASS(DocInfo),META(Video, (SUPERCLASS(DocInfo), VARKEY(name, 256),VARKEY(name, 256),
VARFIELD(caption, 2048),VARFIELD(caption, 2048),FIELD(format)));FIELD(format)));
};};
Programming Applications (C+Programming Applications (C++)+)
Collection<Video> videos(“CNN”);Collection<Video> videos(“CNN”);videos.insert(video1);videos.insert(video1);
Query q(“caption MATCHES Jordan and Query q(“caption MATCHES Jordan and format=wav”);format=wav”);
Cursor<Video> cursor(videos, q);Cursor<Video> cursor(videos, q);
while (cursor.hasNext())while (cursor.hasNext())cout << cursor.get();cout << cursor.get();
Small Adaptive Set IntersectionSmall Adaptive Set Intersection
Query compilerQuery compiler– One cursor on posting lists for each
node– CursorWord, CursorAnd, CursorOr,
CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)
– Returns first result r >= minSingle operator for all kind of Single operator for all kind of
queries: e.g. proximityqueries: e.g. proximity
SASI exampleSASI example
world wide web
3
9
12
20
40
47
1
8
10
40
41
2
4
6
21
40
PerformancePerformance
Comparison (single node)Comparison (single node)
IndexingIndexing
TimeTime
SearchSearch
SpeedSpeed ¹
ProgramProgrammabilitymability
ExcerptsExcerpts ProximityProximity
RankRank
RankingRanking
IXEIXE 2 GB/h2 GB/h 30 q/s30 q/s C++ APIC++ API Link Link popularitypopularity
FulcrumFulcrum 0.7 GB/h0.7 GB/h 6 q/s6 q/s C APIC API nono nono nono
GoogleGoogle ?? 1 q/s1 q/s C, C, pythonpython
PageRankPageRank
FastFast 1-2 GB/h1-2 GB/h 3 q/s3 q/s CC
plannedplanned FirstPageFirstPage
ShareShare
PointPoint
0.2 GB/h0.2 GB/h 3 q/s3 q/s C++C++ nono ?? ??
VerityVerity 0.2 GB/h0.2 GB/h 4 q/s4 q/s nono ?? ??
¹ 2 million documents
Comparison (2)Comparison (2)
Paragraph Paragraph indexingindexing
ColorColor
SearchSearch
ColumnColumn
SearchSearch
Max docMax doc
sizesize
O.S.O.S.
IXEIXE no limitno limit
Linux, Linux, Windows, Windows,
Alpha, Alpha, SolarisSolaris
FulcrumFulcrum nono limitedlimited 64 K64 KWindows, Windows,
Linux, Linux, SolarisSolaris
GoogleGoogle nono ??limitedlimited
4 K4 K LinuxLinux
FastFast nono ?? ?? ?? NetBSDNetBSD
ShareShare
PointPointnono nono nono ?? WindowsWindows
An independent benchmarkAn independent benchmark
0,00
50,00
100,00
150,00
200,00
250,00
Indexing (Intel) Retrieval (Intel)
AltaVistaIXE
0,00
50,00
100,00
150,00
200,00
250,00
Indexing (Intel) Retrieval (Intel)
AltaVistaIXE
Independent evaluationsIndependent evaluations
Major portal, GermanyMajor portal, GermanyMajor portal, FranceMajor portal, FranceMajor portal, ItalyMajor portal, Italy
– Stress test with 300 concurrent queries– Verity crashed in several cases
Microsoft RedmondMicrosoft Redmond
IXE in useIXE in use
JanasJanas– 150 Million documents– 50 Million documents per server:
• Pentium III, 1 GHz, 2 GB RAM, 2x75 GB IDE
– Italy: 3 PCs, 300 K queries/dayKataWebKataWeb
– largest Italian Web portal– 4 GB documents– 2nd largest Italian newspaper
Other FeaturesOther Features
SnippetsSnippetsDocument cacheDocument cacheColorsColorsMultiple collectionsMultiple collections
– Sorted by page rank– Authoritativeness– Popularity
Filter/Group by similarityFilter/Group by similarityConceptual ClusteringConceptual Clustering
SnippetsSnippets
Adaptive algorithm:Adaptive algorithm:– Compiled regular expression search for
few words– Karp-Rabin algorithm for several words
Customizable on length of snippets, Customizable on length of snippets, proximity of hits, etc.proximity of hits, etc.
Programmable Query SyntaxProgrammable Query Syntax
Typical Search OptionsTypical Search Options– By document type (e.g. HTML, PDF,
DOC)– By color (e.g. title, author)– Within site or domain (through prefix
search on URL)
Result RankingResult Ranking
Based on combination of measuresBased on combination of measures– Classical IR– Authoritativeness– Link popularity– Prioritized collections
Clients can provide their own criteriaClients can provide their own criteria– Pay for placement– Adult filter– Freshness, etc.
Ranking MeasuresRanking Measures
IR rankIR rank– Based on frequencies (tf, idf)– cosine, Okapi (Robertson), Amati
Best Trec10 score: 0,22% relevanceBest Trec10 score: 0,22% relevance IXE uses simplified cosine with IXE uses simplified cosine with
additional scoring factors:additional scoring factors:– Colors (presence in title, heading, etc.)– Proximity for multiple words– Capitalization/font possible (Google)
Authoritative scoreAuthoritative score
Link popularityLink popularity– Based on incoming link count
Reference from authoritative site Reference from authoritative site (e.g. Dmoz)(e.g. Dmoz)– Increase document rank– Descriptions from Dmoz are added to
document with special colorCitations (i.e. text surrounding link)Citations (i.e. text surrounding link)
– Added to document with special color
Priority rankPriority rank
Documents are arranged in several Documents are arranged in several collectionscollections
Collections are searched in orderCollections are searched in orderEarlier collections contain higher Earlier collections contain higher
rank documentsrank documentsTunable cutoff at 4000 documentsTunable cutoff at 4000 documentsStatistical estimate of overall number Statistical estimate of overall number
of resultsof results
Custom rankCustom rank
IR rank is computed from data in IR rank is computed from data in lexicon (word based)lexicon (word based)
Cosine, authoritativeness, custom Cosine, authoritativeness, custom rank are document relatedrank are document related
Accessing document data during Accessing document data during search is a drag in performancesearch is a drag in performance
Solution: associate direct access Solution: associate direct access info (mmapped)info (mmapped)
Nested ObjectsNested Objects
class WebInfo : public DocInfo {class WebInfo : public DocInfo {
CompressedText<65535>CompressedText<65535> text;text;
RankWeightRankWeight weights;weights;
META(WebInfo,META(WebInfo,
(SUPERCLASS(DocInfo),(SUPERCLASS(DocInfo),
FIELD(text),FIELD(text),
KEY(weights, mapped)));KEY(weights, mapped)));
};};
Custom Rank Nested ObjectCustom Rank Nested Object
Struct RankWeightStruct RankWeight
{{int importance,int popularity,int freshness,int adult,…
};};
ScalabilityScalability
Distributed IndexingDistributed Indexing– Performed on spidering machines– Merged indexes
Server farm of cheap PCsServer farm of cheap PCs– 1.2 GHz Athlon or Pentium– 2 GB RAM– 2 x 75 GB disks
12 h indexing cycle for 50 million 12 h indexing cycle for 50 million documents on 8 PCsdocuments on 8 PCs
Query processingQuery processing
Query brokerQuery broker– Dispatches query– Merge sort of results– Maintains cache of results
IFIFLL (Local Inverted File Partition) (Local Inverted File Partition)
Distributed CrawlerDistributed Crawler
Distributed CrawlerDistributed Crawler
High performanceHigh performance– ~120 pages/sec on single node
ScalableScalableFault tolerantFault tolerantCollects data for link popularity, Collects data for link popularity,
citationscitationsHandles several documents formatsHandles several documents formats
Crawler ArchitectureCrawler Architecture
Retriever Crawler
Parser
Scheduler Retriever
Retriever
Cache
CrawlInfo
select()
Table <UrlInfo>
Citations
Hosts Robots
Host queues
Web Service SupportWeb Service Support
C# integrationC# integration
Managed code indexer DLLManaged code indexer DLLManaged objects for controlling Managed objects for controlling
indexing:indexing:– CollectionInfo– Gatherer– Gathered
WebForm GUIWebForm GUI
GUI ArchitectureGUI Architecture
GUIControl
GUIControl
CollectionInfo
CollectionInfo
GathererGatherer
GUIControl
GUIControl
table<Gathered>
Collection BuilderCollection Builder
*.coll
Serialize
copycache
URL:ID:time:size:MD5:lastSeen:
Gathered
ConverterConverter name:time:size:
WebInfo
CollectionEnumeratorCollection
Enumerator
table<WebInfo>
WebIndexerWebIndexer
UnManaged
Web Search ServiceWeb Search Service
High performance High performance search engine search engine librarylibrary
C++ template C++ template librarylibrary
Handles Terabyte Handles Terabyte of dataof data
Available as Web Available as Web ServiceService
IIndendeXXing ing EEnginengine
Top Related