Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn
-
Upload
lucidworks -
Category
Software
-
view
1.685 -
download
0
Transcript of Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn
LinkedIn’s Vision “Create economic opportunity for every member of the global workforce”
• Find work
• Realize your dream job
• Be great at what you do
Comparison of different Search Engines Netflix: 100K AirBnB: 800K Ebay: 500M Bing: 100’s of Billions Google: 100’s of Billions Facebook: Trillions
Comparison of different Search Engines Netflix: 100K Lucene AirBnB: 800K Lucene Ebay: 500M Custom C++ Bing: 100’s of Billions Custom C++ Google: 100’s of Billions Custom C++ Facebook: Trillions Custom C++
LinkedIn: 100’s of Millions
Galene (Lucene based)
Lucene
Galene (Custom)
Important Galene Features • Offline index building • Live updates at a fine granularity • Static rank and early termination • Faceting • Data distribution • Relevance framework
A little about LinkedIn data • Most datasets at LinkedIn are available in 2 ways
• A real 9me, change no9fica9on stream • A complete dataset, ETL’d to Hadoop
• We often rely on derived datasets • Many derived datasets can’t be crunched in real time
Anatomy of a Galene index • Base Index
• Generated by Hadoop periodically • Single-‐segment Lucene index • On Disk. Immutable. MMAPed and MLOCKed • Contains complex / rich features, that we can only afford to compute offline
• Live Index • Inverted index with our own format • In-‐memory data structure • Contains incremental updates to documents
• Snapshot Index • On Disk Snapshot of Live index when necessary • Ini9ally empty • Single segment Lucene Index. Live index is folded in regularly
BLAH BLAH BLAH Jeff BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Reid BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH 2.
1.
Jeff Reid LinkedIn
2
1
Inverted Index (with Posting Lists) Forward Index
Inverted Index: Three Segments Three independent segments with non-overlapped UIDs: • B1S1L1 (Base/snapshot/live) segment
• Base has all UIDs. • Neither of Snapshot nor Live introduces new UIDs.
• S2L2 (Snapshot/live) segment • None of UIDs exist in BSL. • Snapshot has all UIDs • Live does not introduce any new UIDs.
• L3 (live) segment • None of UIDs exist in BSL or SL.
Search: Static Rank (SR) • A global score of a document
• Each document must have one and only one SR • It could be anything that can globally represent the importance of an UID, for
example, the number of 1st degree connec9ons • Different documents might have same SRs
• B1S1L1 segment • Base knows SRs of all UIDs of the segment
• S2L2 • Snapshot knows SRs of all UIDs of the segment
• L3 segments • We assign ar9ficial SRs in either of the two ways:
• Ascending order star9ng from the max SR of all UIDs in all 3 segments • Descending order star9ng from the min SR of all UIDs in all 3 segments
Search: Early Termination (ET) • Segment Level ET
• Depending on the ordering of sta9c ranking assignment of L segment, which will affect the ordering of all segments, we can search: • BSL -‐> SL -‐> L (if it is descending) • L -‐> SL -‐> BSL (if it is ascending)
• Posting List Level ET • Since all pos9ngs are first sorted by SR, early termina9on on pos9ng list guarantees
that documents with highest SRs are always first retrieved (however, this does not guarantee that the final scores are also highest scores).
Going Forward • Very efficient custom index in C++ • Base index build can be run in a distributed manner • BSL supported at a more fundamental level
Faceting • Types of facets supported:
• discoverable (e.g. current company) • sta9c values (e.g. network) • supplied values (e.g. my groups)
• Legacy stack had no early termination allowing for exact facet counting (at a cost)
• Current Galene stack applies heuristics to determine counts in an approximate manner
• Going forward, custom posting list format will encode facet details for more efficient facet count estimation
Relevance Framework
• Infrastructure to support common scoring needs
• Provides framework to evaluate relevance changes
• Enables rapid iterations over relevance experiments
• Allows relevance engineers to focus on building features
Life of a Query – Within A Rewriter
Query
DATA MODEL
Rewriter State
Rewriter Module
DATA MODEL
DATA MODEL
Rewri4en Query
Rewriter Module
Rewriter Module
INDEX
Top Results
Retrieve a Document
Score the Document
Life of a Query – Within A Search Shard
Rewri4en Query
Top Results From Shard
Case Study: Instant Member Search • The index contains connections as document terms
(term:diego AND prefix:buth AND (connec>on:35176 OR connec>on:418001 OR connec>on:1520032))
• Static Rank of documents reflects popularity • Documents are augmented offline with spell correction data
• “shreeram sa” : (term:shreeram OR cluster:5678) AND (prefix:sa) AND (connec9on:1234)