IBM Research, India
© 2009 IBM Corporation
Highly Scalable Algorithm for Distributed Real-time Text Indexing
Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay GargIBM Research- India.Email: {annarang, avikas, monkedia}@in.ibm.com, [email protected]
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Agenda
Background Challenges in Scalable Indexing In-memory Index data structure design Parallel Indexing Algorithm
Parallel Pipelined Indexing
Asymptotic Time Complexity Analysis Experimental Results
Strong Scalability
Weak Scalability
Search Performance Conclusions & Future Work
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Background
Data Intensive Supercomputing is gaining strong research momentum Large scale computations over massive and changing data sets
Multiple Domains: Telescope imagery, online transaction records, financial markets, medical records, weather prediction
Massive throughput real-time text indexing and search Massive data at high rate ~ 1-10 GB/s
Index expected to age-off at regular intervals
Architectural Innovations
Massively parallel / many-core architectures Storage class memories with 10s of tera-bytes of storage
Requirement for very high indexing rates and stringent search response time
Optimizations needed to Maximize Indexing Throughput Minimize Indexing Latency from indexing to search (per document)
Sustained search performance
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Lucene Index Overview A Lucene index covers a set of documents
A document is a sequence of fields
A field is a sequence of terms
A term is a text string
A Lucene index consists of one or more segments
Each segment covers a set of documents
Each segment is a fully independent index
Index updates are serialized Multiple index searches can proceed concurrently Supports simultaneous index update and search
Background – Index for Text Search (e.g. Lucene)
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Challenges to Scalable In-memory Distributed Real-time Indexing
Scalability Issues in Typical Approaches Mergesort of sorted terms in the input segments to generate the list of
Terms and TermInfos for the MergedSegment Merging and Re-organization of the Document-List and Position-List of
the input segments Load Imbalance increase with increase in number of processors
Index-Merge process quickly becomes the bottleneck Large indexing latency
Index Data Structure Design Challenge Inherent trade-offs in index-size vs. indexing throughput vs. search
throughput Trade-off in indexing latency vs. search response time vs.
throughput Performance Objective
Maximize Indexing performance while sustaining the search performance (including search response time and throughput)
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Positions List : p11,p12,p21,p22,p23, p31,p32,p41,p42,p43
Segment(1) Segment(2)
Term (T(i)) Term (T(i))
TermInfo(T(i))TermInfo(T(i))
TermInfo(T(i))
Document-List : Doc(1) / Freq(1), Doc(2) / Freq(2)
Position-List : p11,p12, p21,p22,p23
Position-List : p31,p32, p41,p42,p43
Document-List : Doc(1’) / Freq(1’), Doc(2’) / Freq(2’)
Document-List : Doc(1)/F1,Doc(2)/F2, Doc(3)/F3, Doc(4)/F4
Step(1) : Merge- Sort Of Terms & Creation of new TermInfo
Step(2) : Merge of Document-Lists and Position-Lists
Merged Segment
Term(T(i))
Scalability Issues With Typical Indexing Approaches
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
In-memory Indexing Data Structure Design
Two-level hierarchical index data structure design Top-level hash table: GHT (Global Hash Table)
Represents complete index for a given set of documents Map: Terms => Second-level hash table(IHT)
Second level hash table: IHT (Interval Hash Table)
Represents index for an interval of documents with contiguous IDs Map: Term => list of documents containing that term
Postings data also stored
Advantages of the design No need for re-organization of data while merging IHT into GHT
Merge-sort eliminated using hash-operations
Efficient encoding of IHT reduces memory requirements of an IHT
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Interval Hash Table (IHT) : Concept
Term CollisionResolution
. . . . Ti. . .
DocID, Frequency, Positions Array
Dj. . . . . Di+1Di
Ti : HF(Ti)
IHT Data
Hash Table
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Global Hash Table (GHT) : Concept
Term CollisionResolution
. . . . Ti. . .
DocID, Frequency, Positions Array
Dk. . . . . Dj+1Dj
Ti : HF(Ti)
Hash Table
Dj-Dk. . .
Document-interval Indexed Hash Table
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Encoded IHT representation
# Distinct terms in IHT
# Docs/term * # terms
#Docs/term * #terms
# Docs/term * #terms
# Distinct terms in IHT
# Hash table entries
Number of Distinct terms in each Hash table
entry
Term IDs Number of Docs in which each term
occurred
Document IDs per term
Term frequency in each
Document
Offset into Position
Information
What each sub-array represents
Steps to Access Term-Positions from (TermID(Ti), docID(Dj))
Get NumTerms From TermKey(Ti)
GetTermID(Ti) GetNum-Docs(Ti)
GetDocIDs(Ti) Get Offset Into Position Data(Ti,Dj)
GetNumTerms(Dj)
Size of each sub-array
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
New Indexing Algorithm
Main Steps of the Indexing Algorithm Posting table (LHT) for each document is constructed without involving
sorting of terms
Posting tables of k documents are merged into an IHT which are then encoded appropriately
Encoded IHTs are merged into a single GHT in an efficient manner
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
GHT Construction from IHT
Array Of IHTs
HF(Ti)
Global Hash Table
Ti. . .
Ti Tj. . .
New Encoded IHT(g)
Distinct terms
Encoded IHT array
IHT(g)
Tj. . .
IHT(g)
HF(Tj)
S2(a)
S2(b)
S1
IBM Research
© 2008 IBM Corporation
Documents
Indexing Group
Search Group
Index Groups
Query
Documents Documents
Documents
Documents
I0 I2
I4 I3
I1
Parallel Group-based Indexing Algorithm
IBM Research
© 2008 IBM Corporation
Pipeline Diagram
Time (Distributed Indexing Algorithm)
Producer(1)
Producer(2)
Producer(3)
Consumer
Produce IHTs/segments
Merge IHTs/segments
Send IHTs/segments
Barrier Sync.
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Asymptotic Time Complexity Analysis
Definitions Size of the indexing group: |G| = (|P| + 1)
P: set of Producers Single Consumer
“n” Produce-Consume rounds |P| Producers and single Consumer in each round Prod(j,i): total time for jth Producer, in ith round.
ProdComp(j,i): compute time ProdComm(j,i): communication time
Cons(i): total time for the Consumer in the ith round ConsComp(i): compute time ConsComm(i): communication time
Distributed Indexing T(distributed) = X + Y + Z
X = maxj ProdComp(j,1)
Y = Σ2≤i≤n max( maxj Prod(j,i) , Cons(i−1)) Z = Cons(n)
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Asymptotic Time Complexity Analysis
Overall Indexing Time: dependent on balance of pipeline stages
2 cases Produce phase dominates merge phase Merge phase dominates the compute phase
Time complexity Upper Bounds Case(1) : Production time per round > Merge time per
round T(Pghtl) = O(R/|P|) T(Porgl) = O((R/|P|) * log(k))
Case(2) : Merge time per round > Production time per round
T(Pghtl) = O(R /k) T(Porgl) = O((R/k) * log(|P|))
Highly Scalable Text Indexing © 2009 IBM Corporation
IBM Research
Experimental Setup
Original CLucene codebase (v0.9.20) Porgl implementation
Distributed in-memory indexing algorithm using RAMDirectory Distributed Search Implementation
Pghtl implementation Implementation of IHT and GHT data structures Distributed Indexing and Search Algorithm Implementation
IBM Intranet website data Text data extracted from HTML files Loaded equally into the memory of the producer nodes
Experiments run on Blue Gene/L Upto 16K Processor nodes (PPC 440)
2 PPC 440 per node Co-processor mode: 1 compute, 1 router
High Bandwidth 3D torus interconnect For Porgl
“k” such that only one segment is created from all the text data fed to a Producer so as to get its best indexing throughput.
IBM Research
© 2008 IBM Corporation
Strong Scalability Comparison: Pghtl vs Porgl
0
100
200
300
400
500
600In
dex
ing
Tim
e (s
)
Index Group Size (#Nodes)
Strong Scalability (1 GB/index-gp)
Porgl 600 480 224 162 151 182 195.33 220 265
Pghtl 304 119 60.88 36.9 26.18 24.55 28.72 37.82 39.02
2 4 8 16 32 64 128 256 512
IBM Research
© 2008 IBM Corporation
SpeedUp Comparison: Pghtl vs Porgl
Speedup (1 GB / index-gp)
0
2
4
6
8
10
12
14
2 4 8 16 32 64 128 256 512
Index Group Size (#Nodes)
Sp
eed
up
Speedup Pghtl Speedup Porgl
IBM Research
© 2008 IBM Corporation
Weak Scalability Comparison: Pghtl vs Porgl
0
50
100
150
200
250
300In
de
xin
g T
ime
(s
)
Index Group size (#Nodes)
Weak Scalability
Porgl 14.98 14.05 33.79 52.56 117.9 290
Pghtl 6.64 7.2 8.72 12.88 21.27 37.92
4 8 16 32 64 128
IBM Research
© 2008 IBM Corporation
Scalability With Data Size: Pghtl vs Porgl
Scalability with Text Data (G = 128)
0
50
100
150
200
250
64 128 256 512 1024
Text Data Size (MB) Per Index Group
Ind
ex
ing
Tim
e (
s)
Porgl Pghtl
IBM Research
© 2008 IBM Corporation
Indexing Latency Variation: Pghtl vs. Porgl
0
20
40
60
80
100
120
140
160
180
200
Ind
exin
g L
aten
cy (
s)
Index Group Size (#Nodes)
Indexing Latency Variation (1 GB/index-gp)
Porgl 21.89 27.4 38.13 70.5 87 182
Pghtl 1.79 2.73 3.56 7.52 9.9 17
8 16 32 64 128 256
IBM Research
© 2008 IBM Corporation
Search Performance Comparison (Single Index Group)
Index Group Size (#nodes)
Porgl Search Time (s)
Pghtl Search Time (s)
32 2.13 1.75
64 2.85 2.34
128 5.12 5.12
IBM Research
© 2008 IBM Corporation
Conclusions & Future Work
High throughput text indexing demonstrated for the first time at such a large scale. – Architecture independent design of new data structures– Algorithm for distributed in-memory real-time group-based text indexing
• better load balance, low communication cost & good cache performance. Proved analytically: parallel time complexity of our indexing algorithm is at least
(log(P)) better asymptotically compared to typical indexing approaches.
Experimental Results– 3× - 7× improvement in indexing throughput and around 10× better indexing
latency on Blue Gene/L.– Peak indexing throughput of 312 GB/min on 8K nodes– Estimate: 5 TB/min on 128K nodes
Future Work: Distributed Search Optimizations
Top Related