Download - Highly Scalable Algorithm for Distributed Real-time Text Indexing

IBM Research, India

© 2009 IBM Corporation

Highly Scalable Algorithm for Distributed Real-time Text Indexing

Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay GargIBM Research- India.Email: {annarang, avikas, monkedia}@in.ibm.com, [email protected]

Highly Scalable Text Indexing © 2009 IBM Corporation

IBM Research

Agenda

Background Challenges in Scalable Indexing In-memory Index data structure design Parallel Indexing Algorithm

Parallel Pipelined Indexing

Asymptotic Time Complexity Analysis Experimental Results

Strong Scalability

Weak Scalability

Search Performance Conclusions & Future Work


IBM Research

Background

Data Intensive Supercomputing is gaining strong research momentum Large scale computations over massive and changing data sets

Multiple Domains: Telescope imagery, online transaction records, financial markets, medical records, weather prediction

Massive throughput real-time text indexing and search Massive data at high rate ~ 1-10 GB/s

Index expected to age-off at regular intervals

Architectural Innovations

Massively parallel / many-core architectures Storage class memories with 10s of tera-bytes of storage

Requirement for very high indexing rates and stringent search response time

Optimizations needed to Maximize Indexing Throughput Minimize Indexing Latency from indexing to search (per document)

Sustained search performance


IBM Research

Lucene Index Overview A Lucene index covers a set of documents

A document is a sequence of fields

A field is a sequence of terms

A term is a text string

A Lucene index consists of one or more segments

Each segment covers a set of documents

Each segment is a fully independent index

Index updates are serialized Multiple index searches can proceed concurrently Supports simultaneous index update and search

Background – Index for Text Search (e.g. Lucene)


IBM Research

Challenges to Scalable In-memory Distributed Real-time Indexing

Scalability Issues in Typical Approaches Mergesort of sorted terms in the input segments to generate the list of

Terms and TermInfos for the MergedSegment Merging and Re-organization of the Document-List and Position-List of

the input segments Load Imbalance increase with increase in number of processors

Index-Merge process quickly becomes the bottleneck Large indexing latency

Index Data Structure Design Challenge Inherent trade-offs in index-size vs. indexing throughput vs. search

throughput Trade-off in indexing latency vs. search response time vs.

throughput Performance Objective

Maximize Indexing performance while sustaining the search performance (including search response time and throughput)


IBM Research

Positions List : p11,p12,p21,p22,p23, p31,p32,p41,p42,p43

Segment(1) Segment(2)

Term (T(i)) Term (T(i))

TermInfo(T(i))TermInfo(T(i))

TermInfo(T(i))

Document-List : Doc(1) / Freq(1), Doc(2) / Freq(2)

Position-List : p11,p12, p21,p22,p23

Position-List : p31,p32, p41,p42,p43

Document-List : Doc(1’) / Freq(1’), Doc(2’) / Freq(2’)

Document-List : Doc(1)/F1,Doc(2)/F2, Doc(3)/F3, Doc(4)/F4

Step(1) : Merge- Sort Of Terms & Creation of new TermInfo

Step(2) : Merge of Document-Lists and Position-Lists

Merged Segment

Term(T(i))

Scalability Issues With Typical Indexing Approaches


IBM Research

In-memory Indexing Data Structure Design

Two-level hierarchical index data structure design Top-level hash table: GHT (Global Hash Table)

Represents complete index for a given set of documents Map: Terms => Second-level hash table(IHT)

Second level hash table: IHT (Interval Hash Table)

Represents index for an interval of documents with contiguous IDs Map: Term => list of documents containing that term

Postings data also stored

Advantages of the design No need for re-organization of data while merging IHT into GHT

Merge-sort eliminated using hash-operations

Efficient encoding of IHT reduces memory requirements of an IHT


IBM Research

Interval Hash Table (IHT) : Concept

Term CollisionResolution

. . . . Ti. . .

DocID, Frequency, Positions Array

Dj. . . . . Di+1Di

Ti : HF(Ti)

IHT Data

Hash Table


IBM Research

Global Hash Table (GHT) : Concept

Term CollisionResolution

. . . . Ti. . .

DocID, Frequency, Positions Array

Dk. . . . . Dj+1Dj

Ti : HF(Ti)

Hash Table

Dj-Dk. . .

Document-interval Indexed Hash Table


IBM Research

Encoded IHT representation

# Distinct terms in IHT

# Docs/term * # terms

#Docs/term * #terms

# Docs/term * #terms

# Distinct terms in IHT

# Hash table entries

Number of Distinct terms in each Hash table

entry

Term IDs Number of Docs in which each term

occurred

Document IDs per term

Term frequency in each

Document

Offset into Position

Information

What each sub-array represents

Steps to Access Term-Positions from (TermID(Ti), docID(Dj))

Get NumTerms From TermKey(Ti)

GetTermID(Ti) GetNum-Docs(Ti)

GetDocIDs(Ti) Get Offset Into Position Data(Ti,Dj)

GetNumTerms(Dj)

Size of each sub-array


IBM Research

New Indexing Algorithm

Main Steps of the Indexing Algorithm Posting table (LHT) for each document is constructed without involving

sorting of terms

Posting tables of k documents are merged into an IHT which are then encoded appropriately

Encoded IHTs are merged into a single GHT in an efficient manner


IBM Research

GHT Construction from IHT

Array Of IHTs

HF(Ti)

Global Hash Table

Ti. . .

Ti Tj. . .

New Encoded IHT(g)

Distinct terms

Encoded IHT array

IHT(g)

Tj. . .

IHT(g)

HF(Tj)

S2(a)

S2(b)

S1

IBM Research


Documents

Indexing Group

Search Group

Index Groups

Query

Documents Documents

Documents

Documents

I0 I2

I4 I3

I1

Parallel Group-based Indexing Algorithm

IBM Research


Pipeline Diagram

Time (Distributed Indexing Algorithm)

Producer(1)

Producer(2)

Producer(3)

Consumer

Produce IHTs/segments

Merge IHTs/segments

Send IHTs/segments

Barrier Sync.


IBM Research

Asymptotic Time Complexity Analysis

Definitions Size of the indexing group: |G| = (|P| + 1)

P: set of Producers Single Consumer

“n” Produce-Consume rounds |P| Producers and single Consumer in each round Prod(j,i): total time for jth Producer, in ith round.

ProdComp(j,i): compute time ProdComm(j,i): communication time

Cons(i): total time for the Consumer in the ith round ConsComp(i): compute time ConsComm(i): communication time

Distributed Indexing T(distributed) = X + Y + Z

X = maxj ProdComp(j,1)

Y = Σ2≤i≤n max( maxj Prod(j,i) , Cons(i−1)) Z = Cons(n)


IBM Research

Asymptotic Time Complexity Analysis

Overall Indexing Time: dependent on balance of pipeline stages

2 cases Produce phase dominates merge phase Merge phase dominates the compute phase

Time complexity Upper Bounds Case(1) : Production time per round > Merge time per

round T(Pghtl) = O(R/|P|) T(Porgl) = O((R/|P|) * log(k))

Case(2) : Merge time per round > Production time per round

T(Pghtl) = O(R /k) T(Porgl) = O((R/k) * log(|P|))


IBM Research

Experimental Setup

Original CLucene codebase (v0.9.20) Porgl implementation

Distributed in-memory indexing algorithm using RAMDirectory Distributed Search Implementation

Pghtl implementation Implementation of IHT and GHT data structures Distributed Indexing and Search Algorithm Implementation

IBM Intranet website data Text data extracted from HTML files Loaded equally into the memory of the producer nodes

Experiments run on Blue Gene/L Upto 16K Processor nodes (PPC 440)

2 PPC 440 per node Co-processor mode: 1 compute, 1 router

High Bandwidth 3D torus interconnect For Porgl

“k” such that only one segment is created from all the text data fed to a Producer so as to get its best indexing throughput.

IBM Research


Strong Scalability Comparison: Pghtl vs Porgl

0

100

200

300

400

500

600In

dex

ing

Tim

e (s

)

Index Group Size (#Nodes)

Strong Scalability (1 GB/index-gp)

Porgl 600 480 224 162 151 182 195.33 220 265

Pghtl 304 119 60.88 36.9 26.18 24.55 28.72 37.82 39.02

2 4 8 16 32 64 128 256 512

IBM Research


SpeedUp Comparison: Pghtl vs Porgl

Speedup (1 GB / index-gp)

0

2

4

6

8

10

12

14

2 4 8 16 32 64 128 256 512


Sp

eed

up

Speedup Pghtl Speedup Porgl

IBM Research


Weak Scalability Comparison: Pghtl vs Porgl

0

50

100

150

200

250

300In

de

xin

g T

ime

(s

)

Index Group size (#Nodes)

Weak Scalability

Porgl 14.98 14.05 33.79 52.56 117.9 290

Pghtl 6.64 7.2 8.72 12.88 21.27 37.92

4 8 16 32 64 128

IBM Research


Scalability With Data Size: Pghtl vs Porgl

Scalability with Text Data (G = 128)

0

50

100

150

200

250

64 128 256 512 1024

Text Data Size (MB) Per Index Group

Ind

ex

ing

Tim

e (

s)

Porgl Pghtl

IBM Research


Indexing Latency Variation: Pghtl vs. Porgl

0

20

40

60

80

100

120

140

160

180

200

Ind

exin

g L

aten

cy (

s)


Indexing Latency Variation (1 GB/index-gp)

Porgl 21.89 27.4 38.13 70.5 87 182

Pghtl 1.79 2.73 3.56 7.52 9.9 17

8 16 32 64 128 256

IBM Research


Search Performance Comparison (Single Index Group)

Index Group Size (#nodes)

Porgl Search Time (s)

Pghtl Search Time (s)

32 2.13 1.75

64 2.85 2.34

128 5.12 5.12

IBM Research


Conclusions & Future Work

High throughput text indexing demonstrated for the first time at such a large scale. – Architecture independent design of new data structures– Algorithm for distributed in-memory real-time group-based text indexing

• better load balance, low communication cost & good cache performance. Proved analytically: parallel time complexity of our indexing algorithm is at least

(log(P)) better asymptotically compared to typical indexing approaches.

Experimental Results– 3× - 7× improvement in indexing throughput and around 10× better indexing

latency on Blue Gene/L.– Peak indexing throughput of 312 GB/min on 8K nodes– Estimate: 5 TB/min on 128K nodes

Future Work: Distributed Search Optimizations