Optimizing Search in Un-Sharded Large-Scale Distributed...

1
Optimizing Search in Un-Sharded Large-Scale Distributed Systems This work is supported in part by the National Science Foundation under awards NSF-1461260 (REU). OVERVIEW Storage systems are increasingly distributed Discovery and access are crucial for management and analysis of data Nodes in unsharded environments more autonomous and network traffic decreased No general model for searching in unsharded environments. MOTIVATION Search Discover files based on names and contents Emphasis on speed and scalability Support for near-real-time discovery Environment Each document remains intact on each node Information stored in system not necessarily balanced among nodes OBJECTIVES Lucene Handles indexing, query processing, searching and scoring of documents Near real time indexing to search capabilities New challenges relating to efficiently discovering, accessing, managing, and analyzing distributed data Search framework does not rely on sharding Applicable to a range of distributed storage models Compares hierarchical index structure Test Bed 90000 Wiki documents per m3.large node common, rare, non-existent queries Evaluated against Solr and Grep ARCHITECTURE Server-Client Model Client interface and a server exists on each node Server gets query and begins searching while taking care of query distribution and result collection Query Distribution Spanning tree constructed using membership list of nodes, allowing for dynamic changes in cluster membership Spanning tree allows queries to be distributed efficiently and reduces network traffic Optimized by sending queries to nodes with larger indexes first, which are more likely to have a longer searchtime RESULTS Results Lower overhead Faster and scaled better than Solr and Grep Easy to integrate fast, scalable text search for unsharded environments Tree-based query distribution model Faster search than alternatives when scaled CONTRIBUTION Evaluation with ElasticSearch Fault Tolerance Smarter distribution structure Integration into Globus and FusionFS EVALUATION Suraj Chafle 1 , Jonathan J. Wu 2 , Ioan Raicu 1 , Kyle Chard 3 1 Department of Computer Science, Illinois Institute of Technology, Chicago, IL 2 Department of Computer Science, Washington University in St. Louis, St. Louis, MO 3 Department of Computer Science,` University of Chicago, Chicago, IL 0.01 0.1 1 10 100 1000 3 7 15 31 Time of Search (s) Number of Nodes Grep: Common DistSearch: Common Solr: Common Grep: Nonexistant DistSearch: Nonexistant Solr: Nonexistant Grep: Rare DistSearch: Rare Solr: Rare FUTURE WORK ACKNOWLEDGEMENTS

Transcript of Optimizing Search in Un-Sharded Large-Scale Distributed...

Page 1: Optimizing Search in Un-Sharded Large-Scale Distributed Systemsdatasys.cs.iit.edu/publications/2016_SC16-search-poster.pdf · Emphasis on speed and scalability Support for near-real-time

Optimizing Search in Un-Sharded Large-Scale Distributed Systems

This work is supported in part by the NationalScience Foundation under awards NSF-1461260(REU).

OVERVIEW

Storage systems are increasingly distributed

Discovery and access are crucial for management and analysis of data

Nodes in unsharded environments more autonomous and network traffic decreased

No general model for searching in unshardedenvironments.

MOTIVATION

•Search

Discover files based on names and contents

Emphasis on speed and scalability

Support for near-real-time discovery

•Environment

Each document remains intact on each node

Information stored in system not necessarily balanced among nodes

OBJECTIVES

• Lucene

Handles indexing, query processing, searching andscoring of documents

Near real time indexing to search capabilities

New challenges relating to efficiently discovering, accessing, managing, and analyzing distributed data

Search framework does not rely on sharding

Applicable to a range of distributed storage models

Compares hierarchical index structure

• Test Bed

90000 Wiki documents per m3.large node

common, rare, non-existent queries

Evaluated against Solr and Grep

ARCHITECTURE

• Server-Client Model

Client interface and a server exists on each node

Server gets query and begins searching whiletaking care of query distribution and resultcollection

• Query Distribution

Spanning tree constructed using membership listof nodes, allowing for dynamic changes in clustermembership

Spanning tree allows queries to be distributedefficiently and reduces network traffic

Optimized by sending queries to nodes with largerindexes first, which are more likely to have alonger searchtime

RESULTS

• Results

Lower overhead

Faster and scaled better than Solr and Grep

Easy to integrate fast, scalable text search forunsharded environments

Tree-based query distribution model

Faster search than alternatives when scaled

CONTRIBUTION

Evaluation with ElasticSearch

Fault Tolerance

Smarter distribution structure

Integration into Globus and FusionFS

EVALUATION

Suraj Chafle1, Jonathan J. Wu2, Ioan Raicu1, Kyle Chard3

1Department of Computer Science, Illinois Institute of Technology, Chicago, IL2Department of Computer Science, Washington University in St. Louis, St. Louis, MO

3Department of Computer Science,` University of Chicago, Chicago, IL

0.01

0.1

1

10

100

1000

3 7 15 31

Tim

e o

f S

ea

rch

(s

)

Number of Nodes

Grep: Common

DistSearch: Common

Solr: Common

Grep: Nonexistant

DistSearch: Nonexistant

Solr: Nonexistant

Grep: Rare

DistSearch: Rare

Solr: Rare

FUTURE WORK

ACKNOWLEDGEMENTS