Efficient and Scalable Archive SearchAvishek Anand
IS : Idealized ShardingCA : Cost Aware Sharding
time
Doc 1Doc 2
Doc 3Doc 4
Doc 5Doc 6
Doc 7
Doc 1Doc 2Doc 7
Doc 3Doc 4Doc 5Doc 6
Web archives span over a long
time
Challenge
Support search with temporal
constraints
Searching Archives
Web archives continuously
grow over time
Challenge
Scale search to growing archives
ScalingArchive Search
Index Sharding
[1] Index Sharding for Space-Time Efficiency in Archive Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. [2] Index Maintenance for Time-Travel Text Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012.[3] A Time Machine for Text Search : Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007.
2007 2008 2009 2010 2011 2012 2013
Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7
Index-list
Shard 1
Shard 2
Need to design index structures which efficiently process time-travel queries and can be easily maintained.
obama @ [6/2009 – 6/2011]
Idealized Sharding: Eliminates access to postings with no intersection with query-time interval.
Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs.
Index Sharding: • Partitions each index-list disjointly. • No index blow-up.
Index Maintenance
References
Experiments
Active Index
Archive Index
In-memory Archive Index
External-memory Archive Index
Crawls
Doc 4: version 2
Doc 3: version 2
Doc 2: version 9
Doc 1: version 1
Doc 4: version 3
Sent to Archive Indexing System In the live index
now
Insertedincoming version
Appended popped posting
Shard buffers Archive Index Shards
System Architecture : Separate indexes for active and retired versions.
Incremental Sharding: • Online algorithm with approximation guarantee.
• Append-only operation on shards.• Retains query performance.
End-time arrival order: Versions finalized in their end-time-order.
query time-interval
SB : Vertical Partitioning with trade-off between performance and index size [3]
Approach
Avoid accessing postings
which do not overlap with query time-
interval.
Approach
Avoid re-computation of
the index by creating shards incrementally.
Wallclock-times comparison with SB Index-size comparison Index maintenance efficiencyPerformance of incremental sharding
INC : Incremental Sharding
Top Related