Batch indexing & near real time, keeping things fast.

Batch Indexing & Near Real Time, keeping things fast

Marc SturleseSoftware engineer @ Trovit

About me...

• Marc Sturlese – @sturlese

• Software engineer @Trovit. R&D focused

• Responsible for search and scalability

Agenda

• Who we are

• Batch architecture. Hadoop & Hive

• Near real time architecture. Storm & stuff

• Putting it all together

• Alternatives and Future directions

• Questions

Who we are

Trovit, a search engine for classifieds

Who we are

Batch Layer

• Hadoop based

• Documents are crunched by a pipeline of MR jobs

• Hive to save stats of each phase

Batch LayerPipeline overview

Incoming data

Deployment

Lucene Indexes

Ad Processor Diff Matching Expiration Deduplication Indexing

t – 1

External Data

Hive Stats

Hadoop Cluster

Batch LayerThe good things!

• Index always built from scratch. Small number of big segments

• Multicast deployment allows to send indexes to all slaves at the same time.

• Backups convenient on HDFS

Batch LayerThat was cool but...

• Not even close to real time

• Crunch documents in batch means to wait until all is processed. This can take a few hours

• We want to show the user fresher results!

Near real time LayerStorm and stuff to the rescue

Near real time LayerStorm properties

• Distributed real time computation system

• Fault tolerance

• Horizontal scalability

• Low latency

• Reliability

Near real time LayerStorm in action

Solr prod replicas

SlaveXML feed

XML feed

Kafka partition

Storm topologySources

Kafka spout

XML spout Doc Manager bolt Indexer bolt

SHUFFLEGROUPING GROUPING

• Spouts just read and send

• Doc Manager Bolt processes and classifies

• Indexer Bolt adds documents to Solr

• Replicated logic with different implementation

• Careful not to overload Solr slaves...

Near real time LayerStorm in action. But...

• Now Solr has to handle user queries and storm inserts

• Field grouping on Indexer Bolt for politeness

• Small bulks to reduce insert requests

• Committing on many cores, same host, same time can be painful

Near real time LayerStorm in action - Committing

Indexer Bolt Cars US

Real state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1

Indexer Bolt Jobs BR

ZooKeeper Locker

Slave 1 Slave 2 Slave N

• Adding documents now is fast

• Keep number of segments small

• Avoid merges on big segments

• Just add new docs (no deletes or updates)

Mixed ArchitecturePutting it all together

Solr prod replicas

SlaveXML feed

XML feed

Kafka partition

Storm topologySources

Hbase doc info

Bulk addExists?

MR Pipeline

Mixed ArchitectureSwapping indexes

• NRT docs might not be contained in the new batch index (even fresher than the “being built” batch index)

• This can lead to inconsistencies...

Mixed ArchitectureSwapping indexes. Time jumps!

XML feed t

Slave t+1

Slave t

Pipeline t

Pipeline t+1

XML feed t+1

XML feed t+2

NRT indexerBatch indexer

XML feed t

Slave t+1

Slave t

Pipeline t

Pipeline t+1

XML feed t+1

XML feed t+2

XML feed t

Slave t+1

Slave t

Pipeline t

Pipeline t+1

XML feed t+1

XML feed t+2

NRT t+1

NRT t+2

XML feed t

Slave t+1

Slave t

Pipeline t

Pipeline t+1

XML feed t+1

XML feed t+2

NRT t+1

NRT t+2

• NRT indexed docs must be stored in a temporary storage

• Fetch missing docs from the storage and add them before the next deploy

• This avoids time jumps

Mixed ArchitectureStorm and Hadoop

• Near real time inserts, low latency

• Hadoop handles deletes and updates. No rush on those

• No merges on big segments so optimal query response times

• Tolerant to human errors

• Temporary lost of accuracy on the NRT layer

AlternativesSolrCloud - Why not?

• Good for the vast majority of use cases

• Incremental inserts/updates/deletes oriented. Pay segment merges per real time

• Need to deploy full indexes fast (faster that rsync or http replication)

• Now full deploy easier with aliases

Future linesLucene real time feature

• Allows to see docs in the index before they are committed

• Good but not a must right now for the use case

• Very easy to integrate on the current architecture

Thanks for your attention!

Marc Sturlesemarc@trovit.com

Lucene/Solr Revolution 2013, San Diego, May 1 2013

CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets you in the door

TOMORROW Breakfast starts at 7:30Keynotes start at 8:30

Batch indexing & near real time, keeping things fast.

Education

Transcript of Batch indexing & near real time, keeping things fast.

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

File Processing - Indexing MVNC1 Indexing Jim Skon.

File Storage and Indexing. File Organizations Indices Types of index Tree based indexing Hash based indexing.

SEMIOTICS AND INDEXING: AN ANALYSIS OF THE SUBJECT INDEXING PROCESS

Indexing - storage.googleapis.com Job Aids... · Indexing is the process that takes pages from an existing batch or recent upload and organizes them into distinct documents that can

Islandora: An Open Source Institutional Repository Solution · 2017-02-08 · metadata user friendly relationships access management secure storage control indexing reliable ... •Batch

Indexing and HashingCHAPTER 11 Indexing and Hashing Practice Exercises 11.1 Answer: Reasons for not keeping indices on every attribute include: • Every index requires additional

2011 Medical Professional Liability Symposium Chicago, IL ~ March 24 & 25, 2011 MEDICAL NECESSITY & BATCH CLAIMS: KEEPING UNDERWRITERS UP AT NIGHT?

Oracle Open World Data-and-Compute-Intensive processing ... · Solr Database-embedded search engines Oracle Text Lucene Domain Index ... • Indexing tables requires triggers, batch

MEDICAL NECESSITY & BATCH CLAIMS: KEEPING UNDERWRITERS UP AT NIGHT?

FamilySearch Indexing: Indexing...FamilySearch Indexing—March 2010 ©2010 Intellectual Reserve, Inc. All rights reserved. Page 1 of 11 Table of Contents Index a Batch Log in p. 1

FamilySearch Indexing: Indexing

manual indexing

Indexing Techniques for Multimedia Databases Multimedia Similarity Search Structure Image Indexing Video Indexing.

Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing

INDEXING* INDEXING*

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Captiva InputAccel 6.5 Tutorial - Armedia · the batch. Indexing occurs on a per-document basis. See the Captiva InputAccel document, System Overview: The Basics of InputAccel, for

Improvements in the FamilySearch Indexing Programmedia.ldscdn.org/pdf/family-history/web-indexing/... · Indexing Training Guide Improvements in the FamilySearch Indexing Program

Indian Institute of Technology, Kanpur · Keeping and Catering Services for Visitors’ Hostel, Visitors’ hostel extension & PBCEC (Pioneer Batch Continuing Education Centre)”