Post on 12-Jul-2015
Deduplication using SOLR Neeraj Jain, Software Engineer, Stubhub, Inc. neeraj.adi@gmail.com
About myself
RIDE ON
StubHub is about…..
We enable “access” to events
We want to be more!!!
Worlds Largest Ticke<ng marketplace
10M active listings
Some Fun Facts about StubHub! Ø An eBay owned company Ø Over 25 million users and growing Ø We sell one ticket per second Ø ~8.5 million page views a day, on an average Ø ~ 3 million additional page views per day on Mobile devices Ø ~10 M tickets for sale in sports, concerts and others. Ø ~ 1 TB of data processed monthly by the analytics infrastructure – This number will significantly go up as we bring in data from many of the unstructured data sources Ø ~300 Million SQL executions/day
2010 2011 2012 2013 2014
Search at Stubhub!
SOLR 1.2 SOLR 3+, Geo spa<al search SOLR Cloud SOLR 4, NRT
Agenda Ø Use case
Ø Challenges
Ø Legacy solution
Ø Our approach
Ø Results
Use Case : Content Ingestion
Input record
Deduplica<on
Post deduplica<on
Pre deduplica<on
Normalize
Geocode
Review
Insert
Update
Discard
Filtering
Classifica<on
Feed-‐1
Form
Feed-‐2
Feed-‐3
Feed-‐n
Event DB
Challenges : Deduplication Ø Problem space
² Event catalog Ø Performance considerations
² Real <me processing ² Batch processing
Ø Speed and data quality
Legacy Solution : Deduplication Flow
Deduplica<onModule
for each field
Event DB
for each document
Client
1: getDuplicates()
2: getSubsetByLoca.on()
3: loop
4: DuplicateList
5: upsert()
Normalize Filter Compute Score
Feed Ingestor
UGC
Batch Job
Approach : Problem Model
Ø Milpitas Library vs Milpitas Public Library Ø 1601 E 7th St vs 1601 E. Seventh St. Ø Pick up the right algo, edit distance, jaccard.
Milpitas Library 160 N. Main St; 40 N. Milpitas Blvd. Distance : ~0.5 mi
Library, Restaurant, etc
e.g. venue name, street number Boost
Dup detec<on -‐ name, address etc
Subset -‐ Text Similarity on Categories
Subset -‐ Geo spa<al distance
Venue Deduplica.on
Approach : Deduplication Flow Feed Ingestor
Deduplica<onService
QueryBuilder QueryExecuter Scorer
SOLR Index
Client
UGC
Batch Job
1: /dedupe
3.1: /select
7: /update 3: execute() 4: compute() 2: build()
6: DedupeResponse
Event DB 8: upsert()
IndexUpdater
A1: poll()
A2: /update, /delete
NameFilter
AddressFilter
Filter
*Filter
Approach : Deduplication Service public interface DeduplicationService<T> {
/** * Checks for duplicate entity and return a DeduplicationResponse containing information about duplicates
found. For each possible duplicate, there is a justification as to why it's a duplicate. * @param t entity for which duplicates need to be found. * @param options use options provided by this object to find and filter the results. * @return a not null instance of DeduplicationResponse object. * @throws DeduplicationConnectivityException if there was an issue in connecting to the dedupe data
store. */ public DeduplicationResponse<T> findDuplicates(T t, DedupeOptions options)
throws DeduplicationConnectivityException; }
Approach : Deduplication Service @Component(value = "VenueDeduplicationService”) public class VenueDeduplicationService implements DeduplicationService<Venue> { @Override
public DeduplicationResponse<Venue> findDuplicates(Venue venue, DedupeOptions options) throws Deduplica<onConnec<vityExcep<on {
} } @Component(value = "EventDeduplicationService”) public class EventDeduplicationService implements DeduplicationService<Event> {
@Override public DeduplicationResponse<Event> findDuplicates(Event event, DedupeOptions options) throws DeduplicationConnectivityException {
} }
Approach : Optimizations Ø How to keep the score consistent?
² <similarity class=“TfSimilarity"/>
Ø Auto commit settings ² <autoSomCommit><maxTime>5</maxTime></autoSomCommit>
Ø Custom PostFilter ² <queryParser name="fdist" class=“DistanceQParserPlugin"/>
Ø Custom update handler ² <processor class=“VenueUpdateProcessorFactory”></processor>
Results : Sample Output Input Venue Matched Venue Score Distance
Jillian's Billiards Club 101 Fourth St.
Jillian's 175 4th St.
1.5573 5.6352
Lush Lounge 1092 Post St.
Lush Lounge 1221 Polk St.
12.9836 16.6501
Mountain Theatre 10 Panoramic Hwy.
Mountain Theater Nearby E Ridgecrest Boulevard and Pantoll Road
3.2509 5.8913
Results : Sample Output Input Venue Matched Venue Score Distance
The Hedley Club at Hotel DeAnza 233 W. Santa Clara St.
Hedley Club 233 W. Santa Clara St.
5.0805 0.0000
Sonya Paz Fine Art Gallery 1793 LafayeYe St.
Sonya Paz Gallery and Studio 1793 LafayeYe St. Suite 110
6.6764 0.0069
Pearl Avenue Library Community Room 4270 Pearl Ave.
Pearl Avenue Branch Library 4270 Pearl Ave.
5.7024 0.0000
Milpitas Library 160 N. Main St.
Milpitas Library 40 N. Milpitas Blvd.
16.4318 0.7284
Summary Ø Use case
² Content inges<on Ø Challenges
² Deduplica<on Ø Legacy solution Ø Our approach
² Used SOLR for text similarity ² Extended default behavior ² REST endpoint over SOLR interface
Ø Next steps ² Big data ² Performer matching ² I18n
Ø Results
Thank You