Scaling / optimizing search on netlog

Scaling search & content filtering

Search optimization

Netlog => social network• meet / connect to new people => search essential • localized content => content filtering essential

Types of searches

Content filtering

Search filtering

Daily search statistics on Netlog

How to handle this

Problem 1: Large number of requests + each request is kind of unique

Problem 2:Content to search on is spread• different distributions (nl, en, fr, .. )• each with their own databasehosts/ isolations :

videos, photos, ...• different shards as explained previously

Solution #1

Add fulltext indexes to tables aggregate different data later onf.e. VIDEOS

Full text index on title, tags, description,

combine results at the end

Problems• Large indexes• Not all indexes are effective• Locking of table => searches are having an impact on

other things on the site• May work good for a small site but otherwise => BAD

Solution #2

Create seperate tables with fulltext indexes especially for searching queriesf.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text))

Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.

Add a full text index on it. Combine results at the end.

Problems• Duplication of data may cause inconsistencies• Not easy to rebuild (takes a very long time)• Peak moments: updates of changes + a lot of

searches => table locks. (MyISAM)

Solution #3 ...almost there :)

Looking for non MySQL based alternatives• Google

• no control over results or whats being indexed/ when its being indexed.

• Yahoo BOSS• promising, great step on making search more open.

Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results,

f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage

• Lucene• Java based, from the creators of Apache• Servers are not optimized for running java/ tomcat +

more custom coding is needed to make php <-> java bridge • Sphinx

• C++ based, more inhouse expertise• fast results in test setup

Solution #3 ...sphinx!

How sphinx works:• Full text search engine

• two essential cli- tools:• indexer

• creating indexes

• searchd• daemon that serves indexes & handles search requests, delivers

results in form of documentids & attributes• uses custom protocol for retreiving results => need a sphinx API

in PHP, java,.. to talk to this daemon: (use search for debugging)

• Some sphinx terminology• sphinx.conf the basic config file, with two essential parts: sources &

indexes• documentid: id that uniquely identifies a document in the sphinx

search index (must be unique!)• attribute: each documentid can have additional attributtes, these can

Indexing (1)

• Indexing• We need to index a data source (SQL database, text files, html

files.. ) defining this in sphinx.conf can be as easy as source users{

type = mysqlsql_host = localhostsql_db = localdbsql_user = jaymesql_pass = *******sql_port = 3306sql_query = SELECT id, firstname, lastname, counter_photos FROM USERSsql_attr_uint = counter_photos

}

• We define counter_photos as an attribute, because we want to sort/ filter on it later on.

Indexing (2)

1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users {

docinfo = externsource = users

path = /var/lib/sphinx/data/users }

2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words

internally :a. creates a dictionary of all of these words. (WordIDs)b. keeps references to documentIDs for each WordIDc. stores attributes with references to documentIDs

Indexing (3) & searching

• indexing ./indexer -c ../etc/sphinx.conf users or./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)

Searchingusing php api:

Sphinx netlog setup

We use a main+delta schememain: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have.

delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day.

When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)Sphinx will use the last index first when searching, so if needed newer content will be found / returned

Future developments on Netlog with sphinx

Indexing of shards (messages / friendships)• Running an indexer on each shard• Creating a main index for x shards

(merge these shards in to 1)• Running distributed searches on these indexes

Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops=> sphinx has an option to generate the most used words in an index which can be relevant for tags

Some sphinx tips & tweaks

• Use range queries when indexing datatry always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. )

• Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE)

• Avoid searhes on small words with OR (f.e. the|new|...)• Define a charset table when indexing UTF-8,

foreign languages• Check if there are no trailing spaces after \ in your sphinx.conf

when using multi -lined queries, can cause weird errors else.

• Cache results!• More info/ advanced usage on: sphinxsearch.com

Questions?

netlog.com/go/[email protected] - [email protected]

Scaling / optimizing search on netlog

Technology

Transcript of Scaling / optimizing search on netlog