Scaling / optimizing search on netlog
-
Upload
jayme-rotsaert -
Category
Technology
-
view
7.793 -
download
2
description
Transcript of Scaling / optimizing search on netlog
Scaling search & content filtering
Search optimization
Netlog => social network• meet / connect to new people => search essential • localized content => content filtering essential
Types of searches
Content filtering
Search filtering
Daily search statistics on Netlog
How to handle this
Problem 1: Large number of requests + each request is kind of unique
Problem 2:Content to search on is spread• different distributions (nl, en, fr, .. )• each with their own databasehosts/ isolations :
videos, photos, ...• different shards as explained previously
Solution #1
Add fulltext indexes to tables aggregate different data later onf.e. VIDEOS
Full text index on title, tags, description,
combine results at the end
Problems• Large indexes• Not all indexes are effective• Locking of table => searches are having an impact on
other things on the site• May work good for a small site but otherwise => BAD
Solution #2
Create seperate tables with fulltext indexes especially for searching queriesf.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text))
Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.
Add a full text index on it. Combine results at the end.
Problems• Duplication of data may cause inconsistencies• Not easy to rebuild (takes a very long time)• Peak moments: updates of changes + a lot of
searches => table locks. (MyISAM)
Solution #3 ...almost there :)
Looking for non MySQL based alternatives• Google
• no control over results or whats being indexed/ when its being indexed.
• Yahoo BOSS• promising, great step on making search more open.
Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results,
f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage
• Lucene• Java based, from the creators of Apache• Servers are not optimized for running java/ tomcat +
more custom coding is needed to make php <-> java bridge • Sphinx
• C++ based, more inhouse expertise• fast results in test setup
Solution #3 ...sphinx!
How sphinx works:• Full text search engine
• two essential cli- tools:• indexer
• creating indexes
• searchd• daemon that serves indexes & handles search requests, delivers
results in form of documentids & attributes• uses custom protocol for retreiving results => need a sphinx API
in PHP, java,.. to talk to this daemon: (use search for debugging)
• Some sphinx terminology• sphinx.conf the basic config file, with two essential parts: sources &
indexes• documentid: id that uniquely identifies a document in the sphinx
search index (must be unique!)• attribute: each documentid can have additional attributtes, these can
Indexing (1)
• Indexing• We need to index a data source (SQL database, text files, html
files.. ) defining this in sphinx.conf can be as easy as source users{
type = mysqlsql_host = localhostsql_db = localdbsql_user = jaymesql_pass = *******sql_port = 3306sql_query = SELECT id, firstname, lastname, counter_photos FROM USERSsql_attr_uint = counter_photos
}
• We define counter_photos as an attribute, because we want to sort/ filter on it later on.
Indexing (2)
1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users {
docinfo = externsource = users
path = /var/lib/sphinx/data/users }
2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words
internally :a. creates a dictionary of all of these words. (WordIDs)b. keeps references to documentIDs for each WordIDc. stores attributes with references to documentIDs
Indexing (3) & searching
• indexing ./indexer -c ../etc/sphinx.conf users or./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)
Searchingusing php api:
Sphinx netlog setup
We use a main+delta schememain: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have.
delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day.
When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)Sphinx will use the last index first when searching, so if needed newer content will be found / returned
Future developments on Netlog with sphinx
Indexing of shards (messages / friendships)• Running an indexer on each shard• Creating a main index for x shards
(merge these shards in to 1)• Running distributed searches on these indexes
Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops=> sphinx has an option to generate the most used words in an index which can be relevant for tags
Some sphinx tips & tweaks
• Use range queries when indexing datatry always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. )
• Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE)
• Avoid searhes on small words with OR (f.e. the|new|...)• Define a charset table when indexing UTF-8,
foreign languages• Check if there are no trailing spaces after \ in your sphinx.conf
when using multi -lined queries, can cause weird errors else.
• Cache results!• More info/ advanced usage on: sphinxsearch.com
Questions?
netlog.com/go/[email protected] - [email protected]