Real time fulltext search with sphinx
-
Upload
adrian-nuta -
Category
Technology
-
view
5.203 -
download
4
description
Transcript of Real time fulltext search with sphinx
![Page 1: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/1.jpg)
Real time fulltext search
with Sphinx
Adrian Nuta // Sphinxsearch // 2013
![Page 2: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/2.jpg)
Quick intro
Sphinx search
• high performance fulltext search engine
• written in C++
• serving searches since 2001
• can work on any modern architecture
• distributed under GPL2 licence
![Page 3: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/3.jpg)
Why a search engine?
• performanceo a search engine delivery faster a search and with
less resourses
• quality of searcho build-in FTS in databases don’t offer advanced
search options
• independent FTS engines offer speed not
only for FT searches, but other types, like
geo or faceted searches
![Page 4: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/4.jpg)
Classic way of indexing in Sphinx
on-disk (classic) method:
• use a data source which is indexed
• to update the index you need to reindex again
• in addition to main index, a secondary index
(delta) index can be used to reindex only latest
changes
• easy because indexing doesn’t require changes
in the application, but:
• reindexing, even delta one, can put pressure
on data source and system
![Page 5: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/5.jpg)
Real time indexing in Sphinx
• index has no data source
• everything that needs be indexed must be added manually in the index
• you can add/update/remove at any time
• compared to classic method, RT requires changes in the application
• performance is same or near same as classic index
• Only specific requirement :
workers = threads
![Page 6: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/6.jpg)
Structures
![Page 7: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/7.jpg)
RealTime index definition
index rt {
type = rt
rt_field = title
rt_field = content
rt_attr_uint = user_id
rt_attr_string = title
rt_attr_json = metadata
}
![Page 8: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/8.jpg)
Schema - Fields
rt_field - fulltext field, raw text is not stored
Tokenization features:
wildcarding ( prefix or infix),
morphology, custom charset definition,
stopwords, synonyms, segmentation, html
stripping, paragraph/sentence detection etc.
![Page 9: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/9.jpg)
Schema - Attributes
• rt_attr_uint & rt_attr_bigint
• rt_attr_bool
• rt_attr_float
• rt_attr_multi & rt_attr_multi64 -integer set
• rt_attr_timestamp
• rt_attr_string - actual text stored, kept in memory, used only for display, sorting and grouping.
• rt_attr_json - full support for JSON documents
![Page 10: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/10.jpg)
Content manipulation
![Page 11: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/11.jpg)
Quick intro to SphinxQL
• our SQL dialect
• any mysql client can be used to connect to
Sphinx
• MySQL server is not required!
• Full document updates only possible with
SphinxQL
• to enable it, add in searchd section of config
listen = host:port:mysql41
![Page 12: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/12.jpg)
Content insert
$mysql> INSERT INTO rt
(id,title,content,user_id,metadata)
VALUES(100,’My title’, ‘Some long content
to search’, 10,
’{“image_id”:1,”props”:[20,30,40]}’);
![Page 13: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/13.jpg)
Full content replace
$mysql> REPLACE INTO rt
(id,title,content,user_id,metadata)
VALUES(100,’My title’, ‘Some long content
to search’, 10,
’{“image_id”:1,”props”:[20,30,40]}’);
• needed for text field, json and string attribute
updates
![Page 14: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/14.jpg)
Updating numerics
• For numeric attributes including MVA:
$mysql> UPDATE rt SET user_id = 10 WHERE id
= 100;
• For numeric JSON elements it’s possible to
do inplace updates:
$mysql> UPDATE rt SET metadata.image_id =
1234 WHERE id=100;
![Page 15: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/15.jpg)
Deleting
$mysql> DELETE FROM rt WHERE id = 100;
$mysql> DELETE FROM rt WHERE user_id > 100;
$mysql> TRUNCATE RTINDEX rt;
● empty the memory shard, delete all disk shards and
release the index binlogs
![Page 16: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/16.jpg)
Adding new attributes
mysql> ALTER TABLE rt ADD COLUMN gid
INTEGER;
• only for int/bigint/float/bool attributes for
now
![Page 17: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/17.jpg)
Searching
![Page 18: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/18.jpg)
Searching
• no difference in searching a RT or classic
index
• dict = keywords required for wildcard search.
![Page 19: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/19.jpg)
Relevancy ranking
• build-in rankers:o proximity_bm25 ( default)
o none, matchany,wordcount,fieldmask,bm25
• custom ranker - create own expression rank
exampleranker = proximity_bm25
same as ranker = expr(‘sum(lcs*user_weight)*1000+bm25’)
![Page 20: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/20.jpg)
Tokenization settings example
index rt {
…
charset_type = utf-8
dict = keywords
min_word_len = 2
min_infix_len = 3
morphology = stem_en
enable_star = 1
…
}
![Page 21: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/21.jpg)
Operators on fulltext fields
• Boolean: hello | world, hello ! world
• phrasing: “hello world”
• proximity: “hello world”~10
• quorum: “world is a beautiful place”/3
• exact form: =cats and =dogs
• strict order: cats << and << dogs
• zone limit: (h2,h4) cats and dogs
• SENTENCE: all SENTENCE words SENTENCE “ in
one sentence”
• PARAGRAPH: “this search” PARAGRAPH “is fast”
• selected fields only: @(title,body) hello world
• excluded fields: @!(title,body) hello world
![Page 22: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/22.jpg)
Using API
<?php
require("sphinxapi.php");
$cl = new SphinxClient();
$res = $cl->Query('search me now','rt');
print_r($res);
Official: PHP, Python, Ruby, Java, C
Unofficial: JS(Node.js), perl, C++, Haskell,
.NET
![Page 23: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/23.jpg)
Using SphinxQL
$mysql> SELECT * FROM rt WHERE
MATCH('”search me fuzzy”~10') AND featured
= 1 LIMIT 0,20;
$mysql> SELECT * FROM rt WHERE
MATCH('”search me fuzzy”~10 @tag
computers') AND featured = 1 GROUP BY
user_id ORDER BY title ASC LIMIT 30,60
OPTION field_weights=(title=10,content=1),
ranker=expr(‘sum((4*lcs+2*(min_hit_pos==1)
+exact_hit)*user_weight)*1000+bm25’);
![Page 24: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/24.jpg)
Boolean filtering
$mysql> SELECT *,
views > 10 OR category = 4 AS cond
FROM rt WHERE
MATCH('”search me proximity”~10') AND
featured = 1 AND cond = 1
GROUP BY user_id ORDER BY title ASC
LIMIT 30,60 OPTION ranker=sph04;
![Page 25: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/25.jpg)
Geo search
mysql> SELECT *, GEODIST(lat,long,0.71147,-
1.29153) as distance FROM rt WHERE distance <
1000 ORDER BY distance ASC;
mysql> SELECT *, GEODIST(lat,long,40.76439,-
73.99976,
{in=degrees,out=miles,method=adaptive}) as
distance FROM rt WHERE distance < 10 ORDER BY
distance ASC;
![Page 26: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/26.jpg)
Multi-queries
mysql> DELIMITER \\
mysql> SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_one ORDER by
counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_two ORDER by
counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE
MATCH('search me') GROUP by property_three ORDER by
counter DESC;
\\
• used for faceting
![Page 27: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/27.jpg)
Internals
![Page 28: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/28.jpg)
Internal architecture
Each RT index is a sharded index consisting of:
• one memory shard for latest content
• one or more disk shards
![Page 29: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/29.jpg)
Internal shards management
rt_mem_limit = maximum size of memory
shard
When full, is flushed to disk as a new disk
shard.
• OPTIMIZE INDEX rt - merge all disk shards
into one.o Merging too intensive? throttle with rt_merge_iops
and rt_merge_maxiosize
![Page 30: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/30.jpg)
Binlog support
Sphinx support binlogs, so memory shard will not be lost in case of disasters
• binlog_flusho like innodb_flush_log_at_trx_commit
o 0 - flush and sync every second - fastest, 1 sec lose
o 1 - flush and sync every transaction - most safe, but slowest
o 2 - flush every transaction, sync every second - best
balance, default mode
• binlog_patho binlog_path = # disable logging
![Page 31: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/31.jpg)
Fast RT setup using classic index
• Create classic index to get initial data.
• Declare a RT index
• mysql> ATTACH INDEX classic TO RTINDEX rt
• transform classic index to RT
• operation is almost instant o in essence is a file renaming: classic index
becomes a RT disk shard
![Page 32: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/32.jpg)
Sphinx use 1 CPU core per
index
More power?
Distribute!
![Page 33: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/33.jpg)
Distributed RT index
Update on each shard, search on everythingindex distributed
{
type = distributed
local = rtlocal_one
local = rtlocal_two
agent = some.ip:rtremote_one
}
don’t forget about dist_threads = x
![Page 34: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/34.jpg)
Copy RT index from one server to
another
• just simulate a daemon restart
• searchd --stopwait
• flushes memory shard to disk
• Copy all index files to new server.
• Add RT index on new server sphinx.conf
• Start searchd on new server
![Page 35: Real time fulltext search with sphinx](https://reader030.fdocuments.us/reader030/viewer/2022012405/554a3d57b4c905293a8b4dac/html5/thumbnails/35.jpg)
Questions?
www.sphinxsearch.com
Docs: http://sphinxsearch.com/docs/
Wiki: http://sphinxsearch.com/wiki/
Official blog: http://sphinxsearch.com/blog/
SVN repository: https://code.google.com/p/sphinxsearch/