PyCon Russian 2015 - Dive into full text search with python.
-
Upload
andrii-soldatenko -
Category
Internet
-
view
2.512 -
download
4
Transcript of PyCon Russian 2015 - Dive into full text search with python.
Dive into full text search
with PythonAndrii Soldatenko
18-19 September 2015 @a_soldatenko
About me:• Lead QA Automation Engineer at
• Backend Python Developer at
• Speaker at PyCon Ukraine 2014
• Speaker at PyCon Belarus 2015
• @a_soldatenko
Preface
Information Explosion
Text Searchgrep -‐-‐ignore-‐case -‐-‐recursive foo books/
grep -‐-‐ignore-‐case -‐-‐recursive -‐-‐file=words.txt books/
Entry.objects.get(headline__icontains='foo')
words = [] with open('words.txt', 'r') as f: words = f.readlines()
Entry.objects.get(headline__icontains_in=words)
Full text search
Search index
Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
Inverted indexTerm Doc_1 Doc_2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Inverted index
Term Doc_1 Doc_2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ brown | X | X quick | X | -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Total | 2 | 1
Inverted index: normalization
Term Doc_1 Doc_2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Term Doc_1 Doc_2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Search Engines
PostgreSQL
PostgreSQL:operators for textual data types-‐-‐-‐ PostgreSQL has operators for textual data types: -‐-‐-‐ LIKE -‐ match case-‐sensitive -‐-‐-‐ ILIKE -‐ match case-‐insensitive -‐-‐-‐ ~ -‐ Matches POSIX regular expression, case-‐sensitive -‐-‐-‐ ~* -‐ Matches POSIX regular expression, case-‐insensitive select 'foo' LIKE 'foo'; -‐-‐ true select 'bar' ILIKE 'BAR'; -‐-‐ true select 'abc' LIKE 'b'; -‐-‐ true select 'abc' LIKE 'c'; -‐-‐ false select 'abc' ~ 'abc'; -‐-‐ true select 'abc' ~ '^a'; -‐-‐ true select 'abc' ~ '(b|d)'; -‐-‐ true select 'abc' ~ '^(b|c)'; -‐-‐ false select 'andrii' ~* '.*Andrii.*'; -‐-‐ true
PostgreSQL:accuracy issue
select 'prone' like '%one%'; -‐-‐true
select 'money' like '%one%'; -‐-‐true
select 'lonely' like '%one%'; -‐-‐true
Full text search in PostgreSQL
1. Creating tokens
2. Converting tokens into Lexemes
3. Storing preprocessed documents
Full text search in PostgreSQL
27 built-in configurations for 10 languages
Support of user-defined FTS configurations
Pluggable dictionaries, parsers
Inverted indexes
functions to convert normal text to tsvector
explain SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat’::tsquery; QUERY PLAN -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Result (cost=0.00..0.01 rows=1 width=0) (1 row)
explain SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; -‐-‐ false QUERY PLAN -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Result (cost=0.00..0.01 rows=1 width=0) (1 row)
PostgreSQL:index management
CREATE FUNCTION notes_vector_update() RETURNS TRIGGER AS $$ BEGIN IF TG_OP = 'INSERT' THEN new.search_index = to_tsvector('pg_catalog.english', COALESCE(NEW.name, '')); END IF; IF TG_OP = 'UPDATE' THEN IF NEW.name <> OLD.name THEN new.search_index = to_tsvector('pg_catalog.english', COALESCE(NEW.name, '')); END IF; END IF; RETURN NEW; END $$ LANGUAGE 'plpgsql';
PostgreSQL:stopwords
SELECT to_tsvector('english','in the list of stop words'); to_tsvector -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 'list':3 'stop':5 'word':6
/usr/pgsql-9.3/share/tsearch_data/english.stop
Django:
Malcolm Tredinnick's Advice on Writing SQL in Django :
“︎If you need to write advanced SQL you should write it. I would balance that by cautioning against overuse of the raw() and extra() methods.”
PostgreSQL full-text search integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from djorm_pgfulltext.models import SearchManager from djorm_pgfulltext.fields import VectorField from django.db import models
class Page(models.Model): name = models.CharField(max_length=200) description = models.TextField()
search_index = VectorField()
objects = SearchManager( fields = ('name', 'description'), config = 'pg_catalog.english', # this is default search_field = 'search_index', # this is default auto_update_search_field = True )
For search just use search method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>> Page.objects.search("documentation & about")
[<Page: Page: Home page>]
>>> Page.objects.search("about | documentation | django | home", raw=True)
[<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
Second wayclass Page(models.Model): name = models.CharField(max_length=200) description = models.TextField() objects = SearchManager(fields=None, search_field=None)
>>> Page.objects.search("documentation & about", fields=('name', 'description')) [<Page: Page: Home page>] >>> Page.objects.search("about | documentation | django | home", raw=True, fields=('name', 'description')) [<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
Pros and ConsPros:
• Quick implementation • No dependency
Cons:
• Need manually manage indexes • Not as flexible as pure search engines • tied to PostgreSQL • no analytics data • no DSL only `&` and `|` queries • difficult to manage stop words
ElasticSearch
Who uses ElasticSearch?
ElasticSearch: Quick Intro
Relational DB Databases TablesRows Columns
ElasticSearch Indices FieldsTypes Documents
ElasticSearch: Quick Intro
PUT /haystack/user/1 { "first_name" : "Andrii", "last_name" : "Soldatenko", "age" : 30, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ], "likes": [ "python", "django" ] }
ElasticSearch: Locks
•Pessimistic concurrency control
•Optimistic concurrency control
ElasticSearch: Setup
#!/bin/bash
VERSION=1.7.1
curl -‐L -‐O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-‐$VERSION.zip unzip elasticsearch-‐$VERSION.zip cd elasticsearch-‐$VERSION
# Download plugin marvel ./bin/plugin -‐i elasticsearch/marvel/latest
echo 'marvel.agent.enabled: false' >> ./config/elasticsearch.yml
# run elastic ./bin/elasticsearch -‐d
ElasticSearch: Setup
$ curl ‘http://localhost:9200/?pretty'
{ "status" : 200, "name" : "Dredmund Druid", "cluster_name" : "elasticsearch", "version" : { "number" : "1.7.1", "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19", "build_timestamp" : "2015-‐07-‐29T09:54:16Z", "build_snapshot" : false, "lucene_version" : "4.10.4" }, "tagline" : "You Know, for Search" }
Haystack
Adding search functionality to Simple Model
$ cat myapp/models.py
from django.db import models from django.contrib.auth.models import User
class Page(models.Model): user = models.ForeignKey(User) name = models.CharField(max_length=200) description = models.TextField()
def __unicode__(self): return self.name
Haystack: Installation$ pip install django-‐haystack
$ cat settings.py
INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites',
# Added. 'haystack',
# Then your usual apps... 'blog', ]
Haystack: Installation
$ pip install elasticsearch
$ cat settings.py ... HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', }, } ...
Haystack: Creating SearchIndexes
$ cat myapp/search_indexes.py
import datetime from haystack import indexes from myapp.models import Note
class PageIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date')
def get_model(self): return Note
def index_queryset(self, using=None): """Used when the entire index for model is updated.""" return self.get_model().objects. \ filter(pub_date__lte=datetime.datetime.now())
Haystack: SearchQuerySet API
from haystack.query import SearchQuerySet from haystack.inputs import Raw
all_results = SearchQuerySet().all()
hello_results = SearchQuerySet().filter(content='hello')
unfriendly_results = SearchQuerySet().\ exclude(content=‘hello’).\ filter(content=‘world’)
# To send unescaped data: sqs = SearchQuerySet().filter(title=Raw(trusted_query))
Keeping data in sync# Update everything. ./manage.py update_index -‐-‐settings=settings.prod
# Update everything with lots of information about what's going on. ./manage.py update_index -‐-‐settings=settings.prod -‐-‐verbosity=2
# Update everything, cleaning up after deleted models. ./manage.py update_index -‐-‐remove -‐-‐settings=settings.prod
# Update everything changed in the last 2 hours. ./manage.py update_index -‐-‐age=2 -‐-‐settings=settings.prod
# Update everything between Dec. 1, 2011 & Dec 31, 2011 ./manage.py update_index -‐-‐start='2011-‐12-‐01T00:00:00' -‐-‐end='2011-‐12-‐31T23:59:59' -‐-‐settings=settings.prod
Signalsclass RealtimeSignalProcessor(BaseSignalProcessor): """ Allows for observing when saves/deletes fire & automatically updates the search engine appropriately. """ def setup(self): # Naive (listen to all model saves). models.signals.post_save.connect(self.handle_save) models.signals.post_delete.connect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then hooking up signals only for those.
def teardown(self): # Naive (listen to all model saves). models.signals.post_save.disconnect(self.handle_save) models.signals.post_delete.disconnect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then disconnecting signals only for those.
Haystack: Pros and Cons
Pros:
• easy to setup • looks like Django ORM but for searches • search engine independent • support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API • difficult to manage stop words • loose performance, because extra layer • Model - based
Future FTS and Roadmap Django 1.9
• PostgreSQL Full Text Search (Marc Tamlyn)
https://github.com/django/django/pull/4726
• Custom indexes (Marc Tamlyn)
• etc.
Final Thoughts
https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html
Thank You
@a_soldatenko
https://asoldatenko.com
We are hiring
Questions
?