Oktavia Search Engine - pyconjp2014

37
DeNA Co, Ltd. Yoshiki Shibukawa 9/14/2014 PyConJP

description

 

Transcript of Oktavia Search Engine - pyconjp2014

Page 1: Oktavia Search Engine - pyconjp2014

DeNA Co, Ltd. Yoshiki Shibukawa

9/14/2014 PyConJP

Page 2: Oktavia Search Engine - pyconjp2014

!  Yoshiki Shibukawa !  Work for DeNA Co, Ltd. !  @shibu_jp (twitter) !  yoshiki.shibukawa (Facebook) !  [email protected] (mail)

!  Languages !  C/C++, Python, JavaScript

!  Founder of sphinx-users.jp !  San Francisco -> Tokyo

Page 3: Oktavia Search Engine - pyconjp2014

!  The Basic of Existing Search Engines !  The structure of Oktavia !  Oktavia API examples

Page 4: Oktavia Search Engine - pyconjp2014

!  In some cases, inverted index is not good for Eastern Asian Languages.

!  FM-index is a completely different search algorithm.

!  I published new PyPI module yesterday !  It includes only essential part of Oktavia !  I will add features more.

Page 5: Oktavia Search Engine - pyconjp2014
Page 6: Oktavia Search Engine - pyconjp2014

AM.txt (0)

• Good morning

• Hi

PM.txt (1)

• Good afternoon

• Good evening

• Hi

Page 7: Oktavia Search Engine - pyconjp2014

Word Document ID

Good 0, 1

Morning 0

Afternoon 1

Evening 1

Hi 0, 1

!  Word -> Document !  Split words in query

string and search each word from table and show result.

Good Morning → (0, 1) and (0,) → (0,)

Page 8: Oktavia Search Engine - pyconjp2014

• It is nice weather to go out to PyConJP. English

• 这是不错的天气出去PyConJP Chinese • 今日はPyConJPに出かけるにはいい天気ですね Japanese

• 그것은 PyConJP 에 외출 좋은 날씨 입니다 Korean※

※Korean has space between group of words, but not each word.

Page 9: Oktavia Search Engine - pyconjp2014

今日はPyConJPに出かけるにはいい天

気ですね

今日|は|PyConJP|に|出かける|に|は|いい|天気|です|ね

!  Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji

!  It needs deep knowledge of each language and big dictionary.

Page 10: Oktavia Search Engine - pyconjp2014

Word Doc ID

今日 0

は 0, 0

PyConJP 0

に 0, 0

出かける 0

いい 0

天気 0

です 0

ね 0

!  Document becomes words and it can use same inverted index backend.

!  Same word splitter is needed when creating index and searching.

Page 11: Oktavia Search Engine - pyconjp2014

!  2-gram

!  3-gram

!  Split a query word into fixed length strings then search each chunk

!  Use each chunk as a word

こんにちは

こん|んに|にち|ちは

こんにちは

こんに|んにち|にちは

Page 12: Oktavia Search Engine - pyconjp2014

Word Doc/Pos ID

こん (0, 0)

んに (0, 1)

にち (0, 2)

ちは (0, 3)

!  It can still use an inverted index algorithm.

!  Index file become big.

!  It can’t treat shorter words than chunk size.

こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)

Page 13: Oktavia Search Engine - pyconjp2014

Inverted Index

Have space Split document by space Simple Space is needed

Eastern Asian Language

N-gram Still simple Index becomes huge

NLP Works perfect

with Asian language

NLP processor and dictionary

is needed

Page 14: Oktavia Search Engine - pyconjp2014
Page 15: Oktavia Search Engine - pyconjp2014

!  It provides a search engine for browser. !  Inverted Index

!  It didn’t support Japanese. !  I sent some patches. !  But they were not enough…

Page 16: Oktavia Search Engine - pyconjp2014
Page 17: Oktavia Search Engine - pyconjp2014

!  Developed by… !  Paolo Ferragina !  Giovanni Manzini

!  FM-index is not popular in western countries. !  It is completely different from existing algorithm. !  Existing algorithm is enough for western

languages. !  It is popular in genome analysis.

!  I made new search engine by using this algorithm.

Page 18: Oktavia Search Engine - pyconjp2014

Estimated Time: 15min

Page 19: Oktavia Search Engine - pyconjp2014

!  Search Engine works on web browser. !  Written in Python and JSX (altJS made by

DeNA. See http://jsx.github.io/ )

!  It uses FM-index as a backend search algorithm.

Page 20: Oktavia Search Engine - pyconjp2014

!  It is similar to Action Script 3 !  Class statement (no prototype!) !  Strict type checking !  No “this” hell !  Performance optimization

Page 21: Oktavia Search Engine - pyconjp2014
Page 22: Oktavia Search Engine - pyconjp2014

!  FM-index is the fastest algorithm that uses a compressed index file.

!  FM-index doesn’t need word splitting.

Page 23: Oktavia Search Engine - pyconjp2014

!  Oktavia adds extra information !  Add region information to source text.

!  You can add as many metadata as you can. !  Section (documents and sections) !  Block (code block and so on) !  Splitter (word splitter) !  Table (rows and columns)

Ep4.txt

Use the Force, Luke. No, I am your father. Ep5.txt

Page 24: Oktavia Search Engine - pyconjp2014

Read Source

Generate Index

File API

Read Index

File API

Search Result

CLI tool Browser search program

Page 25: Oktavia Search Engine - pyconjp2014

Read Source

Generate Index

File API

Read Index

File API

Show Search Result

CLI tool Browser search program

!  I published yesterday. !  It supports Python 2.6, 2.7, 3.3, 3.4.

Page 26: Oktavia Search Engine - pyconjp2014

!  Use Oktavia API to implement search feature in your application

Page 27: Oktavia Search Engine - pyconjp2014

!  Build JSX version

!  web/bin/oktavia-jquery-ui.js, web/bin/oktavia-web-runtime.js are important.

$ git clone [email protected]:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build

Page 28: Oktavia Search Engine - pyconjp2014

!  Creating index !  Dump an index file in base64 encode and create

file in the following style.

!  concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js).

!  Add web/bin/oktavia-jquery-ui.js to your website. !  It reads index and runtime on WebWorker and

sends requests and show result.

var searchIndex = 'aGVsbG8gd29ybGQ…..=’;

Page 29: Oktavia Search Engine - pyconjp2014

Estimated Time: 23min

Page 30: Oktavia Search Engine - pyconjp2014

!  Oktavia provides APIs for creating your better search engine.

!  Most important part for user experience is an adjustment of scoring (sorting and filtering).

!  In some case, user feels “not available” is important information, but in other case, it is just noise.

Page 31: Oktavia Search Engine - pyconjp2014

!  I want to buy some bottle of wine for gift!

Cabernet Sauvignon [Sold Out] • From France

Pinot noir [Sold Out] • From Chili

Zinfandel [Sold Out] • From USA

Photo by Josh Kenzer under CC-NC-SA

Page 32: Oktavia Search Engine - pyconjp2014

!  I want to buy “My Little Pony DVD”!

Season One $32

Season Two $32

Season Three [Sold out]

Page 33: Oktavia Search Engine - pyconjp2014

!  Oktavia class (oktavia.py) !  Main entry point of creating/searching.

!  Metadata classes (metadata.py) !  Section !  Block !  Splitter !  Table

!  Query, Result classes (TBD)

Page 34: Oktavia Search Engine - pyconjp2014
Page 35: Oktavia Search Engine - pyconjp2014

!  Sorry, I am working… In future the following code will work:

Page 36: Oktavia Search Engine - pyconjp2014

!  In some cases, inverted index is not good for Eastern Asian Languages.

!  FM-index is a completely different search algorithm.

!  I published new PyPI module yesterday !  It includes only essential part of Oktavia !  I will add features more.

Page 37: Oktavia Search Engine - pyconjp2014

!  Office Hour !  13:40-14:10

!  Message !  Facebook(yoshiki.shibukawa) !  Twitter(@shibu_jp, @shibukawa)