Oktavia Search Engine - pyconjp2014
-
Upload
yoshiki-shibukawa -
Category
Documents
-
view
1.706 -
download
0
description
Transcript of Oktavia Search Engine - pyconjp2014
DeNA Co, Ltd. Yoshiki Shibukawa
9/14/2014 PyConJP
! Yoshiki Shibukawa ! Work for DeNA Co, Ltd. ! @shibu_jp (twitter) ! yoshiki.shibukawa (Facebook) ! [email protected] (mail)
! Languages ! C/C++, Python, JavaScript
! Founder of sphinx-users.jp ! San Francisco -> Tokyo
! The Basic of Existing Search Engines ! The structure of Oktavia ! Oktavia API examples
! In some cases, inverted index is not good for Eastern Asian Languages.
! FM-index is a completely different search algorithm.
! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
AM.txt (0)
• Good morning
• Hi
PM.txt (1)
• Good afternoon
• Good evening
• Hi
Word Document ID
Good 0, 1
Morning 0
Afternoon 1
Evening 1
Hi 0, 1
! Word -> Document ! Split words in query
string and search each word from table and show result.
Good Morning → (0, 1) and (0,) → (0,)
• It is nice weather to go out to PyConJP. English
• 这是不错的天气出去PyConJP Chinese • 今日はPyConJPに出かけるにはいい天気ですね Japanese
• 그것은 PyConJP 에 외출 좋은 날씨 입니다 Korean※
※Korean has space between group of words, but not each word.
今日はPyConJPに出かけるにはいい天
気ですね
今日|は|PyConJP|に|出かける|に|は|いい|天気|です|ね
! Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji
! It needs deep knowledge of each language and big dictionary.
Word Doc ID
今日 0
は 0, 0
PyConJP 0
に 0, 0
出かける 0
いい 0
天気 0
です 0
ね 0
! Document becomes words and it can use same inverted index backend.
! Same word splitter is needed when creating index and searching.
! 2-gram
! 3-gram
! Split a query word into fixed length strings then search each chunk
! Use each chunk as a word
こんにちは
こん|んに|にち|ちは
こんにちは
こんに|んにち|にちは
Word Doc/Pos ID
こん (0, 0)
んに (0, 1)
にち (0, 2)
ちは (0, 3)
! It can still use an inverted index algorithm.
! Index file become big.
! It can’t treat shorter words than chunk size.
こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)
Inverted Index
Have space Split document by space Simple Space is needed
Eastern Asian Language
N-gram Still simple Index becomes huge
NLP Works perfect
with Asian language
NLP processor and dictionary
is needed
! It provides a search engine for browser. ! Inverted Index
! It didn’t support Japanese. ! I sent some patches. ! But they were not enough…
! Developed by… ! Paolo Ferragina ! Giovanni Manzini
! FM-index is not popular in western countries. ! It is completely different from existing algorithm. ! Existing algorithm is enough for western
languages. ! It is popular in genome analysis.
! I made new search engine by using this algorithm.
Estimated Time: 15min
! Search Engine works on web browser. ! Written in Python and JSX (altJS made by
DeNA. See http://jsx.github.io/ )
! It uses FM-index as a backend search algorithm.
! It is similar to Action Script 3 ! Class statement (no prototype!) ! Strict type checking ! No “this” hell ! Performance optimization
! FM-index is the fastest algorithm that uses a compressed index file.
! FM-index doesn’t need word splitting.
! Oktavia adds extra information ! Add region information to source text.
! You can add as many metadata as you can. ! Section (documents and sections) ! Block (code block and so on) ! Splitter (word splitter) ! Table (rows and columns)
Ep4.txt
Use the Force, Luke. No, I am your father. Ep5.txt
Read Source
Generate Index
File API
Read Index
File API
Search Result
CLI tool Browser search program
Read Source
Generate Index
File API
Read Index
File API
Show Search Result
CLI tool Browser search program
! I published yesterday. ! It supports Python 2.6, 2.7, 3.3, 3.4.
! Use Oktavia API to implement search feature in your application
! Build JSX version
! web/bin/oktavia-jquery-ui.js, web/bin/oktavia-web-runtime.js are important.
$ git clone [email protected]:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build
! Creating index ! Dump an index file in base64 encode and create
file in the following style.
! concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js).
! Add web/bin/oktavia-jquery-ui.js to your website. ! It reads index and runtime on WebWorker and
sends requests and show result.
var searchIndex = 'aGVsbG8gd29ybGQ…..=’;
Estimated Time: 23min
! Oktavia provides APIs for creating your better search engine.
! Most important part for user experience is an adjustment of scoring (sorting and filtering).
! In some case, user feels “not available” is important information, but in other case, it is just noise.
! I want to buy some bottle of wine for gift!
Cabernet Sauvignon [Sold Out] • From France
Pinot noir [Sold Out] • From Chili
Zinfandel [Sold Out] • From USA
Photo by Josh Kenzer under CC-NC-SA
! I want to buy “My Little Pony DVD”!
Season One $32
Season Two $32
Season Three [Sold out]
! Oktavia class (oktavia.py) ! Main entry point of creating/searching.
! Metadata classes (metadata.py) ! Section ! Block ! Splitter ! Table
! Query, Result classes (TBD)
! Sorry, I am working… In future the following code will work:
! In some cases, inverted index is not good for Eastern Asian Languages.
! FM-index is a completely different search algorithm.
! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
! Office Hour ! 13:40-14:10
! Message ! Facebook(yoshiki.shibukawa) ! Twitter(@shibu_jp, @shibukawa)