I F T S – S Q L 2008 F T S Engine

48
iFTS – SQL 2008 FTS Engine Hebrew Full-Text Search In the real world

description

ההרצאה של שי ממפגש ISUG מספר 87.

Transcript of I F T S – S Q L 2008 F T S Engine

Page 1: I F T S –  S Q L 2008  F T S  Engine

iFTS – SQL 2008 FTS Engine

Hebrew Full-Text SearchIn the real world

Page 2: I F T S –  S Q L 2008  F T S  Engine

Hebrew in the real world

Page 3: I F T S –  S Q L 2008  F T S  Engine

Agenda• Tapuz• iFTS – Introduction• iFTS – Terms and keywords• Setting up Full-Text• Index structure• Population• Querying• Improvements from 2005• Tapuz solution• Known Issues

Page 4: I F T S –  S Q L 2008  F T S  Engine

Tapuz – It’s all about content• 5 Major websites – Forums, Communa, Blogs, Flix

(Video), Albums• Over 165 million content items• Over 3 million registered Users• Thousands of new items every day• More than 30 web servers• SQL Server:

• SQL server 2005 enterprise edition on a 2-node Cluster

• 4 quadcore CPU, 16 GB RAM• ~500 GB of data in 5 major databases• ~1200 batch requests per seconds

Page 5: I F T S –  S Q L 2008  F T S  Engine

Tapuz - old search engines• 3 different search engines:

° 3 different database systems° Search often didn’t return correct results° 3 Different relevance sort algorithms° Very resource intensive (more than 20 servers

used for search alone!)° No support for advanced search (dynamic fields)° Long period of time before a new item is indexed

Page 6: I F T S –  S Q L 2008  F T S  Engine

Tapuz Search - project requirements• Search through most of the existing content (more

than 165M items)• Allow querying the new added items in real time• The search engine's default language is Hebrew and

special linguistic characteristics should be supported • Dynamic fields search – the user can choose which

fields to search• Should have a relevance sorting mechanism

Page 7: I F T S –  S Q L 2008  F T S  Engine

Challenges• The search should add minimal load on the

production SQL Server• Should have decent query performance• Real-time item indexing• How do we handle Hebrew ??!!*#$??!%?

Page 8: I F T S –  S Q L 2008  F T S  Engine

The solution

Transactional replication

SQL 2005 Enterprise Cluster SQL 2008 Standard

Auto Change tracking population

Page 9: I F T S –  S Q L 2008  F T S  Engine

iFTS - Introduction• FTS allows fast and flexible indexing for keyword-

based querying of text data• SQL Server has had full-text search capabilities since

version 7.0• The Full-Text Engine supports two roles: indexing and

querying• Full-text indexes can be created not only on textual

data columns, but also on binary columns• Common uses: searching Web sites, product

catalogs, document management systems

Page 10: I F T S –  S Q L 2008  F T S  Engine

Terms and Keywords

Document Population

Full-Text Catalog

Full-Text Index

Page 11: I F T S –  S Q L 2008  F T S  Engine

Terms and Keywords

Document Population

Full-Text Catalog

Full-Text Index

(also known as a crawl) - Population is the process

of creating and maintaining a full-text index. (creating and building the index)

Page 12: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breaker

Filter

PopulationTerms and Keywords

Page 13: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breaker

Filter Given a specified file extension such as .doc, filters extract text from

a file stored in a varbinary(max) or

IMAGE column

PopulationTerms and Keywords

Page 14: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breaker

FilterFor a given language, a

word breaker tokenizes the text, identifies individual

words by determining where word boundaries exist based on the lexical

rules of the language

PopulationTerms and Keywords

Page 15: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breaker

Filter

For a given language, a stemmer generates inflectional forms of a particular word based

on the rules of that language.

PopulationTerms and Keywords

Page 16: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breaker

Filter

Token

PopulationTerms and Keywords

Page 17: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breake

r

Filter

Token

A token is a word or a character string

identified by a word breaker

PopulationTerms and Keywords

Page 18: I F T S –  S Q L 2008  F T S  Engine

STOPLIST

STOPWORDSTOPWORDSTOPWORDSTOPWORD

PopulationTerms and Keywords

Stemmer

Word breake

r

Filter

Token

Page 19: I F T S –  S Q L 2008  F T S  Engine

Stemmer

Word breake

r

FilterPopulation

Terms and Keywords

STOPLIST

STOPWORDSTOPWORDSTOPWORDSTOPWORD

Token

A stopword is a word that is not relevant to your

search and is filtered out from indexing and query

processes. SQL Server 2008 introduces stoplists.

A stoplist is a list of stopwords

Page 20: I F T S –  S Q L 2008  F T S  Engine

Document PopulationFull-Text Catalog

Terms and Keywords

Full-Text Index

Page 21: I F T S –  S Q L 2008  F T S  Engine

Document PopulationFull-Text CatalogFull-Text Index

A full-text index stores information about

significant words and their location within a given

column

Terms and Keywords

Page 22: I F T S –  S Q L 2008  F T S  Engine

Terms and Keywords

Document PopulationFull-Text Catalog

A full-text catalog is a logical concept that refers to a group of

full-text indexesFull-Text

Index

Page 23: I F T S –  S Q L 2008  F T S  Engine

Enable DB

Create Catalog

Create Full-Text Index

Setting up Full-Text

Page 24: I F T S –  S Q L 2008  F T S  Engine

• A full-text index is a special type of token-based index

• In order to create a full-text index on a table or a view, it must have a unique, single-column, non-nullable index

• Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max)

• Each index supports only a single language per column

Creating a Full-Text indexSetting up Full-Text

Page 25: I F T S –  S Q L 2008  F T S  Engine

Choose column and languages Assign to catalog Choose filegroup Choose stoplist

Creating a Full-Text indexSetting up Full-Text

Page 26: I F T S –  S Q L 2008  F T S  Engine
Page 27: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in EnglishSource Row:

Full-Text Index:

• Demo

Index Structure

Page 28: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

The Keyword column contains a representation of a single token

extracted at indexing time. Word breakers determine what makes

up a token.

• Demo

Index StructureSource Row:

Full-Text Index:

Page 29: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

• Demo

Index StructureSource Row:

Full-Text Index:

Page 30: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

The ColId column contains a value that corresponds to a

particular column that is full-text indexed.

• Demo

Index StructureSource Row:

Full-Text Index:

Page 31: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

• Demo

Index StructureSource Row:

Full-Text Index:

Page 32: I F T S –  S Q L 2008  F T S  Engine

• Demo

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

The DocId column contains eight-byte integer values that maps to a particular

full-text key value in a full-text indexed table.

Index StructureSource Row:

Full-Text Index:

Page 33: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

• Demo

Index StructureSource Row:

Full-Text Index:

Page 34: I F T S –  S Q L 2008  F T S  Engine

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

• Demo

The Occurrence column contains an integer value. For each DocId value, there is a list of occurrence values that correspond to the relative word offsets of the particular keyword within that DocId.

Index StructureSource Row:

Full-Text Index:

Page 35: I F T S –  S Q L 2008  F T S  Engine

1. Full – A full population builds index entries for all the rows of the base table or indexed view

2. Change Tracking – SQL server tracks changes to the base table since the last population:1. Auto2. Manual

3. Incremental Timestamp-Based Population

Population methodsPopulation Process

Page 36: I F T S –  S Q L 2008  F T S  Engine

• Contains, Freetext – as a predicate (Where)Syntax:

CONTAINS (column_name,search_string)

• ContainsTable, FreetextTable – TVF, includes ranking.

Syntax:

SELECT * FROM

CONTAINSTABLE (table_name,column_name,search_string, top n)

Querying

Page 37: I F T S –  S Q L 2008  F T S  Engine

• Fully integrated into SQL Server• Stoplists• New Tools for Troubleshooting SQL Server 2008 Full-

Text Search (DMVs)• A New Word Breaker Family (Hebrew and other

languages)• Performance improvements (reasons: Integer Key,

full integration)

iFTS enhancements in SQL Server 2008

Page 38: I F T S –  S Q L 2008  F T S  Engine

איש = אישה•הולך = הולכים•כלב = לב•כביסה = ביסה = ביס•לבני = בני = בן\בני•

DEMOS - עברית קשה שפה

Hebrew????

Page 39: I F T S –  S Q L 2008  F T S  Engine

העברית השפה מחיי

Page 40: I F T S –  S Q L 2008  F T S  Engine

• Sys.dm_fts_parser• sys.dm_fts_index_keywords• sys.dm_fts_index_keywords_by_doc• sys.fulltext_index_fragments• FULLTEXTCATALOGPROPERTY:

– MergeStatus – PopulateStatus

• OBJECTPROPERTYEX:– TableFulltextPopulateStatus – TableFulltextPendingChanges

New DMVs and management tools

Page 41: I F T S –  S Q L 2008  F T S  Engine

• Why Scan if you can……FORCESEEK – new hint- can help a bit in determining the query plan

• When using contains don’t forget to use quotes (“) if searching more than one word

• Use \ to escape special characters• To search quotes (“) in the text use "

Tips and Tricks

Page 42: I F T S –  S Q L 2008  F T S  Engine

• Use an integer key as the Unique index• Place Full-Text index on another filegroup• Performance degrades when full text index is

fragmented - use reorganize for merge

Tips and Tricks

Page 43: I F T S –  S Q L 2008  F T S  Engine

• SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU

• Transactional replication• FT indexes on different FG than the main tables• Change tracking (AUTO)• Daily reorganizing fragmented indexes only• Hierarchy set of queries to make sure relevance

results return first• Use Dynamic SQL so that dynamic search fields can

be used

Tapuz Solution

Page 44: I F T S –  S Q L 2008  F T S  Engine

• Freetext ranking (okapi –bm25)• Contains• Contains all words (using AND)• Free search (freetext)

Results relevance sorting logic

Page 45: I F T S –  S Q L 2008  F T S  Engine

• Index sizes – 53 GB (~68 GB Data)• Number of rows indexed – >165M• AVG search time – 1.7 Sec• More than 97% of the searches respond in less than

7 Sec• Number of searches (2 months) – more than 6

million• Number of connections – ~900

Numbers

Page 46: I F T S –  S Q L 2008  F T S  Engine

• High CPU load and intense disk IO during queries • Population and merges are resource intensive• Ranking not as a TVF?? – impossible• Statistics, query plans and join types are not always

optimal –hints can’t be used• No scale out or partitioning options

Known issues found so far

Page 47: I F T S –  S Q L 2008  F T S  Engine

• Books Online• SQL Server 2008 Full-Text Search: Internals

and Enhancements: http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506227

• Pro Full-Text in SQL Server 2008 by Michael Coles

References

Page 48: I F T S –  S Q L 2008  F T S  Engine