I F T S – S Q L 2008 F T S Engine

iFTS – SQL 2008 FTS Engine

Hebrew Full-Text SearchIn the real world

Hebrew in the real world

Agenda• Tapuz• iFTS – Introduction• iFTS – Terms and keywords• Setting up Full-Text• Index structure• Population• Querying• Improvements from 2005• Tapuz solution• Known Issues

Tapuz – It’s all about content• 5 Major websites – Forums, Communa, Blogs, Flix

(Video), Albums• Over 165 million content items• Over 3 million registered Users• Thousands of new items every day• More than 30 web servers• SQL Server:

• SQL server 2005 enterprise edition on a 2-node Cluster

• 4 quadcore CPU, 16 GB RAM• ~500 GB of data in 5 major databases• ~1200 batch requests per seconds

Tapuz - old search engines• 3 different search engines:

° 3 different database systems° Search often didn’t return correct results° 3 Different relevance sort algorithms° Very resource intensive (more than 20 servers

used for search alone!)° No support for advanced search (dynamic fields)° Long period of time before a new item is indexed

Tapuz Search - project requirements• Search through most of the existing content (more

than 165M items)• Allow querying the new added items in real time• The search engine's default language is Hebrew and

special linguistic characteristics should be supported • Dynamic fields search – the user can choose which

fields to search• Should have a relevance sorting mechanism

Challenges• The search should add minimal load on the

production SQL Server• Should have decent query performance• Real-time item indexing• How do we handle Hebrew ??!!*#$??!%?

The solution

Transactional replication

SQL 2005 Enterprise Cluster SQL 2008 Standard

Auto Change tracking population

iFTS - Introduction• FTS allows fast and flexible indexing for keyword-

based querying of text data• SQL Server has had full-text search capabilities since

version 7.0• The Full-Text Engine supports two roles: indexing and

querying• Full-text indexes can be created not only on textual

data columns, but also on binary columns• Common uses: searching Web sites, product

catalogs, document management systems

Terms and Keywords

Document Population

Full-Text Catalog

Full-Text Index

Terms and Keywords

Document Population

Full-Text Catalog

Full-Text Index

(also known as a crawl) - Population is the process

of creating and maintaining a full-text index. (creating and building the index)

Stemmer

Word breaker

Filter

PopulationTerms and Keywords

Stemmer

Word breaker

Filter Given a specified file extension such as .doc, filters extract text from

a file stored in a varbinary(max) or

IMAGE column


Stemmer

Word breaker

FilterFor a given language, a

word breaker tokenizes the text, identifies individual

words by determining where word boundaries exist based on the lexical

rules of the language


Stemmer

Word breaker

Filter

For a given language, a stemmer generates inflectional forms of a particular word based

on the rules of that language.


Stemmer

Word breaker

Filter

Token


Stemmer

Word breake

r

Filter

Token

A token is a word or a character string

identified by a word breaker


STOPLIST

STOPWORDSTOPWORDSTOPWORDSTOPWORD


Stemmer

Word breake

r

Filter

Token

Stemmer

Word breake

r

FilterPopulation

Terms and Keywords

STOPLIST

STOPWORDSTOPWORDSTOPWORDSTOPWORD

Token

A stopword is a word that is not relevant to your

search and is filtered out from indexing and query

processes. SQL Server 2008 introduces stoplists.

A stoplist is a list of stopwords

Document PopulationFull-Text Catalog

Terms and Keywords

Full-Text Index

Document PopulationFull-Text CatalogFull-Text Index

A full-text index stores information about

significant words and their location within a given

column

Terms and Keywords

Terms and Keywords

Document PopulationFull-Text Catalog

A full-text catalog is a logical concept that refers to a group of

full-text indexesFull-Text

Index

Enable DB

Create Catalog

Create Full-Text Index

Setting up Full-Text

• A full-text index is a special type of token-based index

• In order to create a full-text index on a table or a view, it must have a unique, single-column, non-nullable index

• Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max)

• Each index supports only a single language per column

Creating a Full-Text indexSetting up Full-Text

Choose column and languages Assign to catalog Choose filegroup Choose stoplist

Creating a Full-Text indexSetting up Full-Text

Keyword ColId DocId Occurrence

English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in EnglishSource Row:

Full-Text Index:

• Demo

Index Structure


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English

1 there is nothing like searching in English

The Keyword column contains a representation of a single token

extracted at indexing time. Word breakers determine what makes

up a token.

• Demo

Index StructureSource Row:

Full-Text Index:


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


• Demo


Full-Text Index:


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


The ColId column contains a value that corresponds to a

particular column that is full-text indexed.

• Demo


Full-Text Index:


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


• Demo


Full-Text Index:

• Demo


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


The DocId column contains eight-byte integer values that maps to a particular

full-text key value in a full-text indexed table.


Full-Text Index:


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


• Demo


Full-Text Index:


English 3 1 7

Nothing 3 1 3

Searching 3 1 5

ID Text_English


• Demo

The Occurrence column contains an integer value. For each DocId value, there is a list of occurrence values that correspond to the relative word offsets of the particular keyword within that DocId.


Full-Text Index:

1. Full – A full population builds index entries for all the rows of the base table or indexed view

2. Change Tracking – SQL server tracks changes to the base table since the last population:1. Auto2. Manual

3. Incremental Timestamp-Based Population

Population methodsPopulation Process

• Contains, Freetext – as a predicate (Where)Syntax:

CONTAINS (column_name,search_string)

• ContainsTable, FreetextTable – TVF, includes ranking.

Syntax:

SELECT * FROM

CONTAINSTABLE (table_name,column_name,search_string, top n)

Querying

• Fully integrated into SQL Server• Stoplists• New Tools for Troubleshooting SQL Server 2008 Full-

Text Search (DMVs)• A New Word Breaker Family (Hebrew and other

languages)• Performance improvements (reasons: Integer Key,

full integration)

iFTS enhancements in SQL Server 2008

איש = אישה•הולך = הולכים•כלב = לב•כביסה = ביסה = ביס•לבני = בני = בן\בני•

DEMOS - עברית קשה שפה

Hebrew????

העברית השפה מחיי

• Sys.dm_fts_parser• sys.dm_fts_index_keywords• sys.dm_fts_index_keywords_by_doc• sys.fulltext_index_fragments• FULLTEXTCATALOGPROPERTY:

– MergeStatus – PopulateStatus

• OBJECTPROPERTYEX:– TableFulltextPopulateStatus – TableFulltextPendingChanges

New DMVs and management tools

• Why Scan if you can……FORCESEEK – new hint- can help a bit in determining the query plan

• When using contains don’t forget to use quotes (“) if searching more than one word

• Use \ to escape special characters• To search quotes (“) in the text use "

Tips and Tricks

• Use an integer key as the Unique index• Place Full-Text index on another filegroup• Performance degrades when full text index is

fragmented - use reorganize for merge

Tips and Tricks

• SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU

• Transactional replication• FT indexes on different FG than the main tables• Change tracking (AUTO)• Daily reorganizing fragmented indexes only• Hierarchy set of queries to make sure relevance

results return first• Use Dynamic SQL so that dynamic search fields can

be used

Tapuz Solution

• Freetext ranking (okapi –bm25)• Contains• Contains all words (using AND)• Free search (freetext)

Results relevance sorting logic

• Index sizes – 53 GB (~68 GB Data)• Number of rows indexed – >165M• AVG search time – 1.7 Sec• More than 97% of the searches respond in less than

7 Sec• Number of searches (2 months) – more than 6

million• Number of connections – ~900

Numbers

• High CPU load and intense disk IO during queries • Population and merges are resource intensive• Ranking not as a TVF?? – impossible• Statistics, query plans and join types are not always

optimal –hints can’t be used• No scale out or partitioning options

Known issues found so far

• Books Online• SQL Server 2008 Full-Text Search: Internals

and Enhancements: http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506227

• Pro Full-Text in SQL Server 2008 by Michael Coles

References

http://technet.microsoft.com/en-us/library/cc721269.aspx

http://technet.microsoft.com/en-us/library/cc721269.aspx

I F T S – S Q L 2008 F T S Engine

Technology

Transcript of I F T S – S Q L 2008 F T S Engine