FLO T G 4.0 S F O D I C S W F T E S - Fairtrade International
I F T S – S Q L 2008 F T S Engine
-
Upload
sqlservercoil -
Category
Technology
-
view
1.797 -
download
5
description
Transcript of I F T S – S Q L 2008 F T S Engine
iFTS – SQL 2008 FTS Engine
Hebrew Full-Text SearchIn the real world
Hebrew in the real world
Agenda• Tapuz• iFTS – Introduction• iFTS – Terms and keywords• Setting up Full-Text• Index structure• Population• Querying• Improvements from 2005• Tapuz solution• Known Issues
Tapuz – It’s all about content• 5 Major websites – Forums, Communa, Blogs, Flix
(Video), Albums• Over 165 million content items• Over 3 million registered Users• Thousands of new items every day• More than 30 web servers• SQL Server:
• SQL server 2005 enterprise edition on a 2-node Cluster
• 4 quadcore CPU, 16 GB RAM• ~500 GB of data in 5 major databases• ~1200 batch requests per seconds
Tapuz - old search engines• 3 different search engines:
° 3 different database systems° Search often didn’t return correct results° 3 Different relevance sort algorithms° Very resource intensive (more than 20 servers
used for search alone!)° No support for advanced search (dynamic fields)° Long period of time before a new item is indexed
Tapuz Search - project requirements• Search through most of the existing content (more
than 165M items)• Allow querying the new added items in real time• The search engine's default language is Hebrew and
special linguistic characteristics should be supported • Dynamic fields search – the user can choose which
fields to search• Should have a relevance sorting mechanism
Challenges• The search should add minimal load on the
production SQL Server• Should have decent query performance• Real-time item indexing• How do we handle Hebrew ??!!*#$??!%?
The solution
Transactional replication
SQL 2005 Enterprise Cluster SQL 2008 Standard
Auto Change tracking population
iFTS - Introduction• FTS allows fast and flexible indexing for keyword-
based querying of text data• SQL Server has had full-text search capabilities since
version 7.0• The Full-Text Engine supports two roles: indexing and
querying• Full-text indexes can be created not only on textual
data columns, but also on binary columns• Common uses: searching Web sites, product
catalogs, document management systems
Terms and Keywords
Document Population
Full-Text Catalog
Full-Text Index
Terms and Keywords
Document Population
Full-Text Catalog
Full-Text Index
(also known as a crawl) - Population is the process
of creating and maintaining a full-text index. (creating and building the index)
Stemmer
Word breaker
Filter
PopulationTerms and Keywords
Stemmer
Word breaker
Filter Given a specified file extension such as .doc, filters extract text from
a file stored in a varbinary(max) or
IMAGE column
PopulationTerms and Keywords
Stemmer
Word breaker
FilterFor a given language, a
word breaker tokenizes the text, identifies individual
words by determining where word boundaries exist based on the lexical
rules of the language
PopulationTerms and Keywords
Stemmer
Word breaker
Filter
For a given language, a stemmer generates inflectional forms of a particular word based
on the rules of that language.
PopulationTerms and Keywords
Stemmer
Word breaker
Filter
Token
PopulationTerms and Keywords
Stemmer
Word breake
r
Filter
Token
A token is a word or a character string
identified by a word breaker
PopulationTerms and Keywords
STOPLIST
STOPWORDSTOPWORDSTOPWORDSTOPWORD
PopulationTerms and Keywords
Stemmer
Word breake
r
Filter
Token
Stemmer
Word breake
r
FilterPopulation
Terms and Keywords
STOPLIST
STOPWORDSTOPWORDSTOPWORDSTOPWORD
Token
A stopword is a word that is not relevant to your
search and is filtered out from indexing and query
processes. SQL Server 2008 introduces stoplists.
A stoplist is a list of stopwords
Document PopulationFull-Text Catalog
Terms and Keywords
Full-Text Index
Document PopulationFull-Text CatalogFull-Text Index
A full-text index stores information about
significant words and their location within a given
column
Terms and Keywords
Terms and Keywords
Document PopulationFull-Text Catalog
A full-text catalog is a logical concept that refers to a group of
full-text indexesFull-Text
Index
Enable DB
Create Catalog
Create Full-Text Index
Setting up Full-Text
• A full-text index is a special type of token-based index
• In order to create a full-text index on a table or a view, it must have a unique, single-column, non-nullable index
• Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max)
• Each index supports only a single language per column
Creating a Full-Text indexSetting up Full-Text
Choose column and languages Assign to catalog Choose filegroup Choose stoplist
Creating a Full-Text indexSetting up Full-Text
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in EnglishSource Row:
Full-Text Index:
• Demo
Index Structure
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
The Keyword column contains a representation of a single token
extracted at indexing time. Word breakers determine what makes
up a token.
• Demo
Index StructureSource Row:
Full-Text Index:
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
• Demo
Index StructureSource Row:
Full-Text Index:
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
The ColId column contains a value that corresponds to a
particular column that is full-text indexed.
• Demo
Index StructureSource Row:
Full-Text Index:
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
• Demo
Index StructureSource Row:
Full-Text Index:
• Demo
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
The DocId column contains eight-byte integer values that maps to a particular
full-text key value in a full-text indexed table.
Index StructureSource Row:
Full-Text Index:
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
• Demo
Index StructureSource Row:
Full-Text Index:
Keyword ColId DocId Occurrence
English 3 1 7
Nothing 3 1 3
Searching 3 1 5
ID Text_English
1 there is nothing like searching in English
• Demo
The Occurrence column contains an integer value. For each DocId value, there is a list of occurrence values that correspond to the relative word offsets of the particular keyword within that DocId.
Index StructureSource Row:
Full-Text Index:
1. Full – A full population builds index entries for all the rows of the base table or indexed view
2. Change Tracking – SQL server tracks changes to the base table since the last population:1. Auto2. Manual
3. Incremental Timestamp-Based Population
Population methodsPopulation Process
• Contains, Freetext – as a predicate (Where)Syntax:
CONTAINS (column_name,search_string)
• ContainsTable, FreetextTable – TVF, includes ranking.
Syntax:
SELECT * FROM
CONTAINSTABLE (table_name,column_name,search_string, top n)
Querying
• Fully integrated into SQL Server• Stoplists• New Tools for Troubleshooting SQL Server 2008 Full-
Text Search (DMVs)• A New Word Breaker Family (Hebrew and other
languages)• Performance improvements (reasons: Integer Key,
full integration)
iFTS enhancements in SQL Server 2008
איש = אישה•הולך = הולכים•כלב = לב•כביסה = ביסה = ביס•לבני = בני = בן\בני•
DEMOS - עברית קשה שפה
Hebrew????
העברית השפה מחיי
• Sys.dm_fts_parser• sys.dm_fts_index_keywords• sys.dm_fts_index_keywords_by_doc• sys.fulltext_index_fragments• FULLTEXTCATALOGPROPERTY:
– MergeStatus – PopulateStatus
• OBJECTPROPERTYEX:– TableFulltextPopulateStatus – TableFulltextPendingChanges
New DMVs and management tools
• Why Scan if you can……FORCESEEK – new hint- can help a bit in determining the query plan
• When using contains don’t forget to use quotes (“) if searching more than one word
• Use \ to escape special characters• To search quotes (“) in the text use "
Tips and Tricks
• Use an integer key as the Unique index• Place Full-Text index on another filegroup• Performance degrades when full text index is
fragmented - use reorganize for merge
Tips and Tricks
• SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU
• Transactional replication• FT indexes on different FG than the main tables• Change tracking (AUTO)• Daily reorganizing fragmented indexes only• Hierarchy set of queries to make sure relevance
results return first• Use Dynamic SQL so that dynamic search fields can
be used
Tapuz Solution
• Freetext ranking (okapi –bm25)• Contains• Contains all words (using AND)• Free search (freetext)
Results relevance sorting logic
• Index sizes – 53 GB (~68 GB Data)• Number of rows indexed – >165M• AVG search time – 1.7 Sec• More than 97% of the searches respond in less than
7 Sec• Number of searches (2 months) – more than 6
million• Number of connections – ~900
Numbers
• High CPU load and intense disk IO during queries • Population and merges are resource intensive• Ranking not as a TVF?? – impossible• Statistics, query plans and join types are not always
optimal –hints can’t be used• No scale out or partitioning options
Known issues found so far
• Books Online• SQL Server 2008 Full-Text Search: Internals
and Enhancements: http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506227
• Pro Full-Text in SQL Server 2008 by Michael Coles
References