SharePoint Saturday Belgium 2014 - Content Enrichment in SharePoint Search
Using the Fast Search for SharePoint Pipeline to Improve Search
-
Upload
surfray -
Category
Technology
-
view
3.146 -
download
0
description
Transcript of Using the Fast Search for SharePoint Pipeline to Improve Search
Improving search using the pipeline in FAST Search for SharePoint
SurfRay.comideaeng.com
Miles KehoeAuthor of: Professional Microsoft [email protected]@miles_kehoemileskehoe
Agenda
• Introductions
• When FS4SP makes sense
• What is the FS4SP indexing pipeline?
• Why is it important to you?
• How do you use it?
• Wrap Up
About Me
• Founder of New Idea Engineering Inc.
• Work with enterprise search since 1989
• Co-Author Professional Microsoft Search/Wrox
• Author several blogs:
- Enterprisesearchblog.com
- SearchComponentsOnline.com
• Search nerd
When to use FS4SP
Large datasets
• SP Search indexes 100M documents
• FS4SP virtually unlimited (650M in tests)
• Rows and Columns concept
Need to fine-tune index & search
• Pipeline
• Need custom relevance profiles
• Need to fine-tune queries for relevance
What is the FS4SP indexing pipeline?
Standard sequence of ‘stages’ from crawl to index
• Format conversion & language detection
• Lemmatization / Stemming
• Entity extraction
• Map crawled properties to managed properties
Unique to FAST: the ability to insert custom processing
• ‘Must’ be just before mapper
• C# supported; but any code using STDIN/STDOUT ok
• Time critical!
A great way to fix up messy data!
Pipeline Architecture
User QueriesData Sources
Content Processor
Crawler Indexer Query Processor
Form
at C
on
vers
ion
Lan
guag
e D
etec
tio
n
Enti
ty E
xtra
ctio
n
Lem
mat
izat
ion
…C
ust
om
Ext
ensi
bili
ty
Map
per
Index Flow
FS4SP Pipeline
Why is the pipeline important to you?
Sometimes content IS messy:
• URLs with abbreviations
• Additional metadata is in external sources
• Geo-tag documents
Diagnose problems in the indexing process:
• Identify bad or missing metadata
Examples where the pipeline can save you
Cryptic URLs
• With URLs like www.myco.com/mkt/prodmgmt/products.aspx
• I can add specific metadata to the document
‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)
Adding valuable metadata:
• When I find a user name in a document I can lookup and return phone number and email
• When I find a city name I can geo-tag with latitude and longitude
Debugging the indexing process
• When things are not as they seem I can diagnose problems in the indexing process
How do you use the pipeline?
Pipeline configuration files in \FASTSearch\etc
• PipelineConfig.xml
• PipelineExtensibility.xml
For each Document Processor node:
• Create an entry for a new ‘processor’
• Add your new processor name to the <pipelines> node
• Restart the ‘FAST processor server’ from CMD: psctrl reset
• Submit a single known test document
• Check your results
Config Files
Adding a Processor Stage
On each FAST document processor node:• Edit %FASTSEARCH%\etc\pipelineconfig.xml
<processor name=“Spy1" type="general" hidden="0"><load module="processors.Spy" class="Spy"/><config><param name="SpyDumpFile" value="var/log/spy.txt" type="str"/><param name="FileStringCutOffLen" value="32768" type="int"/></config><inputs></inputs>
</processor>• In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14
pipeline) <processor name=“Spy1” />
• Reset (each) document processor node:psctrl reset
FS4SP Pipeline Extensibility
How do you create a custom stage?
Edit file %FASTSEARCH%\etc\pipelineconfig as aboveEdit file %FASTSearch%\etc\PipelineExtensibility.xml
<PipelineExtensibility><Run command=“YourCode.EXE %(input)s %(output)s"><Input><CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" />
</Input><Output>
<CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/><CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/>
</Output></Run>
</PipelineExtensibility>
Restart content servers from command Line promptpsctrl reset
Pipeline is performance-critical
Pipeline runs in ‘sandbox’ environment
• NOT the same type of ‘sandbox’ in O365
• File I/O only allowed in C:\users\<fast service user>\AppData\LocalLow
• Maximum of 10 seconds to live
• Permissions restricted regardless of FAST Service user permissions
• Each Document Processor (DP) is an individual instance
• Only one item passes thru a DP at a time
• If each document takes 1 second then10 DPs can process at best 10 docs/sec
• Consider 1 sec for each of 100K docs ~ 3 hours!
Pipeline HintsMS only supports:
• Single custom stage (in PipelineConfig.xml)
• .NET languages (C#, etc)
But:
• A custom stage can appear in multiple places in PipelineConfig.xml even w/ different parameters
• Theoretically any executable that handles STDIN/STDOUT will do
• VC#/VC++/VBScript/CMD files seem to work
• Web services calls are supported
Using web services in Sandbox
Web Service
Stage
Stage
Stage
Stage
XML
XML
XML Config
Ontolica FAST Management
Ontolica Fast Management provides clear and easy to use configuration directly from within the SharePoint admin GUI. Forget XML configuration files, manual file deployments, and tricky PowerShell configuration with easy management consoles.
Key Features:
• Backup, Manage, & Deploy Configurations• Manage FAST Relevance Profiles• Upload & Manage Pipeline Extensions• Create & Manage JDBC Connections• FAST Webcrawler Configuration• Manage FAST Server Processes from Central
Admin
Additional Resources
• This slide deck live at http://slidesha.re/sCGAaP
• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/
• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/
• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/
• ESW Blog - http://www.enterprisesearchwiki.com/wp/
• TechNet/MSDN/Microsoft
• And of course: SurfRay.com (Robert Piddocke & Josh Noble)
Miles KehoeAuthor of: Professional Microsoft [email protected]@miles_kehoemileskehoe
Q/A & Contact Details
ideaeng.com SurfRay.com
Robert Piddocke
Author: Pro SharePoint 2010 Search
@rpiddocke
R Piddocke