An Introduction to Distributed Search with Datastax Enterprise Search
Enterprise Search - Introduction
-
Upload
amplexor -
Category
Technology
-
view
121 -
download
1
description
Transcript of Enterprise Search - Introduction
Enterprise Search8/12/2011 – Damien Dewitte
2.
Enterprise SearchSetting the scene
Damien Dewitte
Lead ECM consultant
3.
search
The enterprise search promiseSome thoughts on search scenariosMake your content “findable”Search: How it worksThe enterprise search market
Contents
4.
5.
While on the Intranet …
6.
the Enterprise Search promise
7.
The Enterprise Search Promise
IDC 2001:”The High Cost of Not Finding Information”Ø Cost=
Poor decisions based on faulty or poor informationDuplicated efforts within different divisions/projectsLost sales due to customer’s inability to find product and servicesLost productivity due to employees inability to find information
8.
The Enterprise Search Promise
Google (2008)
9.
The Enterprise Search Promise
10.
The Invisible Intranet
Using Search on an Intranet usually leaves a huge portion of existing valuable information ‘invisible’, becauseØ Some information silos are not indexed:
Databases with structured content
External sources
Isolated departmental content repositories
Individual desktops
Content applications ‘in the cloud’
Digital ArchivesØ Some Information is “over-secured”Ø Some Information is trapped in proprietary file formats, which can not
be indexedØ Some Information can not be extracted as text
Rich Media files (Audio, Video)
Badly scanned documents
11.
The Enterprise Search Promise
12.12
The Enterprise Search Promise
RDBMS(JDBC, ODBC,SQLNet, DW,
DM)
Applications(e.g. ERM, CRM,
Help Desk)
Legacy Data(e.g. ISAM, VSAM, IMS)
Message Queues(e.g. TIBCO, MQ-Series)
DMS(e.g. M’Soft CMS,
Documentum)
eMail Systems(e.g. Notes,Exchange)
Files(e.g. Word, Excel,pdf, images, mp3)
Portals(e.g. WebSphere,
WebLogic)
WWW(HTML, XML, WML,
JavaScript)
Private Webs(e.g. news feeds,
Intranets)
Direct Push
UNSTRUCTUREDSTRUCTURED REAL--TIME
Enterprise Search PlatformSI
TE S
EAR
CH
MA
IL S
EAR
CH
BI S
EAR
CH
DM
S SE
AR
CH
CO
RPO
RAT
ESE
AR
CH
ECO
MM
ERC
ESE
AR
CH…
13.13
The Enterprise Search Promise
“There’s no reason to expect that search is going to get that much better. The basic algorithms by which search is done have not improved much since about 1975.
The only way to improve the situation is by enhancing search engines with more deterministic metadata.
If you look at the victory of Google, it wasn’t because they had better search techniques. It’s because they deployed one key metadata value – how many pages are linked to this one – to enhance the relevancy of their results.The same concepts need to be applied to the enterprise.”
(Tim Bray)
14.
Some thoughts on search scenarios
15.
Enterprise versus web search
Web EnterpriseContent Mainly HTML and
PDFAll formats and sources, including databases and legacy systems
Security Focus on system security
Also restricting user access to specific content
Updates Via (scheduled) crawling
Push updates to the index (near real time)
Volume On average: 1000 files
Potentially: > 1.000.000 “records”
Metadata management
Centrally in e.g. Web CMS
Consolidate metadata from various source systems
Relevance Popularity via hyperlinks
Popularity via “social” instruments?
16.
Enterprise versus web search
Probably the cheapest website search you can find
17.
Structured versus unstructured
Start by filtering
Start by typing
18.
Search versus research
“Meeting minutes social collaboration project” “Amplexor
proposal for Intranet”
“Timesheets april 2009”
“Ecm and Green IT in Europe”
“Does ECM have impact on governmental decisions in Spain?”
“I know you’re out there..”“Life is like a box of chocolates, …You never know what you gonna get”
“average time spent on searching for content”
19.
Search versus research
Search based onØ Information Type (Meeting minutes,
Proposal, Invoice, Timesheet, …)Ø Document Format (PDF, DOC, PPT, e-
mail, …)Ø Organisational Source
Projects
Products
Processes– HR– Compliance– Marketing– IT– …
…Ø Publication Date, Modification dateØ Author
“Meeting minutes social collaboration project”
Search queries are more or less predictable (after analysis)
20.
Search versus research
Research based onØ Entities:
People
Geographical locations
Companies & Brands
…Ø Source: Internal or ExternalØ Publication Date RangeØ Natural language search
“Does ECM have impact on governmental decisions in Spain?”
Search queries are unpredictable. The system should be “taught” how to interpret a query. (natural language search, entity extraction from content, …
21.
Metadata
What is metadata?Ø Information about the information:
Descriptive
Structural
Administrative
Types of metadata:Implicit (e.g. creation date, publication date, URL, filename, file format, source system, …)
Explicit (e.g. owner, topic, summary, expiry date, status, …)
Guiding metadata input with:Taxonomies
Folksonomies
Ontologies
22.22
Taxonomies
23.
Folksonomies
http://taggalaxy.de
24.
Ontologies
Taxonomies, representing knowledge as a set of concepts within a domain, and the relationships between those concepts
http://en.wikipedia.org/wiki/Geopolitical_ontology
25.
Metadata
Statement 1: “A performant Enterprise Search Engine should not require information workers to add metadata. It should just Crawl all my information sources”
But:Ø Will users understand the
results displayed? (title, author, …
Ø How will they filter results?Ø Does it really help to crawl
1.000.000 records if 900.000 have becomeirrelevant over time?
26.
Metadata
Statement 2: “Google doesn’t need metadata”
Are you sure?
27.
Metadata
So you think Google doesn’t need metadata?
28.
Simple example of the semantic web
29.
Metadata
Statement 3: Adding metadata is so time consuming my information workers will never do it.
Yes, but:Ø In an structured ECM approach, it is possible to automate lots of the
metadata input, because it can be deduced from some business rulesØ If you’re not 100% sure you will need a metadata field for a specific
purpose, then don’t create it.Ø Convince users about the value of the metadata fields which remainØ Make it user friendly for content contributors to add metadata
30.
Metadata
Avoid defining metadata around the document, if it should already be present IN the document.
31.
Make content findable
32.
Findability
Findability is not obtained just by implementing search technology
AIIM.org: “Information Organization and Access (IOA) refers to a collection of technologies to help you organize and find information”, which includes:Ø enterprise searchØ content classificationØ categorization and clusteringØ fact and entity extractionØ taxonomy creation and managementØ information presentation (i.e., visualization)Ø information governance
33.
Findability Tips & Tricks
The more value content has, the more effort should be spent in managing it (and making it findable)
34.
Findability Tips & Tricks
One search interface doesn’t solve it all. Keep in mind thatØ Specific content sources or Lines of Business might require specialized
search screens
35.
Findability Tips & Tricks
Define specific search scopes, if your information governance permits …
36.
Findability Tips & Tricks
Landing Pages are still “in”!Ø Projects Overview PageØ Knowledge base page
(links to knowledge bases)Ø Practical Guide
(categorized hyperlinks to practical information)
Ø ToolsØ FormsØ Filtered listings (e.g.
Automatic listing of all FAQ Content types)
37.
How search works
38.
How it works
CO
NN
ECTO
RS
Pipeline
SEARCH QU
ERY &
RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Pipeline
Multimedia
Architecture
39.
How it works
Connect to content sources and get dataØ Web pages (e.g. XML, HTML, WML): CrawlerØ Files, documents (e.g. Word, Excel, pdf): File
traverserØ Database content (e.g. Oracle, DB2): Database
connectorsØ Applications (e.g. Sharepoint, Documentum,
Exchange, CMS/DMS): Application connectors
CO
NN
ECTO
RS
Pipeline
SEARCH QU
ER
Y &
RES
ULT
PR
OC
ESS
ING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLE
R
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Multimedia
40.
How it works
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLE
R
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
ContentPush
Pipeline
Multimedia
Analyze and index content to make it searchable
Ø Convert and process content through pre-processing pipeline:
Lemmatization/stemming, entity extraction, taxonomy classification
Custom logic (e.g. adding special tags)
Ø Write content to index files
41.
Search EngineHow It Works
Analyze query
Ø Use query language or query APIØ Convert and process query through query pipeline:
Linguistic processing Custom logic (e.g. query term
modification/addition)
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLE
R
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
42.
How it works
Match query to content index
Ø Query- and content adaptive matchingØ Exploit all information and structure in the data
CO
NN
ECTO
RS
Pipeline
SEARCH QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLE
R
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
WebContent
ContentPush
Files,Documents
Databases
CustomApplications
Pipeline
Multimedia
43.
CO
NN
ECTO
RS
How it works
Return results to user
Ø Convert and process results through result pipeline:
Resort, filter for security, organize for dynamic drilldown
Ø Pass results on to application (generated or through API) Ø Push results to alert engine and then external environment (e.g. mail, queue)
WebContent
Pipeline
SEARCH RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTO
R
FILETRAVERSE
R
WEBCRAWLE
R
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
44.
Mediafin
45.
How it works
Federated Search: Relies on the indexes and the relevance algorithms of the under laying search engines
46.
the Enterprise Search market
47.
The Enterprise Search Market
What’s the vendors focus?Ø Business IntelligenceØ Text-mining (linguistic support!)Ø E-CommerceØ Image/Video: Visual Information retrievalØ Audio/Video: speech recognitionØ eDiscoveryØ …
48.
The Enterprise Search Market
Enterprise search products can be:Ø Specialized — products that use search to address a need in a
specific area like customer service or to supplement business intelligence platforms
Ø Integrated — products that merge search capabilities with other information management functions like content management, collaboration or analytics; the goal of these products is to become deeply ingrained in the technology portfolio so that the use of the tool becomes a ubiquitous part of the information workplace
Ø Detached — products like Google’s appliance focused on ease of deployment and flexibility
49.
The Enterprise Search Market
Forrester (september 2011) evaluated twelve vendors/products in its Market Overview (not including open source):Ø Autonomy IDOL 7 Acquired by HPØ Attivio AIE 1.3Ø Coveo Platform 6.5Ø Endeca Latitude 2 Acquired by OracleØ Exalead CloudView 5.1Ø Fabsoft Mindbreeze 5.0Ø Google Search Appliance 6.8Ø IBM Content Analytics with Enterprise Search 2.2Ø ISYS Enterprise Server v9.7Ø Microsoft FAST Search for SharePoint Server 2010Ø Sinequa ES 7Ø Vivisimo Velocity 8.0
50.
The Enterprise Search Market
Important TrendsØ Social and collaborative featuresØ Mobile supportØ Audio/VideoØ CloudØ Spatial supportØ Semantics/text analyticsØ Search Based Applications
(“SBA”)
51.
Wrap up
Search Technology platforms are mature and are available on the market in abundance and multiple flavors.
But,
make sure you are:
Cost-effective (what’s the business case? Priorities?)
Consistent in Content classification and Governance
Continuously monitoring usage and improving relevance
Clever & Pragmatic
Creative (User interface, multi-device)
52.
Thank you!