GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
-
Upload
prosper-hancock -
Category
Documents
-
view
226 -
download
0
Transcript of GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
About Me
Grover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M
University) Stanford Project Management Courses
About Me 10+ years of development, analysis, and
implementation 10+ years of ColdFusion experience 2+ years of Java experience Commonspot, Strongmail, ClickFix
(Developer) Email: [email protected] Web site: http://www.groverfields.com
Agenda What?
What can we do with GOAT? Why?
Why do we want to use GOAT and not Verity? How?
How do we do that? Conclusion and alternative solutions
What What is a Search Engine?
Builds an index on text Answers queries using that index, a la Verity
Existing database already
A search engine offers? Scalability Reliance Ranking Tweaking Integrates different sources (email, web pages, files,
DATABASES)
What is a search engine? (cont.)
Works on words, not on substrings Auto != automatic, automobile
Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index
Apache Lucene Overview Lucene Java 2.4
A high-performance, full-featured text search engine library written entirely in Java.
It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
No GUI http://lucene.apache.org
Apache Lucene Overview Java library for indexing and searching No dependencies Works with Java 1.4 or later Input for indexing: Document objects
Each document: set of Fields, field name, field content Stores its index as files on disk or memory No document converters No web crawler
Lucene Java users HBCU.info LinkedIn IBM OmniFind Yahoo! Edition Techorati.com Eclipse Monster.com …
Lucene Java Summary
Java Library for indexing and searching Lightweight /no dependencies Powerful and fast and tested! No document conversion No GUI
Verity Limitations 10,000 documents for ColdFusion Developer Edition
125,000 documents of ColdFusion Standard Edition
250,000 documents for ColdFusion Enterprise Edition What do developers do in a shared hosting
environment? Is it possible for the hosting company to limit the
number of documents per Web site?
T-SQL Limitations? Search for “Yahoo” on my blog
SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC
Multiply that time 10, 100, 500, or 1000 users/hr?
T-SQL Limitations?
Full table scan = 1 THING PERFORMANCE KILLER!!! No search sorting
RDBMS isn’t designed to do this but allows it Use the right tools!
How? GOAT Search Solution
Lucene 2.4.0 ColdFusion MX 8
MX is fine but GUI needs to be rolled back Commons IO 1.4
Simply package .jar files Simply Web based GUI
How? Macromedia JDBC Drivers
Same drivers that ColdFusion uses No additional drivers to install
Supports RDBMS ONLY MSSQL MySQL Oracle
No File system support (Yet)
Basics? Indexing extracts both meaning and structure from
unstructured information by indexing each document Contains a complete list of all the words used in a given
document along with metadata about that document Lucene creates a collection that normalizes both the
structured and unstructured data. Search requests then check these collections rather than
scanning the actual documents and database fields. This provides a faster search of information, regardless of the
file type and whether the source is structured or unstructured.
Basics? Collection
A special database created by Lucene that contains metadata that describes the documents Documents
A sequence of fields Similar to a row in a database table
Row 1 Row 2, etc
Fields A named sequence of terms Similar to a column in a table
Primary Key Column 1
Terms Is a string
Knowledge? Index
A special database created by Lucene that contains metadata that describes the documents
Query Syntax Similar to Google’s advanced search:
field:value E.G. resume: coldfusion http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
Results Primary Key list of values XML based on the document CFX Tag integration
Alternative Solutions for Search Commercial vendors:
FAST, $100k Autonomy, $80k Google, $50k
Commercial search engines based on Lucene IBM OmniFind Yahoo Edition
RDBMS with Integrated Search Oracle MySQL MSSQL PERFORMANCE KILLERS
RoadMap
Road Map
A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials.
Overhaul Java programming (still novice) Integrate with other products
Aperture Nutch Solr
File system integration .txt, .pdf, .doc, .ppt, etc.
Geospatial based searches E.G. All jobs within a 50 mile radius