How to Make Your WordPress Site Search Engine Friendly & Web Marketing Ready
Search Engine - How to Make it
-
Upload
andreas-yunanto -
Category
Technology
-
view
106 -
download
1
description
Transcript of Search Engine - How to Make it
Search EngineHow To Make it
Wednesday, December 12, 12
Search Engine
All documents
retrieved documents (RET)
relevant documents (REL)
RET ∩ REL
database search:- low recall- high precision
web search:- high recall- low precision
Search Quality Measurement
Wednesday, December 12, 12
Search EngineFile
System
3rd party apps
Database
File System Crawler
Crawler API
Database Crawler
AaBb
Text Parser
HTML Parser
PDF Parser
AaBbPDFTextHTML
DocumentImage...
Document Enhancing
Documents (title,
summary, author,
datetime)
Indexer
Documents (Categorized, Taxonomized)
Stop AnalyzerLanguage Analyzer
Index Searcher Index
Mobile Client
Web Client
Index Searcher
Document Landing Page
Wednesday, December 12, 12
Search Engine
• Process in Search Engine
• Crawling
• Parsing
• Indexing
• Searching
Wednesday, December 12, 12
Search Engine• Process in Search Engine
• Crawling
• Parsing
• Duplicate Content Detection
• Document Enhancement
• Indexing
• Searching
• Document ServingWednesday, December 12, 12
Search Engine
• Crawling
• Collecting Data
• Input : Data content to Search
• Output : Raw Content Data in its original format
Wednesday, December 12, 12
Search Engine• Crawling
AaBb
File System
3rd party apps
Database
File System Crawler
Crawler API
Database Crawler
AaBbPDFTextHTML
DocumentImage...
Wednesday, December 12, 12
Search Engine
• Parsing
• Process to extract elements from crawled documents
• Input : Raw Contents
• Output : Textual Structured Documents
Wednesday, December 12, 12
Search Engine• Parsing
AaBb
Text Parser
HTML Parser
PDF Parser
AaBbPDFTextHTML
DocumentImage...
Documents (title,
summary, author,
datetime)
Wednesday, December 12, 12
Search Engine
• Content Duplication Detection
• Bigger Data means Bigger Duplication on Data
• Search Engine implement similiar document detection
Wednesday, December 12, 12
Search Engine• Document Representation
Model: Term Frequency(Tf)Contoh:
Document 1(d1)=”andi likes to watch movie. His wife likes it too”
Document 2(d2)=”andi also likes to watch soccer game.”
Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}
Document representation in model Tf:d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
Wednesday, December 12, 12
Search Engine• Document Similiarity
Similarity between document d1 dan d2 : S(d1, d2)
S(d1, d2)=|d1-d2|
d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
Contoh:
S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|
S(d1, d2)=7
With above definition, less value we got means more those two documents are getting more similiar
Wednesday, December 12, 12
Search Engine• Alghoritms
1. Counting Tf for every document
2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document4. Repeat process 2 dan 3 until there is no value of S that less than Theshold
Wednesday, December 12, 12
Search Engine
• Document Enhancement
• Give tagging based on taxonomy
Wednesday, December 12, 12
Search Engine• Document Enhancement
Document Enhancing
Documents (title,
summary, author,
datetime)
Documents (Categorized, Taxonomized)
Wednesday, December 12, 12
Search Engine
• Indexing
• Indexing process from all information that have been gathered in one document
• Faster Searching process
• Able to search based on certain field
Wednesday, December 12, 12
Search Engine• Indexing
IndexerDocuments
(Categorized, Taxonomized)
Index
Stop Analyzer
Language Analyzer
Wednesday, December 12, 12
Search Engine
• Searching
Index SearcherIndex
Mobile Client
Web Client
Wednesday, December 12, 12
Search Engine
• Document Serving
• Search Engine also has a function to display result
Wednesday, December 12, 12
Search Engine
Index SearcherIndex
Mobile Client
Web ClientIndex
SearcherDocument
Landing Page
Wednesday, December 12, 12
Search Engine• Recommended Open Source
Technology• Search Engine : Lucene, Nutch
• Programming Library : Hadoop, Scala Actor
• Database : MongoDB, PostgreSQL
• Programming Language : Java, Scala, PHP
Wednesday, December 12, 12
Thank You
Wednesday, December 12, 12