ADITYA BIRLA GROUP COMPANY PROFILE (Autosaved) (Autosaved) (Repaired)
A customized web search engine [autosaved]
-
Upload
mustafaelkhiat -
Category
Technology
-
view
477 -
download
0
description
Transcript of A customized web search engine [autosaved]
Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic Engineering,
Menoufiya University.
The main purpose of this project is to build our own search engine that should suffice for our needs as a nation
In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
Question : What is a Search Engine?
How web search engine work?
Web crawler , Indexing , Ranking
Lucene , Nutch , Solr
Who uses solr?
Setup Nutch for web crawling
Setup Solr for search
Running Nutch in Eclipse for developing
Experiments
Answer: A software that
builds an index on text
answers queries using that index
A search engine offers
Scalability
Relevance Ranking
Integrates different data sources (email, web pages, files, database,...)
A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking
a program or automated script which browses the World Wide Web
used to create a copy of all the visited pages for later processing by a search engine
it starts with a list of URLs to visit, called the seeds
URLs recursively visited according to a set of policies
A selection policy
A re-visit policy
A politeness policy
A parallelization policy
Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.
The process involves the following steps
Data collection
Data traversal
Indexing
Indexing process: Convert document Extract text and meta data Normalize text(stop word,stim)Write (inverted) index
Example: Document 1: “Apache Lucene at Jazoon“ Document 2: “Jazoon conference“ Index: apache -> 1 conference -> 2 Jazoon -> 1, 2 lucene -> 1
The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
a high-performance, scalable information retrieval (IR) library
lets you add searching capabilities to your applications.
free, open source project implemented in Java
With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
Web Search Engine Software
Open source web crawler
Coded entirely in the Java programming language
Advantages Scalability
Crawler Politeness
Crawler Management
Quality
Open source enterprise search platform based on Apache Lucene project.
Powerful full-text search, hit highlighting, faceted search
Database integration, and rich document (e.g., Word, PDF) handling
Download a binary package (apache-nutch-bin.zip)
cd apache-nutch-1.X/
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
Now you should be able to see the following directories created:
crawl/crawldb
crawl/linkdb
crawl/segments
If you have a Solr core already set up and wish to index to it we should use
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Now skip to here for how to set up your Solr instance and index your crawl data.
Download binary file (apache-Solr-bin.zip)
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
After you started Solr admin console, you should be able to access the following link:
http://localhost:8983/solr/admin/
Integrate Solr with Nutchcp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/
crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
Crawling the Egyptian Universities
Crawling the Arabic news websites
Crawling the Arabic news websites
Mustafa Mohammed Ahmed Elkhiat
Email:[email protected]