A customized web search engine [autosaved]

Supervised By

Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia

Dept. of Computer Science & Engineering

Faculty of Electronic Engineering,

Menoufiya University.

The main purpose of this project is to build our own search engine that should suffice for our needs as a nation

In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news

Question : What is a Search Engine?

How web search engine work?

Web crawler , Indexing , Ranking

Lucene , Nutch , Solr

Who uses solr?

Setup Nutch for web crawling

Setup Solr for search

Running Nutch in Eclipse for developing

Experiments

Answer: A software that

builds an index on text

answers queries using that index

A search engine offers

Scalability

Relevance Ranking

Integrates different data sources (email, web pages, files, database,...)‏

A search engine operates, in the following order

1. Web crawling

2. Indexing

3. Ranking

a program or automated script which browses the World Wide Web

used to create a copy of all the visited pages for later processing by a search engine

it starts with a list of URLs to visit, called the seeds

URLs recursively visited according to a set of policies

A selection policy

A re-visit policy

A politeness policy

A parallelization policy

Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.

The process involves the following steps

Data collection

Data traversal

Indexing

Indexing process: Convert document Extract text and meta data Normalize text(stop word,stim)Write (inverted) index

Example: Document 1: “Apache Lucene at Jazoon“ Document 2: “Jazoon conference“ Index: apache -> 1 conference -> 2 Jazoon -> 1, 2 lucene -> 1

The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs

a high-performance, scalable information retrieval (IR) library

lets you add searching capabilities to your applications.

free, open source project implemented in Java

With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.

Web Search Engine Software

Open source web crawler

Coded entirely in the Java programming language

Advantages Scalability

Crawler Politeness

Crawler Management

Quality

Open source enterprise search platform based on Apache Lucene project.

Powerful full-text search, hit highlighting, faceted search

Database integration, and rich document (e.g., Word, PDF) handling

Download a binary package (apache-nutch-bin.zip)

cd apache-nutch-1.X/

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Now you should be able to see the following directories created:

crawl/crawldb

crawl/linkdb

crawl/segments

If you have a Solr core already set up and wish to index to it we should use

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Now skip to here for how to set up your Solr instance and index your crawl data.

Download binary file (apache-Solr-bin.zip)

cd ${APACHE_SOLR_HOME}/example

java -jar start.jar

After you started Solr admin console, you should be able to access the following link:

http://localhost:8983/solr/admin/

Integrate Solr with Nutchcp ${NUTCH_RUNTIME_HOME}/conf/schema.xml

${APACHE_SOLR_HOME}/example/solr/conf/

restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example

run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/

crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

Crawling the Egyptian Universities

Crawling the Arabic news websites

Mustafa Mohammed Ahmed Elkhiat

Email:[email protected]

A customized web search engine [autosaved]

Technology

Transcript of A customized web search engine [autosaved]