Bin Rank Project

49
BINRANK: SCALING DYNAMIC AUTHORITY-BASEDSEARCH USING MATERIALIZED SUBGRAPHS OBJECTIVE: In this project, a BinRank system that employs a hybrid approachwhere query time can be traded off for preprocessing t i m e a n d storage. BinRank closely approximates ObjectRank scores by runningthe same ObjectRank algorithm on a small subgraph, instead of the fulldata graph PROBLEM DEFINITION: 1.The idea of approximating ObjectRank by using Materializedsubgraphs (MSGs), which

Transcript of Bin Rank Project

Page 1: Bin Rank Project

 

BINRANK: SCALING DYNAMIC AUTHORITY-BASEDSEARCH USING MATERIALIZED SUBGRAPHS

OBJECTIVE:In this project, a BinRank system that employs a hybrid approachw h e r e q u e r y t i m e c a n b e t r a d e d o f f f o r p r e p r o c e s s i n g t i m e a n d storage. BinRank closely approximates ObjectRank scores by runningthe same ObjectRank algorithm on a small subgraph, instead of the fulldata graph

PROBLEM DEFINITION:

1.The idea of approximating ObjectRank by using Materializedsubgraphs (MSGs), which can be precomputed offline to supporton l i ne que ry ing fo r a spec i f i c que ry work load , o r t he en t i r e dictionary.

2.Use of ObjectRank itself to generate MSGs for “bins” of terms.

Page 2: Bin Rank Project

3 . A g r e e d y a l g o r i t h m t h a t m i n i m i z e s t h e n u m b e r o f b i n s b y clustering terms with similar posting lists.

4.Extensive experimental evaluation on the Wikipedia data set thatsupports our performance and search quality claims. The  evaluation demonstrates superiority of BinRank over other state-of-the-art approximation algorithms.

Abstract:Dynamic authority-based keyword search algorithms, such asObjectRank and personalized PageRank, leverage semantic linkinformation to provide high quality, high recall search in databases,and the Web. Conceptually, these algorithms require a querytimePageRank-style iterative computation over the full graph. Thiscomputation is too expensive for large graphs and not feasible atquery time. Alternatively, building an index of precomputed results forsome or all keywords involves very expensive preprocessing. Weintroduce BinRank, a system that approximates ObjectRank results byutilizing a hybrid approach inspired by materialized views in traditionalquery processing. We materialize a number of relatively small subsetso f t h e d a t a g r a p h i n s u c h

Page 3: Bin Rank Project

a w a y t h a t a n y k e y w o r d q u e r y c a n b e answered by runn ing Ob jec tRank on on ly one o f t he subg raphs . BinRank generates the subgraphs by partitioning all the terms in thecorpus based on their co-occurrence, executing ObjectRank for eachpartition using the terms to generate a set of random walk startingpoints, and keeping only those objects that receive non-negligiblescores. The intuition is that a subgraph that contains all objects andlinks relevant to a set of related terms should have all the informationn e e d e d t o r a n k o b j e c t s w i t h r e s p e c t t o o n e o f t h e s e t e r m s . W e  demonstrate that BinRank can achieve subsecond query executiontime on the English Wikipedia data set, while producing high-qualitysearch results that closely approximate the results of ObjectRank onthe original graph. The Wikipedia link graph contains about 108 edges,which is at least two orders of magnitude larger than what prior stateof the art dynamic authority-based search systems have been able todemonstrate. Our experimental evaluation investigates the trade-

Page 4: Bin Rank Project

off between query execution time, quality of the

results, and storagerequirements of BinRank.

 

LITERATURE REVIEW:

Page Rank :

The PageRank algorithm utilizes the Web graph link structure to assignglobal importance to Web pages. It works by modeling the behavior of a “random Web surfer” who starts at a random Web page and followsoutgoing links with uniform probability. The PageRank score isindependent of a keyword query. Recently, dynamic versions of thePageRank algorithm have become popular. They are characterized bya query-specific choice of the random walk starting points.

Personalized Page Rank :

In particular, two algorithms have got a lot of attention: PersonalizedPageRank (PPR) for Web graph data sets and ObjectRank for graph-modeled

Page 5: Bin Rank Project

databases. PPR is a modification of PageRank that performssearch personalized on a preference set that contains Web pages thata user likes. For a given preference set, PPR performs a very expensivefixpoint iterative computation over the entire Web graph, while itgenerates personalized search results. Therefore, the issue of scalability of PPR has attracted a lot of attention. ObjectRank extends(personalized) PageRank to perform keyword search in databases.ObjectRank uses a query term posting list as a set of random walkstarting points and conducts the walk on the instance graph of the  

database. The resulting system is well suited for “high recall” search,which exploits different semantic connection paths between objects inhighly heterogeneous data sets.Object Rank ObjectRank has successfully been applied to databases that havesocial networking components, such as bibliographic data andcollaborative product design. However, ObjectRank suffers from thesame sca l ab i l i t y i s sue s a s pe r sona l i z ed PageRank , a s i t r equ i r e s multiple iterations over all nodes and links of the entire databasegraph. The original ObjectRank system has two modes:

Page 6: Bin Rank Project

online andoffline. The online mode runs the ranking algorithm once the query isreceived, which takes too long on large graphs. For example, on agraph of articles of English Wikipedia1 with 3.2 million nodes and 109million links, even a fully optimized in-memory implementation of ObjectRank takes 20-50 seconds to run. In the offline mode,ObjectRank precomputes top-k results for a query workload inadvance. This precomputation is very expensive and requires a lot of storage space for precomputed results. Moreover, this approach is notf ea s ib l e fo r a l l t e rms ou t s i de t he que ry work load t ha t a u se r may search for, i.e., for all terms in the data set dictionary. For example, onthe same Wikipedia data set, the full dictionary precomputation wouldtake about a CPU-year.

  Existing System:•PageRank algorithm utilizes the Web graph link structure toassign global importance to Web pages. It works by modeling thebehavior of a “random Web surfer” who starts at a random Webpage and follows outgoing links with uniform probability.•The PageRank score is independent of a keyword query.•

Page 7: Bin Rank Project

Pe r sona l i z ed PageRank (PPR) fo r Web g raph da t a s e t s and ObjectRank for graph-modeled databases results. Therefore, theissue of scalability of PPR has attracted a lot of attention.•ObjectRank extends (personalized) PageRank to performkeyword search in databases. ObjectRank uses a query termposting list as a set of random walk starting points and conductsthe walk on the instance graph of the database.

  Proposed System:•In this project, a BinRank system that employs a hybrid approachwhere query time can be traded off for preprocessing time andstorage. BinRank closely approximates ObjectRank scores byrunning the same ObjectRank algorithm on a small subgraph,instead of the full data graph.•B inRank que ry execu t i on ea s i l y s ca l e s t o l a rge c lu s t e r s by distributing the subgraphs between the nodes of the cluster.•We are proposing the BinRank algorithm for the trade time of  search. Our alogorithm solves the time consuming problem inquery execution. Time will

Page 8: Bin Rank Project

be reduced because of cache storageand redundant query handling method.

  Proposed System:•In this project, a BinRank system that employs a hybrid approachwhere query time can be traded off for preprocessing time andstorage. BinRank closely approximates ObjectRank scores byrunning the same ObjectRank algorithm on a small subgraph,instead of the full data graph.•B inRank que ry execu t i on ea s i l y s ca l e s t o l a rge c lu s t e r s by distributing the subgraphs between the nodes of the cluster.•We are proposing the BinRank algorithm for the trade time of  search. Our alogorithm solves the time consuming problem inquery execution. Time will be reduced because of cache storageand redundant query handling method.

 

HARDWARE SPECIFICATIONProcessor: Any Processor above 500 MHz.Ram

Page 9: Bin Rank Project

: 128Mb.Hard Disk : 10 GB.Input device: Standard Keyboard and Mouse.Output device: VGA and High Resolution Monitor.SOFTWARE SPECIFICATIONOperating System: Windows Family.Pages developed using: Java Server Pages and HTML.Techniques: Apache Tomcat Web Server 5.0, JDK 1.5 or higher Web Browser: Microsoft Internet Explorer.Data Bases: SQlServer 2000Client Side Scripting: Java Script  Modules:.•User Registration•Authentication Module•

Page 10: Bin Rank Project

Search - Query Submission•Index Creation•BinRank Algorithm Implementation•Graph based on Rank User Registration:We are providing the facility to register new users. If anyone wants use our  application, they should become a member of our application. To getting the membershiplogin the users should made registration with our application. In registration we will getall the details about the users and it will be stored in a database to create membership.Authentication Module:This module provides the authentication to the users who are using our application.In this module we are providing the registration for new users and login for existingusers.Search Query Submission:Users query will be submitted in this module. Users can search any kind of thingsin our application when we connect with Internet. Users query will be processed based ontheir submission, and then it will produce the appropriate result. Result will be producedbased on our algorithm.

 

Page 11: Bin Rank Project

Index Creation:Index i s some th ing l i ke t he coun t o f s ea r ch and r e su l t wh i ch we p roduced wh i l e searching. Based on the index we will create the rank for the results, such like pages or corresponding websites. This will be maintained in background for future use like cachememory. By the way we are creating the index for speed up the search efficient and fastwith the help of implementing BinRank algorithm.BinRank Algorithm Implementation:We generate an MSG for every bin based on the intuition that a subgraph that contains allobjects and links relevant to a set of related terms should have all the information neededto rank objects with respect to one of these terms. Based on the index creation we need togenerate the results for the users query. BinRank algorithm will use the indexing andranking techniques to produce the efficient results in short time.Graph based on Rank Graph will be generated based on the users queries submitted. This graph will representthe user search key word, number of websites produced for their search, how many timesthat websites occurred in the search result and the Rank for websites based on the user clicks. User may search the same key word again and again, so result may also produceas same URLs. At that user will click some of the URLs; based on their clicks the Rank will be calculated. Based on the Number of

Page 12: Bin Rank Project

times URL occurrence, Rank and Keywordthe Graph will generate.

  Module INPUT/OUTPUT:Module 1:Given Input:User details registrationOutputAuthorized user to access applicationModule 2:Given Input:Username and passwordOutput:Access to applicationModule 3:Given Input:Search queryOutput:query to databaseModule 4:Given Input:Query result from databaseOutput:Results to user Module 5:Given Input:Binrank implementationOutput:Binrank for outputModule 6:

  Given Input:binrank Output:Graph based on binrank

  SAMPLE SOURCE CODESearch.java/** To change this template, choose Tools | Templates* and open the template in the editor.*/package org.websearch.Action;//import java.io.BufferedReader;import java.io.*;import java.net.*;import java.util.Calendar;import java.text.SimpleDateFormat;/**** @author Leonard*/public class Search{ public static void main(String args[]){System.out.println("Search File");long start =

Page 13: Bin Rank Project

System.nanoTime();System.out.println("Start: " + start+" nano seconds");try {URL url = newURL("http://www.google.com/search?q=example");URLConnection conn = url.openConnection();conn.setRequestProperty("User-Agent","Mozilla/5.0 (X11; U; Linuxx86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");BufferedReader in = new BufferedReader(newInputStreamReader(conn.getInputStream()));String str;while ((str = in.readLine()) != null) {System.out.println(str);}in.close();}catch (MalformedURLException e) {}catch (IOException e) {}long end = System.nanoTime();System.out.println("End : " + end+" nano seconds");

  long elapsedTime = end - start;System.out.println("The process took approximately: "+ elapsedTime + " nano seconds");} }DrawGraph.java/** To change this template, choose Tools | Templates* and open the template in the editor.*/package org.websearch.Action;import java.io.BufferedWriter;import java.io.File;import java.io.FileWriter;import java.io.IOException;import java.io.PrintWriter;import java.io.Writer;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet;import

Page 14: Bin Rank Project

javax.servlet.RequestDispatcher;import javax.servlet.ServletException;import javax.servlet.http.HttpServlet;import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletResponse;import org.websearch.DB.ConnectionDB;/**** @author User*/public class DrawGraph extends HttpServlet {/*** Processes requests for both HTTP <code>GET</code> and<code>POST</code> methods.* @param request servlet request* @param response servlet response* @throws ServletException if a servlet-specific error occurs* @throws IOException if an I/O error occurs*/protected void processRequest(HttpServletRequest request,HttpServletResponse response)throws ServletException, IOException {response.setContentType("text/html;charset=UTF-8");PrintWriter out = response.getWriter();try {String key = request.getParameter("combo");ConnectionDB db = new ConnectionDB();

  Connection conObj = db.MakeConnectionDB();PreparedStatement stat = conObj.prepareStatement("selectdistinct p1.Key2,p1.key1,p2.link,Count(p1.Link) Freq,p1.Rank frompagerank p1, pagerank p2 where p1.link = p2.link group byp1.Key1,p1.key2,p1.Rank");//stat.setString(1, key);ResultSet rs = stat.executeQuery();Writer output = null;StringBuffer text = new StringBuffer("<chart palette=\"3\"caption=\"BinRank

Page 15: Bin Rank Project

Graph\" " +"PYAxisName=\"Keyword Frequency\"SYAxisName=\"Rank\" xAxisName=\"Keyword\" " +"animation=\"1\" showValues=\"0\"formatNumberScale=\"0\" numberPrefix=\"\" " +"labelDisplay=\"STAGGER\"seriesNameInToolTip=\"0\">");text.append("<categories>");while (rs.next()) {text.append("<category label=\"" + rs.getString("Key2")+ "," + rs.getString("Key1") + "\"/>");}text.append("</categories>");ResultSet rs1 = stat.executeQuery();text.append("<dataset seriesName=\"Rank\"parentYAxis=\"S\">");while (rs1.next()) {//String link = rs1.getString("Link");text.append("<set value=\"" + rs1.getString("Rank") +"\" toolText='"+rs1.getString("Link")+"'/>");}text.append("</dataset>");ResultSet rs2 = stat.executeQuery();text.append("<dataset seriesname=\"Keyword Frequency\">");while (rs2.next()) {text.append("<set value=\"" + rs2.getString("Freq") +"\" toolText='"+rs2.getString("Link")+"'/>");}text.append("</dataset>").append("<styles>").append("<definition>").append("<style type=\"font\" name=\"CaptionFont\" size=\"15\"color=\"666666\"/>").append("<style type=\"font\"name=\"SubCaptionFont\"bold=\"0\"/>").append("</definition>").append("<application>").append("<apply toObject=\"caption\" styles=\"CaptionFont\"/>").append("<applytoObject=\"SubCaption\"styles=\"SubCaptionFont\"/>").append("</

Page 16: Bin Rank Project

application>").append("</styles>").append("</chart>");File file = newFile("E:/Jeys/Project/WebSearch/web/Data.xml");//File file = new File("..\\" + request.getContextPath() +"/web/Data.xml");output = new BufferedWriter(new FileWriter(file));output.write(text.toString());output.close();RequestDispatcher rd =request.getRequestDispatcher("Demo.jsp");rd.forward(request, response);} catch (Exception ex) {  System.out.println("Exception: " + ex);RequestDispatcher rd =request.getRequestDispatcher("Demo.jsp");rd.forward(request, response);} finally {out.close();}}// <editor-fold defaultstate="collapsed" desc="HttpServlet methods.Click on the + sign on the left to edit the code.">/*** Handles the HTTP <code>GET</code> method.* @param request servlet request* @param response servlet response* @throws ServletException if a servlet-specific error occurs* @throws IOException if an I/O error occurs*/@Overrideprotected void doGet(HttpServletRequest request,HttpServletResponse response)throws ServletException, IOException {processRequest(request, response);}/*** Handles the HTTP <code>POST</code> method.* @param request servlet request* @param response servlet response* @throws ServletException if a servlet-specific error occurs* @throws IOException if an I/O error occurs*/@Overrideprotected void doPost(HttpServletRequest request,HttpServletResponse response)throws ServletException, IOException {processRequest(request, response);}/*** Returns a short

Page 17: Bin Rank Project

description of the servlet.* @return a String containing servlet description*/@Overridepublic String getServletInfo() {return "Short description";}// </editor-fold>}

ConnectionDB.java/** To change this template, choose Tools | Templates* and open the template in the editor.*/package org.websearch.DB;import java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.sql.ResultSet;/**** @author Leonard*/public class ConnectionDB{private Connection con;public Connection MakeConnectionDB() {try{Class.forName("com.mysql.jdbc.Driver");con =DriverManager.getConnection("jdbc:mysql://localhost:3306/websearch","root", "root");}catch (Exception ex){ex.printStackTrace();}return con;}}

 

Page 18: Bin Rank Project

DATABASE DESIGN

Page 19: Bin Rank Project
Page 20: Bin Rank Project
Page 21: Bin Rank Project
Page 22: Bin Rank Project

 

Page 23: Bin Rank Project

SCREEN SHOTS

Page 24: Bin Rank Project
Page 25: Bin Rank Project
Page 26: Bin Rank Project
Page 27: Bin Rank Project
Page 28: Bin Rank Project
Page 29: Bin Rank Project
Page 30: Bin Rank Project
Page 31: Bin Rank Project
Page 32: Bin Rank Project
Page 33: Bin Rank Project
Page 34: Bin Rank Project
Page 35: Bin Rank Project
Page 36: Bin Rank Project

 

CONCLUSIONSIn this paper, we proposed BinRank as a practical solution for scalabledynamic au tho r i t y -ba sed r ank ing . I t i s ba sed on pa r t i t i on ing and approximation using a number of materialized subgraphs. We showedthat our tunable system offers a nice trade-off between query time

Page 37: Bin Rank Project

andpreprocessing cost.We introduce a greedy algorithm that groups co-occurring terms into anumber of bins for which we compute materialized subgraphs. Notethat the number of bins is much less than the number of terms. Thematerialized subgraphs are computed offline by using ObjectRanki t s e l f . The i n tu i t i on beh ind t he app roach i s t ha t a subg raph t ha t contains all objects and links relevant to a set of related terms shouldhave all the information needed to rank objects with respect to one of these terms. Our extensive experimental evaluation confirms thisi n tu i t i on . Fo r fu tu r e work , we wan t t o s t udy t he impac t o f o the r keyword relevance measures, besides term co-occurrence, such asthesaurus or ontologies, on the performance of BinRank. By increasingthe relevance of keywords in a bin, we expect the quality of materialized subgraphs, thus the top-k quality and the query time canbe improved. We also want to study better solutions for queries whoserandom surfer starting points are provided by Boolean conditions. Andultimately, although our system is tunable, the configuration of oursystem ranging from number of bins, size of bins, and tuning of theObjectRank algorithm itself (edge weights and thresholds) is quitechallenging, and a wizard to aid users is desirable

Page 38: Bin Rank Project