MG4J: Managing Gigabytes for Java Exercise Ida Mele.

37
MG4J: Managing Gigabytes for Java Exercise Ida Mele

Transcript of MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Page 1: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

MG4J: Managing Gigabytes for Java

Exercise

Ida Mele

Page 2: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Indexing in MG4J is centered around documents• Package: it.unimi.di.big.mg4j.document• The object document, which is the instance of the class

Document, represents a single document that can be indexed

• Different documents have different number and type of fields.

• For example,• E-mail: from, to, date, subject, body

• HTML page: title, url, body

Ida Mele MG4J - exercise 2

Document

Page 3: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Ida Mele MG4J - exercise 3

Document

• Summary of methods:

Page 4: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Package: it.unimi.di.big.mg4j.document• DocumentCollection is a randomly addressable lists

of documents

Ida Mele MG4J - exercise 4

DocumentCollection

Page 5: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

FileSetDocumentCollection

• Package: it.unimi.di.big.mg4j.document• The main method of FileSetDocumentCollection

allows to build and serialize a set of documents specified by their filenames

Ida Mele MG4J - exercise 5

Page 6: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Document Factory

• Package: it.unimi.di.big.mg4j.document• The factory turns a pure stream of bytes (file) into a

document made by several fields (title and text)

Ida Mele MG4J - exercise 6

Page 7: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Standard MG4J Document Factories

• CompositeDocumentFactory• HtmlDocumentFactory• IdentityDocumentFactory• MailDocumentFactory• PdfDocumentFactory• ReplicatedDocumentFactory• PropertyBasedDocumentFactory• TRECHeaderDocumentFactory• ZipDocumentCollection.ZipFactory

Ida Mele MG4J - exercise 7

Page 8: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Query

• Package: it.unimi.di.big.mg4j.query• To query the index we can use the main method of

the class Query • We can submit queries by using:

• command line • web browser

• QueryEngine: The query engine receives the query and returns the ranked list of results

• HttpQueryServer: A simple web server for query processing

Ida Mele MG4J - exercise 8

Page 9: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Indexing and querying: exercise

• TECHNICAL REQUIREMENTS:

• UNIX Operating System

• Java (>=6)

• Document collection and the libraries are available at:

http://www.dis.uniroma1.it/~mele/WebIR.html

Ida Mele MG4J - exercise 9

Page 10: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Set the classpath

• Download and extract htmlDIS.tar.gz

• Download and extract lib.zip

• Download the file set-classpath.sh

• Edit the first line of the file set-classpath.sh: replace your_directory with the path of the folder containing all the .jar files (lib folder)

• Set the CLASSPATH: source set-classpath.sh

Ida Mele MG4J - exercise 10

Page 11: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Building the collection of documents (1)

• Help:

java it.unimi.di.big.mg4j.document.FileSetDocumentCollection --help

• Create the collection:

find htmlDIS -iname \*.html | java it.unimi.di.big.mg4j.document.FileSetDocumentCollection -f HtmlDocumentFactory -p encoding=UTF-8 dis.collection

find returns the list of files, one per line. This list is provided as input to the main method of the FileSetDocumentCollection

Ida Mele MG4J - exercise 11

Page 12: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• We need also to specify a factory (the -f option) and the encoding as a property

• The name of the collection is dis.collection

• The collection does not contain the files, but only their names

• Deleting or modifying files of htmlDIS directory may cause inconsistence in the collection

Building the collection of documents (2)

Ida Mele MG4J - exercise 12

Page 13: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Building the index

• Help: java it.unimi.di.big.mg4j.tool.IndexBuilder --help

• Create the index:

java it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis

• --downcase: this option forces all the terms to be downcased • -S : specifies that we are producing an index for the specified

collection. If the option is omitted, Index expects to index a document sequence read from standard input

• dis: basename of the index• If you have memory problem, you can use -Xmx for allocating more

memory to Java:

java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis

Ida Mele MG4J - exercise 13

Page 14: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• dis-{text,title}.terms: contain the terms of the dictionary. One term per line

more dis-text.terms

• dis-{text,title}.stats: contain statisticsmore dis-text.stats

• dis-{text,title}.properties: contain global informationmore dis-text.properties

Index files (1)

Ida Mele MG4J - exercise 14

Page 15: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• dis-{text,title}.frequencies: for each term, there is the number of documents with the term (-code)

• dis-{text,title}.globcounts: for each term, there is the number of occurrence of the term (-code)

• dis-{text,title}.offset: for each term, there is the offset (-code)

Index files (2)

Ida Mele MG4J - exercise 15

Page 16: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• dis-{title,text}.sizes: contain the list of the document sizes. The document size is the number of words contained in each document (- code)

• dis-{text,title}.batch<i>: temporary files with sub-indices (-code). Use the option --keep-batches to not delete temporary files

• dis-{text,title}.index: contain the index (-code)

Index files (3)

Ida Mele MG4J - exercise 16

Page 17: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Web server

• Help: java it.unimi.di.big.mg4j.query.Query --help

• Querying the index:

java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c dis.collection dis-text dis-title

• Command line: {text, title} > computer• Web browser: http://localhost:4242/Query

Ida Mele MG4J - exercise 17

Page 18: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Search one word: The result is the set of documents that contain the specified word

• Example: computer

• AND: more than one term separated by whitespace or by AND or &. The result is the set of documents that contain all the specified words

• Example: computer science• Example: computer AND science• Example: computer & science

Query (1)

Ida Mele MG4J - exercise 18

Page 19: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• OR: more than one term separated by OR or |. The result is the set of documents that contain any of the given words

• Example: conference | workshop

• NOT: the operator NOT or ! is used for negation• Example: conference & ! workshop

• Parentheses: the parentheses are used to enforce priority in complex queries

• Example: university & (rome | california)

Query (2)

Ida Mele MG4J - exercise 19

Page 20: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Proximity restriction: the words must appear within a limited portion of the document

• Example: (university rome)~6

• Phrase: using “ ” we can look for documents that contain the exact phrase

• Example: “university of rome la sapienza”

• Ordered AND: more than one term separated by <• Example: computer < science < department

Query (3)

Ida Mele MG4J - exercise 20

Page 21: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Wildcard (*): wildcard queries can be submitted appending * at the end of a term

• Example: infor*

• Index specifiers: prefixing a query with the name of an index followed by : you can restrict the search to that index

• Example: title:computer• Example: text:computer science AND

title:FOCS

Query (4)

Ida Mele MG4J - exercise 21

Page 22: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• MG4J provides sophisticated query tuning

• To use this features, we must use the command line interface

• $ --- to get some help on the available options

• Some examples: • $mode --- to choose the kind of results

Example: > $mode short

• $selector --- to choose the way the snippet or intervals are shown

Example: > $selector 3 40

Sophisticated queries (1)

Ida Mele MG4J - exercise 22

Page 23: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Other examples:• $mplex --- when multiplexing is on, each query is

multiplexed to all indices. When a scorer is used, it is a good idea to use multiplexing

Example: > $mplex on

• $score --- to choose the scorer

Example: > $score VignaScorer• $weight --- to change the weight of the indices. This

is useful when multiplexing is on

Example: >$weight text:1 title:3

Sophisticated queries (2)

Ida Mele MG4J - exercise 23

Page 24: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Scorer are important for ranking the documents result of a

query.

Default: BM25Scorer and VignaScorer

• ConstantScorer. Each document has a constant score (default is

0)

>$score ConstantScorer

• CountScorer. It is the product between the number of

occurrences of the term in the document and the weight

assigned to the index

>$score CountScorer

Scorer (1)

Ida Mele MG4J - exercise 24

Page 25: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• TfIdfScorer. It implements TF/IDF

TF is the term frequency of the term t for the document d: c/l;

where c is the number of occurrences of t in d and l is the

length of d

IDF is the inverse document frequency of the term t in the

collection: log(N/f); where N is the number of documents in

the collection and f is the number of documents where t

appears

>$score TfIdfScorer

Scorer (2)

Ida Mele MG4J - exercise 25

Page 26: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• DocumentRankScorer. The scores of documents are stored

in a text file

>$score DocumentRankScorer nameFile

Scorer (3)

Ida Mele MG4J - exercise 26

Page 27: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• A virtual field produces pieces of text that refer to other documents (possibly belonging to the collection)

• Referrer: the document that is referring to another document

• Referee: the document to which a piece of text of the Referrer is referring to

• Intuitively, the Referrer gives us information about the Referee

• The Referrer produces in a virtual field a number of fragments of text, each referring to a Referee

• The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec)

Virtual fields (1)

Ida Mele MG4J - exercise 27

Page 28: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Virtual fields (2)

Ida Mele MG4J - exercise 28

• In the case of the HTML document:

• the document spec is a URL (as specified in the href attribute)

• the virtual fragment is the content of the anchor element and some surrounding text (anchor context)

• The HTMLDocumentFactory produces the pairs (document spec, virtual fragment)

Page 29: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Create the list of URL of the documents in the collection:

java it.unimi.di.big.mg4j.tool.ScanMetadata -S dis.collection -u dis.urls

• Create the document resolver. It is able to map the document spec produced by some document factory into actual references to documents in the collection

• Given a document spec, the resolver will decide whether the spec really refers to a document in the collection or not, and in the first case it will find out to which document the spec refers to:

java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o dis.urls dis-anchor.resolver

Virtual fields (3)

Ida Mele MG4J - exercise 29

Page 30: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• Building the index:

java it.unimi.di.big.mg4j.tool.IndexBuilder -a -v anchor:dis-anchor.resolver --downcase -S dis.collection dis

• Querying the index:

java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c

dis.collection dis-text dis-title dis-anchor

{text, title, anchor} > anchor:conference

{text, title, anchor} > title:combinatorial algorithms AND

anchor:conference

{text, title, anchor} > text:RoboCup AND anchor:info

Virtual fields (4)

Ida Mele MG4J - exercise 30

Page 31: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• All the virtual fragments that refer to a given document of the collection are like a single text, called virtual text

• Virtual fragments coming from different anchors are concatenated, and they are in a text file

• This may produce false positive results• For example, the query

anchor:(computer AND science)

produces as result a list of documents that contain both the words in some of their anchors, but not necessarily in the same anchor

Virtual gap (1)

Ida Mele MG4J - exercise 31

Page 32: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• To avoid such kinds of false positives, we can use virtual gaps

• The virtual gap is a positive integer, representing the virtual space left between different virtual fragments

• For example, if the virtual gap is 64 (the default), anchors are concatenated by leaving 64 “empty words” between subsequent fragments

• We can submit the query:

>anchor:(computer AND science)~64

and we will be sure that only documents containing both the term in the same anchor are retrieved

Virtual gap (2)

Ida Mele MG4J - exercise 32

Page 33: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

• If the anchor is longer than 64 characters, we can still have false positives

• In the indexing phase, it is possible to specify a different virtual gap

• For example, we can use:

java it.unimi.di.big.mg4j.tool.IndexBuilder -a -g anchor:100 -v anchor:dis-anchor.resolver --downcase -S dis.collection dis

It uses 100 characters for the virtual gap

Virtual gap (3)

Ida Mele MG4J - exercise 33

Page 34: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Term map (1)

• A simple representation of a dictionary is the term list (the file .terms): a text file containing the whole dictionary, one term per line, in index order (the first line contains the term with index 0, the second line the term with index 1, etc.)

• A more efficient representation is based on a monotone minimal perfect hash function: it is a very compact data structure that is able to answer to the question "What is the index of the term XXX?”

• You can build such a function from a sorted term list using:

java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction titles.mph dis-title.terms

Ida Mele MG4J - exercise 34

Page 35: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Term map (2)

• Monotone minimal perfect functions have a serious limit: they can answer correctly to the question "What is the index of the term XXX?” but only if the term appears in the dictionary

• To solve this problem, we can use a signed function• For terms not in the dictionary, the function will answer

with a special value (-1) that means "the word is not in the dictionary”

java it.unimi.dsi.util.ShiftAddXorSignedStringMap titles.mph titles.map mycollection-title.terms

Ida Mele MG4J - exercise 35

Page 36: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Term map (3)

• Wildcard searches require the use of a prefix map • A prefix map is able to answer correctly to the question

"What are the indices of terms starting with the characters YYY?”

• If terms are lexicographically sorted, the answer is a pair of integers, representing the first and the last index of terms satisfying the property

• We can build a prefix map by using:java it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki -o dis-title.terms dis-title.dict

Ida Mele MG4J - exercise 36

Page 37: MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Homework

1. Read the MG4J (big) manual: http://www.dis.uniroma1.it/~mele/teaching/WebIR/manual-mg4j.pdf

2. Repeat the exercise

3. Create your own document collection, build the inverted index (with or without virtual fields), then submit some queries and try the different scorers

Ida Mele MG4J - exercise 37