IR Software for Large-Scale Research Gregory B. Newby School of Information and Library Science,...

IR Software for Large-Scale Research Gregory B. Newby School of Information and Library Science, University of North Carolina at Chapel Hill CB 3360 Manning Hall, Chapel Hill, NC, 27599-3360 [email protected]

Abstract

Who Greg Newby has been working on experimental IR systems for over 10 years. Hes participated in TREC since 1986. His interest has been in extending information space ideas to IR systems (see recent JASIST article, Information Space and Cognitive Space). IRTools is a more generalized version of software hes developed previously With an NSF/ITR grant, its been possible to hire student programmers to help write code and test system performance

What IRTools is a software toolkit. Its not a ready-made IR system, but can be easily configured to perform consistently with major IR models: Boolean retrieval with various term and document weighting Vector Space Model (VSM) Latent Semantic Indexing (LSI) and Newbys Information Space Probabilistic IR The software is designed for modularity, scalability and high performance, but with an emphasis on IR experimentation, not real-world production use

Where UNC Chapel Hill has a tradition of information retrieval research, systems development and evaluation School facilities include new SunFire servers. The University provides additional computational hosts, and a robotic tape to disk library with unlimited storage Project facilities include two research systems with 2 and 4GB RAM and 1000GB disk space

When The NSF/ITR project runs for 3 years, ending in August 2003 Software development is ongoing, and partners and contributors are sought to join in a virtual development team The approximate timeline for IRTools is: 2001: Fundamental software functional for Boolean and VSM 2002: Functionality for LSI and Information Space 2003: More emphasis on XML and other semi-structured data types

Why To have configurable, flexible software for IR experimentation that is freely available, high performance and scalable. Excellent IR software such as SMART, Okapi and INQUERY are missing one or more of the desired qualities above Excellent Web retrieval software such as ht://dig are not suitable for experimentation, as they only implement a subset of desirable retrieval models The search engines dont share their source code, algorithms or methods

How Write code. We use mostly C++, with some reliance on the Standard Template Library (STL). We use C, Perl and other languages as needed Test and evaluate. The code includes a full regression test (make test) Experiment. Weve been working with the 10GB Web dataset from TREC, with several years of relevance judgments Tune. Data structures, file structures and algorithms need experimental validation. Often, they must be tuned for particular retrieval methods

Getting the Code Source code is periodically assembled into releases. We have not yet made a 1.0 release Visit the project homepage for documentation and information about current work For the source code, visit our development site at Sourceforge: http://sf.net/projects/irtools You can download the most current code IRTools has been tested for: Solaris Linux (i686 and Alpha)

Full Disclosure Does this software work? Not fully, but many parts of it function quite well. Its a work in progress. So, your TREC 2001 results must have been pretty good, eh? No, there were some bugs that resulted in poor performance this year. We were trying to test our implementation of the VSM with pivoted term weights Will this be better than Google? Doubtful, but thats not the point. This is for IR researchers, not a commercial product Are you trying to get people to use IRTools for their own research? Not necessarily, but we hope it will be helpful for other researchers, and possibly for use in the classroom

Major Components SpiderIndexerRetrieval engine Retrieves documents (on disk or the Web) Represents data: terms, inverted index, etc. Matches & ranks documents to queries

The Spider Needed for live Web use. For existing datasets (such as TREC data), we dont need the spider Were borrowing methods from wget and other open- source spidering tools Challenges include spider traps and poorly formed HTML The spider is solely concerned with Web interaction to get documents and handle errors. The indexer worries about seeking more documents (HREFs), parsing the documents, etc.

The Indexer Quite complicated, with dozens of classes and thousands of lines of code Some components are generic, but many are specific to a particular retrieval experiment. Different indexing methods are applied based on: The type of data being indexed (Web, abstracts, full text) What retrieval methods will be used (VSM, LSI, Boolean) What term weighting is needed The size of the data (e.g., to determine whether multiple files will be used for the inverted index, or only one)

The Retrieval Engine Highly configurable for different experiments One collection (aka set of indexed data) may be used with different retrieval methods. This is the core value of the software: to enable experiments with many constants Small proxy servers enable the retrieval engine to interact with external interfaces (e.g., Java programs) Other small servers can retrieve from Web search engines, such as Google, then reformat hits internally

A Typical TREC-Style Experiment: Indexer Configuration Estimate high-water marks for memory and disk usage. Determine whether you can index the entire dataset with one run, or if you need multiple runs Bring together different indexing classes and methods into one program. For example: File opener (to recursively retrieve files & directories) Tokenizer (identify word boundaries) Stemmer and stoplist handlers Choice of HTML or XML tags or other elements to identify, and how to identify them Choice of what data to store to disk (e.g., separate inverted indexes for particular tags; sequential index)

A Typical TREC-Style Experiment: Retrieval Engine Configuration For batch-oriented retrieval, queries may be pre-stemmed and stopped (or you could use term ID #s instead of the terms) For interactive retrieval or testing, the tokenizer, stemmer and stopword processor should match indexer Add components as needed, such as: Candidate document selection (e.g., Boolean AND) Query expansion Weighting of terms and documents (tf*idf, pivoted, user specified) Similarity measure (cosine, geometric distance) Ranking Presentation of results Relevance feedback, query adjustment, etc. Adding CGI functionality, command line options and other interaction methods is easily done

Some Files used by IRTools Inverted index: Binary files. One file contains term ID #s, term counts, weights and offset locations to the document list. The second file contains the document list for each term. A third file contains the list of term locations (for NEAR operator) Sequential index: For each document, a list of the term ID #s, term counts and locations (3 separate binary files) Term map: a database (Berkeley DB) to look up a term ID # for a term Term ID data: For each term, its frequency in the collection Sparse matrix files: For co-occurrence data, term by document lists, etc. Binary files using a modified Harwell-Boeing format For any experiment, only some of these files (or others) are needed

A Little Source Code: weight.h class IRT_Weight { public: // Constructor: IRT_Weight (IRT_Index &inarg) { in = &inarg; } // End of constructor for class IRT_Weight // Destructor: ~IRT_Weight() { } // End of destructor for class IRT_Weight // Get a tf weight irt_float weight_get_tf(vector, irt_float); // Get an idf weight irt_float weight_get_idf(irt_float, irt_float); // Get a tf*idf weight irt_float weight_get_tfidf(irt_int termid, vector docidlist); // We'll need to use the index class to look up term frequencies IRT_Index *in; };

A Little Source Code: bool_or.cc This class member function merges lists of document ID #s

A Program to Index WT10G (the TREC 10GB Web dataset) This program runs in about 3 hours on our dual Alpha station It creates separate inverted indexes for terms in the and tags The slowest part is the tokenizer, which identifies terms and tags of interest. The token class is being redesigned for higher performance

Current Projects Include TREC Interactive Track. Well be using IRTools to post-process Google results and display them via a proxy to the end user. A sitemap-style interface will be compared to a traditional list. 3D navigation. Several interfaces to navigate through information space. These can use a local dataset, or visualize relative locations of documents retrieved elsewhere.

Our Thanks To: gcc and g++: These compilers greatly facilitate cross-platform development The Berkeley DB: High performance database functionality for single-key data wget and ht://dig, with open source functionality we have learned from

IR Software for Large-Scale Research Gregory B. Newby School of Information and Library Science,...

Documents

Transcript of IR Software for Large-Scale Research Gregory B. Newby School of Information and Library Science,...