CS276A Text Information Retrieval, Mining, and Exploitation Lecture 10 7 Nov 2002.
CS276A Text Information Retrieval, Mining, and Exploitation
description
Transcript of CS276A Text Information Retrieval, Mining, and Exploitation
![Page 1: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/1.jpg)
CS276AText Information Retrieval, Mining, and
Exploitation
Project Information and ResourcesProject Ideas3 Oct 2002
![Page 2: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/2.jpg)
2
Project: The idea
An opportunity to do a reasonable sized project on a topic of interest related to the course
Open ended, but... Have a clear, focused objective Set reasonable goals Don’t reinvent the wheel Look for ways to evaluate and validate your
work A good project write-up is essential
![Page 3: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/3.jpg)
3
Project: General Info
40% of grade Work in teams of two
Stay after class to look for a partner if needed
Try asking on news:su.class.cs276a TA can help you find a partner
If you are not registered for the course, you may not have access to some of the data subscribe to cs276a-guests@lists if you are
not registered officially send a mesg with “subscribe cs276a-guests”
in the body to majordomo@lists
![Page 4: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/4.jpg)
4
Project: General Info cont.
Suggested profile information to put on newsgroup when looking for a partner: What times of the day are you free? What kind of project interests you?
theoretical efficiency visualization etc...
What programming language do you prefer? Do you prefer working remotely, or on
campus?
![Page 5: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/5.jpg)
5
Timeline
Oct 3, Oct 10: discuss project ideas and resources in class
Oct 17 [one week later]: Hand in project abstract (1 page)
We will email feedback by Oct 24 Nov 12: Submit summary of progress, and
plans for completing the project (2–3 pages)
Dec 3: Submit project Dec 5: Present project (very brief
“executive summary”) in class
![Page 6: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/6.jpg)
Tools & Resources
![Page 7: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/7.jpg)
7
Installed/Supported ToolsLinux & Solaris
Java (J2SE 1.4.0) Lucene Berkeley DB Google API WebBase API WordNet
software is in cs276a/software/ data is in cs276a/data1/
![Page 8: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/8.jpg)
8
Installed/Supported ToolsLinux & Solaris
Put the following in your .cshrc: “source
/afs/ir/class/cs276a/software/setpaths.sh” Please take a look at:
http://www.stanford.edu/class/cs276a/projects/libraries.html
http://www.stanford.edu/class/cs276a/projects/docs/ Running example, making use of many of these
tools, is in cs276a/software/examples/ package cs276a.examples.titleindexer contains a
running example we’ll be walking through
![Page 9: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/9.jpg)
9
Other Tools
Porter’s Stemmer Xerces XML parser Many free libraries available at
http://www.sourceforge.net/
![Page 10: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/10.jpg)
10
Java
Rich API of built-in functionality Networking (e.g., HTTP) HTML parser XML support Visualization ...
![Page 11: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/11.jpg)
11
Lucene
http://jakarta.apache.org/lucene/docs/ Library for text indexing Free, easy to use, and extensible API for both constructing and querying text
indexes IndexWriter IndexSearcher Analyzer
Detailed example later on
![Page 12: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/12.jpg)
12
BerkeleyDB
http://www.sleepycat.com/ External memory databases
BTree Hash table
Might be useful for certain tasks (like link analysis)
Very widely used
![Page 13: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/13.jpg)
13
Google API
http://www.google.com/apis/ Programmatic interface to Google Requires you to register to get a client key Limit of 1,000 queries / day Supports 3 kinds of queries
search - search query (more details next) cache - retrieved cached copy of a url spelling - spelling suggestion
![Page 14: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/14.jpg)
14
Google API cont.
Note that “search” query also allows link-queries
backlinks of stanford: “link:www.stanford.edu” related pages queries
similar to stanford: “related:www.stanford.edu” intitle:, daterange:, etc...
see http://www.google.com/advanced_search
![Page 15: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/15.jpg)
15
Stanford WebBase
http://www-diglib.stanford.edu/~testbed/doc2/WebBase/ Recent crawl of .Stanford.EDU is in
cs276a/data1/stanford.edu/ Need to use an HTML parser
Pages from the Arts, Business, and Computers ODP categories are in cs276a/data1/dmoz/
Already converted to plaintext (no HTML markup)
Use Feeder classes to access Pull-style iterator next()
![Page 16: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/16.jpg)
16
WordNet
http://www.cogsci.princeton.edu/~wn/ Run ‘wnb’ to play with the system locally, or use
the online Web version Library provides programmatic interface to hand
constructed semantic relationships for words synonyms word senses hypernyms meronyms etc...
See glossary of relationships at http://www.cogsci.princeton.edu/~wn/man1.7.1/wngloss.7WN.html
![Page 17: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/17.jpg)
17
Datasets
Find your own data, or use something from cs276a/data1/
WebBase data, as described previously Assorted corpora in data1/linguistic-data/
TREC Reuters Excite query logs Bilingual datasets much more...
![Page 18: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/18.jpg)
18
Machines
tree (solaris) elaine (solaris) firebird (linux) raptor (linux) If you switch architectures, and are storing
binary files on disk, don’t forget about endianness
![Page 19: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/19.jpg)
19
Kerberos trivia
“man keeptoken” to find out how to keep your processes authenticated for 1 week
might be necessary if you are doing complex text analysis and want your process to run longer than a day
sidenote: processing large web crawls takes up to a few days, even with several 1.4GHz Athlons operating in parallel
![Page 20: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/20.jpg)
Project Ideas
![Page 21: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/21.jpg)
21
Parametric search
Each document has, in addition to text, some “meta-data” e.g., Language = French Format = pdf Subject = Physics etc. Date = Feb 2000
A parametric search interface allows the user to combine a full-text query with selections on these parameters e.g., language, date range, etc.
![Page 22: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/22.jpg)
22
Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.
Parametric search example
![Page 23: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/23.jpg)
23
Parametric search example
We can add text search.
![Page 24: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/24.jpg)
24
Secure search
Set up a document collection in which each document can be viewed by a subset of users.
Simulate various users issuing searches, such that only docs they can see appear on the results.
Document the performance hit in your solution index space retrieval time
![Page 25: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/25.jpg)
25
“Natural language” search / UI
Present an interface that invites users to type in queries in natural language
Find a means of parsing such questions into full-text queries for the engine
Measure what fraction of users actually make use of the feature Bribe/beg/cajole your friends into
participating Suggest information discovery tasks for
them Understand some aspect of interface design
and its influence on how people search
![Page 26: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/26.jpg)
26
Link analysis
Measure various properties of links on the Stanford web what fraction of links are navigational rather
than annotative what fraction go outside (to other
universities?) (how do you tell automatically?)
What is the distribution of links in Stanford and how does this compare to the web?
Are there isolated islands in the Stanford web?
![Page 27: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/27.jpg)
27
Visual Search Interfaces
Pick a visual metaphor for displaying search results 2-dimensional space 3-dimensional space Many other possibilities
Design visualization for formulating and refining queries
![Page 28: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/28.jpg)
28
![Page 29: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/29.jpg)
29
![Page 30: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/30.jpg)
30
Visual Search Interfaces
Are visual search interfaces more effective?
On what measure? Time needed to find answer Time needed to specify query User satisfaction Precision/recall
![Page 31: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/31.jpg)
31
Cross-Language Information Retrieval
Given: a user is looking for information in a language that is not his/her native language.
Example: Spanish speaking doctor searching for information in English medical journals.
Simpler: The user can read the non-native language.
Harder: no knowledge of non-native language.
![Page 32: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/32.jpg)
32
Cross-Language Information Retrieval
Two simple approaches: Use bilingual dictionary to translate query Use simplistic transformation to normalize
orthographic differences (coronary/coronario)
Cross-language performance is expected to be worse. By how much?
Query refinement/modification more important in cross-language information retrieval. Implications for UI design?
![Page 33: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/33.jpg)
33
Latent Semantic Indexing (LSI)
LSI represents queries and documents in a “latent semantic space”, a transformation of term/word space
For sparse queries/short documents, LSI representation captures topical/semantic similarity better.
Based on SVD analysis of term by document matrix.
![Page 34: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/34.jpg)
34
Latent Semantic Indexing
Efficiencies of inverted index (for searching and index compression) not available. How can LSI be implemented efficiently?
Impact on retrieval performance (higher recall, lower precision)
Latent Semantic Indexing applied to a parallel corpus solves cross-language IR problem. (but need parallel corpus!)
![Page 35: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/35.jpg)
35
Meta Search Engine
Send user query to several retrieval systems and present combined results to user.
Two problems: Translate query to query syntax of each
engine Combine results into coherent list
What is the response time/result quality trade-off? (fast methods may give bad results)
How to deal with time-out issues?
![Page 36: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/36.jpg)
36
Meta Search Engine
Combined web search: Google, Altavista, Overture
Medical Information Google, Pubmed
University search Stanford, MIT, CMU
Research papers Universities, citeseer, e-print archive
Also: look at metasearch engines such as dogpile
![Page 37: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/37.jpg)
37
IR for Biological Data
Biological data offer a wealth of information retrieval challenges
Combine textual with sequence similarity Requires blast or other sequence homology
algorithm Term normalization is a big problem (greek
letters, roman numerals, name variants, eg, E. coli O157:H7)
![Page 38: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/38.jpg)
38
IR for Biological Data
One place to start: www.netaffx.com Microarray data Sequence data Textual data, describing genes/proteins Links to national center of bioinformatics
What is the best way to combine textual and non-textual data?
UI design for mixed queries/results Pros/Cons of querying on text only,
sequence only, text/sequence combined.
![Page 39: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/39.jpg)
39
Peer-to-Peer Search
Build information retrieval system with distributed collections and query engines.
Advantages: robust (eg, against law enforcement shutdown), fewer update problems, natural for distributed information creation
Challenges Which nodes to query? Combination of results from different nodes Spam / trust
![Page 40: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/40.jpg)
40
Personalized Information Retrieval
Most IR systems give the same answer to every user.
Relevance is often user dependent: Location Different degrees of prior knowledge Query context (buy a car, rent a car, car enthusiast)
Questions How can personalization information be represented Privacy concerns Expected utility Cost/benefit tradeoff
![Page 41: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader035.fdocuments.us/reader035/viewer/2022070405/56813d67550346895da7446a/html5/thumbnails/41.jpg)
41
N-gram Retrieval
Index on n-grams instead of words Robust for very noisy collections (lots of
typos, low-quality OCR output) Another possible approach to cross-
language information retrieval Questions
Compare to word-based indexing Effect on precision/recall Effect on index size/response time