Finding Out About - Chapter 2 Extracting Lexical...
Transcript of Finding Out About - Chapter 2 Extracting Lexical...
© R. K. Belew 1996-2001Finding Out About
Finding Out About - Chapter 2Extracting Lexical Features
© R. K. Belew 1996-2001Finding Out About
2.1 Building useful tools
© R. K. Belew 1996-2001Finding Out About
2.2 Inter-document parsing
© R. K. Belew 1996-2001Finding Out About
General issues
© R. K. Belew 1996-2001Finding Out About
EmailFrom: [email protected]: [email protected]: [email protected]: new positions for students at Max Planck Institute,BerlinDate: Tue, 13 Jan 98 19:14:08 +0100X-Mts: smtpStatus:
Dear colleagues,
We are pleased to announce that we have postdoctoral andpredoctoral positions available beginning next year here at ourCenter for Adaptive Behavior and ...
© R. K. Belew 1996-2001Finding Out About
Lex/Yacc doesn’t always work!
“login (Full Name)” vs “Full Name <login>”
© R. K. Belew 1996-2001Finding Out About
Lex/Yacc doesn’t always work! (cont.)
© R. K. Belew 1996-2001Finding Out About
THESIS# 00001AUTHOR: SHINN, HONG SHIKYEAR: 1989TITLE: A UNIFIED APPROACH TO ANALOGICAL REASONING CLASSIF: MACHINE LEARNINGUNIVERSITY: GEORGIA INSTITUTE OF TECHNOLOGY (0078)ADVISOR: JANET L. KOLODNERABSTRACT:
Experiential reasoning is the most basic form of intelligent activity,consequently, in artificial intelligence, numerous computational models of
...problem solver but also a very general problem solver.
EOABSTRACT
AI Theses (AIT)
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
© R. K. Belew 1996-2001Finding Out About
Common processing flow
© R. K. Belew 1996-2001Finding Out About
2.3 Intra-document
© R. K. Belew 1996-2001Finding Out About
Major issues
© R. K. Belew 1996-2001Finding Out About
LA generators
© R. K. Belew 1996-2001Finding Out About
Finite state automata
0
1
2
8
9
{a-zA-Z0-9}
{b \n \t}
{a-zA-Z}
&
\0
else. .
.
© R. K. Belew 1996-2001Finding Out About
Same lexical analysis for bothdocuments and queries!
© R. K. Belew 1996-2001Finding Out About
2.3.1 Stemming and OtherMorphological Processing
© R. K. Belew 1996-2001Finding Out About
Conflation
© R. K. Belew 1996-2001Finding Out About
Stemming
© R. K. Belew 1996-2001Finding Out About
Rewrite rules
“IES” except (“EIES” or “AIES”) --> “y”
© R. K. Belew 1996-2001Finding Out About
Porter stemmer
© R. K. Belew 1996-2001Finding Out About
Rules
static RuleList step1a_rules[] =
{ 101, "sses", "ss", 3, 1, -1, NULL, 102, "ies", "i", 2, 0, -1, NULL, 103, "ss", "ss", 1, 1, -1, NULL, 1100, "\'s", LAMBDA, 1, -1, -1, NULL, 104, "s", LAMBDA, 0, -1, -1, NULL, 000, NULL, NULL, 0, 0, 0, NULL, };
static RuleList step1b_rules[] = {
105, "eed", "ee", 2, 1, 1, NULL,106, "ed", LAMBDA, 1, -1, -1, ContainsVowel,107, "ing", LAMBDA, 2, -1, -1, ContainsVowel,1101, "ingly", LAMBDA, 4, -1, -1, ContainsVowel,000, NULL, NULL, 0, 0, 0, NULL,
};
© R. K. Belew 1996-2001Finding Out About
Rule matching
ReplaceEnd(char *word, RuleList *rule){ while ( 0 != rule->id ) { ending = end - rule->old_offset; if ( word < ending )
if ( 0 == strcmp(ending,rule->old_end) ) { tmp_ch = *ending; *ending =EOS; if ( rule->min_root_size < WordSize(word) ) if ( !rule->condition || (*rule->condition) (word)) { (void)strcat( word, rule->new_end );
end = ending + rule->new_offset; break; }
*ending = tmp_ch; }
rule++; } return( rule->id );} /* ReplaceEnD */
© R. K. Belew 1996-2001Finding Out About
Other approaches
© R. K. Belew 1996-2001Finding Out About
Phrases
© R. K. Belew 1996-2001Finding Out About
Asian languages
© R. K. Belew 1996-2001Finding Out About
2.3.2 Noise words
© R. K. Belew 1996-2001Finding Out About
Combining stopword removal withother lexical decisions
© R. K. Belew 1996-2001Finding Out About
Character classes
© R. K. Belew 1996-2001Finding Out About
Character lex-loop (Part 1)
© R. K. Belew 1996-2001Finding Out About
Character lex-loop (Part 2)
© R. K. Belew 1996-2001Finding Out About
2.4 Example corpora
© R. K. Belew 1996-2001Finding Out About
THESIS# 00001AUTHOR: SHINN, HONG SHIKYEAR: 1989TITLE: A UNIFIED APPROACH TO ANALOGICAL REASONING CLASSIF: MACHINE LEARNINGUNIVERSITY: GEORGIA INSTITUTE OF TECHNOLOGY (0078)ADVISOR: JANET L. KOLODNERABSTRACT:
Experiential reasoning is the most basic form of intelligent activity,consequently, in artificial intelligence, numerous computational models of
...problem solver but also a very general problem solver.
EOABSTRACT
AI Theses (AIT)
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
© R. K. Belew 1996-2001Finding Out About
AIT year distribution
© R. K. Belew 1996-2001Finding Out About
Basic algorithm
for every doc in corpuswhile (token = getNonNoiseToken)
if (StemP)token = stem(token)
Save Posting(token,doc) in Tree for every token in Tree
Accumulate ndoc(token), totfreq(token)Sort p \in Postings(token) in descending docfreq(p) order
write token,ndoc,totfreq, Postings
© R. K. Belew 1996-2001Finding Out About
Basic Posting Data Structures
© R. K. Belew 1996-2001Finding Out About
Refined Posting Data Structures
© R. K. Belew 1996-2001Finding Out About
Minimizing OS dependencies
© R. K. Belew 1996-2001Finding Out About
Assume:
© R. K. Belew 1996-2001Finding Out About
Two central files
FileNo Path Date Size NBlock NMsg� Out?
FileNo MsgNoBPos TxtPos EPos Proxy NLines Date From(To) (Cc)�
© R. K. Belew 1996-2001Finding Out About
Pointer structure4 "Host:Email archive:950818:apple" "1991/08/01 14:15:53" 17058 1 7 05 "Host:Email archive:950818:bib-cites" "1991/05/31 16:18:27" 933 1 1 06 "Host:Email archive:950818:comp-bio" "1991/07/25 18:09:51" 22730 1 6 07 "Host:Email archive:950818:conf" "1991/08/01 13:28:26" 122422 4 11 08 "Host:Email archive:950818:contacts" "1991/08/19 16:31:07" 215475 7 67 09 "Host:Email archive:950818:cse" "1991/08/10 17:03:33" 66052 2 26 0
6 5 7071 8042 14159 236 "Exisiting RTGs" "1995/07/17 08:21:11" 19 (21 ) ()6 6 14160 15178 22729 158 "Summary of Computational Biology Town Meeting, Jul" "1995/07/18 06:36:52"7 1 0 515 3441 111 "[[email protected]: (DBWORLD) ESSIR - Europea" "1995/05/17 13:21:40"7 2 3442 4094 9760 158 "From Animals to Animats" "1995/05/18 09:16:26" 24 (25 ) ()7 3 9761 11341 18242 186 "CFP: Pacific Symposium on Biocomputing" "1995/05/18 11:32:47" 307 4 18243 19057 26378 194 "Call For Papers" "1995/05/18 16:10:29" 58 (57 ) ()
to find out more options send help
From ???@??? Thu May 18 15:17:49 1995Received: from cogsci.ucsd.edu by odin.ucsd.edu; id AA10669 sendmail 5.67/UCSDPSEUDO.4-CS via SMTP Thu, 18 May 95 09:16:41 -0700 for rikReceived: from yakima.UCSD.EDU by cogsci.UCSD.EDU (4.1/UCSDPSEUDO.4) id AA17558 for rik@cs; Thu, 18 May 95 09:16:34 PDTReceived: by yakima.UCSD.EDU (4.1/UCSDPSEUDO.3) id AA00611 for [email protected]; Thu, 18 May 95 09:16:26 PDTDate: Thu, 18 May 95 09:16:26 PDTMessage-Id: <[email protected]>From: John Batali <[email protected]>Sender: batali@cogsciTo: alife-lab@cogsciSubject: From Animals to AnimatsStatus: O
Date: Thu, 18 May 1995 11:08:14 -0400From: Maja Mataric <[email protected]>Subject: Conference Announcement and Call For Papers
==============================================================================
Conference Announcement and Call For Papers
FROM ANIMALS TO ANIMATS
Fourth International Conference on Simulation of Adaptive Behavior (SAB96)
Cape Cod, Massachusetts, USA, September 9-13, 1996
The objective of the conference is to bring together researchers inethology, psychology, ecology, artificial intelligence, artificiallife, robotics, and related fields so as to further our understanding
...
SEP 9-13: Conference dates
General queries to: [email protected] Page: http://www.cs.brandeis.edu/conferences/sab96
==============================================================================
From ???@??? Thu May 18 15:17:59 1995Received: from hydra.sdsc.edu by odin.ucsd.edu; id AA19421 sendmail 5.67/UCSDPSEUDO.4-CS via SMTP Thu, 18 May 95 11:36:01 -0700 for christos
file.d doc.d
(Original) document file
© R. K. Belew 1996-2001Finding Out About
2.5.2 Fine Points
© R. K. Belew 1996-2001Finding Out About
STAIRS Posting Information
© R. K. Belew 1996-2001Finding Out About
Quoted Lines in an Email Message
© R. K. Belew 1996-2001Finding Out About
2.5.3 Software Libraries
FOA-CD:///FindingOutAbout/index.htm#FOA source code
© R. K. Belew 1996-2001Finding Out About
C
© R. K. Belew 1996-2001Finding Out About
Java
© R. K. Belew 1996-2001Finding Out About
Other topics
© R. K. Belew 1996-2001Finding Out About
String matching
© R. K. Belew 1996-2001Finding Out About
Motivation
© R. K. Belew 1996-2001Finding Out About
Formal model
∑∑
|Matches|= n − m +1cm
© R. K. Belew 1996-2001Finding Out About
Naive algorithm
/* Search for pat[1...m] in text[1...n] R. Baeza-Yates Fig 10.1*/naive_search (text, n, pat, m)char text[], pat[];int n,m;{ int i,j,k,lim;
lim= n-m+1; for (i=1; i <= lim; i++) { k=i; for (j=1;j<=m && text[k] == pat[j]; j++) k++; if (j>m) rpt_match(i-j+1); }}
© R. K. Belew 1996-2001Finding Out About
Naive alg. - Complexity
E(Cn ) = c
c −1(1− 1
cm)(n − m +1) + O(1)
© R. K. Belew 1996-2001Finding Out About
Boyer-Moore [1977]
© R. K. Belew 1996-2001Finding Out About
Smart shift after unsuccessful match
© R. K. Belew 1996-2001Finding Out About
Horspool [1980]
© R. K. Belew 1996-2001Finding Out About
Address table
© R. K. Belew 1996-2001Finding Out About
Boyer-Moore/Horspool Alg
/* Search pat[1...m] in text[1...n] */bmhsearch ( text, n, pat, m)char text[], pat[];int n m;{ int d[ALPH_SIZE], i j, k, lim; char save;
/* Initialize d[] */ for (k=0; k<ALPH_SIZE; k++) d[k]=m+1; for (k=1; k<=m;k++) d[pat[k]] = m+1-k;
sav = pat[m+1]; pat [m+1] = STOPPER_CHAR; /* To avoid test for m=n-k+1 */
lim=n-m+1; for (k=1; k<=lim; k+= d[text[k+m]]) { i=k; for (j=1; text[i]==pat[j]; j++) i++; if (j==m+1) rpt_match (k); } pat[m+1] = sav;}
© R. K. Belew 1996-2001Finding Out About
Shift-OR
∑
© R. K. Belew 1996-2001Finding Out About
Shift left one bit/char of pattern
∑
∑
© R. K. Belew 1996-2001Finding Out About
O(Cn) = E(Cn) = O (kn)
© R. K. Belew 1996-2001Finding Out About
Comparison
2 3 4 5 6 7 8 9 10 20
TIme
Patternlength (m)
Naive
Shift-OR
Boyer-Moore-Horspool
© R. K. Belew 1996-2001Finding Out About
PAT trees