Superficial & Lexical level 1
-
Upload
hammett-bentley -
Category
Documents
-
view
17 -
download
0
description
Transcript of Superficial & Lexical level 1
![Page 1: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/1.jpg)
NLP superficial and lexic level 1
Superficial & Lexical level 1
• Superficial level• What is a word • Lexical level• Lexicons• How to acquire lexical information
![Page 2: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/2.jpg)
NLP superficial and lexic level 2
Superficial level 1
• Textual pre-process• Getting the document(s)
• Accessing a BD
• Accessing the Web (wrappers)
• Getting the textual fragments of a document• Multimedia documents, Web pages, ...
• Filtering out meta-information• tags HTML, XML, ...
![Page 3: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/3.jpg)
NLP superficial and lexic level 3
Superficial level 2
• Text segmentation into paragraphs or sentences
• Tokenization• orthographic vs grammatical word
• Multiword terms
• dates, formulas, acronyms, abbreviations, quantities (and units), idioms,
• Named entities• NER, NEC, NERC
• Unknown word
• Language identification
Beeferman et al, 1999Ratnaparkhi, 1998
Bikel et al, 1999Borthwick, 1999Mikheev et al, 1999
Elworthy, 1999Adams,Resnik, 1997
![Page 4: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/4.jpg)
NLP superficial and lexic level 4
Superficial level 3
• Vocabulary size (V)• Heap's Law
• V = KN
• K depends on the text 10 K 100
• N total number of words depends on the language, for English 0.4 0.6
• Vocabulary grows sublinealy but does not saturate tends to stabilize for 1Mb of text (150.000w)
words
Dif
fere
nt w
ords
![Page 5: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/5.jpg)
NLP superficial and lexic level 5
Superficial level 4
• word tokens vs word types• Statistical distribution of words in a document
• Obviously non uniform
• Most common words cover more than 50% of occurrences
• 50% of the words only occur once
• ~12% of the document is formed by word occurring less than 4 times.
![Page 6: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/6.jpg)
NLP superficial and lexic level 6
Superficial level 5
Histogram
0
50
100
150
200
250
300
350
Bin
Frequency
Frequency
10/
/1
NC
rCf
Zipf law:We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant
![Page 7: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/7.jpg)
NLP superficial and lexic level 7
Lexical level 1
• Part of Speech (POS)• Formal property of a word-type determining its acceptable
uses in syntax.• A POS can be seen as a class of words • A word-type can own several POS, a word-token only
one• Plain categories
• open, many elements, neologisms, independent and semantically rich classes
• N, Adj, Adv, V• Functional categories
• closed
![Page 8: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/8.jpg)
NLP superficial and lexic level 8
Lexical level 2
• Repository of lexical information for human or computer use
• Two aspects to consider• Representation of lexical information
• Acquisition of lexical information
Lexicon
![Page 9: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/9.jpg)
NLP superficial and lexic level 9
Lexical level 3
• Orthografic Transcription • Phonetic Transcription • Flexion model• diathesis alternations, subcategorization frames
• LOVE VTR (OBJLIST: SN).
• LOVE• CAT = VERB
• SUBCAT = <SN, SN>
Lexicon content
![Page 10: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/10.jpg)
NLP superficial and lexic level 10
• POS• Argument structure• Semantic information
• dictionaries => definition
• lexicons => semantic types predefined in a hierarchy.
• Lexical Relations• derivation
• Equivalence with other languages
Lexical level 4
![Page 11: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/11.jpg)
NLP superficial and lexic level 11
Lexical level 5
• Form• attribute/value pairs, binarr or n-ary relations, coded values,
open domain values…
• Multiple assignments• One to many and many to one relations
• Contextual dependencies …
• Facets of features• Mandatory or optional, cardinality, default values
• Grading• Exact values, preferences, probabilistic assigments.
Problems
![Page 12: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/12.jpg)
NLP superficial and lexic level 12
Lexical level 6
• General purpose databases• Textual databases• Lexical databases• OO formalisms• OO databases• Frames• Unification-based formalisms
Representation
![Page 13: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/13.jpg)
NLP superficial and lexic level 13
Lexical level 7
• Dictionaries• MRD
• Predefined internal structure
• Some degree of coding in some contents
• Internal relations (synonimy, hyponymy, ...)
• (sometimes) restricted vocabulary
• Some sistematics on building definitions
Lexical Information acquisition
![Page 14: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/14.jpg)
NLP superficial and lexic level 14
Lexical level 8
• Colocations
• Argument structure.
• Frecuency information
• Context
• Grammatical Induction
• Probabilistic Analysis.
• Lexical relations
• Examples of use.
• Selectional Restrictions
• Nominal compounds
• Idioms, ...
Information present in corpora
![Page 15: Superficial & Lexical level 1](https://reader036.fdocuments.us/reader036/viewer/2022071807/56812a44550346895d8d7152/html5/thumbnails/15.jpg)
NLP superficial and lexic level 15
Lexical level 9
• Raw corpus• Horizontal or vertical Corpus • Tagged corpora• Parenthized corpora• Treebanks
Corpus typology