Introduction to Computational Linguistics
description
Transcript of Introduction to Computational Linguistics
![Page 1: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/1.jpg)
Introduction to Computational Linguistics
Dipti Misra Sharma
IIIT, Hyderabad
IASNLP 05-07-2012
![Page 2: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/2.jpg)
Outline
Background
What is Computational Linguistics (CL)?
What do the Computational Linguists do?
What are the issues in processing natural languages?
What can we do with CL?
Approaches in CL?
![Page 3: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/3.jpg)
Background
Language is a means of communication
Therefore, one can say
It encodes what is communicated <information>
We apply the processes of
Analysis (decoding) for understanding
Synthesis (encoding) for expression (speaking)
![Page 4: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/4.jpg)
What do we communicate ?
Information (SPAIN delivered a football masterclass at Euro 2012)
Intention <purpose> Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012)
Introduces variation
![Page 5: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/5.jpg)
How do we communicate ?
We use linguistic elements such as
Words (country, park, the, is, Bandipur, of, as, and, considered, National,
a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)
Arrangement of the words (Sentences) Words are related to each-other to provide the
composite meaning(Bandipur National park is a beautiful tourist spot and considered as
one of the best wild life sanctuaries in the country)
![Page 6: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/6.jpg)
How do we communicate ?
Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning
*(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.)
(Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km)
Languages differ in the way they organise information in these entities
All of these interact in the organisation of information
![Page 7: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/7.jpg)
What is Computational Linguistics?
Computational linguistics is the scientific study of language from a computational perspective.
![Page 8: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/8.jpg)
What does it mean?
Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon
Computational Develops computational models/techniques for linguistic phenomena
Human language is the subject of study
![Page 9: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/9.jpg)
In other words
Computational linguistics is the application of linguistic theories and computational techniques
to problems of natural language processing.
http://www.ba.umist.ac.uk/public/departments/registrars/academicoffice/uga/lang.htm
![Page 10: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/10.jpg)
What do the Computational Linguists do?
Linguistic research
Develop language models for processing natural languages
Develop language resources for NLP research/applications
Understand and develop models for analysis and generation of natural languages by the computers
![Page 11: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/11.jpg)
So,
A Computational Linguist needs to understand
How language works
What information is available in the language?
How languages encode information? How this knowledge/information can
be representated for computational processing?
![Page 12: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/12.jpg)
Information in Language (1/4)
Languages encode information
cuuhe maarate haiN kutte
rats kill dogs
Hindi sentence is ambiguous Possible interpretations
Dogs kill rats
Rats kill dogs
However,
English sentence is not ambiguous
![Page 13: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/13.jpg)
Information in Language (2/4)
Ambiguity in Hindi is resolved if,
cuuhe maarate haiM kuttoN korats kill dogs acc
English encodes information in positions
Hindi in morphemes
Languages encode information differently
![Page 14: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/14.jpg)
Information in Language (3/4)
Another example,
This chair has been sat on
– The chair has been used for sitting– X sat on this chair, and it is known– The sentence does not mention X
Languages encode information partially
![Page 15: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/15.jpg)
Information in Language (4/4)
English pronouns he, she, itHindi pronoun vaha
He is going to Delhi ==> vaha dilli jaa rahaa hai
She is going to Delhi ==> vaha dillii jaa rahii hai
It broke ==> vaha TuuTa ??
Information does not always map fully from one language into another
Conceptual worlds may be different
![Page 16: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/16.jpg)
Differences ?
Words
English Hindi Telugu
boys laDake/laDakoN <n,pl> <n,sg/pl,case>
He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho
is going jaa rahaa hai/rahii hai/rahe haiN
![Page 17: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/17.jpg)
Indian Languages
Relatively flexible word order
1. a) baccaa phala khaataa hai
‘child’ ‘fruit’ ‘eat+hab’ ‘pres’
The child eats fruits
b) phala baccaa khaataa hai
c) phala khaataa hai baccaa
d) baccaa khaataa hai phala
![Page 18: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/18.jpg)
Some structural differences
EnglishDeclarative : Ravi is coming todayInterrogative : Is Ravi coming today ?
Change in the position of ‘is’ brings the change in meaning
HindiDeclarative : ravi aaj aa rahaa haiInterrogative : kyaa ravi aaj aa rahaa hai ?
Word ‘kyaa’ encodes the question information
Alternatively, more natural spoken form in Hindi
ravi aaj aa rahaa hai ? (with appropriate intonation) ORRavi aaj aa rahaa hai kyaa?
![Page 19: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/19.jpg)
Post nominal modification
'ing' clauses
I know [the man playing guitar]
Hindi, on the other hand
maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN
![Page 20: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/20.jpg)
Clauses having 'un-' negative constructions
EnglishUnless you reach there the job will not be done
Hindijab tak tum vahaaN nahiiN pahuNcate , kaam
nahiiN hogaa
![Page 21: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/21.jpg)
Languages Differ
Different languages have different
mechanisms/devices to encode information Some devices are common across certain languages and some are different There are alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the othersEnglish exploits ‘position’ for encoding informationHindi uses ‘words’ more effectively
Thus, differences in grammatical structures
![Page 22: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/22.jpg)
Ambiguity in Natural Language (1/2)
Look at the word 'plot' in the following examples
(a) The plot having rocks and boulders is not good.(b) The plot having twists and turns is interesting.
'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'
![Page 23: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/23.jpg)
Ambiguity in Natural Language (2/2)
Lexical level
Sentence level
Structural differences between SL and TL in a Machine Translation system.
![Page 24: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/24.jpg)
Lexical ambiguity
Lexical ambiguity can be both for
Content words – nouns, verbs etcFunction words – prepositions, TAMs etc
Content words' ambiguity is of two types
HomonymyPolysemy
![Page 25: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/25.jpg)
Homonymy
A word has two or more unrelated senses
Example : I was walking on the bank (river-bank)
I deposited the money in the bank (money-bank)
![Page 26: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/26.jpg)
Polysemy
A word having two or more related senses
Example : English word 'issue', noun 1. The issue is under discussion (muddaa)2. The latest issue of the journal is out (aNka)3. He buys stamps on the day of the issue (vimocan)
4. The couple has no issue even after five years of marriage (saNtaan)
![Page 27: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/27.jpg)
Information Flow and Ambiguity
1. He scratched a figure on the rock (engrave)
2. She scratched the figure on the rock (scrape)
• Other words in the context make a difference• Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'
![Page 28: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/28.jpg)
Function words can also pose problems (1/4)
Function words can also be ambiguousFor example – English preposition 'in'
(a) I met him in the garden maiN usase bagiice meiN milaa
(b) I met him in the morning maiN usase subaha 0 milaa
'Ambiguity' here refers to the 'appropriate correspondence' in the target language.
![Page 29: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/29.jpg)
Function words can also pose problems (2/4)
1. He bought a shirt with tiny collars.
usane chote kaular vaalii kamiiz khariidii
‘he tiny collars with shirt bought’
‘with’ gets translated as ‘vaalii’ in Hindi
2. He washed a shirt with soap.
usane saabun se kamiiz dhoii
‘he soap with shirt washed’
‘with’ gets translated as ‘se’ .
![Page 30: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/30.jpg)
Function words can also pose problems (3/4)
TAM Markers mark tense, aspect and modality
– Consist of inflections and/or auxiliary verbs in Hindi
– An important source of information
– Narrow down the meaning of a verb (eg. lied, lay)
![Page 31: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/31.jpg)
Function words can also pose problems (4/4)
English Simple Past vs Habitual'
1a. He stayed in the guest house during his visit to our University in Jan (rahaa)
1b. He stayed in the guest house whenever he visited us (rahataa thaa)
2a. He went to the school just now (gayaa)
2b. He went to the school everyday (jaataa thaa)
![Page 32: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/32.jpg)
Sentence level ambiguity
I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store Time flies like an arrow. + Possible parses:
a) Time flies like an arrow (N V Prep Det N)b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)
![Page 33: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/33.jpg)
Thus,
Languages encode information differently
Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at
different levels
![Page 34: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/34.jpg)
Human beings use
World knowledge
Context (both linguistic and extra-linguistic)
Cultural knowledge and
Language conventions to resolve ambiguities
Can all this knowledge be provided to the machine ? Computational Linguistics aims for this.
![Page 35: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/35.jpg)
How to provide this knowledge ? (1/2)
Analyse language at various levels (word, phrase, sentence etc)
Build Tools for analysing the natural language at various levels in a text
POS tagger (category marking)
Morphological analysers (analysis of a word)
Morphological generators (word generators)
Chunkers (shallow parsers)
Parsers (syntactic analysis)
Filters (markers for special expressions)
Sense Disambiguation Algorithms
Etc
The tools need linguistic knowledge
![Page 36: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/36.jpg)
How to provide this knowledge ? (2/2)
Build language resources
Machine Readable Lexicon Rules for various levels of linguistic
analysis Computational Grammars Mapping rules for the concerned
language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc
![Page 37: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/37.jpg)
POS Tagger
What is a POS? Take the following English sentence
My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop .
Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS))
The class to which a word may belong is based on its morphological and syntactic behavior
MorphologicalKind of affixes a word takes, for example,
boy, boys; girl, girls; book, books (noun class) Syntactic
How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)
![Page 38: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/38.jpg)
Why is POS relevant in CL/NLP ? (1/2)
• Word class information of a given word in a sentence helps to predict its neighbour
• WSD
He runs a mile every day (verb)
Their team made 250 runs (noun)
Time flies like an arrow (n v prep det n)
• Helps in further processing – chunking, morph pruning, sentence parsing
• IR
A POS tagger automatically marks the POS of all the words in a text
![Page 39: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/39.jpg)
POS tagged sentence
My possesive pronoun
old adjective
friend noun
Ram proper noun
recently adverb
bought verb
a determiner
book noun
on preposition
Indian adjective
snakes noun
for preposition
his possesive pronoun
cousin noun
from preposition
London proper noun
, punctuation
from preposition
the determiner
new adjective
bookshop noun
in preposition
town noun
![Page 40: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/40.jpg)
POS Tagging Approaches
Rule Based
Statistical
Transformation Based
![Page 41: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/41.jpg)
Rule Based POS Tagging
Two staged architecture algorithms
(Harris, 1962; Klein and Simmons, 1963; Green and Rubin,
1971)
Stage 1 assign POS by referring to the
dictionary
Eg Dictionary entry for Eng word that
that Conj, Adv, Pronoun
Stage 2 disambiguate, using manually
crafted rules
![Page 42: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/42.jpg)
Statistical
Taggers use probabilities for tagging
The tagger picks the most likely tag for a given word in a context
HMM based algorithms are most commonly used for POS tagging task
Requires manually tagged corpus
![Page 43: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/43.jpg)
Annotating Corpus for POS
Annotated corpora is useful for developing statistical POS taggers
Tagging schemeSet of POS Tags
Guidelines for the annotators
The tagged corpora should beHigh quality (in terms of tagging accuracy)
Consistent
![Page 44: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/44.jpg)
POS Tags for English
English
Penn Tree Bank – 45 tags
C5 - Lancaster – 61 tags – used in CLAWS
Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm
- C7 – 147 tags – Leech
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
![Page 45: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/45.jpg)
Pen Treebank Tags
My PP$
old JJ
friend NN
Ram NNP
recently RB
bought VBD
a DT
book NN
on IN
Indian JJ
snakes NNS
for IN
his PP$
cousin NN
from IN
London NNP
, ,
from IN
the DT
new JJ
bookshop NN
in IN
town NN
![Page 46: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/46.jpg)
POS Tags for Indian Languages
Objective
To arrive at a standard POS and Chunk tagging scheme for all Indian languages
Assumption
Commonality in Indian Languages
![Page 47: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/47.jpg)
Issues in Tag Set Design (1/2)
Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for
POS tags) New tags vs tags close to existing English
tags Should be comprehensive/complete
![Page 48: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/48.jpg)
Issues in Tag Set Design (2/2)
Simple Less effort in manual tagging Number of tags Common for all Indian languages
![Page 49: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/49.jpg)
Linguistic Knowledge :Fine vs Coarse (1/2)
ExampleOnly noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM,
ladakiyoMORNoun with gender, number, case information (NNM) ladakA, ladAke, laDakoM, (NNMS) ladakA, laDake (NNMP) laDake, laDkoM, (NNMSD) laDakA, (NNMSO) laDake, (NNMPD) laDake, (NNMPO) laDakoM
The decision has implications for the size of corpora and machine learning
![Page 50: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/50.jpg)
Linguistic Knowledge :Fine vs Coarse (2/2)
Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD
(advantages/disadvantages) Hierarchical
Example: NN_m_sg_d
Hierarchical tag set provides the possibility for underspecification
![Page 51: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/51.jpg)
Considerations
POS tagger is NOT a replacement for a morph analyzer
Coarse analysis to begin with Expandable if needed If the information can be obtained from
elsewhere, it need not be included in the POS tag
![Page 52: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/52.jpg)
Syntactic function vs lexical category
Example
harijana bAlaka ‘harijan’ ‘child’
Decision : Lexical category
Helps achieve Consistency in annotation Better learning
![Page 53: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/53.jpg)
New tags vs tags close to existing English tags
New tags
Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags)
NN, PRP, JJ, RB
Decision : Penn tags for common lexical types
New tags for certain IL specific cases
![Page 54: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/54.jpg)
Comprehensive/Complete
All the lexical items occurring in a sentence should be marked for their POS, including punctuations.
If the language has some special cases, these should also be captured – Reduplications in ILs
![Page 55: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/55.jpg)
Simple
Why simple ? The tags are designed for some manual
annotation Ease of learning Consistency in annotation
![Page 56: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/56.jpg)
Less Effort in Manual Tagging
The annotators should not have to Write too much Take too many steps in annotating a lexical item
![Page 57: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/57.jpg)
Number of Tags
Number of tags makes a difference both for the man and the machine
For the man in decision making For the machine in learning for automatic
tagging
![Page 58: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/58.jpg)
Common for All Indian Languages
Indian languages belong to various language families
Share linguistic features
However, There are differences
Some languages have quotatives, some don't Some have classifiers, some don't
![Page 59: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/59.jpg)
Chunking
What forms a chunk ?
Non-recursive phrase ((det adj noun))
Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category
Example : ((mere choTe bhaaii ne))_NP
((jaa rahaa hai))_VG
![Page 60: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/60.jpg)
Chunker
A Chunker automatically groups words in a sentence as chunks and labels them
((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP.
![Page 61: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/61.jpg)
IL Chunk Tags (1/2)
NP noun chunk bahut acchiiI kitaab
JJP adjective chunk bahut sundar sii
RBP adverb chunk dhiIre – dhIire
NEGP chunk for negatives nahiiN
CCP conjunct chunks raam Ora shyaam
BLK miscellaneous interjections etc
![Page 62: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/62.jpg)
IL Chunk Tags (2/2)
VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk
raama (meraa bhaaii) ne
![Page 63: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/63.jpg)
Some Issues
How to chunk the following ?
Adverbs
within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought))
Punctuations Particles – hii (only), to, bhii (also) etc
![Page 64: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/64.jpg)
Current approach
For punctuation – chunk them with the preceding chunk
Adverbs – chunk them separatelyParticles – chunk them with the chunk to
which they belong
((raam ne bhii)) ((jaa hii rahaa thaa))
![Page 65: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/65.jpg)
Issues
• Verb Negation
1. nahiiN jaa rahaa ‘not going’2. kahaa hii nahiiN ‘just did not mention’3. kaha to nahiiN rahaa thaa ‘was not saying’
(emphatic)4. binaa yaha baata kahe ‘without saying this’
5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa hai
‘Not only this, in fact, this is also found in writing'
![Page 66: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/66.jpg)
Current approach
For cases 1 to 3, chunk NEG with the verb group
For 4, chunk the NEG separately in a chunk
For 5, also a separate NEGP chunk will work
NOUN NEGATION ???
![Page 67: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/67.jpg)
Chunking Co-ordinate Constructions
1. word1 CC word2 raam aur shyaam
((raam))_NP ((aur))_CCP ((shyaam))_NP
2. phrase CC phrasemeraa bhaaii shyaam aur tumhaaraa bhaaii mohan
((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP
3. clause CC clause
![Page 68: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/68.jpg)
Discontiguous Phrases
What about cases such as ' X (Y) Z' ?
where X = noun, Y = a phrase, Z = postposition
raam (meraa xillii vaalaa bhaaii) ne
OR
isa 'upanyaas – samraaT' shabda kaa'
FRAGP
![Page 69: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/69.jpg)
Chunking Conjunct Verbs
Conjunct verbs
A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept')
Should the conjunct verbs be tagged as a single chunk or two chunks?
'prawIkSA karanA', 'kSamA karanA' etc
‘to wait’ ‘to forgive’
![Page 70: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/70.jpg)
What about genitives ?
raam kaa betaa
'brother of Ram'
usakaa betaa
'his/her son'
mere bhaaii raam kaa betaa
'my brother Ram's son'
iske pahale
'before this'
mez ke uupar
'above/on the table'
ravi ke saath
'with Ravi'
![Page 71: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/71.jpg)
Chunking Numbers/Quantifiers (1/2)
Numerals, quantifiers may occur as follows
a) ek laDakaa 'one boy'
b) 1 laDakaa '1 boy'
c) pahalaa laDakaa 'first boy'
d) karoDoN log 'billions of people'
e) 1962 meiN 'in 1962'
![Page 72: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/72.jpg)
Chunking Numbers/Quantifiers (2/2)
The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset
Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun
The issue :
Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ?
![Page 73: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/73.jpg)
Summary
For annotating POS and Chunk a scheme needs to be designed
While doing so following issues need to be considered.
Definition of 'chunk'
Elements which together can form a chunk type
Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks
POS/Chunk tag labels
![Page 74: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/74.jpg)
Approaches in Computational Linguistics (for Tools)
Two major approaches Rule based
Requires manually crafted rulesExplicit linguistic knowledgeNeeds manual time and effortTrained manpowerHigh precisionLess robust
![Page 75: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/75.jpg)
Approaches in Computational Linguistics (for Tools)
Data driven approachUses statistical methods or machine learning Requires less human effortOften requires large scale data sources (manually annotated corpora, lexicons etc)Linguistic knowledge is implicitMore adaptive to noisy textMore robust
![Page 76: Introduction to Computational Linguistics](https://reader033.fdocuments.us/reader033/viewer/2022051218/568157a3550346895dc53486/html5/thumbnails/76.jpg)
Computational Linguistics Application Areas
Is useful for Communication between
Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual searchMan – man
Machine translation