Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
-
Upload
marvin-bennett -
Category
Documents
-
view
214 -
download
0
Transcript of Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
![Page 1: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/1.jpg)
Natural Language Processing Assignment
Group Members:Soumyajit De
Naveen BansalSanobar Nishat
![Page 2: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/2.jpg)
Outline• POS tagging
Tag wise accuracyGraph- tag wise accuracyPrecision recall f-score
• Improvements In POS taggingImplementation of tri-gramPOS tagging with smoothingTag wise accuracyImproved precision, recall and f-score
• Next word predictionModel #1Model #2Implementation method and detailsScoring ratioperplexity ratio
• NLTK• Yago
Different examples by using yago• Parsing
Different examplesconclusions
![Page 3: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/3.jpg)
POS Tagging
![Page 4: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/4.jpg)
![Page 5: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/5.jpg)
Outline
![Page 6: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/6.jpg)
Precision, Recall, F-Score
Precision = 0.92Recall = 1F-score = 0.958
![Page 7: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/7.jpg)
Improvements inPOS tagger
![Page 8: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/8.jpg)
Improvement in POS Tagger• Implementation of trigram
* issues (sparcity – solution smoothing)? * results – increases overall accuracy upto 94%
![Page 9: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/9.jpg)
Improvement in POS Tagger (cont..)
• Implementation of smoothing Technique* Linear Interpolation Technique* Formula:
i.e.* Finding value of lambda
![Page 10: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/10.jpg)
POS tagging Accuracy with smoothing
1 2 3 4 5 6 7 8 9 1094.02
94.04
94.06
94.08
94.1
94.12
94.14
94.16
94.18
94.2
94.22
Series1
![Page 11: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/11.jpg)
• Precision : tp/(tp+fp) = 0.9415
• Recall: tp/(tp+fn) = 1
• F-score: 2.precision.recall/(precision + recall) = 0.97
![Page 12: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/12.jpg)
Tag wise accuracy
AJ0 AJC AJS AT0 AV0 AVP AVQ CJC CJS CJT CRD DPS DT0 DTQ EX0 ITJ NN0 NN1 NN2 NP00
20
40
60
80
100
120
Series1
![Page 13: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/13.jpg)
ORD PNI PNP PNQ PNX POS PRF PRP PUL PUN PUQ PUR TO0 UNC VBB VBD VBG VBI VBN VBZ0
20
40
60
80
100
120
Series1
VDB VDD VDG VDI VDN VDZ VHB VHD VHG VHI VHN VHZ VM0 VVB VVD VVG VVI VVN VVZ XX0 ZZ00
20
40
60
80
100
120
Series1
Tag wise accuracy (cont..)
![Page 14: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/14.jpg)
Further improvements in POS tagging by handling unknown words
![Page 15: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/15.jpg)
Precision score (accuracy in %age)
![Page 16: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/16.jpg)
Tag wise accuracy
![Page 17: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/17.jpg)
Error AnalysisVVB - finite base form of lexical verbs (e.g. forget, send, live, return)Count: 9916
Confused with counts Reason
VVI (infinitive form of lexical verbs (e.g. forget, send, live, return))
1201 VVB is used to tagged the word that has the same form as the infinitive without “to” for all persons. E.g. He has to show Show me
VVD (The past tense form of lexical verbs (e.g. forgot, sent, lived, returned))
145 The base form and past tense form of many verbs are same. So domination of emission probability of such word caused VVB wrongly tagged as VVD. And effect of transition probability might got have lower influence.
NN1 303 Words with similar base form gets confuse with common noun.e.g. The seasonally adjusted total regarded as…Total has been tagged as VVB and NN1
![Page 18: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/18.jpg)
Error AnalysisZZ0 - Alphabetical symbols (e.g. A, a, B, b, c, d) (Accuracy - 63%)Count: 337
Confused with counts Reason
AT0 (Article e.g. the, a, an, no)
98 Emission probability of “a” as AT0 is much higher compare to ZZ0. Hence AT0 dominates while tagging “a”
CRD (Cardinal number e.g. one, 3, fifty-five, 3609)
16 Because of the assumption of bigram/trigram Transition probability.
![Page 19: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/19.jpg)
Error AnalysisITJ - Interjection (Accuracy - 65%) Count: 177Reason: ITJ Tag appeared so less number of times, that it didn't miss classified
that much, but yet its percentage is so low
Confused with counts Reason
AT0 (Article (e.g. the, a, an, no)) 26 “No“ is used as ITJ and article in the corpus. So confusion is due to the higher emission probability of word with AT0
NN1 (Singular common noun) 14 “Bravo” is tagged as NN1 and ITJ in corpus
![Page 20: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/20.jpg)
Error AnalysisUNC - Unclassified items (Accuracy - 23%) Count: 756
Confused with counts Reason
AT0 (Article (e.g. the, a, an, no)) 69 Because of the domination of transition probability UNC is wrongly tagged
NN1 (Singular common noun) 224 Because of the domination of transition probability UNC is wrongly tagged
NP0 (Proper noun (e.g. London, Michael, Mars, IBM))
132 New word with begin capital letter is tagged as NP0, since mostly the UNC words are not repeating among different corpus.
![Page 21: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/21.jpg)
Next word prediction
![Page 22: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/22.jpg)
Model # 1
When only previous word is givenExample: He likes -------
![Page 23: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/23.jpg)
Model # 2
When previous Tag & previous word are known.Example: He_PP0 likes_VB0 --------
Previous Work
![Page 24: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/24.jpg)
Model # 2 (cont..)
Current Work
![Page 25: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/25.jpg)
Evaluation Method
1. Scoring Method• Divide the testing corpus into bigram• Match the testing corpus 2nd word of bigram
with predicted word of each model• Increment the score if match found• The final evaluation is the ratio of the two
scores of each model i.e. model1/model2• If ratio > 1 => model 1 is performing better and
vice-verca.
![Page 26: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/26.jpg)
Implementation Detail
Previous Word Next Predicted Word (Model 1)
Next Predicted Word (Model 2)
I see see
he looks goes
::
::
::
Look Up Table
Look up is used in predicting the next word
![Page 27: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/27.jpg)
Scoring Ratio
1 2 3 4 510.4
10.6
10.8
11
11.2
11.4
11.6
11.8
12
12.2
Series1
![Page 28: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/28.jpg)
2. Perplexity:
Comparison:
![Page 29: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/29.jpg)
1 2 3 4 50.988
0.99
0.992
0.994
0.996
0.998
1
Series1
Perplexity Ratio
![Page 30: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/30.jpg)
Remarks
• Model 2 is performing poorer than model 1 because of words are sparse among tags.
![Page 31: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/31.jpg)
Further Experiments
![Page 32: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/32.jpg)
Score (ratio) of word-prediction
1 2 3 4 5 6 7 8 9 101.13
1.14
1.15
1.16
1.17
1.18
1.19
1.2
1.21
1.22
1.23
Series1
![Page 33: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/33.jpg)
Perplexity (ratio) of word-prediction
1 2 3 4 5 6 7 8 9 100.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Series1
![Page 34: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/34.jpg)
Remarks
• Perplexity is found to be decreasing in this model.
• Overall score has been increased.
![Page 35: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/35.jpg)
Yago
![Page 36: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/36.jpg)
Example #1Query : Amitabh and Sachin
wikicategory_Living_people -- <type> -- Amitabh_Bachchan -- <givenNameOf> -- Amitabh
wikicategory_Living_people -- <type> -- Sachin_Tendulkar -- <givenNameOf> -- Sachin
ANOTHER-PATHwikicategory_Padma_Shri_recipients -- <type> -- Amitabh_Bachchan --
<givenNameOf> -- Amitabh
wikicategory_Padma_Shri_recipients -- <type> -- Sachin_Tendulkar -- <givenNameOf> -- Sachin
![Page 37: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/37.jpg)
Example#2Query : India and Pakistan
PATHwikicategory_WTO_member_economies -- <type> -- India
wikicategory_WTO_member_economies -- <type> -- Pakistan
ANOTHER-PATHwikicategory_English-speaking_countries_and_territories -- <type> -- India
wikicategory_English-speaking_countries_and_territories -- <type> -- Pakistan
ANOTHER-PATHOperation_Meghdoot -- <participatedIn> -- India
Operation_Meghdoot -- <participatedIn> -- Pakistan
![Page 38: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/38.jpg)
ANOTHER-PATHOperation_Trident_(Indo-Pakistani_War) -- <participatedIn> -- India
Operation_Trident_(Indo-Pakistani_War) -- <participatedIn> -- Pakistan
ANOTHER-PATHSiachen_conflict -- <participatedIn> -- India
Siachen_conflict -- <participatedIn> -- Pakistan
ANOTHER-PATHwikicategory_Asian_countries -- <type> -- India
wikicategory_Asian_countries -- <type> -- Pakistan
![Page 39: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/39.jpg)
ANOTHER-PATHCapture_of_Kishangarh_Fort -- <participatedIn> -- India
Capture_of_Kishangarh_Fort -- <participatedIn> -- Pakistan ANOTHER-PATHwikicategory_South_Asian_countries -- <type> -- India
wikicategory_South_Asian_countries -- <type> -- Pakistan
ANOTHER-PATHOperation_Enduring_Freedom -- <participatedIn> -- India
Operation_Enduring_Freedom -- <participatedIn> -- Pakistan
ANOTHER-PATHwordnet_region_108630039 -- <type> -- India
wordnet_region_108630039 -- <type> -- Pakistan
![Page 40: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/40.jpg)
Example #3
Query: Tom and Jerry
wikicategory_Living_people -- <type> -- Tom_Green -- <givenNameOf> -- Tom
wikicategory_Living_people -- <type> -- Jerry_Brown -- <givenNameOf> -- Jerry
![Page 41: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/41.jpg)
ParsingExample#1:
![Page 42: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/42.jpg)
Example#2
Example#3
![Page 43: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/43.jpg)
Example#4
![Page 44: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/44.jpg)
• Example#5
• Example#6
![Page 45: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/45.jpg)
• Example#7
![Page 46: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/46.jpg)
Conclusion1. VBZ always comes at the end of the parse tree in Hindi and Urdu.2. The structure in Hindi and Urdu is always expand or reset to NP VB
e.g. S=> NP VP (no change) OR VP => VBZ NP (interchange)3. For exact translation in Hindi and Urdu, merging of sub-tree in English is
sometimes required4. One word to multiple words mapping is common while translating from English to
Hindi/Urdue.g. donar => aatiya shuda OR have => rakhta hai
5. Phrase to phrase translation is sometimes required, so chunking is requirede.g. hand in hand => choli daman ka saath (Urdu) => sath sath hain (Hindi)
6. DT NN or DT NP doesn’t interchange7. In example#7: correct translation won’t require merging of two sub-trees MD and
VP e.g. could be => jasakta hai
![Page 47: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/47.jpg)
NLTK Toolkit
• NLTK is a suite of open source Python modules• Components of NLTK : Code, Corpora >30 annotated
data sets1. corpus readers2. tokenizers3. stemmers4. taggers5. parsers6. wordnet7. semantic interpretation
![Page 48: Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.](https://reader038.fdocuments.us/reader038/viewer/2022110100/56649e165503460f94b01375/html5/thumbnails/48.jpg)
A* - Heuristic
^ $
Fixed : (Min cost)* No. of Hops
Selected Route