OpenNLP demo
-
Upload
gagan-gowda -
Category
Technology
-
view
161 -
download
1
description
Transcript of OpenNLP demo
Samatha
Gagan Sunil
What is NLP?
• NLP provides means of analyzing text
• The goal of NLP is to make computers analyze and understand the languages that humans use naturally
• Interaction between Computers-Humans
Why Natural Language Processing?
• kJfmmfj mmmvvv nnnffn333• Uj iheale eleee mnster vensi credur• Baboi oi cestnitze
• Computers “see” text in English the same way you have seen above!
• People have no trouble understanding language• Computers have
– No common sense knowledge– No reasoning capacity
raw(unstructured)
text
part-of-speechtagging
named entityrecognition
deepsyntacticparsing
annotated(structured)
text
Natural Language Processing
………………………………..………………………………………….………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………………………………………………..
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
PP PP NP
PP
VP
VP
NP
NP
S
Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/DTCII.ppt
Uses of NLP
• Text based application
• Dialogue based application
• Information extractionExtract useful information. e.g. resumes
• Automatic summarizationCondense 1 book into 1 page
What is ?
OpenNLP is a open source, java-based NLP tools which perform 1. sentence detection,2. Tokenization, 3. pos-tagging, 4. parsing, 5. named-entity detection using the OpenNLP package.1
1http://opennlp.sourceforge.net/
Use of openNLP in our University project
• It can be used in “searching” names using Named entity recognition.
OpenNLP is used for:
• Sentence splitting
• Tokenization
• Part-of-speech tagging
• Named entity recognition
• Chunking
• Treebank Parser
Sentence splittingsentence boundary = period + space(s) + capital letter
Unusually, the gender of crocodiles is determined by temperature.
If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile.
At lower temperatures only female or 'cow' crocodiles develop.
Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop.
sentDetect(s, language = "en", model = NULL)
A character vector with texts from which sentences
should be detected. A character string giving the language of s. This
argument is only used if model is NULL for selecting a default model.
A model. If model is NULL then a default model for
sentence detection is loaded from the corresponding openNLP models language package.
s
language
model
http://opennlp.sourceforge.net/
Tokenization
• Convert a sentence into a sequence of tokens
• Divides the text into smallest units (usually words), removingpunctuation.
Rule:
• Use spaces as the boundaries• Adds spaces before and after special characters
tokenize(s, language = "en", model = NULL)
http://opennlp.sourceforge.net/
Tokenization
"A Saudi Arabian woman can get a divorce if her husband doesn't give her coffee."
" A Saudi Arabian woman can get a divorce if her husband does n't give her coffee . "
Part-of-speech tagging
Assign a part-of-speech tag to each token in a sentence.
Most/JJS lipstick/NN is/VBZ partially/RB made/VBN of/IN fish/NN scales/NNS
Most lipstick is partially made of fish scales
tagPOS(sentence, language = "en", model = NULL, tagdict = NULL)
http://opennlp.sourceforge.net/
Part of speech tags1
CC - Coordinating conjunctionCD - Cardinal numberDT - DeterminerEX - Existential thereFW - Foreign wordIN - Preposition or subordinating conjunctionJJ - AdjectiveJJR - Adjective, comparativeJJS - Adjective, superlativeNN - Noun, singular or massNNS - Noun, pluralNNP - Proper noun, singularNNPS - Proper noun, pluralPDT – Predeterminer
NP - Noun Phrase.
PP - Prepositional Phrase
VP - Verb Phrase.
PRP - Personal pronounRB - AdverbRBR - Adverb, comparativeRBS - Adverb, superlativeRP - ParticleSYM - SymbolTO - toUH - InterjectionVB - Verb, base formVBD - Verb, past tenseVBG - Verb, gerund or present participleVBN - Verb, past participleVBP - Verb, non-3rd person singular presentVBZ - Verb, 3rd person singular presentWDT - Wh-determinerWP - Wh-pronounWRB - Wh-adverb
1 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
Named-Entity Recognition
• Named entity recognition classify tokens in text into predefined categories such as date, location, person, time.
• The name finder can find up to seven different types of entities - date, location, money, organization, percentage, person, and time.
15
Named-Entity Recognition
Diana Hayden was in Philadelphia city on 3rd october
<namefind/person>Diana Hayden</namefind/person> was
in<namefind/location>Philadelphia</namefind/location> city on<namefind/date>3rd october</namefind/date>
Chunking (shallow parsing)
He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP
A chunker (shallow parser) segments a sentence into meaningful phrases.
Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/DTCII.ppt
Tree bank parser
It tags tokens and groups phrases into a tree.
(TOP (S (NP (DT A) (NN hospital) (NN bed)) (VP (VBZ is) (NP (NP (DT a) (VBN parked) (NN taxi)) (PP (IN with) (NP (DT the) (NN meter) (VBG running)))))))
A hospital bed is a parked taxi with the meter running
S
NP VP
DT NN NN VBZ NP
NP
DT VBN NN
PP
IN NP
DT NN VBG
a hospital bed is a parked taxi with the meter running
Visualization of Treebank Parser