Extracting structured information from unstructured and/or semi-structured m/c- readable documents. ...

Information extraction from bioinformatics related

documents

Extracting structured information from unstructured and/or semi-structured m/c-readable documents.

Processing human language texts by means of NLP methods.

Text/Images/audio/video

Introduction

Computation on previously unstructured data.

from an online news sentence such as:

Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

Logical reasoning to draw inferences Text simplification

Goals

oNamed entityoCo-reference resolutionoRelationship extractiono Language and vocabulary analysis extractionoAudio extraction

Subtasks

Approaches Hand-written regular expressions Classifiers

◦ Generative: naïve Bayes ◦ Discriminative: maximum entropy models

Sequence models◦ Hidden Markov model

Field of CS, AI and CL concerned with the interactions between computers and Natural languages.

Major Focus

HCI NLU NLG

Natural Language Processing (NLP) Introduction

Hand written rules Statistical inference algorithms to produce

models ◦ robust to unfamiliar input (e.g. containing words or

structures that have not been seen before) ◦ erroneous input (e.g. with misspelled words or words

accidentally omitted). Methods used are

stochastic, probabilistic and statistical Methods for disambiguation often involve the use

of corpora and Markov models.

NLP Methods

Automatic summarization Discourse analysis Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding

Major tasks in NLP

Native Language Identification Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing

Applications of NLP

Biomedical text mining (BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain.

It is a rather recent research field on the edge of NLP, bioinformatics, medical informatics and computational linguistics.

NLP Techniques for Bioinformatics

There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

Motivation

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.



Syntactic Analysis

Semantic Analysis


John runs.

John run+s. P-N V 3-pre N plu



Syntactic Analysis

Semantic Analysis


John runs.

John run+s. P-N V 3-pre N plu S

NP

P-N

John

VP

V

run



Syntactic Analysis

Semantic Analysis


John runs.


NP

P-N

John

VP

V

runPred: RUN Agent:John



Syntactic Analysis

Semantic Analysis


John runs.


NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.



Syntactic Analysis

Semantic Analysis


Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition

Inflection/Derivation

Compounding



Syntactic Analysis

Semantic Analysis


Difficulties of NLP

(1) Robustness: Incomplete Knowledge



Syntactic Analysis

Semantic Analysis


Difficulties of NLP


Incomplete Lexicons Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions



Syntactic Analysis

Semantic Analysis


Difficulties of NLP


Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions

Syntactic Analysis



Semantic Analysis


Difficulties of NLP


Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects ofInformation



Syntactic Analysis

Semantic Analysis


Difficulties of NLP


(2) Ambiguities:Combinatorial

Explosion



Syntactic Analysis

Semantic Analysis


Difficulties of NLP



Explosion

Most words in Englishare ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings



Syntactic Analysis

Semantic Analysis


Difficulties of NLP



Explosion

Structural Ambiguities

Predicate-argument Ambiguities

number of methods for determining context automatic topic detection/theme extraction.

"what" is being discussed. Nouns and noun phrases to define context. Named entity recognition and extraction.

Nouns, Verbs extraction from textual documents

large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped

into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synsets are interlinked by means of conceptual-semantic and lexical relations.

The resulting network of meaningfully related words and concepts can be navigated with the browser.

freely and publicly available for download. WordNet's structure makes it a useful tool for CL and NLP works.

Wordnet for synonym finding

WordNet similarity to thesaurus (words and meanings)

WordNet interlinks not just word forms—strings of letters—but specific senses of words. ◦ words that are found in close proximity to one another in

the network are semantically disambiguated. Semantic relations among words, whereas the

groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

CATEGORIZATION / CLASSIFICATION

Given:◦ A description of an instance, xX, where X is the

instance language or instance space. e.g: how to represent text documents.

◦ A fixed set of categories C = {c1, c2,…, cn}

Determine:◦ The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.

A GRAPHICAL VIEW OF TEXT CLASSIFICATION

NLP

Graphics

AI

Theory

Arch.

TEXT CLASSIFICATIONThis concerns you as a patient.

Our medical records indicate you have had a history of illness. We are now encouraging all our patients to use this highly effective and safe solution.

Proven worldwide, feel free to read the many reports on our site from the BBC & ABC News.

We highly recommend you try this Anti-Microbial Peptide as soon as possible since its world supply is limited. The results will show quickly.

Regards, http://www.superbiograde.us/bkhog/

85% of all email!!

http://www.superbiograde.us/bkhog/

EXAMPLES OF TEXT CATEGORIZATION

LABELS=BINARY◦ “spam” / “not spam”

LABELS=TOPICS◦ “finance” / “sports” / “asia”

LABELS=OPINION◦ “like” / “hate” / “neutral”

LABELS=AUTHOR◦ “Shakespeare” / “Marlowe” / “Ben Jonson”◦ The Federalist papers

Methods (1)

Manual classification◦ Used by Yahoo!, Looksmart, about.com, ODP, Medline◦ very accurate when job is done by experts◦ consistent when the problem size and team is small◦ difficult and expensive to scale

Automatic document classification◦ Hand-coded rule-based systems

Reuters, CIA, Verity, … Commercial systems have complex query languages

(everything in IR query languages + accumulators)

Methods (2)

Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, …

Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But can be built (and refined) by amateurs

Bayesian Methods Learning and classification methods based on

probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how

data is produced Uses prior probability of each category given no

information about an item. Categorization produces a posterior probability

distribution over the possible categories given a description of an item.

Bayes’ Rule

)()|()()|(),( CPCXPXPXCPXCP

)(

)()|()|(

XP

CPCXPXCP

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDPh

HhMAP

)()|(argmax hPhDPhHh

MAP

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only

need to consider the P(D|h) term:

)|(argmax hDPhHh

ML

Naive Bayes Classifiers

Task: Classify a new instance based on a tuple of attribute values

nxxx ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

CcMAP cccP

cPcxxxPc

j

)()|,,,(argmax 21 jjnCc

MAP cPcxxxPcj

Naïve Bayes Classifier: Assumptions P(cj)

◦ Can be estimated from the frequency of classes in the training examples.

P(x1,x2,…,xn|cj) ◦ Need very, very large number of training

examples Conditional Independence Assumption:

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

Conditional Independence Assumption: features are independent of each other given the class: )|()|()|()|,,( 52151 CXPCXPCXPCXXP

Learning the Model

Common practice:maximum likelihood◦ simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP jj

)()(ˆ

Feature selection via Mutual Information We might not want to use all words, but just

reliable, good discriminators In training set, choose k words which best

discriminate the categories. One way is in terms of Mutual Information:

◦ For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

OTHER APPROACHES TO FEATURE SELECTION T-TEST CHI SQUARE TF/IDF (CFR. IR lectures) Yang & Pedersen 1997: eliminating features

leads to improved performance

tfidf(t, d) = tf(t, d) · idf(t)

Tf·idf term-document matrix.

where Nt,d is the number of occurrences of a term t in a document d, and the denominator is the sum of occurrences of all terms in that document d

where W(t) is the number of documents containing the term t

Used to evaluate the independence between two events. The relevance of a term t in a class c can be estimated by the following formula

Chi- Square Statistics

F11: #documents belonging to c and containing t;F10: #documents which are not in c but containing t;F01: #documents belonging to c but not containing t;F00: #documents which are not in c and not containing t.

Classification task: to decide which class to choose. Measure importance of term t for a class c.

MAP Estimates

where Nt|c and Nt are the numbers of term t in the class c and in the entire corpus, respectively. Nc is the number of distinct classes.

where Nd|c is the number of documents in the scene class c, and Nd is the entire number ofdocuments. Note that α1 and α2 are the smoothing parameters that are typically determinedempirically.

OTHER CLASSIFICATION METHODS

K-NN DECISION TREES LOGISTIC REGRESSION SUPPORT VECTOR MACHINES

Extracting structured information from unstructured and/or semi-structured m/c- readable documents. ...

Documents

Transcript of Extracting structured information from unstructured and/or semi-structured m/c- readable documents. ...