Extracting structured information from unstructured and/or semi-structured m/c- readable documents. ...

49
Information extraction from bioinformatics related documents

Transcript of Extracting structured information from unstructured and/or semi-structured m/c- readable documents. ...

Page 1: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Information extraction from bioinformatics related

documents

Page 2: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Extracting structured information from unstructured and/or semi-structured m/c-readable documents.

Processing human language texts by means of NLP methods.

Text/Images/audio/video

Introduction

Page 3: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Computation on previously unstructured data.

from an online news sentence such as:

Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

Logical reasoning to draw inferences Text simplification 

Goals

Page 4: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

oNamed entityoCo-reference resolutionoRelationship extractiono Language and vocabulary analysis extractionoAudio extraction

Subtasks

Page 5: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Approaches Hand-written regular expressions Classifiers

◦ Generative: naïve Bayes ◦ Discriminative: maximum entropy models

Sequence models◦ Hidden Markov model

Page 6: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Field of CS, AI and CL concerned with the interactions between computers and Natural languages.

Major Focus

HCI NLU NLG

Natural Language Processing (NLP) Introduction

Page 7: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Hand written rules Statistical inference algorithms to produce

models ◦ robust to unfamiliar input (e.g. containing words or

structures that have not been seen before) ◦ erroneous input (e.g. with misspelled words or words

accidentally omitted).  Methods used are

 stochastic, probabilistic and statistical  Methods for disambiguation often involve the use

of corpora and Markov models. 

NLP Methods

Page 8: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Automatic summarization Discourse analysis Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding

Major tasks in NLP

Page 9: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Native Language Identification Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing

Applications of NLP

Page 10: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Biomedical text mining (BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain.

It is a rather recent research field on the edge of NLP, bioinformatics, medical informatics and computational linguistics.

NLP Techniques for Bioinformatics

Page 11: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

Motivation

Page 12: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.
Page 13: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

Page 14: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

Page 15: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu S

NP

P-N

John

VP

V

run

Page 16: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu S

NP

P-N

John

VP

V

runPred: RUN Agent:John

Page 17: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.

Page 18: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition

Inflection/Derivation

Compounding

Page 19: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Page 20: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Lexicons Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions

Page 21: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions

Page 22: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects ofInformation

Page 23: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Page 24: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Most words in Englishare ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings

Page 25: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

Page 26: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

number of methods for determining context automatic topic detection/theme extraction.

"what" is being discussed. Nouns and noun phrases to define context. Named entity recognition and extraction. 

Nouns, Verbs extraction from textual documents

Page 27: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped

into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synsets are interlinked by means of conceptual-semantic and lexical relations.

The resulting network of meaningfully related words and concepts can be navigated with the browser.

freely and publicly available for download. WordNet's structure makes it a useful tool for CL and NLP works.

Wordnet for synonym finding

Page 28: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

WordNet similarity to thesaurus (words and meanings)

WordNet interlinks not just word forms—strings of letters—but specific senses of words. ◦ words that are found in close proximity to one another in

the network are semantically disambiguated. Semantic relations among words, whereas the

groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

Page 29: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.
Page 30: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

CATEGORIZATION / CLASSIFICATION

Given:◦ A description of an instance, xX, where X is the

instance language or instance space. e.g: how to represent text documents.

◦ A fixed set of categories C = {c1, c2,…, cn}

Determine:◦ The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.

Page 31: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

A GRAPHICAL VIEW OF TEXT CLASSIFICATION

NLP

Graphics

AI

Theory

Arch.

Page 32: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

TEXT CLASSIFICATIONThis concerns you as a patient.

Our medical records indicate you have had a history of illness. We are now encouraging all our patients to use this highly effective and safe solution.

Proven worldwide, feel free to read the many reports on our site from the BBC & ABC News.

We highly recommend you try this Anti-Microbial Peptide as soon as possible since its world supply is limited. The results will show quickly.

Regards, http://www.superbiograde.us/bkhog/

85% of all email!!

Page 33: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

EXAMPLES OF TEXT CATEGORIZATION

LABELS=BINARY◦ “spam” / “not spam”

LABELS=TOPICS◦ “finance” / “sports” / “asia”

LABELS=OPINION◦ “like” / “hate” / “neutral”

LABELS=AUTHOR◦ “Shakespeare” / “Marlowe” / “Ben Jonson”◦ The Federalist papers

Page 34: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Methods (1)

Manual classification◦ Used by Yahoo!, Looksmart, about.com, ODP, Medline◦ very accurate when job is done by experts◦ consistent when the problem size and team is small◦ difficult and expensive to scale

Automatic document classification◦ Hand-coded rule-based systems

Reuters, CIA, Verity, … Commercial systems have complex query languages

(everything in IR query languages + accumulators)

Page 35: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Methods (2)

Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, …

Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But can be built (and refined) by amateurs

Page 36: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Bayesian Methods Learning and classification methods based on

probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how

data is produced Uses prior probability of each category given no

information about an item. Categorization produces a posterior probability

distribution over the possible categories given a description of an item.

Page 37: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Bayes’ Rule

)()|()()|(),( CPCXPXPXCPXCP

)(

)()|()|(

XP

CPCXPXCP

Page 38: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDPh

HhMAP

)()|(argmax hPhDPhHh

MAP

Page 39: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only

need to consider the P(D|h) term:

)|(argmax hDPhHh

ML

Page 40: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Naive Bayes Classifiers

Task: Classify a new instance based on a tuple of attribute values

nxxx ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

CcMAP cccP

cPcxxxPc

j

)()|,,,(argmax 21 jjnCc

MAP cPcxxxPcj

Page 41: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Naïve Bayes Classifier: Assumptions P(cj)

◦ Can be estimated from the frequency of classes in the training examples.

P(x1,x2,…,xn|cj) ◦ Need very, very large number of training

examples Conditional Independence Assumption:

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

Page 42: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

Conditional Independence Assumption: features are independent of each other given the class: )|()|()|()|,,( 52151 CXPCXPCXPCXXP

Page 43: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Learning the Model

Common practice:maximum likelihood◦ simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP jj

)()(ˆ

Page 44: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Feature selection via Mutual Information We might not want to use all words, but just

reliable, good discriminators In training set, choose k words which best

discriminate the categories. One way is in terms of Mutual Information:

◦ For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

Page 45: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

OTHER APPROACHES TO FEATURE SELECTION T-TEST CHI SQUARE TF/IDF (CFR. IR lectures) Yang & Pedersen 1997: eliminating features

leads to improved performance

Page 46: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

tfidf(t, d) = tf(t, d) · idf(t)

Tf·idf term-document matrix.

where Nt,d is the number of occurrences of a term t in a document d, and the denominator is the sum of occurrences of all terms in that document d

where W(t) is the number of documents containing the term t

Page 47: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Used to evaluate the independence between two events. The relevance of a term t in a class c can be estimated by the following formula

Chi- Square Statistics

F11: #documents belonging to c and containing t;F10: #documents which are not in c but containing t;F01: #documents belonging to c but not containing t;F00: #documents which are not in c and not containing t.

Page 48: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

Classification task: to decide which class to choose. Measure importance of term t for a class c.

MAP Estimates

where Nt|c and Nt are the numbers of term t in the class c and in the entire corpus, respectively. Nc is the number of distinct classes.

where Nd|c is the number of documents in the scene class c, and Nd is the entire number ofdocuments. Note that α1 and α2 are the smoothing parameters that are typically determinedempirically.

Page 49: Extracting structured information from unstructured and/or semi-structured m/c- readable documents.  Processing human language texts by means of NLP.

OTHER CLASSIFICATION METHODS

K-NN DECISION TREES LOGISTIC REGRESSION SUPPORT VECTOR MACHINES