Lars Juhl Jensen Biomedical text mining. exponential growth.

87
Lars Juhl Jensen Biomedical text mining

Transcript of Lars Juhl Jensen Biomedical text mining. exponential growth.

Page 1: Lars Juhl Jensen Biomedical text mining. exponential growth.

Lars Juhl Jensen

Biomedical text mining

Page 2: Lars Juhl Jensen Biomedical text mining. exponential growth.

exponential growth

Page 3: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 4: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 5: Lars Juhl Jensen Biomedical text mining. exponential growth.

~45 seconds per paper

Page 6: Lars Juhl Jensen Biomedical text mining. exponential growth.

information retrieval

Page 7: Lars Juhl Jensen Biomedical text mining. exponential growth.

named entity recognition

Page 8: Lars Juhl Jensen Biomedical text mining. exponential growth.

augmented browsing

Page 9: Lars Juhl Jensen Biomedical text mining. exponential growth.

text corpora

Page 10: Lars Juhl Jensen Biomedical text mining. exponential growth.

information extraction

Page 11: Lars Juhl Jensen Biomedical text mining. exponential growth.

information retrieval

Page 12: Lars Juhl Jensen Biomedical text mining. exponential growth.

find the relevant papers

Page 13: Lars Juhl Jensen Biomedical text mining. exponential growth.

ad hoc retrieval

Page 14: Lars Juhl Jensen Biomedical text mining. exponential growth.

user-specified query

Page 15: Lars Juhl Jensen Biomedical text mining. exponential growth.

“yeast AND cell cycle”

Page 16: Lars Juhl Jensen Biomedical text mining. exponential growth.

PubMed

Page 17: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 18: Lars Juhl Jensen Biomedical text mining. exponential growth.

indexing

Page 19: Lars Juhl Jensen Biomedical text mining. exponential growth.

fast lookup

Page 20: Lars Juhl Jensen Biomedical text mining. exponential growth.

stemming

Page 21: Lars Juhl Jensen Biomedical text mining. exponential growth.

word endings

Page 22: Lars Juhl Jensen Biomedical text mining. exponential growth.

dynamic query expansion

Page 23: Lars Juhl Jensen Biomedical text mining. exponential growth.

MeSH terms

Page 24: Lars Juhl Jensen Biomedical text mining. exponential growth.

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 25: Lars Juhl Jensen Biomedical text mining. exponential growth.

no tool will find that

Page 26: Lars Juhl Jensen Biomedical text mining. exponential growth.

named entity recognition

Page 27: Lars Juhl Jensen Biomedical text mining. exponential growth.

computer

Page 28: Lars Juhl Jensen Biomedical text mining. exponential growth.

as smart as a dog

Page 29: Lars Juhl Jensen Biomedical text mining. exponential growth.

teach it specific tricks

Page 30: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 31: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 32: Lars Juhl Jensen Biomedical text mining. exponential growth.

identify the concepts

Page 33: Lars Juhl Jensen Biomedical text mining. exponential growth.

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 34: Lars Juhl Jensen Biomedical text mining. exponential growth.

comprehensive lexicon

Page 35: Lars Juhl Jensen Biomedical text mining. exponential growth.

proteins

Page 36: Lars Juhl Jensen Biomedical text mining. exponential growth.

chemicals

Page 37: Lars Juhl Jensen Biomedical text mining. exponential growth.

compartments

Page 38: Lars Juhl Jensen Biomedical text mining. exponential growth.

tissues

Page 39: Lars Juhl Jensen Biomedical text mining. exponential growth.

diseases

Page 40: Lars Juhl Jensen Biomedical text mining. exponential growth.

organisms

Page 41: Lars Juhl Jensen Biomedical text mining. exponential growth.

CDC2

Page 42: Lars Juhl Jensen Biomedical text mining. exponential growth.

cyclin dependent kinase 1

Page 43: Lars Juhl Jensen Biomedical text mining. exponential growth.

orthographic variation

Page 44: Lars Juhl Jensen Biomedical text mining. exponential growth.

upper- and lower-case

Page 45: Lars Juhl Jensen Biomedical text mining. exponential growth.

CDC2

Page 46: Lars Juhl Jensen Biomedical text mining. exponential growth.

Cdc2

Page 47: Lars Juhl Jensen Biomedical text mining. exponential growth.

spaces and hyphens

Page 48: Lars Juhl Jensen Biomedical text mining. exponential growth.

cyclin dependent kinase 1

Page 49: Lars Juhl Jensen Biomedical text mining. exponential growth.

cyclin-dependent kinase 1

Page 50: Lars Juhl Jensen Biomedical text mining. exponential growth.

prefixes and postfixes

Page 51: Lars Juhl Jensen Biomedical text mining. exponential growth.

CDC2

Page 52: Lars Juhl Jensen Biomedical text mining. exponential growth.

hCDC2

Page 53: Lars Juhl Jensen Biomedical text mining. exponential growth.

“black list”

Page 54: Lars Juhl Jensen Biomedical text mining. exponential growth.

SDS

Page 55: Lars Juhl Jensen Biomedical text mining. exponential growth.

scalable implementation

Page 56: Lars Juhl Jensen Biomedical text mining. exponential growth.

text corpora

Page 57: Lars Juhl Jensen Biomedical text mining. exponential growth.

>10 km<10 hours

Page 58: Lars Juhl Jensen Biomedical text mining. exponential growth.

most use Medline

Page 59: Lars Juhl Jensen Biomedical text mining. exponential growth.

~22 million abstracts

Page 60: Lars Juhl Jensen Biomedical text mining. exponential growth.

few use full-text articles

Page 61: Lars Juhl Jensen Biomedical text mining. exponential growth.

no access

Page 62: Lars Juhl Jensen Biomedical text mining. exponential growth.

PDF files

Page 63: Lars Juhl Jensen Biomedical text mining. exponential growth.
Page 64: Lars Juhl Jensen Biomedical text mining. exponential growth.

layout-aware extraction

Page 65: Lars Juhl Jensen Biomedical text mining. exponential growth.

millions of full-text articles

Page 66: Lars Juhl Jensen Biomedical text mining. exponential growth.

information extraction

Page 67: Lars Juhl Jensen Biomedical text mining. exponential growth.

formalize the facts

Page 68: Lars Juhl Jensen Biomedical text mining. exponential growth.

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 69: Lars Juhl Jensen Biomedical text mining. exponential growth.

two approaches

Page 70: Lars Juhl Jensen Biomedical text mining. exponential growth.

co-mentioning

Page 71: Lars Juhl Jensen Biomedical text mining. exponential growth.

counting

Page 72: Lars Juhl Jensen Biomedical text mining. exponential growth.

within documents

Page 73: Lars Juhl Jensen Biomedical text mining. exponential growth.

within paragraphs

Page 74: Lars Juhl Jensen Biomedical text mining. exponential growth.

within sentences

Page 75: Lars Juhl Jensen Biomedical text mining. exponential growth.

co-mentioning score

Page 76: Lars Juhl Jensen Biomedical text mining. exponential growth.

NLPNatural Language Processing

Page 77: Lars Juhl Jensen Biomedical text mining. exponential growth.

grammatical analysis

Page 78: Lars Juhl Jensen Biomedical text mining. exponential growth.

part-of-speech tagging

Page 79: Lars Juhl Jensen Biomedical text mining. exponential growth.

multiword detection

Page 80: Lars Juhl Jensen Biomedical text mining. exponential growth.

semantic tagging

Page 81: Lars Juhl Jensen Biomedical text mining. exponential growth.

sentence parsing

Page 82: Lars Juhl Jensen Biomedical text mining. exponential growth.

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 83: Lars Juhl Jensen Biomedical text mining. exponential growth.

extract stated facts

Page 84: Lars Juhl Jensen Biomedical text mining. exponential growth.

high precision

Page 85: Lars Juhl Jensen Biomedical text mining. exponential growth.

poor recall

Page 86: Lars Juhl Jensen Biomedical text mining. exponential growth.

ExerciseGo to http://diseases.jensenlab.org

Find TYMS disease associations

Inspect the text-mining evidence

Look for examples of synonym usage

Find genes linked to colorectal cancer

Page 87: Lars Juhl Jensen Biomedical text mining. exponential growth.

thank you!