Large-scale biomedical data and text integration

260
Protein association networks Lars Juhl Jensen

description

Large-scale biomedical data and text integration

Transcript of Large-scale biomedical data and text integration

Page 1: Large-scale biomedical data and text integration

Protein association networks

Lars Juhl Jensen

Page 2: Large-scale biomedical data and text integration

association networks

Page 3: Large-scale biomedical data and text integration

guilt by association

Page 4: Large-scale biomedical data and text integration
Page 5: Large-scale biomedical data and text integration

biological systems

Page 6: Large-scale biomedical data and text integration

molecular networks

Page 7: Large-scale biomedical data and text integration

STRING

Page 8: Large-scale biomedical data and text integration

2000+ genomes

Page 9: Large-scale biomedical data and text integration

computational predictions

Page 10: Large-scale biomedical data and text integration

gene fusion

Page 11: Large-scale biomedical data and text integration

Korbel et al., Nature Biotechnology, 2004

Page 12: Large-scale biomedical data and text integration

gene neighborhood

Page 13: Large-scale biomedical data and text integration

Korbel et al., Nature Biotechnology, 2004

Page 14: Large-scale biomedical data and text integration

phylogenetic profiles

Page 15: Large-scale biomedical data and text integration

Korbel et al., Nature Biotechnology, 2004

Page 16: Large-scale biomedical data and text integration

a real example

Page 17: Large-scale biomedical data and text integration
Page 18: Large-scale biomedical data and text integration
Page 19: Large-scale biomedical data and text integration
Page 20: Large-scale biomedical data and text integration

Cell

Cellulosomes

Cellulose

Page 21: Large-scale biomedical data and text integration

experimental data

Page 22: Large-scale biomedical data and text integration

gene coexpression

Page 23: Large-scale biomedical data and text integration
Page 24: Large-scale biomedical data and text integration

protein interactions

Page 25: Large-scale biomedical data and text integration

Jensen & Bork, Science, 2008

Page 26: Large-scale biomedical data and text integration

curated knowledge

Page 27: Large-scale biomedical data and text integration

complexes

Page 28: Large-scale biomedical data and text integration

pathways

Page 29: Large-scale biomedical data and text integration

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 30: Large-scale biomedical data and text integration

many databases

Page 31: Large-scale biomedical data and text integration

different formats

Page 32: Large-scale biomedical data and text integration

different identifiers

Page 33: Large-scale biomedical data and text integration

variable quality

Page 34: Large-scale biomedical data and text integration

not comparable

Page 35: Large-scale biomedical data and text integration

not same species

Page 36: Large-scale biomedical data and text integration

hard work

Page 37: Large-scale biomedical data and text integration

(Ph.D. students)

Page 38: Large-scale biomedical data and text integration

parsers

Page 39: Large-scale biomedical data and text integration

mapping files

Page 40: Large-scale biomedical data and text integration

common identifiers

Page 41: Large-scale biomedical data and text integration

clever ideas

Page 42: Large-scale biomedical data and text integration

quality assessment

Page 43: Large-scale biomedical data and text integration

scoring schemes

Page 44: Large-scale biomedical data and text integration

affinity purification

Page 45: Large-scale biomedical data and text integration

von Mering et al., Nucleic Acids Research, 2005

Page 46: Large-scale biomedical data and text integration

microarray experiments

Page 47: Large-scale biomedical data and text integration

Oliva et al., PLOS Biology, 2005

Page 48: Large-scale biomedical data and text integration

phylogenetic profiles

Page 49: Large-scale biomedical data and text integration
Page 50: Large-scale biomedical data and text integration
Page 51: Large-scale biomedical data and text integration

score calibration

Page 52: Large-scale biomedical data and text integration

gold standard

Page 53: Large-scale biomedical data and text integration

von Mering et al., Nucleic Acids Research, 2005

Page 54: Large-scale biomedical data and text integration

implicit weighting by quality

Page 55: Large-scale biomedical data and text integration

common scale

Page 56: Large-scale biomedical data and text integration

interologs

Page 57: Large-scale biomedical data and text integration

homology-based transfer

Page 58: Large-scale biomedical data and text integration

orthologous groups

Page 59: Large-scale biomedical data and text integration

Franceschini et al., Nucleic Acids Research, 2013

Page 60: Large-scale biomedical data and text integration

missing most of the data

Page 61: Large-scale biomedical data and text integration

Lars Juhl Jensen

Biomedical text mining

Page 62: Large-scale biomedical data and text integration

>10 km

Page 63: Large-scale biomedical data and text integration

too much to read

Page 64: Large-scale biomedical data and text integration

exponential growth

Page 65: Large-scale biomedical data and text integration

~40 seconds per paper

Page 66: Large-scale biomedical data and text integration

computer

Page 67: Large-scale biomedical data and text integration

as smart as a dog

Page 68: Large-scale biomedical data and text integration

teach it specific tricks

Page 69: Large-scale biomedical data and text integration
Page 70: Large-scale biomedical data and text integration
Page 71: Large-scale biomedical data and text integration

named entity recognition

Page 72: Large-scale biomedical data and text integration

comprehensive lexicon

Page 73: Large-scale biomedical data and text integration

CDC2

Page 74: Large-scale biomedical data and text integration

cyclin dependent kinase 1

Page 75: Large-scale biomedical data and text integration

orthographic variation

Page 76: Large-scale biomedical data and text integration

expansion rules

Page 77: Large-scale biomedical data and text integration

prefixes and suffixes

Page 78: Large-scale biomedical data and text integration

CDC2

Page 79: Large-scale biomedical data and text integration

hCdc2

Page 80: Large-scale biomedical data and text integration

flexible matching

Page 81: Large-scale biomedical data and text integration

spaces and hyphens

Page 82: Large-scale biomedical data and text integration

cyclin dependent kinase 1

Page 83: Large-scale biomedical data and text integration

cyclin-dependent kinase 1

Page 84: Large-scale biomedical data and text integration

“black list”

Page 85: Large-scale biomedical data and text integration

SDS

Page 86: Large-scale biomedical data and text integration

information extraction

Page 87: Large-scale biomedical data and text integration

co-mentioning

Page 88: Large-scale biomedical data and text integration

counting

Page 89: Large-scale biomedical data and text integration

within documents

Page 90: Large-scale biomedical data and text integration

within paragraphs

Page 91: Large-scale biomedical data and text integration

within sentences

Page 92: Large-scale biomedical data and text integration

scoring scheme

Page 93: Large-scale biomedical data and text integration
Page 94: Large-scale biomedical data and text integration
Page 95: Large-scale biomedical data and text integration

score calibration

Page 96: Large-scale biomedical data and text integration

natural language processing

Page 97: Large-scale biomedical data and text integration

grammatical analysis

Page 98: Large-scale biomedical data and text integration

part-of-speech tagging

Page 99: Large-scale biomedical data and text integration

what you learned in schoolpronoun pronoun verb preposition noun

Page 100: Large-scale biomedical data and text integration

multiword detection

Page 101: Large-scale biomedical data and text integration

compound nouns in Danish

Page 102: Large-scale biomedical data and text integration

semantic tagging

Page 103: Large-scale biomedical data and text integration

words of special interest

Page 104: Large-scale biomedical data and text integration

sentence parsing

Page 105: Large-scale biomedical data and text integration

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 106: Large-scale biomedical data and text integration

text corpus

Page 107: Large-scale biomedical data and text integration

~22 million abstracts

Page 108: Large-scale biomedical data and text integration

Medline

Page 109: Large-scale biomedical data and text integration

~2 million full-text articles

Page 110: Large-scale biomedical data and text integration

restricted access

Page 111: Large-scale biomedical data and text integration

Mini exercise 1Go to http://string-db.org

Query for Mt H37Rv adhD

(Rv3086)

Change between different

views

Check evidence for adhD–lipR

link

Extent network to 50

interactors

Page 112: Large-scale biomedical data and text integration
Page 113: Large-scale biomedical data and text integration
Page 114: Large-scale biomedical data and text integration

Mini exercise 2Go to the paper PMC2995261

Extract the protein names in

table 1

Create STRING network of

them

Change to “advanced” mode

Analyze for clusters and

enrichment

Page 115: Large-scale biomedical data and text integration

multi-page tables

Page 116: Large-scale biomedical data and text integration

Large-scale data integration

Lars Juhl Jensen

Page 117: Large-scale biomedical data and text integration

general approach

Page 118: Large-scale biomedical data and text integration

curated knowledge

Page 119: Large-scale biomedical data and text integration

experimental data

Page 120: Large-scale biomedical data and text integration

text mining

Page 121: Large-scale biomedical data and text integration

computational predictions

Page 122: Large-scale biomedical data and text integration

common identifiers

Page 123: Large-scale biomedical data and text integration

quality scores

Page 124: Large-scale biomedical data and text integration

score calibration

Page 125: Large-scale biomedical data and text integration

visualization

Page 126: Large-scale biomedical data and text integration

STRING

Page 127: Large-scale biomedical data and text integration

protein networks

Page 128: Large-scale biomedical data and text integration

string-db.org

Page 129: Large-scale biomedical data and text integration

STITCH

Page 130: Large-scale biomedical data and text integration

chemical networks

Page 131: Large-scale biomedical data and text integration

stitch-db.org

Page 132: Large-scale biomedical data and text integration

PubChem

Page 133: Large-scale biomedical data and text integration

metabolic pathway maps

Page 134: Large-scale biomedical data and text integration

drug target databases

Page 135: Large-scale biomedical data and text integration

high-throughput screening

Page 136: Large-scale biomedical data and text integration

COMPARTMENTS

Page 137: Large-scale biomedical data and text integration

subcellular localization

Page 138: Large-scale biomedical data and text integration

compartments.jensenlab.org

Page 139: Large-scale biomedical data and text integration

Gene Ontology

Page 140: Large-scale biomedical data and text integration

GO annotations

Page 141: Large-scale biomedical data and text integration

UniProtKB

Page 142: Large-scale biomedical data and text integration

model organism databases

Page 143: Large-scale biomedical data and text integration

sequence-based predictions

Page 144: Large-scale biomedical data and text integration

PSORT

Page 145: Large-scale biomedical data and text integration

YLoc

Page 146: Large-scale biomedical data and text integration

TISSUES

Page 147: Large-scale biomedical data and text integration

tissue expression

Page 148: Large-scale biomedical data and text integration

tissues.jensenlab.org

Page 149: Large-scale biomedical data and text integration

Brenda Tissue Ontology

Page 150: Large-scale biomedical data and text integration

high-throughput studies

Page 151: Large-scale biomedical data and text integration

EST libraries

Page 152: Large-scale biomedical data and text integration

microarrays

Page 153: Large-scale biomedical data and text integration

RNA-Seq

Page 154: Large-scale biomedical data and text integration

mass spectrometry

Page 155: Large-scale biomedical data and text integration

immunohistochemistry

Page 156: Large-scale biomedical data and text integration

DISEASES

Page 157: Large-scale biomedical data and text integration

disease associations

Page 158: Large-scale biomedical data and text integration

text mining

Page 159: Large-scale biomedical data and text integration

genetics databases

Page 160: Large-scale biomedical data and text integration

Genetics Home Reference

Page 161: Large-scale biomedical data and text integration

GWAS studies

Page 162: Large-scale biomedical data and text integration

NHGRI GWAS Catalog

Page 163: Large-scale biomedical data and text integration

cancer mutation data

Page 164: Large-scale biomedical data and text integration

COSMIC

Page 165: Large-scale biomedical data and text integration

Work on your own datastring-db.org

stitch-db.org

compartments.jensenlab.org

tissues.jensenlab.org

diseases.jensenlab.org

Page 166: Large-scale biomedical data and text integration

Lars Juhl Jensen

Medical text data mining

Page 167: Large-scale biomedical data and text integration

structured data

Page 168: Large-scale biomedical data and text integration

Jensen et al., Nature Reviews Genetics, 2012

Page 169: Large-scale biomedical data and text integration

unstructured data

Page 170: Large-scale biomedical data and text integration
Page 171: Large-scale biomedical data and text integration

central registries

Page 172: Large-scale biomedical data and text integration

individual hospitals

Page 173: Large-scale biomedical data and text integration
Page 174: Large-scale biomedical data and text integration

opt-out

Page 175: Large-scale biomedical data and text integration

opt-in

Page 176: Large-scale biomedical data and text integration

Danish registries

Page 177: Large-scale biomedical data and text integration

civil registration system

Page 178: Large-scale biomedical data and text integration

CPR number

Page 179: Large-scale biomedical data and text integration

established in 1968

Page 180: Large-scale biomedical data and text integration

Jensen et al., Nature Reviews Genetics, 2012

Page 181: Large-scale biomedical data and text integration

national discharge registry

Page 182: Large-scale biomedical data and text integration

14 years

Page 183: Large-scale biomedical data and text integration

6.2 million patients

Page 184: Large-scale biomedical data and text integration

45 million admissions

Page 185: Large-scale biomedical data and text integration

68 million records

Page 186: Large-scale biomedical data and text integration

119 million diagnosis

Page 187: Large-scale biomedical data and text integration

ICD-10

Page 188: Large-scale biomedical data and text integration

Jensen et al., Nature Reviews Genetics, 2012

Page 189: Large-scale biomedical data and text integration

not research

Page 190: Large-scale biomedical data and text integration

reimbursement

Page 191: Large-scale biomedical data and text integration

diagnosis trajectories

Page 192: Large-scale biomedical data and text integration

naïve approach

Page 193: Large-scale biomedical data and text integration

comorbidity

Page 194: Large-scale biomedical data and text integration

Jensen et al., Nature Reviews Genetics, 2012

Page 195: Large-scale biomedical data and text integration

confounding factors

Page 196: Large-scale biomedical data and text integration

“known knowns”

Page 197: Large-scale biomedical data and text integration

gender

Page 198: Large-scale biomedical data and text integration

age

Page 199: Large-scale biomedical data and text integration

type of hospital encounter

Page 200: Large-scale biomedical data and text integration

Jensen et al., Nature Communications, 2014

Page 201: Large-scale biomedical data and text integration

“known unknowns”

Page 202: Large-scale biomedical data and text integration

smoking

Page 203: Large-scale biomedical data and text integration

diet

Page 204: Large-scale biomedical data and text integration

“unknown unknowns”

Page 205: Large-scale biomedical data and text integration

reporting biases

Page 206: Large-scale biomedical data and text integration

matched controls

Page 207: Large-scale biomedical data and text integration

temporal correlations

Page 208: Large-scale biomedical data and text integration

multiple testing

Page 209: Large-scale biomedical data and text integration

trajectories

Page 210: Large-scale biomedical data and text integration

Jensen et al., Nature Communications, 2014

Page 211: Large-scale biomedical data and text integration

trajectory networks

Page 212: Large-scale biomedical data and text integration

Jensen et al., Nature Communications, 2014

Page 213: Large-scale biomedical data and text integration

key diagnoses

Page 214: Large-scale biomedical data and text integration

Jensen et al., Nature Communications, 2014

Page 215: Large-scale biomedical data and text integration

direct medical implications

Page 216: Large-scale biomedical data and text integration

electronic health records

Page 217: Large-scale biomedical data and text integration

structured data

Page 218: Large-scale biomedical data and text integration

Jensen et al., Nature Reviews Genetics, 2012

Page 219: Large-scale biomedical data and text integration

unstructured data

Page 220: Large-scale biomedical data and text integration
Page 221: Large-scale biomedical data and text integration

free text

Page 222: Large-scale biomedical data and text integration

Danish

Page 223: Large-scale biomedical data and text integration

busy doctors

Page 224: Large-scale biomedical data and text integration

typos

Page 225: Large-scale biomedical data and text integration

psychiatric patients

Page 226: Large-scale biomedical data and text integration

custom dictionaries

Page 227: Large-scale biomedical data and text integration

diseases

Page 228: Large-scale biomedical data and text integration

drugs

Page 229: Large-scale biomedical data and text integration

adverse drug reactions

Page 230: Large-scale biomedical data and text integration

expansion rules

Page 231: Large-scale biomedical data and text integration

Clozapine

Page 232: Large-scale biomedical data and text integration

Clozapineclozapi

n

clossapin

klozapine

chlosapin

chlosapine

chlozapin

chlozapine

klossapin

closapine

klozapinklosapi

n

Page 233: Large-scale biomedical data and text integration

post-coordination rules

Page 234: Large-scale biomedical data and text integration

failure of kidney

Page 235: Large-scale biomedical data and text integration

kidney failure

Page 236: Large-scale biomedical data and text integration

pharmacovigilance

Page 237: Large-scale biomedical data and text integration

clinical trials

Page 238: Large-scale biomedical data and text integration

spontaneous reports

Page 239: Large-scale biomedical data and text integration
Page 240: Large-scale biomedical data and text integration

underreporting

Page 241: Large-scale biomedical data and text integration

data mining

Page 242: Large-scale biomedical data and text integration

structured data

Page 243: Large-scale biomedical data and text integration

medication

Page 244: Large-scale biomedical data and text integration

semi-structured data

Page 245: Large-scale biomedical data and text integration

drug indications

Page 246: Large-scale biomedical data and text integration

known ADRs

Page 247: Large-scale biomedical data and text integration

unstructured data

Page 248: Large-scale biomedical data and text integration

adverse drug reactions

Page 249: Large-scale biomedical data and text integration

temporal correlations

Page 250: Large-scale biomedical data and text integration

hand-crafted rules

Page 251: Large-scale biomedical data and text integration

Eriksson et al., Drug Safety, 2014

Page 252: Large-scale biomedical data and text integration

Eriksson et al., Drug Safety, 2014

Page 253: Large-scale biomedical data and text integration

Eriksson et al., Drug Safety, 2014

Page 254: Large-scale biomedical data and text integration

Eriksson et al., Drug Safety, 2014

Page 255: Large-scale biomedical data and text integration

recall known ADRs

Page 256: Large-scale biomedical data and text integration

estimate ADR frequencies

Page 257: Large-scale biomedical data and text integration

Eriksson et al., Drug Safety, 2014

Page 258: Large-scale biomedical data and text integration

discover new ADRs

Page 259: Large-scale biomedical data and text integration

Drug substance ADE p-value

Chlordiazepoxide Nystagmus 4.0e-8

Simvastatin Personality changes

8.4e-8

Dipyridamole Visual impairment

4.4e-4

Citalopram Psychosis 8.8e-4

Bendroflumethiazide

Apoplexy 8.5e-3

Eriksson et al., Drug Safety, 2014

Page 260: Large-scale biomedical data and text integration

AcknowledgmentsMolecular networksChristian von MeringDamian SzklarczykMichael KuhnManuel StarkSamuel ChaffronChris CreeveyJean MullerTobias DoerksPhilippe JulienAlexander RothMilan SimonovicJan KorbelBerend SnelMartijn HuynenPeer Bork

Localization and diseaseSune FrankildJasmin SaricEvangelos PafilisKalliopi TsafouAlberto SantosJanos BinderHeiko HornMichael KuhnNigel BrownReinhardt SchneiderSean O’ Donoghue

Medical data miningAnders Boeck JensenPeter Bjødstrup JensenRobert ErikssonFrancisco S. RoqueHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulTudor OpreaPope MoseleyThomas WergeSøren Brunak