Disease s Anatom y Genes Physiolog y Diseases Physiolog y Anatomy Genes Diseases Medical Informatics...
-
Upload
jodie-kelley-ward -
Category
Documents
-
view
219 -
download
0
Transcript of Disease s Anatom y Genes Physiolog y Diseases Physiolog y Anatomy Genes Diseases Medical Informatics...
DiseasesDiseasesDiseases
DiseasesDiseasesDiseases
Diseases
Anatomy
Anatomy
Anatomy
Anatomy
Anatomy
Anatomy
Gen
esG
enes
Gen
esG
enes
Gen
esG
enes
Physiology Physiolog
y Physiology Physiolog
y Physiology Physiolog
y
Diseases
Physiology
Anatomy
Genes
Genes
Genes
Diseases
Diseases
Medical Informatics
Bioinformatics
Novel relationships & Deeper insights
Integrative Genomics For Understanding Disease
Process
Anil JeggaDivision of Biomedical Informatics,
Cincinnati Children’s Hospital Medical Center (CCHMC)
Department of Pediatrics, University of CincinnatiCincinnati, Ohio - [email protected]
Acknowledgement• Jing Chen• Mrunal Deshmukh• Sivakumar Gowrisankar• Chandra Gudivada• Arvind Muthukrishnan• Bruce J Aronow
Medical Informatics Bioinformatics
Patient Records
Patient Records
Disease Database
Disease Database→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
PubMed
Clinical Trials
Clinical Trials
Two Separate Worlds…..
With Some Data Exchange…
Genome
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome
Pharm
aco
geno
me
OMIMClinical
Synopsis
Disease
World
354 “omes” so far………
and there is “UNKNOME” too - genes with no function knownhttp://omics.org/index.php/Alphabetically_ordered_list_of_omics
(as on October 15, 2006)
To correlate diseases with anatomical parts affected, the genes/proteins involved, and the underlying physiological processes (interactions, pathways, processes). In other words, bringing the disciplines of Medical Informatics (MI) and BioInformatics (BI) together (Biomedical Informatics - BMI) to support personalized or “tailor-made” medicine.
Motivation
How to integrate multiple types of genome-scale data across experiments and phenotypes in order
to find genes associated with diseases
Model Organism Databases: Common Issues
• Heterogeneous Data Sets - Data Integration– From Genotype to Phenotype– Experimental and Consensus Views
• Incorporation of Large Datasets– Whole genome annotation pipelines– Large scale mutagenesis/variation projects
(dbSNP)
• Computational vs. Literature-based Data Collection and Evaluation (MedLine)
• Data Mining– extraction of new knowledge– testable hypotheses (Hypothesis Generation)
Support Complex Queries• Get me all genes involved in brain
development that are expressed in the Central Nervous System.
• Get me all genes involved in brain development in human and mouse that also show iron ion binding activity.
• For this set of genes, what aspects of function and/or cellular localization do they share?
• For this set of genes, what mutations are reported to cause pathological conditions?
Bioinformatic Data-1978 to present
• DNA sequence• Gene expression• Protein expression• Protein Structure• Genome mapping• SNPs & Mutations
• Metabolic networks• Regulatory networks• Trait mapping• Gene function analysis• Scientific literature• and others………..
Human Genome Project – Data Deluge
Database name Records
Nucleotide 11,512,792
Protein 313,099
Structure 8,490
Genome Sequences
51
Popset 20,801
SNP 12,702,095
3D Domains 31,862
Domains 25
GEO Datasets 2,969
GEO Expressions 9,783,946
UniGene 86,804
UniSTS 322,092
PubMed Central 3,140
HomoloGene 20,123
Taxonomy 1
No. of Human Gene Records currently in NCBI: 31507 (excluding pseudogenes, mitochondrial genes and obsolete records).
Includes ~460 microRNAs
NCBI Human Genome Statistics – as on October 18, 2006
The Gene Expression Data Deluge
Till 2000: 413 papers on microarray!
YearPubMed Articles
2001 834
2002 1557
2003 2421
2004 3508
2005 4400
2006-
4083+
Problems Deluge!Allison DB, Cui X, Page GP, Sabripour M. 2006. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 7(1): 55-65.
• 3 scientific journals in 1750
• Now - >120,000 scientific journals!
• >500,000 medical articles/year
• >4,000,000 scientific articles/year
• >16 million abstracts in PubMed derived from >32,500 journals
• >4.5 billion distinct web pages indexed by Google! Google Search for integrative genomics: ~930,000 hits
“integrative genomics”: ~112,000 hits
Information Deluge…..
A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer (Baasiri et al., 1999 Oncogene 18: 7958-7965).
•Accelerin•Antiquitin•Bang Senseless•Bride of Sevenless•Christmas Factor•Cockeye•Crack•Draculin•Dickie’s small eye
Disease names• Mobius Syndrome with
Poland’s Anomaly• Werner’s syndrome• Down’s syndrome• Angelman’s syndrome• Creutzfeld-Jacob
disease
•Draculin•Fidgetin•Gleeful•Knobhead•Lunatic Fringe•Mortalin•Orphanin•Profilactin•Sonic Hedgehog
Data-driven Problems…..
Gene Nomenclature
• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?
• How to ascribe and name a function, process or location consistently?
• How to describe interactions, partners, reactions and complexes?
• Develop/Use controlled or restricted vocabularies (IUPAC-like naming conventions, HGNC, MGI, UMLS, etc.)
• Create/Use thesauruses, central repositories or synonym lists (MeSH, UMLS, etc.)
• Work towards synoptic reporting and structured abstracting
Some Solutions
1. Generally, the names refer to some feature of the mutant phenotype
2. Dickie’s small eye (Thieler et al., 1978, Anat Embryol (Berl), 155: 81-86) is now Pax6
3. Gleeful: "This gene encodes a C2H2 zinc finger transcription factor with high sequence similarity to vertebrate Gli proteins, so we have named the gene gleeful (Gfl)." (Furlong et al., 2001, Science 293: 1632)
What’s in a name!Rose is a rose is a rose is a rose!
Some more ambiguous examples……..• The yeast homologue of the human gene PMS1,
which codes for a DNA repair protein, is called PMS2; whereas yeast PMS1 corresponds to human PMS2!
• Even more confusing, 4,257 abbreviated names were used to refer to more than one gene. Top of the list was MT1, used to describe at least 11 members of a cluster of genes encoding small proteins that bind to metal ions (Nature: 411: 631-632).
• AR*E: aryl sulfatase E in all species• f**K: fuculokinase gene in bacteria
and there are some weird ones too……..
Rose is a rose is a rose is a rose….. Not Really!
Image Sources: Somewhere from the internet…
What is a cell?
• any small compartment
• (biology) the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in higher plants and animals
• a device that delivers an electric current as a result of chemical reaction
• a small unit serving as part of or as the nucleus of a larger political movement
• cellular telephone: a hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver
• small room in which a monk or nun lives
• a room where a prisoner is kept
Foundation Model Explorer
Semantic Groups, Types and Concepts:
• Semantic Group Biology – Semantic Type Cell
• Semantic Groups Object OR Devices – Semantic Types Manufactured Device or Electrical Device or Communication Device
• Semantic Group Organization – Semantic Type Political Group
Hepatocellular Carcinoma
CTNNB1
MET
TP53
1. COLORECTAL CANCER [3-BP DEL, SER45DEL]2. COLORECTAL CANCER [SER33TYR]3. PILOMATRICOMA, SOMATIC [SER33TYR]4. HEPATOBLASTOMA, SOMATIC [THR41ALA]5. DESMOID TUMOR, SOMATIC [THR41ALA]6. PILOMATRICOMA, SOMATIC [ASP32GLY]7. OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]8. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]9. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]10. MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1. COLORECTAL CANCER [3-BP DEL, SER45DEL]2. COLORECTAL CANCER [SER33TYR]3. PILOMATRICOMA, SOMATIC [SER33TYR]4. HEPATOBLASTOMA, SOMATIC [THR41ALA]5. DESMOID TUMOR, SOMATIC [THR41ALA]6. PILOMATRICOMA, SOMATIC [ASP32GLY]7. OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]8. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]9. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]10. MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1. HEPATOCELLULAR CARCINOMA SOMATIC [ARG249SER]
1. HEPATOCELLULAR CARCINOMA SOMATIC [ARG249SER]
TP53*
aflatoxin B1, a mycotoxin induces a very specific G-to-T mutation at codon 249 in the tumor suppressor gene p53.
Environmental Effects
Many disease states are complex, because of many genes (alleles & ethnicity, gene families, etc.), environmental effects (life style, exposure, etc.) and the interactions.
The REAL Problems
HEPATOCELLULAR CARCINOMA
LIVER: •Hepatocellular carcinoma; •Micronodular cirrhosis; •Subacute progressive viral hepatitis
NEOPLASIA: •Primary liver cancer
CTNNB1
MET
TP53
1. ALK in cardiac myocytes 2. Cell to Cell Adhesion Signaling 3. Inactivation of Gsk3 by AKT causes
accumulation of b-catenin in Alveolar Macrophages
4. Multi-step Regulation of Transcription by Pitx2 5. Presenilin action in Notch and Wnt signaling 6. Trefoil Factors Initiate Mucosal Healing 7. WNT Signaling Pathway
1. ALK in cardiac myocytes 2. Cell to Cell Adhesion Signaling 3. Inactivation of Gsk3 by AKT causes
accumulation of b-catenin in Alveolar Macrophages
4. Multi-step Regulation of Transcription by Pitx2 5. Presenilin action in Notch and Wnt signaling 6. Trefoil Factors Initiate Mucosal Healing 7. WNT Signaling Pathway
1. CBL mediated ligand-induced downregulation of EGF receptors
2. Signaling of Hepatocyte Growth Factor Receptor
1. CBL mediated ligand-induced downregulation of EGF receptors
2. Signaling of Hepatocyte Growth Factor Receptor 1. Estrogen-responsive protein Efp
controls cell cycle and breast tumors growth
2. ATM Signaling Pathway 3. BTG family proteins and cell
cycle regulation 4. Cell Cycle 5. RB Tumor
Suppressor/Checkpoint Signaling in response to DNA damage
6. Regulation of transcriptional activity by PML
7. Regulation of cell cycle progression by Plk3
8. Hypoxia and p53 in the Cardiovascular system
9. p53 Signaling Pathway 10. Apoptotic Signaling in Response
to DNA Damage 11. Role of BRCA1, BRCA2 and ATR
in Cancer Susceptibility….Many More…..
1. Estrogen-responsive protein Efp controls cell cycle and breast tumors growth
2. ATM Signaling Pathway 3. BTG family proteins and cell
cycle regulation 4. Cell Cycle 5. RB Tumor
Suppressor/Checkpoint Signaling in response to DNA damage
6. Regulation of transcriptional activity by PML
7. Regulation of cell cycle progression by Plk3
8. Hypoxia and p53 in the Cardiovascular system
9. p53 Signaling Pathway 10. Apoptotic Signaling in Response
to DNA Damage 11. Role of BRCA1, BRCA2 and ATR
in Cancer Susceptibility….Many More…..
The REAL Problems
Integrative Genomics - what is it?Another buzzword or a meaningful concept useful for
biomedical research?
Acquisition, Integration, Curation, and Analysis of biological data
Integrative Genomics: the study of complex interactions between genes, organism and environment, the triple helix of biology. Gene <–> Organism <-> Environment
It is definitely beyond the buzzword stage - Universities now have programs named 'Integrated Genomics.'
Hypothesis
Information is not knowledge - Albert Einstein
1. Link driven federations• Explicit links between databanks.
2. Warehousing• Data is downloaded, filtered,
integrated and stored in a warehouse. Answers to queries are taken from the warehouse.
3. Others….. Semantic Web, etc………
Methods for Integration
1. Creates explicit links between databanks
2. query: get interesting results and use web links to reach related data in other databanks
Examples: NCBI-Entrez, SRS
Link-driven Federations
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
Querying Entrez-Gene
Database name
No. of Records
Query= p53
Query= TP53
(HGNC)
Query= p53 OR TP53
PubMed 37,962 1928 38,512
PMC 9647 373 9738
Book 710 332 744
Nucleotide 7062 1603 8442
Protein 3882 314 3970
Genome 12 0 12
OMIM 317 79 744
SNP 14,277 1513 14,779
Gene 1058 258 1115
Homologene 723 31 735
GEO Profiles 68,000 10,539 70,718
Cancer Chr 292 129 421
1.Advantages• complex queries• Fast
2.Disadvantages• require good knowledge• syntax based• terminology problem not solved
Link-driven Federations
Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.
Data Warehousing
Advantages1. Good for very-specific,
task-based queries and studies.
2. Since it is custom-built and usually expert-curated, relatively less error-prone.
Disadvantages1. Can become quickly
outdated – needs constant updates.
2. Limited functionality – For e.g., one disease-based or one system-based.
No Integrative Genomics is Complete without Ontologies
• Gene Ontology (GO)
• Unified Medical Language System (UMLS)
Gene World Biomedical World
• Molecular Function = elemental activity/task– the tasks performed by individual gene products; examples
are carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine
metabolism, that are accomplished by ordered assemblies of molecular functions
– Biological objective, accomplished via one or more ordered assemblies of functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
The 3 Gene Ontologies
http://www.geneontology.org
Function (what) Process (why)
Drive a nail - into wood Carpentry
Drive stake - into soil Gardening
Smash a bug Pest Control
A performer’s juggling object Entertainment
Example: Gene Product = hammer
http://www.geneontology.org
• ISS: Inferred from sequence or structural similarity
• IDA: Inferred from direct assay• IPI: Inferred from physical interaction• TAS: Traceable author statement• IMP: Inferred from mutant phenotype• IGI: Inferred from genetic interaction• IEP: Inferred from expression pattern• ND: no data available
GO term associations: Evidence Codes
http://www.geneontology.org
• Access gene product functional information
• Find how much of a proteome is involved in a process/ function/ component in the cell
• Map GO terms and incorporate manual annotations into own databases
• Provide a link between biological knowledge and
• gene expression profiles
• proteomics data
What can researchers do with GO?
• Getting the GO and GO_Association Files
• Data Mining– My Favorite Gene– By GO– By Sequence
• Analysis of Data– Clustering by
function/process• Other Tools
And how?
http://www.geneontology.org/
Open biomedical ontologies
http://obo.sourceforge.net/
Unified Medical Language System Knowledge Server– UMLSKS
http://umlsks.nlm.nih.gov/kss/
• The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.
• The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.
• The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.
Unified Medical Language SystemMetathesaurus
• about over 1 million biomedical concepts • About 5 million concept names from more than 100 controlled
vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems.
• The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together.
• Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.
• Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus.
• Uses: – linking between different clinical or biomedical vocabularies– information retrieval from databases with human assigned subject index
terms and from free-text information sources– linking patient records to related information in bibliographic, full-text, or
factual databases– natural language processing and automated indexing research
UMLSKS – Semantic Network
• Complexity reduced by grouping concepts according to the semantic types that have been assigned to them.
• There are currently 15 semantic groups that provide a partition of the UMLS Metathesaurus for 99.5% of the concepts.ACTI|Activities & Behaviors|T053|Behavior
ANAT|Anatomy|T024|Tissue
CHEM|Chemicals & Drugs|T195|Antibiotic
CONC|Concepts & Ideas|T170|Intellectual Product
DEVI|Devices|T074|Medical Device
DISO|Disorders|T047|Disease or Syndrome
GENE|Genes & Molecular Sequences|T085|Molecular Sequence
GEOG|Geographic Areas|T083|Geographic Area
LIVB|Living Beings|T005|Virus
OBJC|Objects|T073|Manufactured Object
OCCU|Occupations|T091|Biomedical Occupation or Discipline
ORGA|Organizations|T093|Health Care Related Organization
PHEN|Phenomena|T038|Biologic Function
PHYS|Physiology|T040|Organism Function
PROC|Procedures|T061|Therapeutic or Preventive Procedure
Semantic Groups (15)
Semantic Types (135) Concepts
(millions)
UMLSKS – Semantic Navigator
• The number of patients with AD in any community depends on the proportion of older people in the group. Traditionally, the developed countries had large proportions of elderly people, and so they had very many cases of Alzheimer’s disease in the community at one time.
• 4.5 million AD patients in the United States today.
• Expected to increase to 11 to 16 million by 2050.
• In 2000, health care costs for AD patients in the United States totaled approximately $31.9 billion, which is expected to reach $49.3 billion by 2010 (http://www.alz.org)
• World-wide: ~18 million (projected to nearly double by 2025 to 34 million).
• Demographic transition - Developing countries:
• Increased life expectancy (current life expectancy in India is >60 years).
• 1991 India Census: 70 million people were over 60 years.
• 2001 India Census: 77 million, or 7.6% of the population.
• By 2025, we will have 177 million elderly people.
• Currently, more than 50% of people with Alzheimer’s disease live in developing countries and by 2025, this will be over 70%.
Alzheimer’s Disease – Alarming Statistics
Source: WHO & NIA
• The goal of applying computational data-mining approaches is to extract useful information from large amounts of data by employing mathematical methods that should be as automated as possible.
• Computational data-mining approaches are particularly appropriate in areas with much data but few explanations, such as gerontology. If researchers can find/derive patterns in data to perceive information, then information may enhance our knowledge over aging.
Alzheimer’s Disease – Why Computational Approaches?
• The complexity and broad range of cellular and biochemical events make researchers believe that there must be a sophisticated network of AD signal transduction, gene regulation, and protein-protein interaction events.
• Therefore, deciphering AD-related molecular network “circuitry” can help researchers understand AD disease better, model details, and propose treatment ideas.
Alzheimer Disease
Astrocytes
Basal Nucleus of Meynert
Cerebrum
Cerebral Cortex
Brain and Nervous System
Brain Microglia
Hippocampus
Frontal Lobe
Neurons
Temporal Lobe
A2M APOE ALOX12 ABCA1 ABCA2 NEF3 PARK2 STH APPNME1
A simplistic picture
Alzheimer Disease
Astrocytes
Basal Nucleus of Meynert
Cerebrum
Cerebral Cortex
Brain and Nervous System
Brain Microglia
Hippocampus
Frontal Lobe
Neurons
Temporal Lobe
A2M APOE ALOX12 ABCA1 ABCA2 NEF3 PARK2 STH APPNME1
A2M
APOE
ALOX12
ABCA1
ABCA2
STH
APP
NME1
Alzheimer Disease
Astrocytes
Basal Nucleus of Meynert
Cerebrum
Cerebral Cortex
Brain and Nervous System
Brain Microglia
Hippocampus
Frontal Lobe
Neurons
Temporal Lobe
NEF3
PARK2
Parkinson Disease
SchizophreniaSCZD2
SCZD8SCZD3
PARK3
PARK7PARP
Many Diseases – Many Genes
Alzheimer Disease
Astrocytes
Basal Nucleus of Meynert
Cerebrum
Cerebral Cortex
Brain and Nervous System
Brain Microglia
Hippocampus
Frontal Lobe
Neurons
Temporal Lobe
A2M APOE ALOX12 ABCA1 ABCA2 NEF3 PARK2 STH APPNME1
→enzyme binding
→extracellular space
→interleukin-1 binding
→interleukin-8 binding
→intracellular protein transport
→protein carrier activity
→protein homooligomerization
→serine-type endopeptidase inhibitor activity
→tumor necrosis factor binding
→wide-spectrum protease inhibitor activity
Functions/Processes
Alzheimer's disease (Kegg)
Neurodegenerative Disorders (Kegg)
Deregulation of CDK5 in Alzheimers Disease (BioCarta)
Generation of amyloid b-peptide by PS1 (BioCarta)
Platelet Amyloid Precursor Protein Pathway (BioCarta)
Hemostasis (Reactome)Pathways
Genes: Functions & Pathways
Alzheimer Disease
Astrocytes
Basal Nucleus of Meynert
Cerebrum
Cerebral Cortex
Brain and Nervous System
Brain Microglia
Hippocampus
Frontal Lobe
Neurons
Temporal Lobe
A2M APOE ALOX12 ABCA1 ABCA2 NEF3 PARK2 STH APPNME1
C1QBP
KNG1
KLKB1
CNTF
NS5A
TGFB2
APPBP1
Protein Interactions
1. Identifying the genetic players involved
2. Systematically perturbing individual players and/or pathways suspect of being involved in neurodegenerative diseases of model organisms (e.g. knock-outs)
Understanding the genetic network of human Alzheimer’s disease - Two general phases
Computational Approaches
• Data-mining (Data marts): Comparative Genomics, Interactome, Comparative Phenomics, Regulomics (TFBSs, motif/pattern search)
• Text-mining: Literature mining (hypothesis-generator)
• Mathematical Modeling: Disease process modeling
Experimental Approaches
• Genetic Manipulations
• Gene Expression Studies
• Animal Models
• Cellular Studies (to investigate specific cellular processes)
Alzheimer Disease Related Genes
Proteomics Genomics
Gene Expression
Model Organisms &
Genetic Manipulations
Comparative GenomicsDifferentially expressed genes
Cellular Studies
Transcriptome
Models of human neurodegenerative diseases
Post-Transcriptional Regulation - MicroRNAs
Transcriptional Regulation
Text-mining: Knowledge Discovery
Clustering Algorithms
NCBI Entrez Gene Query:
(alzheimer[Disease/Phenotype] OR alzheimer[All fields]) AND "homo sapiens"[Organism]143 Genes
A2M
ABCA1
ABCA2
ABCB1
ABL1
ACE
AD5
AD6
AD7
AD8
AD9
ADAM10
AGER
AHSG
APBA1
APBB2
APH1A
APOC1
APOD
APOE
APOM
APP
ASAHL
ATF2
BACE1
BACE2
BAX
BCHE
BCL2
BCL2L2
BLMH
CBS
CD40
CDC2
CDK5
CDK5R1
CDK5R2
CHAT
CHRNA4
CHRNA7
CLU
COL18A1
COL25A1
COX10
CRH
CTCF
CTNNA3
CTSB
CTSD
CXCR3
CYP46A1
DHCR24
DLST
DSCR1
E2F1
EEF2
EEF2K
EIF2AK2
EIF4E
EIF4EBP1
ENO1
ERBB4
ESR1
FALZ
FAS
FASLG
FRAP1
FYN
GABBR1
GAL
GAPDH
GFAP
GRIA1
GRIA2
GRIA3
GRIN2A
GRIN2B
GSK3B
HADH2
HPCAL1
HTR2A
IDE
IFNG
IGF2R
IL1B
ITM2B
KCNC4
KLK10
KLK7
LAMA1
LAMC1
LOC644264
LRP8
MAP2K1
MAPT
MEOX2
MME
MPO
MRE11A
MSI1
MTRR
NACA
NCAM1
NCSTN
NDRG2
NES
NGFR
NME1
NME2
NOS3
NRG1
OLR1
P18SRP
PARK7
PAXIP1
PCSK1
PCSK2
PCSK9
PIN1
PLAU
PON1
PRDX1
PRDX2
PRDX3
PRNP
PSEN1
PSEN2
RPS3A
RABGAP1L
RTN4
SERPINA3
SFRS12
SLC1A2
SLC6A3
SLC6A4
SNCB
SORL1
TFAM
TGFB1
TNF
TUBB3
UBQLN1
VSNL1
Mining Interactome
Pathways (top 10)
Molecular & Cellular Functions (top 10)
Physiological System Development & Function (top 10)
Y-axis represents significance - probability that the genes within the dataset file are involved in a particular high level function (Ingenuity Analysis)
http://depts.washington.edu/l2l/
NCBI Entrez Gene Query:
(alzheimer[Disease/Phenotype] OR alzheimer[All fields]) AND "homo sapiens"[Organism]
143 AD-associated genes
Mining about 800 gene expression datasets
Text-mining MedLine Abstracts
• Data Source: GeneRIF – Gene reference into function – Manually entered/curated sentences.
• GeneRIF: “Abstract of Abstracts”• NLP - MetaMap and GATE (General
Architecture for Text Engineering)• Keywords: MESH and UMLS concepts for
Alzheimer’s disease (AD, Alzheimer’s dementia, Alzheimer disease, etc.)
299 unique genes associated with Alzheimer’s disease
GATACA – Gene Association To Anatomy & Clinical Abnormality
299 genes associated with Alzheimer's Disease (based on text-mining Medline abstracts)
Entrez GENE ID
GENE SYMBOL SENTENCE PubMed_ID
2 A2MGenetic association of alpha2-macroglobulin polymorphisms with Alzheimer's disease 12221172
5243 ABCB1
Deposition of Alzheimer beta amyloid is inversely correlated with expression of this protein in the brains of elderly non-demented humans. 12360104
153 ADRB1
Single-nucleotide polymorphisms (SNPs) in the beta1-adrenergic receptor (ADRB1) allelic frequencies were analyzed in Alzheimer's disease. The combination of G protein beta3 subunit and ADRB1 polymorphisms produces AD susceptibility. 15212839
239 ALOX1212/15-lipoxygenase is increased in Alzheimer's disease and has a possible role in brain oxidative stress 15111312
246 ALOX1512/15-lipoxygenase is increased in Alzheimer's disease and has a possible role in brain oxidative stress 15111312
9546 APBA3Associated with etiological mechanism of Alzheimer's disease. 11831025
List PMID description
total probes
expected
actual bin prob
alzheimers_disease_dn 14769913*
Downregulated in correlation with overt Alzheimer's Disease, in the CA1 region of the hippocampus 1222
11.08886 49 2.83E-17
alzheimers_disease_up 14769913*
Upregulated in correlation with overt Alzheimer's Disease, in the CA1 region of the hippocampus 1665 15.1088 53 1.82E-14
ageing_brain_up
15190254**
Age-upregulated in the human frontal cortex 252
2.286737 19 3.67E-12
ageing_brain_dn
15190254
**Age-downregulated in the human frontal cortex 145
1.315781 13 1.07E-09
*Lu T, Pan Y, Kao SY, Li C, Kohane I, Chan J, Yankner BA. 2004. Gene regulation and DNA damage in the ageing human brain. Nature 429(6994): 883-891.
** Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW. 2004. Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci U S A. 101 (7): 2173-2178.
http://depts.washington.edu/l2l/
299 genes associated with Alzheimer’s disease: Comparison with genes differentially expressed in Alzheimer’s and ageing frontal cortex
Human CNS
Mouse non-CNSMouse CNS
Human non-CNS
A 940 gene ortholog pairs over-expressed in both human and mouse CNS
B 206 gene ortholog pairs over-expressed in human, not mouse CNS
C 266 gene ortholog pairs over-expressed in mouse, not human CNS
Kong and Jegga, unpublished
CNS-overexpressed genes in adult human and/or mouse
220 28
2130 308
865
581
940 human-mouse orthologous genes overexpressed in CNS
1222 genes downregulated in Alzheimer’s
299 genes associated with Alzheimer’s disease – Literature mining
APP
ARPP-19
CAMK2A
CDK5
CDK5R1
CHGA
CKB
GLUL
GNAS
GRIA3
KNS2
MAP2K1
MAPK1
MAPK8IP1
PCSK1
PRDX2
RGS4
SNCA
UCHL1
VSNL1
YWHAZ
How many of these are involved in CNS development or function – From GO
Sequence Context
List of Transcription Factor Binding Sites
http://concise-scanner.cchmc.org
To identify putative gene targets of transcription factors
Human Mouse
GenomeTrafac Coordinates
Genome Assembly Coordinates
Conserved binding sites between human and mouse
Trachea & bronchial epithelial cells
Prostate
Gnf Expression Atlas - Human
• PDEF is an ETS transcription factor expressed in prostate epithelial cells.
• Nkx3.1 interacts with SPDEF or Prostate derived Ets factor.
GenomeTrafac Tracks
http://polydoms.cchmc.org
Goals – Summary………• Enable discovery of novel disease-gene
relationships• Facilitate discovery of disease-pathway
relationships• Enable discovery of novel pathways and targets
and associate them with disease processes• Help researchers generate testable hypotheses• Support efforts to prioritize research• Facilitate meta-analyses
Computational• Semantic Web (SW): “A
vision for the next generation web in which data from multiple sources described with rich semantics are integrated to enable human processing by humans as well as software agents” (SW Life Sciences)
• Semantic Web Languages– RDF (Resource Description
Framework)– RDFS (RDF schema) and – OWL (Ontology Web
Language)– SPARQL (semantic web
querying language) • Prioritization and Ranking
entities on novel Gene Networks and Inferencing
New/Future Directions…….Biological/Genomics• Gene regulation by
microRNAs (miRNAs): – ~22 bp non-coding nucleotide
RNAs that primarily act post-transcriptionally by suppressing mRNAs
– At least 1% of the transcripts in the genome code for miRNAs
– miRNA have at least 20-30% of the coding genes as their targets
– miRNAs are implicated in various cellular processes, such as cell fate determination, cell death, and tumorigenesis (Bartel 2004).
– E.g.: CREB-regulated miRNA regulates neuronal morphogenesis (Vo et al 2005)
Take-home messages• Networks and integration of databases are
keys to success in Bioinformatics.• Integration of data computation and data
integration into a single cohesive whole will increase the efficiency of research effort – by reducing the serendipity & hit and miss nature
of empirical research and – will provide valuable clues to the biomedical
researchers on their choice of experiments - limitations of funds, manpower and time.
• Researchers/Users have to know what is available and how to access (what are the limitations), and use the resources they are offered or are available.
PubMed
Medical Informatics
Patient Records
Patient Records
Disease Database
Disease Database→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
Clinical Trials
Clinical Trials
Bioinformatics
Genome
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome
Ph
arm
acog
enom
e
Disease
World
OMIM
►Personalized Medicine►Decision Support System►Outcome Predictor►Course Predictor►Diagnostic Test Selector►Clinical Trials Design►Hypothesis Generator…..
Integrative
Genomics -
Biomedical
Informatics
the Ultimate Goal…….
http://sbw.kgi.edu/Thank You!
“To him who devotes his life to science, nothing can give more happiness than increasing the number of discoveries, but his cup of joy is full when the results of his studies immediately find practical applications”
— Louis Pasteur
http://anil.cchmc.org (under presentations)