Introduction to data integration in bioinformatics
-
Upload
yan-xu -
Category
Technology
-
view
320 -
download
1
description
Transcript of Introduction to data integration in bioinformatics
Introduction to Data Integration in Bioinformatics
Yan Xu
Dec. 2013
Introduction to Data Integration in Bioinformatics Dec. 2013
Data Integration
miRNA
Copy Number
GeneExpression
Clinical data
Pathways
Methylation
Epigenome
Introduction to Data Integration in Bioinformatics Dec. 2013
Recent PublicationsR. Louhimo, T. Lepikhova, O. Monni, and S. Hautaniemi, ”Comparative analysis of algorithms for integration of copy number and expression data,” Nature Methods, 2012.
The ENCODE Project Consortium, “An integrated encyclopedia of DNA elements in the human genome, ” Nature, 2012.
S. Aerts and J. Cools, “Cancer: Mutations close in on gene regulation,” Nature, Jul. 2013.
V. J. H. Powell and A. Acharya, “Disease Prevention: Data Integration,” Science, Dec. 2012.
A. Vinayagam, Y. Hu, M. Kulkarni, C. Roesel, R. Sopko, S. E. Mohr, and N. Perrimon“Protein Complex–Based Analysis Framework for High-Throughput Data Sets,” Science Signaling, Feb. 2013.
Introduction to Data Integration in Bioinformatics Dec. 2013
DNA the molecule of life
Protein-coding DNA makes up barely 2% of the human genome, About 80% of the bases in the genome may be expressed without an identified function.
Introduction to Data Integration in Bioinformatics Dec. 2013
Gene Expression
capPoly-A tail
DNA: Two long biopolymers made of nucleotides,composed of nucleobase:A: AdenineT: ThymineC: CytosineG: Guanine
Sequence of amino acids
start codon
termination codon
Introduction to Data Integration in Bioinformatics Dec. 2013
Microarray
Reverse Transcription
Result
Introduction to Data Integration in Bioinformatics Dec. 2013
Next generation RNA-sequencing
EST: Expressed Sequence Tag
Reference:Open Reading Frame
Reads of a single type of nucleotide at one moment
(animation)
TimeTh
e n
um
ber
of
nu
cleoti
de r
ead
s a
t on
e m
om
en
t
Introduction to Data Integration in Bioinformatics Dec. 2013
DNA structural variation: Copy numberCNV (Copy Number Variation):• 12% of human genomic DNA• 0.4% of the genome of unrelated people differ with respect to copy number• Range from 1000 nucleotide bases to several megabases• Inherited or caused by de novo mutation (not inherited from either parent).
Relation to disease:Higher EGFR (Epidermal growth factor receptor) copy number exist in Non-small cell lung cancer. (Cappuzzo et al. Journal of the National Cancer Institute, 2005)
Higher copy number of CCL3L1 decreases susceptibility to HIV. (Gonzalez et al. Nature, 2005)
Low copy number of FCGR3B increases susceptibility to inflammatory autoimmune disorders (Aitman et al. Nature, 2006).
Introduction to Data Integration in Bioinformatics Dec. 2013
Epigenome: DNA Methylation
• Addition of a methyl group to the C or A DNA nucleotides. • Permanent and unidirectional• Can be copied across cell divisions or even passed on to offsprings
Why we look so different even we have the
exactly identical genes ??
What, when and where directionsEpigenome Genome
Introduction to Data Integration in Bioinformatics Dec. 2013
miRNA (microRNA)
• Perfect complementary binding leads to mRNA degradation of the target gene
• Imperfect pairing inhibits translation of mRNA to protein
RISC: RNA-induced silencing complex. Use miRNA as a template for recognizing complementary mRNA
Genome has protein-coding genes, also has genes that code for small RNAe.g., “transfer RNA” that is used in translation is coded by genese.g., “ribosomal RNA” that forms part of the structure of the ribosome, is also coded by genes
miRNA: 21-22 nucleotide non-coding RNA
miRNA Pathway
Introduction to Data Integration in Bioinformatics Dec. 2013
Clinical dataGeneral clinical checkup data: temperature, blood pressure;
Pathology: blood test, antibody test;
Radiology: X-ray, CT (Computed tomography), Ultrasound, MRI (Magnetic resonance imaging).
Texture Heterogeneity
High score Low score
Internal Arteries
High score Low score
Introduction to Data Integration in Bioinformatics Dec. 2013
Challenges of data integration analysis• Large highly connected data sources and ontologies
• Heterogeneity: functions, structures, data access and analysis methods, dissemination formats.
• Incomplete or overlapping data sources
• Frequent changes
Introduction to Data Integration in Bioinformatics Dec. 2013
Case I E. Segal et al.,“Decoding global gene expression programs in liver cancer by noninvasive imaging,” nature biotechnology, May 2007.
E. Segal et al. “, Module network: identifying regulatory modules and their condition-specific regulators from gene expression data,” nature genetics, 2003.
Introduction to Data Integration in Bioinformatics Dec. 2013
Case IIO. Gevaert et al., “Non–Small Cell Lung Cancer: Identifying Prognostic Imaging Biomarkers by Leveraging Public Gene Expression Microarray Data—Methods and Preliminary Results,” Radiology, Aug. 2012.