Introduction to data integration in bioinformatics

14
Introduction to Data Integration in Bioinformatics Yan Xu Dec. 2013

description

A brief introduction to the basic concepts and terms in bioinformatics

Transcript of Introduction to data integration in bioinformatics

Page 1: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics

Yan Xu

Dec. 2013

Page 2: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Data Integration

miRNA

Copy Number

GeneExpression

Clinical data

Pathways

Methylation

Epigenome

Page 3: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Recent PublicationsR. Louhimo, T. Lepikhova, O. Monni, and S. Hautaniemi, ”Comparative analysis of algorithms for integration of copy number and expression data,” Nature Methods, 2012.

The ENCODE Project Consortium, “An integrated encyclopedia of DNA elements in the human genome, ” Nature, 2012.

S. Aerts and J. Cools, “Cancer: Mutations close in on gene regulation,” Nature, Jul. 2013.

V. J. H. Powell and A. Acharya, “Disease Prevention: Data Integration,” Science, Dec. 2012.

A. Vinayagam, Y. Hu, M. Kulkarni, C. Roesel, R. Sopko, S. E. Mohr,  and N. Perrimon“Protein Complex–Based Analysis Framework for High-Throughput Data Sets,” Science Signaling, Feb. 2013.

Page 4: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

DNA the molecule of life

 Protein-coding DNA makes up barely 2% of the human genome, About 80% of the bases in the genome may be expressed without an identified function.

Page 5: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Gene Expression

capPoly-A tail

DNA: Two long biopolymers made of nucleotides,composed of nucleobase:A: AdenineT: ThymineC: CytosineG: Guanine

Sequence of amino acids

start codon

termination codon

Page 6: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Microarray

Reverse Transcription

Result

Page 7: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Next generation RNA-sequencing

EST: Expressed Sequence Tag 

Reference:Open Reading Frame

Reads of a single type of nucleotide at one moment

(animation)

TimeTh

e n

um

ber

of

nu

cleoti

de r

ead

s a

t on

e m

om

en

t

Page 8: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

DNA structural variation: Copy numberCNV (Copy Number Variation):• 12% of human genomic DNA• 0.4% of the genome of unrelated people differ with respect to copy number• Range from 1000 nucleotide bases to several megabases• Inherited or caused by de novo mutation (not inherited from either parent).

Relation to disease:Higher EGFR (Epidermal growth factor receptor) copy number exist in Non-small cell lung cancer. (Cappuzzo et al. Journal of the National Cancer Institute, 2005)

Higher copy number of CCL3L1 decreases susceptibility to HIV. (Gonzalez et al. Nature, 2005)

Low copy number of FCGR3B increases susceptibility to inflammatory autoimmune disorders (Aitman et al. Nature, 2006).

Page 9: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Epigenome: DNA Methylation

• Addition of a methyl group to the C or A DNA nucleotides. • Permanent and unidirectional• Can be copied across cell divisions or even passed on to offsprings

Why we look so different even we have the

exactly identical genes ??

What, when and where directionsEpigenome Genome

Page 10: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

miRNA (microRNA)

• Perfect complementary binding leads to mRNA degradation of the target gene

• Imperfect pairing inhibits translation of mRNA to protein

RISC: RNA-induced silencing complex. Use miRNA as a template for recognizing complementary mRNA

Genome has protein-coding genes, also has genes that code for small RNAe.g., “transfer RNA” that is used in translation is coded by genese.g., “ribosomal RNA” that forms part of the structure of the ribosome, is also coded by genes

miRNA: 21-22 nucleotide non-coding RNA

miRNA Pathway

Page 11: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Clinical dataGeneral clinical checkup data: temperature, blood pressure;

Pathology: blood test, antibody test;

Radiology: X-ray, CT (Computed tomography), Ultrasound, MRI (Magnetic resonance imaging).

Texture Heterogeneity

High score Low score

Internal Arteries

High score Low score

Page 12: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Challenges of data integration analysis• Large highly connected data sources and ontologies

• Heterogeneity: functions, structures, data access and analysis methods, dissemination formats.

• Incomplete or overlapping data sources

• Frequent changes

Page 13: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Case I E. Segal et al.,“Decoding global gene expression programs in liver cancer by noninvasive imaging,” nature biotechnology, May 2007.

E. Segal et al. “, Module network: identifying regulatory modules and their condition-specific regulators from gene expression data,” nature genetics, 2003.

Page 14: Introduction to data integration in bioinformatics

Introduction to Data Integration in Bioinformatics Dec. 2013

Case IIO. Gevaert et al., “Non–Small Cell Lung Cancer: Identifying Prognostic Imaging Biomarkers by Leveraging Public Gene Expression Microarray Data—Methods and Preliminary Results,” Radiology, Aug. 2012.