CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background...
-
Upload
piers-hutchinson -
Category
Documents
-
view
215 -
download
0
Transcript of CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background...
CS491JH: Data Mining in Bioinformatics
Introduction to Microarray Technology
•Technology Background
•Data Processing Procedure
•Characteristics of Data
•Data integration and Data mining
Substrates for High Throughput Arrays
Nylon Membrane Glass SlidesGeneChip
Single label P33 Single label biotinstreptavidin
Dual labelCy3, Cy5
GeneChip® Probe Arrays
24µm24µm
Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe
Image of Hybridized Probe ArrayImage of Hybridized Probe Array
>200,000 different>200,000 differentcomplementary probes complementary probes
Single stranded, Single stranded, labeled RNA targetlabeled RNA target
Oligonucleotide probeOligonucleotide probe
**
**
*
1.28cm1.28cm
GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell
GeneChip® Expression Array Design
GeneGeneSequenceSequence
Probes designed to be Probes designed to be Perfect MatchPerfect Match
Probes designed to be Probes designed to be MismatchMismatch
Multiple Multiple oligo probesoligo probes
5´5´ 3´3´
Procedures for Target Preparation
cDNAcDNAFragmentFragment(heat, Mg(heat, Mg2+2+))
LL LL LL LL
Wash & StainWash & Stain
ScanScan
HybridizeHybridize
(16 hours)(16 hours)
Labeled transcriptLabeled transcript
Poly (A)Poly (A)++//TotalTotal RNARNA
AAAAAAAA
IVTIVT
(Biotin-UTP(Biotin-UTPBiotin-CTP)Biotin-CTP)
Labeled fragmentsLabeled fragments
LL LL
LL
LL
CellsCells
Microarray Technology
NSF Soybean Functional GenomicsSteve Clough / Vodkin Lab
Printing Arrays on 50 slides
Cells from condition ACells from condition ACells from condition ACells from condition A Cells from condition BCells from condition BCells from condition BCells from condition B
mRNA
Label Dye 2
NSF / U of IllinoisMicroarray Workshop-Steve Clough / Vodkin Lab
Ratio of expression of genes from two sources
Label Dye 1
cDNA
equal over under
Mix
Totalor
GSI Lumonics
NSF Soybean Functional GenomicsSteve Clough / Vodkin Lab
Beta Actin
PKG
HPRT
Beta 2 microglobulin
RubiscoAB binding protein
Major latex proteinhomologue (MSG)
Cattle and Soy Controls
Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green).1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water).
IgM
IgM heavy chain
MYLK
COL1A2 COL1A2
MYLK
IgM
Fetal Spleen-Cy3 Adult Spleen-Cy5
IgM heavy chain
Placenta vs. Brain – 3800 Cattle Placenta Array cy3 cy5
GenePix Image Analysis Software
GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name:O2#1 8-20-99adjfinal N2#1finaladj
INTENSITIESRAW NORMALIZED
ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIOYAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65
1. Experimental Design
2. Image Analysis – raw data
3. Normalization – “clean” data
4. Data Filtering – informative data
5. Model building
6. Data Mining (clustering, pattern recognition, et al)
7. Validation
Microarray Data Process
Scatterplot of Normalized Data
Adult
Fet
al
>0.3<-0.3
Characteristics of Data
Data can be viewed as a NxM matrix (N >> M):
N is the number of genes
M is the number of data points for each gene
Or Nx(M+K)
K is the number of Features describing each gene(genome location, functional description, metabolic pathway et al)
Model for Data Analysis
•Gene Expression is a Dynamic Process
•Each Microarray Experiment is a snap shot of the process
•Need basic biological knowledge to build model
For Example:
Assumption – In most of experiments, only a small set of genes (100s/1000s) have been affected significantly.
Data Mining
•Data volumes are too large for traditional analysis methods
Large number of records and high dimensional data
•Only small portion of data is analyzed
•Decision support process becomes more complex
Functions of Data Mining
Need for Data Mining
Use the data to build predictors – prediction, classification, deviation detection, segmentation
Generates more sophisticated summaries and reports to aid understanding of the data – find clusters, partitions in data
Data Mining Methods
Classification, Regression (Predictive Modeling)
Clustering (Segmentation)
Association Discovery (Summarization)
Change and deviation detection
Dependency Modeling
Information Visualization
Cholesterol Biosynthesis
Cell Cycle
Immediate Early Response
Signaling and Angiogenesis
Wound Healing and Tissue Remodeling
Clustered display of data from time course of serum stimulation of primary human fibroblasts.
Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) pg 14865
Self Organizing Maps
Molecular Classification of Cancer
Gene Expression Profile of Aging and Its Retardation by Caloric Restriction
Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla
Expression Landscape of cell-cycle regulated genes in yeast
Multi-dimension data visualization