The aroma.affymetrix package - How to analyze huge ... · PDF fileThe aroma.affymetrix...
-
Upload
truongkhanh -
Category
Documents
-
view
218 -
download
3
Transcript of The aroma.affymetrix package - How to analyze huge ... · PDF fileThe aroma.affymetrix...
The aroma.affymetrix package-
How to analyze huge Affymetrix data sets in Ron a notebook
Henrik Bengtsson - [email protected]
University of California, Berkeley
Sept 28, 2006
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Affymetrix chips
Hybridization of target sequences to probes
Target sequence: ...GGTTACCATCGGTAAGTACTCAATGATTA...
Perfect-match (PM) probe: ATCATGCGCCATTCATGAGTTACTA
Hybridization of target sequences to probes
Target sequence: ...GGTTACCATCGGTAAGTACTCAATGATTA...
Perfect-match (PM) probe: ATCATGCGCCATTCATGAGTTACTAMismatch (MM) probe: ATCATGCGCCATACATGAGTTACTA
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).Scanned image: 9x9 (+ cell margins) pixels/cell.
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).Scanned image: 9x9 (+ cell margins) pixels/cell.
Analyzed image: (mean pixel, stddev pixel, #pixels).
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Amount of data per array
Affymetrix chip data is stored in “CEL” files.
Per cell [10 bytes]:
◮ Average pixel intensity [float = 4 bytes = 40%]
◮ Std dev pixel intensity [float = 4 bytes = 40%]
◮ # pixels [integer = 2 bytes = 20%]
With an array of 1600x1600 cells this sums up to 25.6 · 106 bytes= 24.4 MB/array1.
11 kB = 1024 bytes, 1Mb = 1024 kB = 1048576 bytes.
Example of different Affymetrix chips
Chip type Dimension # cells # Units File size
Hu6800 536x536 0.29 · 106 7129 2.9MB
HG U95Av2 640x640 0.41 · 106 12625 3.9MB
Mapping10K Xba142 658x658 0.43 · 106 10208 4.1MB
HG-U133A 712x712 0.51 · 106 22283 4.8MBHG-U133B 712x712 0.51 · 106 22645 4.8MB
Mapping10K Xba131 712x712 0.51 · 106 11564 4.8MB
Mouse 430 v2 1002x1002 1.00 · 106 45101 9.6MB
Mapping50K Hind240 1600x1600 2.56 · 106 57299 24.4MBMapping50K Xba240 1600x1600 2.56 · 106 59015 24.4MB
Mapping250K Nsp 2560x2560 6.55 · 106 262338 62.5MBMapping250K Sty 2560x2560 6.55 · 106 238378 62.5MB
HuEx-1 0-st-v2 2560x2560 6.55 · 106 1432154 62.5MB
*) Sizes of binary CEL files; ASCII CEL files are much larger.
Example of data sets
Some public data sets:Name # samples Chip type Size Signals
Affymetrix CEPH 100K 90x2 chips Mapping 100K 4.5GB 1.8GB
Affymetrix CEPH 500K 48x2 chips Mapping 500K 6.0GB 2.4GB
Some data sets we’ve been working on:Name # samples Chip type Size Signals
Slater 100K 22+21 chips Mapping 100K 1.0GB 0.4GB
Affymetrix CEPH 500K 270x2 chips Mapping 500K 33.8GB 13.5GB
Broad Institute 500K 96x2 chips Mapping 500K 11.7GB 4.7GB
Affymetrix Services Laboratory 190+154 chips Mapping 500K 21.1GB 8.4GB
Sinclair 500K 19+16 chips Mapping 500K 2.4GB 1.0GB
WEHI Exon 35x2 chips Human Exon 2.2GB 0.9GB
Some large data sets we know of:Name # samples Chip type Size Signals
Sanger’s 500K 15,000x2 chips Mapping 500K 1746GB 698GB
“Signals” = amount of RAM required for probe intensities only.
Example of data sets
Some public data sets:Name # samples Chip type Size Signals in R
Affymetrix CEPH 100K 90x2 chips Mapping 100K 4.5GB 3.6GB
Affymetrix CEPH 500K 48x2 chips Mapping 500K 6.0GB 4.8GB
Some data sets we’ve been working on:Name # samples Chip type Size Signals in R
Slater 100K 22+21 chips Mapping 100K 1.0GB 0.8GB
Affymetrix CEPH 500K 270x2 chips Mapping 500K 33.8GB 27.0GB
Broad Institute 500K 96x2 chips Mapping 500K 11.7GB 9.4GB
Affymetrix Services Laboratory 190+154 chips Mapping 500K 21.1GB 16.8GB
Sinclair 500K 19+16 chips Mapping 500K 2.4GB 2.0GB
WEHI Exon 35x2 chips Human Exon 2.2GB 1.8GB
Some large data sets we know of:Name # samples Chip type Size Signals in R
Sanger’s 500K 15,000x2 chips Mapping 500K 1746GB 1396GB
“Signals” = amount of RAM required for probe intensities only.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Computer systems
Operating systems:Operating system Max address space
Windows XP 4GB*Windows XP 64-bit 17 billion GBLinux 32-bit 4GBLinux 64-bit 17 billion GB*) A single application can only use 2GB.
Hardware:Hardware limits the amount of memory to about 32-64 GB.
Department of Statistics, UC Berkeley:The main computational Linux (64-bit) server has 16 GB RAM.
The R software
Overview of R:
◮ A free open source application (GPL).
◮ Great community forums.
◮ Widely used. Dominant in Bioinformatics applications.
Overview of the language:
◮ All floating-point values are stored as double:s.⇒ float (4 bytes) to double (8 bytes); 200% more RAM.
◮ Functional language (no pointers/reference variables)⇒ there often 2-3 copies of a data object at any time.
◮ A workaround is to use environments (or the R.oo package)⇒ one copy of each data object.
Main issue is memory! (not only R)
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Probeset design
In gene-expression analysis, the “activity” (amount of RNAtranscripts) of a single gene is measured by a about 10-50 probeseach quering a small fraction of the genes DNA.
A probeset (aka unit):
1 2 3
30 31 32
MM
PM
Probeset design
In gene-expression analysis, the “activity” (amount of RNAtranscripts) of a single gene is measured by a about 10-50 probeseach quering a small fraction of the genes DNA.
A probeset (aka unit):
1 2 3
30 31 32
MM
PMcell
Probeset design
In gene-expression analysis, the “activity” (amount of RNAtranscripts) of a single gene is measured by a about 10-50 probeseach quering a small fraction of the genes DNA.
A probeset (aka unit):
1 2 3
30 31 32
MM
PMcell
probe
pair
Probeset design
In gene-expression analysis, the “activity” (amount of RNAtranscripts) of a single gene is measured by a about 10-50 probeseach quering a small fraction of the genes DNA.
A probeset (aka unit):
1 2 3
30 31 32
MM
PMcell
probe
pair
Most of the modelling is done probeset by probeset.Thus this the #1 way we access data.
There are 10,000s probesets on each array.
Probe-level modelling (PLM)
Ignoring the mismatch probes, model the PMs only:1 2 3
14 15 16
PM
Consider a given SNP with PM probes k = 1, . . . ,K and samplesi = 1, . . . , I . The PLM used in RMA is:
log yik = αi + βk + εik
with PM signal yik , chip effect αi for sample i , probe affinity(sensitivity) βk for probe k , and random error ξik .
Probe-level modelling (PLM)
Ignoring the mismatch probes, model the PMs only:1 2 3
14 15 16
PM
Consider a given SNP with PM probes k = 1, . . . ,K and samplesi = 1, . . . , I . The PLM used in RMA is:
log yik = αi + βk + εik
with PM signal yik , chip effect αi for sample i , probe affinity(sensitivity) βk for probe k , and random error ξik .
The PLM used by dChip (MBEI) is:
yik = θi · φk · ξik .
Chip-definition files (CDFs)
Probesets are defined in CDF files (one per chip type), e.g.Mapping250K Nsp.CDF. A fraction of this CDF:
$ SNP_A-1782949:List of 3
..$ type : int 2
..$ direction: int 1
..$ groups :List of 2
.. ..$ A:List of 5
.. .. ..$ x : int [1:12] 651 652 458 457 940 939 ...
.. .. ..$ y : int [1:12] 1772 1772 1388 1388 221 ...
.. .. ..$ pbase: chr [1:12] "c" "g" "a" "t" ...
.. .. ..$ tbase: chr [1:12] "g" "g" "a" "a" ...
.. .. ..$ expos: int [1:12] 13 13 15 15 16 16 17 17 ...
.. ..$ G:List of 5
.. .. ..$ x : int [1:12] 651 652 458 457 940 939 ...
.. .. ..$ y : int [1:12] 1771 1771 1389 1389 220 ...
.. .. ..$ pbase: chr [1:12] "c" "g" "a" "t" ...
.. .. ..$ tbase: chr [1:12] "g" "g" "a" "a" ...
.. .. ..$ expos: int [1:12] 13 13 15 15 16 16 17 17 ...
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Allelic cross-talk calibration (genotyping chips)
PMA probe: ATCATGCGCCATCCATGAGTTACTAPMB probe: ATCATGCGCCATACATGAGTTACTA
Allelic cross-talk calibration (genotyping chips)
PMA probe: ATCATGCGCCATCCATGAGTTACTAPMB probe: ATCATGCGCCATACATGAGTTACTA
yC
y A
0 5000 15000 25000
050
0015
000
2500
0
Allelic cross-talk calibration
An affine model for cross-talk between allele A and allele B is(ignoring sample index i) is:
[
yA,j ,k
yB,j ,k
]
=
[
aA,j
aB,j
]
+
[
WAA WAB
WBA WBB
] [
xA,j ,k
xB,j ,k
]
+
[
εA,j ,k
εB,j ,k
]
and in vector format
yj = a + Wxj + εj
We estimate this robustly using unpublished work by Wirapati &Speed (2002).
Allelic cross-talk calibration
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> dsC <- calibrateAllelicCrosstalk(ds)
Benchmarking:# arrays Chip type Total time Time/array
90x2 Mapping 100K 1:26h 28s
270x2 Mapping 500K 5:30h 75s
15,000x2 Mapping 500K 13.0 days* 75s*
Overheads (approx.): Reading: 15%, Fitting: 50%, Writing: 30%.
Allelic cross-talk calibration
yC
y A
0 5000 15000 25000
050
0015
000
2500
0
(699,771) 1.000 0.035 0.121 0.959
before
yCy A
0 5000 15000 25000
050
0015
000
2500
0
after
Quantile normalization
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> dsN <- normalizeQuantile(ds)
Calculating target distribution (averaging):# arrays Chip type Total time Time/array
90x2 Mapping 100K 0:07h 2.1s
270x2 Mapping 500K 1:45h 11.8s
15,000x2 Mapping 500K 4.1 days* 11.8s*
Normalizing arrays to target distribution:# arrays Chip type Total time Time/array
90x2 Mapping 100K 0:55h 18s
270x2 Mapping 500K 9:20h 62s
15,000x2 Mapping 500K 21.5 days* 62s*
Overheads (approx.): Reading: 10%, Fitting: 65%, Writing: 25%.
Fitting RMA PLM for total copy numbers
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> model <- RmaCnPlm(ds, mergeStrands=TRUE,
combineAlleles=TRUE)
> fit(model)
Benchmarking:# arrays Chip type Total time Time/array & unit
22+21 Mapping 100K 1:00h 1.4ms
19+16 Mapping 500K 3:20h 1.3ms
90x2 Mapping 100K 7:15h 1.3ms
270x2 Mapping 500K 2.0 days* 1.3ms*
15,000x2 Mapping 500K 119 days* 1.3ms*
Overheads (approx.): Reading: 20%, Fitting: 15%, Writing: 65%.
Fitting multiple copy-number PLMs at once
path <- findCelSet("SinclairA_etal_2006")
ds <- AffymetrixCelSet$fromFiles(path)
models <- list(
rma = RmaCnPlm(ds, mergeStrands=TRUE),
mbei = MbeiCnPlm(ds, mergeStrands=TRUE),
affine = AffineCnPlm(ds, mergeStrands=TRUE)
}
lapply(models, fit, units=1:5000)
Note: Read data is cached ⇒ average reading time scales down.
Displaying results
Graphical output is still under development, but...
User feedback
- fit()[.ProbeLevelModel] worked perfectly. Ran rma on 18mouse 430 2 chips [1002x1002 cells, 45101 units] in 14 minutes.
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1063975 28.5* 2403845 64.2 2403845 64.2*
Vcells 957616 7.4* 4826040 36.9 15966139 121.9*
Compare to memory usage for fitPLM():
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2088619 55.8* 4953636 132.3 3950498 105.5*
Vcells 42847107 326.9* 90835631 693.1 90538060 690.8*
:)
Cheers / Ken
Software robustness
All transformed data and parameter estimates are stored to fileimmediately (in chunks). This means:
◮ If/when R crashes (it happens!), or when there is a powerfailure, algorithms can pick from where they were interrupted.This is automagically taken care of by aroma.affymetrix .
◮ The algorithms may be interrupted in order to temporarilyrelease computer resources for other needs.
◮ The algorithms can be restarted on a different host.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Interfacing to Bioconductor
◮ Pre-processed data is already stored as CEL files, which canbe imported to Bioconductor (and other software).
◮ Ongoing: Porting algorithms to aroma.affymetrix . Most ofthis can be done by simple wrappers calling existingimplementations, cf. fitPLM(), crlmm(), gcrma() etc.
◮ To do: Provide an eSet interface to the data classes;
1. Extract data in memory.2. Virtual eSet class to still work with data on file (more tricky).
Parallelization
Since all data and parameter estimates are kept in a sharedpersistent memory (the file system), multiple processes/hosts canaccess the data and estimates at any time.Speed up:With N parallel hosts, total time T shrinks to ≈T/N, e.g.N = 20, T = 2.0 days ⇒ T/N = 2.4 hours.One process writing, multiple reading:With this setup there are no conflicts. All readers can access theestimates as soon as they are available. Examples:
◮ Visualizers, e.g. CN plots, SNP scatter plots.
◮ Progress bars.
Multiple writing processes:A file-lock mechanism for writing is required (todo). Examples:
◮ Single-array calibration and normalization methods.
◮ Modelling of data subsets, e.g. chromsome by chromsome.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Conclusions
◮ We are now capable of analyzing very large data sets.
◮ Almost all models and algorithm we work with can beperformed with bounded memory constraints.
◮ Advantages of storing “intermediate” data and estimates onthe file system are:
◮ Standard CEL files: data is ready to be imported in other tools.◮ Persistent memory: can be restarted after a software failure.◮ Parallelization: Data and estimates can be shared and process
by multiple hosts simultaneously.
◮ The package can easily be extended by other developers.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Suggestions
◮ ASCII (tab-delimited) data files are orders of magnitudeslower to parse than binary files.
◮ Know how to use read.table(..., colClasses=...).
◮ Understand how data is kept in memory.
◮ Understand that data in matrices in R are stored as stackedcolumns, that is, it is more efficient (caching) to work columnby column, rather than row by row.
◮ Understand how I/O of data can be optimized: contiguousdata is much faster to access than scattered data.
HDF5 National Center for Supercomputing Applications,University of Illinois at Urbana
◮ File format for storing scientific data:Primary objects: data sets and groups. A dataset is essentiallya multidimensional array of data elements, and a group is astructure for organizing objects in an HDF5 file.
◮ Efficient storage and I/O:Meets data management needs of scientists and engineersworking in high performance, data intensive computingenvironments. Compressed or chunked data. Read and writedata efficiently on parallel computing systems.
◮ Large user community:Engineering, scientific, and other fields, ranging fromcomputational fluid dynamics to film making.
◮ R package hdf5:Interface to the NCSA HDF5 library. Experimental.
Package R.huge
◮ Provides in memory like access to extremely large-size dataliving on the file system, e.g. x[1:30,56:60] andx[939220+1:20,2] <- NA.
◮ Supported dimensions: vectors, matrices, andmulti-dimensional arrays.
◮ Supported data types: byte (1 byte), single (2 bytes),integer (4 bytes), float (4 bytes), and double (8 bytes).
◮ Written using plain R; easy to extend.
◮ Experimental.
◮ Most of it’s usage in aroma.affymetrix have been replaced byI/O support of CEL/CDF files in order to ease migration ofdata and analysis.
Package R.huge - Example
> x <- FileByteMatrix("x.Rmatrix", nrow=1e6, ncol=1e4)
> x
[1] "FileByteMatrix: Pathname: ./x.Rmatrix. Opened: TRUE.
File size: 10000004268 bytes (9.3 GB). Dimensions: 1e+06x
1e+04. Number of elements: 1e+10. Bytes per cell: 1."
> x[939220+1:20,2] <- 1:40
> x[939220+5:8,1:3]
[,1] [,2] [,3]
[1,] 0 5 0
[2,] 0 6 0
[3,] 0 7 0
[4,] 0 8 0
Acknowledments
In no specific order:
◮ James Bullard, UC Berkeley.
◮ Pratyaksha Wirapati, Swiss Cancer Research Center.
◮ Ben Bolstad, UC Berkeley.
◮ Rafael Irizarry, John Hopkins University.
◮ Ken Simpson, WEHI, Melbourne, Australia.
◮ Benilton Carvalho, John Hopkins University.
◮ Terry Speed, UC Berkeley/WEHI.
◮ Jan Holst, Lund University, Sweden.
◮ Kasper D. Hansen, UC Berkeley.
◮ Jane Fridlyand, UCSF.
◮ Ola Hossjer, Stockholm University, Sweden.
aroma.affymetrix is available at http://www.braju.com/R/.
Converting an ASCII CDF to a binary CDF
Example: Human Exon array with > 1.4·106 units.
> cdf <- AffymetrixCdfFile$fromChipType("HuEx-1_0-st-v2")
> cdf
AffymetrixCdfFile:
Filename: HuEx-1_0-st-v2.text.cdf
Chip type: HuEx-1_0-st-v2
Number of units: 1432154
File size: 933.84 MB
> cdf2 <- convert(cdf)
> cdf2
AffymetrixCdfFile:
Filename: HuEx-1_0-st-v2.cdf
Chip type: HuEx-1_0-st-v2
Number of units: 1432154
File size: 376.78 MB