The aroma.affymetrix package - How to analyze huge ... · PDF fileThe aroma.aﬀymetrix...

The aroma.affymetrix package-

How to analyze huge Affymetrix data sets in Ron a notebook

Henrik Bengtsson - [email protected]

University of California, Berkeley

Sept 28, 2006

Outline

Introduction to the Affymetrix platform

Description of data

Computer systems & software

More on the design of Affymetrix arrays

Examples

Ongoing work & Future directions

Conclusions

General discussion on computation on large data sets

Affymetrix chips

Hybridization of target sequences to probes

Target sequence: ...GGTTACCATCGGTAAGTACTCAATGATTA...

Perfect-match (PM) probe: ATCATGCGCCATTCATGAGTTACTA

Hybridization of target sequences to probes

Target sequence: ...GGTTACCATCGGTAAGTACTCAATGATTA...

Perfect-match (PM) probe: ATCATGCGCCATTCATGAGTTACTAMismatch (MM) probe: ATCATGCGCCATACATGAGTTACTA

Scanning & Image analysis

Example array: 1600x1600 cells; 65536 intensity levels (16 bits).


Example array: 1600x1600 cells; 65536 intensity levels (16 bits).Scanned image: 9x9 (+ cell margins) pixels/cell.


Example array: 1600x1600 cells; 65536 intensity levels (16 bits).Scanned image: 9x9 (+ cell margins) pixels/cell.

Analyzed image: (mean pixel, stddev pixel, #pixels).

Outline


Description of data



Examples


Conclusions


Amount of data per array

Affymetrix chip data is stored in “CEL” files.

Per cell [10 bytes]:

◮ Average pixel intensity [float = 4 bytes = 40%]

◮ Std dev pixel intensity [float = 4 bytes = 40%]

◮ # pixels [integer = 2 bytes = 20%]

With an array of 1600x1600 cells this sums up to 25.6 · 106 bytes= 24.4 MB/array1.

11 kB = 1024 bytes, 1Mb = 1024 kB = 1048576 bytes.

Example of different Affymetrix chips

Chip type Dimension # cells # Units File size

Hu6800 536x536 0.29 · 106 7129 2.9MB

HG U95Av2 640x640 0.41 · 106 12625 3.9MB

Mapping10K Xba142 658x658 0.43 · 106 10208 4.1MB

HG-U133A 712x712 0.51 · 106 22283 4.8MBHG-U133B 712x712 0.51 · 106 22645 4.8MB

Mapping10K Xba131 712x712 0.51 · 106 11564 4.8MB

Mouse 430 v2 1002x1002 1.00 · 106 45101 9.6MB

Mapping50K Hind240 1600x1600 2.56 · 106 57299 24.4MBMapping50K Xba240 1600x1600 2.56 · 106 59015 24.4MB

Mapping250K Nsp 2560x2560 6.55 · 106 262338 62.5MBMapping250K Sty 2560x2560 6.55 · 106 238378 62.5MB

HuEx-1 0-st-v2 2560x2560 6.55 · 106 1432154 62.5MB

*) Sizes of binary CEL files; ASCII CEL files are much larger.

Example of data sets

Some public data sets:Name # samples Chip type Size Signals

Affymetrix CEPH 100K 90x2 chips Mapping 100K 4.5GB 1.8GB


Some data sets we’ve been working on:Name # samples Chip type Size Signals

Slater 100K 22+21 chips Mapping 100K 1.0GB 0.4GB


Broad Institute 500K 96x2 chips Mapping 500K 11.7GB 4.7GB

Affymetrix Services Laboratory 190+154 chips Mapping 500K 21.1GB 8.4GB

Sinclair 500K 19+16 chips Mapping 500K 2.4GB 1.0GB

WEHI Exon 35x2 chips Human Exon 2.2GB 0.9GB

Some large data sets we know of:Name # samples Chip type Size Signals

Sanger’s 500K 15,000x2 chips Mapping 500K 1746GB 698GB

“Signals” = amount of RAM required for probe intensities only.

Example of data sets

Some public data sets:Name # samples Chip type Size Signals in R



Some data sets we’ve been working on:Name # samples Chip type Size Signals in R

Slater 100K 22+21 chips Mapping 100K 1.0GB 0.8GB


Broad Institute 500K 96x2 chips Mapping 500K 11.7GB 9.4GB

Affymetrix Services Laboratory 190+154 chips Mapping 500K 21.1GB 16.8GB

Sinclair 500K 19+16 chips Mapping 500K 2.4GB 2.0GB

WEHI Exon 35x2 chips Human Exon 2.2GB 1.8GB

Some large data sets we know of:Name # samples Chip type Size Signals in R

Sanger’s 500K 15,000x2 chips Mapping 500K 1746GB 1396GB

“Signals” = amount of RAM required for probe intensities only.

Outline


Description of data



Examples


Conclusions


Computer systems

Operating systems:Operating system Max address space

Windows XP 4GB*Windows XP 64-bit 17 billion GBLinux 32-bit 4GBLinux 64-bit 17 billion GB*) A single application can only use 2GB.

Hardware:Hardware limits the amount of memory to about 32-64 GB.

Department of Statistics, UC Berkeley:The main computational Linux (64-bit) server has 16 GB RAM.

The R software

Overview of R:

◮ A free open source application (GPL).

◮ Great community forums.

◮ Widely used. Dominant in Bioinformatics applications.

Overview of the language:

◮ All floating-point values are stored as double:s.⇒ float (4 bytes) to double (8 bytes); 200% more RAM.

◮ Functional language (no pointers/reference variables)⇒ there often 2-3 copies of a data object at any time.

◮ A workaround is to use environments (or the R.oo package)⇒ one copy of each data object.

Main issue is memory! (not only R)

Outline


Description of data



Examples


Conclusions


Probeset design

In gene-expression analysis, the “activity” (amount of RNAtranscripts) of a single gene is measured by a about 10-50 probeseach quering a small fraction of the genes DNA.

A probeset (aka unit):

1 2 3

30 31 32

MM

PM

Probeset design



1 2 3

30 31 32

MM

PMcell

Probeset design



1 2 3

30 31 32

MM

PMcell

probe

pair

Probeset design



1 2 3

30 31 32

MM

PMcell

probe

pair

Most of the modelling is done probeset by probeset.Thus this the #1 way we access data.

There are 10,000s probesets on each array.

Probe-level modelling (PLM)

Ignoring the mismatch probes, model the PMs only:1 2 3

14 15 16

PM

Consider a given SNP with PM probes k = 1, . . . ,K and samplesi = 1, . . . , I . The PLM used in RMA is:

log yik = αi + βk + εik

with PM signal yik , chip effect αi for sample i , probe affinity(sensitivity) βk for probe k , and random error ξik .

Probe-level modelling (PLM)

Ignoring the mismatch probes, model the PMs only:1 2 3

14 15 16

PM

Consider a given SNP with PM probes k = 1, . . . ,K and samplesi = 1, . . . , I . The PLM used in RMA is:

log yik = αi + βk + εik

with PM signal yik , chip effect αi for sample i , probe affinity(sensitivity) βk for probe k , and random error ξik .

The PLM used by dChip (MBEI) is:

yik = θi · φk · ξik .

Chip-definition files (CDFs)

Probesets are defined in CDF files (one per chip type), e.g.Mapping250K Nsp.CDF. A fraction of this CDF:

$ SNP_A-1782949:List of 3

..$ type : int 2

..$ direction: int 1

..$ groups :List of 2

.. ..$ A:List of 5

.. .. ..$ x : int [1:12] 651 652 458 457 940 939 ...

.. .. ..$ y : int [1:12] 1772 1772 1388 1388 221 ...

.. .. ..$ pbase: chr [1:12] "c" "g" "a" "t" ...

.. .. ..$ tbase: chr [1:12] "g" "g" "a" "a" ...

.. .. ..$ expos: int [1:12] 13 13 15 15 16 16 17 17 ...

.. ..$ G:List of 5

.. .. ..$ x : int [1:12] 651 652 458 457 940 939 ...

.. .. ..$ y : int [1:12] 1771 1771 1389 1389 220 ...

.. .. ..$ pbase: chr [1:12] "c" "g" "a" "t" ...

.. .. ..$ tbase: chr [1:12] "g" "g" "a" "a" ...

.. .. ..$ expos: int [1:12] 13 13 15 15 16 16 17 17 ...

Outline


Description of data



Examples


Conclusions


Allelic cross-talk calibration (genotyping chips)

PMA probe: ATCATGCGCCATCCATGAGTTACTAPMB probe: ATCATGCGCCATACATGAGTTACTA

Allelic cross-talk calibration (genotyping chips)

PMA probe: ATCATGCGCCATCCATGAGTTACTAPMB probe: ATCATGCGCCATACATGAGTTACTA

yC

y A

0 5000 15000 25000

050

0015

000

2500

0

Allelic cross-talk calibration

An affine model for cross-talk between allele A and allele B is(ignoring sample index i) is:

[

yA,j ,k

yB,j ,k

]

=

[

aA,j

aB,j

]

+

[

WAA WAB

WBA WBB

] [

xA,j ,k

xB,j ,k

]

+

[

εA,j ,k

εB,j ,k

]

and in vector format

yj = a + Wxj + εj

We estimate this robustly using unpublished work by Wirapati &Speed (2002).


> path <- findCelSet("SinclairA_etal_2006")

> ds <- AffymetrixCelSet$fromFiles(path)

> dsC <- calibrateAllelicCrosstalk(ds)

Benchmarking:# arrays Chip type Total time Time/array

90x2 Mapping 100K 1:26h 28s

270x2 Mapping 500K 5:30h 75s

15,000x2 Mapping 500K 13.0 days* 75s*

Overheads (approx.): Reading: 15%, Fitting: 50%, Writing: 30%.


yC

y A

0 5000 15000 25000

050

0015

000

2500

0

(699,771) 1.000 0.035 0.121 0.959

before

yCy A

0 5000 15000 25000

050

0015

000

2500

0

after

Quantile normalization



> dsN <- normalizeQuantile(ds)

Calculating target distribution (averaging):# arrays Chip type Total time Time/array

90x2 Mapping 100K 0:07h 2.1s

270x2 Mapping 500K 1:45h 11.8s

15,000x2 Mapping 500K 4.1 days* 11.8s*

Normalizing arrays to target distribution:# arrays Chip type Total time Time/array

90x2 Mapping 100K 0:55h 18s

270x2 Mapping 500K 9:20h 62s

15,000x2 Mapping 500K 21.5 days* 62s*


Fitting RMA PLM for total copy numbers



> model <- RmaCnPlm(ds, mergeStrands=TRUE,

combineAlleles=TRUE)

> fit(model)

Benchmarking:# arrays Chip type Total time Time/array & unit

22+21 Mapping 100K 1:00h 1.4ms

19+16 Mapping 500K 3:20h 1.3ms

90x2 Mapping 100K 7:15h 1.3ms

270x2 Mapping 500K 2.0 days* 1.3ms*

15,000x2 Mapping 500K 119 days* 1.3ms*


Fitting multiple copy-number PLMs at once

path <- findCelSet("SinclairA_etal_2006")

ds <- AffymetrixCelSet$fromFiles(path)

models <- list(

rma = RmaCnPlm(ds, mergeStrands=TRUE),

mbei = MbeiCnPlm(ds, mergeStrands=TRUE),

affine = AffineCnPlm(ds, mergeStrands=TRUE)

}

lapply(models, fit, units=1:5000)

Note: Read data is cached ⇒ average reading time scales down.

Displaying results

Graphical output is still under development, but...

User feedback

- fit()[.ProbeLevelModel] worked perfectly. Ran rma on 18mouse 430 2 chips [1002x1002 cells, 45101 units] in 14 minutes.

> gc()

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 1063975 28.5* 2403845 64.2 2403845 64.2*

Vcells 957616 7.4* 4826040 36.9 15966139 121.9*

Compare to memory usage for fitPLM():

> gc()

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 2088619 55.8* 4953636 132.3 3950498 105.5*

Vcells 42847107 326.9* 90835631 693.1 90538060 690.8*

:)

Cheers / Ken

Software robustness

All transformed data and parameter estimates are stored to fileimmediately (in chunks). This means:

◮ If/when R crashes (it happens!), or when there is a powerfailure, algorithms can pick from where they were interrupted.This is automagically taken care of by aroma.affymetrix .

◮ The algorithms may be interrupted in order to temporarilyrelease computer resources for other needs.

◮ The algorithms can be restarted on a different host.

Outline


Description of data



Examples


Conclusions


Interfacing to Bioconductor

◮ Pre-processed data is already stored as CEL files, which canbe imported to Bioconductor (and other software).

◮ Ongoing: Porting algorithms to aroma.affymetrix . Most ofthis can be done by simple wrappers calling existingimplementations, cf. fitPLM(), crlmm(), gcrma() etc.

◮ To do: Provide an eSet interface to the data classes;

1. Extract data in memory.2. Virtual eSet class to still work with data on file (more tricky).

Parallelization

Since all data and parameter estimates are kept in a sharedpersistent memory (the file system), multiple processes/hosts canaccess the data and estimates at any time.Speed up:With N parallel hosts, total time T shrinks to ≈T/N, e.g.N = 20, T = 2.0 days ⇒ T/N = 2.4 hours.One process writing, multiple reading:With this setup there are no conflicts. All readers can access theestimates as soon as they are available. Examples:

◮ Visualizers, e.g. CN plots, SNP scatter plots.

◮ Progress bars.

Multiple writing processes:A file-lock mechanism for writing is required (todo). Examples:

◮ Single-array calibration and normalization methods.

◮ Modelling of data subsets, e.g. chromsome by chromsome.

Outline


Description of data



Examples


Conclusions


Conclusions

◮ We are now capable of analyzing very large data sets.

◮ Almost all models and algorithm we work with can beperformed with bounded memory constraints.

◮ Advantages of storing “intermediate” data and estimates onthe file system are:

◮ Standard CEL files: data is ready to be imported in other tools.◮ Persistent memory: can be restarted after a software failure.◮ Parallelization: Data and estimates can be shared and process

by multiple hosts simultaneously.

◮ The package can easily be extended by other developers.

Outline


Description of data



Examples


Conclusions


Suggestions

◮ ASCII (tab-delimited) data files are orders of magnitudeslower to parse than binary files.

◮ Know how to use read.table(..., colClasses=...).

◮ Understand how data is kept in memory.

◮ Understand that data in matrices in R are stored as stackedcolumns, that is, it is more efficient (caching) to work columnby column, rather than row by row.

◮ Understand how I/O of data can be optimized: contiguousdata is much faster to access than scattered data.

HDF5 National Center for Supercomputing Applications,University of Illinois at Urbana

◮ File format for storing scientific data:Primary objects: data sets and groups. A dataset is essentiallya multidimensional array of data elements, and a group is astructure for organizing objects in an HDF5 file.

◮ Efficient storage and I/O:Meets data management needs of scientists and engineersworking in high performance, data intensive computingenvironments. Compressed or chunked data. Read and writedata efficiently on parallel computing systems.

◮ Large user community:Engineering, scientific, and other fields, ranging fromcomputational fluid dynamics to film making.

◮ R package hdf5:Interface to the NCSA HDF5 library. Experimental.

Package R.huge

◮ Provides in memory like access to extremely large-size dataliving on the file system, e.g. x[1:30,56:60] andx[939220+1:20,2] <- NA.

◮ Supported dimensions: vectors, matrices, andmulti-dimensional arrays.

◮ Supported data types: byte (1 byte), single (2 bytes),integer (4 bytes), float (4 bytes), and double (8 bytes).

◮ Written using plain R; easy to extend.

◮ Experimental.

◮ Most of it’s usage in aroma.affymetrix have been replaced byI/O support of CEL/CDF files in order to ease migration ofdata and analysis.

Package R.huge - Example

> x <- FileByteMatrix("x.Rmatrix", nrow=1e6, ncol=1e4)

> x

[1] "FileByteMatrix: Pathname: ./x.Rmatrix. Opened: TRUE.

File size: 10000004268 bytes (9.3 GB). Dimensions: 1e+06x

1e+04. Number of elements: 1e+10. Bytes per cell: 1."

> x[939220+1:20,2] <- 1:40

> x[939220+5:8,1:3]

[,1] [,2] [,3]

[1,] 0 5 0

[2,] 0 6 0

[3,] 0 7 0

[4,] 0 8 0

Acknowledments

In no specific order:

◮ James Bullard, UC Berkeley.

◮ Pratyaksha Wirapati, Swiss Cancer Research Center.

◮ Ben Bolstad, UC Berkeley.

◮ Rafael Irizarry, John Hopkins University.

◮ Ken Simpson, WEHI, Melbourne, Australia.

◮ Benilton Carvalho, John Hopkins University.

◮ Terry Speed, UC Berkeley/WEHI.

◮ Jan Holst, Lund University, Sweden.

◮ Kasper D. Hansen, UC Berkeley.

◮ Jane Fridlyand, UCSF.

◮ Ola Hossjer, Stockholm University, Sweden.

aroma.affymetrix is available at http://www.braju.com/R/.

http://www.braju.com/R/

Converting an ASCII CDF to a binary CDF

Example: Human Exon array with > 1.4·106 units.

> cdf <- AffymetrixCdfFile$fromChipType("HuEx-1_0-st-v2")

> cdf

AffymetrixCdfFile:

Filename: HuEx-1_0-st-v2.text.cdf

Chip type: HuEx-1_0-st-v2

Number of units: 1432154

File size: 933.84 MB

> cdf2 <- convert(cdf)

> cdf2

AffymetrixCdfFile:

Filename: HuEx-1_0-st-v2.cdf

Chip type: HuEx-1_0-st-v2

Number of units: 1432154

File size: 376.78 MB

The aroma.affymetrix package - How to analyze huge ... · PDF fileThe aroma.aﬀymetrix...

Documents

Transcript of The aroma.affymetrix package - How to analyze huge ... · PDF fileThe aroma.aﬀymetrix...