Using SAS to Automate The Process of Microarrays Data ... · PDF fileBetween arrays...

PO34

Using SAS to Automate The Process of Microarrays Data Normalization

Don (Dongguang) Li. NCIC-CTG at Queen’s University Lei Qin. Cancer Research Institute, Queen’s University

ABSTRACT This article introduces a SAS macro based program for microarrays data normalization. The algorithms of different methods for both within array and between arrays normalization are discussed. With the breast cancer data adopted from the Stanford Microarray Database, the author implemented the program to facilitate an automatic normalization process with optional methods selection and automatic graphs generation. The results prove that the program is valid and robust. Loess regression method for within array and centering approach for between arrays normalizations show better normalizing outcomes for the example data. KEY WORDS Microarray data analysis, Normalization, SAS automation INTRODUCTION In recent decades, the concept ‘bioinformatics’ has been introduced into diverse sorts of studies including medical and health care research. The concept of ‘trans-disciplinary’ study has attracted many researchers’ efforts. People start approaching the goal of understanding human health and disease features at the molecular level. There are intensive demands of genome data manipulation and analysis. DNA microarrays data analysis is a major part of gene expression analysis. It has been widely used in medical research such as disease sub-type classification, treatment sensitive gene selection, prognosis prediction, etc. Microarrays are devices that measure the expression of thousands of genes in parallel. Because of the influences of the factors in the experimental process, system errors can be produced in array data (such as unequal expressions of the Cy3 and Cy5 labels and an imbalance of the expression levels on the different parts on the array chip). Normalization, as the first step of using the microarrays data in further analysis, is immediately required after the genes are assessed in the laboratory. There is plenty of commercial software available for handling the microarrays data. For example, Gene Spring, ArrayAnalyzer, ArrayAssist and MARAN are examples of popularly used ones. There are many application programs based on S or R languages available as well. SAS, as a powerful statistical package, has however, very little impact on this issue. This article is purposed to explore the usage of SAS programming techniques to automatically perform the microarrays data normalization. BASIC STATISTICAL CONCEPTS Logarithm of original intensity data and calculate M and A.

The microarray data is derived from image scanning which uses the strength of the colored light to express the intensity of gene expression. Two-color array (Cy5/Cy3) is used as an example in this article. The data is left-skewed distributed with a long tail at the right side. Log transformation is therefore a common practice in microarray data management. Usually, the raw-data converting software within the scanner could have several pre-data managing functions, such as taking means/medians of pixels of signals and the background, performing background subtraction (mean intensity – median background). For some software, transforming the data into log (2 based) scale can be done for raw-data. In the program we developed, log2 transformation is the first step of ensuring the data is in the right format before proceeding with analysis. The net-mean intensities of Cy5 (ch2: red) and Cy3 (ch1: green) are used.

log_ch2_mean = log2 (ch2_mean)

To get logarithms to base 2 from natural logarithms, use the equation:

2ln

lnlog2xX =

Although plot of log intensities of R and G is used, MA plot is the recommended method to display the data. Where M is the log ratio (log2 (R/G)) and A is the average of log intensities ( RG2log ). An MA plot amounts to a 45° counterclockwise rotation of the log2 R / log2 G plot. Within array normalization – linear regression of log ratio against average intensity

Linear regression normalization can be applied to logG over logR or to M over A. M over A is used in the program because it is more reliable. To do this, a linear regression of M (y-coordinator) vs. A (x-coordinator) is conducted at first. Then the model fit coefficients are used to calculate the predicted M values. The normalized M values are the original values minus predicted ones. Within array normalization – loess regression of log ratio against average intensity

This is done in a similar way as linear regression. A loess regression is a sort of non-linear regression. It smoothes the regression scatters and line by means of robust local linear fits. An output data containing fitted M values is created and the normalized M is calculated by subtracting the fitted values. You can control the smoothing factor by using smooth option in SAS loess procedure. Typically we use 0.2 – 0.4 to specify the fraction, or the procedure can apply the default fraction. There are other normalization methods such as normalization by mean or median of M values. They can be easily implemented in the same manner. The methods of normalization for correcting spatial bias are not included in the current program. By doing that, more variables representing spatial areas or blocks are needed and multiple loess regression model is fitted. Between array normalization – Scaling

Between arrays normalization is necessary when comparing or correlating the microarray data between patients or different arrays. Two basic methods, scaling and centering, are used in the current program. Box-plots are the most preferable way to visualize the data. Scaling scales the data so that the means of all the distributions are equal. It is simply done by subtracting the mean value of all data on the array from each measurement on the array. Between array normalization – Centering

Centering makes the means and the standard deviations of all of the distributions equal. The method is similar to scaling: for each measurement on the array, subtract the mean value of the array and divide by the standard deviation. Centering is very commonly used for comparing multiple arrays. It is particularly useful when calculating The Pearson correlation coefficient for measuring distance in a cluster analysis. FEATURES OF THE PROGRAM Using SAS macros, this normalization program is fully automated, with methods selection controlled by macro parameters. The names of the array files in a data folder can be extracted and directly used to make macro variables for further use. The selected methods generate the normalized data as well as produce the MA plots. This program covers the most useful methods for within and between arrays normalization, and gives the template for applying other methods. SAS PROGRAMMING STRATEGIES Create serial macro variables for all array data files

The program uses a file that lists all the array data names and automatically creates macro variables of all data names. The data names file can be generated using Unix command: ls –1 > fnames.txt or under DOS using dir *.xls > fnames.txt. Read in the data first, then use following code to detect the number of data files (patients) and create the array macro variables: data _null_; set fname end=last; if last then call symput("dsn", _n_); /* number of data files */ run; proc transpose data=fname out=fnm; /*transpose data to a line with all names*/ var fname; run; data _null_; set fnm; %do i=1 %to &dsn; /* create array macro variables (file names) */ call symput("fnm&i", trim(left(COL&i))); output; %end; run; Macro looping for repeated analyses

Use %do i=1 %to &dsn; analyses code… %end; to loop the statistical analyses and plotting for each of the array profiles. The macro variables generated in above step are the index to find the data sets. The selected operations are done for each dataset one by one. Optional methods selection controlled by macro parameter

Use macro parameters to control the method selection. Use macro conditional code to conduct only the selected methodological processes. Produce CGM graphic files

CGM formatted graphic output are used in this program. The output graphic files can be easily inserted and edited in the Windows environment. PROGRAM FLOW Read in MA data with proc import from resources

proc import out=work.gtemp datafile="&dtpath\&&fnm&i" dbms=EXCEL2000 replace; getnames=no; run; Data manipulation (2 based logarithm and R/G ratio)

data gtemp; set gtemp(where=(F5=0)); if F3>0 then log_g=log2(F3); else log_g=.; if F4>0 then log_r=log2(F4); else log_r=.; if F3>0 and F4>0 then m=log2(F4/F3); else m=.; a=log2(sqrt(F3*F4)); ptid="P"||left(scan("&&fnm&i",1)); rename F1=nameid F2=gname F3=ch1_g F4=ch2_r F5=flag; run; Define graphic environment

goptions reset=ALL ftext=hwcgm001 htitle=2 htext=1.3 device=cgmof97l gunit=cells gsfname=graph gsfmode=replace; *border; symbol1 color=black value=dot h=0.5;

symbol2 color=red i=r height=20; axis1 order= &whaxrng minor=(number=1) label=(h=1.2 "Average Log Intensity"); axis2 order= &wvaxrng minor=(number=1) label=(a=90 h=1.2 "Log Fold Ratio (Cy5/Cy3)"); axis3 order= &wvaxrng minor=(number=1) label=(a=90 h=1.2 "Linear Normalized Log Fold Ratio (Cy5/Cy3)"); axis4 order= &wvaxrng minor=(number=1) label=(a=90 h=1.2 "Loess Normalized Log Fold Ratio (Cy5/Cy3)"); Produce MA plots (macro parameter controlled option)

filename graph "&dtpath\plots\MA&i..cgm"; proc gplot data=gtemp; plot m*a m*a / haxis=axis1 vaxis=axis2 overlay; title "M vs. A Plot: Original"; footnote j=l "for ID: &&fnm&i"; run; quit; Perform linear regression normalization and produce normalized MA plots

ods select ParameterEstimates; ods output ParameterEstimates=param; proc glm data=gtemp ; model m=a / solution; run; quit; data _null_; set param; if Parameter='Intercept' then call symput('int', Estimate); if parameter='a' then call symput('beta', estimate); run; data nnn; set gtemp; pred=m * &beta + ∫ normr = m - pred; run; Use normalized array data nnn to generate MA plots. Perform loess regression normalization and produce normalized MA plots

proc loess data=gtemp; model m=a; id nameid; ods output OutputStatistics=Results; run; proc sort data=results; by nameid a; run; data mmm; merge gtemp(where=(nameid^=.)) results(where=(nameid^=.)); by nameid a; normr=m-pred; format _ALL_; run; Create MA plots using the normalized data mmm.

Integrate data of all subjects for between-array normalization Within the do-loop on each array profile in the data directory, use the following code: %if &i=1 %then %do; data normloe; set mmm; run; %end; %else %do; proc append base=normloe data=mmm; run; %end; Perform Centered or Scaled normalization (optional)

Get the mean values using proc means. Use the mean and standard deviation of each array to do the between array normalization using either scaled or centered method (code omitted). Produce box-plots for normalized data (scaled or centered)

filename graph "f:\test1\madata\plots\cented.cgm"; proc gplot data=cented; plot normr * ptid / haxis=axis1 vaxis=axis3; title "Between Array Normalization: Centered Log Ratio"; run; quit; Macro program calling

%madt (dtpath=f:\test1\madata, /* path of the source xls datasets */ maplot=y, /* create M vs A plots Y=yes N=no */

wnormtyp=M, /* methodes of within array normalization: l=linear regression m=Loess b=both*/

whaxrng=0 to 16 by 2, /* define x-axis scale for MA plot */ wvaxrng=-8 to 8 by 2, /* define y-axis scale for MA plot */ bnormtyp=S, /* methodes of within array normalization:

S=scaling C=centering */ bvaxrng= -10 to 15 by 5, /* define y-axis scale for Box plot */ nmdat=fname2.txt ); /* name of the data containing file names*/ EXAMPLE: THE STANFORD BREAST TUMOR MA DATA NORMALIZATION. A set of microarrays data of breast cancer patients is downloaded from the Stanford Microarray Database (http://genome-www.stanford.edu_MicroArray_) through accessing to (http://genome-www.stanford.edu_breast_cancer_). The database contains 176 patients, but only 5 patients are used for testing purposes. The normalized data are used to make MA plots, which are compared to the pre-normalized plots. The results of within array normalization are shown as figure 1 to 3, and the results of between arrays normalization are shown as figure 4. DISCUSSION There are many kinds of software available for microarray data analysis. Each of them has its own interface and emphasizes different issues. It seems SAS is not very active in microarray data analysis. Using the powerful statistical facilities such as SAS to do the job of microarray data analysis is definitely a worthwhile attempt. Compared to other statistical software, SAS has its strengths of efficient data management and powerful programming capabilities. The calculations and statistical analyses for microarray data normalization are straightforward. However, there are intensive repeated operations on the data sets, repeated implementations of statistical procedures, and heavy graphic productions. SAS macro is especially appropriate in automating the large amount of routine processes. By exploring the microarray data normalization programs, we have taken a deep review of current methodological issues and have strongly benefited from the practice. Although some commercial software can perform microarray data normalization, it is of great importance to fully understand the algorithms and the mechanisms of the process. With self-developed programs, we can be more confident to attack the tasks of microarray data analysis if it comes, and to properly apply other facilities.

The testing analyses demonstrate that the original microarray data are obviously biased in most of the cases and cannot be directly used without normalization. Comparison of linear and loess regression normalization proves the latter one is the better method. However, the loess procedure is highly capacities and resources consuming and it takes long system time. For between arrays normalization, the centering method shows better effect than scaling method. Obviously, the example breast tumor data from the Stanford microarrays database is in relatively good quality. In the real world, the raw data might be much worse. Further study to evaluate the microarray data normalization methods has been planned. SELECTED GRAPHIC OUTPUT (Note: only one graphic example is shown for each situation due to the large size of the graphs). Figure 1. MA plots of pre-normalized data Pre-normalized MA plots show distortion of regression lines for some patients (such as patients 2154, 2157 and 2160). It may be caused by the bias in the experimental process. Figure 2. MA plots of linear regression normalized data Linear regression has corrected considerably the possible bias in the data. The normalized MA plots for patients whose data were distorted are now close to zero. However, for patient 2157 the regression line is still obviously away from zero. Figure 3. MA plots of Loess regression normalized data Loess regression normalization demonstrates better effect than linear regression. The MA plots display the optimistic results for all 5 selected patients.

M vs. A Plot: Original

for ID: 2154.xls

Log

Fold

Rat

io (C

y5/C

y3)

-8

-6

-4

-2

0

2

4

6

8

Average Log Intensity0 2 4 6 8 10 12 14 16

M vs. A Plot - NormalizedNormalized with linear regression

for ID: 2154.xls

Line

ar N

orm

aliz

ed L

og F

old

Rat

io (C

y5/C

y3)

-8

-6

-4

-2

0

2

4

6

8


M vs. A Plot - NormalizedNormalized with Loess regression

for ID: 2154.xls

Loes

s N

orm

aliz

ed L

og F

old

Rat

io (C

y5/C

y3)

-8

-6

-4

-2

0

2

4

6

8


Figure 4. Between arrays normalization: centering and scaling methods:

Inter Array Normalization: Centered Log Ratio

Cen

tere

d Lo

g R

atio

-10

-5

0

5

10

15

P2151 P2154 P2157 P2160 P2164

Inter Array Normalization: Scaled Log Ratio

Scal

ed L

og R

atio

-10

-5

0

5

10

15

P2151 P2154 P2157 P2160 P2164

Both centered and scaled normalizations show that the bias between arrays is well corrected. The visualization, however, demonstrates that the data after centered normalization has smaller variance and more homogeneity. CONCLUSION This article introduces a SAS macro based program for microarrays data normalization. The algorithms of different methods for both within array and between arrays normalization are discussed. With the breast cancer data adopted from the Stanford Microarray Database, the author implemented the program to facilitate an automatic normalization process with optional methods selection and automatic graphs generation. The results prove that the program is valid and robust. Loess regression method for within array and centering approach for between arrays normalizations show better normalizing outcomes for the example data. REFERENCE 1. Stekel D. Microarray Bioinformatics. Cambridge University Press, 2003. 2. Sorlie T., Tibshirani R., Parker J., et al. Repeated observation of breast tumor subtypes in independent gene

expression data sets. PNAS. 2003; 100:8418-8423. 3. Dudoit S., Yang Y.H., Callow M.J., and Speed T.P. Statistical methods for identifying differentially expressed

genes in replicated cDNA microarray experiments. Stanford University, 2000. 4. Satagopan J.M. and Panageas K.S. Tutorial in biostatistics: a statistical perspective on gene expression data

analysis. Statist Med. 2003; 22:481-499. 5. Engelen K, Coessens B, Marchal K and De Moor B. MARAN: normalizing micro-array data. Bioinformatics. 2003;

19(7): 893-894. 6. Zhou Y and Liu J. AVA: visual analysis of gene expression microarray data. Bioinformatics. 2003; 19(2): 293-294. 7. Murphy D. Gene expression studies using microarrays: principles, problems, and prospects. Adv in Physiol Educ.

2002; 26(4): 256-270.

CONTACT INFORMATION Don (Dongguang) Li, PhD, MPH, Senior Biostatistician National Cancer Institute of Canada - Clinical Trials Group 10 Stuart Street, Queen’s University, Kingston, Canada, K7L 3N6 613 533 6000 Ext. 78337 [email protected]

Using SAS to Automate The Process of Microarrays Data ... · PDF fileBetween arrays...

Documents

Transcript of Using SAS to Automate The Process of Microarrays Data ... · PDF fileBetween arrays...