Data Analysis 101 - Cytometry...1 Data Analysis 101 Jim Jett: [email protected] National Flow...

88
1 Data Analysis 101 Jim Jett: [email protected] National Flow Cytometry Resource Bioscience Division Los Alamos National Laboratory University of New Mexico In this set of slides, I will not attempt to tell you how to analyze data but will present some of the concepts behind data collection and analysis.

Transcript of Data Analysis 101 - Cytometry...1 Data Analysis 101 Jim Jett: [email protected] National Flow...

1

Data Analysis 101

Jim Jett: [email protected]

National Flow Cytometry Resource

Bioscience Division

Los Alamos National Laboratory

University of New Mexico

In this set of slides, I will not attempt to tell you how to analyze data but will present some of the concepts behind data collection and analysis.

Answer the following questions:

What happens during data collection?

How are the data stored?

How are the data displayed?

What are some of the ways that the data are analyzed?

Here are some the questions that will be answered in what follows.

What happens during data collection?

Analog Signal Processing:

Detector PreAmp Signal Processing ADC

Time to Amplitude: Pulse Width

Integrator: Pulse Area

Amplifier: Pulse Height

Or:

Or:

Photomultiplieror

Photo diode

This is the ‘old’ way of analog signal processing.

Many systems now on the market are digital from the preamp on.

However, these diagrams explain the parameters that are derived from a light scatter or fluorescence signal.

First, the signal can be simply amplified to fit the input range of the Analog-to-Digital Converter. The height/amplitude of the signal is digitized.

To derive the integral of the signal the electronics sums-up the signal from the rising side threshold crossing until the signal falls below the threshold on the falling side. The resulting sum/integral is “sent” to the ADC for digitization.

The Pulse Width is derived by starting a linear voltage ramp at the first threshold crossing and stopping the ramp when the signal crosses the threshold on the down side. The ramp voltage, now proportional to the signal width, is passed to the ADC for digitization.

What happens during data collection?

Why is there Signal Processing?

Amplitude: proportional to maximum fluorescence

emission

Integral: proportional to total fluorescence emission

Width: proportional to particle size - under the

proper conditions

Why all this signal processing? To derive as much information as possible from the signals generated.

What are the ‘proper conditions’? To obtain size information from the signal width measurement, the width of the laser beam needs to be on the order of, or smaller than, the diameter of the particles being measured.

What happens during data collection?

Logarithmic Amplification:

Detector PreAmp Signal Processing ADC

Log Amp:3 or 4 Decades

Amplified or integratedsignal

Amplitudeproportional tolog of input

In many flow measurements, it is necessary to cover several orders of magnitude to adequately describe the parameter being measured. This is particularly true for immonophenotyping experiments.

Some log amps cover 5 decades or a range of 100,000X in signal amplitude or signal integral.

Log vs.... Linear Scales:

1 200 400 600 800 1000

Linear Scale/Channel Number

Lo

g S

cale

Three decade conversion:

0

1

2

3

For a 3 decade conversion from a linear scale to log scale, the curve shows the transformation with the y-axis being the log of the x-axis.

Log vs.... Linear Scales:

500

1000

1 200 400 600 800 1000

Linear Scale/Channel Number

Lo

g S

cale

/Ch

ann

el N

um

ber

Three decade conversion:

1

Same transformation with the y-axis now being presented as channel number. To convert a linear channel number to a ‘log’ channel number, follow the linear channel number vertically until it hits the curve, then read across to the vertical scale.

Log vs.... Linear Scales:

LinearLinear1

1000

00 341341 682682 10231023

LogLog

10100

Another way to visualize the effect of the log transformation for three decades. The top decade (values 100 to 1000 on the linear scale) map to the top third of the log scale. The bottom decade (values 1 to 10 on the linear scale) map into the bottom third of the log scale. The effect is to expand the low value signals providing more detail while compressing the larger values. Generally, in measurements that use logarithmic amplification, the upper end of the distribution is relatively featureless.

Logarithmic vs.... Linearly Recorded Data:

How are they different ?

Log Linear

Dynamic range use: Large Small

Typical use: Immunofluorescence DNA

Distances along axis: Relative Absolute

Multiplicative Additive

Axis has: No zero A zero

Linear and logarithmic scales are used for different applications. Linear scales are used when the fluorescence range is less than a factor of 10. Log scales are used when a large dynamic range must be covered - up to 5 orders of magnitude.

For Linear Amplification:

Scale starts at ZERO.

Factors of 2 are measured from the origin.

The widths of peaks with constant CV’s increase

linearly.

Origin 1 X 2 X

Some attributes of linear scales.

For Logarithmic Amplification

There is no channel on the log scale that corresponds

to zero on the linear scale.

Factors of 2 are constant distances - no matter where

in the distribution.

Peaks with constant CV’s have equal widths.

Useful for chromosome analysis.

2 X

Some attributes of log scales.

What happens during data collection?

This comment is not related to data acquisition.

Several approaches relating to display of log scale data have

been published. The problem that is being solved is the

appearance of ‘negative’ events in compensated data. When

there are negative events, they can be lost off the bottom of

the displays.

One approach, reported by Bruce Bagwell, the “HyperLog”

transform to adjust the data such that there are no negative

compensated events (Cytometry 64A:34 (2005)). A

similar approach, “Logicle”, has been reported by the

Stanford group (Cytometry 69A:541 (2006)).

Two methods of removing ‘negative’ events from compensated immunofluorescence data.

What’s Happening ?

Amplitudes of processed signals are digitized.By an A. D. C. into 256, 512, 1024, . . . . Bins

Voltages are converted to numbers.

Information passes from the continuous world of analog electronics to discrete digital world.

Information is lost in the process.

An effect of this is that for some data manipulations, misleading results can be obtained.

Some attributes of data acquisition.

Misleading results can occur in the following situation. When the ratio of two measurements is calculated on a cell by cell basis the resulting ratio distribution can have extraneous holes or peaks that are due to artifacts of the digitization.

An example of a ratio calculation is the ratio of surface fluorescence to cell volume to the 2/3rds power (proportional to cell surface area for spherical cells) to estimate antigen surface density.

What happens during data collection?

Digital Signal Processing is used by many modern systems:

Detector PreAmp Amplifier Free running ADC

Pulse height, area, and width calculatedby DSP circuit.

Basis of the MICASsystem used in the build-it lab.

For modern commercial cytometers,The ADC resolution can be 18 bits, or 262,144.

Many modern data acquisition systems use this approach. The ADCs can have up to 18 bit resolution. Features of a pulse such as height, width, area and logs are calculated digitally, often by dedicated digital signal processors.

Data Storage:

In a list mode data file -

Each measurement/parameter is recorded for each particle.

Histograms and other displays generated later.

Memory requirements - proportional to the number of particles analyzed times the number of parameters recorded.

Full correlation of the measurements is preserved.

Cell # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . . .

FALS 0 2 4 3 6 4 9 7 6 4 3 2 1 8 . . .90o LS 1 4 2 3 5 8 4 5 9 4 4 5 2 9 . . .FL-1 Int 2 4 3 2 5 6 8 6 7 0 8 3 1 7 . . .FL-2 Amp 2 3 4 2 3 2 1 8 5 8 9 7 5 3 . . .

Although single parameter, or bivariate, histograms can be stored, flow data sets are predominately recorded in list mode, especially multiparameter data.

Data Display:

A picture is worth a thousand words.

Analysis can’t do it all.

However, analytical methods are needed to extract quantitative information.

After analysis, if it is possible, look at the results graphically.

Some philosophy.

Why Look at Data ?

“There is no single statistical tool that is as powerful as a well-chosen graph.”

“Our eye-brain system is the most sophisticated information processor ever developed, and through graphical displays we can put this system to good use to obtain deep insight into the structure of data.”

Edward Tufte

Two quotes from the literature. These are by an investigator who has thought a lot about data presentation.

Why Look at Data ?

Haskie Jim, a wise old uncle in Tony Hillerman's book "Coyote Waits” about a Navajo policeman says:

"I think that from where we stand the rain seems random. If we could stand somewhere else, we would see the order in it."

The second one is from a novel.

An example show later is of the use of principal component transformations to get a better “view” of a complex data set.

Display of Univariate Data:

A histogram is a graphical summary of many measurements.

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 30

NumberOf

Buses

Number of Passengers per Bus

An example of digital data and a way to plot it.

This is a hypothetical data set that has been plotted as the number of passengers per bus vs.. the number of busses containing that number of passengers.

One characteristic of all histograms is that one axis is number of events per bin. That is true for single parameter, two parameter, or three parameter histograms. For single parameter histograms, the Y-axis is number of events. There are several ways to depict number of events for two or three parameter histograms.

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 30

Display of Univariate Data:

A histogram is a graphical summary of many measurements.

NumberOf

Buses

Number of Passengers per Bus

Perhaps a bar chart is more appropriate. This is a histogram plot. The vertical axis is always number of observations/events.

The peak 30 passengers/bus is due to rush hour.

Mean, Median, Mode, Average, etc.

Mean: The average of the measurements.Ave. = Sum(# x value) / Sum (#)

Median: The midpoint of the distributionHalf of the data points are below the median and half are above.

Mode: The value with the highest frequency.

Definition of three commonly used statistics used to describe distributions.

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 30

Display of Univariate Data:

A histogram is a graphical summary of many measurements.

NumberOf

Buses

Number of Passengers per Bus

mean = 8.5

edian = 6ode = 3

Demonstrating where for the people on a bus, the mean, mode and median are located.

If the distribution is symmetric with a single peak, the mean, mode, and median are equal.

Display of Univariate Data:

A histogram is a graphical summary of many measurements.

NumberOf

Buses

05

10152025303540

Number ofBuses

20 30 35 25 25 22 20 15 17 12 10 7 3 5 2 33

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

30

Number of Passengers per Bus

People on the bus data for you to determine the mean, mode and median for yourselves.

Display of Univariate Data:

One or multiple histograms can be generated from a data file.

This example is from an 8 parameter immunophenotypingdata set.

Eight uncorrelated histograms of data obtained in an 8 parameterexperiment at Los Alamos a number of years ago by C. Stewart and R. Habbersett. These data are used only as an example of multiparameterdata.

CV = Counter Volume

ALL = Axial light loss (amount of light taken out of laser beam by a cell)

FLS = Forward angle light scatter

OLS = 90-degree light scatter

CD8, CD4, Leu M3, and CD3 = four-color antibody fluorescence measurements.

Uncorrelated means that there is no connection between regions in one histogram with any of the other histograms. Or, as presented this way, the data are equivalent to 8 single parameter measurements.

A Confusing Display?

Graph from an email inquiry on the Cytometry bulletin board

This plot appeared on the Purdue web site with the question “Where are the G1, G2M cells?” The measurements were of bacterial DNA content.

The first thing to look at is the x-axis scale. It is a 4-decade log scale. Since it is a log scale, a factor of 2 is the same distance any where along the axis. The arrow on the left, near the log scale tic marks, denotes thedistance that corresponds to a factor of 2. The two peaks on the right hand side of the distribution are a factor of two apart. Thus they may be the G1 and G2M peaks. However, what are all the events to the left of the G1 peak? Without knowing the experimental details, it appears that those events are due to broken cells and debris. Using the log scale has magnified the low signal region, which may, or may not be not of interest.

A more appropriate scale to use for displaying data of this type is the linear scale. One does not need to display 4 orders of magnitude in signal strength when interested in events that vary by a factor of 2 or 4.

Kindergarten Histogram:

zz

Even in kindergarten they learn how to plot histograms !

Display of Univariate Data

As a cumulative distribution function:

0

Total Number of Cells

Number of cells in channels less than or equal to n

n

Channel Number

The cumulative distribution function is used in several ways in FCM data analysis.

The first step in calculating the cumulative distribution function (CDF) is, as a function of channel number, determine the number of cells in a specific univariate histogram with measurements less than, or equal to, the channelnumber and plot the number on the y-axis. The plot starts at zero at the origin and increases to equal the number of cells measured at the last channel.

Display of Univariate Data:

As a cumulative distribution function

Channel Number

0.00.0

1.01.0

Probability

P(n) = P(meas. <= n)

n

The CDF is calculated by normalizing the previous plot by the total number of cells measured. The y-axis then becomes a probability axis. At any channel number, the value on the Y-axis is the probability that a cell will have a measurement value less than, or equal to, the channel number.

Bivariate Data Display:

As histograms:Contours, intensity, isometrics

Data storage defined by histogram resolution

Calculations made from histogram array

As dot plots:Data storage defined by the number of events

Calculations made from original data

There are several methods used to display bivariate distributions.

Bivariate Data Display:

Dot plots:Each event is plotted as a dot. This type of display can quickly become saturated.

In a dot plot, a ‘dot’ is placed on the plot for every event displayed. This means that if 2 events have the same values, the ‘dots’ are plotted on top of each other and appear as one. Thus, the plot can become saturated and one looses the third dimension, the number of events at each position. However, an advantage of the dot plot is that there is no lower threshold. That is, regions of very low counts are plotted.

The data are plotted directly from the data file.

Some instruments have live dot plot displays during data acquisition.

Bivariate Data Display:

Color coded histograms:Color of the dots is related to the number of counts in a bivariate bin.

This looks like a dot plot - but it is not. Here, a bivariate histogram is formed in computer memory from the original data file. Then a color scale is used to convert the number of events in a bivariate channel into a color and the result is plotted.

Bivariate Data Display:

Histograms as contour plotsContours interpolated between bins and color coded. Analogous to a topographic map.

The contour plot is formed like the previous plot with the additional step that equal count contours are calculated and plotted as connected lines. The regions of higher event numbers, ‘mountains’, are easily recognized. Contour plots can get very noisy at low counts. For this reason, a lower threshold below which events are not plotted is often employed.

Chromosome Analysis and Sorting:

A cool isometric plot of human chromosome data. The chromosomes that produced these data were stained with Heoscht 33258 and Chromomycin A3 to measure the AT and CG content of individual chromosomes. The large landmark peak in the middle of the distribution is due to chromosomes 9, 10, 11, and 12 which are not separated by the measurement.

The data were collected by J. Fawcett and the display generated by M. Pruitt, both of Los Alamos.

Trivariate Data Display:

G2p

Prophase

Prometaphase

MetaphaseAna-Telophase

pH3+ CytokinesispH3- Cytokinesis

Cyc

linA

Cyclin B1

pH3

Display: J. Jacobberger, et al.

A tri-variate dot plot that correlates expression of 3 cell cycle antigens.

Typical questions answered by univariate data analysis:

What is the mean and standard deviation of a distribution?

What is the DNA content distribution of cells with low light scatter?

Are these distributions different?

What fraction of the cells analyzed are labeled by antibody Zzz?

How are the cells distributed around the cell cycle?

Etc.

Some of the types of questions that can be asked of univariate data.

Comparison of Histograms:

Global comparison -Complete distributions or portions of distributions

Channel-by-channel comparison -A measure of the uncertainty in each channel is developed

for each distribution compared.

Statistical test applied to determine if differences are significant.

First, asking the question whether two distributions representing 2 different populations of cells are measureably/statistically different or not.

There are two approaches: comparison channel-by-channel of two distributions and whole or partial comparison of distributions.

Comparison of Histograms:

Global comparison: by Kolomogorov-Smirnov (KS) testRef: I. T. Young, J. Histochem Cytochem 25:935 (1977)

C. B. Bagwell, et al., Cytometry 9:469 (1988)

Generally too sensitive a test. Indicates differences most of the time, even when the biology does not support that conclusion.

0.00.0

1.01.0

Cumulative Distribution

Function

Cumulative Distribution

Function

Maximum DifferenceMaximum Difference

One approach is to use a K-S test to test whether a difference exists between two distributions. If the test ‘says’ that there is a difference, it will not provide any information about where the distributions differ.

Comparison of Histograms:

Global comparison: by Kolomogorov-Smirnov (KS) test

A more detailed analysis.Ref: F. Lampariello, Cytometry 39:179 (2000).

This paper presents an approach to modifying the KS test to takeinto account sources of variability other than purely statistical including “cellular sample variability, staining variability, and possible instrumental biases”.

The result is a modified Dcritical value that is used in the comparisons.

However, the strictly formed, statistically correct, the K-S test is too stringent for most flow data. Very few flow cytometric histograms are found to be the same, even if they were measurements of the same sample. The paper cited above cites reasons for this and shows how to modify the test statistic.

Comparison of Histograms:

Two real distributions: Are they the “same”?

Two distributions. One for males, one for females - the red bars. They have different numbers of ‘events’. The question one could ask is whether they are similar.

Comparison of Histograms:

The traditional K-S test is inconclusive.

To answer the question, the cumulative distributions are calculated and the K-S test applied. In this form they look pretty similar. And, the KS test is noninformative.

The data were derived from a list of the ages of drivers arrested for DUI and separated into male and female to look for differences in age distributions.

Comparison of Histograms:

Channel-by Channel comparison -A measure of the uncertainty in each channel is

calculated:

From Poisson counting statistics: = # counts

By calculating means and standard deviations using data from several ‘identical’ distributions

Using uncertainties, each channel is tested using a T-test.

Anther method of comparing histograms.

Comparison of Histograms:

Ref: Burns, et al., Cytometry 4:150 (1977).

Two similar histograms. The x’s denote where the histograms crossed each other. The comparison is made after the histograms are normalized to have equal numbers of events.

Comparison of Histograms:

Ref: Burns, et al., Cytometry 4:150 (1977).

Two different histograms that demonstrate differences in the regions indicated by the vertical bars.

Comparison of Histograms:

A technique that is in between global comparison and channel-by-channel comparison: Probability Binning

Ref: M. Roederer, et al., Cytometry 43: 37 (2001) and following.They also develop a theory for comparison of multivariate distributions and identifying subsets that move from one distribution to another.

The basic technique bins a control histogram such that there are equal numbers of cells in each bin. A statistic is calculated for a test distribution that is a measure of the difference between the two distributions.

B1 B2 B3 B4 B5 B6

X2 = (ci-ti)2/(ci+ti)Where ci is the fraction of cells in control bini

A method for comparing histograms that re-bins the data to have equal number of events and thus equal uncertainties in the number of events.

CD4/CD8 Quadstats:

Log FITC Fluorescence (CD8)

.1 1 10 100 1000

1 2

3 4

45% 2%

26%27%

Courtesy P. Robinson

Lo

g P

E F

luo

resc

ence

(C

D4)

The most basic method of determining the fraction of cells that are positive for expressing a particular cell surface antigen is Quad stats. To determine the fraction of cells that are --, +-, -+, ++ for the 2 antigens measured is the draw horizontal and vertical lines on a bivariate distribution the separate positive from negative expression of each parameter. The fraction of the cells that fall into each quadrant is reported as the quat stats.

In the relatively old example shown here, the division between positive and negative cells is clear. That is not necessarily the case. Quadrants can be defined on compensated or uncompensated distributions.

What Fraction of the Cells are Labeled?

For this histogram, the answer in clear. Threshold set above 98% of the cells in the control distribution. %+ = fraction of cells above the threshold in the test distribution.

Channel NumberChannel Number

##

0

500

1000

1500

2000

0 200 400 600 800 1000

2%

For distributions like this, it is clear how to delineate the positively labels cells from the unlabeled dells.

Even here, the answer is clear.Ref: W. R. Overton, Cytometry 9:619 (1988).

What Fraction of the Cells are Labeled?

Here it is still clear.

Here, the answer is not clear. Need another method.

What Fraction of the Cells are Labeled?

How it is not clear. Several approaches have been proposed in the literature.

What Fraction of the Cells are Labeled?

Match Region:Select a region on the left hand side of the test

distribution.Adjust the amplitude of the control distribution

to match the test in the Match Region.Subtract the adjusted control from the test

distribution.%+ = fraction of cells left in the test distribution.Ref: K. J. Schementi, Cytometry 13:48 (1992).

An early method of determining the percent positive cells based upon matching lower portions of a control and ‘test’ distributions in the region that is presumed to be occupied by unlabeled cells only.

What Fraction of the Cells are Labeled?

Fitting approaches:

Fit the control distribution, subtract

Ref: Lampariello, Cytometry 15:294 (1994).

Fit the control and 100% positive distributions

Ref: Sladek and Jacobberger, Cytometry 14:23 (1993).

Fit the control and the test distributions

Ref: Lampariello and Aiello, Cytometry 32:241 (1998).

Numerous fitting approaches to determining the percent positive.

What Fraction of the Cells are Labeled?

Other approaches:

Probability Binning

Ref: M. Roederer, et al., Cytometry 43: 37 (2001)

General Summary

Ref: B. Bagwell, Clinical Immunology Newsletter,Vol

16 (1996)

Ratio of cumulative distributions

Ref: Lampariello, Cytometry, 75A:665 (2009)

Numerous other approaches to determining the percent positive.

What Fraction of the Cells are Labeled?

Ratio of cumulative distributions

Ref: Lampariello, Cytometry, 75A:665 (2009)

Rat

io f

Channel Number

A relatively recent approach to determining the percent positive cells in immunofluorescence distributions. This method form the ratio of labeled and unlabeled CDFs and extrapolates the ratio back to channel zero to find the fraction labeled.

FCOM - A Way to Display Phenotyping Data:

After determining the %-positive for each surface labeling, FCOM is a method to display the combined results for all antibodies measured in a single experiment.

FCOM is a method to display, for each event in a file, a calculated parameter that classifies events based on combinations of selected gates for any number of parameters. Gates are set for each parameter of interest. Usually the gates differentiate between positive and negative cells for each parameter. Since an event is either inside or outside a gate, its state can be represented by a single digit: 0=outside and 1=inside. FCOM assigns each event an integer number reflecting the gate combination for that event. Thus, there are 8 eight possible combinations of 0’s and 1’s for 3 parameters.

The histogram above shows an example of a FCOM histogram. This FCOM is the result of three gates, with the returned values of 0, 1, 2, 3, 4, 5, 6, or 7 representing each possible combination of inside or outside the gates. For an event that is outside all three gates, the value of “0” is returned. For an event that is inside all three gates, a value of “7” is returned. If the values 0 thru 7 were scaled to be plotted on a 1024 channel histogram, they would be shown as single channel spikes. In operation, the spikes are spread to give them a normal distribution appearance.

The complete equation used to generate the FCOM seen above is (FCOM(G2,G3,G4)+0.5)*1024/8+FRND(10).

Details are contained in:

http://www.vsh.com/Support/WinList%20-%20Using%20FCOM.PDF

Application: Cellular DNA Content Analysis

The cell cycle:

G1 S MG2

2n

4n

“Time”/Progression

Start with a general picture of the relationship between cell cycle progression and DNA content. From this figure, one can visualize how a DNA distribution histogram would appear. This connection was first described by B. Bagwell.

DNA Histogram Analysis:

= 40%

= 13%

= 7%

= 40%G1

S

G2

M

From the previous figure one can derive the fraction of cells in the various phases of the cell cycle.

DNA Histogram Analysis:

Is the appearance of S-phase cells under the G0/g1 and G2/M peaks real?

This was answered in the second issue of Cytometry:Sheck, Muirhead, and Horan, “Evaluation of the S Phase Distribution of Flow Cytometric DNA Histograms by Autoradiography and Computer Analysis”, Cytometry 1:109 (1980).

Solid line: histogramDiamond: 2-channel averageTriangle: Autoradiographic

This early paper verified that the models that assigns cells under the G1 and G2M peaks to the S-phase population are correct. In a measured histogram there are events under the G1 and G2M peaks that are due to cells in early or late S-phase.

DNA Histogram Analysis:

How does cell cycle distribution analysis proceed?Assume a model for S-phase.Assume that in the measurement that each component of

the model is normally distributed, i.e., can be fit by a Gaussian function

Assume some model for the background that underlies the cellular DNA content distribution.

Make some guesses for initial values of the model parameters.

“Fit” the data.Calculate the desired parameters (%s ,etc.) from the fit

parameters.

Perhaps the oldest histogram analysis problem.

Models for Cell Cycle Distribution:

Differ exclusively in how the S-phase region is modeled.

Some of the approaches include:

Constant S-Phase

Linear S-Phase

Quadratic S-Phase

Higher order S-Phase

Rectangular S-Phase

Trapezoidal S-Phase

Analysis algorithms differ primarily in how S-phase is described.

DNA Histogram Analysis:

Constant S-phase

The simplest S-phase model. If one used this model for S-phase, this is what one would expect a CV = 0.0 measurement to look like.

DNA Histogram Analysis:

Linear S-phase

Next higher complexity.

DNA Histogram Analysis:

Quadratic or second order S-phase

Next higher complexity. Probably the most generally used model.

DNA Histogram Analysis:

Trapezoid model of S-phase

For even more complex S-phase shapes, such as is obtained in the analysis of synchronized cell populations. This is perhaps the most flexible S-phase model that does not have too many degrees of freedom. If the S-phase model has too many degrees of freedom, that is, it is too “flexible”, the fitting result may not make sense.

DNA Histogram Analysis:

Exponentially decaying background

And generally, some assumption is made of the shape of underlying background events. This shape is generally attributed to random cell breakage.

64

DNA Histogram Analysis:

Sliced nuclei background

This background shape is used in the analysis of the DNA content of cell nuclei obtained from sections of paraffin blocks. It was developed by B. Bagwell and P. Rabinovitch independently.

DNA Histogram Analysis:

Fitting algorithms adjust parameters, starting with the initial guesses, to reduce Chi-square, a global measure of the distance between the data and the fit.

A hypothetical plot of the value of Chi-squared, a measure of the ‘distance’between the fit and the data, as a function of a parameter value. The object is to obtain the lowest Chi-square value which implies that the fit to the data is as good as it can get. However, fitting programs can get stuck in false minima such as the second lowest valley in this plot. Fitting programs adjust all parameters simultaneously to minimize Chi-square.

66

DNA Histogram Analysis:

Chi-square is reduced until it “bottoms out”. Then the algorithm reports the fit parameter values.

Initial guesses

First iteration

Second iterationEtc.

How Chi-square should change with iteration number. The fitting stops when Chi-square can not be lowered. Usually, there is a cut-off such that when the change in Chi-squared change from one iteration to the next is below the cut-off, the iteration is stopped.

67

Finally:

The model parameters determined during the fitting are used to:

Calculate the distribution and display it with the data

Compute values - based upon the parameters determined by the fit - for:

Means of G1, G2M

C. V. of G1, G2M

%G1, %S, %G2M

Once the fitting process is finished, that is the minimum Chi-square has been achieved and any further adjustment of the parameters does not significantly decrease Chi-square, the desired parameters can be calculated.

Typical questions answered by multivariate data analysis:

Is distribution X different from distribution Y ?

Are there distinct subpopulations of cells present?

What parameters or combination of parameters define the subpopulations present?

How many cells are in each defined population?

How many parameters are necessary to describe the data set?

Some of these questions are answered by techniques such as gating, clustering, CART, Artificial Intelligence, . . . .

These are just a few of the types of questions that can be asked of multiparameter data.

Multi-parameter Data Analyses:

• Hard

• Can’t visualize relationships between parameters easily– For 14 parameters 91 bivariate plots are

needed

– Some analytical packages facilitate exploring multi-parameter data

– Such exploration is often manual

Analysis of such data is not easy.

Display of Univariate Data:

One or multiple histograms can be generated from a data file.

This example is from an 8 parameter immunophenotyping data set.

Back to the eight parameter data and 8 uncorrelated histograms of that data.

A presented here, one can not determine the size distribution of CD8 positive cells, or any other similar correlation. A solution is gating or reprocessing of the original data file.

Gating/Data Reprocessing:

Show all the cells that: have large values for FALS, or are newly divided, or are large and FL2 dim, or . . . . .

Cell # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . . .

FALS 0 2 4 3 6 4 9 7 6 4 3 2 1 8 . . .90o LS 1 4 2 3 5 8 4 5 9 4 4 5 2 9 . . .FL-1 Int 2 4 3 2 5 6 8 6 7 0 8 3 1 7 . . .FL-2 Amp 2 3 4 2 3 2 1 8 5 8 9 7 5 3 . . .

Cell # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . . .

FALS 6 9 7 6 8 . . .90o LS 5 4 5 9 9 . . .FL-1 Int 5 8 6 7 7 . . .FL-2 Amp 3 1 8 5 3 . . .

First attempt at defining correlations among measured parameters.

In this example, only a subset of the original data file is kept, data for cells that have large values of FALS. The filtered data can then be plotted as individual or bivariate histograms.

Gating/Data Reprocessing:

ALL dim cells

(ALL = Axial light loss)

Histograms for ALL “dim” cells. ALL stands for axial light loss, a measure of how much light is removed from/scattered out of a laser beam. It is not a commonly made measurement.

Gating/Data Reprocessing:

ALL “bright” cells

Histograms for ALL bright cells shown as faint red lines under the parent distributions.

Gating/Data Reprocessing:

From Mittag, et al., Cytometry 65A, Page 108

Gates can be drawn in bivariate distributions and the resulting subsetteddata plotted and further gated. In this case, gates are set in two bivariatedistributions that lead to the histograms on the right.

Multi-parameter Data Analyses:

• What else can you do ??

• Reduce the number of parameters– By computing ratios of parameters

– By Principal Component analysis

• Go to “automated” methods

So, what can one do? In some cases, one can reduce the number of parameters by forming ratios of measurements and expressing the result as a new parameter, As an example, some investigators have formed the ratio of fluorescence due to an antibody that labels a cell surface antigen to cell volume to the 2/3rds power (proportional to cell surface area) to obtain a measure of surface density of the antigen. On second thought, this can increase the complexity of the data set.

Principal Component Analysis: PCA

Think of the measured parameters (FALS,CCS, Fl1, . . ) as coordinates in a multi-dimensional space.

PCA defines a new set of coordinates that are linear combinations of the originals such that the first PC has the largest variation.

This is a statistical approach that basically makes rotations in the multi-dimensional measurement space such that the first principal component accounts for the largest amount of variation in the data set.

PCA

An example: Original measurements

Number of events

A hypothetical example of PCA in two dimensions. The original observations are shown as two groups by the ellipses. There is no line parallel to the X-or T-axis that separates the two groups.

PCA

An example: First PC

Number of events

After PCA, the groups can be separated by projecting the data onto the first principal axis as shown on the left.

PCA

An example: Second PC

Number of events

Projection onto the second principal component axis would reveal only one group.

PCA used to bin multivariate data

Cytometric fingerprint: W. T. Rogers, et al., Cytometry 73A:430 (2008).

PCA has been used to develop a method for comparing bivariatedistributions.

Cytometric fingerprint:

The number of cells in the bivariate bins as defined by repeated PCA are compared to located the increase or decrease of cells in bins.

Automated Data Analysis:

Several approaches have been developedNeural networks to define distinct populations

DNA cell cycle in batch mode

Batch mode gating/display

All analysis programs have batch mode capability: WinList, TreeStar, DeNovo, etc

R/Bioconductor (Cytometry 75A:699)

Clustering

Gemstone

There are numerous approaches to automated analysis of multiparameterdata that have been developed with varying success. For example, neural network analysis has been applied to aquatic flow data to define species.

DNA histogram can be analyzed in batch mode, perhaps after trimming high and low events.

Numerous programs offer batch mode gating and subsequent statistical calculations.

Clustering methods have been attempted to define subpopulations of primarily immunophenotyping data.

GemStone is multiparameter model building program.

Clustering of Playing Cards :

Two clusters based on colors.

A clustering in which the number of clusters was predefined as 2 and the parameter used as a discriminator was the color of the card.

Clustering is an “automated” statistical method that determines the number of events in a data set that belong to a “cluster” of events. Generally, clustering programs require that the number of clusters be defined. This makes obtaining the “correct” clustering difficult, as in shown in these example with playing cards.

Clustering of Playing Cards :

Two clusters based on numbers vs... non.

Two clusters again but the discriminator was number vs.. non-number cards.

Clustering of Playing Cards:

Four clusters based on colors plus suits.

Four clusters were desired with a single discriminator of suite.

Clustering of Playing Cards:

Which is clustering “right” ?

They are all correct.

For flow cytometry, the moral of the story is that the answer that one gets is dependent on the parameters that are used in the clustering and that should be dictated by the biology.

So, which clustering is correct ??

Gemstone - Data Driven Model Building:

A multi-parameter analysis method based upon prior knowledge, i.e., a model for the beginning parameter.

For example, to analyze cell cycle data, one may assume the relationship between cell cycle position and cellular DNA content.

The data points, recorded at random, are ordered in a progression according to the model.

While maintaining the order of the first parameter, succeeding parameters are ordered to develop a progression.

The final result is a model that correlates all the measurementsmade for the cell sample.

A word on GemStone. It is a multiparameter data analysis tool that develops a model for progression of measured parameters.

Final Word:

Finally, don’t catch “flowfright” !!

Alternate definition by J. Freyer, Univ. of New Mexico

The second definition is attributed to J. Freyer.

Don’t suffer from flowfright. Delve into your data.