A guide to stem cell identification: Progress and challenges in system-wide predictive testing with...

11
Prospects & Overviews A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers Roy Williams 1) , Bernhard Schuldt 2) and Franz-Josef Mu ¨ ller 3) We have developed a first generation tool for the unbiased identification and characterization of human pluripotent stem cells, termed PluriTest. This assay uti- lizes all the information contained on a microarray and abandons the conventional stem cell marker concept. Stem cells are defined by the ability to replenish them- selves and to differentiate into more mature cell types. As differentiation potential is a property that cannot be directly proven in the stem cell state, biologists have to rely on correlative measurements in stem cells associ- ated with differentiation potential. Unfortunately, most, if not all, of those markers are only valid within narrow limits of specific experimental systems. Microarray tech- nologies and recently next-generation sequencing have revolutionized how cellular phenotypes can be charac- terized on a systems-wide level. Here we discuss the challenges PluriTest and similar global assays need to address to fulfill their enormous potential for industrial, diagnostic and therapeutic applications. Keywords: .bioinformatics; biomarkers; gene expression; machine learning; pluripotent stem cells Large datasets transform biology Stem cell science – like any other biological discipline – is currently undergoing a revolutionary change. Most day-to-day biological experiments now lead to a large flood of digital data that can be stored and mined [1, 2]. Whereas biologists were once trained to conduct focused, hypothesis-driven experiments, in today’s scientific environ- ment so-called ‘‘bottom-up’’ research has become more and more prevalent. In this paradigm, initially large amounts of data are gathered, next a biological ‘‘story’’ is distilled from the results and, ideally, this de novo-generated hypothesis is subsequently tested in wet lab experiments. Prominent examples of such data-driven research are: the Human Genome Project, the Human Epigenome Project [3], the 1000 Genomes Project [4] and the International Cancer Genome Consortium [5]. Notably, despite the enormous potential, this paradigm shift poses some challenging problems. In this review we focus on the conceptual basis underlying the unprecedented opportunity for building and using truly predictive, data- driven models in biology today. These models combined with nucleotide-based low-cost, high-throughput, ultra-high con- tent methods will lay the foundations for novel types of global assays. Here, instead of using just a few genes or proteins as ‘‘simple’’ biomarkers, by looking at whole global data DOI 10.1002/bies.201100073 1) Bioinformatics Shared Resource, Sanford Burnham Medical Research Institute, La Jolla, CA, USA 2) Graduiertenschule Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen, Aachen, Germany 3) Zentrum fu ¨ r Integrative Psychiatrie, Kiel, Germany *Corresponding author: Franz-Josef Mu ¨ ller E-mail: [email protected] Abbreviations: CHIP-seq, chromatin-immunoprecipitation sequencing; FACS, fluorescence- activated cell sorting; hESC, human embryonic stem cell; HSC, hematopoietic stem cell; iPSC, induced pluripotent stem cell; NGS, next-generation sequencing; NMF, non-negative matrix factorization; PCA, principal component analysis; POU5F1/OCT4, POU domain class 5; transcription factor 1/octamer binding transcription factor 4; PSC, pluripotent stem cell; QC, quality control; RNA-seq, RNA sequencing; whole transcriptome shotgun sequencing. 880 www.bioessays-journal.com Bioessays 33: 880–890,ß 2011 WILEY Periodicals, Inc. Methods, Models & Techniques

Transcript of A guide to stem cell identification: Progress and challenges in system-wide predictive testing with...

Page 1: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

Prospects & Overviews

A guide to stem cell identification:Progress and challenges insystem-wide predictive testingwith complex biomarkers

Roy Williams1), Bernhard Schuldt2) and Franz-Josef Muller3)�

We have developed a first generation tool for the

unbiased identification and characterization of human

pluripotent stem cells, termed PluriTest. This assay uti-

lizes all the information contained on a microarray and

abandons the conventional stem cell marker concept.

Stem cells are defined by the ability to replenish them-

selves and to differentiate into more mature cell types.

As differentiation potential is a property that cannot be

directly proven in the stem cell state, biologists have to

rely on correlative measurements in stem cells associ-

ated with differentiation potential. Unfortunately, most, if

not all, of those markers are only valid within narrow

limits of specific experimental systems. Microarray tech-

nologies and recently next-generation sequencing have

revolutionized how cellular phenotypes can be charac-

terized on a systems-wide level. Here we discuss the

challenges PluriTest and similar global assays need to

address to fulfill their enormous potential for industrial,

diagnostic and therapeutic applications.

Keywords:.bioinformatics; biomarkers; gene expression; machine

learning; pluripotent stem cells

Large datasets transform biology

Stem cell science – like any other biological discipline – iscurrently undergoing a revolutionary change. Most day-to-daybiological experiments now lead to a large flood of digital datathat can be stored and mined [1, 2].

Whereas biologists were once trained to conduct focused,hypothesis-driven experiments, in today’s scientific environ-ment so-called ‘‘bottom-up’’ research has become more andmore prevalent. In this paradigm, initially large amounts ofdata are gathered, next a biological ‘‘story’’ is distilled fromthe results and, ideally, this de novo-generated hypothesis issubsequently tested in wet lab experiments. Prominentexamples of such data-driven research are: the HumanGenome Project, the Human Epigenome Project [3], the1000 Genomes Project [4] and the International CancerGenome Consortium [5].

Notably, despite the enormous potential, this paradigmshift poses some challenging problems. In this review we focuson the conceptual basis underlying the unprecedentedopportunity for building and using truly predictive, data-driven models in biology today. These models combined withnucleotide-based low-cost, high-throughput, ultra-high con-tent methods will lay the foundations for novel types of globalassays. Here, instead of using just a few genes or proteinsas ‘‘simple’’ biomarkers, by looking at whole global data

DOI 10.1002/bies.201100073

1) Bioinformatics Shared Resource, Sanford Burnham Medical ResearchInstitute, La Jolla, CA, USA

2) Graduiertenschule Aachen Institute for Advanced Study in ComputationalEngineering Science (AICES), RWTH Aachen, Aachen, Germany

3) Zentrum fur Integrative Psychiatrie, Kiel, Germany

*Corresponding author:Franz-Josef MullerE-mail: [email protected]

Abbreviations:CHIP-seq, chromatin-immunoprecipitation sequencing; FACS, fluorescence-activated cell sorting; hESC, human embryonic stem cell; HSC, hematopoieticstem cell; iPSC, induced pluripotent stem cell; NGS, next-generationsequencing; NMF, non-negative matrix factorization; PCA, principalcomponent analysis; POU5F1/OCT4, POU domain class 5; transcriptionfactor 1/octamer binding transcription factor 4; PSC, pluripotent stem cell;QC, quality control; RNA-seq, RNA sequencing; whole transcriptome shotgunsequencing.

880 www.bioessays-journal.com Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 2: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

structures (referred to as ‘‘complex biomarkers’’) muchsuperior insights can be gained into cellular phenotypes,states, responses, and clinically relevant endpoints [6].

Cancer research has pioneereddata-driven biology

Cancer research has led the way into the ‘‘present future’’ ofpredictive models in biology based on high-content datasets.

Patients suffering from clinically and pathologically indis-tinguishable malignant disorders frequently have widelyvariable disease courses and respond differently to treatments.This has fueled the hope that by using highly discriminatingbiomarkers patient groups could be intelligently sub-dividedand matched with the most appropriate treatment options.

The very first ordered DNA arrays were used to analyzemalignant samples [7]. Scientists in the cancer field havepioneered the development and application of machine learn-ing algorithms for high-content data [8, 9]. Most principles,methods and quality controls (QCs) for mining large scaledatasets were first established with a focus on cancer biology[microarray quality control project (MAQC)-I and MAQC-II,MAQC-III (aka SEQC)] [10].

In particular, the MAQC-II project evaluated predictivemodels for several relevant endpoints, and has also providedsignificant insights for stem cell-related data models.Analyzing more than 30,000 models developed by 36 teamsaimed at predicting 13 endpoints in the context of preclinicaland clinical cancer datasets [10], some of the MAQC-II resultswere quite remarkable.

First, not all endpoints could be modeled equally well.Some endpoints were easy to predict using adequatebioinformatics techniques; while some endpoints were nearlyintractable for all methodologies used.

Secondly, the experience of the modeling teams mattered.Even using the same algorithms, general bioinformatic experi-ence and familiarity with the particular problem was associ-ated with better predictive performance.

The third surprising result was: many approaches andalgorithms are equally capable of predicting relevant end-points. Mathematically less-involved linear methods (usingonly a few parameters) performed comparably to more bio-logically oriented approaches (using more sophisticated non-linear models, fitting many more biological parameters; anobservation that was also highlighted in the DREAM3 –Dialogue for Reverse Engineering Assessments and Methodsproject 3 – network identification challenge [11]).

From our experience, the following lessons are particularlyrelevant for stem cell studies:

(i) Only datasets which have been optimally quality con-trolled, normalized, transformed, and filtered will serveas a reliable basis for determining discriminativepredictors.

(ii) Many algorithms can give predictive answers equallywell, if the methodology is adequate for the particularendpoint in question.

(iii) Some endpoints cannot reliably be predicted from currentstate-of-the-art gene expression datasets [e.g. effects due

to epigenetic memory in induced pluripotent stem cells(iPSCs)]. In our opinion, this problem will not be resolvedby using more and increasingly sophisticated algorithms,but by using more appropriate data sources [such as RNAsequencing; whole transcriptome shotgun sequencing(RNA-seq)].

Yet, there are intrinsic and important differences in howrelevant endpoints in stem cell studies can be modeled.Usually it is difficult to obtain more than five different stemcell lines from the same researcher or lab [12]. Studies analyz-ing 34 pluripotent stem cell (PSC) lines consider themselves as‘‘large’’ [13], while current state-of-the-art cancer studies reg-ularly collect and analyze between 400 and 6,000 samplesfrom patients with a distinct malignant disorder [14, 15]. Also,the human embryonic stem cell (hESC) field is heavily biasedtoward a few cell lines that were reported early in the field,particularly those distributed by WiCell [16, 17].

Bioequivalence is today’s ‘‘holy grail’’of stem cell biology

The discovery that cellular phenotypes with nearly identicalfeatures as their physiological and developmental counter-parts can be readily induced from somatic cells with today’smolecular technologies has lead to an unprecedented pace ofnovel discoveries. With several thousand scientists refiningtechnologies to derive improved in vitro stem cell preparationswith increasingly sophisticated methods, a central questionhas emerged and will dominate the field in the next few years:what constitutes bio-equivalence?

For example, on a global scale, iPSCs are nearlyindistinguishable from ESCs [12]; yet, when subjected toultra-high-resolution, genome-wide assays, and potentiallynon-physiological differentiation protocols, small, but signifi-cant differences between iPSC and ESC lines can be detected[18–22]. During reprogramming iPSCs may acquire both singlepoint mutations in gene-coding regions [23] and DNA copynumber alterations [20, 21]. Furthermore, there aredocumented differences between iPSCs and ESCs in regardto DNA methylation, which appear to stem from the parentalcell lines in the case of iPSCs [19, 24]. None of the differencesbetween ESCs and iPSCs (genetic and epigenetic) seem to bepresent or absent in all iPSCs or ESCs [25], thus there is nosimple test or biomarker that could distinguish iPSCs fromESCs with high sensitivity or specificity. Although some iPSClines have now been shown to possess epigenetic [19], genetic[20, 21, 23], and immunological [26] abnormalities when com-pared to ESCs, iPSCs can pass even the most rigorous test for‘‘developmentally appropriate’’ pluripotency in the murinemodel system, the tetraploid complementation assay [27, 28].

As a result, a heated debate is currently on going betweentwo schools of thought in the stem cell field [29]. On one‘‘side’’, the few, but potentially significant, differences inthe epigenetic makeup and genomic integrity suffice to makeiPSCs questionable as a preclinical and clinical tool. For the‘‘opposition’’, these findings are a starting point to ask if thesedocumented differences do actually matter.

....Prospects & Overviews R. Williams et al.

Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc. 881

Methods,Models

&Techniques

Page 3: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

To fully realize how difficult it is to ‘‘diagnose’’ a stem cell withits core properties, we can bypass the iPSC-ESC discussion,and just consider the teratoma assay. Until the present day,injecting PSCs into immunocompromised mice and waitinguntil a tumor forms, which should contain derivatives from allthree germ layers, is regarded as the ‘‘Gold Standard’’ assay forproving pluripotentiality in human cell lines [30, 31]. Theteratoma assay was developed about 60 years ago [32],and is still required by many peer reviewers of stem cellstudies [30].

Independently, several studies have shown that teratomasor teratoma-like structures can arise from iPSCs and ESCs.Some of these lines were incapable of contributing to all thecell types in the embryo proper [33, 34], or have epigenetic/genetic abnormalities, which impede pluripotential lineagedifferentiation in vitro [19].

‘‘Classical’’ pluripotency genes as stem cell markers do notoffer a solution as they unfortunately lack specificity.This becomes evident when considering the case of terato-carcinoma cell lines. These lines are derived from malignanthuman tissues and do express many, if not all, ‘‘pluripotencymarkers’’. Thus, teratocarcinoma lines have even been pro-posed to possess utility as a reference standard for hESCs [35].However, teratocarcinoma cells are clearly not suited for pre-clinical stem cell research, or clinical applications, as theymight possess persistent tumorigenic potential and otherabnormalities even after they have been differentiated.

With only continuous in vitro culture and the teratomaassays as a measure for self-renewal and differentiation,respectively, hPSC researchers face a vexing circular dilemma.

How can we devise a surrogate assay for hPSCs thataccurately predicts lineage potential while lacking a reliableexternal outcome metric?

The issue is compounded by the partially stochastic natureof epigenetic mechanisms governing lineage choice in PSCs.While single epigenetic features can significantly bias a cellline toward a specific fate choice, so far we have been unableto identify any single local transcriptomic, epigenetic, orgenetic features sufficient and necessary to predict lineagebias.

As a result, ‘‘complex’’ biomarkers have been proposedto guide the selection of the ‘‘optimal’’ stem cell line orpreparation.

‘‘Complex’’ biomarkers lead to resultswith improved interpretability

Microarray studies (and other high-content data-generationtechnologies) have been mined for single or limited numbersof marker genes [36]. Complex biomarkers are linear and/ornon-linear parameter combinations constructed from morethan one measured biological feature. Complex biomarkerscan be identified algorithmically and regularly outperform‘‘classical’’ biomarkers.

Simple biomarkers can be significantly regulated genesthat when detected as present in one sample can indicate, e.g.a specific disease state versus the normal condition, or aspecific stem cell type versus a more differentiated somaticcell. Hematopoietic stem cells (HSCs) are the prototypical

example of a stem cell type for which known single cell surfacebiomarkers are sufficiently sensitive and specific to be usefulin preclinical and clinical applications [37, 38]. Bulk trans-plantation of blood-forming tissues from umbilical cords orbone marrow containing HSCs can reconstitute the wholeblood-forming organ system in vivo [39]. Seminal work byWeissman and coworkers showed that, using a combinationof cell surface markers, HSCs with long-term self-renewal anddifferentiation capacity can be identified from these hetero-geneous cell populations [37, 38]. Similar markers for otherstem cell types would be useful for wet lab experiments, sincesingle markers could be selected with fluorescence-activatedcell sorting (FACS) or perhaps overexpressed to alter a cellularor disease state [36]. While this concept has been tried exten-sively, only very few sensitive, specific and universal markershave been identified. In cases where the ‘‘marker hunt’’ wassuccessful, moving such a molecular concept beyond aspecific organ or experimental system has often failed.

The main problem with the identification of single markersfor other stem cell preparations lies in the source material andthe experimental readout: (i) it is relatively easy to harvest‘‘pristine’’ tissue preparations from living organisms contain-ing HSCs that have not been epigenetically or geneticallyaltered through artificial culture in vitro; and (ii) after HSCshave been obtained and/or sorted, the functional hemato-poietic reconstitution assay demonstrating self-renewal anddifferentiation potential in vivo is sensitive, specific, informa-tive and reliable [37, 38]. Unfortunately, there is no other stemcell type that can be as easily obtained in such an idealphysiological state and tested with such an unequivocal out-come measure as HSCs.

Let us consider the case of POU domain class 5; tran-scription factor 1/octamer binding transcription factor 4(POU5F1/OCT4) in PSCs and tissues. The gene was discoveredbecause it was differentially expressed in murine oocytes,early mouse embryos and germ cells [40, 41]. Just its soleoverexpression in somatic cells can induce pluripotency [42].Yet, against prior assumptions, POU5F1/OCT4 is not univer-sally important for all types of stem cells [43]. Recent evidenceshows that only very few genes and proteins show cell type-or organ-specific expression patterns [44], and that thecombinatorial expression of genes defines cell types andtissues, not single marker genes [45]. This is also the casefor POU5F1/OCT4 since it is tightly regulated and expressed inseveral non-identical pluripotency-associated tissues and celltypes [46].

Aggregated groups of genes, so-called ‘‘signatures’’, canmitigate the instability of predictions based on single genes.In this context, ‘‘gene-set-enrichment’’ techniques were devel-oped [47, 48]: results from both high-content experiments andbiological pathways can be conceptualized as gene sets thatare then associated with particular cellular functions or dis-ease states. These ‘‘synthetic gene’’ (sets) can be used tointerrogate expression datasets. The up-regulation of a geneset in one condition versus another can signify biologicallyrelevant alterations, e.g. in molecular pathways such as oxi-dative phosphorylation [47]. From this abstraction level it isonly a small step toward using de novo, algorithmically ident-ified data dimensions for stem cell class discovery [12, 49] orprediction [50].

R. Williams et al. Prospects & Overviews....

882 Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 4: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

One class of algorithms, termed dimension reduction algo-rithms, have shown great utility for the automatic identifi-cation of so-called ‘‘complex’’ biomarkers.

To illustrate this concept, let us consider our ‘‘real’’ three-dimensional world. One can describe every physical object inspace with three coordinates in the three spatial dimensions.Now imagine a stem cell sample that was analyzed onan inexpensive microarray, e.g. with 48,000 features. In math-ematical terms, we can regard the 48,000 probes as dimen-sions and we could consider the sample analyzed on amicroarray as ‘‘living’’ in a 48,000-dimensional space.

Fortunately, many of the gene measurements displayhighly correlated behavior and thus mathematical methodscan ‘‘condense’’ the single gene dimensions without anysignificant information loss into surprisingly few dimensions(often referred to as ‘‘Principal Components’’) [12, 49, 51]. Thebasic geometrical intuition to reduce the data dimensionsto the ‘‘Principal Components’’ was already present inKarl Pearson’s landmark paper published in 1901; principalcomponent analysis (PCA) remains one of the most populardimension reduction methods [52].

With PCA, one can illustrate the basic principle of dimen-sion-reduction techniques (see Fig. 1): single measurementsare analyzed for their main variation ‘‘axis’’ and thosemeasurements supporting a common dimension (component)are condensed to a single vector. Many different algorithmshave been proposed for performing dimension-reduction tasks[e.g. non-negative matrix factorization (NMF) [53]].Dimension-reduction methods vary in how the space is

mapped, what constraints are imposed on the different axisand how features supporting an individual axis are selected.

The resulting data dimensions can have many advan-tageous properties over single gene measurements and cura-ted gene sets, as they tend to be more stable and resistant tomeasurement noise (see Figs. 2 and 3). Dimension-reduction

Figure 1. Principal component analysis (PCA). Large numbers ofdata points (P1, P2, P3,. . ...Pn) can be reduced to a small number ofprincipal components that effectively describe the variation amongthe data points in multidimensional spaces. In PCA, a principal com-ponent (also referred to as a ‘‘dimension’’) is defined by the directionof largest variation among points, and is the line that is optimized forthe smallest sum of perpendicular distances (p1, p2, p3,. . ...pn) tothe principal component. This figure has been redrawn fromPearson’s manuscript describing PCA in 1901 for illustration of datadimension reduction techniques [52].

Figure 2. Basic transcriptomic data dimensions. Very few PCA datadimensions can clearly distinguish biologically meaningful continentsand classes of biological samples. A: Here, we highlight the conti-nent concept by roughly outlining biological entities (such as blood,brain, and muscle) that are grouped together by the first two princi-pal components of a very large microarray atlas. B: Highlighted hereis the separation of cellular states by a ‘‘malignant’’ dimension,which separates tissues from cancer samples and cell lines. Note,the sub-separation of the ‘‘blood continent’’ in (A) into tissue, can-cer, and cell line sub-states in (B). Upon close inspection, it is appa-rent that these lines between states and continents are not perfect.PCA is useful for understanding high-order structures in data. Forperfect separation or even prediction of such phenotypes and states,machine learning approaches need to be trained and optimized foreach separation and prediction task (see Figs. 4 and 5). This figurehas been modified from a figure originally published in a study byLukk et al. [51] (http://www.nature.com/nbt/journal/v28/n4/abs/nbt0410-322.html).

....Prospects & Overviews R. Williams et al.

Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc. 883

Methods,Models

&Techniques

Page 5: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

methods can be used for many biologically relevant tasks,such as sample clustering, sample identification, or responsemeasurement [49, 51, 54].

Thus, these dimensions can serve as ‘‘complex bio-markers’’ to characterize stem cell phenotypes and possiblyother states [12, 51, 55]. Moreover, we propose that these datadimensions might serve as ‘‘internal’’ indicators for character-izing cellular potential for differentiation. In other words, theycan be seen as predictive models, potentially even in theabsence of informative wet lab outcome measures.

Stem cells ‘‘live’’ in cellular states,phenotypes, and continents

Early results suggest that complex and characteristic higherorder structures exist among the transcriptomes of tissues andin vitro cell preparations [12, 51, 55].

So far, only little is known of such large-scale patterns inthe global, multidimensional transcriptional landscape of allhuman cells and tissues. Although many hundreds of labshave contributed microarrays studies or even transcriptionalatlases to the public microarray repositories, reliable insightswere scarce until a study by Lukk et al. [51].

The authors used a large, heterogeneous, and rigorouslyquality-controlled expression atlas with more than 5,000arrays from public microarray repositories to search for largerorder structures in the human expression space [51].

Using PCA, they found that the first three principalcomponents have a biological interpretation and candistinguish sample groups in a biological meaningful way(see Fig. 2A). Specifically, a ‘‘hematological axis’’ dimensionseparates cells from the hematopoietic system from solidtissues, incompletely differentiated cell types and connectivetissues. A separate ‘‘brain tissue axis’’ separates all neuraltissues from the other samples, while a ‘‘malignancy axis’’separates normal tissues from cell lines and malignantsamples.

Do these rather abstract findings posses any meaning forthe biological experimentalist?

For one, cellular phenotypes, such as pluripotency, appearto represent ‘‘continents’’ in specific dimensions of the tran-scriptional landscape in our studies [12, 50]. ‘‘Continents’’ inthis context are connected areas, e.g. in the multidimensionalPCA sample space (see Fig. 2B). ‘‘Continents’’ possess distinctboundaries to other sample type ‘‘continents’’. Such an area isexclusively ‘‘populated’’ by a biological group of samples, e.g.occupied by blood organ system samples (HSCs, hemato-poietic progenitors, monocytes etc.).

Somatic differentiation of pluripotency-associated cells(e.g. MII oocytes, ESCs, epiblast stem cells, germ cells)that likely occupy a distinct continent [12, 50] is associatedwith a complete reorganization of the transcriptionallandscape and results in ‘‘transition’’ to other ‘‘continents’’[12, 50]. A ‘‘continent’’ can contain cellular phenotypes,such as in vitro PSC preparations, or brain tissues andmuscle tissues in case of the somatic solid tissue ‘‘continent’’(Fig. 2B).

Within phenotypes, several sub-states appear to bepossible. Such sub-states appear to be regulated by localizedfeatures, which may or may not have any significant impact onthe biology of a certain phenotype. An excellent example is theY-chromosome. Male and female hESCs as well as iPSC lineshave been derived. The gender of a PSC line can be easilyidentified by probing for genes located on the Y-chromosome.Yet, while the Y-chromosome has no obvious impact onthe somatic differentiation abilities of a given PSC line, itwill presumably influence PSCs in germ-line differentiationexperiments.

Other less obvious local states have been discoveredin PSCs [13, 56, 57], but it is unclear as to what theactual biological implications of such states are. Within the

Figure 3. The Black Swan challenge. To illustrate the inherentbiases introduced by the training set, we use the Black Swan prob-lem as thought experiment. A: An algorithmic predictor is trained toseparate between swans and ducks. Such a method will tune itsparameters based on the training dataset, symbolized by WhiteSwans (Genus: Cygnus, Species: olor) and ducks (Genus: Anas,species: platyrhynchos). Now, if the resultant predictor is used on aBlack Swan (Genus: Cygnus, Species: atratus), the result can beeither: (a) identifying the bird as a ‘‘perfect’’ White Swan or (b) mis-classifying the Black Swan as duck. We propose [50] the concept ofglobal model-fit estimation for each test subject [65]. This can beachieved by using the general dimension-reduction algorithm andunbiased systems-wide datasets. We illustrate such an approach in(B), where the algorithm uses a ‘‘swan signature’’ to correctly identifythe genus (Cygnus) of the animal, but highlights through the model-fit assessment problems with the exact species determination(Cygnus olor would be white)

R. Williams et al. Prospects & Overviews....

884 Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 6: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

‘‘continent’’ concept, epigenetic memory in iPSCs would be agood representative of a PSC sub-state in the bounds of thepluripotent phenotype with a measurable basis in the epige-netic makeup of somatic cells.

The ‘‘continent-phenotype-state’’ concept in combinationwith novel epigenetic high-content assays could lead to thedevelopment of better tests for stem cell features.

Predictive epigenetic assays are the nextfrontier in stem cell biology

The last few years have brought epigenetic mechanisms to theattention of stem cell biologists. It is widely believed thatobservable stem cell and somatic phenotypes are the resultof epigenetic mechanisms.

Around 1998, commercially available gene expressionmicroarrays became a mainstay in bio-medical research.Although at that time the complete human genome had notyet been sequenced, decades of experience in studying singlegene functions already existed. Genome-wide high-contentmethods for surveying a sample’s epigenetic landscape nowface the same challenges as gene expression microarray stud-ies several years ago.

Epigenetics is the study of phenotypic variation in cells andorganisms that is not caused by changes in the underlyingDNA code. Reversible mechanisms alter the DNA’s accessibil-ity to both transcription factors and the transcriptionalmachinery; this mechanism is known to influence actualmRNA transcription.

Methylation of cytosine bases that are trailed by a guanineor adenosine base (CpG, CpA), post-translational modificationof histone tails through acetylation, and methylation atspecific amino acid residues are among the most commonlystudied epigenetic modifications. There is now a large andbewildering menagerie of other epigenetic marks (such asother histone modifications or DNA hydroxyl methylation)that can be experimentally studied in stem cells [22, 58, 59].An epigenetic code [60], a rule-based conversion of a specificcombination of histone modifications into a specific transcrip-tional state of each gene, has been proposed but is still underinvestigation [61, 62].

Even though it is evident that stem cell phenotypes are amanifestation of epigenetic mechanisms, very little is under-stood about how specific epigenetic changes relate to actualtranscriptional states and the potential for activation or silenc-ing of a current transcriptional phenotype.

This highlights major challenges associated with the func-tional interpretation of the now abundant epigenomic data-sets. Thus, current large-scale efforts aiming at mapping andintegrating various epigenetic marks and signals apply twomain strategies to these datasets. One is to correlate epigeneticpatterns with functionally interpretable readouts, such asgene expression measurements [25, 63], and a secondapproach is to employ integrative dimension-reduction andclustering algorithms [61, 62].

Figure 4. Pluripotency Score Dimensions. We have plotted threeNMF-derived dimensions, which separate several hundred PSCs(yellow) from somatic cells and tissues (red). Three such dimensionsare combined by PluriTest into a single score (Pluripotency Score,see Fig. 5) to increase robustness against biological and technicalnoise.

....Prospects & Overviews R. Williams et al.

Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc. 885

Methods,Models

&Techniques

Page 7: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

In both scenarios, the validity and predictive value is stronglydependent on implicit assumptions made with the trainingdataset and how the machine learning algorithms handledeviations from the data model.

‘‘Black Swans’’ challenge biologicalpredictions

We would like to illustrate this problem with the help of theBlack Swan thought experiment. The Black Swan problemdates back to the philosopher Mill [64]. Famously, he ques-tioned if we can infer from all White Swans that Black Swansdo not exist. Only one single Black Swan can disprove theconjecture that all swans are white. Also, one might want toadd, ‘‘having white feathers’’ does not necessarily characterizea swan at all.

What does this have to do with stem cells and predictingdifferentiation potential?

All attempts at defining a physiologicalcellular state must start by studying aknown population of cells in vitro orin vivo. From this starting population allinferences on biomarkers, predictive tests,and phenotypic assays are made. Thisstarting population is probably the mostimportant and often under-appreciatedimplicit assumption in most, if not all, stemcell studies.

Importantly, once a valid biomarker,e.g. POU5F1/OCT4 for pluripotency ingenetically normal cells, has been estab-

lished, the reverse conclusion cannot necessarily be deducted:a bird that is white cannot necessarily be identified asswan; a cell that expresses POU5F1/OCT4 is not automaticallypluripotent.

Although all this appears to be obvious, many studies havebeen published claiming to identify pluripotency in certainsomatic cells based on marker profiles and ‘‘confirmation’’ ofdifferentiation potential based on extremely sensitive butunfortunately rather unspecific marker-based assays. This isin our opinion even truer for much more subtle differencesamong PSC lines.

We believe the core challenge is to answer this question:how can we identify the cellular state or even differentiationpotential of a biological system, without ever being sure if ourtraining population was representative for the phenotype,state or differentiation potential we want to predict?

For this, one needs a comprehensive idea of the globalsample space, which currently in cell biology only high-content datasets in combination with dimension reductionalgorithms can provide.

Figure 5. Neural differentiation time course analyzed with PluriTest. hESCs were differ-entiated into neural precursor cells (blue) over a time of 14 days. In this example the mostcommonly used hESC line WA09 with proven excellent multi-lineage potential, depicted inred in their undifferentiated state. On day 0 (red), day 3 (light red), day 6 (light blue), andday 14 (dark blue) triplicate samples were collected, analyzed with Illumina HT12v3 micro-arrays, and the raw data processed with PluriTest. More details on this experiment canbe found in ref. [50]. The two PluriTest Scores (‘‘Pluripotency Score’’ and the pluripotencymodel-fit assessment score ‘‘Novelty Score’’) for each sample are plotted on a thirddimension representing an empirical density distribution derived from a test dataset with�400 samples. Red surfaces and positive values on the PluriTest ‘‘landscape’’ indicatewhere pluripotent samples are mapped by PluriTest and blue and negative values on thedensity function indicate somatic samples.

R. Williams et al. Prospects & Overviews....

886 Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 8: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

PluriTest’s innovation is not only the identification of more orless complex biomarkers, but also a systematic evaluation ofthe model assumptions through ‘‘one class’’ classification [50,65]. Consequently, we have proposed systematically assessingmodel fit in unbiased biological datasets with a ‘‘NoveltyScore’’ [50, 65].

Large data atlases are essential forreliable biological predictions

Toward this end, an empirical data model has to be mapped toglobal datasets such as very large microarray atlases as a firststep.

Unfortunately, public high-content data repositories suchas NCBI GEO (National Center for Biotechnology InformationGene Expression Omnibus; http://www.ncbi.nlm.nih.gov/geo/) or ArrayExpress (http://www.ebi.ac.uk/arrayexpress/)are fraught by datasets suffering from largely unpredictablequality. Current estimates suggest that 10% of all microarraysdeposited in ArrayExpress are unusable because of technicalproblems [66]. For a recent microarray meta-analysis, theauthors had to exclude �40% of all arrays from the down-stream analysis after thorough QC [51]. Even if arrays passbasic QC thresholds, latent, arduous to detect but highlyproblematic, data structures (commonly referred to as ‘‘batcheffects’’) often make meaningful fine-grained analysis ofcellular sub-states difficult, if not impossible [67]. Statisticalbatch-effect-removal techniques, such as ComBat, can seem-ingly solve this problem [68, 69], but in our experience areprone to also removing co-varying biological signals.

Surprisingly, next-generation sequencing (NGS) datasuffer from the same problems [67] with the additional caveatthat, due to the extremely rapid pace at which NGS technol-ogies (particular sequencing chemistries) are evolving, it iscurrently even more difficult to develop stable and validatedmodels for low-level QC.

We suggest another approach: as microarray and NGS databecomes cheaper by the hour, even small scale experimentscan be planned in such a way that they can be integrated intolarger scale datasets. Importantly, replication and referencesamples are key components. Integrating as many biologicalreplicates as possible for a sample type or state leads toincreased ‘‘biological noise’’, which actually helps to identifytruly relevant and significant signals. For example, Guentheret al. were able to show that there are no reproducible histonesignatures characteristic for iPSCs by including sufficientbiological replicates in a chromatin-immunoprecipitationsequencing (CHIP-seq) and gene expression study [58]. Incontrast, others erroneously claimed to have identified aniPSC signature based on sample groups of only threeAffymetrix GeneChip replicates [33].

Our integrative concept has lead us to create the Stem CellMatrix 1 [12] and 2 [21, 50] datasets. In such large samplecollections, batch effects can be effectively controlled if differ-ent sample types are distributed over analysis batches andarray or flow cell positions in a near-random fashion andreference samples are run with each batch [70].

More reliable and generalizable signatures, includingthose from epigenetic analyses, can be derived from compre-

hensive databases with large sample numbers in contrast tothe more commonly employed, limited sample set exper-iments. For example, Bock et al. recently build a predictivemodel for differentiation trajectories based on a small sampleset of 20 hESC lines and 12 human iPSC lines, all culturedunder identical, yet undefined, culture conditions [25, 63]. Itremains to be determined if such local models will be general-izeable to other, now more commonly used, hPSC culturetechnologies. Yet, this example also illustrates the clearadvantages inherent to smaller, hypothesis-driven modelingprojects, as these can adapt more quickly to novel technol-ogies. In this particular case, the authors used a global andunbiased DNA methylation-profiling method for deep featuremapping, which was recently pioneered by the samegroup [71].

We believe that current next-generation technologiesparticularly RNA-seq, whole genome bisulfite sequencingand CHIP-seq for histone modification offer the opportunityto create large-scale and feature-deep datasets that will remainrelevant even for future integrative efforts.

PluriTest is a framework for futuresystems-wide tests

Recently, we have reported on a first generation tool thatintegrates the insights outlined above into an online assayfor predicting pluripotency in stem cell preparations [50].

For identifying human PSCs with an unbiased approach,we have de-convoluted the transcriptomic patterns containedin a large database of microarray profiles from stem andsomatic cells with a powerful dimension-reduction algorithm.This algorithm, called NMF, identifies parts-based decompo-sitions of the large-scale patterns [50, 53]. The resulting datadimensions are used as complex biomarkers in two separatesteps.

First, we mathematically decomposed the data matrix ofall samples, pluripotent and somatic, and all probes to identifycomponents (i.e. data dimensions) that separate PSCs wellfrom all other cell types and tissues (see Fig. 4). These dimen-sions bear a certain resemblance to classical marker signa-tures, with the main difference being that there is no need foran arbitrary cut-off limiting the list of genes included in such asignature.

Secondly, we used the same approach on a data matrixthat only contained PSC samples to identify gene expressionpatterns that, when linearly combined, can ‘‘reconstruct’’global PSC ‘‘data’’ samples in our training dataset.

With these two optimized models, we can perform twotests on global gene expression datasets. We can check a querysample for the presence of data dimensions that we haveidentified to be expressed exclusively in hPSC in vitro prep-arations (thus this test’s results are termed ‘‘PluripotencyScore’’). Based on our experience, and as discussed before,such pluripotency dimensions are unfortunately not specificfor karyotypically normal PSCs.

With the second model of pluripotency, we can assessmodel fit by asking the question: can we reconstruct the globalgene expression patterns in our test dataset by a linearcombination of the NMF data dimensions? If this is true,

....Prospects & Overviews R. Williams et al.

Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc. 887

Methods,Models

&Techniques

Page 9: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

the analyzed sample fits the test’s stem cell model. Allkaryotypical and epigentically abnormal pluripotent cell lineswe have analyzed so far show global or localized deviationsfrom this model fit parameter, termed ‘‘Novelty Score’’.

In case we find that the ‘‘Novelty Score’’ model dimensionscannot explain all expression patterns seen in the novel stemcell sample, we can ask if the residual, unexplained geneexpression can be summarized by biologically meaningful‘‘themes’’. For example, parteneote-derived hESC lines [72]show gene expression patterns that cannot be explained byconventional hESC and iPSC transcriptomic patterns aroundcytobands that are known to be maternally imprinted.

Together, these observations strongly suggest that thereare major advantages to using the global PluriTest approachover ‘‘classical’’ in vivo pluripotency assays [30] or single-maker-based approaches, e.g. for analyzing differentiationtime course experiments (see Fig. 5). The ‘‘Novelty Score’’enables PluriTest to identify and highlight unusual patternsin a new stem cell line that deviate from those of known PSCs,without ever having been exposed to these unusual, ‘‘BlackSwan’’-type patterns before.

While providing a different approach to biomarker discov-ery and cellular phenotype assessment, we have also designeda novel way for biologists to interact with bioinformaticssoftware and models [www.pluritest.org].

Today, in most cases bioinformatic software is either codebased (e.g. in the case of Bioconductor/R) and requires exper-tise in programming and biostatistical approaches, or possessa point and click interface, which is easier to use (e.g.Expander, GenePattern, or GeneSpring), but still requiresconsiderable expertise in bioinformatic concepts, as manyof the assay parameters need to be knowingly tuned to suc-cessfully conduct a bioinformatic experiment.

Google has lead the way of encapsulating complex algo-rithmic (such as the Page Rank algorithm invented by Pageand Brin [73]) into optimized user-centric interfaces withinstantaneous results and feedback relevant to non-expertusers.

No parameters need to be tuned or set for obtainingPluriTest results from raw microarray data through an onlinetool, as all bioinformatic steps are predefined and optimizedfor the task.

Restricting the user’s ability to ‘‘tune’’ the results to theexpected behavior increases the reliability of the results.Similar workflows are required by regulatory agencies inclinical trials, where the statistical analysis has to be definedbefore analyzing data from patients.

Such a ‘‘hard-coded’’ processing pipeline allows for asimple online interface structure similar to those of web searchengines, only with the raw microarray data as ‘‘query term’’.

Conclusions

Seminal work by Takahashi and Yamanaka [74] has lead theway to engineering cellular phenotypes by overexpressionof transcription factors, microRNAs, siRNAs, cell selectionstrategies, epigenetic modifiers, small compounds, anddefined media compositions. Novel analytical tools need tobe developed that can identify and position resultant cell

populations in multidimensional transcriptional, epigenetic,and genetic maps.

Conventional in vivo assays, such as the teratoma assay oreven embryoid body formation and subsequent immunocyto-chemical analysis will not be able to keep pace with futurelarge-scale efforts currently underway, which plan to derive,QC and use thousands of iPSC lines as tools in genome- andepigenome-wide association studies or for ethnicity awaredrug discovery efforts.

If in the near future regenerative medicine based onpatient-specific stem cells becomes a reality, we will needhighly efficient methods to QC stem cell preparations, withoutever being able to completely map the target space or (equallyrelevant) undesirable genetic or epigenetic aberrations.

With iPSC-based high-throughput drug screens alreadyproducing hundreds of hit compounds within a short time,it will be necessary to quickly understand therapeutic pheno-typic and state shifts as well as the assessment of potentialhepato-, cardio-, or nephro-toxic side effects.

These challenges can be solved within a general frame-work of unbiased genome-wide datasets derived from nucleo-tide-based high-content measurements in combination withalgorithmically derived target signatures and model-fit esti-mates. We speculate that such assays, with PluriTest asa prototypical example, will soon become abundant in aca-demic and industry settings.

For biologists, using ubiquitous computational modelswill require a state transition of another kind: only increasingmathematical literacy among wet lab experimentalist willensure that the stem cell field can truly take advantage ofpowerful algorithms and machine learning methodologies.

AcknowledgmentsWe are grateful to Andreas Schuppert, Gulsah Altun, andSonia Vivas for critical insights. Josef B. Aldenhoff andJeanne F. Loring have provided critical comments and guid-ance. Johanna Goldman and Ibon Garitaonandia helped withthe neural stem cell differentiation time course. We are in debtto anonymous reviewers for improving the paper with insight-ful comments and suggestions. We thank Corina Becker andAnja Fritz for their support and critical discussion of thispaper. B.M.S. is supported by Bayer Technology ServicesGmbH and the Deutsche Forschungsgemeinschaft (GSC 111).F.-J.M. is supported by an Else-Kroner Fresenius Stiftungfellowship.

References

1. Kahvejian A, Quackenbush J, Thompson JF. 2008. What would you doif you could sequence everything? Nat Biotechnol 26: 1125–33.

2. Gray J. 2009. A transformed scientific method. In Hey T, Tansley S, TolleK, eds; The Fourth Paradigm. Data-Intensive Scientific Discovery.Redmond, WA: Microsoft Research.

3. American Association for Cancer Research Human Epigenome TaskForce EUNoESAB. 2008. Moving AHEAD with an international humanepigenome project. Nature 454: 711–5.

4. Siva N. 2008. 1000 Genomes project. Nat Biotechnol 26: 256.5. Hudson TJ, Anderson W, Artez A, Barker AD, et al. 2010. International

network of cancer genome projects. Nature 464: 993–8.6. Group BDW. 2001. Biomarkers and surrogate endpoints: preferred defi-

nitions and conceptual framework. Clin Pharmacol Ther 69: 89–95.

R. Williams et al. Prospects & Overviews....

888 Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 10: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

7. Augenlicht LH, Wahrman MZ, Halsey H, Anderson L, et al. 1987.Expression of cloned sequences in biopsies of human colonic tissueand in colonic carcinoma cells induced to differentiate in vitro. CancerRes 47: 6017–21.

8. Golub TR, Slonim DK, Tamayo P, Huard C, et al. 1999. Molecularclassification of cancer: class discovery and class prediction by geneexpression monitoring. Science 286: 531–7.

9. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, et al. 2002.Prediction of central nervous system embryonal tumour outcome basedon gene expression. Nature 415: 436–42.

10. Shi L, Campbell G, JonesWD, Campagne F, et al. 2010. The MicroArrayQuality Control (MAQC)-II study of common practices for the develop-ment and validation of microarray-based predictive models. NatBiotechnol 28: 827–38.

11. Marbach D, Prill RJ, Schaffter T, Mattiussi C, et al. 2010. Revealingstrengths and weaknesses of methods for gene network inference. ProcNatl Acad Sci USA 107: 6286–91.

12. Muller FJ, Laurent LC, Kostka D, Ulitsky I, et al. 2008. Regulatorynetworks define phenotypic classes of human stem cell lines. Nature455: 401–5.

13. Kim H, Lee G, Ganat Y, Papapetrou EP, et al. 2011. miR-371-3 expres-sion predicts neural differentiation propensity in human pluripotent stemcells. Cell Stem Cell 8: 695–706.

14. Hummel M, Bentink S, Berger H, Klapper W, et al. 2006. A biologicdefinition of Burkitt’s lymphoma from transcriptional and genomic profil-ing. N Engl J Med 354: 2419–30.

15. Cardoso F, Piccart-Gebhart M, Van’t Veer L, Rutgers E. 2007. TheMINDACT trial: the first prospective clinical validation of a genomic tool.Mol Oncol 1: 246–51.

16. Scott CT, McCormick JB, Owen-Smith J. 2009. And then there weretwo: use of hESC lines. Nat Biotechnol 27: 696–7.

17. Loser P, Schirm J, Guhr A, Wobus AM, et al. 2010. Human embryonicstem cell lines and their use in international research. Stem Cells 28: 240–6.

18. Hu BY, Weick JP, Yu J, Ma LX, et al. 2010. Neural differentiation ofhuman induced pluripotent stem cells follows developmental principlesbut with variable potency. Proc Natl Acad Sci USA 107: 4335–40.

19. Kim K, Doi A, Wen B, Ng K, et al. 2010. Epigenetic memory in inducedpluripotent stem cells. Nature 467: 285–90.

20. Hussein SM, Batada NN, Vuoristo S, Ching RW, et al. 2011. Copynumber variation and selection during reprogramming to pluripotency.Nature 471: 58–62.

21. Laurent LC, Ulitsky I, Slavin I, Tran H, et al. 2011. Dynamic changes inthe copy number of pluripotency and cell proliferation genes in humanESCs and iPSCs during reprogramming and time in culture. Cell Stem Cell8: 106–18.

22. Lister R, Pelizzola M, Kida YS, Hawkins RD, et al. 2011. Hotspots ofaberrant epigenomic reprogramming in human induced pluripotent stemcells. Nature 471: 68–73.

23. Gore A, Li Z, Fung HL, Young JE, et al. 2011. Somatic coding mutationsin human induced pluripotent stem cells. Nature 471: 63–7.

24. Ohi Y, Qin H, Hong C, Blouin L, et al. 2011. Incomplete DNA methylationunderlies a transcriptional memory of somatic cells in human iPS cells. NatCell Biol 13: 541–9.

25. Bock C, Kiskinis E, Verstappen G, Gu H, et al. 2011. Reference maps ofhuman ES and iPS cell variation enable high-throughput characterizationof pluripotent cell lines. Cell 144: 439–52.

26. Zhao T, Zhang ZN, Rong Z, Xu Y. 2011. Immunogenicity of inducedpluripotent stem cells. Nature 474: 212–5.

27. Nagy A, Gocza E, Diaz EM, Prideaux VR, et al. 1990. Embryonic stemcells alone are able to support fetal development in the mouse.Development 110: 815–21.

28. Zhao XY, Li W, Lv Z, Liu L, et al. 2009. iPS cells produce viable micethrough tetraploid complementation. Nature 461: 86–90.

29. Hayden EC. 2011. Stem cells: the growing pains of pluripotency. Nature473: 272–4.

30. Muller FJ, Goldmann J, Loser P, Loring JF. 2010. A call to standardizeteratoma assays used to define human pluripotent cell lines. Cell StemCell 6: 412–4.

31. Dolgin E. 2010. Putting stem cells to the test. Nat Med 16: 1354–7.32. Stevens LC, Little CC. 1954. Spontaneous testicular teratomas in an

inbred strain of mice. Proc Natl Acad Sci USA 40: 1080–7.33. Chin MH, Mason MJ, Xie W, Volinia S, et al. 2009. Induced pluripotent

stem cells and embryonic stem cells are distinguished by gene expressionsignatures. Cell Stem Cell 5: 111–23.

34. Stadtfeld M, Apostolou E, Akutsu H, Fukuda A, et al. 2010. Aberrantsilencing of imprinted genes on chromosome 12qF1 in mouse inducedpluripotent stem cells. Nature 465: 175–81.

35. Josephson R, Ording CJ, Liu Y, Shin S, et al. 2007. Qualification ofembryonal carcinoma 2102Ep as a reference for human embryonic stemcell research. Stem Cells 25: 437–46.

36. KornblumHI, Geschwind DH. 2001. Molecular markers in CNS stem cellresearch: hitting a moving target. Nat Rev Neurosci 2: 843–6.

37. Baum CM, Weissman IL, Tsukamoto AS, Buckle AM, et al. 1992.Isolation of a candidate human hematopoietic stem-cell population.Proc Natl Acad Sci USA 89: 2804–8.

38. Notta F, Doulatov S, Laurenti E, Poeppl A, et al. 2011. Isolation of singlehuman hematopoietic stem cells capable of long-term multilineageengraftment. Science 333: 218–21.

39. McCulloch EA, Till JE. 1960. The radiation sensitivity of normal mousebone marrow cells, determined by quantitative marrow transplantationinto irradiated mice. Radiat Res 13: 115–25.

40. Scholer HR, Balling R, Hatzopoulos AK, Suzuki N, et al. 1989. Octamerbinding proteins confer transcriptional activity in early mouse embryo-genesis. EMBO J 8: 2551–7.

41. Scholer HR, Hatzopoulos AK, Balling R, Suzuki N, et al. 1989. A familyof octamer-specific proteins present during mouse embryogenesis: evi-dence for germline-specific expression of an Oct factor. EMBO J 8: 2543–50.

42. Kim JB, Greber B, Arauzo-Bravo MJ, Meyer J, et al. 2009. Directreprogramming of human neural stem cells by OCT4. Nature 461: 649–3.

43. Lengner CJ, Camargo FD, Hochedlinger K, Welstead GG, et al. 2007.Oct4 expression is not required for mouse somatic stem cell self-renewal.Cell Stem Cell 1: 403–15.

44. Ponten F, Gry M, Fagerberg L, Lundberg E, et al. 2009. A global view ofprotein expression in human cells, tissues, and organs. Mol Syst Biol 5:337.

45. Ravasi T, Suzuki H, Cannistraci CV, Katayama S, et al. 2010. An atlas ofcombinatorial transcriptional regulation in mouse and man. Cell 140: 744–52.

46. Scholer HR, Ruppert S, Suzuki N, Chowdhury K, et al. 1990. New typeof POU domain in germ line-specific protein Oct-4. Nature 344: 435–9.

47. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, et al. 2003.PGC-1alpha-responsive genes involved in oxidative phosphorylation arecoordinately downregulated in human diabetes. Nat Genet 34: 267–73.

48. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, et al. 2005. Geneset enrichment analysis: a knowledge-based approach for interpretinggenome-wide expression profiles. Proc Natl Acad Sci USA 102: 15545–50.

49. Brunet JP, Tamayo P, Golub TR, Mesirov JP. 2004. Metagenes andmolecular pattern discovery using matrix factorization. Proc Natl Acad SciUSA 101: 4164–9.

50. Muller FJ, Schuldt BM, Williams R, Mason D, et al. 2011. A bioinfor-matic assay for pluripotency in human cells. Nat Methods 8: 315–7.

51. Lukk M, Kapushesky M, Nikkila J, Parkinson H, et al. 2010. A globalmap of human gene expression. Nat Biotechnol 28: 322–4.

52. Pearson K. 1901. On lines and planes of closest fit to systems of points inspace. Philos Mag Ser 6: 559–72.

53. Lee DD, Seung HS. 1999. Learning the parts of objects by non-negativematrix factorization. Nature 401: 788–91.

54. Kim PM, Tidor B. 2003. Subsystem identification through dimensionalityreduction of large-scale gene expression data. Genome Res 13: 1706–18.

55. Dudley JT, Tibshirani R, Deshpande T, Butte AJ. 2009. Disease sig-natures are robust across tissues and experiments. Mol Syst Biol 5: 307.

56. Hanna J, Cheng AW, Saha K, Kim J, et al. 2010. Human embryonic stemcells with biological and epigenetic characteristics similar to those ofmouse ESCs. Proc Natl Acad Sci USA 107: 9222–7.

57. Nichols J, Smith A. 2009. Naive and primed pluripotent states. Cell StemCell 4: 487–92.

58. Guenther MG, Frampton GM, Soldner F, Hockemeyer D, et al. 2010.Chromatin structure and gene expression programs of human embryonicand induced pluripotent stem cells. Cell Stem Cell 7: 249–57.

59. Pastor WA, Pape UJ, Huang Y, Henderson HR, et al. 2011. Genome-wide mapping of 5-hydroxymethylcytosine in embryonic stem cells.Nature 473: 394–7.

60. Strahl BD, Allis CD. 2000. The language of covalent histone modifi-cations. Nature 403: 41–5.

61. Cheng C, Yan KK, Yip KY, Rozowsky J, et al. 2011. A statisticalframework for modeling gene expression using chromatin features andapplication to modENCODE datasets. Genome Biol 12: R15.

62. Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, et al. 2011.Comprehensive analysis of the chromatin landscape in Drosophila mel-anogaster. Nature 471: 480–5.

63. Boulting GL, Kiskinis E, Croft GF, Amoroso MW, et al. 2011. A func-tionally characterized test set of human induced pluripotent stem cells.Nat Biotechnol 29: 279–86.

....Prospects & Overviews R. Williams et al.

Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc. 889

Methods,Models

&Techniques

Page 11: A guide to stem cell identification: Progress and challenges in system-wide predictive testing with complex biomarkers

64. Mill JS. 1884. A System of Logic Ratiocinative and Inductive, Being aConnected View of the Principles of Evidence, and the Methods of ScientificInvestigation. 3rd edn., London, England: Longmans, Green and Co. p. 622.

65. Tax DMJ, Muller KR. 2004. A consistency-based model selection forone-class classification. Proceedings of the 17th International Conferenceon Pattern Recognition (ICPR’04), Cambridge, England.

66. McCall MN, Murakami PN, Lukk M, Huber W, et al. 2011. Assessingaffymetrix GeneChip microarray quality. BMC Bioinf 12: 137.

67. Leek JT, Scharpf RB, Bravo HC, Simcha D, et al. 2010. Tackling thewidespread and critical impact of batch effects in high-throughput data.Nat Rev Genet 11: 733–9.

68. Johnson WE, Li C, Rabinovic A. 2007. Adjusting batch effects in micro-array expression data using empirical Bayes methods. Biostatistics 8:118–27.

69. Luo J, Schumacher M, Scherer A, Sanoudou D, et al. 2010. A com-parison of batch effect removal methods for enhancement of prediction

performance using MAQC-II microarray gene expression data.Pharmacogenomics J 10: 278–91.

70. Cheng C, Shen K, Song C, Luo J, et al. 2009. Ratio adjustment andcalibration scheme for gene-wise normalization to enhance microarrayinter-study prediction. Bioinformatics 25: 1655–61.

71. Meissner A, Mikkelsen TS, Gu H, Wernig M, et al. 2008. Genome-scaleDNA methylation maps of pluripotent and differentiated cells. Nature 454:766–70.

72. Harness JV, Turovets NA, Seiler MJ, Nistor G, et al. 2011. Equivalenceof conventionally-derived and parthenote-derived human embryonicstem cells. PLoS One 6: e14499.

73. Page L, Brin S, Motwani R, Winograd T. 1998. The PageRank CitationRanking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

74. Takahashi K, Yamanaka S. 2006. Induction of pluripotent stem cellsfrom mouse embryonic and adult fibroblast cultures by defined factors.Cell 126: 663–76.

R. Williams et al. Prospects & Overviews....

890 Bioessays 33: 880–890,� 2011 WILEY Periodicals, Inc.

Methods,Models

&Techniques