Post on 07-Apr-2018
Appendix A
Probability, Distribution Theory,
and Statistical Inference
A.1 Definitions
We define an event as any set of possible outcomes of a random experiment. In turn,
an outcome is the result of a single trial of a random experiment. A randomexperiment is an experiment that has an outcome that is not completely predictable,
that is, we can repeat the experiment under the same conditions and potentially get
a different result. The sample space (sometimes called universe) is the set of all
possible outcomes for one trial of the random experiment and is often denoted by
O.1 The symbol f denotes the set containing no outcomes, or the empty set.
Using the above definitions, we can now define the union, intersection, and
complement of events. We use two events A and B:
The union of two events A and B is the set
of outcomes that are in either A or B or
both and is denoted by A [ B.Ω
A B
(continued)
1 The letter U is also sometimes used for this and we may use either interchangeably.
R. Sullivan, Introduction to Data Mining for the Life Sciences,DOI 10.1007/978-1-59745-290-8, # Springer Science+Business Media, LLC 2012
593
The intersection of two events A and B is the
set of outcomes that are in both A and Bsimultaneously and is denoted by A \ B.
Ω
A∩BA B
The complement of an event A is the set of outcomes
that are not in A and is denoted by �A or A0. Ω
A A
An event B is a subset of an event A if all of the
outcomes of B are also outcomes of A. ΩA
B
Two events A and B which have no outcomes in
common are mutually exclusive or disjoint.In this case the occurrence of A, without lossof generality, precludes the occurrence of event B.
Ω
A B
594 Appendix A
A.2 Axioms of Probability
A probability is a real number between 0 and 1 that is assigned to each of the events
that an experiment generates, with higher values indicating that the event is more
likely to occur. Assigning a value of 1 to an event means that the event is certain
to occur while assigning a value of 0 means that the event can never occur. The
following axioms must be satisfied for any consistent probability assignment:
1. 0�PðAÞ� 1 for any event A.2. P Oð Þ ¼ 1:3. If A and B are mutually exclusive events, that is, P A \ Bð Þ ¼ f, P A [ Bð Þ
¼ PðAÞ þ PðBÞ. In the more general case, P A [ B [ C [ :::ð Þ ¼ PðAÞ þ PðBÞþPðCÞ þ � � �
We can now use these axioms to prove the following rules:
1. P fð Þ ¼ 0:Since O [ f ¼ O and O \ f ¼ f, by axiom 3, 1 ¼ 1þ P fð Þ.
2. P �Að Þ ¼ 1� PðAÞ:Since O ¼ A [ �A and A \ �A ¼ f, by axiom 3, 1 ¼ PðAÞ þ P �Að Þ.
3. P A [ Bð Þ ¼ PðAÞ þ PðBÞ � P A \ Bð Þ:A [ B ¼ A [ �A \ Bð Þ which are disjoint. By axiom 3, P A [ Bð Þ ¼ PðAÞþP �A \ Bð Þ. B ¼ A \ Bð Þ [ �A \ Bð Þ which are disjoint. By axiom 3, PðBÞ¼ P A \ Bð Þ þ P �A \ Bð Þ. Substituting for P �A \ Bð Þ in above gives P A [ Bð Þ¼ PðAÞ þ PðBÞ � P A \ Bð Þ.
A visualization of this rule from the Venn diagram shows us that P(A) + P(B)would actually include P A \ Bð Þ twice, thus we subtract one from the sum.
A.3 Joint Probability
Consider a random experiment in which O includes two events A and B. We define
the joint probability as the probability of both events occurring simultaneously.2
The implication of this is that the set of outcomes occurs in both event A and event
B, which, on the Venn diagram, is the quantity A \ B. That is, the joint probabilityof the events A and B is P A \ Bð Þ. But what happens if the events are independent ofeach other? In this case, P A \ Bð Þ ¼ PðAÞ � PðBÞ.
We note here that independent events and disjoint events are not identical.
In probability, two events are independent if they have no effect on each other:
the occurrence of one has no impact on the occurrence, or lack thereof, of the other.
Thus, independence is a property that arises from the probabilities assigned to the
2Actually, we really mean simultaneously on the same iteration or repetition of the random
experiment.
Appendix A 595
events and their intersection. Disjoint, or mutually exclusive, events are events
which have no outcomes in common: such events with non-zero probabilities
cannot be independent since their intersection is the empty set, with P fð Þ ¼ 0,
which can never equal the product of the probabilities of the two events.
We call the probability of an event A, in the joint probability setting, itsmarginalprobability, and this is calculated as P A \ Bð Þ þ P A \ �Bð Þ using the axioms of
probability. (The confirmation of this is left as an exercise for the reader.)
A.4 Conditional Probability
If we are told that event A has occurred, then we know that everything outside of
AðA0Þ is no longer possible. We can consider the reduced universe Or ¼ A since we
only need to consider outcomes within A. Consider a second event B. If we knowthat event A has occurred, will this affect the probability of our second event Boccurring? That is, what is the conditional probability of B occurring given that Ahas occurred? The only part of the event B which will be relevant is that part which
is also a part of A, that is, A \ B. Given that we know that A has occurred,
P Orð Þ ¼ 1, by the second axiom or probability. Therefore, the probability of Bgiven A, written P B Ajð Þ, is the probability of A \ B multiplied by the scaling factor1
PðAÞ due to our reduced universe Or:
P B Ajð Þ ¼ P A \ Bð ÞPðAÞ (A.1)
Note that this equation tells us that P B Ajð Þ / P A \ Bð Þ.In the event that A and B are independent events, P B Ajð Þ ¼ P A\Bð Þ
PðAÞ ¼PðAÞ�PðBÞ
PðAÞ ¼ PðBÞ, which states that knowledge about the event A does not affect
the probability of the event B occurring when A and B are independent events.
Although it is tempting to consider P A Bjð Þ in the analogous way and write
P A Bjð Þ ¼ P A\Bð ÞPðBÞ , we need to remember that B is an unobservable event. That is,
when we calculate the conditional probability – the probability of B occurring giventhat A occurred – we are not observing either the occurrence or the nonoccurrence
of B. Thus, the probability of A occurring will be dependent on which one of Bor �Bhas occurred: P(A) is conditional on the occurrence of nonoccurrence of event B.Rewriting the conditional probability formula as
P A \ Bð Þ ¼ PðBÞ � P A Bjð Þ (A.2)
we get the rule known as the multiplication rule for probability. Although this may
seem trivial, it provides us with a conditional probability rule for any observable
596 Appendix A
event given an unobservable event and allows us to find the joint probability.
Equation A.2 shows the relationship when B occurs and similarly,
P A \ �Bð Þ ¼ P �Bð Þ � P A �Bjð Þ
covers the case when B does not occur.
A.5 Bayes’ Theorem
Although we discuss Bayes’ theorem at various points in the book, we include it
here for completeness.
From its definition, the marginal probability of an event A is calculated by
summing the probabilities of its disjoint parts: PðAÞ ¼ P A \ Bð Þ þ P A \ �Bð Þ andsubstituting this into Eq. A.1 gives
P B Ajð Þ ¼ P A \ Bð ÞP A \ Bð Þ þ P A \ �Bð Þ
Using the multiplication rule, we can find the join probabilities and rewrite the
equation to give Bayes’ theorem for an event A:
P B Ajð Þ ¼ P A Bjð Þ � PðBÞP A Bjð Þ � PðBÞ þ P A �Bjð Þ � P �Bð Þ (A.3)
In A.3 and our definition for P(A), we used B and �B which comprised the
universe – O ¼ B [ �B. B and �B are disjoint. We say that B and �B partition the
universe. This is the simplest case. Often we will have more than two events which
partition the universe. Consider the case where we have n events such that B1 [B2 [ ::: [ Bn ¼ O and Bi \ Bj ¼ f 8i; j, where i ¼ 1,. . .,n, j ¼ 1,. . .,n, i 6¼ j.
Ω
A
B4B1
B2B3
An observable event A will therefore be partitioned into n parts: A ¼ A \ B1ð Þ [A \ B2ð Þ [ ::: [ A \ Bnð Þ where A \ Bi and A \ Bj are disjoint because Bi andBj are
disjoint. Thus,
Appendix A 597
r AΩ =B4
B1B2
B3
A∩B1
A∩B2 A∩B3A∩B4
PðAÞ ¼Xni¼1
A \ Bið Þ Law of Total Probabilityð Þ
Using the multiplication rule gives
PðAÞ ¼Xni¼1
P A Bijð Þ � P Bið Þ
where each conditional probability isP Bi Ajð Þ ¼ P A\Bið ÞPðAÞ and the multiplication rule
is used to determine the joint probability of the numerator P A \ Bið Þ ¼ P Bið Þ �P A Bijð Þ to give
P Bi Ajð Þ ¼ P A Bijð Þ � P Bið ÞPni¼1
P A Bijð Þ � P Bið Þ
A.6 Estimators
Consider a random sample from a distribution where we are tossing a thumb tack
(or drawing pint) and where the random variable X takes a value of 1 if the tack
lands on its head and 0 otherwise. Then in tossing the tack we are sampling from a
distribution of X which has the probability distribution p0 ¼ P(Xi ¼ 0) ¼ 1 � p,p1 ¼ P(Xi ¼ 1) ¼ p, where p is the probability that the tack lands on its side.
We will not typically know the value of p, and so a problem we often face is to
estimate the value of p using a random sample. In such cases, we know the
distribution type, but it depends upon the values of one or more unknown
parameters. In our example, the unknown parameter is p.We therefore use a random sample to obtain estimates for the values of these
parameters. In the general case, we have a random variable X whose distribution is
of a known type, but which depends upon the value of some unknown parameter y.We wish to obtain a numerical value for y and use the procedure of obtaining a
random sample from the distribution of X and then use the value of an appropriate
statistic to estimate y.
598 Appendix A
Thus, formally, an estimator for a parameter y is a statistic (random variable)
whose value is used to estimate y. We typically use the notation y to denote an
estimator for y.On occasions, it will be obvious what an estimator is for the unknown parameter.
In our thumb tack example above, an obvious procedure would be to toss the tack
n times, counting the number of times the tack lands on its side. If it lands on its side
a total of a times, that is, the proportion of tosses which result in the tack landing on
its size, then we could use
p ¼ a
n
as our estimator for p.However, the choice is not always so clear. In fact, there may be several choices
for an estimator. For example, a random variable with a probability distribution
f ðxÞ ¼ 1ye
�xy if x� 0;
¼ 0 otherwise
could use either the samplemean or the sample standard deviation of a random sample
of size n to estimate y. An important question is which is the best estimator for y.If y is an estimator for y, we obviously wish to know how good an estimator it is.
An intuitive question to help us answer this is: How close is the estimated value to
the real value? We use the mean to help us with this. Other terms for the mean of a
random variable X with probability function px is defined as
EðXÞ ¼Xni¼1
xpx
where the summation is over all possible values of X. The notation E(X) is read as
the expected value, or expectation.
If EðyÞ ¼ y, we say that y is an unbiased estimator for y. If EðyÞ 6¼ y, then we
define the bias as given by
biasðyÞ ¼ EðyÞ � y
and say that y is a biased estimator for y.If X is a random variable with mean m and variance s2, then the sample mean �X
and the sample variance S2 are unbiased estimators for m and s2, respectively.
If we consider S2 ¼ 1n
Pni¼1
ðXi � �XÞ2as an estimator for s2, then EðS2Þ ¼ n�1ðn�1Þs2n
and so biasðS2Þ ¼ �1ns
2, which explains why S2 ¼ 1n�1
Pni¼1
Xi � �Xð Þ2 is usually used
as the estimator of the population variance.
The variance of a random variable X is defined by VarðXÞ ¼ E½ X � mð Þ2�where m ¼ EðXÞ and the symbol s2 is reserved for the variance, with S2 being
Appendix A 599
used for the variance of a set of data. This is a measure of the dispersion of the
probability distribution.
We say that an unbiased estimator y1 is a more efficient (better) estimator for ythan a second unbiased estimator y2 if Varðy1Þ<Vary2Þ and the relative efficiency ofy2 with respect to y1 is Varðy1Þ=Varðy2Þ:. This fraction is often expressed as
a percentage by ½Varðy1Þ=Varðy2Þ:� � 100. Thus, the relative efficiency is >1
(or 100%) if y2 is more efficient than y1 and less than 1 if y1 is more efficient than y2.The criterion outlined above for comparing estimators can only be used when all
the estimators under consideration are unbiased. However, under certain
conditions, we may prefer to use a biased estimator. We illustrate this by returning
to our thumb tack example above where we selected p ¼ an as our estimator, which
we will now denote by p1. Let us define a second estimator as follows:
p2 ¼ aþ 1
nþ 2
We won’t explain the origin of this estimator since we’re using it for illustration
purposes. Instead, we note that this does in fact occur as an estimator in certain
more complex models.
E p2ð Þ ¼ EðaÞ þ 1
nþ 2¼ npþ 1
nþ 2
In order for p2 to be unbiased, p ¼ 12, otherwise it is biased. However, if n is
large, the bias is small. Further, the value of Varðp2Þ is also small, where n is large.Thus, the estimate is close to the true value of p. So how can we compare two such
estimators? One common method is to use the mean square error.The mean square error (MSE) of an estimator y for a parameter y is defined as
MSEðyÞ ¼ E y� y� �2
� �
As can be seen below, it is related to other quantities we have introduced above:
MSE y� �
¼ E y� y� �2
� �
¼ E y2 � 2yyþ y2
h i
¼ E y2
� �� 2yE y
� �þ y2
¼ E y2
� �� E y
� �h i2þ E y
� �h i2� 2yE y
� �þ y2
� �
¼ Var y� �
þ E y� �
� yh i2
¼ Var y� �
þ bias y� �h i2
600 Appendix A
Thus, MSEðyÞ ¼ VarðyÞ þ ½biasðyÞ�2Note also that if y is unbiased, then MSEðyÞ ¼ VarðyÞ. Therefore, we can think
of the MSE as being a generalization of the variance that allows for bias.
An estimator y, based on a sample of size n, for a parameter y is said to be a
consistent estimator for y if,
EðyÞ ! y and VarðyÞ ! 0 as n ! 1: Equivalently; MSEðyÞ ! 0:
This is an important consideration because when an estimator is consistent, it
becomes increasingly likely that the estimate is close to the parameter’s actual
value as the sample size increases.
A.7 The Law of Large Numbers
The Law of Large Numbers states that for repeated, independent trials with the
same probability p of success in each trial, the percentage of successes is increas-
ingly likely to be close to the chance of success as the number of trials increases.
Formally, the chance that the percentage of successes differs from the probabil-
ity p of success by more than a fixed positive amount, e > 0, tends to zero as the
number of trials n tends to infinity, for every number t > 0.
Another way of thinking about this is to consider an experiment where the
outcome is a random variable, and where the experiment is conducted repeatedly.
The outcome for different repetitions is independent from any other. The law of
large numbers says that as the number of independent repetitions increases, the
average of the observed outcomes approaches the average of all possible outcomes.
This can be easily illustrated by using the example of a die being rolled. The
average of all possible outcomes is the average value of the six numbers:
(1 + 2 + 3 + 4 + 5 + 6)/6 ¼ 3.5. Obviously, 3.5 is not a possible, observable
outcome for any individual experiment since it is not on the face of a die. If the
observed result we see for the ith experiment is denoted xi, the average of the
observed outcomes is
Pni¼1
xi
n
If x1 ¼ 3, and x2 ¼ 5, then the average of the observed outcomes is
ð3þ 5Þ=2 ¼ 4, if x3 ¼ 1, the average of the observed outcomes is
ð3þ 5þ 1Þ=3 ¼ 3, and so on. We can make the value as close as we like to 3.5
by increasing the number of repetitions of the experiment.
Appendix A 601
Appendix B
Databases and Software Projects
A plethora of software products exist to support data mining efforts. These products
include core technical infrastructure products such as the underlying database
management systems and facilities within to support mining models, through
general-purpose products to help with the various stages of the mining effort, on
to those products which offer very specific, and complete support for specific
subject areas such as microarray analysis. To include an exhaustive list of products
is simply not feasible and would certainly be completely out of date by the time this
book is published.
Instead, we have included a number of products and databases that we have
found useful in our experiences and which, we believe, will continue to be avail-
able, developed, and supported. We include these software products and databases
with the kind agreement of the respective various authors of the respective products
and databases to illustrate the various mining techniques we describe.
All of the products or databases we use are listed here for easy reference by the
reader. We have kept the information to a minimum, preferring to point the reader
to the official web site or reference since the versions, licensing details, and scope of
these products or databases may change drastically between the time of writing and
when the reader accesses the software or data.
Wherever we have been granted permission, the actual version of the software/
database used in this book can be located at www.lsdm.org/thirdparty.
The descriptions of the products and databases contained in this appendix are
from the various vendors and not from the authors.
The nature of the Internet is such that URLs accurate at the time of writing may
no longer be accessible. We have tried to provide enough context for a search in the
event that the URL does not work.
603
B.1 Software Projects
B.1.1 Alyuda NeuroIntelligence
http://www.alyuda.com/neural-networks-software.htm
This is a neural networks software application designed to assist experts in solving
real-world problems. NeuroIntelligence features only proven algorithms and
techniques. It is fast and easy to use.
B.1.2 AmiGO
www.godatabase.org
This tool provides a mechanism for searching the gene ontology database, which
comprises a controlled vocabulary of terms for biological concepts and for genes
and gene products which have been annotated using the controlled vocabulary. For
example, an inquiry using “protein kinase activity” returned 93 matches in the
database.3
See also “The Gene Ontology.”
B.1.3 AutoClass@IJM
http://ytat2.ijm.univ-paris-diderot.fr/AutoclassAtIJM.html
AutoClass@IJM is a freely available (to academia) computational resource with a
web interface to AutoClass, a powerful unsupervised Bayesian classification system
developed by the Ames Research Center at NASA AutoClass. End users upload
their datasets through a web interface; computations are then queued into the cluster
server. When the clustering is completed, an URL to the results is sent back to the
user by e-mail.
B.1.4 Babelomics
http://babelomics.bioinfo.cipf.es/
An integrative platform for the analysis of transcriptomics, proteomics, and geno-
mic data with advanced functional profiling, integrating primary (normalization,
calls, etc.) and secondary (signatures, predictors, associations, TDTs, clustering,
3 A sample run performed on May 5, 2007.
604 Appendix B
etc.) analysis tools within an environment that allows relating genomic data and/or
interpreting them by means of different functional enrichment or gene set methods.
This tool integrates primary (normalization, calls, etc.) and secondary (signatures,
predictors, associations, TDTs, clustering, etc.) analysis tools within an environ-
ment that allows relating genomic data and/or interpreting them by means of
different functional enrichment or gene set methods.
B.1.5 BioCarta
www.biocarta.com
This is an interactive, web-based resource for life scientists that includes informa-
tion on gene function, proteomic pathways, ePosters, and research reagents.
B.1.6 Bioconductor
www.bioconductor.org
The Bioconductor project was started in 2001 with the overall objective of devel-
oping a set of tools for analyzing and understanding genomic data. Gentleman et al.
(2001 #90) is the official FAQ for this project at www.bioconductor.org/docs/faq
and the reader is referred to this page for more details on the project’s goals and
objectives.
The version of Bioconductor used in this book is version 1.9.
B.1.7 Biomedical Informatics Research Network (BIRN)
The Biomedical Informatics Research Network (BIRN) is a national initiative to
advance biomedical research through data sharing and online collaboration. Funded
by the National Center for Research Resources (NCRR), a component of the US
National Institutes of Health (NIH), BIRN provides data-sharing infrastructure,
software tools, strategies, and advisory services – all from a single source.
http://www.birncommunity.org/
B.1.8 BioPAX
www.biopax.org
This is a data exchange format for biological pathway data.
Appendix B 605
B.1.9 BioTapestry
http://www.biotapestry.org/
BioTapestry is an interactive tool for building, visualizing, and simulating genetic
regulatory networks.
B.1.10 BLAST
http://www.ncbi.nlm.nih.gov/BLAST
B.1.11 BLAT
http://genome.ucsc.edu/cgi-bin/hgBlat?command¼start
B.1.12 BrainMaker
http://www.calsci.com/BrainIndex.html
This is a neural network software that lets you use your computer for business and
marketing forecasting, stock, bond, commodity, and futures prediction, pattern
recognition, medical diagnosis, sports handicapping. . .almost any activity where
you need special insight.
B.1.13 Cambridge Structure Database (CSD)
The Cambridge Structural Database (CSD) records bibliographic, chemical, and
crystallographic information for organic molecules and metal-organic compounds
whose 3D structures have been determined using X-ray diffraction and neutron
diffraction. Results are recorded for single crystal studies and powder diffraction
studies, which yield 3D atomic coordinate data for at least all non-hydrogen atoms.
In some cases, the Cambridge Crystallographic Data Centre (CCDC) is unable to
obtain coordinates, and incomplete entries are archived to the CSD. Crystal struc-
ture data is included that arises from publications in the open literature or private
deposits.
http://www.ccdc.cam.ac.uk/products/csd/
B.1.14 CDC National Center for Health Statistics
http://www.cdc.gov/nchs/datawh.htm
606 Appendix B
B.1.15 Clone|Gene ID Converter
http://idconverter.bioinfo.cnio.es/
This is a tool for converting gene, clone, or protein IDs to other IDs, which can be
used for small queries or for tens of thousands of IDs (typically from a microarray
experiment). Most of the conversions are pre-generated every time the databases
are updated in order to get a fast answer for each query.
B.1.16 Clustal
http://www.clustal.org
B.1.17 Cognos
http://www-01.ibm.com/software/data/cognos/
B.1.18 COPASI: Biological Network Simulator
http://www.copasi.org/tiki-view_articles.php
COPASI is a software application for simulation and analysis of biochemical
networks and their dynamics. COPASI is a stand-alone program that supports
models in the SBML standard and can simulate their behavior using ODEs or
Gillespie’s stochastic simulation algorithm; arbitrary discrete events can be
included in such simulations.
COPASI carries out several analyses of the network and its dynamics and has
extensive support for parameter estimation and optimization. COPASI provides
means to visualize data in customizable plots, histograms, and animations of
network diagrams.
B.1.19 cMAP
cmap.nci.nih.gov
B.1.20 CompuCell3D
CompuCell is an open-source software modeling environment and PDE solver. It is
largely used for cellular modeling (foams, tissues, etc.); however, efforts are being
made to include fluid simulation capabilities. Created in collaboration between
Appendix B 607
groups at IU and Notre Dame, CompuCell provides an easy user interface for
complex cellular modeling.
https://simtk.org/home/compucell3d
http://www.compucell3d.org/
B.1.21 Cytoscape
http://www.cytoscape.org/
Cytoscape is an open-source bioinformatics software platform for visualizing these
molecular interaction networks and biological pathways and integrating these
networks with annotations, gene expression profiles, and other data. Cytoscape
was originally designed for biological research, now it is a general platform
supporting complex network analysis and visualization. Its use of plug-ins to
provide additional features supporting new file formats, profiling analysis, layouts,
scripting, or database connectivity, to name but a few domains, allows this tool to
be a more general-purpose tool.
B.1.22 Database for Annotation, Visualization,and Integrated Discovery (DAVID)
http://david.abcc.ncifcrf.gov/home.jsp
Database for Annotation, Visualization, and Integrated Discovery (DAVID)provides a comprehensive set of functional annotation tools for investigators to
understand biological meaning behind large list of genes. Its functionality includes
the ability to identify enriched biological themes, particularly GO terms, discover
enriched functional-related gene groups, cluster redundant annotation terms,
visualize genes on BioCarta and KEGG pathway maps, display related many-
genes-to-many-terms on 2-D views, search for other functionally related genes
not in the user-provided list, list interacting proteins, explore gene names in
batch, link gene-disease associations, highlight protein functional domains and
motifs, redirect the user to related literatures, and convert gene identifiers from
one type to another.
B.1.23 EasyNN Plus
http://www.easynn.com/
This is a low-cost, intuitive software product for developing neural networks.
608 Appendix B
B.1.24 EMBOSS
http://www.hgmp.mrc.ac.uk/Software/EMBOSS
B.1.25 Entrez
www.ncbi.nlm.nih.gov/entrez
B.1.26 Evoker
Evoker is a graphical tool for visualizing genotype intensity data in order to assess
genotype calls as part of quality control procedures for genome-wide association
studies. It provides a solution to the computational and storage problems related to
being able to work with the huge volumes of data generated by such projects by
implementing a compact, binary format that allows rapid access to data, even with
hundreds of thousands of observations.
http://www.sanger.ac.uk/resources/software/evoker/
B.1.27 Flapack
New software tools for graphical genotyping and haplotype visualization are
required that can routinely handle the large data volumes generated by high
throughput SNP and comparable genotyping technologies. Flapjack is a new
visualization tool to facilitate analysis of these data types. Its visualizations are
rendered in real time allowing for rapid navigation and comparisons between lines,
markers, and chromosomes.
Based on the input of map, genotype, and trait data, Flapjack is able to provide a
number of alternative graphical genotype views with individual alleles colored by
state, frequency, or similarity to a given standard line. Flapjack supports a range of
interactions with the data, including graphically moving lines or markers around the
display, insertions or deletions of data, and sorting or clustering of lines by either
genotype similarity to other lines or by trait scores. Any map-based information
such as QTL positions can be aligned against graphical genotypes to identify
associated haplotypes.
All results are saved in an XML-based project format and can also be exported as
raw data or graphically as image files. We have devised efficient data storage
structures that provide high-speed access to any subset of the data, resulting in
fast visualization regardless of the size of the data.
http://bioinf.scri.ac.uk/flapjack/
Appendix B 609
B.1.28 GEO
www.ncbi.nlm.nih.gov/geo
B.1.29 Gene Ontology (GO)
The Gene Ontology (GO) is a bioinformatics initiative aimed at standardizing the
representation of gene and gene product attributes (RNA or proteins resulting from
the expression of a gene) across all species by developing a controlled vocabulary,
annotating genes and gene products, assimilating and disseminating the
annotations, and providing tools to access the data, such as the AmiGO browser.
www.geneontology.org
B.1.30 g:Profiler
http://biit.cs.ut.ee/gprofiler/gconvert.cgi
This is a gene identifier tool that allows conversion of genes, proteins, microarray
probes, standard names, and various database identifiers.
B.1.31 Graphviz
www.graphviz.org
This is an open-source graph visualization software.
B.1.32 HMMER
http://hmmer.wustl.edu
B.1.33 Java Data Mining (JDM) Toolkit
http://www.jcp.org/en/jsr/detail?id¼247
B.1.34 ID Converter
http://biodb.jp/#ids
This is a tool for converting data IDs for biological molecules that are used in a
database into other, corresponding data IDs that are used in other databases.
610 Appendix B
B.1.35 iModel
http://www.biocompsystems.com/products/imodel/
B.1.36 KnowledgeMiner
http://www.knowledgeminer.com/
KnowledgeMiner (yX) for Excel is a knowledge mining tool that works with data
stored in Microsoft Excel for building predictive and descriptive models from this
data autonomously and easily.
B.1.37 Libcurl
curl.haxx.se
B.1.38 MatchMiner
http://discover.nci.nih.gov/matchminer/MatchMinerLookup.jsp
This is a set of tools that enables the user to translate between disparate ids for the
same gene using data from the UCSC, LocusLink, Unigene, OMIM, Affymetrix,
and Jackson data sources to determine how different ids relate. Supported id types
include gene symbols and names, IMAGE and FISH clones, GenBank accession
numbers, and UniGene cluster ids.
B.1.39 MATLAB
http://www.mathworks.com/products/matlab/
MATLAB® probably needs no introduction to the majority of readers. It is a high-
level technical computing language and interactive environment for algorithm
development, data visualization, data analysis, and numeric computation. Using
the MATLAB product, you can solve technical computing problems faster than
with traditional programming languages, such as C, C++, and Fortran.
You can use MATLAB in a wide range of applications, including signal and
image processing, communications, control design, test and measurement, financial
modeling and analysis, and computational biology. Add-on toolboxes (collections
of special-purpose MATLAB functions, available separately) extend the MATLAB
environment to solve particular classes of problems in these application areas.
Appendix B 611
MATLAB provides a number of features for documenting and sharing your
work. You can integrate your MATLAB code with other languages and
applications, and distribute your MATLAB algorithms and applications.
Of more specific interest is the range of toolkits that are available within the
MATLAB architecture to support specific needs.
Image Processing Toolkit
http://www.mathworks.com/products/image/#thd1
Image Processing Toolbox™ provides a comprehensive set of reference-standard
algorithms and graphical tools for image processing, analysis, visualization, and
algorithm development. You can perform image enhancement, image deblurring,
feature detection, noise reduction, image segmentation, spatial transformations, and
image registration.
B.1.40 MemBrain
http://www.membrain-nn.de/main_en.htm
MemBrain is a powerful graphical neural network editor and simulator for
Microsoft Windows, supporting neural networks of arbitrary size and architecture.
B.1.41 Microarray Databases
Database URL
Gene Expression Omnibus – NCBI http://www.ncbi.nlm.nih.gov/geo/
Stanford Microarray database http://smd.stanford.edu/
GeneNetwork system http://www.genenetwork.org/
ArrayExpress at EBI http://www.ebi.ac.uk/arrayexpress/
UNC Microarray database https://genome.unc.edu/
Genevestigator database https://www.genevestigator.com/
caArray at NCI http://array.nci.nih.gov/caarray/
UPenn RAD database http://www.cbil.upenn.edu/RAD
UNC modENCODE Microarray
database
https://genome.unc.edu:8443/nimblegen
ArrayTrack http://www.fda.gov/ScienceResearch/BioinformaticsTools/
Arraytrack//
MUSC database http://proteogenomics.musc.edu/ma/musc_madb.php?
page¼home&act¼manage
UPSC-BASE http://www.upscbase.db.umu.se/
612 Appendix B
B.1.42 Molecular Signatures Database (MSigDB)
http://www.broadinstitute.org/gsea/msigdb/index.jsp
This is a collection of annotated gene sets for use with GSEA software.
B.1.43 NAR
www3.oup.co.uk/nar/database/c
B.1.44 Neural Network Toolbox (MATLAB)
http://www.mathworks.com/products/neuralnet/
Neural Network Toolbox™ extends MATLAB® with tools for designing,
implementing, visualizing, and simulating neural networks. Neural networks are
invaluable for applications where formal analysis would be difficult or impossible,
such as pattern recognition and nonlinear system identification and control. Neural
Network Toolbox software provides comprehensive support for many proven
network paradigms, as well as graphical user interfaces (GUIs) that enable you to
design and manage your networks. The modular, open, and extensible design of the
toolbox simplifies the creation of customized functions and networks.
B.1.45 NeuralWorks
http://www.neuralware.com/index.jsp
NeuralWorks Predict is an integrated, state-of-the-art tool for rapidly creating and
deploying prediction and classification applications. Predict combines neural net-
work technology with genetic algorithms, statistics, and fuzzy logic to automati-
cally find optimal or near-optimal solutions for a wide range of problems. Predict
incorporates years of modeling and analysis experience gained from working with
customers faced with a wide variety of analysis and interpretation problems.
B.1.46 NeuroSolutions
http://www.nd.com/
This leading edge neural network development software combines a modular, icon-
based network design interface with an implementation of advanced learning
procedures, such as conjugate gradients and backpropagation through time.
Appendix B 613
B.1.47 NeuroXL
http://www.neuroxl.com/
NeuroXL Classifier is a fast, powerful, and easy-to-use neural network software
tool for classifying data in Microsoft Excel. Designed to aid experts in real-world
data mining and pattern recognition tasks, it hides the underlying complexity of
neural network processes while providing graphs and statistics for the user to easily
understand results. NeuroXL Classifier uses only proven algorithms and
techniques, and integrates seamlessly with Microsoft Excel. OLSOFT Neural
Network Library is the class to create, learn, and use Back Propagation neural
networks and SOFM (Self-Organizing Feature Map). The library makes integration
of neural networks’ functionality into your own applications easy and seamless. It
enables your programs to handle data analysis, classification, and forecasting needs.
B.1.48 Open Biological and Biomedical Ontologies(OBO Foundry)
This is a collaborative experiment involving developers of science-based
ontologies who are establishing a set of principles for ontology development with
the goal of creating a suite of orthogonal interoperable reference ontologies in the
biomedical domain.
http://www.obofoundry.org/
B.1.49 Omegahat
www.omegahat.org
B.1.50 OMIM
www.ncbi.nlm.nih.gov/entrez
B.1.51 OpenDiseaseModels.Org
OpenDiseaseModels.org is an open-source disease/systems model development
project. Analogous to open-source software development projects, the goal of this
effort is to develop better, more useful models in a transparent and public collabo-
rative forum.
http://www.opendiseasemodels.org/
614 Appendix B
B.1.52 Perl
http://strawberryperl.com/
http://www.activestate.com/activeperl/
Perl is a programming language that has seen a particularly widespread adoption in
the bioinformatics space. It is available on all the usual platforms.
B.1.53 PIPE
http://sourceforge.net/projects/pipe2/
It can create, model, and analyze Petri nets with a standards-compliant Petri net
tool. This is the active fork of the Platform Independent Petri net Editor project,
which originated at Imperial College London.
B.1.54 PNK
http://www2.informatik.hu-berlin.de/top/pnk/index.html
PNK provides an infrastructure for bringing ideas for analyzing, simulating, and
verifying Petri Nets.
B.1.55 PNK2e
http://page.mi.fu-berlin.de/trieglaf/PNK2e/index.html
This is a software environment for the modeling and simulation of biological
processes that uses Stochastic Petri Nets (SPNs), a graphical representation of
Markov Jump Processes.
B.1.56 Proper
Most databases employ the relational model for data storage. To use this data in a
propositional learner, a propositionalization step has to take place. Similarly, the
data has to be transformed to be amenable to a multi-instance learner. The Proper
Toolbox contains an extended version of RELAGGS, the Multi-Instance Learning
Kit MILK, and can also combine the multi-instance data with aggregated data from
RELAGGS. RELAGGS was extended to handle arbitrarily nested relations and to
work with both primary keys and indices. For MILK, the relational model is
flattened into a single table and this data is fed into a multi-instance learner.
Appendix B 615
REMILK finally combines the aggregated data produced by RELAGGS and the
multi-instance data, flattened for MILK, into a single table that is once again the
input for a multi-instance learner. Several well-known datasets are used for
experiments which highlight the strengths and weaknesses of the different
approaches. (Abstract from Reutemann et al. (2004 #848))
http://www.cs.waikato.ac.nz/ml/proper/
B.1.57 PseAAC
http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/
This site allows you to generate pseudo amino acid compositions, according to
Chou (2001 #520).
B.1.58 PubMed
www.ncbi.nlm.nih.gov/entrez
B.1.59 R
www.r-project.org
R is an open-source analytical environment that is not only comprehensive in its
own right but is widely enhanced and extended through packages that have been
developed. At the time of writing, several hundred add-on packages are available
for R that cover a wide array of subjects and disciplines.
B.1.60 R Packages
We have used many R packages throughout this book and gratefully appreciate
the skill and time that the various authors have spent developing these packages.
In particular, we have used several packages extensively and highlight these herein.
These packages can all be accessed using the Comprehensive R Archive Net-
work (CRAN) at http://cran.r-project.org.
B.1.60.1 odfWeave
The Sweave function combines R code and LATEX so that the output of the code is
embedded in the processed document. The odfWeave package was created so that
616 Appendix B
the functionality of Sweave can used to generate documents that the end{user can
easily edit. The markup language used is the Open Document Format (ODF), which
is an open, non{proprietary format that encompasses text documents, presentations
and spreadsheets.
B.1.61 RapidMiner
http://rapid-i.com/content/view/181/190/
RapidMiner provides comprehensive data mining capabilities, including
• Data Integration, Analytical ETL, Data Analysis, and Reporting in one single
suite
• Powerful but intuitive graphical user interface for the design of analysis
processes
• Repositories for process, data, and metadata handling
• Only solution with metadata transformation: forget trial and error and inspect
results already during design time
• Only solution which supports on-the-fly error recognition and quick fixes
• Complete and flexible: hundreds of data loading, data transformation, data
modeling, and data visualization methods
RapidMiner Image Processing Extension
http://spl.utko.feec.vutbr.cz/en/component/content/article/21-zpracovani-
medicinskych-signalu/46-image-processing-extension-for-rapidminer-5
This add-on package provides capabilities including
• Local image features extraction
• Segment feature extraction
• Global image feature extraction
• Extracting features from single/multiple image(s)
• Detect a template in image (rotation invariant)
• Point of interest detection
• Image comparison
• Image transforms
• Color mode transforms
• Noise reduction
• Image segmentation
• Object detection and object detector training (Haar-like features)
Appendix B 617
B.1.62 Reactome
www.reactome.org
B.1.63 Resourcerer
Pga.tigt.org/tigr-scripts/magic/r1.pl
B.1.64 RStudio
RStudio™ is a new integrated development environment (IDE) for R. RStudio
combines an intuitive user interface with powerful coding tools to help you get the
most out of R.
http://www.rstudio.org/
B.1.65 Systems Biology Graphical Notation (SBGN)
The Systems Biology Graphical Notation (SBGN) project is an effort to standardize
the graphical notation used in maps of biochemical and cellular processes studied in
systems biology.
Standardizing the visual representation is crucial for more efficient and accurate
transmission of biological knowledge between different communities in research,
education, publishing, and more. When biologists are as familiar with the notation
as electronics engineers are familiar with the notation of circuit schematics, they
can save the time and effort required to familiarize themselves with different
notations, and instead spend more time thinking about the biology being depicted.
http://sbgn.org/
B.1.66 Systems Biology Markup Language (SBML)
http://sbml.org/
A free and open interchange format for computer models of biological processes.
SBML is useful for models of metabolism, cell signaling, and more. It has been in
development by an international community since the year 2000.
618 Appendix B
B.1.67 Systems Biology Workbench (SBW)
http://sys-bio.org/sbwWiki/doku.php?id¼sysbio:sbw
This is an open-source framework connecting heterogeneous software applications.
SBW is made up of two kinds of components:
• Modules: These are the applications that a user would use. We have a wide
collection of model editing, model simulation, and model analysis tools.
• Framework: The software framework that allows developers to cross program-
ming language boundaries and connect application modules to form new
applications.
B.1.68 W3C
W3c.org
B.1.69 Weka
This is a machine learning toolkit that includes an implementation of an SVM
classifier. Weka can be used both interactively through a graphical interface and as
a software library. (The SVM implementation is called “SMO.” It can be found in
the Weka Explorer GUI, under the “functions” category.)
http://www.cs.waikato.ac.nz/ml/weka/
B.2 Data Sources
B.2.1 BioModels Database
This is a repository of peer-reviewed, published, computational models. These
mathematical models are primarily from the field of systems biology, but more
generally are of biological interest. This resource allows biologists to store, search,
and retrieve published mathematical models. In addition, models in the database
can be used to generate sub-models, can be simulated online, and can be converted
between different representational formats. This resource also features program-
matic access via web services.
http://www.ebi.ac.uk/biomodels-main/
B.2.2 DDBJ
http://www.ddbj.nig.ac.jp
Appendix B 619
B.2.3 EMBL
http://www.ebi.ac.uk/embl/index.html
B.2.4 GenBank
http://www.ncbi.nlm.nih.gov/GenBank
B.2.5 Pfam
http://pfam.wustl.edu
B.2.6 PROSITE
http://us.expasy.org/prosite
B.2.7 SWISS-PROT
http://us.expasy.org/sprot
B.3 Support Vector Machines
Support vector machines (SVMs) have seen a particular growth in the tools and
software available to support this valuable supervised learning paradigm. We list a
number of products available at the time of writing4.
The Kernel-Machine Library (GNU) – C++ template library for support vector
machines.
Lush – a Lisp-like interpreted/compiled language with C/C++/Fortran interfaces
that has packages to interface to a number of different SVM implementations.
Interfaces to LASVM, LIBSVM, mySVM, SVQP, SVQP2 (SVQP3 in future) are
available. Leverage these against Lush’s other interfaces to machine learning,
hidden Markov models, numerical libraries (LAPACK, BLAS, GSL), and built-in
vector/matrix/tensor engine.
4 As available from the Wikipedia entry http://en.wikipedia.org/wiki/Support_vector_machine
620 Appendix B
SVMlight – a popular implementation of the SVM algorithm by Thorsten Joachims;
it can be used to solve classification, regression, and ranking problems.
SVMProt – Protein Functional Family Prediction.
LIBSVM – A library for support vector machines, Chih-Chung Chang and Chih-Jen
Lin.
YALE – a powerful machine learning toolbox containing wrappers for SVMLight,
LibSVM, and MySVM in addition to many evaluation and preprocessing methods.
LS-SVMLab – MATLAB/C SVM toolbox; well-documented, many features.
Gist – implementation of the SVM algorithm with feature selection.
Weka – a machine learning toolkit that includes an implementation of an SVM
classifier; Weka can be used both interactively through a graphical interface and as
a software library. (The SVM implementation is called “SMO.” It can be found in
the Weka Explorer GUI, under the “functions” category.)
OSU SVM – MATLAB implementation based on LIBSVM.
Torch – C++ machine learning library with SVM.
Shogun – Large-Scale Machine Learning Toolbox that provides several SVM
implementations (like libSVM, SVMlight) under a common framework and
interfaces to Octave, MATLAB, Python, and R.
Spider – machine learning library for MATLAB.
kernlab – Kernel-based machine learning library for R.
e1071 – machine learning library for R.
SimpleSVM – SimpleSVM toolbox for MATLAB.
SVM and Kernel Methods MATLAB Toolbox
PCP – C program for supervised pattern classification; includes LIBSVM wrapper.
TinySVM – a small SVM implementation, written in C++.
pcSVM is an object-oriented SVM framework written in C++ and provides
wrapping to Python classes. The site provides a stand-alone demo tool for
experimenting with SVMs.
PyML – a Python machine learning package; includes: SVM, nearest neighbor
classifiers, ridge regression, multi-class methods (one-against-one and one-against-
rest), Feature selection (filter methods, RFE, multiplicative update), Model selec-
tion, Classifier testing (cross-validation, error rates, ROC curves, statistical test for
comparing classifiers).
Algorithm::SVM – Perl bindings for the libsvm support vector machine library.
SVM Classification Applet – performs classification on any given dataset and gives
10-fold cross-validation error rate.
Appendix B 621
Appendix C
The Patient Dataset
C.1 Introduction
At various points in the book we refer to a patient dataset. We define this dataset in
this appendix rather than in the main body of the book so as not to lose too much
momentum in the book itself.
The patient dataset definition, together with some artificially created data, is
available from the companion website along with some example code for various
data mining techniques.
The datasets defined below are not intended to be viewed as complete, or even as
necessarily valuable, but simply serve to provide us with a meaningful data
environment with which to illustrate some of the data mining concepts we discuss
within the book.
For the purposes of illustration of the data mining techniques, we have
denormalized the patient dataset into two datasets – one for the patient, and one
for the tests performed on specimens/isolates from those patients. A normalized
version (to mimic a transactional environment) and a star-schema (to mimic a data
warehouse environment) are available on the companion website.
C.2 Supporting Datasets
In addition to the dataset generated from our work, we would typically use a
number of datasets that come from other sources. For example, in chapter two we
use a body mass index dataset to allow us to calculate weight-related categories.
Where so used, we indicate these in the main text.
623
C.3 Patient Dataset Definition
Key Attribute Format Description
PK Id Integer Unique key value for this record in the dataset. Every record
will have a unique value
Patient Label String Some identification for the patient. This may be their name,
initials, or other “human-readable” attribute. Note that this
is not an identifier in the pure sense, but is simply a text label
MRN String Medical record number
Patient Id Integer A unique identifier assigned to each patient. This is an integral
value and may simply be used as a sequence, where the next
available number is assigned to the next unique patient
without any particular ordering
Context Id Integer A unique identifier assigned to a particular context associated
with this data. Contexts will typically be some project or
study, or other high-level grouping for the data. In this book,
the context is synonymous with a project
Encounter Id Integer A unique identifier assigned to each encounter between a patient
and a healthcare professional within a specific context. Thisitalicized constraint is important since it is likely that this
value will be a sequence within a project/study
Age Integer The patient’s age
Weight Double The patient’s weight. This is measured in kilograms
Height Double The patient’s height. This is measured in meters
Gender Character The patient’s gender. Allowed values are M(ale), F(emale),
N(ot provided), and ‘ ‘ or NULL
Smoker Boolean TRUE/FALSE values indicating whether the patient is a smoker
or not
Activity Factor Integer A scalar value from 0 to 10 indicating the level of activity/
exercise in the patient’s daily life. 0 indicates immobility,
whereas 10 indicates an extreme level of activity such as for
professional athletes
C.4 Test Dataset Definition
Key Attribute Format Description
PK Id Integer Unique key value for this record in the dataset. Every record will
have a unique value
FK Patient Id Integer The patient id value from the patient dataset for the patient
associated with this test result
(continued)
624 Appendix C
Key Attribute Format Description
Test Id String Some unique identifier for the test that is consistent within the
data environment. For example, LDL-C might be a test id for
measurement of LDL cholesterol. This can then be compared
across instances
Test
Description
String A description of the test
Test Result
Value String
String The actual test result. This attribute will be populated based on
the test result domain being a string or character value
Test Result
Value
Double The actual test result. This attribute will be populated based on
the test result domain being an integer or numeric value
Test Result
Domain
Integer A value indicating the actual result value’s domain. Values are
S(tring), I(nteger), N(umeric), and C(haracter). This can be
used to convert the values to their appropriate types
Interpretation String The interpretive result, if any, associated with this test
Appendix C 625
Appendix D
The Clinical Trial and Data Mining
D.1 Introduction
Throughout our book we have referred to the clinical trial as an archetype for data
mining. This is for some intuitive reasons along with some which may not be quite
as intuitive to our readers.
It is definitely true to say that much of the data generated in the healthcare and
life sciences arenas relate to the pharmaceutical product or medical device
industries, and by our focus on these industries, we are certainly not intending to
dismiss any of the other vitally important research and development disciplines that
additionally form the foundation for the life sciences today. We instead have used
the pharmaceutical development process since many more people outside of these
disciplines have some knowledge of the pharmaceutical process due to its being
front and center in the public psyche due to advertising, public information on the
safety of drugs and devices, and other media outlets.
D.2 The Clinical Trial Process
D.2.1 Statistics of a Phase III Clinical Trial
We now present some of the basic statistics of a phase III clinical trial.
Number of patients (start of trial) 2,500
Number of visits per patient 12
Number of data attributes collected
per patient per visit in bytes (characters) 4,000
Total 114.44 Mb
In the above list, the data is the maximum amount of data that might be
captured. Thus there are several assumptions that typically do not occur in reality.
627
The most affective of which is the assumption that every patient is included in the
trial from the beginning to the very end. This does not typically happen as patients
drop out, are excluded, or, unfortunately, die before the trial ends. Since the largest
driver of the data volume is obviously the number of patients involved in the trial,
we will assume that only 25% of the patients initially enrolled see the trial through.
This would result in 28.6 Mb of data.
628 Appendix D
References
Chou K-C (2001) Prediction of protein cellular attributes using pseudo amino acid composition.
Proteins 43:246–255
Gentleman RC et al (2001) The Bioconductor FAQ
Reutemann P, Pfahringer B, Frank E (2004) A toolbox for learning from relational data with
propositional and multi-instance learners. In: 17th Australian joint conference on Artificial
Intelligence (AI2004). Springer, Berlin
629
Index
A
Accuracy, 5, 10, 12, 25, 42, 70, 80, 120,
133–135, 142, 192, 211, 217, 233, 240,
260, 278, 327, 332, 352, 364, 365,
370–373, 396, 407, 429, 458, 460, 467,
472, 474, 484, 489–497, 514, 515, 585
Aggregation, 21, 41, 110–113, 202, 209–211,
225, 379, 497, 544
Analytical entity, 193, 194
Apriori algorithm, 146, 148–150, 372, 447, 587
ARFF, 86, 333
Artificial neural networks, 396, 399–425, 577
Association, 4, 6, 8, 11, 14, 18, 25, 38, 42, 46,
69–73, 82, 86, 87, 105, 119, 121,
145–152, 161, 170–172, 183, 283–285,
289, 297, 339, 364, 365, 372, 397, 447,
548, 561, 562, 566, 568, 576, 580
Association analysis, 42, 46
Association rules, 69–73, 82, 145–152, 183,
364, 372, 447
Asymmetry, 75, 238, 272–273
Attribute, 4, 36, 87, 125, 192, 236, 304, 365,
459, 514, 564
B
Back-propagation, 396, 412–416, 423–425,
469, 477
Bayes factor, 312–315
Bayesian, 14, 29, 183, 186, 237, 240, 241,
303–360, 456, 457, 469, 471–477, 480,
549, 577
Bayesian belief networks, 337–340
Bayesian classification, 327–337, 469
Bayesian modeling, 471–477
Bayesian reasoning, 322–327
Bayes’ theorem, 293, 304, 309–312, 317, 320,
327, 331, 339, 471
Bias, 10, 43–45, 47, 55, 129, 196, 222, 229,
239, 260, 269, 341, 351, 404, 411,
459, 495
Bias-variance decomposition, 55
Binary tree algorithm, 436–437
Binning, 211–215
Boolean logic, 440–448
Box plot, 154, 159–161, 272
Branching, 133–142, 144, 145, 532
Branching factor, 128
C
C4.5, 79, 138, 587
Causality, 8, 11
Characteristic, 24–26, 34, 36, 38, 42, 45, 48,
52, 57, 65, 70, 77, 81, 95, 102, 113, 133,
144, 145, 170, 216, 243, 261, 270, 286,
293, 294, 298, 325, 327, 351, 371, 375,
383, 386, 396, 399, 405, 424, 425, 431,
460, 472, 478, 482, 492, 512, 513, 515,
531, 540, 546, 549, 550, 564, 590
Characterization, 45, 351, 363
Classification, 4, 35, 87, 127, 241, 308, 364,
455–498, 513, 544, 587
Classification and regression trees, 129–145
Classification rules, 66–69, 72, 127–129, 371,
464, 465
Cluster analysis, 18, 47–48, 266, 364, 365, 396,
397, 446, 447
Clustering, 4, 14, 25, 73–74, 76, 77, 81, 119,
168, 169, 177, 187, 194, 211, 215,
371–373, 383, 396, 397, 426–431, 436,
446, 448, 456, 458–460, 473, 476, 477,
631
481, 483, 485, 486, 498, 514, 532, 533,
535, 544, 546, 576, 577
Concept, 2, 33–82, 85, 132, 194, 236, 304, 364,
456, 502, 553, 585
Concept hierarchy, 38, 52–55, 205, 206,
215, 227
Conditional random field (CRF), 351–355
Confidence, 13, 26, 42, 70, 75, 114, 145–148,
176, 238, 250, 259, 260, 278–282, 293,
341, 372, 394, 492
Confidence intervals (CI), 259, 260, 278–282,
293
Contingency tables, 127, 277, 278
Correlation, 2, 3, 6, 8, 11, 18, 24, 33, 35, 44, 45,
55, 75, 145, 161, 169, 186, 194, 219,
221, 224, 238, 241, 242, 259–265, 283,
284, 289, 295, 296, 314, 321, 322, 329,
396, 460, 480, 486–488, 531, 563
Co-training, 373, 374, 377–381, 387
Coverage, 4, 42, 70, 145, 208, 209, 260, 372,
467, 511
Cross-validation, 132, 143–144, 382, 493, 494
D
Data architecture, 2, 4, 12, 13, 29, 82,
85–122, 193
Database for Annotation, Visualization, and
Integrated Discovery (DAVID), 208,
209, 297
Databases, 2–4, 11, 12, 14, 16, 24, 26, 29,
30, 34, 53, 85, 86, 94, 96, 98, 99,
103–105, 110, 111, 116–118, 148,
171, 176, 181, 197, 202–204,
206–208, 218, 222, 228, 266, 290,
294, 295, 297, 315, 327, 328, 349,
352, 378–380, 433, 447, 458,
502–504, 507–512, 520, 538, 545,
546, 553, 563–565, 567, 570, 576
Data curation, 111, 115, 504
Data format, 13, 22–23, 172, 198, 504
Data integration, 102–106, 504–508, 545
Data mart, 29, 110–112, 228
Data mining
ethics, 10–12
process, 15, 19, 24, 43
results, 23, 24, 125–189
Data modeling, 85–122, 203
Data plot, 154–158
Data preparation, 19, 21–23, 93, 194–195,
459–460
Data reduction, 194, 212, 214, 215, 224–227
Data warehouse, 3, 4, 17, 29, 37, 53,
110–113, 115–116, 126, 191, 192,
207, 228, 230, 508
DAVID.See Database for Annotation,Visualization, and Integrated Discovery
(DAVID)
Decision tables, 125, 127–129
Decision trees, 23, 38, 44, 63–68, 79–81,
127–129, 215, 327, 332, 371, 448, 457,
463–466, 469, 588
Denormalization, 40, 201–206
Dependent entity, 87, 88
Diffusion map, 426–431
Dimensionality, 49–51, 74, 80, 114, 211, 217,
394, 426, 480, 481, 497
Discretization, 62, 193, 194, 212, 215, 227
Discrimination, 45
Distance, 48, 76–79, 154, 162, 198, 199, 216,
272, 340, 369, 382, 383, 385, 389–391,
416, 417, 422, 423, 426, 446, 447, 459,
470, 480, 484, 486–488, 498, 531–535,
537, 554
Dynamical systems, 202, 546–548,
567, 580
E
Enrichment analysis, 294–297
Entity, 46, 86–92, 95, 102–103, 105–109, 120,
180, 181, 184, 193, 194, 196, 218, 374,
379, 380, 569, 570, 576
Entrez, 99, 176, 208, 508–510, 566, 569
Entropy, 75, 134, 137–138, 140, 145, 238,
269, 270, 351, 467
Error, 10, 36, 86, 138, 211, 239, 333, 367, 461,
512, 546
Expectation-maximization (EM) algorithm
119, 223, 346, 373–377, 382, 448, 587
External consistency, 195, 197, 206
Extract, 5, 17, 22, 91, 100, 120, 126, 132,
137, 204, 206–209, 217, 264, 363,
365, 378, 384, 466, 479, 505, 553,
567, 568, 576, 579
F
False discover rate (FDR), 296–300
Feature selection, 51, 79–81, 364, 515
Feed-forward network, 400, 411–413, 477
Frequency polygon, 154
Functional networks, 404, 411, 568
Fuzzy logic, 394, 440–448
632 Index
G
Gaussian mixture, 375–377
Gene expression analysis, 394–398, 430
Gene identifier cross-referencing,
208–209
Gene ontology (GO), 95–96, 116–118,
152, 176, 294, 509, 512, 544,
563–565, 567, 577, 578
Generalization, 4, 37, 43–45, 52, 81, 82,
87–89, 226–227, 390, 393, 423, 469,
493, 501, 565
Gene selection, 397, 514, 515
Genome-wide association studies (GWAS),
171, 172, 297, 580
Genomic data, 169, 170, 503, 563
Genotype visualization, 171–172
Gini, 75, 134–140, 142, 145, 238,
269–272
Graph theory, 130, 339, 554–563
H
Heat map, 166–170, 510
Heterogeneity, 75, 86, 104, 105, 238,
269–270, 363
Hidden Markov models (HMM), 49,
342–356, 358, 359, 374, 377, 387,
480, 538
Histogram, 153–154, 268, 277, 296,
298, 354
Hypothesis testing, 239–265, 300
I
Independent entity, 87, 88
Information gain, 134, 137–139
Input representation, 39–40
Instance, 10, 37, 87, 126, 193, 246, 304, 364,
457, 514, 544, 587
Integrating annotations, 505
Interestingness, 26–28
Internal consistency, 192, 195–197
Interpretive data, 93
K
Kernels, 50, 368, 383–386, 388, 391–394,
417, 426, 428, 431, 549
K-means, 49, 74, 375, 448, 477, 481–486,
498, 587
K-nearest neighbor (k-NN), 198, 446, 457,
470, 587
Kurtosis, 75, 238, 273–274
L
Learning algorithm, 49, 50, 73, 183, 382, 383,
385, 388, 391, 399, 405, 407, 408, 425,
448, 481, 588
Least squares, 74, 76, 286–290, 461
Linear model, 55, 74–75, 81, 130, 157, 161,
282, 286, 463, 549
Linear regression, 74, 212, 283–290, 460–463,
477, 488
M
Many-to-many relationship, 87, 88
Marginal probability, 309, 310, 471, 477
Maximum a posteriori (MAP) estimation, 235,
293–294, 374
Maximum likelihood estimation (MLE), 75,
76, 290–294, 374, 376
Measures of central tendency, 266–276
Metadata, 105, 215–217
MIAME.See Minimal information about a
microarray experiment (MIAME)
Microarray databases, 563–565
Microarrays, 8, 9, 33, 121, 122, 171, 393,
395–397, 447, 457, 502, 505, 513–515,
537, 538, 544, 546, 548, 567
Minimal information about a microarray
experiment (MIAME), 505, 565
Minimum confidence threshold, 146
Minimum support threshold, 146, 148
Mining by navigation, 561–562
Missing data, 21, 40–41, 55, 58, 62, 82, 94,
104, 194, 199, 217–219, 222, 223,
228, 231, 232, 366, 367, 460, 474
Missing values, 22, 40, 41, 62, 66, 128, 129,
199, 217–223, 267, 374, 467
Model, 2, 34, 85, 126, 192, 236, 305, 364,
456, 501, 543, 585, 586
Modeling, 13, 19, 22, 23, 29, 51, 56–58, 66,
85–122, 126, 203, 216, 217, 230, 233,
237, 241, 282, 314, 322, 326, 347, 352,
355, 377, 438–440, 457, 461, 471–477,
502, 546–549, 552, 578, 580, 586
Monte Carlo strategies, 56–60
Motifs, 57, 94, 295, 297, 373, 387, 431–436,
480, 516–520, 528, 539, 577
Multiple sequence alignment, 99, 343
N
Naıve Bayes, 80, 304, 587
Naıve Bayes classifier, 183, 304, 309, 311,
312, 332–337
Index 633
Nearest neighbor, 81, 382, 383, 469–470, 588
Neighbors, 76–79, 99, 188, 434, 470
Network analysis, 176, 513, 577
Network (graph)-based analysis, 431
Network motif, 431–436, 577
Network visualization, 172–175
Neural networks, 3, 46, 49, 241, 327, 332,
364, 370, 391, 392, 394, 396, 404,
411, 424, 436, 456, 457, 460, 474,
477–481, 492, 498, 549, 577, 588
Noisy data, 12, 211–215
Normal form (NF), 90, 91, 120, 201, 208
Normalization, 6, 49, 50, 90, 91, 121, 194, 196,
198, 200–201, 204, 406, 426, 510, 545
Null hypothesis, 75, 164, 238, 241–245, 250,
254, 256, 259, 263
O
Odds ratio, 312–315
One-to-many relationship, 87, 88, 203, 207
Ontologies, 95, 96, 105, 116, 152, 512, 564,
565, 567, 577
Operational data store (ODS), 110–113
Outlier, 4, 21, 22, 38, 48–49, 154, 163,
211, 239, 258, 268, 462, 488, 493,
510, 544, 576
Over-fitting, 12, 26, 27, 51–52, 81, 134, 142,
218, 290, 371, 393, 394, 458, 466,
469, 474, 476, 502
Over-training, 50, 368
P
Pairwise sequence alignment, 521–527
Parameter estimation, 340–341
Partial ordering, 53
Patterns, 2, 34, 102, 126, 194, 236, 324, 364,
457, 502, 543, 586
Perceptron, 391, 407–409, 411–412, 423,
424, 477
Petri nets (PNs), 437–440, 503, 549
Probabilistic Boolean networks (PBN),
548–550
Phylogenetic trees, 174, 188, 531, 537
Phylogenetic tree visualization, 177–178
Position map, 170–171
Posterior distribution, 304, 340
Prediction, 3, 4, 18, 27, 28, 34, 46–47, 52, 57,
127, 132, 144, 170, 240, 322, 335,
352–354, 356, 365, 372, 386–389,
395–397, 416, 422, 455–498, 501, 509,
514, 515, 543, 546, 553, 576
Principal component analysis, 51, 56, 80, 159,
169, 170, 194, 224, 225, 426, 477, 546
Prior distribution, 304, 324, 475
Prior probability, 305, 306, 308, 311, 312, 319,
324–326, 340, 471, 472, 477
Protein, 28, 33, 93, 172, 207, 290, 313, 386,
478, 502, 544
Protein identifier cross-referencing, 208–209
Proteomic data, 8, 503–504, 514
P-value, 200, 243–245, 255, 261–265, 296–299
Q
Qualitative and quantitative data, 90–93
R
1R, 16, 41, 60–63, 82, 237, 304, 465–469
Radial basis function (RBFs), 50, 393, 416, 417
Random walk, 426–431
Receiver operator characteristic (ROC), 386,
486, 492
Reducing (the) dimensionality, 80, 481
Regression, 27, 35, 74, 129–145, 156, 157, 161,
166, 183, 186, 211, 212, 216, 238, 239,
241, 259, 283–287, 289, 290, 364, 389,
393, 407, 447, 457, 460, 577, 587
Regression analysis, 27, 46, 141, 282–290,
456–458
Regular expressions, 480, 516–520, 589
Regulatory networks, 431, 547–549, 552,
576, 580
Reinforcement learning, 425
Relational data model, 85, 94
Relevance, 80–82, 216, 242, 300, 473
Relevance analysis, 47, 194, 459
S
Sampling error, 239
Scatterplot, 154, 157, 161–162, 283–289
Self-organizing map (SOM), 49, 418, 419,
422, 423
Semi-supervised learning (SSL), 373–383
Sequence data, 28, 34, 93–95, 207, 215, 319,
386, 430, 492, 503–505, 508, 509, 520,
521, 528, 531, 536, 589
Shortest common substring (SCS), 574, 575
Similarity, 14, 47, 48, 73, 76, 77, 94, 99, 173,
368–369, 381–388, 392, 393, 426, 428,
459, 464, 470, 478, 479, 486–488, 498,
510, 527, 539
Singular value decomposition (SVD), 51
634 Index
Skewness, 75, 238, 258, 274
Spectral clustering, 426–431
Standardization, 6, 194, 198, 222, 512, 545
Star schema, 86, 109, 193
Statistical inference, 10, 237, 239–265,
269, 471
Statistical significance, 44, 242, 260, 300, 515
Structural risk minimization (SRM), 393–395
Subtype, 87–89, 354, 373, 397, 458, 509
Supertype, 88, 89
Supervised learning, 37, 216, 363–366, 368,
370–372, 389, 395, 399, 407–423, 425,
448, 473
Support, 2, 12, 13, 17, 19–21, 25, 29, 30, 38,
42, 70, 82, 86, 103, 110, 121, 126,
145–147, 162, 172, 175–177, 195, 196,
207, 228, 239, 251, 262, 268, 297, 320,
335, 351, 372, 378, 388, 390, 426, 429,
430, 434, 443, 444, 456, 471, 477, 479,
502, 508, 511, 512, 548, 552, 566
Support vector machines (SVMs), 46, 369,
384, 387–398, 456, 457, 469, 492, 587
Systems biology, 7, 97, 152, 153, 184, 498,
501, 502, 543–581
Systems biology graphical notation (SBGN),
183–185
Systems biology markup language (SBML),
176, 178–182, 186, 548
T
Temporal models, 544, 578
Text mining, 4, 97, 119, 120, 217, 377, 544,
546, 565–571, 575
Time dimension, 112–113, 230
Tools, 2, 56, 95, 126, 207, 236, 323, 374, 456,
504, 546, 586
Total ordering, 53, 54
Training, 23, 35, 91, 129, 260, 312, 364, 457,
502, 544
Training data, 35–37, 39, 43, 52, 282, 329,
333–335, 337, 345, 346, 353, 371,
373, 393, 413, 457, 467, 469, 480,
494, 502, 515
Transactional data model, 110–112
Transform and load, 17, 206–209
Transformation, 22, 82, 103, 104, 110, 126,
181, 191, 194, 198–200, 207, 274,
380, 412
Tree pruning, 133, 142
Twoing rule, 134, 137
U
Unsupervised learning, 37, 49, 50, 73,
363–366, 368, 372–373, 396, 399,
406, 407, 418, 423, 458, 459, 473,
477, 481, 544
V
Validation, 14, 23, 24, 29, 52, 69, 111,
132, 143, 216, 327, 371, 458, 460,
465, 468, 469, 478, 490, 494–496,
498, 534
Variation, 55, 75, 116, 170–172, 231, 238,
289, 296, 355, 396, 493, 509, 515,
528, 544, 580
Index 635