Hypothesis-Driven and Exploratory Data Analysis

download Hypothesis-Driven and Exploratory Data Analysis

of 3

Transcript of Hypothesis-Driven and Exploratory Data Analysis

  • 8/12/2019 Hypothesis-Driven and Exploratory Data Analysis

    1/3

    Hypothesis-Driven and Exploratory Data

    Analysis

    The 14th-century maxim known as Ockham's Razor, paraphrased by Jefferys and Berger (1992) as "It

    is vain to do with more what can be done with less", is usually applied to the interpretation of

    scientific results. However, it applies equally well to choice of analysis. Thus if one has a very simple

    ecological data set, consisting of few species and few samples, ordination is not worthwhile. In such a

    case, the data are easiest to interpret in a simple table.

    In a typical data set, however, there are dozens of species and samples. It is impossible for the human

    mind to simultaneously contemplate dozens of dimensions. The purpose of ordination is to assist the

    implementation of Ockham's Razor: a few dimensions are easier to understand than many dimensions.

    A good ordination technique will be able to determine the most important dimensions (or gradients) in

    a data set, and ignore "noise" or chance variation.

    Both direct and indirect gradient analysis have the potential to reduce the dimensionality of a data set.

    However, reduction of dimensionality is not the only reason to use ordination. Before the

    development of CCA, most widely-used ordination techniques were indirect, and the primary goal of

    ordination was considered "exploratory" (Gauch 1982). It was the job of the ecologist to use his or her

    knowledge and intuition to collect and interpret data; pure objectivity could potentially interfere with

    the ability to distinguish important gradients. Ordination was often considered as much an art as a

    science.

    Once CCA was available, multivariate direct gradient analysis became feasible. It became possible to

    rigorously test statistical hypotheses and go beyond mere "exploratory" analysis. However, testinghypotheses requires complete objectivity, which results in repeatability and falsifiability. The two

    basic motivations for multivariate direct gradient analysis, hypothesis testing and exploratory analysis,

    conflict with each other to some extent:

    Table 1.Hypothesis-driven analysis, exploratory analysis, and their major characteristics and

    motivations. This table applies to regression techniques and indirect gradient analysis in addition to

    CCA.

    HYPOTHESIS DRIVEN EXPLORATORY

    Motivating Question: "Can I reject the null hypothesis thatspecies are unrelated to a postulated environmental factor or

    factors?"

    Motivating Question: "How can I optimally explain or

    describe variation in my data set?"

    objective subjective

    sites must be representative of universe: random, stratifiedrandom, regular placement

    sites can be "encountered" or subjectively located

    analyses must be planned a priori "data diving" permissible;post-hocanalyses, explanations,hypotheses OK

    p-values meaningful p-values only a rough guide

    Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate

    of 3 6/5/14, 6:44

  • 8/12/2019 Hypothesis-Driven and Exploratory Data Analysis

    2/3

    stepwise techniques not valid without cross-validation stepwise techniques (e.g. forward selection) valid anduseful.

    To perform a hypothesis-driven analysis, one must be very specific about the analyses one wishes to

    perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable

    manner. Usually, the sampling design will involve random, stratified random, or regular distribution of

    study plots. If there is any subjectivity involved in locating or orienting study plots, the results are

    technically not valid. All of the analyses, including variations of data transformation and use ofdifferent ordination options (e.g. detrending or not), must be planned in advance, or else the user runs

    the risk of "data diving" or "data mining", i.e. getting an artificially significant result because so many

    options are tried. Stepwise techniques (discussed later) are automated forms of "data diving", and will

    typically also lead to incorrect statistical inference (Cliff 1987, Draper and Smith 1981). The reward

    for rigorously adhering to these rather stringent criteria is that the statistical inference (i.e. thep-value)

    is valid.

    Exploratory analyses might lack statistical rigor, but they are still a mainstay of vegetation research.

    The purpose of exploratory analysis is to find pattern in nature, which is an inherently subjective

    enterprise. Exploratory analyses incorporate the wisdom, skill, and intuition of the investigator into

    the experiment. Unless you can find another investigator with identical wisdom, skill and intuition, theanalyses are not strictly repeatable, and are hence not falsifiable. While it is possible to perform

    exploratory analyses on sample plots located according to a rigorous, objective sampling design, such

    careful placement is not necessary. Indeed, an exploratory analysis can be aided if the investigator

    subjectively places study plots in locations he or she considers to be important or interesting.

    Orienting plots within vegetation which appears homogeneous is highly subjective, but very useful in

    evaluating differences between plots.

    With exploratory analysis, "data diving" (e.g. using different transformations of species abundances,

    adjusting ordination options, selecting different subsets of environmental variables, or selecting

    different subsets of study plots) is no longer to be avoided. Instead, it is a way for the investigator to

    learn more about the data set. Stepwise analysis is a form of automated data diving. It is useful as atool to help discover "important" or "interesting" variables.

    Ecologists are often mislead into thinking thatp-values from stepwise methods have a rigorous

    meaning, and that the results of stepwise methods give the best possible model. Such thinking is false.

    It is possible to combine exploratory analysis and hypothesis-driven analysis into a larger study. One

    way of doing this is to perform a 2-phase study, in which the first phase is an exploratory analysis,

    perhaps involving subjectively located plots and employing many variations on analysis. The patterns

    found in the first phase are then posed as hypotheses for the second phase. The second phase involves

    the collection of fresh data from objectively located plots, and an entirely planned data analysis.

    A second way to combine the two major types of analysis is through data set subdivision. The data set

    is randomly divided into two subsets: an exploratorysubset and a confirmatorysubset (alternatively

    called model buildingand model validation, respectively). Many, varied analyses can be performed on

    the exploratory subset (including stepwise analysis) - and such analyses can be based upon intuition,

    hunches, or superstition. If interesting patterns are found with respect to particular environmental

    variables, and using particular data transformations, these patterns can be statistically tested using the

    Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate

    of 3 6/5/14, 6:44

  • 8/12/2019 Hypothesis-Driven and Exploratory Data Analysis

    3/3

    confirmatory subset. To use data set subdivision properly, samples must be objectively located.

    Literature cited

    (see also selected references for self-education)

    Cliff, N. 1987. Analyzing Multivariate Data. Harcourt Brace Jovanovich, Publishers, San Diego,

    California.

    Draper, N. R., and H. Smith. 1981. Applied Regression Analysis. second edition. Wiley, New York.

    Gauch, H. G., Jr. 1982. Multivariate Analysis and Community Structure. Cambridge University Press,

    Cambridge.

    Hallgren, E., M. W. Palmer, and P. Milberg. 1999. Data diving with cross validation: an investigation

    of broad-scale gradients in Swedish weed communities. Journal of Ecology 87:1037-1051.

    Jefferys, W. H., and J. O. Berger. 1992. Ockham's Razor and Bayesian Analysis. Am. Sci. 80:64-72.

    This page was created and is maintained by Michael Palmer

    To the ordination web page

    Hypothesis-Driven and Exploratory Data Analysis http://ordination.okstate.edu/motivate

    of 3 6/5/14, 6:44