Etnobotanica cuantitativa

download Etnobotanica cuantitativa

of 49

description

serie plantas y pueblos

Transcript of Etnobotanica cuantitativa

  • to

    People and Plants Initiative,

    Division of Ecological Sciences,

    UNESCO, 7 Place de Fontenoy,

    75352 Paris CEDEX 07 SP, France.

    This series of working papers

    is intended to provide information and

    to generate fruitful

    discussion

    on key issues

    in the sustainable

    and equitable use

    of plant resources.

    Please

    send comments

    on this paper

    and suggestions

    for future

    issues

    6PEOPLE AND PLANTS WORKING PAPER - JUNE 1999

    0 2 4 6 8 100.01

    0.10

    1.00

    10.00

    RRytigynia kiwuensisytigynia kiwuensis

    Diameter at brDiameter at breast height [cm]east height [cm]

    Tota

    l bar

    k fr

    esh

    wei

    ght w

    ithin

    2m

    pla

    nt h

    eigh

    t [kg

    ]

    Quantitative EthnobotanyApplications of multivariate and statistical analysesin ethnobotany

    M. Hft, S.K. Barik and A.M. Lykke

    2 4 6 8 102 4 6 8 10

  • The designations employed and the presentation of material throughout this publication do not implythe expression of any opinion whatsoever on the part of UNESCO concerning the legal status of anycountry, territory, city, or area of its authorities, or concerning the delimitation of its frontiers orboundaries. The opinions expressed in this paper are entirely those of the authors and do not commitany Organization.

    Authors addresses:

    M. Hftc/o UNESCO Office NairobiP.O. Box 30592NairobiKENYA

    S. K. BarikCentre for Environmental StudiesNorth-Eastern Hill UniversityShillong 793 014INDIA

    A. M. LykkeDept. of Systematic BotanyNordlandsvej 688240 RisskovDENMARK

    Photos: R. Hft

    Cover illustration: M. Hft

    Published in 1999 by the United Nations Educational, Scientific and Cultural Organization7, place de Fontenoy, 75352 Paris Cedex 07 SP, FRANCEPrinted by UNESCO on chlorine-free recycled paper

    Edited by Malcolm Hadley and Robert HftDesign: Ivette FabbriLayout: Martina Hft

    UNESCO / M.Hft, S.K. Barik & A.M. Lykke 1999

    SC-99/WS/41

    Recommended citation: Hft, M., Barik, S.K. & Lykke, A.M. 1999. Quantitative ethnobotany. Applicationsof multivariate and statistical analyses in ethnobotany. People and Plants working paper 6. UNESCO, Paris.

  • Some wild plant resources are severely threatenedby habitat loss and species-selective overexploita-tion. In addition, indigenous knowledge about theuses of wild plant resources is rapidly disappear-ing from traditional communities. In the contextof conservation and sustainable and equitable useof wild plant resources, quantitative ethnobotanycan contribute to the scientific base for manage-ment decisions.

    In the past, most ethnobotanical studieshave recorded vernacular names and usesof plant species with little emphasis onquantitative studies. In this working paper,a selection of multivariate and statisticalmethods particularly applicable to theanalysis of ethnobotanical field data is pre-sented. The working paper aims at assistingresearchers and students to recognize theappropriate method to analyse their dataand to develop management recommenda-tions from scientifically sound conclusions.

    The techniques presented include clus-ter and principal component analysis,regression analysis, analysis of variance,and log-linear modelling.

    Multivariate and statistical analysisrequires computerized statistics and graph-ics programs. Basic technical knowledge touse such tools as well as basic understand-ing of statistical terms are importantrequirements to get most benefit from thispublication.

    PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    1

    Abstract

    Quantitative ethnobotanyAPPLICATIONS OF MULTIVARIATE AND STATISTICAL

    ANALYSES IN ETHNOBOTANY

    In most cases ethnobotanical data collection requires simpletools such as measuring tape or spring balance. This photoshows the Loita Ethnobotany Team quantifying amounts ofOlea europaea L. ssp. africana (Mill.) P. Green used for fuel inMaasai households.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    2

    1 Abstract 2 Contents

    3 Introduction3 Dimensions of data4 Sampling and organization of data7 Data standardization and transformation

    8 Classification and ordination techniques8 Clustering and classification10 Ordination12 Examples of data matrices13 Matrix structure and analysis

    16 Applications of cluster and principal component analysis16 Cluster analysis of Wood identification task18 Principal component analysis of Paired comparison of wood species task

    22 Comparisons of several means22 Hypothesis testing25 Prediction26 Linear correlation27 Cross-tabulation

    32 Applications of general linear models32 Analysis of variance33 Regression analysis33 Correlation34 Chi-square analysis of contingency tables

    38 References 39 Acknowledgements40 Appendix I

    Contents

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    3

    In order to enhance the indicative value of eth-nobotanical studies, there have been attempts inrecent years to improve the traditional compila-tion-style approach through incorporating suit-able quantitative methods of research in ethnob-otanical data collection, processing and interpre-tation. Such quantitative approaches aim todescribe the variables quantitatively and analysethe observed patterns in the study, besides test-ing hypotheses statistically. The concept ofquantitative ethnobotany is relatively new andthe term itself was coined only in 1987 byPrance and co-workers (Prance, 1991).Quantitative ethnobotany may be defined as "theapplication of quantitative techniques to thedirect analysis of contemporary plant use data"(Phillips & Gentry 1993a and b). Quantificationand associated hypothesis-testing help to gener-ate quality information, which in turn con-tributes substantially to resource conservationand development. Further, the application ofquantitative techniques to data analysis necessi-tates refinement of methodologies for data col-lection. Close attention to methodological issuesnot only improves the discipline of ethnobotanybut also enhances the image of ethnobotanyamong other scientists (Phillips & Gentry 1993aand b).

    Different approaches are taken to collect andanalyse quantitative and qualitative ethnobotani-cal data. The approaches depend on the objec-tives of the researcher and the nature of study andaim at the objective evaluation of the reliabilityof the conclusions based on the data.Multivariate and statistical methods are typicallyapplied to the interpretation of the followingtypes of ethnobotanical data (the list is notexhaustive): relative importance of plant taxa and vegeta-

    tion types to different ethnic, social or gendergroups;

    knowledge and uses of plants by differentethnic, social or gender groups;

    preference information on different plantspecies;

    size class distribution of woody plant species; quantitative impact of human uses on growth

    and regeneration patterns; quantitative impact of environmental factors

    on certain plant traits; quantitative impact of agricultural or horti-

    cultural techniques on certain plant traits;

    quantitative plant morphological and pharma-cological characteristics of useful plants.The data processing techniques in ethno-

    botany may range from calculating a simpleindex to complex computational techniques ofmultivariate analysis such as classification andordination. The selection of a particular tech-nique for application to the data is based on theeffectiveness of the technique for sound interpre-tation of the results and identification of theinter-relationships that may exist among the vari-ables studied. In general, statistical applicationsmay be classified into two broad categories:1) Sets of data where the measurements are

    taken only on one attribute or response vari-able and the data so obtained are analysedthrough a set of techniques called univariateanalysis techniques.

    2) Sets of data where the measurements aretaken simultaneously on more than one vari-ables and the statistical techniques applied tosuch data sets are called multivariate analysistechniques.Studies of multivariate nature are more com-

    mon in ethnobotanical research, and are treatedin more detail in this paper.

    Dimensions of dataBecause of the complexities involved in mostethnobotanical studies, it is common for ethnob-otanical researchers to collect observations onmany different variables. The need to under-stand the relationships between many variablesmakes multivariate analysis mathematicallycomplex and the techniques to analyse suchdata invariably need a computer. Today a largenumber of computer packages are available foranalysis of multivariate data sets. CANOCO,PC-ORD, TWINSPAN, R-Package, NTSYS(Numerical Taxonomy System), SAS(Statistical Analysis System), SYSTAT (SPSSInc.), SPSS (Statistical Package for the SocialSciences) and BMDP (Biomedical Programs)are some of the popular and powerful softwarepackages widely used in multivariate and statis-tical data analysis.

    Generally, multivariate and statistical meth-ods aim at making large data sets mentally acces-sible, structures recognizable and patterns explic-able, if not predictable. Johnson and Wicherngive five basic applications for these methods(Johnson & Wichern 1988):

    Introduction

  • 1) Data reduction or structural simplifica-tion: The phenomenon being studied is repre-sented as simply as possible with reducednumber of dimensions but without sacrificingvaluable information. This makes interpreta-tion easier.

    2) Sorting and grouping: Groups of similarobjects or variables are created.

    3) Examining relationships among variables:Variables are investigated for mutual interde-pendency. If interdependencies are found thepattern of dependency is determined.

    4) Prediction: Relationships between variablesare determined for predicting the values ofone or more variables on the basis of obser-vations on the other variables.

    5) Testing of hypothesis: Specific statisticalhypotheses formulated in terms of the para-meters of multivariate populations are tested.This may be done to validate or rejectassumptions.The different multivariate and statistical

    analysis techniques, which are available for theabove applications are derived from one simplelinear mathematical model, the MultivariateGeneral Linear Hypothesis (MGLH). In thispaper the following linear models are presentedalong with their applications:1) classification and clustering;2) ordination;3) analysis of variance:4) regression analysis;5) correlation;6) log-linear modelling;

    These techniques will be demonstrated usingexamples from a People and Plants workshopon species used for woodcarving in Kenya, aPh.D. study on alkaloid patterns of Tabernae-montana pachysiphon, two Ugandan M.Sc. stud-ies, one on Rytigynia kiwuensis and one on med-icinal plant collection habits of different special-ist groups. Before getting to the practical appli-cations, some general remarks regarding types ofdata, sampling size, sorting and grouping of dataare presented.

    First of all, the different types of quantitativeand qualitative data must be distinguished (seeBox 1, page 5). In the majority of cases ethno-botanical data are quantitative on an ordinalscale. Frequency and abundance are key parame-ters in vegetation analysis and populationdynamics, ranking order reveals important infor-mation on preferences of user groups and orderedmultistate character are data that fall into prede-fined hierarchical groups. Quantitative data on aratio or interval scale may be collected to deter-mine growth patterns of individual plant species,

    to assess the effectiveness of a certain remedy, orto express the impact of human uses.

    Qualitative data like presence/absence oryes/no are often recorded during interviewswhen peoples knowledge of certain species ormanagement techniques is assessed or the poten-tial for the acceptance of substitutes for a partic-ular resource is gauged.

    Counts are obtained when numbers of peoplefalling in a certain category, or numbers ofevents taking place in a pre-defined category ortime span are recorded. In order to assist in deter-mining relationships among and between vari-ables and how they can be classified and appro-priate analysis techniques identified, Box 2 (page5) lists some common data settings and researchquestions to which corresponding parametric andnon-parametric methods exist. Not all of thesetechniques are discussed in this paper.

    Parametric methods apply to approximatelynormally distributed data. In a simple linearmodel

    Y = a + bX + e

    Y is the dependant and X the independent vari-able. Variables are defined as quantities that canvary in the same equation. In contrast, the para-meters a and b are quantities that are constant ina particular equation, but can be varied in orderto produce other equations in the same generalfamily. The parameter a is the value of Y when X= 0. This is sometimes called a Y-intercept(where a line intersects the Y-axis in a graphwhen X = 0). The parameter b is the slope of theline, or the number of units Y changes, when Xchanges by one unit; e is referred to as anerror or residual, which is a departure of anactual Y from what the equation predicts. Thesum of all e is zero.

    Sampling and organization ofdataHaving developed a well thought out researchdesign before going to the field is likely to: 1) save a lot of time (and money) when

    analysing data;2) enhance the expected output in terms of

    meaningful results;3) allow more easily for the results being trans-

    lated into scientifically sound recommenda-tions;

    4) leave you and others satisfied with the work.The following reflections are crucial when

    planning ethnobotanical research in the field: How many samples need to be taken in dif-

    ferent categories?

    PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    4

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    5

    Box 1. Qualitative and quantitative data.

    Qualitative (binary or two-state) yes/no*

    presence/absence*

    (nominal or multi-state) categories

    Quantitative (discrete, ordinal scale) frequency (number of observations for each value)

    abundance (number of observations for each value per unit of space)

    ranking order

    (continuous, ratio or interval scale) units in time and space (e.g. for temperature, weight, height,

    circumference, etc.)

    * Usually coded as 1/0 or TRUE/FALSE

    Box 2. Applications of multivariate and statistical analyses techniques based onlinear models.

    Research interest Method of analysis

    General relationships among variables Correlation analysis/analysis of co-variance

    Associations between variables Detrended correspondence analysis

    Quantitative relationships among variables and prediction Regression analysis

    Similarity/dissimilarity among variables or groups of variables Cluster analysis

    Variance among variables or subjects in counted observations Principal component analysis

    Testing of hypothesis regarding factorial effects on variables Analysis of variance/Kruskal Wallis test

    Relationships among categories in multi-way frequency tables Log-linear modellingand prediction of cell frequencies based on counts

    Exploring survival rates Survival analysis

    How many categories can realistically bestudied without cutting down the minimumnumber of samples to be taken in each cate-gory?

    Can equal sampling be assured for each cate-gory?

    Which are the categories that would be repre-sentative for the research question?

    Have seasonal, diurnal, or circumstantialfluctuations to be accounted for?

    Is repeated sampling necessary? (i.e. samesamples studied at different times)

    Are the samples representative for the popu-lation?

    If processes are to be documented: canchanges realistically be observed within thetime-frame of the study? (i.e. growth recordsof plants)

    What indicators can be used to documentprocesses?Due to high cost in terms of time and money

    true random sampling is not practicable in many

    applied research situations. A stratified or sys-tematic random sampling strategy is, therefore,usually applied to study plant use by people.Charles Peters (1996) provides a good discussionon the different approaches.

    Mzee Ali Mwadzpea and Alex Jeremani construct litter traps.Sixteen traps were randomly set up in a coastal forest in Kenya tostudy the nutrient dynamics of soil and vegetation.

  • True random sampling of a finite populationwould mean the assignment a number to eachcase and then the random selection of a sample ofnumbers. In SYSTAT, random numbers between1 and 73,500 can be generated with the followingexpression:

    1 + INT (73500 * URN)When people are interviewed and questions

    asked with respect to some particular knowledge,the sample should be representative and includepeople from different social backgrounds, age orgender. In order to allow interpretation of associ-ations that may arise from data analysis, it is nec-essary to record as much information as possiblefrom the interviewees.

    If the aim for ethnobotanical applications isto predict one quantitative plant trait from anoth-er (usually ready to measure) quantitative planttrait, the sample size must be sufficiently large toinclude individual variation according to envi-ronmental factors at the study site (e.g. altitude,exposition, soil nutrients) and endogenous fac-tors within the species itself (e.g. age, phenolog-ical status). While for statistical confidence andaccuracy a sampling size of 10% of the total pop-ulation is desirable, for practical reasons theactual sampling size may not even reach 1% ofthe total population. An absolute minimum offour individuals within each cell (category) isindispensable for any statistical analysis.Obviously, the predictive value of inferenceincreases with increasing sample size.

    The state of environmental factors may bedescribed either quantitatively (e.g. concentra-tion of nitrogen and phosphorus in the soil, aver-age daily light sums, amount of water in the soil)or simply categorically (e.g. fertile soil, highlight intensity, dry site). These attributes aresubjective depending on the perceptions of theresearcher. For quantitative observations of theindependent variable on a continuous or ratioscale, regression analysis would apply. However,if the independent variable is categorical, thenanalysis of variance is applied. In both cases,relationships among variables and significantinteractions between the environmental factorsmay be expected and have to be accounted for inanalysis. Sampling schemes in the field shouldbe planned in such manner as to allow separateaccounting for environmental effects. In additionto sampling in the field, experimental designs areused to separately test effects of factors and prop-er planning is imperative in this respect. Linearrelationships or co-variances among variablesmust be tested before applying any of the stan-dard procedures.

    Ideally, the sample size in each category has

    to be equal and samples have to be taken consis-tently in time. Such a design is then called fullfactorial. Repeated sampling at different timesmight be done to account for seasonal and diur-nal variations. Repeated sampling analysis is aspecial form of analysis of variance and usuallycomputed with the general linear model option.

    The power of the test depends on the samplesize. The larger the sample, the smaller the min-imum detectable difference. There is no upperlimit as to the number of samples, as long as onecan handle them. In theory, there is also no limitto the number of factors that might be analysedsimultaneously. However, the number of possi-ble interactions becomes unwieldy and interpre-tation of interactions of more than three or fourvariables extremely difficult.

    Another group of multivariate analysisinvolves the application of log-linear models,also referred to as discrete analysis of variance.In this case data are counts and are arranged intwo- or multi-dimensional contingency tables.Again, the number of observations should ideal-ly be equal in each of the categories.

    When preparing the data set it is crucial toclear ambiguous signs (e.g. numbers with ques-tion marks) or in-between categories for the finalrecord. Usually a lot of time is wasted in clean-ing of data sets from such ambiguous entries.Often the whole entry is lost when the meaningof symbols one used to mark a certain entry at acertain time can not be recalled. It is better toinvest more time in the field measuring or inquir-ing to obtain clear data entries from the begin-ning.

    In vegetation research extensive relevs areoften produced and the crucial problem in thebeginning is to decide on the right samplingmethod. Species distribution can be recordedusing transects, whereas species abundance isrecorded in plots. In stratified plot sampling,plots are arranged along imaginary lines follow-ing environmental gradients. Stratified plot sam-pling combines two approaches to vegetationanalysis and is mainly used for investigating pop-ulation dynamics. In ecosystems where woodyvegetation is sparsely distributed, plotless sam-pling is most appropriate when wishing to deriveestimates of woody species density. The simplestplotless sampling method is the nearest individ-ual method. Random sampling points are deter-mined in the area and the distance to the nearestindividual(s) of each tree species is recorded.Successive distance measurements are taken andthe procedure is repeated for a number of randompoints. The density of each species is thenderived from the following formula:

    PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    6

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    7

    DSp= mean area/2where: mean area = (mean distance to nearest

    individual of a species)2In forestry, another plotless sampling

    method, the point-centred quarter method isused to assess the economic value of tree stands.Here, the point centre is marked by an individualtree, and four equally sized plots are delineatedaround the centre.

    With respect to plot size or transect lengththe following leads exist: inside forest and whendealing with large trees: subplot size should be20 x 10 m or 20 x 20 m. For the analysis ofregeneration patterns, subplot sizes of 10 x 10 mare sufficient to cover total areas between e.g.0.1 and 1 ha. In grassland and when analysingherbaceous vegetation (including treeseedlings), subplot sizes of 1 x 1 or 5 x 5 m areusually chosen. Transects may have lengthsbetween 100 and 1000 m and are usuallybetween 1 and 5 m wide along each side.

    Data are entered into the com-puter with the aid of a spreadsheet(rows and columns), that can laterbe imported into any statisticpackage, or simply as ASCII(American Standard Code forInformation Interchange) file,where entries are separated byblanks or tabulators. Some statisticprograms put limits to the maxi-mum file width (i.e. number ofcolumns) that can be importedwithout specification of the filewidth.

    Data are entered either asnumerical values (SI units) orcharacter values. Character vari-ables are also referred to asstring variables and in most pro-grams are marked with the $sign after the variable name (i.e.name$), while numeric variableshave no special sign added (i.e.length). In many statistic pack-ages it is not possible to inter-change character and string vari-ables by simple editing of thespreadsheet. Instead, a new vari-able has to be defined, based onthe value of the variable that is tobe altered. Variable names shouldbe as simple and straightforwardas possible.

    To study the abundance of three species of much sought aftermedicinal plants (Rytigynia kigeziensis, R. kiwuensis, and R.bagshawei (Rubiaceae)) in Bwindi Impenetrable National Park,Uganda, Maud Kamatenesi set up more than 300 plots of 20 x 20m, counted the individuals, measured dbh and height and deter-mined the amount of bark used.

    Moses Kipelian of the Loita Ethnobotany Team measures DBH ofJuniperus procera Endl. (Cupressaceae) trees which are highly valuedfor the construction of stockades and fences. The resulting size classdistribution curve showed a lack of regeneration which has raised seri-ous concern and has led to the establishment of a tree nursery.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    8

    Data standardisationand transformationClassical parametric methods of inference makethe assumption that the underlying populationfrom which the sample data are drawn shows anormal distribution. Normal probability plots(Figure 1) help to visualize the distribution of one

    or more variables. A sample from a normal distri-bution results in an approximately S-shaped curve. A few of these methods, e.g. t - test, are robust inthe sense that they are not sensitive to modestdeparture from normality. However, the accura-cy of most tests is seriously affected at largedeviations from normality. In that case, data aretransformed so as to approximate a normal dis-tribution (Berenson et al. 1983). In order tomeet the conditions of normality, standardiza-tion of the basic data matrix is an essential stepin most techniques. Besides, standardization incertain multivariate tests (e.g. principal compo-nent analysis, factor analysis) is done in order toremove the measurement units from the basicdata. Standardization or transformation isachieved by treating the data with one of thetransformation functions given in Box 3, wherexij is the transformed, while xij, and y are theoriginal data.

    Binary/two state character data are not stan-dardized. For combinations of two- and multi-state characters ordering should be used. Forcombinations of qualitative and quantitativedata, one of the following options should be fol-lowed:a) ignore the problem;b) divide the data matrix;c) convert the quantitative data to qualitative.

    0 10 20 30 40 50Variable X

    0.00.10.20.30.40.50.60.70.80.91.0

    Frac

    tion

    of D

    ata

    Figure 1. A quantile plot showing the standardized values ofa variable Y (Fraction of Data) as a function of a variable X.

    Logarithmic transformation:

    x ij = log10 (x ij )

    or

    x ij = log10 (x ij + 1)

    Square root transformation:

    x ij = x ijor

    x ij = x ij + 0.05

    Divide by standard deviation:

    x ijx ij =

    1

    Standardization:

    _X ij X

    X ij =

    Proportional function:

    x ijx ij = 0.0 x 1.0

    n I =1 x ij

    Divide by the range value:

    x ijx ij= 0.0 x 1.0

    x max x min

    Ordering

    x ijx ij= 0.0 x 1.0

    x max x min

    Linear transformation:

    Y = ( Y a )/ b+c

    SUBTRACTION OPTIONS:

    y y min

    _y y i

    DIVIDE OPTIONS:

    y / y max

    y / y max y min

    y /

    _y / y yy / yy / y2y / y

    Box 3. Data transformations.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    9

    Classification and ordination techniques In general, multivariate techniques are used tocategorize or group the objects or experimentalunits. The aim of classification or ordinationcould be:1) to get an overview of the variance;2) to compare groups or trends among them-

    selves or with additional data;3) produce hypotheses to prepare further studies.

    Clustering and classification Classes have boundaries and hence an innerstructure and relationships with external objectsor other classes. Thus, algorithms have toaddress the problem of what to include in a par-ticular class and what to exclude. Important cri-teria for judging, recognizing and testing of clas-sifications and classes are: the centres (averages for elements); the density of classes; the variance of classes; the number of members; the distinctness of delimitation.

    In different methods, different criteria areoptimized. The significance of the respective cri-teria must be seen in relation to the objective ofthe study. The choice of methods depends on the

    objectives. Figure 2 gives an overview of thedivision of classification methods.

    There are situations where the categorizationis done in terms of groups that are themselvesdetermined from the data. Such exploratory tech-niques for grouping objects (variables or items)are called clustering. In classification methodsother than cluster analysis, the number of groupsare known beforehand and the objective is toassign new observations (items) to one of thesegroups. In cluster analysis, in contrast, noassumptions are made concerning the number ofgroups. Grouping is done on the basis of similar-ities or distances. The inputs required are simi-larity measures or data from which similaritiescan be computed.

    CLUSTER ANALYSISCluster analysis attempts to subdivide or par-

    tition a set of heterogeneous objects into relative-ly homogeneous groups. The objective of clusteranalysis is to develop subgroupings such thatobjects within a particular subgroup are morealike than those in a different subgroup. Thus, theoutcome of cluster analysis is a classificationscheme that provides the sequence of groupings

    Pattern recognition

    Discriminant analysis Cluster analysis

    Hierarchical Non-hierarchical

    Divisive Agglomerative

    Monothetic Polythetic

    Association analysisGroup analysis

    Divisive informationanalysis

    Nodal analysis

    Indicator species(PHYTO)

    (TWINSPAN)

    Serial clusteringRelocated group

    clusteringGrid analysis

    Single linkageComplete linkageCentroid sortingAverage linkage

    Minimum variance

    Figure 2. Classification of classification methods (after Fischer & Bemmerlein 1986).

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    10

    by which a set of objects is subdivided. Box 4 listssome examples of data which are suitable for clus-ter analysis.

    The processes of sequencing are hierarchicalor non-hierarchical clustering. In non-hierarchi-cal clustering, objects are divided into groups,without relationships being established between

    them, i.e. no dendrogram can be produced. Non-hierarchical clustering is particularly suitable forlarge data sets, since no complete similarity matrixmust be calculated. All non-hierarchical clustersare calculated in the following way:1) choice of number and position of initial clus-

    ter centres;2) allocation of all objects to one respective

    cluster centre;3) new calculation of cluster centres;4) re-iteration of steps 2) and 3) until no further

    changes occur in the structure of clusters;5) eventually merging of clusters.

    The more widely used approach is hierarchi-cal clustering arrangement. In this approach,once two objects are linked together at a particu-lar stage, they cannot be separated into differentclusters later on. Therefore, clustering decisionsat a particular step are conditioned by thearrangement of objects at the previous step. Inthis approach the number of possible clusteringchoices decreases at each step. In hierarchicalclustering, groups at any lower level of a clusterare exclusive subgroups of those groups at high-er levels. In contrast to non-hierarchical cluster-ing, statements on the relationships of classes(but not of the relationships of members in therespective classes) can be made in the hierarchi-cal approach. The results can be depicted in theform of a dendrogram. All methods discussed inthe following paragraphs are hierarchical.

    Hierarchical clustering may be either divisiveor agglomerative. In a divisive cluster analysis,the entire collection of objects is divided and re-divided, based on object similarities, to arrive atthe final groupings (i.e., picture an inverted tree).In an agglomerative classification, as its nameimplies, individual objects are combined and re-combined successively to form larger groups ofobjects, (i.e. the tree).

    Divisive and agglomerative arrangementsmay be either monothetic or polythetic.Agglomerative methods are always polythetic.The following groups exist: monothetic divisive, polythetic divisive, polythetic agglomerative.

    In a monothetic clustering, the similarity ofany two object groups is based on the value of asingle variable, for example, preference rankingbased on a single factor. In a polythetic classifi-cation, the similarity of any two objects orgroups is based on their overall similarity asmeasured by numerous variables, for example,preference ranking based on several factors andfinally combined to an index.

    Box 4. Examples of data suitable for cluster analysis.

    Similarity/dissimilarity of peoples responses to well definedquestions.

    Similarity/dissimilarity of plant utilization patterns among differentethnic, social or gender groups.

    Similarity/dissimilarity of species based on peoples indication ofuse values

    Similarity/dissimilarity of phenotypic characteristics (e.g. seeds)in different varieties of food plants.

    Similarity/dissimilarity of the pattern of secondary compounds(e.g. essential oils) in different varieties of medicinal or aromaticplants.

    An elderly woman sells herbal medicine at a market in Menglun,Yunnan Province, Peoples Republic of China. In most cases theolder members of a community have a deeper knowledge of theenvironment and the properties and uses of plant species.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    11

    Agglomerative clustering procedures beginby considering each object as its own distinctcluster. Then two objects are placed together in asingle cluster according to certain optimizationcriteria while grouping each of the remainingobjects separately. In the next step, objects aregrouped into either one cluster of three or twoclusters of two (with each remaining objectgrouped separately). This clustering procedurecontinues sequentially until all objects aremerged into one cluster.

    Another criterion for defining cluster analy-ses is related to the measure of distance utilized inlinking the objects for cluster formation.Alternative approaches are followed, includingcomplete linkage, single linkage and averagelinkage. In complete linkage, the merger of twosubsets of objects is based on the maximum dis-tance between objects. This approach is alsocalled farthest neighbour or diameter method andproduces compact clusters of approximatelyequal size (unsuitable for ethnobotanical researchquestions). In single linkage, the merger is basedon the minimum distance between objects. Thisapproach is alternatively known as nearest neigh-bour method and often produces a single largechain-like cluster and several small clusters dur-ing its sequencing process. The average linkageapproach bases the merger of two subsets ofobjects on the average distance between objectsand is considered to be a way in between the firsttwo approaches.

    The general approach to cluster analysis is tocompute a normal mode resemblance matrixbetween the objects (also referred to as samplingunits or operational taxonomic units (OTUs))using appropriate resemblance functions. Thesimilarities/distances between all pairwise com-binations of sampling units (SUs) in a collectionare summarized into a SU x SU similarity/dis-tance matrix and the various cluster analysisstrategies operate on this matrix.

    The cluster analysis models described hereare agglomerative: they begin with a collectionof N individual SUs and progressively buildgroups or clusters of similar SUs. During eachclustering cycle, only one pair of entities may bejoined to form a new cluster. This pair may be:1) an individual SU with another individual SU, 2) an individual with an existing cluster of SUs, 3) a cluster with a cluster. Hence, the term pair-

    group cluster analysis is applied.The first step in all pair-group cluster analy-

    sis strategies involves searching thesimilarity/distance matrix for the smallest dis-tance value between two individual SUs. Thesetwo individual SUs may be represented by the

    symbols j and k, respectively. Hence, the firstcluster is formed at a distance D(j,k) and this canbe diagrammed using a dendogram. The initialcollection of N SUs is now reduced to one clus-ter C1 (= SUs j and k joined) and N 2 individ-ual SUs. Special equations have been developedto compute the distance between this cluster andeach of these N 2 remaining SUs. A generallinear combinatorial equation developed byLance & Williams (1967) is given below:

    D (j, k) = 1 D (j, h) + 2 D (k, h) + D (j, k)

    where the distance between the new cluster (j,k)is formed from the jth and kth SUs. A third hthSU or group of SUs can be calculated from theknown distances D(j,k) D(j,h) and D(k,h) and theparameters 1,

    2, and

    . The distance between

    SU 3 and the cluster represented by SUs 1 and 4is given by:

    D(1,4)(3) = 1 D(1,3) + 2 D(4,3) + D(1,4)

    The different clustering strategies differ onlyin their values for 1,

    2, and

    , which are the

    weights for determining the new distances. Depending on the weighting scheme used,

    the resultant cluster formation varies. The groupmean clustering strategy (the unweighted pair-group method with arithmetic averages -UPGMA) is most commonly used and it effec-tively computes the mean of all distancesbetween SUs of one group to the SUs of anotherand, hence, is unweighted (see Legendre &Legendre 1998 for weighting strategies).

    Pramoth Kheowvongsri interviews a Palong healer in No Lai,northern Thailand, on medicinal plants use and trade. Responsesfrom structured interviews can be analysed using cluster analysis.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    12

    OrdinationOrdination involves reduction of dimension-

    ality. The basic objective of reducing dimension-ality in analysing multi-response data is to obtainsimplicity for better understanding, visualizationand interpretation. While reducing the dimen-sions, the techniques ensure the retention of suf-ficient details for adequate representation. Someof the important goals of reducing the dimen-sionality of multiple response data are as follows(Gnanadesikan 1977):1) to screen out redundant variables or to find

    more insightful ones as a preliminary step tofurther analysis;

    2) to stabilize scales of measurement, when asimilar property is described by each of sever-al variables. Here the aim is to compound thevarious measurements into fewer numbers;

    3) to help in assessing the significance for test-ing a null hypothesis by compounding themultiple information. For example, smalldepartures from null conditions may be evi-denced on each of several jointly observedresponses. It is advisable to integrate thesenon-centralities into a smaller dimensionalspace wherein their existence might be moresensitively indicated;

    4) to obtain the preliminary specification of aspace, which may be used later on in classifi-cation and discrimination procedures;

    5) to detect the possible functional dependenciesamong observations in high-dimensionalspace.In ordination two distinctly different

    approaches exist: direct and indirect gradientanalysis (Figure 3).

    Reduction of dimensionality (ordination)

    Direct gradient analysis Indirect gradient analysis

    Principal componentanalysis

    Detrended correspondenceanalysis

    Factoranalysis

    Multidimensionalscaling

    Correspondenceanalysis

    Canonical correspondenceanalysis

    Bray-and-Curtis-Ordination

    Figure 3. Classification of ordination methods.

    Historically, these methods are employed toinvestigate the relative importance of underlyingecological factors in vegetation analysis. Indirect gradient analysis, vegetation relevs arearranged in an ecological space along axes ofmoisture, nutrients, altitude, etc. and the influ-ence of the respective factors on the vegetation isdetermined. The indirect gradient analysis, incontrast to direct gradient analysis, focuses onthe floristic composition. Five methods are dis-tinguished: Bray-Curtis-Ordination, correspondence analysis,

    multidimensional scaling, principal component analysis, and factor analysis.

    PRINCIPAL COMPONENT AND FACTORANALYSIS

    The two most widely used classical linearreduction methods are principal componentanalysis (PCA) and factor analysis. In PCA, a d-dimensional observation (usually with correlatedvariables) is replaced by a k-linear combinationof uncorrelated variables, where k is much small-er than d. Biplots are used to graphically describe

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    13

    both, relationships among the d-dimensionalobservations x1, x2, x3 ....xn and relationshipsamong the variables in two dimensions.Underlying assumptions for the data set to beanalysed used PCA are:1) data are normally distrib-

    uted,2) linear relationships exist

    between variables.Only linear relationships

    are elaborated through PCA. Themethod looks at the objects (respondents)as an assembly of dots in a space, whos axesrepresent the (plant) species in question. The aimof the method is to project the multi-dimensionalonto a two-dimensional hyperspace, such thatminimum information on the distances betweendots is lost. The first axis is laid through the centreof the dot cloud into the direction of largest vari-ance. The second and following axes are perpen-dicular to the first axes, pointing into the directionof largest rest-variance. PCA is a transformation,in which the origin of the co-ordinate system ismoved to the centre of the dot cloud and the axesare arranged according to variance. The problemof moving axes is mathematically solved throughanalysis of Eigen (German, meaning self)vectors of the co-variance or correlation matrix.Detrended correspondence analysis and reciprocalaveraging are forms of PCA which were specifi-cally developed for plant sociological analysesand are not further discussed here.

    A vendor at a market in Menglun, YunnanProvince, China, selling spices. Market surveys

    can provide data suitable for principle com-ponent analysis.

    Factor analy-sis, a method often con-fused with PCA, attempts to extract a lower dimen-sional linear structure from the data that explainsthe correlations between the variables. However,when one subset of variables is compared with thesubset of the remaining variables in the set, themethod of canonical correlation (not discussedhere) is used to find suitable linear combinationswithin each subset. If any grouping of the observa-tions in a lower dimension is required to be high-lighted, then canonical discriminant analysis (dis-criminant coordinates) can be performed. Linearcombinations are then chosen to highlight groupseparation. In Box 5 some examples for applica-

    tion of principal component analy-sis are given.

    Examples of datamatricesThe statistical analysis of theexamples provided in this work-ing paper are all based on matri-ces and matrix algebra. The fol-lowing examples are drawn froman exercise where sixteenKenyan woodcarvers were inter-viewed. During a workshop,three sets of data were collected: free listing of wood suitablefor carving; wood identification task(yes/no; binary or two state char-acter); paired comparison of woodspecies (ordered multistate char-acter);In the following paragraphs fur-ther details are provided on thesedata sets.

    Box 5. Examples of data suitable for prin-cipal component analysis.

    People asked to rank or categorize plant use values. PCA can becarried out on the People x Species matrix (with the rank in thecell). The resulting ordination diagram (with people in plantsspace) will reveal if there are certain groups of people that tendto value the same species in the same way, i.e. gender, ethnic orage groups. The species vectors in the diagram will indicatewhich species are characteristic for which groups.

    Spot people who respond differently from the majority. If a per-son just gave random answers or purposely replied incorrectlythis person will be seen as an outlier on the ordination diagram,on the condition that there is a pattern in the answers in general.

    People indicating if certain species are useful (or not) for a num-ber of purposes. A Species x Use matrix can be formed (with thenumber of species indicated in a certain use category).Ordination on these data will group species according to the usevalues assigned by people, and the vectors will indicate whichuses characterize a group of species.

    Characterizing changes in e.g. floristic composition along envi-ronmental gradients. The axes would provide information on themost influential factor.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    14

    I. DATA SET ON FREE LISTING OF WOODSUITABLE FOR CARVING

    For free listing of wood species that are suit-able for carving, 16 interviewees were selectedrepresenting women, old carvers, medium-agedcarvers and young apprentices. The question,Which trees can be used for carving? was askedto each interviewee and fourteen most preferredspecies along with the frequency and position oftheir mention by the interviewees were recorded(Table 1).

    II. BASIC DATA MATRIX FOR WOODIDENTIFICATION TASK

    The data on Wood identification task werecollected on eight species based on the responsesof sixteen respondents involved in a woodcarv-

    ing project in Kenya. Each artisan was asked thequestion separately for eight species to know ifhe or she can identify the species or not. In theevent of a positive reply (Yes), the value 1 wasallotted; alternatively, if the reply was No, a 0value was assigned. In this way, the matrix forsixteen respondents and eight species was com-pleted. The species were arranged across therows, while the respondents were arrangedacross the columns. (see Appendix I, Table 1).

    III. BASIC DATA MATRIX FOR PAIREDCOMPARISON OF WOOD SPECIESIn order to assess species preference among

    the artisans, Paired comparison of woodspecies was undertaken. For the purpose, fivetree species used for woodcarving were selectedand the respondents (the sixteen artisans) wereasked to state their preference between any twospecies set or pair combination of the fivespecies. Preferences of each respondent inrespect of five such possible species pair combi-nations (n (n-1) / 2) were tabulated as shown inTable 2. The score is defined by the total numberof mentions in the table and the highest rank isassigned to the species with the highest score.Pairwise rank matrices were then prepared inrespect of each respondent (R1.....R16). Finally,the ranks for five species so obtained from theresponses were tabulated in matrix form. Therows of the matrix represented the species and thecolumns were respondents.

    Table 1. Fourteen most preferred species in the Kenyanwoodcarving industry with frequency and posi-tion of their mention by sixteen interviewees.

    Species Frequency of mention Position of mention

    (X) (Y)

    Brachylaena huillensis 16 1.4

    Dalbergia melanoxylon 16 1.8

    Sterculia africana 13 8.2

    Zanthoxylum chalybeum 10 6.2

    Combretum schumannii 16 4.9

    Terminalia brownii 9 10.7

    Olea europaea 6 9.3

    Albizia anthelmintica 8 10.5

    Mangifera indica 10 10.3

    Erythrina sacleuxii 12 9.7

    Commiphora baluensis 11 9.7

    Azadirachta indica 12 6.3

    Oldfieldia somalensis 10 12.5

    Platycelyphium voense 6 12.0

    Table 2. A pairwise ranking matrix for five tree speciesused in woodcarving. *

    S1 S2 S3 S4 S5 Score Rank

    S2 S3 S4 S5 S1 0 A

    S2 S2 S2 S2 4 E

    S3 S5 S3 2 C

    S5 S4 1 B

    S5 3 D

    * The table is based on the preferences mentioned by one respondent.

    Muhuhu (Brachylaena huillensis O. Hoffm.,Asteraceae) logs piled up outside a carvingworkshop in Wamunyu, Eastern Province,Kenya. Each year 40,000 muhuhu trees arefelled in Kenya for woodcaring.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    15

    Matrix structure and analysisThe term descriptor is used for the attributes thatdescribe or compare the objects of the study. Theobjects may be the respondents, samples, loca-tions, quadrats, observations or any other sam-pling units (e.g. operational taxonomic units -OTUs in numerical taxonomy). In our examplethe respondents (R1 - R16) were objects and thespecies were the descriptors, i.e. responses of theartisans (measures of ability to identify woodspecies used for woodcarving). Yes/No or posi-tive/negative reply values were recorded inWood identification task while rank values ofthe species were recorded in Pairwise ranking ofwood species data.

    NORMAL VS. INVERSE ANALYSISData matrices can be viewed either down

    columns or across rows, i.e. one can look at rela-tions between objects or between descriptors. Forinstance, one may wish to explore the relationshipbetween respondents/objects to see whether cer-tain groups of people gave similar responses andtherefore may have similar attitudes towards carv-ing wood. Or, one may wish to explore relationsbetween descriptors/ rowsto highlight for whichspecies people tend to givesimilar responses.Maximum information canoften be obtained by mak-ing both modes of analysis.The two modes of analysisrequire different measuresof association as objectsare independent of eachother (sampling of objectsis preferably done in a wayto ensure mutual indepen-dence of sampling units),whereas descriptors maybe dependent. A variety ofassociation measures areavailable to study the rela-tionship of objects (e.g.Legendre & Legendre1998). Different correlationcoefficients are applied tothe study of relations between descriptors.

    If objects are grouped on the basis of theentire set of descriptors, it is sometimes referredto as normal analysis, whereas in an inverseanalysis, descriptors are grouped on the basis oftheir distribution in a series of objects (Kent &Coker 1994). In connection with ordinations thetwo modes of analysis have been referred to as

    objects in descriptors space and descriptors inobjects space, e.g. people in species space andwoodcarving species in peoples space.

    The two modes of analysis described above,are also frequently referred to as R and Q mode.The use of the terms R- and Q-mode, however, isa possible point of confusion as certain authorsdefine the mode on the basis of the associationmatrix, whereas others define the mode on thepurpose of the analysis. Authors who base thedefinition of the mode on the association matrixcall analyses based on the relationships betweendescriptors R-mode, and analyses based on therelationships of objects Q-mode (Jongmann etal. 1987, Legendre & Legendre 1998). Authorswho base the definition of the mode on the pur-pose of the analysis, do so in two contradictoryways: in some literature R-mode relates to clas-sification/ordination of objects and Q-moderelates to species classification/ordination(Pielou 1984; Causton 1988; Kent & Coker1994). Again, in other literature, Q-mode relatesto classification/ordination of objects and R-mode relates to species classification/ordination(Romesburg 1984).

    Object ordinations normally begin with a dis-persion/correlation matrix of descriptors (althoughthey can be based on a association matrix ofobjects). According to Legendre & Legendre(1998), object ordinations can therefore be both,R- and Q-mode. Because of these confusing nota-tions, we prefer to use the terms normal andinverse to describe the purpose of the analysis.

    National Museums of Kenya researchers Mohamed Pakia, RaymondObunga and Juma(?) Mududu measuring DBH and basal diameters ofstanding and cut Combretum schumannii Engl. (Combretaceae) trees inDzombo Forest, coastal Kenya.

  • SIMILARITY MEASURESThe analysis is started on a resemblance matrixwhich is derived either from the original or atransformed/standardized data matrix. Theseresemblance matrices arecalled similarity matrix ordissimilarity matrix depend-ing on the way in whichresemblance functions are cal-culated and the matrix isderived. In this section, someresemblance functions thatquantify the similarity or dis-similarity between samplesare described. The more simi-lar the objects (respondents orsamples) are with respect to aparticular character (variable),the greater their resemblanceand the smaller the distancebetween them when projectedinto a geometric space.Resemblance functions quan-tify the similarity or dissimi-larity between two objects(samples) based on observa-tions over a set of descriptors(Sneath & Sokal 1973). Toexplore the nature of relation-ships or affinities that existsamong the respondents, nor-mal mode analysis is usuallyapplied. Two types of normalmode resemblance functionsare distinguished: 1) similarity coefficients and 2) distance coefficients.

    Similarity coefficients vary from a minimumof 0 (when a pair of respondents are completelydifferent) to 1 (when the respondents are identi-cal). On the other hand, distance coefficientsassume a minimum value of 0 when a pair ofrespondents are identical and have some maxi-mum value (in some cases infinity) when the pairof respondents are completely different. Hence,distance coefficients are also referred to as dis-similarity coefficients. In fact, a similarity indexmay always be expressed as a distance just by asimple transformation such as 1 similarity(Legendre & Legendre 1998). Thus, distancemay be thought of as the complement of similar-ity (Sneath & Sokal 1973).

    Similarity coefficients are widely usedindices. These indices are based solely on pres-ence/positive reply (indicated with a 1) orabsence/negative reply (indicated with a 0) data(see Appendix 1, Tables 1 to 5 for illustration).

    Three indices - Ochiai, Dice and Jaccard -are useful for calculating the similarity index ofpresence/absence or positive/ negative replydata (qualitative) (see Box 6).

    These indices can be used to measure thedegree of association between species (aninverse mode analysis, i.e., across the rows ofthe data matrix) as well as to compute a normalmode similarity between respondents. It may bementioned here that these are the only types offunctions that are used to measure both normalmode (sample similarity) and inverse mode(species association) resemblance (Ludwig &Reynolds 1988).

    DISTANCE COEFFICIENTSMeasures of distance may be categorized intothree groups:1) E-group (the Euclidean distance coeffi-

    cients);2) BC-group (the Bray-Curtis dissimilarity

    index);3) RE-group (the relative Euclidean distance

    measures).

    PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    16

    Box 6. Indices for calculating similarityindex of presence/absence or posi-tive/negative reply data (qualitative).

    Ochiai Index (OI)

    aOI = In the above example:

    a+ba+c1

    OI R14,R15 = = 0.5771 3Dice Index (DI) (Sorensen Index)

    2aDI = In the above example:

    2a+b+c

    2DI R14,R15 = = 0.5

    2+0+2

    Jaccard Index (JI)

    aJI = In the above example:

    a+b+c 1

    JI R14,R15 = = 0.331+0+2

  • The distances computed between all possiblepairs of sampling units (SUs) based on any of theabove similarity or distance measures are arrangedin a SU x SU matrix. Examination of this matrixquickly reveals the distance between any two SUs.It is on this distance matrix that the clusteringstrategies and ordination techniques such as prin-cipal component analysis operate. The distancecoefficients are explained in Box 7.

    PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    17

    BC-GROUP DISTANCE

    This group is represented by a single index first introducedby Bray & Curtis (1957). The step is to compute the per-cent similarity (PS) between SUs j and k as

    2WPS jk = 100

    A + BWhere W = si=1 min(X ij , Xik)A = si=1 X ijB = si=1 XikPercent Dissimilarity (PD):

    PD = 100 PS

    PD may also be computed on a 0 1 scale as

    PD = 1 [2W/(A+B)]

    RE-GROUP DISTANCE

    This group contains distance indices that are expressed onstandardized or relative scales.

    RELATIVE EUCLIDEAN DISTANCE (RED)

    RED jk = si=1 [(X ij / si X ij ) (Xik / si Xik)]2RED ranges from 0 to 2.

    RELATIVE ABSOLUTE DISTANCE (RAD)

    RAD jk = si=1 [(X ij / si X ij ) (Xik / si Xik)]RAD has a range from 0 to 2.

    CHORD DISTANCE (CRD)

    This is done by projecting the SUs on to a circle of unit radiusthrough the use of direction cosinuses. The measure isthen the chord distance between the two SUs after such aprojection.

    CRD jk = 2 (1 ccos jk)Where the chord cosinus (ccos) is computed from:

    si = 1 (X ij , Xik)ccos jk =

    siX 2 ij si X2ikIn case of binary data, this ccos is identical to Ochiai's coef-

    ficient. CRD, like RED, ranges from0 to 2.

    GEODESIC DISTANCE (GDD)

    This measure is the distance along the arc of the unit circle(rather than the chord distance) after projection of the SUsonto a circle of unit radius:

    GDD jk = arccos (ccos jk)

    GDD has a range from 0 to /2 (i.e. 0 to 1.57).

    Box 7. Distance coefficients (after Ludwig & Reynolds 1988).

    E-GROUP DISTANCES

    EUCLIDEAN DISTANCE (ED)

    This measure is the familiar equation for calculating the dis-tance between two points Rj and Rk in Euclidean space:

    ED jk = si=1(X ij Xik)2The value of ED ranges from zero to infinity, as do all of the

    E-group measures.

    SQUARED EUCLIDEAN DISTANCE (SED)

    This measure is the square of ED:

    SED jk = si=1(X ij Xik)2

    MEAN EUCLIDEAN DISTANCE (MED)

    MED is similar to ED, but the final distance is on a smallerscale since the mean difference is used:

    si=1(X ij Xik)2MED jk =

    S

    ABSOLUTE DISTANCE (AD)

    This measure is the sum of the absolute differences takenover the S species:

    AD jk = si=1(X ij Xik)This distance measure is also known as Manhattan or City

    block dissimilarity coefficient measure. The AD measureis the character difference in numerical taxonomy (Sneath& Sokal 1973).

    MEAN ABSOLUTE DISTANCE (MAD)

    The MAD is similar to AD, but a mean distance is usedrather than an absolute distance:

    si=1(X ij Xik)MAD jk =

    SMAD is equivalent to the mean character difference used in

    numerical taxonomy (Sneath & Sokal 1973).

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    18

    Cluster analysis of Wood iden-tification taskThe six basic steps involved in cluster analysisare described below taking the data sets from theKenyan woodcarving project Wood identifica-tion task and Paired comparison of woodspecies as example. The utility of cluster analy-sis on such data are: The responses (objects) can be grouped

    according to their resemblances, i.e. based onthe respondents ability to identify a particu-lar species used for woodcarving in case ofWood identification task data and onspecies preferences in the Paired comparisonof wood species data. The respondents ineach cluster should have a number of com-

    mon characteristics that set them apart fromthe respondents of other such clusters.

    The data sets can be reduced to homogeneousgroups or clusters. The objective is to demon-strate the relationships of the respondents toeach other and to simplify these relationshipsso that some general statements about theclasses of respondents that exist can be made.Being an ethnobotanical problem, where the

    interest is to know about the respondents throughtheir view on the individual species, normalmode analysis will be used for both data sets.The procedure is a polythetic, agglomerativeclassification technique. The results are based onthe output of the NTSYS package but the basicsteps are similar for any other package.

    Applications of cluster and principal componentanalysis

    STEP 1 Obtaining the basic data matrix (see Appendix1, Tables 1 and 2).

    STEP 2 Standardizing the basic data matrix. The basicdata matrix is standardized for following reasons:

    To make the species contribute more equally to the simi-larity between the respondents.

    To remove all the measuring units (not applicable to thedata presented here).The standardization is performed through a linear trans-

    formation of the original values for each character/element ofthe basic data matrix. Since binary data are not standardized,the basic data matrix for the Wood identification task willbe used for further analysis. The basic data matrix for Pairedcomparison of wood species has been standardized by divid-ing the matrix elements by the standard deviation (seeAppendix I, Table 3). STEP 3 Computing the resemblance matrix. The next

    step in cluster analysis is to compute a normalmode resemblance between the respondents(R1...R16). Although any of the numerous resem-blance functions available could be used, distancemeasures have been used for multistate characterdata in the Paired comparison of wood speciesbecause of their heuristic value in a cluster analy-sis (Sneath & Sokal 1973). However, for two statedata in the Wood identification task, the similar-ity measure is Jaccard's coefficient. The distancesbetween all pairwise combinations of respondentsare summarized into a 16 x 16 distance (D) or

    resemblance matrix for each data set (seeAppendix I, Tables 6 and 7). The further clusteranalysis strategies operate on these resemblancematrices.

    STEP 4 Executing the clustering method and obtainingthe tree matrix. The clustering technique usedhere is a hierarchical agglomerative procedurebased on UPGMA (unweighted pair-group methodwith arithmetic averages). The clustering was exe-cuted on the resemblance matrices (see Tables 6and 7, Appendix) to yield the tree matrices (seeAppendix I, Tables 8 and 9).

    STEP 5 Drawing the tree or dendrogram. The treematrix derived below produces a tree on scaleshowing the clustering scheme. The dendrogramsor trees for Wood identification and Pairedcomparison of wood species data is given inFigures 4 and 5, respectively (pages 17 and 18).

    STEP 6 Computing the cophenetic matrix and coeffi-cient, and plotting. A tree is not exactly like thedata matrix it represents. It is necessary to knowhow well the tree represents the basic data matrix.The cophenetic correlation coefficient measureshow well the tree and the resemblance matrixmatches. The values that appear in the copheneticmatrix (see Appendix I, Tables 10 and 11) stemfrom the tree and are compared with those of thebasic data matrix either through a matrix plot orPearson product moment correlation coefficient.

    Box 8. Steps involved in cluster analysis based on Wood identification task.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    The clustering results depicted in the dendro-gram (Figure 4) for the Wood identification taskexhibit a clear separation of respondents at theeight-cluster level. Thetwo-cluster solution sepa-rates the two groups (R16from the rest) which maybe different in their socio-economic conditions, agestructure, artisan skills orethnic composition. Thismay be examined throughthe already collected datain these respects or a newexplorative study may bedesigned for testing theabove hypothesis. Further,the tree shows that R1, R2,R3, R4, R5, R15 and R11are similar. Based on this,the researcher can treatthese respondents as simi-lar in further experimentsor in designing new stud-ies. In addition, the factorsresponsible for such simi-larity may also beexplored, which may havehigh ethnobotanical rele-vance.

    Similarly, the tree inFigure 5 for Paired com-parison of wood species,reveals the existence oftwo groups of respondents(i.e. R3, R4, R16 and R7 inone group and the rest inanother group). The under-lying factors for suchgrouping pattern may beexplored. Further, two distinct groups of respon-dents exist: one group withR1, R2, R10 and R14respondents and the otherwith R6, R13, R9, R11 andR12. Each group consistsof a large number of similar respondents (fiveand four respectively).The factor(s) behind suchsimilarity may be an interesting ethnobotanicalobservation.

    19

    Figure 4. Dendrogram based onthe distance matrix and showingclustering of sixteen respondentsin respect to eight tree speciesused for woodcarving accordingto the responses of the interview-ees.

    Figure 5. Dendrogram based on similaritymatrix and showing clustering of sixteenrespondents in respect of five tree speciesused for woodcarving in Paired comparisonof wood species according to the responsesof the interviewees.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    20

    Principal component analysis ofthe Paired comparison of woodspecies taskIn Paired comparison of wood species data, theresponses of sixteen respondents for five specieshave been recorded. The basic purpose of princi-pal components is to account for the total varia-tion among these sixteen respondents in a five-dimensional space by forming a new set oforthogonal (uncorrelated) composite variates, theordination axes. Each of these axes is a linearcombination of the original set of variables. Inlinear combinations each successive compositevariate accounts for a smaller portion of the totalvariation. Thus, the first principal component ishaving the largest variance, the second is havinga variance less than the first and more than thethird, etc. The new composite variables required toaccount adequately for the total variation are gen-erally fewer than the original variables.

    The objectives are as follows: To simplify and condense data sets. If there

    are many species, dimensions can be reduced,such that relationships between the respon-dents can more easily be examined.

    To project the informants in the space ofcoordinates according to their responses sothat their relative positions to the axes and toeach other provide maximum informationabout their similarities. By identifying similarinformants from their position with respect tothe axes, underlying factors in the observedpattern may be searched.

    The observed patterns may later be explainedon the basis of social and cultural features ofthe informants. Differences in the patternmay be correlated with ethnicity, relative eco-nomic conditions, family structure, etc.The basic data matrices (Appendix I, Tables

    1 and 2) will be used. PCA is carried out throughthe following six steps (Box 9).

    STEP 1. Standardization of basic data matrix. Usually, itis not necessary to standardize data before comput-ing PCA. To work with standardized data, the cor-relation matrix instead of the co-variance matrix ischosen, so that data are standardized automatically.If more weight is put on commonly mentionedspecies (as is sometimes preferable) then the co-variance matrix is chosen to work with unstandard-ized data. As in cluster analysis, binary data shouldnot be standardized. In the following, the standard-ized data matrix (see Appendix I, Table 3) forPaired comparison of wood species computedacross the rows for cluster analysis is utilized.

    STEP 2. Calculation of correlation between characters.Unlike in cluster analysis, in PCA the correlationamong the variables/characters (correlation amongspecies with respect to respondents) is computed(inverse mode analysis). Similarity/correlationmeasures commonly used here are: correlation,variance-covariance and matrix times its transpose(X x T). However, for two-state data, the indicessuch as Jacard, Phi and Simple MatchingCoefficients (SMC) are used. Thus, resemblancematrices across the rows are computed (seeAppendix I, Tables 12 and 13). If other similaritymatrices than correlation and co-variance are used ,the ordination is generally named principal coordi-nate analysis (PCO or PCoA) For binary data sets,PCO is recommended instead of PCA.

    STEP 3. Double decentering the resemblance matrix. Anadditional step in data transformation for PCA isdouble decentering. This is performed on the resem-blance matrix (see Appendix I, Tables 14 and 15).

    STEP 4. Eigen-analysis for deriving principal compo-nents. Eigen-values for each ordination axis andEigen-vectors for each variable (species) are com-puted. The Eigen-value is the variance of a particu-lar principal component while Eigen-vector is theset of coefficients defining the principal compo-nent. In our example, the first three principal com-ponents explain more than 85% of the variance inthe case of Wood identification task data (seeAppendix I, Table 16) and more than 89% inPaired comparison of wood species data (seeAppendix I, Table 17). Therefore, the first threeprincipal components were used for further analy-sis. Eigen-vector matrices (U) with the loading ofeach variable (species) in each principal componentare presented in Tables 18 and 19 of Appendix I.

    STEP 5. Projection of the respondents into the ordinationspace. The projection matrix (Y) for each data setis computed from the basic data matrix (A) and theEigen-vector matrix (U). Y = A x U. This operationresults in projection matrices (Tables 20 and 21,Appendix I).

    STEP 6. Ordination or plotting of projection matrix. Theprojection matrix can be plotted in both a two-dimensional (Figures 6 and 7) and three-dimension-al space (Figures 8 and 9, page 20). The former fig-ures depict the position of respondents with respectto the species in a two principal component space(in two dimensions since only two PCs are consid-ered), while the latter two arrange the respondentsin a three principal components space (thus in threedimensions).

    Box 9. Steps involved in principal component analysis.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    21

    INTERPRETING THE RESULTS OFPRINCIPAL COMPONENT ANALYSISThree composite variables (principal compo-

    nents) were derived in both the above cases outof eight in the Wood identification task and fivein the case of Paired comparison of woodspecies original species variables.

    The coefficients (i.e. loadings) represent thecorrelation of Y1 with the respective originalvariable. Thus, 0.2695 can be interpreted as thecorrelation between Y1 and the variable speciesS1. The principal components (composite vari-ables) are interpreted on the basis of those vari-ables with strong loading patterns. In the exam-ple (see Appendix I, Table 19), the first principalcomponent Y1 may be interpreted and namedaccordingly: for example poor peoples speciesor X ethnic communitys species dependingupon information concerning the respondentsand according to which S1, S2, S4 and S7 have

    similar (positive) loading profiles within Y1.Similarly, the PC 2 (Y2) may be appropriatelynamed in which S3 and S8 are the importantdefining variables and so on.

    In Paired comparison of wood species data,PC 1 can be defined in terms of species S1(0.729) and S2 (0.6063), PC 2 with S3 (0.4586)and PC 3 in terms of S5 (0.5808) since their load-ings dominated the respective composite vari-ables (principal components).

    Further, the relationship among the infor-mants with respect to the above identified groupsof species can be depicted based on the projec-tion plots of the informants in the space of thethree principal components (Figures 6-9, pages19 and 20) according to their responses to thequeries. Such relationships can be later correlat-ed to various underlying factors important fromethnobotanical point of view.

    0.0 0.2 0.3 0.5

    R9

    R15

    R16R14

    R12

    R6 R13

    -0.3

    -0.2

    0.0

    0.2

    0.3

    -0.2

    R10

    Principal component 1

    Prin

    cipa

    l com

    pone

    nt 2

    -0.8 0.0 0.8 1.6

    R14

    R15

    R16

    R4R5

    R13R2R7

    R3

    R8

    --0.5

    0.0

    0.5

    1.0

    1.5

    -1.6

    R10

    Principal component 1

    Prin

    cipa

    l com

    pone

    nt 2

    Figure 7. Projection of eleven respondents in the spacedefined by the first, second and third principal compo-nent in the Paired comparison of wood species.

    Figure 6. Projection of eight respondents in the spacedefined by the first and second principal component inthe Wood identification task.

    R3

    R6

    R14

    R15

    R13

    R9

    R12

    R10Principal com

    ponent 3

    Principal component 2

    Principal component 1

    Figure 8. Projection of eight respondents in the spacedefined by the first, second and third principal compo-nents in the Wood identification task.

    R10 R8R3

    R14

    R11

    R2R7

    R15

    R10R5

    R4

    Principal component 1

    Principal component 2

    Principal compo

    nent 3

    Figure 9. Projection of eleven respondents in the spacedefined by the first, second and third principal compo-nents in the Paired comparison of wood species.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    22

    Analysis of variance, regression analysis andlog-linear modelling, like cluster and principalcomponent analysis, are all based on linear math-ematical equations. Log-linear modelling is alsoknown as discrete multivariate analysis.Application of either technique depends on thenature of the data (i.e. continuous, ordinal, qual-itative).

    Hypothesis testingMost computer packages for statistics providetwo procedures for analysis of variance:ANOVA and general linear model (GLM). Thelatter is usually less automated and allows theanalysis of randomized and incomplete blockdesigns, fractional factorials, Latin squaredesigns, analysis of co-variance with one or morecovariates, crossover designs, split plot, repeatedmeasurements, nesting and the definition of sep-arate error terms for factors in means and effectsmodels to test hypotheses in missing cellsdesigns.

    ANALYSIS OF VARIANCE (ANOVA)Inferential multivariate techniques are gener-

    alizations of classical univariate procedures. Themultivariate analogue of analysis of variancemodels is also known as MANOVA. Similarly,for carrying out simultaneous hypothesis testsand constructing simultaneous confidence inter-vals, procedures for univariate cases are usuallygeneralized and magnified for multivariate situa-tions. Important assumptions on the populationsample for application of ANOVA include thefollowing:1) data (standardized or otherwise transformed)

    are normally distributed; 2) distances of variances are equal (condition of

    homogeneity of variances or homoscedasticity);3) no significant interactions exist between vari-

    ables;4) group means and standard deviations are

    independent (i.e. the size of the group meansis not related to the size of their standarddeviations);

    5) data contain no gross outliers (outliers maybe excluded from analysis upon plausibilitychecks);

    6) number of observations in different cate-gories (cells) are equal (not obligatory).If after standardization or transformation

    conditions 1) to 3) are not met, data can be

    analysed by defining alternative multivariategeneral linear models or hypotheses can be test-ed by specifying nonparametric models. If sig-nificant interactions among variables are sus-pected (e.g. the influence of the level of a givenplant compound on the level of its chemicalderivative), analysis of co-variance is carried outto adjust or remove the variability in the depen-dent variable due to the covariate.

    When the homogeneous variance part of theassumptions is false, it is sometimes possible toadjust the degrees of freedom to result in approx-imately distributed F statistics. In SYSTAT, aprocedure based on Levenes test for unequalvariances, allows to save residuals and performan ANOVA on the transformed absolute valuesof the residuals, merged with the original group-ing variables. If the test is significant, separatevariance tests in the GLM module can be per-formed.

    Although generalized from two-way proce-dures, it is invalid to perform multivariatehypothesis testing on all possible pairs ofhypotheses.

    H0: 1 = 2H0: 2 = 3 invalidH0: 1 = 3

    H0: 1 = 2 = 3 valid

    The above example is for hypothesis testingof a means model: Alternatively, hypothesis test-ing on an effects model would read:

    H0: 1 = 2 = 3 = 0

    The null hypothesisThe null hypothesis assumes that there is no

    difference between population means. When ask-ing people from two different communities aboutthe number of medicinal plant species they use,the null hypothesis assumes that the two commu-nities use approximately equal (i.e., not a signif-icantly different) numbers of species. The F-testis used to calculate whether the null hypothesismust be accepted or rejected and which confi-dence level is reached. For example, p

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    23

    and the errors of the MS will each be an estimateof 2, the variance common to all populationssampled. However, if the population means arenot equal, then the groups MS in the populationwill be greater than the populations error MS.

    groups MSF =

    error MS

    If the calculated F-ratio is at least as large asthe critical value, then H0 is rejected and thealternative hypothesis HA (population meansbeing unequal) accepted.

    The critical value is calculated as follows:The total sum of squares about the overall meanhas N 1 degrees of freedom, corresponding tothe divisor normally used in calculating an esti-mate of variance. There are k 1 degrees of free-dom for the sum of squares relevant to fluctua-tions between the k populations sampled.Subtraction then gives the degrees of freedom forthe residual, i.e. (N 1) (k 1) = N k. Thesedegrees of freedom are to be used as divisors toobtain the critical value.

    A typical computer output of a two factoranalysis (A and B) tabulates the source of varia-tion (singly: A, B, and interactions: A x B), thesum of squares (SS), degrees of freedom (df),mean squares (MS), the F statistic and the pvalue. From this result it can, however, not bedetermined which groups differ from which othergroups. To examine specific pairwise group dif-ferences, post hoc testing is used. Bonferroni,Scheff and Tukey tests are available in most sta-tistical packages to test pairwise differences inmulti-way designs. When the number of groupsare small, the Bonferroni procedure is recom-mended. For more groups, the Tukey test yieldsmore sensitive results.

    Linear and quadratic contrastsContrasts are used to test relationships

    among means. A contrast is a linear combinationof means i with coefficients i. Typically, thehypothesis takes the following form:

    H0 = 11 + 22 + .....+ kk + = 0

    The test statistic for a contrast is similar tothat for a two-sample t - test. The result of thecontrast (a certain relation among means)appears in the numerator of the test statistic, andan estimate of within-group variability (thepooled variance estimate or the error term fromthe ANOVA) is part of the denominator. Specificcontrast coefficients can be selected to test forexample the following: pairwise comparison for testing the differ-

    ence between two particular means;

    linear combinations of means (e.g. two treat-ment means vs. a control mean); or

    linear or quadratic increases or decreases of acertain quality in response to different cate-gories of treatment.

    Block and repeated-measures experimental designs

    Imagine we would want to compare the alka-loid accumulation patterns in the leaves of a cer-tain plant species under three different soil con-ditions (factor A) and two light regimes (factorB). Within each of the three levels of factor A,we have e.g. seven individuals (or blocks) withan observation for each individual at each of thetwo levels of factor B. The total variability wouldbe divided into two parts: the variability amongblocks and the variability within blocks (due toindividual behaviour). Hypotheses testing forFactor A would take the following form:H0: Mean alkaloid content of leaves is the same

    for all three soil types.HA:Mean alkaloid content of leaves is not the

    same for all three soil types.soil types MS

    F=blocks within soil types MS

    For Factor B:H0: Mean alkaloid content of leaves is the same

    under two light regimes.HA: Mean alkaloid content of leaves is not the

    same under two light regimes.light regimes MS

    F=light regimes x (blocks within soil types MS)

    For A x B interaction:H0: Mean alkaloid content is independent of light

    regimes.HA: Mean alkaloid content is not independent of

    light regimes.light regimes x soil types MS

    F= light regimes x (blocks within soil types MS)

    Repeated measurements may be taken at dif-ferent time intervals to quantify changes overtime. In repeated measures design, the same vari-able is measured several times for each subject.A paired-comparison t - test is the most simpleform for this design (e.g. before and after mea-sure). The following steps are involved to manu-ally calculate a t statistic: For each subject the difference between two

    measures is computed; The average of the differences is calculated; The standard deviation of the differences is

    calculated;

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotanyM. HFT, S.K. BARIK & A.M. LYKKE

    24

    The test statistic using this mean and standarddeviation is calculated as shown below:

    X1 X2t =

    sXi X2

    Changes are tested within subjects andbetween subjects. Tests of the within-subjectsvalues are called polynominal tests of order1,2,...k., where k is one less than the number ofrepeated measures. The first polynominal is usedto test linear changes (e.g. do the repeated mea-sures increase or decrease around a line with asignificant slope?), the second tests if theresponses fall along a quadratic curve, etc.

    Types of models1) If the levels of a factor are specifically cho-

    sen, the design is a fixed-effects model orModel I Anova.

    2) If we are interested in general differencesbetween different categories, and samples aretaken truly randomly, then we have random-effects model or Model II Anova.

    3) If we have a factorial design with both, fixedand random effects, the model is called mixedeffects model or Model III Anova.

    GENERAL LINEAR MODEL (GLM)Specific general linear models (means or

    effects models) can be defined using the generallinear model option, available in standard statis-tic computer packages, if the data design is notfull factorial, i.e. if numbers of observations areunequal in categories or missing cells occur.The analysis is not robust to violation of normaldistribution and equal distances of variance. Thelatter can be tested using e.g. Levenes Test. Themeans models are given below:

    Y = constant + A + B + C + A x B x C or

    Y = A x B x C

    However, models without a constant are notadvised to the mathematically inexperienced. Ina means model, predictors are coded as cellmeans, while in the classic effects model effectsare coded as differences from the grand mean.Box 10 lists some examples of data suitable foranalysis of variance.

    KRUSKAL-WALLIS TESTThe multivariate analogue to the univariate

    Mann-Whitney Rank Sum test is the KruskalWallis Rank Sum Test. The Kruskal Wallis test

    Box 10. Examples of data suitablefor analysis of varianceand Kruskal-Wallis Test.

    Quantifiable effects of herbal medicines as dependanton site, growing season, sampling procedure, etc.

    Bioassay testing of ethno-medical recipes.

    Validating and quantifying the described effect of anethno-medical recipe in different user groups.

    Assessment of quantifiable ethnobotanical knowledgeas dependent on age, gender, ethnicity or other socialfactors.

    Storability of grains and grain quality as dependant onthe quality of granaries.

    Life span of beehives as dependent on storage con-ditions during rainy season.

    Effect of specific agricultural methods (e.g. soil work-ing methods, burning, mechanical treatment of fruittrees) on yields.

    Effect of reduced harvesting schemes on the regen-eration potential of wild plant populations.

    Effect of harvesting season for raw material on quali-ty of baskets or other household items.Stephen Weru from Gatei, Kenya, collects the

    sap of Dalbergia lactea Vatke (Fabaceae) usedlocally to cure hepatitis.

  • PEOPLE AND PLANTS WORKING PAPER 6, JUNE 1999Quantitative ethnobotany

    M. HFT, S.K. BARIK & A.M. LYKKE

    25

    is also referred to as analysis of variance byranks and is applied when data do not meet anyof the six conditions listed on page 20.

    As in parametric analysis of variance it can,however, not be concluded which groups differfrom which other groups. The only inference tobe drawn is that at least one difference among thegroups exists. The test is called nonparametricbecause no population parameters are used in thestatement of hypotheses, and neither parametersnor sample statistics are used in the test calcula-tions.

    PredictionAs for analysis of variance, two underlyingassumptions with respect to the distribution ofvalues must be true for regression analysis:1) data must come from an approximately nor-

    mal distributed population;2) variances must be equal.

    MULTIVARIATE REGRESSION ANALYSISThe relationship between two variables may

    be one of functional dependence of one variableon the other. The magnitude of one variable maythus be a function of the magnitude of the secondvariable, whereas the reverse is not true.

    Regression analysis is a statistical methodfor predicting values of one or more response(i.e. dependent) variables from a collection ofpredictor or explanatory ( i.e. independent) vari-able values (Poole, 1974; Zar, 1974). In a simplelinear regression analysis, a linear model isdeveloped from which the values of a dependent(i.e. response) variable can be predicted based onparticular values of a single independent vari-able. The population population regression model isexpressed as:

    Yi = 0 + 1 X1i + iwhere:0 = is the true intercept, a constant factor in

    the regression model representing theexpected or fitted value of Y when X = 0;

    1 = the true slope representing the amountthat Y changes (either positively or neg-atively) per unit change in X;

    i = the random error or residual in Y forobservation i.

    Since the entire population can not be mea-sured, it is not possible to compute the parame-ters 0 and 1 and obtain the population regres-sion model. Therefore, the approximations b0(for 0 ) and b1 (for 1) are generally estimatedfrom the sample using the methods of least

    squares. With this method the statistics b0 and b1are computed from the sample in such a mannerthat the best possible fit within the constraints ofthe least squares model is achieved. Thus, thefollowing sample sample regression equation isobtained, in which the residual does not figure:

    Y = b0 + b1 X1

    Multivariate models from samples can beconsidered as the extensions of univariate modeldescribed above. In multiple regression at leasttwo independent variables (X1 , X2 ) are used topredict the value of a dependent variable (Y ). Asin the case of simple linear regression model,when sample data are analysed, the sampleregression coefficients (b0 , b1 and b2 ) are usedas estimates of