MTH410 S14 Lecture 01 May 12 -Mo-Wed

153
1 MTH410 Probability and Statistics Spring 2014 Nursel S. Ruzgar Mathematics Department [email protected] 416-979 5000/ext. 3173 MTH410 S14- Lecture 1

description

Statistics Lecture Notes

Transcript of MTH410 S14 Lecture 01 May 12 -Mo-Wed

  • 1MTH410

    Probability and Statistics

    Spring 2014

    Nursel S. Ruzgar

    Mathematics Department

    [email protected]

    416-979 5000/ext. 3173

    MTH410 S14- Lecture 1

  • MTH410 S14- Lecture 1 2/153

    Discussion of SyllabusRequired Text: Solved problems in Statistics, Part I- P. Ghargbouri, B. Todorow

    Exercises in Statistics, Part I- P. Gharghbouri, B. Todorow

    Meets

    Mondays: 2:00-5:00pm-KHE221,

    Wednesdays: 2:00-5:00pm- KHE221

    Office Hours: Tuesdays: 5pm-5:45pm-VIC707

    Labs: Section 1: Fridays: 10:00-12:00pm-ENGLG12

    Section 2: Fridays: 13:00-15:00pm-ENGLG12

    Section 3: Fridays: 16:00-18:00pm-ENGLG12

    Section 4: Wednesdays: 11:00-13:00pm-ENG102

  • MTH410 S14- Lecture 1 3/153

    Discussion of Syllabus (contd)

    Course Web Site: Blackboard

    Labs and Quizzes

    Labs will start in the first week, May 12.

    There will be a quiz each week, except the

    first week.

  • MTH410 S14- Lecture 1 4/153

    Discussion of Syllabus (contd)

    Academic Dishonesty (Strongly

    discouraged)

    Refer to the senate policy

    Tentative Course Outline

  • MTH410 S14- Lecture 1 5/153

    Course Objectives

    Identify and formulate problems where statistics can have an impact.

    See the relevance of statistics. Apply what has been learned to other engineering courses and to career practice.

    Understand the basics of Statistics and Probability Theory

    Interpret the statistical results and retrieve necessary information to help decision making

    Develop the bases for the other courses.

  • MTH410 S14- Lecture 1 6/153

    Evaluation

    30% MidtermTest (100 minutes) 10:00am,

    Saturday, June 14, 2014

    60% Final exam (180 minutes), room: TBA

    10% Lab quizzes

  • MTH410 S14- Lecture 1 7/153

    OUTLINE Lecture 1Statistics-Descriptive and Inferential Statistics

    Populations, Parameters, and Samples, Statistic, Variable

    Data & Types of Data Cross-Sectional vs Time Series Data

    Interval, Nominal Data, Ordinal

    Graphical descriptive techniques for each type of data Histograms, Pie and Bar Charts

    Scatter Diagrams, Contingency Table

    Line Chart

  • MTH410 S14- Lecture 1 8/153

    In todays world

    we are constantly being surrounded by statistics and statistical information. For example:

    Political Polls, Customer Surveys

    Interest rates, Economic Predictions

    Course Marks, Job Market Information

    How can we make sense out of all these data?

    How can we differentiate valid from flawed claims?

    What is Statistics?!

  • MTH410 S14- Lecture 1 9/153

    What is Statistics? (Contd)

    Statistics is a way to get information from data

    Statistics

    Data

    Data: Facts, especially

    numerical facts, collected

    together for reference or

    information.

    Information

    Information: Knowledge

    communicated concerning

    some particular fact.

    Definitions: Oxford English Dictionary

    Statistics is a tool for creating new understanding from a set of numbers.

  • MTH410 S14- Lecture 1 10/153

    Example

    A student is somewhat apprehensive about the statistics

    course because the student believes the myth that the

    course is difficult. The professor provides last terms marks to the student. What information can the student obtain from this

    list? Statistics

    Data

    List of last terms marks.

    95

    89

    70

    65

    78

    57

    :

    Information

    New information about

    the statistics class.

    E.g. Median of all marks,

    Typical mark, i.e. average,

    Mark distribution, etc.

  • MTH410 S14- Lecture 1 11/153

    Population, Parameter, &

    Sample, Statistics, VariablePopulation: group of all items of interest to the statistics practitioner. All the members of the Ryerson University.

    Parameter: A descriptive measure of a population. Mean number of soft drinks sold at Ryerson every week.

    Sample: A set of items drawn from the population. 500 students surveyed.

    Statistic: A descriptive measure of a sample. Average number of soft drinks these students buy per week.

    Variable: A characteristic of population or sample that is of interest for us. Number of soft drinks a student buys every week.

  • MTH410 S14- Lecture 1 12/153

    Key Statistical Concepts

    Sample

    A sample is a set of data drawn from the population.

    Potentially very large, but less than the population.

    E.g. a sample of 765 voters exit polls on election day.

    Population

    a population is the entire set of all items under study.

    frequently very large, sometimes infinite.

    E.g. All 5 million Florida voters

  • MTH410 S14- Lecture 1 13/153

    Key Statistical Concepts (Contd)

    Statistic

    A descriptive measure of a sample.

    E.g. The proportion of the sample of 765 Floridians who voted

    for Obama.

    Parameter

    A descriptive measure of a population.

    In most applications of inferential statistics, the parameter

    represents the information we need.

    E.g. The proportion of the 5 million Florida voters who voted

    for Obama.

  • MTH410 S14- Lecture 1 14/153

    Key Statistical Concepts (Contd)

    Parameter

    Populations have

    Parameters

    Population

    Sample

    Subset

    Statistic

    Samples have

    Statistics

    Inference

  • MTH410 S14- Lecture 1 15/153

    Types of Statistics

    Descriptive statistics: involves the

    arrangement, summary, and presentation of

    data, to enable meaningful interpretation, and

    to support decision making.

    Inferential Statistics: a set of methods used

    to draw conclusions about characteristics of a

    population based on sample data.

  • MTH410 S14- Lecture 1 16/153

    Descriptive Statistics

    The actual method used depends on what information we

    would like to extract. Are we interested in:

    measure(s) of central location? and/or

    measure(s) of variability (dispersion)?

    Descriptive Statistics is a set of methods of organizing, summarizing, and presenting data in a convenient and informative way. These methods include:

    Graphical Techniques

    Numerical Techniques

  • MTH410 S14- Lecture 1 17/15317

    Inferential Statistics

    Descriptive Statistics describe the data set thats being

    analyzed, but doesnt allow us to draw any conclusions

    or make any inferences about the data. Hence we need

    another branch of statistics: inferential statistics.

    Inferential statistics is also a set of methods, but it is used

    to draw conclusions or inferences about characteristics of

    populations based on data from a sample.

  • MTH410 S14- Lecture 1 18/153

    Statistical Inference

    Statistical inference is the process of making an estimate,

    prediction, or decision about a population based on a sample.

    Parameter

    Population

    Sample

    Statistic

    Inference

    What can we infer about a Populations Parameters

    based on a Samples Statistics?

  • MTH410 S14- Lecture 1 19/153

    Statistical Inference(Contd)

    We use statistics to make inferences about parameters.

    Therefore, we can make an estimate, prediction, or

    decision about a population based on sample data.

    Then, we can apply what we know about a sample to the

    larger population from which the sample was drawn!

    What is the purpose or/and which kind of benefits

  • MTH410 S14- Lecture 1 20/153

    Statistical Inference (Contd)

    Rationale:

    Large populations make investigating each member

    impractical, extremely expensive and time-consuming.

    Easier and cheaper to take a sample and make estimates

    about the population from the sample.

    However:

    Such conclusions and estimates are not always going to be

    correct.

    Hence, we have to build into the statistical inference

    measures of reliability, namely confidence level and

    significance level.

  • MTH410 S14- Lecture 1 21/153

    Confidence & Significance

    LevelsThe confidence level is the proportion of times that an

    estimating procedure will be correct, if the sampling

    procedure were repeated a very large number of times.

    E.g. a confidence level of 95% means that, estimates based on

    this form of statistical inference will be correct 95% of the time.

    When the purpose of the statistical inference is to draw a

    conclusion about a population, the significance level

    measures how frequently the conclusion will be wrong in

    the long run.

    E.g. a 5% significance level means that, in repeated samples,

    this type of conclusion will be wrong 5% of the time.

  • MTH410 S14- Lecture 1 22/153

    Confidence & Significance Levels

    (Contd)

    So if we use (Greek letter alpha) to represent significance level (how frequently

    the conclusion will be wrong) , then our

    confidence level is 1 .

    Confidence Level

    + Significance Level

    = 1

    This relationship can also be stated as:

  • MTH410 S14- Lecture 1 23/153

    Confidence & Significance Levels

    (Contd)Consider a statement from polling data you may hear about in the news these days:

    This poll is considered accurate within 3.4 percentage points, 19 times out of 20.

    In this case, the confidence level is 95% (19/20 = 0.95), and the significance level is 5%.

  • MTH410 S14- Lecture 1 24/1532014/5/8

    Graphical Descriptive

    Techniques

  • MTH410 S14- Lecture 1 25/153

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for Nominal

    Data

    Graphical Techniques for Interval Data

    Describing Time-Series Data

    Describing the Relationship Between Two

    Variables

  • MTH410 S14- Lecture 1 26/153

    Definitions

    A variable is some characteristic of a population or sample.

    Typically denoted with a capital letter: X, Y, Z

    E.g. student marks. No all students achieve the same mark. The

    marks vary from student to student, so the name variable.

    Values of a variable are all possible observations of the variable.

    E.g. student marks: all integers between 0 and 100.

    Data are the observed values of a variable.

    E.g. marks of 6 students in an exam: {67, 74, 71, 83, 93, 48}

  • MTH410 S14- Lecture 1 27/153

    Types of data analysis

    Knowing the type of data is necessary to properly

    select the technique to be used when analyzing data.

    Type of analysis allowed for each type of data

    Interval data arithmetic calculations

    Nominal data counting the number of observation in each category

    Ordinal data - computations based on an ordering process

  • MTH410 S14- Lecture 1 28/153

    Types of data and information

    Variable - a characteristic of population or

    sample that is of interest for us.

    Number of soft drinks a student buys every week

    The waiting time for medical services

    The score of a student in the Stats Exam.

    Data - the actual values of variables

    Interval data are numerical observations

    Nominal data are categorical observations

    Ordinal data are ordered categorical observations

  • MTH410 S14- Lecture 1 29/153

    Types of Data & Information

    Data (at least for purposes of Statistics)

    fall into three main groups:

    Interval Data

    Nominal Data

    Ordinal Data

  • MTH410 S14- Lecture 1 30/153

    Interval Data

    Real numbers, i.e. weights, prices,

    distance, etc.

    Also called as quantitative or numerical.

    Arithmetic operations can be performed on

    Interval Data, so its meaningful to talk about 2*Weight, or Price + $1.5, and so on.

  • MTH410 S14- Lecture 1 31/153

    Nominal Data

    The values of nominal data are categories.

    E.g. responses to questions about marital status, coded as:

    Single = 1, Married = 2, Divorced = 3, Widowed = 4

    Because the numbers are arbitrary, arithmetic operations dont make any sense (e.g. does Widowed 2 = Married?!)

    Any other numbering system is also valid provided that each category has a different number assigned to it.

    E.g. Another coding system as valid as the previous one:

    Single = 7, Married = 4, Divorced = 13, Widowed = 1

    Nominal data are also called qualitative or categorical.

  • MTH410 S14- Lecture 1 32/153

    Ordinal Data

    Ordinal Data appear to be nominal, but their values have an

    order, a ranking to them.

    E.g. The most active stocks traded on the NASDAQ in

    descending order

    MSFT = 1, CSCO = 2, Dell = 3, SunW = 4

    Any other numbering system is valid provided the order is

    maintained.

    E.g. Another coding system as valid as the previous one:

    MSFT = 6, CSCO = 11, Dell = 23, SunW = 45

    It is still not meaningful to do arithmetic operations on this kind of data (e.g.

    does 2*MSFT = CSCO?!).

    We can say something like the number of stocks traded from:

    Microsoft > Cisco or Sun Microsystems < Dell

  • MTH410 S14- Lecture 1 33/153

    Nominal data vs. Ordinal data

    The critical difference between nominal data and ordinal

    data is that the values of the latter are in order.

    E.g. It is valid for nominal data to have:

    Single = 7, Married = 4, Divorced = 13, Widowed = 1

    However, it wont be valid for ordinal data:MSFT = 7, CSCO = 4, Dell = 13, SunW = 1

    (The order changed to be: SunW, CSCO, MSFT,

    Dell.

    We must keep the order of MSFT, CSCO, Dell, SunW)

  • MTH410 S14- Lecture 1 34/153

    Interval data vs. Ordinal data

    The critical difference between interval data and ordinal data is that the intervals or differences between values of interval data are consistent and meaningful.

    E.g. The difference between marks of 85 and 80 is the same five-mark difference as that between 75 and 70.

    However for coding system like:

    MSFT = 1, CSCO = 2, Dell = 3, SunW = 4

    We can see CSCO MSFT = 1, and SunW Dell = 1

    But we cant conclude that the difference between the number of stocks traded in Microsoft and Cisco Systems is the same as the difference in the number of stocks traded between Dell Computer and Sun Microsystems.

  • MTH410 S14- Lecture 1 35/153

    Types of Data & Information(Contd)

    Categorical?DataInterval

    DataN

    Ranked?

    Y

    Ordinal

    DataY

    Nominal

    Data

    N

    Categorical

    Data

    Knowing the type of data is necessary to properly select

    the technique to be used when analyzing data.

  • MTH410 S14- Lecture 1 36/153

    Calculations for Types of Data

    All calculations are permitted on interval data.

    Only calculations involving a ranking process are allowed for ordinal data.

    No calculations are allowed for nominal data, only allowed to count the number of observations in each

    category.

    This leads to the following hierarchy of data

  • MTH410 S14- Lecture 1 37/153

    Hierarchy of Data

    Higher

    level

    may be

    treated

    as lower

    level(s)

    Interval

    Values are real numbers.

    All calculations are valid.

    Data may be treated as ordinal or nominal.

    Ordinal

    Values must represent the ranked order of the data.

    Calculations based on an ordering process are valid.

    Data may be treated as nominal but not as interval.

    Nominal

    Values are the arbitrary numbers that represent categories.

    Only calculations based on the frequencies of occurrence are valid.

    Data may not be treated as ordinal or interval.

  • MTH410 S14- Lecture 1 38/153

    E.g. Representing Student Grades

    Categorical?DataInterval Data

    Nominal Data

    Ordinal Data

    N

    Ranked?

    Y

    Y

    NCategorical

    DataRanked order to data

    NO ranked order to data

    e.g. integers in {0..100}

    e.g. {F, D, C, B, A}

    e.g. {Pass | Fail}

  • MTH410 S14- Lecture 1 39/153

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for

    Nominal Data

  • MTH410 S14- Lecture 1 40/153

    Graphical & Tabular Techniques for

    Nominal DataThe only allowable calculation on nominal data is to

    count the frequency of each value of the variable.

    We can summarize the data in a table that presents the

    categories and their counts called a frequency

    distribution.

    A relative frequency distribution lists the categories

    and the proportion with which each occurs.

  • MTH410 S14- Lecture 1 41/153

    It is often preferable to show the relative frequency

    (proportion) of observations falling into each class, rather

    than the frequency itself.

    Relative frequencies should be used when

    the population relative frequencies are studied

    comparing two or more histograms

    the number of observations of the samples studied are different

    Class relative frequency = Class frequency

    Total number of observations

    Relative frequency

  • MTH410 S14- Lecture 1 42/153

    It is generally best to use equal class width, but sometimes unequal class width are called for.

    Unequal class width is used when the frequency

    associated with some classes is too low. Then,

    several classes are combined together to form a

    wider and more populated class.

    It is possible to form an open ended class at the

    higher end or lower end of the histogram.

    Class width

  • MTH410 S14- Lecture 1 43/153

    Example: Light Beer Preference

    SurveyIn 2006 total light beer sales in the United States was

    approximately 3 million gallons

    With this large a market breweries often need to know more

    about who is buying their product.

    The marketing manager of a major brewery wanted to

    analyze the light beer sales among college and university

    students who do drink light beer.

    A random sample of 285 graduating students was asked to

    report which of the following is their favorite light beer.

  • MTH410 S14- Lecture 1 44/153

    Example

    1. Budweiser Light

    2. Busch Light

    3. Coors Light

    4. Michelob Light

    5. Miller Lite

    6. Natural Light

    7. Other brand

    The responses were recorded using the codes. Construct a

    frequency and relative frequency distribution for these data

    and graphically summarize the data by producing a bar

    chart and a pie chart.

  • MTH410 S14- Lecture 1 45/153

    Example 1 1 1 1 2 4 3 5 1 3 1 3 7 5 1

    1 5 2 1 5 1 3 3 3 1 1 5 3 1 5

    5 1 1 3 3 5 5 6 3 5 3 5 5 5 1

    1 2 1 1 5 5 3 2 1 6 1 1 4 5 1

    3 3 5 4 7 6 6 4 4 6 5 2 1 1 5

    3 3 1 3 5 3 3 7 3 7 2 1 5 7

    3 6 2 6 3 6 6 6 5 6 1 1 6 3

    7 1 1 1 5 1 3 1 3 7 7 2 1 1

    2 5 3 1 1 3 1 1 7 5 3 2 1 1

    6 5 7 1 3 2 1 3 1 1 7 5 5 6

    1 4 6 1 3 1 1 5 5 5 5 1 5 5

    6 1 3 3 1 3 7 1 1 1 2 4 1 1

    3 3 7 5 5 1 1 3 5 1 5 4 5 3

    4 1 4 5 3 1 5 3 3 3 1 1 5 3

    5 6 4 3 5 6 4 6 5 5 5 5 3 1

    2 3 2 7 5 1 6 6 2 3 3 3 1 1

    5 1 4 6 3 5 1 1 2 1 5 6 1 1

    5 1 3 5 1 1 1 3 7 3 1 6 3 1

    2 2 5 1 3 5 5 2 3 1 1 3 6 1

    1 1 1 7 3 1 5 3 3 3 5 3 1 7

  • MTH410 S14- Lecture 1 46/1532014/5/8

    Frequency and Relative Frequency

    Distributions Light Beer Brand Frequency Relative Frequency

    Budweiser Light 90 31.6%

    Busch Light 19 6.7

    Coors Light 62 21.8

    Michelob Light 13 4.6

    Miller Lite 59 20.7

    Natural Light 25 8.8

    Other brands 17 6.0

    Total 285 100

  • MTH410 S14- Lecture 1 47/153

    Nominal Data (Frequency)

    Bar Charts are often used to display frequencies

    90

    19

    62

    13

    59

    25

    17

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 2 3 4 5 6 7

  • MTH410 S14- Lecture 1 48/153

    Nominal Data (Relative

    Frequency)

    Pie Charts show relative frequencies

    131%

    27%

    322%

    44%

    521%

    69%

    76%

  • MTH410 S14- Lecture 1 49/153

    Nominal Data

    It all the same information,

    (based on the same data).

    Just different presentation.

    Light Beer Brand Frequency Relative Frequency

    Budweiser Light 90 31.6%

    Busch Light 19 6.7

    Coors Light 62 21.8

    Michelob Light 13 4.6

    Miller Lite 59 20.7

    Natural Light 25 8.8

    Other brands 17 6.0

    90

    19

    62

    13

    59

    25

    17

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 2 3 4 5 6 7

    131%

    27%

    322%

    44%

    521%

    69%

    76%

  • MTH410 S14- Lecture 1 50/153

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for

    Nominal Data

    Graphical Techniques for Interval Data

  • MTH410 S14- Lecture 1 51/153

    Graphical Techniques for Interval Data

    There are several graphical methods that are used when the data are interval.

    The most important of these graphical methods is the histogram, which is created by drawing rectangles

    whose bases are the intervals and whose heights are the

    frequencies.

    The histogram is not only a powerful graphical technique used to summarize interval data, but also used

    to help explain probabilities.

  • MTH410 S14- Lecture 1 52/153

    Building a Histogram

    Example The marketing manager of a long-distance telephone

    company conducted a survey of 200 new costumers wherein the

    first months bills are recorded. What information can be extracted

    from those data?

    This manager was only able to find that the smallest bill is $0, and

    the largest bill is $119.63, and most of bills are less than $100

    However, there is a lot of information may be more interesting.

    Bill distribution,

    Are there many small bills and few large bills?

    What is the typical bill?

    Are the bills somewhat similar or different?

  • MTH410 S14- Lecture 1 53/1532014/5/8

    Building a Histogram(Contd)

    Alternatively, we could use Sturges formula:

    Number of class intervals = 1 + 3.3 log (n), n

    is the number of observations, then we get 9.

    1) Collect the Data

    2) Create a frequency distribution for the data a) Determine the number of classes to use

    Refer to Table.

    With 200 observations, we should have between 7 & 10 classes 9 seems the best. For our purpose, let us pick 8.

  • MTH410 S14- Lecture 1 54/153

    Building a Histogram(Contd)

    1) Collect the Data 2) Create a frequency distribution for the data

    a) Determine the number of classes to use. [8]

    b) Determine how large to make each class

    Look at the range of the data, that is,

    Range = Largest Observation Smallest Observation

    Range = $119.63 $0 = $119.63

    Then each class width becomes:

    Range (# of classes) = 119.63 8 15

    FYI: if pick 9, the width should be 13.3

  • MTH410 S14- Lecture 1 55/153

    Building a Histogram(Contd)

    1) Collect the Data 2) Create a frequency distribution for the data

    a) Determine the number of classes to use. [8] b) Determine how large to make each class. [15]

    c) Place the data into each class

    each item can only belong to one class;

    each class contains observations greater than its

    lower limit and less than or equal to its upper limit.

    That means, there is not overlapping between any

    two classes.

  • MTH410 S14- Lecture 1 56/153

    Building a Histogram(Contd)

    3) Draw the Histogram

    1) Collect the Data

    2) Create a frequency

    distribution for the data.

  • MTH410 S14- Lecture 1 57/153

    Building a Histogram(Contd)

    1) Collect the Data 2) Create a frequency distribution for the data. 3) Draw the Histogram

  • MTH410 S14- Lecture 1 58/153

    Example : Interpret

    0

    20

    40

    60

    80

    15 30

    45

    60

    75

    90

    10

    5

    12

    0Bills

    Fre

    qu

    en

    cy

    About half of all

    the bills are smallA few bills are in

    the middle range

    Relatively, large

    number of bills

    are large

    18+28+14=60

    13+9+10=3271+37=108

  • MTH410 S14- Lecture 1 59/153

    FYI The difference with Bar chart

    All data are nominal,

    Each bin is one category,

    There is a gap between

    two neighbor bins.

    All data are interval,

    Each bin is an interval of values,

    There is no gap between two

    neighbor bins.

  • MTH410 S14- Lecture 1 60/153

    There are four typical shape characteristics

    Shapes of histograms

  • MTH410 S14- Lecture 1 61/1532014/5/8

    Shapes of Histograms

    Symmetry

    A histogram is said to be symmetric if, when we

    draw a vertical line down the center of the histogram,

    the two sides are identical in shape and size:

    Fre

    quency

    Variable

    Fre

    quency

    Variable

    Fre

    quency

    Variable

  • MTH410 S14- Lecture 1 62/153

    Shapes of Histograms(Contd)

    Skewness

    A skewed histogram is one with a long tail extending to

    either the right or the left:

    Negatively skewedPositively skewed

  • MTH410 S14- Lecture 1 63/1532014/5/8

    Shapes of Histograms(Contd)

    Modality

    A unimodal histogram is one with a single peak,

    while a bimodal histogram is one with two peaks:

    Fre

    qu

    en

    cy

    Variable

    Unimodal

    Variable

    Bimodal

    Fre

    qu

    en

    cy

    A modal class is the class with the largest number of observations

  • MTH410 S14- Lecture 1 64/153

    A modal class is the one with the largest number of observations.

    A unimodal histogram

    The modal class

    Modal classes

  • MTH410 S14- Lecture 1 65/153

    Modal classes

    A bimodal histogram

    A modal class A modal class

  • MTH410 S14- Lecture 1 66/1532014/5/8

    Bell Shaped Histograms

    A special type of symmetric unimodal histogram

    is Bell Shaped:

    Bell Shaped

    Fre

    qu

    en

    cy

    VariableMany statistical techniques

    require that the population

    be bell shaped.

  • MTH410 S14- Lecture 1 67/153

    Many statistical techniques require that the population be bell shaped.

    Drawing the histogram helps verify the shape of the population in question

    Bell shaped histograms

  • MTH410 S14- Lecture 1 68/1532014/5/8

    Stem & Leaf Display

    Retains information about individual observations that would normally be lost in the creation of a histogram.

    Split each observation into two parts, a stem and a leaf:

    e.g. Observation value: 42.19

    There are several ways to split it up

    We could split it at the decimal point:

    Or split it at the tens position (while rounding to the nearest integer in the ones position)

    Stem Leaf

    42 19

    4 2

  • MTH410 S14- Lecture 1 69/1532014/5/8

    Stem & Leaf Display

    Continue this process for all the observations.

    Then, use the stems for the classes and each leaf becomes part of the histogram as

    follows

    Stem Leaf0 00000000001111122222233333455555566666667788889999991 0000011112333333344555556678899992 00001111123446667789993 0013355894 1244455895 335666 34587 0222245567898 3344578899999 0011222223334455599910 00134444669911 124557889

    Thus, we still have access to our

    original data points value!

  • MTH410 S14- Lecture 1 70/1532014/5/8

    Histogram and Stem & Leaf

  • MTH410 S14- Lecture 1 71/1532014/5/8

    Ogive

    (pronounced Oh-jive) is a graph of

    a cumulative relative frequency distribution.

    We create an ogive in three steps

    First, from the frequency distribution created earlier, calculate relative frequencies

  • MTH410 S14- Lecture 1 72/1532014/5/8

    Relative Frequencies

    For example, we had 71 observations in the first class (telephone

    bills from $0.00 to $15.00). Hence, the relative frequency for this

    class is 71 200 (the total # of phone bills) = 0.355 (or 35.5%)

  • MTH410 S14- Lecture 1 73/1532014/5/8

    Ogive(Contd)

    is a graph of a cumulative frequency distribution.

    We create an ogive in three steps

    1) Calculate relative frequencies.

    2) Calculate cumulative relative frequencies by adding

    the current class relative frequency to the previous

    class cumulative relative frequency.(For the first class, its cumulative relative frequency is just its relative

    frequency)

  • MTH410 S14- Lecture 1 74/1532014/5/8

    Cumulative Relative

    Frequencies

    first class, just itself

    next class: .355+.185=.540

    last class: .930+.070=1.00

    Always or by chance?

  • MTH410 S14- Lecture 1 75/1532014/5/8

    Ogive(Contd)

    is a graph of a cumulative frequency distribution. 1) Calculate relative frequencies. 2) Calculate cumulative relative frequencies. 3) Graph the cumulative relative frequencies.

  • MTH410 S14- Lecture 1 76/1532014/5/8

    Ogive(Contd)

    The ogive can be

    used to answer

    questions like:

    What telephone bill

    value is at the 50th

    percentile?

    around $35

  • MTH410 S14- Lecture 1 77/1532014/5/8

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for

    Nominal Data

    Graphical Techniques for Interval

    Data

    Describing Time-Series Data

  • MTH410 S14- Lecture 1 78/1532014/5/8

    Describing Time Series Data

    Observations measured at the same point in time

    are called cross-sectional data.

    Observations measured at successive points in

    time are called time-series data.

    Time-series data graphed on a line chart, which

    plots the value of the variable on the vertical axis

    against the time periods on the horizontal axis.

  • MTH410 S14- Lecture 1 79/1532014/5/8

    Example

    We recorded the monthly average retail

    price of gasoline since 1978.

    Draw a line chart to describe these data

    and briefly describe the results.

  • MTH410 S14- Lecture 1 80/1532014/5/8

    Example

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    1 25 49 73 97 121 145 169 193 217 241 265 289 313 337

  • MTH410 S14- Lecture 1 81/1532014/5/8

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for

    Nominal Data

    Graphical Techniques for Interval Data

    Describing Time-Series Data

    Describing the Relationship Between Two

    Variables

    Two Nominal Variables

  • MTH410 S14- Lecture 1 82/1532014/5/8

    Relationship between Two

    Nominal VariablesSo far weve looked at tabular and graphical techniques for one variable (either nominal or

    interval data).

    A cross-classification table (or cross-tabulation

    table) is used to describe the relationship between

    two nominal variables.

    A cross-classification table lists the frequency of

    each combination of the values of the two

    variables

  • MTH410 S14- Lecture 1 83/1532014/5/8

    Example

    In a major North American city there are four competing

    newspapers: the Post, Globe and Mail, Sun, and Star.

    To help design advertising campaigns, the advertising

    managers of the newspapers need to know which segments of

    the newspaper market are reading their papers.

    A survey was conducted to analyze the relationship between

    newspapers read and occupation.

    A sample of newspaper readers was asked to report which

    newspaper they read: Globe and Mail (1) Post (2), Star (3),

    Sun (4), and to indicate whether they were blue-collar worker

    (1), white-collar worker (2), or professional (3).

  • MTH410 S14- Lecture 1 84/1532014/5/8

    Example

    By counting the number of times each of the 12 combinations occurs,

    we produced the Table

    Occupation

    Newspaper Blue Collar White Collar Professional Total

    G&M 27 29 33 89

    Post 18 43 51 112

    Star 38 21 22 81

    Sun 37 15 20 72

    Total 120 108 126 354

  • MTH410 S14- Lecture 1 85/1532014/5/8

    Example

    If occupation and newspaper are related, then there will be differences in

    the newspapers read among the occupations. An easy way to see this is

    to covert the frequencies in each column to relative frequencies in each

    column. That is, compute the column totals and divide each frequency by

    its column total.

    Occupation

    Newspaper Blue Collar White Collar Professional

    G&M 27/120 =.23 29/108 = .27 33/126 = .26

    Post 18/120 = .15 43/108 = .40 51/126 = .40

    Star 38/120 = .32 21/108 = .19 22/126 = .17

    Sun 37/120 = .31 15/108 = .14 20/126 = .16

  • MTH410 S14- Lecture 1 86/1532014/5/8

    Example

    Interpretation: The relative frequencies in the columns 2 & 3 are similar,

    but there are large differences between columns 1 and 2 and between

    columns 1 and 3.

    This tells us that blue collar workers tend to read different newspapers

    from both white collar workers and professionals and that white collar and

    professionals are quite similar in their newspaper choice.

    dissimilar

    similar

  • MTH410 S14- Lecture 1 87/1532014/5/8

    Graphing the Relationship Between Two Nominal

    Variables

    Use the data from the cross-classification table to create bar charts

    Professionals tend

    to read the Globe &

    Mail more than

    twice as often as the

    Star or Sun

  • MTH410 S14- Lecture 1 88/1532014/5/8

    Agenda

    Types of Data and Information

    Graphical and Tabular Techniques for Nominal

    Data

    Graphical Techniques for Interval Data

    Describing Time-Series Data

    Describing the Relationship Between Two

    Variables

    Two Nominal Variables

    Two Interval Variables

  • MTH410 S14- Lecture 1 89/1532014/5/8

    Graphing the Relationship Between Two

    Interval Variables

    Moving from nominal data to interval data, we are

    frequently interested in how two interval variables are

    related.

    To explore this relationship, we employ a scatter

    diagram, which plots two variables against one another.

    The independent variable is labeled X and is usually

    placed on the horizontal axis, while the other, dependent

    variable, Y, is mapped to the vertical axis.

  • MTH410 S14- Lecture 1 90/1532014/5/8

    Example

    A real estate agent wanted to know to what extent the selling

    price of a home is related to its size. To acquire this

    information he took a sample of 12 homes that had recently

    sold, recording the price in thousands of dollars and the size

    in hundreds of square feet. These data are listed in the

    accompanying table. Use a graphical technique to describe

    the relationship between size and price.

    Size 23 18 26 20 22 14 33 28 23 20 27 18

    Price 315 229 355 261 234 216 308 306 289 204 265 195

  • MTH410 S14- Lecture 1 91/1532014/5/8

    Example

    It appears that in fact there is a relationship,

    that is, the greater the house size the greater

    the selling price

  • MTH410 S14- Lecture 1 92/1532014/5/8

    Patterns of Scatter Diagrams

    Linearity and Direction are two concepts we are interested in.

    Positive Linear RelationshipNegative Linear Relationship

    Non-Linear RelationshipNo Relationship

  • MTH410 S14- Lecture 1 93/1532014/5/8

    Summary

    Histogram, Ogive

    Frequency and

    Relative Frequency

    Tables, Bar and Pie

    Charts

    Scatter Diagram Cross-classification

    Table, Bar Charts

    IntervalData

    NominalData

    Single Set of Data

    Relationship Between

    Two Variables

  • MTH410 S14- Lecture 1 94/153

    Agenda

    Introduction

    Measures of Central Location

    Measures of Variability

    Measures of Relative Standing

  • MTH410 S14- Lecture 1 95/153

    Numerical Descriptive Techniques

    Measures of Central Location

    Mean, Median, Mode

    Measures of Variability

    Range, Standard Deviation, Variance,

    Coefficient of Variation

    Measures of Relative Standing

    Percentiles, Quartiles

  • MTH410 S14- Lecture 1 96/153

    Agenda

    Introduction

    Measures of Central Location

  • MTH410 S14- Lecture 1 97/153

    Measures of Central Location

    Usually, we focus our attention on two

    types of measures when describing

    population characteristics:

    Central location (e.g. average)

    Variability or spread

    The measure of central location

    reflects the locations of all the actual

    data points.

  • MTH410 S14- Lecture 1 98/153

    With one data point

    clearly the central

    location is at the point

    itself.

    The measure of central location reflects the locations

    of all the actual data points.

    How?

    But if the third data point

    appears on the left hand-side

    of the midrange, it should pullthe central location to the left.

    With two data points,

    the central location

    should fall in the middle

    between them (in order

    to reflect the location of

    both of them).

    Measures of Central Location

  • MTH410 S14- Lecture 1 99/153

    Arithmetic Mean

    The arithmetic mean, or average, simply

    as mean, is the most popular & useful

    measure of central location.

    It is computed by simply adding up all the

    observations and dividing by the total

    number of observations:

    Sum of the observations

    Number of observationsMean =

    The arithmetic mean for a sample is denoted with an

    x-bar:

  • MTH410 S14- Lecture 1 100/153

    Notation

    When referring to the number of

    observations in a population, we use

    uppercase letter N

    When referring to the number of

    observations in a sample, we use lower

    case letter n

    The arithmetic mean for a population is

    denoted with Greek letter mu:

  • MTH410 S14- Lecture 1 101/153

    Statistics is a pattern language

    Population Sample

    Size N n

    Mean

  • MTH410 S14- Lecture 1 102/153

    Mean(Contd)

    Population Mean Sample Mean

  • MTH410 S14- Lecture 1 103/153

    Statistics is a pattern language

    Population Sample

    Size N n

    Mean

  • MTH410 S14- Lecture 1 104/153

    Mean(Contd)

    is appropriate for describing interval data, e.g. heights of people, marks of student papers, etc.

    is seriously affected by extreme values called outliers.

    E.g. If Bill Gates moved into any neighborhood, the average household income for that neighborhood would increase dramatically beyond what it was previously!

  • MTH410 S14- Lecture 1 105/153

    10

    ...

    10

    102110

    1 xxxxx ii

    Example

    The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,

    14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

    0 7 2211.0

    Example

    Suppose the telephone bills of Example 2.1 represent

    the population of measurements. The population mean is

    200

    x...xx

    200

    x 20021i200

    1i 42.19 38.45 45.77 43.59

    The Arithmetic Mean

  • MTH410 S14- Lecture 1 106/153

    Properties of Mean

    Calculated by using every data point.

    Every interval data has a unique mean.

    Sum of deviations from mean is 0.

    Effected from extreme (very large or small)

    values

    Not meaningful for nominal or ordinal data.

    Useful comparing 2 or more data sets.

  • MTH410 S14- Lecture 1 107/153

    MedianThe median is calculated by placing all the observations in

    order; the observation that falls in the middle is the median.

    Data: {0, 7, 12, 5, 14, 8, 0, 9, 22} N=9 (odd)

    Sort them bottom to top, find the middle:

    0 0 5 7 8 9 12 14 22

    Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10 (even)

    Sort them bottom to top, there are two elements in

    the middle:

    0 0 5 7 8 9 12 14 22 33

    median = (8 + 9) 2 = 8.5

    Sample and population medians are computed the same way.

  • MTH410 S14- Lecture 1 108/153

    Properties of Median

    Calculated by using only 1 or at most 2

    values.

    Every interval data has a unique median.

    Not affected from extreme values.

    Can be calculated for ordinal data as well,

    but cant be interpreted as the centre of location.

  • MTH410 S14- Lecture 1 109/153

    Mode

    The mode of a set of observations is the value that occurs most frequently. Sometimes we say

    MODE = PEAK of a curve.

    A set of data may have one mode (or modal class), or two modes, or more modes.

    Mode can be used for all data types, although mainly used for nominal data.

    For populations and large samples the modal classis more preferable.

    Sample and population modes are computed the same way.

  • MTH410 S14- Lecture 1 110/153

    Mode(Contd)

    E.g. Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33}

    N=10

    Which observation appears most often?

    The mode for this data set is 0. How about

    this as a measure of central location?

    In a small sample, it may not be a good measure.

  • MTH410 S14- Lecture 1 111/153

    Mode(Contd)

    The mode may be not unique, i.e. 2 modes for

    bimodal data.

    Note: if you are using Excel for your data

    analysis and your data is multi-modal (i.e.

    there is more than one mode), Excel only

    calculates the smallest one.

    You will have to use other techniques (i.e.

    histogram) to determine if your data is bimodal,

    trimodal, etc.

  • MTH410 S14- Lecture 1 112/153

    Properties of Mode

    Not affected from extreme values.

    Multiple modes possible, hence not a good

    measure of central location.

    No mode exists sometimes, all observations

    have the same value.

    Can be calculated for nominal data as well,

    but cant be interpreted as the centre of location

  • MTH410 S14- Lecture 1 113/153

    Mean, Median, Mode

    If a distribution is symmetrical, the mean,

    median and mode may coincide

    mode

    mean

    median

  • MTH410 S14- Lecture 1 114/153

    Mean, Median, Mode(Contd)If a distribution is asymmetrical, say skewed to the

    left or to the right, the three measures may differ.

    E.g.:

    MeanMedian

    Mode

    A negatively skewed distribution

    (skewed to the left)

    A positively skewed distribution

    (skewed to the right)

    MeanModeMedian

    Note: Median not as sensitive

    as Mean for the skewness.

    modemedian

    mean

  • MTH410 S14- Lecture 1 115/153

    Mean, Median, Mode(Contd)

    If data are symmetric, the mean, median,

    and mode will be approximately the same.

    If data are multimodal, report the mean,

    median and/or mode for each subgroup.

    If data are skewed, report the median.

  • MTH410 S14- Lecture 1 116/153

    About Ordinal & Nominal Data

    For ordinal and nominal data the calculation

    of the mean is not valid.

    Median is appropriate for ordinal data.

    For nominal data, a mode calculation is

    useful for determining highest frequency but

    not central location.

  • MTH410 S14- Lecture 1 117/153

    Example: Assume you got 35 marks for one exam,

    and the average was 45 marks, which kind of result

    would you expect? Failed?

    No sure. Dependent on the actual marks for all

    students.

    If all marks like:

    15, 20, 25, 25, 25, 30, 35, 75, 100, 100

    Congratulation! Good job, youre the fourth highest!

    25 30

    2

    How about telling you the median was 27.5 ( ),

    would you worry again?

    Course Marks: Mean & Median

  • MTH410 S14- Lecture 1 118/153

    The mean is generally the first choice.

    When the following scenarios, the median is

    the best

    there are extreme observations

    determine the rank of a particular value

    relative to the data set

    The mode is rare the best measure.

    Mean, Median, Mode: Which is Best

  • MTH410 S14- Lecture 1 119/153

    Geometric Mean

    The geometric mean is used when the variable is a growth rate or rate of change, such as the value of an investment over periods of time.

    If the rate of return was Rg in every

    period, the nth period return would

    be calculated by:n

    g )R1( )R1)...(R1)(R1( n21

    For the given series of rate of

    returns the nth period return is

    calculated by:

    The geometric mean Rg is selected such that

    1)R1)...(R1)(R1(R n n21g

  • MTH410 S14- Lecture 1 120/153

    Finance Example

    Suppose a 2-year investment of $1,000 grows by 100% to

    $2,000 in the first year, but loses 50% from $2,000 back to

    the original $1,000 in the second year. What is the average

    return?

    The upper case Greek Letter Pi represents a product of terms

    Solving for the geometric mean yields a rate of 0%.

    This would indicate having more than $1,000 at the end of the second

    year, however in fact we only have $1,000.

    Using the arithmetic mean, misleading

    more precise

  • MTH410 S14- Lecture 1 121/153

    Measures of Central Location SummaryCompute mean to

    Describe the central location of a single set of

    interval data

    Compute median to

    Describe the central location of a single set of

    ordinal or interval data (with extreme observations)

    Compute mode to

    Describe a single set of nominal, ordinal or interval

    data

    Compute Geometric mean to

    Describe a single set of interval data based on

    growth rates

  • MTH410 S14- Lecture 1 122/153

    Agenda

    Introduction

    Measures of Central Location

    Measures of Variability

  • MTH410 S14- Lecture 1 123/153

    Measures of variability

    Measures of central location fail to tell the whole story about the distribution.

    A question of interest still remains unanswered:

    How much are the observations spread out

    around the mean value?

  • MTH410 S14- Lecture 1 124/153

    The average value provides

    a good representation of the

    observations in the data set.

    Small variability

    This data set is now

    changing to...

    Why not use mean

    Observe two data sets:

  • MTH410 S14- Lecture 1 125/153

    Why not use mean(Contd)

    Observe two data sets:

    The average value provides

    a good representation of the

    observations in the data set.

    Small variability

    Larger variabilityThe same average value does not

    provide as good representation of the

    observations in the data set as before.

  • MTH410 S14- Lecture 1 126/153

    Range

    The range is the simplest measure of variability, and

    calculated as:

    Range = Largest observation Smallest observation

    E.g. Data set: {4, 4, 4, 4, 4, 50} Range = 46

    Data set: {4, 8, 15, 24, 39, 50} Range = 46

    The range is the same in both cases, but the data

    sets have very different distributions

  • MTH410 S14- Lecture 1 127/153

    Range(Contd)

    ? ? ?

    But, how do all the observations spread out?

    Smallest

    observation

    Largest

    observation

    The range cannot assist in answering this question

    Range

  • MTH410 S14- Lecture 1 128/153

    Variance

    Variance and its related measure,

    standard deviation, are arguably the most

    important statistics. Used to measure

    variability, they also play a vital role in

    almost all statistical inference procedures.

    Population variance is denoted by

    (Lower case Greek letter sigma squared)

    Sample variance is denoted by

    (Lower case s squared)

  • MTH410 S14- Lecture 1 129/153

    Statistics is a pattern language

    Population Sample

    Size N n

    Mean

    Variance

  • MTH410 S14- Lecture 1 130/153

    Variance(Contd)

    The variance of a population is:

    population mean

    sample mean

    Sample size minus one !

    The reason we will discuss later

    population size

    The variance of a sample is:

  • MTH410 S14- Lecture 1 131/153

    Variance(Contd)

    Alternatively, there is a short-cut formulation

    to calculate sample variance directly from the

    data without the intermediate step of

    calculating the mean. Its given by:

    As you can see, you have to calculate the sample mean

    in order to calculate the sample variance.

  • MTH410 S14- Lecture 1 132/153

    Why not use the sum of deviations...

    Consider two small populations:

    1098

    74 10

    11 12

    13 16

    8-10 = -2

    9-10 = -1

    11-10 = +1

    12-10 = +2

    4-10 = - 6

    7-10 = -3

    13-10 = +3

    16-10 = +6

    Sum = 0

    Sum = 0

    The mean of both

    populations is 10...

    but measurements in B

    are more dispersed

    than those in A.

    Any good measure of

    dispersion should agree

    with this observation.

    Can the sum of deviations be a

    good measure of variability?

    A

    B

    The sum of deviations is zero for all populations,

    therefore, is not a good measure of variability.

    10-10 = 0

    10-10 = 0

  • MTH410 S14- Lecture 1 133/153

    Let us calculate the variances

    185

    )1016()1013()1010()107()104( 222222B

    25

    )1012()1011()1010()109()108( 222222A

    Why is the variance defined as

    the average squared deviation

    rather than the sum of squared

    deviations?

  • MTH410 S14- Lecture 1 134/153

    Let us calculate the sum of squared deviations for both

    data sets in this example

    Which data set has a larger dispersion?

    1 3 1 32 5

    A B

    Data set B is

    more dispersed

    around the mean

    Why not use the sum of squared deviations...

    Date set A:

    {1, 1, 1, 1, 1

    3, 3, 3, 3, 3}

    Date set B:{1, 5}

  • MTH410 S14- Lecture 1 135/153

    1 3 1 32 5

    A B

    Sum of squared deviation for A = 5(1-2)2 + 5(3-2)2= 10

    Sum of squared deviation for B = (1-3)2 + (5-3)2 = 8

    SumA > SumB. This is inconsistent

    with the observation that set B is

    more dispersed.

    Why not use the sum of squared deviations...

    Date set A:

    {1, 1, 1, 1, 1

    3, 3, 3, 3, 3}

    Date set B:{1, 5}

  • MTH410 S14- Lecture 1 136/153

    1 3 1 32 5

    A B

    When calculated on per observation basis (variance),

    the data set dispersions are properly ranked.

    A2 = SumA/N = 10/10 = 1

    B2 = SumB/N = 8/2 = 4

    How about averaged squared deviations...

  • MTH410 S14- Lecture 1 137/153

    Application

    Example

    The following sample consists of the

    number of jobs six students applied for:

    17, 15, 23, 7, 9, 13.

    Finds its mean and variance.

    What are we looking to calculate?

  • MTH410 S14- Lecture 1 138/153

    Sample Mean & Variance

    Sample Mean

    Sample Variance

    Sample Variance (shortcut method)

  • MTH410 S14- Lecture 1 139/153

    Standard Deviation

    The standard deviation is the square root of

    the variance.

    Population standard deviation:

    Sample standard deviation:

  • MTH410 S14- Lecture 1 140/153

    Statistics is a pattern language

    Population Sample

    Size N n

    Mean

    Variance

    Standard

    Deviation

  • MTH410 S14- Lecture 1 141/153

    Mean Absolute Deviation

    There is another deviation: Mean Absolute Deviation

    (MAD), which is calculated by averaging the absolute

    value of the deviation. However, this statistic is rarely

    used.

    E.g. Given data set {17, 15, 23, 7, 9, 13}

    |17 14| |15 14| |23 14| |7 14| |9 14| |13 14| 1MAD 4

    6 3

    n

    xxMAD

    n

    i i 1)(

  • MTH410 S14- Lecture 1 142/153

    Measures of Variability Summary

    If data are symmetric, with no serious outliers, use range and standard deviation.

    If comparing variation across two data sets, use coefficient of variation.

    The measures of variability introduced in this section can be used only for interval data.

    The next section will discuss a measure that can be used to describe the variability of ordinal data.

    There are no measures of variability for nominaldata.

  • MTH410 S14- Lecture 1 143/153

    Agenda

    Introduction

    Measures of Central Location

    Measures of Variability

    Measures of Relative Standing

  • MTH410 S14- Lecture 1 144/153

    Measures of Relative Standing

    Measures of relative standing are designed to provide

    information about the position of particular values relative

    to the entire data set.

    Percentile: the Pth percentile is the smallest point in a

    distribution at or below which p percentage of cases is

    found.

    Your score

    60% of all the scores lie here 40%

    Example: Suppose your score is the 60th percentile of a

    GMAT test. That is

    Note: The 60th percentile doesnt mean you scored 60% on the

    exam. It means that 60% of your peers scored lower than you on

    the exam..

  • MTH410 S14- Lecture 1 145/153

    Quartiles

    We have special names for the 25th, 50th, and 75th

    percentiles, namely quartiles.

    The first or lower quartile is labeled Q1 = 25th percentile.

    The second quartile, Q2 = 50th percentile (also the

    median).

    The third or upper quartile, Q3 = 75th percentile.

    We can also convert percentiles into quintiles (fifths) and

    deciles (tenths).

  • MTH410 S14- Lecture 1 146/153

    Quartiles vs. VariabilityQuartiles can provide an idea about the shape of a histogram

    Q1 Q2 Q3

    Positively skewed

    histogram

    Q1 Q2 Q3

    Negatively skewed

    histogram

    < >

  • MTH410 S14- Lecture 1 147/153

    Commonly Used Percentiles

    First (lower) decile = 10th percentile

    First (lower) quartile, Q1, = 25th percentile

    Second (middle) quartile, Q2, = 50th percentile

    Third quartile, Q3, = 75th percentile

    Ninth (upper) decile, = 90th percentile

  • MTH410 S14- Lecture 1 148/153

    Location of Percentiles

    The following formula allows us to

    approximate the location of any percentile:

    percentilePtheoflocationtheisLwhere

    100

    P)1n(L

    thP

    P

  • MTH410 S14- Lecture 1 149/153

    Location of Percentiles(Contd)

    Given the data :

    0 0 5 7 8 9 12 14 22 33

    Where is the location of the 25th percentile?

    0 0 5 7 8 9 12 14 22 33

    The 25th percentile is three-quarters of the distance between the second (which is 0) and the third observations (which is 5). Three-quarters of the distance is: (.75)(5 0) = 3.75; because the second observation is 0, the 25th percentile is

    0 + 3.75 = 3.75

    L25 = (10+1)(25/100) = 2.75

  • MTH410 S14- Lecture 1 150/153

    Location of Percentiles(Contd)

    What about the upper quartile?

    L75 = (10+1)(75/100) = 8.25

    0 0 5 7 8 9 12 14 22 33

    It is located one-quarter of the distance between the eighth and the ninth observations, which are 14 and 22, respectively. One-quarter of the distance is: (.25)(22 - 14) = 2, which means the 75th percentile is at: 14 + 2 = 16

  • MTH410 S14- Lecture 1 151/153

    Location of Percentiles(Contd)

    0 0 5 7 8 9 12 14 22 33

    16

    Lp determines the position in the data set where the percentile value

    lies, not the percentile itself.

    We have already shown how to find the Median, which is the 50th

    percentile. It is the 5.5th observation, (8+9)/2=8.5 The 50th percentile

    is halfway between the fifth and sixth observations (in the middle

    between 8 and 9), that is 8.5.

    3.75position

    8.25

    position2.75

    5.5100

    50)110(L50

  • MTH410 S14- Lecture 1 152/153

    Interquartile Range

    The quartiles can be used to create another

    measure of variability, the interquartile range,

    which is defined as follows:

    The interquartile range measures the spread of the

    middle 50% of the observations.

    Large values of this statistic mean that the 1st and 3rd

    quartiles are far apart indicating a high level of

    variability.

    Interquartile range = Q3 Q1

  • MTH410 S14- Lecture 1 153/153

    1. It is a summary.

    2. It is also a guideline for

    selecting techniques.