Appl Statistics

7/31/2019 Appl Statistics

http://slidepdf.com/reader/full/appl-statistics 1/48

Statistics, Probability and

Applications

Statistics, Probability and Applications by - Dr. Abhijit Kar Gupta, kg.abhi@gmail.com

This is a short course [based on the lecture notes to the M.Sc (Geography) students of Distance

Education] on the elements of Statistics and the concept of Probability supplemented with

examples and illustrations. The purpose of this general and basic course is to serve as aguideline to the practical usages by any student

Statistics, Probability and Applications

What is Statistics?

Statistics is a systematic presentation of data out of which we may conclude something

meaningful.

Just a collection of raw data is meaningless unless we are able to calculate some quantities out

of them. It is only interesting when some patterns emerge out of the data that are

representative of some event or measurement.

After we collect a set of data, the first thing we like to do is to obtain the central tendency of it.

Central Tendency

The central tendency of a data set is obtained by calculating mean, median and mode.

There are various kinds of mean, (i) arithmetic, (ii) geometric, (iii) harmonic. We usually calculate

arithmetic mean and this we commonly call mean or average.

Suppose, we have a set of

-data points:

Arithmetic mean (A.M.)

∑ (1)

The arithmetic mean or average is the measure of the ‘middle’ of the data set.

Now suppose, appears times, appears times and so on in the data set. Here

, … are called the frequencies. The arithmetic mean in this case is

∑ , (2)

Formula (2) is called the weighted mean.

Note: In the formula (2), if we put for all , we get back formula (1).

The above formula (2) can also be written as

∑ ∑ ,

is the relative frequency for each

(each data point).

Example #1

The ages of father, mother, son and daughter in a family are 60 years, 55 years, 25 years

and 20 years respectively. What is the average age of the family members?

Ans. Average age =

years.

Example #2

In the game of ‘Ludo’ (dice throwing), you obtain ‘1’ two times, ‘2’ five times, ‘3’ twotimes, ‘4’ six times, ‘5’ four times and ‘6’ only once from the random throwing of a dice.

What is the average value you get?

Ans. Average value =

Geometric mean (G.M.):

G.M. = ∏

Harmonic mean (H.M.):

H.M. =

It is useful to calculate arithmetic mean (A.M.) of any set of numbers unless they have

some special properties among them.

For example, if we are to find the mean of the following set of numbers: 2, 4, 8, 16, 32, it

is useful to calculate the geometric mean (G.M.).

Note: The numbers 2, 4, 8…are in geometric progression.

If we are asked to find out the mean of the following numbers,

, it would be

interesting to find out the harmonic mean (H.M.):

Note: Here the numbers

are in harmonic progression. In fact, the inverse of the

numbers in H.M. are in A.M.

Useful Method of Mean Calculation:

In practical calculations, when we are to obtain arithmetic mean (A.M.) of a set of big numbers,we follow a short cut method:

Step I: We assume a mean by just looking at the numbers. Let this be .

(This is our choice and we do this as per our convenience.)

Step II: Next, we calculate the deviation of this assumed mean from each data point: .

Now, the calculated mean

The actual arithmetic mean,

Similarly, for data with frequencies, ∑

∑ ∑

Here also, we get the same formula as above, = .

Example #1

Consider the following table. We are to calculate the mean rainfall over seven days in monsoon

season.

Days Rainfall in mm.

1 250 50

2 240 40

3 190 -10

4 254 54

5 225 25

6 232 32

7 170 -30

Total 1561 161

Here the assumed mean, mm, .

mm. The actual mean, = mm.

Also, verify by direct calculation, ∑

Example #2

Calculate the mean of the following data with the help of assumed mean method.

interval

10-20 45 4 5 20

20-30 35 5 -5 -25

30-40 48 3 8 24

40-50 43 2 3 6

50-60 40 1 0 060-70 37 1 -3 -3

70-80 39 4 -1 -4

Total 20 18

Here assumed mean, and number of data, ∑

Mean of deviation, ∑

The actual mean,

We can also check this from direct calculation,

Median:

Median is the data in the middle when the data set is arranged in ascending or descending

order.

Example #1

9, 12, 6, 1, 11

After ordering, 1, 6, 9, 11, 12

Median = 9.

If the data set has even number of entries, the median is the mean of the two data point at the

middle after the ordering.

Example #2

9, 12, 6, 1, 11, 13

After ordering, 1, 6, 9, 11, 12, 13

Median =

Mode is the data value which has maximum frequency. This means this value occurs maximum

number of times in the data set.

Example:

0, 2, 5, 9, 3, 2, 6, 2, 3, 5, 4, 2, 1

In the above data set the number 2 occurs maximum times. Mode = 2.

Usually, a data set follows the following approximate empirical formula:

Measures of Position

It is often important to classify data!

In statistical data analysis, we often like to measure the position of a data point relative to

other values in the set. For example, we like to know the rank or position of a student relative

to others in a certain examination.

The measures are done for rank-ordered data, where the elements in the data set are arranged

in ascending order (from the smallest to the largest).

The following are the most common measures of position of the rank-ordered data:

Percentiles:

Percentile is the value of a variable below which a certain percent of observations fall. For

example, 90th

percentile is the value (or score) below which 90% of the data are to be found.

Suppose, we have -number of values. How is the percentile calculated?

1. First the data is rank-ordered (arranged in ascending order)

2. To calculate the -th percentile we have to find the rank :

3. Round off the above rank to the nearest integer and then take the value corresponding

to the integer rank.

Example:

Given the numbers 2, 5, 4, 9, 8, 1

Rank ordered set: 1, 2, 4, 5, 8, 9

The rank of the 60th

percentile,

(rounded off to nearest

integer)

The 60th

percentile is 5 (the 4th

member in the ordered list).

Median – Mode = 2Mean

Note: The 100th

percentile is defined to be the largest value in the given data set.

Quartiles:

A quartile is one of the three points that divide a rank-ordered data set into four equal groups.

First quartile (): Cuts off lower 25% of data ⇨ 25th

percentile

Second quartile (): Divides the data set into half ⇨ 50th

percentile

Third quartile (): Cuts off lowest 75% (or highest 25%) of data ⇨ 75th

percentile

Inter quartile range = upper quartile – lower quartile

Note: The 50th

percentile = Median

Deciles:

Like percentiles, deciles is calculated to find the position of data out of 10 (instead of 100). So

all we have to do is to replace 100 by 10 in the above percentile formula.

Probability Theory and Applications

For randomly occurring events, we would like to know how many times we get a desired result

out of all trials. This means we would like to know the fraction of favourable events or trails.

Suppose, we flip a coin a few number of times. We may count how many times there is a

“Head” or a “Tail” out of all the flips.

= No. of favourable events and = Total no. of events.

= fraction of favourable events. We can also say this is relative frequency in the usual

language of Statistics.

Now, if we do the trials a large number of times, this fraction tends to some fixed value

specific to the event. Then the limiting value of the fraction is what we call probability .

Total no. of trials is also called ‘sample space’ when we are drawing samples out of total

‘population’. As the no. of trials is increased, the sample space becomes bigger.

Definition of Probability:

Probability is the ratio of number of favourable events to the total number of events, provided

the total number of events is very large (actually infinity).

, when (infinity).

So by definition,

is a fraction between 0 and 1 :

No favourable outcome.

All the outcomes are in favour.

We can also think in the following way: probability of occurring an event, probability of

not occurring the event. Since, either the event will occur or not occur, we must write:

Therefore, we have .

Example #1:

In a coin tossing, we know from our experience, = and = =

Example #2:

In a throw of a dice, we know that the probability of the dice facing “1” up, “2” up, “3” up etc.

will be , , and so on.

Probability of not occurring “1” is .

The condition that the total probability of all the events has to be 1 is called normalization of

probabilities.

Rules of Probability:

When more than one event takes place, we need to calculate the joint probability for the all the

events.

Mutually Exclusive Events

Two events are mutually exclusive (or disjoint) when they can not occur at the same time.Suppose, two events are A and B and the individual probabilities for them are designated as

and . Mutually exclusive means,

Addition Rule:

Example#1: The probability of occurring either Head or Tail in a coin toss,

Example#2: The probability of occurring either “1’ or “6” in a dice throw,

Independent Events

When the occurrence of one event does not influence the other but they can occur at the sametime, they are called independent. For example, the rain fall today and the Manchester United

winning a match.

Multiplication Rule:

Example#1: What is the probability that two Heads will occur when we toss two coins together?

for the first coin and for the second coin.

Note that if would flip a single coin two times and ask the probability of getting Heads twice, we

would get the same answer.

Example#2: Now we ask the question, what is the probability of getting one Head and one Tail

in the flipping of two coins together?

Consider, the probability of obtaining Head in the first coin and Tail in the second coin:

And the probability of obtaining Tail in the first and the Head in the second:

Now the total probability of above two events (either of them occurs mutually exclusively):

Note that in the flipping of two coins together, there are 4 types of events, HH, HT, TH, TT. Out

of which the relative occurrence of one Head and one Tail is 2/4 = /12.

Events which are NOT Mutually Exclusive:

If the events are not mutually exclusive, there are some overlap. Suppose, we designate an area

A corresponding to the probability of some event A and the area B to the probability of another

event B. The overlap between the two areas then represents the joint probability .Note that for two independent events the overlap would be zero.

Addition Rule in this case:

Events that are NOT Independent:

Multiplication rule:

⁄ ) ⇨ “The probability of B given A”. This is a conditional probability , i.e., the probability of occurring B provided A occurs first.

Similarly, ⁄ ) ⇨ “The probability of A, given B”.

Note here that

⁄ ) = , when B does not depend on A which means A and B are independent.

⁄ ) = , when A does not depend on B which means A and B are independent.

So, we can write the formula for conditional probability :

Now to illustrate, follow the following table:

In a survey over 100 people, the question was asked whether they are graduate or not.

Q,1 What us the probability that a randomly selected person is a male?

Q.2 What is the probability that a randomly selected person is a female?

Q.3 What is the probability that a randomly selected person is a male who is graduate?

Graduate Non-

graduate

Male 40 20 60

Female 10 30 40

Total 50 50 100

[Also we can think,

Q.4 What is the probability that a randomly selected person is a female who is non-graduate?

[Also,

Q.5 What is the probability that the randomly selected person is either a male graduate or a

female non-graduate?

Ans. This two events are mutually exclusive and by the law of addition,

Q.6 If we now select two persons, what is the probability that one of them is a male graduate

and another is a female non-graduate?

Ans. Two independent events are occurring together. So by the law of multiplication of

probabilities, .

Q.7 What is the probability that a randomly selected no-graduate is a female? [Prob. of non-

graduate among female]

Q.8 What is the probability that a randomly selected graduate is a male?

Ans. This is no. of male out of total graduates, .

Note: In Q.7 & 8, each probability is a conditional probability . However, we gave the answers by

looking at the table directly. Now we answer them in terms of the law of conditional

probability.

Ans. to Q.8: Suppose, A = graduate, B = male, ⁄ = probability of male given that they are

graduates.

We use the formula:

Here, = Prob. of male graduates =

, = prob. of graduates =

Exercise: Q.7 can also be answered in terms of conditional probability formula. Do this and check

yourself.

Q.9 What is the probability that the selected person is either male or graduate?

Ans. Here the two events do not happen together but they are not mutually exclusive. So we

use the formula:

Probability Distributions

Let us think of the probabilities for a number of events marked 1, 2, 3…..and so on.

For each event we can have and also for all the events, ∑ .

So we have a set of probabilities corresponding to a set of events. This collection is a probability

distribution for all that discrete events.Suppose, instead of discrete events we think that is variable which can continuous values in

and there is the probability for each value of . Now if we plot against , we get a

continuous curve which is the continuous probability distribution curve (commonly referred as

the probability distribution curve).

Area under the curve (above x-axis) can be obtained by summing up the areas of the approximate

rectangular bars (which we may easily find by plotting this on a graph paper). Approximate area

of one such bar of width and height is = . So, the approximate total areabetween the two end points and is = ∑ .

To calculate exactly, we need the help of Integral Calculus which essentially sums up the areas

of the rectangles (bars) of infinitesimally small width.

[ NOTE: Those not familiar with the Mathematics of Calculus, should not have to worry much as

the following explanation and symbols can be understood qualitatively and that may serve the

purpose for now.]

The area under the curve (between the two extreme points shown in the above figure) is thefollowing definite integral:

Area = ∫ = .

is the total probability for all the values between the two limits. That is why, is often

referred to as the probability density. So, is the probability (and the area of the bar of

height and width ) in between and , where is the infinitesimally small (smaller

than you can think) range!

Note that for discrete case, the above is the sum of all the mutually exclusive events.

* The sum, ∑ becomes the integral, ∫ for continuous case.]

∫ = (Normalization)

The above means that the total area under the curve (extended from negative infinity to

positive infinity that means over the entire stretch of the curve.) is unity. This is true as in

discrete case we know that the sum of all the probabilities for all the events should be 1.

For discrete events, we calculated the relative frequency and then the Bar diagram from them.

Here for the continuous case, the bars merge together to form a continuous spectrum and that

is the probability distribution. The relative frequencies tend to the probabilities for

corresponding values of the variable for large number of events.

Now given the probability distribution curve, we would like to know about the shape and size of

the curve, some specific quantities that are representative of the character of the event.

For any discrete set of data collection, we measure the central tendency of the data set. We

commonly calculate mean, mean of square and variance.

=∑ ∑ =

∑ ∑

where is the frequency of occurrence for event and we have total frequency, ∑ .

[Note: relative frequency]

Mean of Square:

∑ = ∑

Variance:

Var () = =∑

=∑ ∑

*∑ +.

Standard deviation is the square root of the variance.

Now for a large number of events each of the ratio in each of the above formulas becomes

the corresponding probability : as tends to very large.

Mathematical Expectation and Mean

If the probabilities , , etc. are known for the values , , and

so on, we write

Expectation: ∑ , where ∑

However, when instead of probabilities, we are given the frequencies

, …for the quantities that appear in a data set, we calculate

Mean or average: = ∑

Therefore, we write the above quantities in terms of probabilities:

Now we calculate the above quantities from the following dice throwing experiments.

Example #1 Throwing of a single dice:

The chance of turning up of any side is equal which is 1 out of 6. We consider that a priori

probabilities for each case and find out the mean and variance from the following table.

1 2 3 4 5 6 Total

1/6 1/6 1/6 1/6 1/6 1/6 1

1/6 2/6 3/6 4/6 5/6 6/6 21/6

1/6 4/6 9/6 16/6 25/6 36/6 91/6

From the table, we can calculate mean, and

variance,

If we plot against , we obtain the probability distribution for this case. This distribution is

uninteresting as we can check that the probabilities for all values of are same! The curve

obtained by joining the points will be a horizontal straight line.

Mean, ∑

Mean of Square,

Variance, = ∑

= σ Standard deviation

Now we do this similar experiment taking two dice.

Example #2 (Two Dice)

We look for the value of which is the sum of two numbers on the top faces of the two dice.Here we shall have possible combinations of events and can have a minimum

value, and maximum value, .

2 3 4 5 6 7 8 9 10 11 12 Total

1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

2/36 6/36 12/36 20/36 30/36 42/36 40/36 36/36 30/36 22/36 12/36 252/3

4/36 18/36 48/36 100/36 180/36 294/36 320/36 324/36 300/36 242/36 144/36 1974/

Mean, , Variance,

Now if we plot against taking from above table, we get an interesting symmetric

distribution around a peak! The peak is at (mean value).

The distribution is showing a peak at the middle and it is symmetric!

We can go on doing such experiment 3 or more dice together and ask for the sum of values

occurring on all the dice together and calculate the corresponding probabilities as above. We

can realize that the distribution would be smoother retaining the symmetry with the peak value

at the mean.

In fact, the envelope of the probability values at different (joining the top of the height bars)

of the discrete distribution will slowly assume a continuous symmetric curve!

In the limit of large number of events obtained from the large number of dice throwing

together, we tend to get a continuous bell shaped symmetric distribution.

This is Normal Distribution.

For any naturally occurring event, for any random measurement of any value in any

experiment, the distribution that occurs is Normal distribution. The bell shaped symmetric

curve is called Normal curve. If we calculate the height distribution or age distribution among a

population, the probability distribution turns out to be Normal. The name ‘normal’ is given as it

occurs normally.

Properties of Normal Distribution:

Symmetric about mean

(mean at the centre or at peak position)

Approximate are under the curve:

A = 68%,

[within one standard deviation ( from the mean ( on both sides]

A = 95%,

A = 99.7%,

For a large number of independent random events, the probability distribution is normal

distribution. This is called Central Limit Theorem.

NOTE: The following part can be skipped by students who are not familiar with Calculus.

How Z-distribution is obtained from Normal distribution:

Mathematical Expression for Normal distribution:

where mean, = standard deviation. The above expression is symmetric around the mean,

[The value of the exponential,

Normal distributions are often referred by the symbol:

The total area under the curve,

If we put

, we get

Thus we can write, the rescaled probability,

√ (2)

Now the above is a symmetric distribution around .

So the Normal distribution (1) has become a ‘Z-distribution” in (2). This is nothing but a normal

distribution with mean = 0 and standard deviation = 1.

We have to remember that the area under the curve between values of gives us the total

probability:

Area = ∫ √ .

Now instead of actually doing the integration over , we are supplied with the -score and we

find the area under the curve (hence the total probability) between two limits from the table.

(See the z-score table.)

Consider the following typical situations where we have to calculate the areas

from z-distribution:

(Total area under the curve = 1)

(Area between and is 0.5 or area between

and is 0.5 because of symmetry)

(Area between and any other value )

(Area between two positive values of or between

two negative values)

(Area between a negative value and a positive value)

(Area less than a negative or greater than apositive value)

Important:

In the z-score table we always look for the area between zero and any other value (as the

integral is actually done that way). So, zero is always the reference point.

Finally, the area between any two values of is obtained by adding or subtracting the scores

involving zero. This will be clear from the following examples.

Examples:(While solving the following problems consult the z-score table given in the appendix.)

#1. In the Geography examination, the marks distribution is known to be Normal where the

mean is 52 and the standard deviation is 15. Determine the z-scores of students receiving

marks: (i) 40, (ii) 95, (iii) 52.

Solution: Here, ,

So, we see the z-scores can be negative, positive or zero.

#2. Find the area under the normal curve in each of the following cases:

(i) and

Area = 0.3849 from table.

(ii) and

Area = 0.2518

(Note: The area is equal to the area between and as the curve is symmetric.)

(iii) Area between and 2.21

Area = (area between and 2.21) + (area between and -0.46)

= 0.4861 + 0.1772 = 0.6633

(Note: The areas are added as they are on both sides of .)

(iv) Area between and

Required area = (area between and 1.94) – (area between and 0.81)

= 0.4738 – 0.2881 = 0.1857

(Note: There is the subtraction as the two areas are on the same side of .)

(v) To the left of

Required area = 0.5 – (area between and )

= 0.5 – 0.2257 = 0.2743

(vi) To the right of

Required area = (area between and ) + 0.5

= (area between and ) + 0.5

= 0.3997 + 0.5 = 0.8997

#3. Among 1000 students, the mean score in the final examination is 25 and the standard

deviation is 4.0. Assume the distribution is Normal. Find the following.

(a) How many students score between 22 and 27?

=25, = 4.0

So the probability is the area under the curve between -0.75 and 0.5

= (area between 0 and -0.75) + (area between 0 and 0.5)

= 0.2734 + 0.1915 = 0.4649

The number of students in this marks range =

(b) How many students score above 30?

Probability = area right to

= (area between 0 and 1.25)

= 0.5 – 0.3944 = 0.1056

The number of students =

(c) How many students score below 15?

Area = 0.5 – (area between and -2.5) = 0.5 – 0.4938 = 0.0062

The number of students =

(d) How many score 24?

Here we have to calculate area between 23.5 and 24.5. ,

Area between and

= (area between 0 and ) + (area between 0 and

= 0.1480 – 0.0517 = 0.0963

The number of students = .

Symmetry of Distribution, Skewness

We have seen that a Normal distribution is symmetric around its peak (most probable value or

the value for which the probability is the highest). In a symmetric distribution the mean,

median and mode are at the same position.

The skewness is any deviation from symmetry or we can say, lack of symmetry.

Coefficient of skewness =

The above coefficient can be positive or negative. Below are the two figures demonstrating the

negative and positive skewness: the distributions are correspondingly called negative skewed

and positively skewed distributions.

(Negative Skewness: Mean < Mode) (Positive Skewness: Mean > Mode)

For a symmetric distribution, skewness is zero.

The distribution we are discussing is a unimodal distribution that means a distribution which

has a single mode or one peak. But in many practical cases, we can have a distribution with

many peaks or many modes. For example, a distribution with two peaks (in fig.) is called a

bimodal distribution.

(Bimodal distribution)

Combination Rules:

When we scale a variable that is we multiply a variable by a number or add with this, we need to know

how this scaled variable behaves. Do they have same statistical measures? Do they follow the same kind

of distributions? Also, we ask the same question for two or more variables when scaled and added

together to form a combined variable.

Mean: Variance:

Variance:

If has a Normal distribution, is also a Normal distribution.

Variance:

If and are separately Normal distributions, is then also a Normal

distribution.

Following the combination rules in the above box, we can solve the following problem.

Example:

The weight of individual people follows Normal distribution, . What will be theprobability distribution of weight of 10 people taking together?

Ans. Here, mean , .

Mean weight of 10 people, + = = 40

Variance, …+ = = 500

The probability distribution of weight of 10 people taking together, .

Binomial and Poisson Probability Distributions

Binomial Probability:

Suppose, the probability of occurring a certain event is and not occurring of the event is

. In a total of trials, the particular event occurs times each with probability and

does not occur

times each with probability

. Also, we have to know which

events

will occur out of total events. The number of ways we can do that is the number of

combinations = . Consider a variable which is equal to the relative frequency, .

As the events are considered independent, the joint probability will be

The above probability is called binomial probability.

The meaning of the symbol is given in the box below.

[For those who are not familiar of the above mathematical notations and rules, may consult the

necessary introduction given in the following Box.]

Now consider the following table based on the binomial probability:

……..

) --------

If we add all the terms of the second row above, we get the following binomial expansion:

From the expression (1) above, we can easily check the following known algebraic formulas:

Factorial: ! =

For example, five factorial !

Consider that factorial of negative integers have no meaning and

Note that we can write ! = ! Permutation: How many different objects can be arranged among themselves?

The answer is the permutation of objects, ! For example, for three objects A, B, C, the different combinations are ABC, ACB,

BCA, BAC, CAB, CBA: total 6 ways = ! Combination:

This is the number of ways some objects can be selected from objects.

For example, if we want to know how 2 students can be selected from total 3

students, the answer is !!! !

Also note for quick calculations, !!! = 1,

!! and

……………….

= ……………..

The coefficients of the terms on the right of the above can be arranged in the following

triangular form which is called Pascal’s triangle:

1 3 3 1

1 4 6 4 1

1 5 10 10 5 1

1 6 15 20 15 6 1

1 7 21 35 35 21 7 1

1 8 28 56 70 56 28 8 1

The Rule:

As indicated above, a number in a row (except the right and left most ones) is the sum of two

numbers on the two sides of the preceding row.

So, from the 8th

row in the Pascal’s triangle we can easily write the binomial expansion:

Remember that each term represents a binomial probability. A binomial distribution is a

collection of these discrete binomial probabilities. Note:

Example #1:

Five independent shots are fired at a target. The probability of a hit from each shot is 0.4.

Q. What is the probability that two shots will hit the target?

Ans. Here , , ,

Q. What is the probability that there will be more than two hits?

Ans. Prob. =

!! !!! !

Q. What is the expectation value of the hits (that is the mean value of hitting the targets out of

all five shots)?

Ans. For this we have to calculate the probabilities , , ,…..for the corresponding number

of hits 0, 1, 2…..

The expectation value,

= 0.2592 + 0.6912 + 0.6912 + 0.3072 + 0.0512 = 2.0

Example #2:

Now, imagine a situation where we toss 8 coins together or we toss one coin 8 times

consecutively. We measure the relative occurrence of Head in 8 trials. Let us attach values,

Head = 1 and Tail = 0. So, we can think of a variable which can take values 1/8, 2/8, 3/8,

4/8…. and so on. Thus we can associate probabilities for the values of directly from Pascal’s

triangle (or by using formula). Note that probability of occurring Head, and not-

occurring Head, .

If we now plot against , we get the following symmetric discrete distribution with the

peak value at .

For large number of trails, this distribution becomes Normal distribution. Therefore, we can say

the following:

Poisson Distribution:

Poisson distribution is applicable to random but extremely rare events. For example, if we

count the number of phone calls received in a span of 5 minutes over a day or count the

numbers of cars passing on the road in a time interval of 1 minute, we will have a distribution

Binomial Probability distribution for a random variable becomes Normal distribution

for a large number of trials.

not quite symmetric like Normal distribution although it is random. This distribution is Poisson

and it appears like a skewed one.

If is a variable that takes the values,

= mean of the distribution,

Some Characteristics of Poisson distribution:

One interesting thing is that for a Poisson distribution, mean and variance are same. For a data

set, if mean and variance are not found approximately equal then the Poisson distribution will

not be suitable model.

We can arrive at Poisson distribution from Binomial distribution. How?

Consider to be the mean value out of total . We can then say that the probability of

occurrence, and so,

For -trials, we write the Binomial probability :

Mean, µ

Variance,

[We can write this as becomes large.]

! [In the limit of large ,

In the above derivation, the approximation from Binomial to Poisson distribution is possibleonly when we assume very large and very small (and thus q very large). The small value of

means a ‘rare event’!

In the following figs. we demonstrate how a symmetric binomial distribution (which would

become a Normal distribution for a large number of events) becomes a Poisson distribution as

the value of is increased (and so is decreased).

A Binomial distribution (with large and for high value) is again plotted below along with an

actual Poisson distribution (continuous curve) of appropriately chosen λ.

It is often difficult to differentiate a Poisson distribution from a Normal distribution with naked

eye. The mean and the variance of the Normal distribution is chosen suitably so as to match

with the Poisson distribution. A close examination can only reveal the difference! Look at the

following graph.

The Measure of Correlation

Let us first note that the variance of a set of data is given by

∑ ∑

Variance of another set of data is likewise,

∑ ∑

We can write the above two expressions in the following form:

∑ ∑

∑ and

∑ ∑

Therefore, we can also define a similar kind of formula involving two variables,

∑ ∑

which is called covariance of the two sets of data.

The linear correlation between two sets of data is defined by the following coefficient:

The above coefficient is called Pearson’s correlation coefficient . Correlation is to test how

strongly a pair of variables is related.

Note: In many books, the correlation coefficient is written in the form, , where

, and .

Corr (x,y) =

Properties of :

The coefficient measures the strength of a linear

relationship.

The range:

+1 ⇨ perfect positive linear correlation

⇨ perfect negative linear correlation

⇨ no correlation

We can have an idea of the kind of correlation between two sets of data from the (

scatter plots:

Correlation Matrix:

For the relations among more than two sets of variables, it is useful to present the correlation

coefficients between every two sets of variables in the form of a table. This is called correlation

matrix

For example, for three sets of variables, , we have the following table:

Note that in the above table, we have only three different entries. The reason is that the matrix

is symmetric as the correlation between and is same as between and and so on:

, , . Also, the correlation of a variable with itself is trivial; it is always

the perfect correlation ( = 1). So we have only three independent useful

quantities.

Practical Calculation of Correlation Coefficient:

For practical calculations, we often use the following formula after multiplying by to the

numerator and denominator of the formula for :

∑ ∑ ∑ ∑ ∑ ∑ ∑

Example #1

In the following table, some values of two values and are given in two columns. We

calculate the necessary quantities in the other columns to be put in the correlation formula.

Here we find, ∑ , ∑ , ∑ , ∑ , ∑ and

The correlation coefficient, √ √

√ √

The above calculated value of correlation coefficient is close to 1. Thus we may say, there is a

good (positive) correlation between two sets of data.

Example #2

Calculate the correlation coefficient from the following height-weight data:

Height

170 172 181 157 150 168 166 175 177 165 163 152 161 173 175

65 66 69 55 51 63 61 75 72 64 61 52 60 70 72

Example #3

Following is a table that represents the data for shoe sizes vs. height achieved by Olympic

participants in a high jump event. Both the columns are measured in inches.

12.0 7.0 4.5 11.0 8.5 5.0 12.0 7.5 8.5 5.5 9.5 5.5 10.5 12.0 14.0 7.0 7.0

height 72 64 62 70 69 65 72 65 65 65 68 61 69 77 73 65 67

1 2 5 4 25 10

2 4 9 16 81 36

3 5 11 25 121 55

4 6 10 36 100 60

5 8 12 64 144 96

Total 25 47 145 471 257

12.0 12.0 7.0 13.0 11.0 12.0 4.5 10.5 10.0 10.0 13.0 7.5 4.5 8.5 14.0 10.0 6.5

71 73 64 71 71 72 61 71 66 67 73 69 61 70 75 72 66

Follow the same procedure as is done in example #1 and calculate the correlation coefficient.

For a visual effect, we may have a scatter plot of the pair of data. The relationship between

them seems to be linear which should be well reflected in the correlation coefficient.

Rank Correlation:

Spearman’s rank correlation coefficient between two sets of ranked variables is defined below.

Suppose, the original data sets for two variables and are ranked-ordered to have two sets:

We calculate the differences between the ranks of two sets.

The rank correlation coefficient:

Time Series, Auto Correlation

What is a Time Series?

A Time series is a set of observations generated sequentially in time. Any electrical signal, stock

exchange data (the daily trading curve), ECG curve, record of temperature or humidity over a

period etc. all are basically time series.

A time series can tell us a lot of things about what is happening in the system and this enables

us to predict with a certain degree of accuracy.

It is sometimes important to see if there is any cross-correlation among data points in a given

time series. Cross-correlation is nothing but the correlation between data taken at some time

with that of other time. This we call autocorrelation. This can throw some light on the hidden

pattern inside the time series data.

Autocorrelation:

Remember, in the correlation formulas (on p. 34 and on p. 35) before, we considered the pair

of quantities and which corresponding to the same parameter or serial number. Here we

would just have to consider a pair of values and of the same variable but at different

times. If index corresponds to a time , will correspond to another time .

Regression

If two variables are related, that means there is a significant correlation between them; we can

make quantitative prediction of one variable for some value of the other. This is the basis of

regression analysis.

There are two types of regression analysis:

Linear regression ⇨ when the data approximately follow a straight line

Non-linear regression ⇨ when there is no linear relationship exists; in general, a

polynomial is considered to fit the data points.

A regression is drawn through the scatterplot of two variables. The line is chosen so that it

comes through all the points as close as possible.

Regression analysis is widely used for prediction and forecasting.

Linear Regression:

Suppose, we have a set of data for a pair of variables ( ) and we predict that the dependent

variable can be obtained from the independent variable , where they obey a linear

regression equation: , where the coefficients and are given by the following,

∑ ∑ ∑ ∑ ∑ , ∑ ∑ ∑ ∑

∑ ∑

[Derivations of the above formulas are given in appendix.]

So the regression equation is the line with slope and intercept which passes through the

point [mean values].

Example:

We plot the data in example#3 above and obtain a scatter plot. Next we calculate the values of

the parameters and by the above formulas. Then we can draw a straight line with the slope

= and intercept (on y-axis) = . We can examine that this straight line superposed on the

scatter data is the best fit line for the data points. This straight line fit is also called least squarefit.

Sampling

Basic Concept:

What is sampling?

Sampling is to take a subsection of the population for a particular study. The aim is to

select the data sample in order to represent the total data set.

In statistics, population means the total collection of data. When the population or the

entire collection of data is studied, it is called census.

In short, population is the total set and the sample is the subset of it.

Why the sampling is done?

When the number of elements in a population is large it is often not possible to

investigate the population completely due to lack of time, money and resources. This is

why the sampling is necessary.

Sampling is done in such a way that the subset of data represents the entire set.

If a TV channel wants to know the popularity of a program it would be expensive to ask

everybody’s opinion. Instead a subsection of viewers are interviewed and the data is

collected.

Methods of Sampling:

A sample of size means there are -data points in the collection. A sample of size is

collected from a population of size in such a way that all the features of the population are

well represented by this.

If a sampling method does over-represent or under-represent a feature of the population it is

said to be biased . The aim of any selection method is to reduce the chance of bias as far as

possible.

There are several methods of sampling; among them the most common is the random

sampling.

Random sampling:

For a sample of size , we collect -data from the population. We collect many such

samples for our evaluation. If this is done randomly so that each group of size taken

from the population has equal chance of getting selected, we call this random sampling.

Sometimes, it is called simple random sampling.

For a random sampling, the successive drawings have to be independent.

Let us suppose, we want to select a sample of size 100 from a population of size 10000.

In case of random sampling, we select the elements (that is which element is to be

picked) with the help of a random number (generated in a computer) or by consulting arandom number table or by some kind of dice throwing.

Systematic Sampling:

If simple random sampling from population is not possible, the systematic sampling may

be done. First, population is enumerated from 1 onwards. If sample size of from a

population of size is to be obtained, every -th item is selected. First a random

number between 1 and is selected and then it is taken as the 1st

element. After this

-th element is taken.

Stratified Sampling:

In this method, the population is first divided into groups (strata). Each element of the

sample belongs to one such group.

Sl no. value

For a sample of size

Select a random number between 1-3: choose 2, for example.

Start with #2 and then take 5, 8, 11

number data.

Divide the population into non-overlapping groups each containing , …data such

that . Next do the simple random sampling to collect one or

a few elements from each group.

Suppose, a population is classified into several groups according to age or something

like that. Then from each group random samples are collected.Note: This is also called restricted random sampling.

Cluster Sampling:

In this method, like before, the population is divided into groups called clusters. Then

clusters are taken randomly and the elements are collected from them as sample.

Any method of sampling that uses (probabilistically) random selection is in general

called probability sampling.

Sampling variation:

When sampling from a population is done, we take not one sample but different sets of

samples having same size. If the samples are different, we call this sampling variation.

Usually in practice, we often draw only one sample or one set of data from a population.

But we may not be sure what may happen in case we draw several other samples. Will

we get the same result? The answer is No. If we look for mean value, we see that the

mean is not the same for all the samples that we are able to draw. We then get some

distributions of the sample means.

population size, sample size, ⁄ = the sample fraction.

Many samples of the same size yield a sampling distribution.

The sampling distributions are usually assumed to follow any well-known probability

distribution.

We look for various properties from the distribution curves.

It is seen how the variation of sample size can affect the properties.

From the experience and theory, we can say that the variability of sampling

distributions decreases with sample size.

Hypothesis Testing

What is Hypothesis?

On the basis of sample information, we make certain decisions about the population. In taking

such decisions we make certain assumptions. These assumptions are known as statistical

hypothesis.

[ Note: A collected set of data points which is a part of the population (a few number of data)

is called a sample. The process of selection is called sampling. When all the data are considered

for a study, this is called population.]

How to test Hypothesis?Assuming the hypothesis correct, we calculate the probability of getting the observed sample. If

this probability is less than a certain assigned value, the hypothesis is rejected .

If there is no significant difference between the observed value and the expected value, the

hypothesis is called Null Hypothesis.

Test of significance:

The tests which enable us to decide whether to accept or to reject the null hypothesis are called

the tests of significance. If the differences between the sample values and the population

values are significantly large it is to be rejected (i.e., Hypothesis is not Null).

Student t-test:

Let be the elements of a set of data from a random sampling. The sampling is

drawn from a population that is assumed to obey Normal distribution.

= the actual mean of the distribution,

= the sample mean.

A parameter is calculated as following:

where ∑ and sample size.

Example:

Q. The average life span of a citizen of India is 70. The average value obtained from a sample of

100 people is 75. The standard deviation is 40. Find if the claim is accepted using the level of

significance of 0.05. Ans.

Here , , ,

Now at the level of significance 0.05, we know (from the standard value) The calculated value of < the tabular value of t [ ]

Thus the claim is accepted within 5% level of significance.

-test (Chi-square test):

Here we evaluate the following quantity:

Observed frequency

Expected frequency

Now let us define another parameter called, ‘degree of freedom’.

Degree of freedom = No. of independent observations

= No. of observations – No. of independent constraints.

In practical calculations, we often estimate the degree of freedom from the number of columns

( and number of rows ( in a data table.

Degree of freedom =

Example:

Q. Given are the amounts of rainfall (in mm.) on different days in a week. Check if the rainfall is

uniformly distributed over the week.Given that the is significant at 5, 6, 7 degrees of freedom are respectively 11.07, 12.59, 14.07

at the 5% level of significance.

Day 1 2 3 4 5 6 7

Rain fall

(in mm.)

14 16 8 12 11 9 14

If the distribution is to be uniform the expected frequency has to be .

= 4.17

Here the degrees of freedom = and the tabulated value for 6

degrees of freedom is 12.59.

As the calculated value 4.17 < 12.59, we can accept the claim.

We then say Null Hypothesis.

Least Square Fit

(Regression Formulas)

Let us think that we are about to fit the set of data by a straight line.The equation of a straight line is bmx y

Consider the data points ( 11, y x ) , ( 22

, y x ), (33

, y x )…….etc. If we know the two parameters

and , we can draw a st. line with them.

Error is defined as 2

)(),( ii

yb xmbm

For the best fit, this error should be minimum.

Therefore, we must have 0

[We take partial derivatives of the error function with respect to the parameters.]

ii ybmxmm

yb xmm

= )().(21

i x ybmx )(21

i x y xb xm

22 = 0 (1)

Similarly,

i ynb xmb 11

From (1) and (2),

Slope,

x y y xn

and Intercept,

y x x x y

1 1 11

Example:

For the data points (1,2), (2,3), (3,4), (4,5)

4n , 104321

i x , 145432

4054433221

ii y x , 3044332211

100120

140160

1014404

m , 120

100120

400420

40103014

FORTRAN Program:

C Least Square fit

open(1,file='xy.dat')

open(2,file='fit.dat')

write(*,*)'Number of Points?'

read(*,*)n

sumx=0.0

sumy=0.0

sumsqx=0.0

sumxy=0.0

write(*,*)'Give data in the form: x,y'

do i=1,n

read(*,*)x,y

write(1,*)x,y

sumx=sumx+xsumy=sumy+y

sumsqx=sumsqx+x*x

sumxy=sumxy+x*y

deno=n*sumsqx-sumx*sumx

slope=(n*sumxy-sumx*sumy)/deno

b=(sumsqx*sumy-sumx*sumxy)/deno

write(*,*)'Slope, Intercept= ',slope,b

write(*,*)'Give a lower and upper limits of X'

read(*,*)xmin, xmaxx=xmin

dx=(xmax-xmin)/2.0

do i=1,3

y=slope*x+b

write(2,*)x,y

x=x+dx

For the Least Square Fit of given data points. The straight line is drawn with the values of the

slope and the intercept obtained from the program.

Z-Score Table

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142 0.07535

0.2 0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026 0.11409

0.3 0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803 0.15173

0.4 0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439 0.18793

0.5 0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904 0.22240

0.6 0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175 0.25490

0.7 0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230 0.28524

0.8 0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057 0.31327

0.9 0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646 0.33891

1.0 0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993 0.36214

1.1 0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100 0.38298

1.2 0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973 0.40147

1.3 0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621 0.41774

1.4 0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056 0.43189

1.5 0.43319 0.43448 0.43574 0.43699 0.43822 0.43943 0.44062 0.44179 0.44295 0.44408

1.6 0.44520 0.44630 0.44738 0.44845 0.44950 0.45053 0.45154 0.45254 0.45352 0.45449

1.7 0.45543 0.45637 0.45728 0.45818 0.45907 0.45994 0.46080 0.46164 0.46246 0.46327

1.8 0.46407 0.46485 0.46562 0.46638 0.46712 0.46784 0.46856 0.46926 0.46995 0.47062

1.9 0.47128 0.47193 0.47257 0.47320 0.47381 0.47441 0.47500 0.47558 0.47615 0.47670

2.0 0.47725 0.47778 0.47831 0.47882 0.47932 0.47982 0.48030 0.48077 0.48124 0.48169

2.1 0.48214 0.48257 0.48300 0.48341 0.48382 0.48422 0.48461 0.48500 0.48537 0.48574

2.2 0.48610 0.48645 0.48679 0.48713 0.48745 0.48778 0.48809 0.48840 0.48870 0.48899

2.3 0.48928 0.48956 0.48983 0.49010 0.49036 0.49061 0.49086 0.49111 0.49134 0.49158

2.4 0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361

2.5 0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520

2.6 0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643

2.7 0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736

2.8 0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807

2 9 0 49813 0 49819 0 49825 0 49831 0 49836 0 49841 0 49846 0 49851 0 49856 0 49861

Appl Statistics

Documents

Transcript of Appl Statistics

24 APPL - ntrs.nasa.gov

Ft Appl Talk

Apple block-appl

Power Industry Appl Guide

· Ton Lbs. Lbs. Lbs. Lbs. Appl Appl Appl Appl Appl ... YIELD/AC 1600 VOLUME MARKETED 1600 EXPECTI PRICE 2.35 ... 500 13500 11400 4000 740 2000 6400 19000

ACS850 ACQ810 Appl Programming Appl Guide B

Appl Renewal

HDFC Appl Form

Department of Petroleum & Energy – Regulating the Oil ...€¦ · 140.00 Petroleum Licence Map of Papua New Guinea 150. APPL 651 APPL 633 PPL 576 APPL 648 150. 152. 9 152. ApPL

CMA APPL 13.5

Semikron Appl Manual

Appl. Environ. Microbiol.

06 tcp part1 - Sophia - Inria · TCP IP Appl. Appl. ÎTCP functions ...

Appl - lsc.com.au

Ch13 Appl Architectures

Chap16- MBal Appl

Paper On Infosuasive Appl

Mobile appl. testing

Minsk Web Appl 190509

Appl Guide Hid