Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409

Instructor: Prof. Ken Tsang

T.A. : Ms Lisa Liu

Office: E409

Tel: 362 0606(office) 362 0630(T.A.)

Email: [email protected] (Instructor)

[email protected] (TA)

Speaking of Statistics (13)

What is Statistics all about? The subject of statistics involves the study of how The subject of statistics involves the study of how

to to collectcollect, , summarizesummarize, , analyzeanalyze and and interpretinterpret data. data.

Data are numerical facts and figures from which

conclusions can be drawn. Such conclusions are

important to the decision-making processes of

many professions and organizations.

Some sources of data are: Data distributed by an organization or an individual A designed experiment A survey An observational study Web, telephone

Data can be Curves, figures Sounds Papers, books Web, telephone process

Data

Distinguished Statisticians in History!

Sir R. A. Fisher

1890-1962

Karl Pearson

1857-1936

W. Edwards Deming--The Father of the Quality Evolution

1900-1993

Data Scientist: Data Scientist:

The Sexiest Job of the The Sexiest Job of the 21st Century 21st Century

20%

10%

20%10%

40% Assi gnmentsQui zzesGroup Proj ectMi d-term TestFi nal Exami nati on

Final Evaluation ProportionFinal Evaluation Proportion

CRA (Criterion-Referenced Assessment)

Adoption of the Criterion-Referenced Assessment (CRA) for evaluating students’ performance

OBTL Syllabus

CRA model is directly compatible with the OBTL philosophy.

UIC Regulations on CRA

Assessment rubrics should be developed for assessment tasks, individually or in combination, which contribute to 40% of the course grade.

We will use rubric for following assessments.

1. Oral Presentation and Group Presentation (20%)

2. Final Examination (40%)

Rubric for Assessment of Oral and Group Presentation (1)

Criteria for assessment

Performance levels

Excellent4

Good3

Satisfactory2

Marginal Pass1

Fail0

Content of presentation

Organization ( _20_ % weighting)

Accuracy and Depth( _40_ % weighting)

Presentation techniques

Oral English( _10_ % weighting)

Body language and facial expressions ( _10_ % weighting)

Time management ( _10_ % weighting)

Question and Answer Performance

Responsiveness( _10_ % weighting)

Oral and Group Presentation (1) Choose your teammates

4-5 members in one team

Submit your group form before November

Study rubric for oral presentation (1)

Choose a topic for your team

Prepare your PPT

Oral presentation will be given (roughly) on the 12th week (Nov ? Dec 2014)

Suggested Grade Distribution

Assessment grade system: Assessment grade system: A (Not more than 5%)A (Not more than 5%) A and A- (Not more than A and A- (Not more than

15%)15%) A and B that include A, A and B that include A,

A-, B+, B and B- (Not A-, B+, B and B- (Not more than 75%)more than 75%)

Below C and not include C Below C and not include C (No any limit ).(No any limit ).

LetterGrade

AcademicPerformance

A Excellent

A- Excellent

B+ Good

B Good

B- Good

C+ Satisfactory

C Satisfactory

D Marginal Pass

F Fail

Some notices on this Course Assignments must be handed in before the deadline.

After the deadline, we refuse to accept your assignments!

For the mid-term test and final examination, you cannot bring anything except some stationeries and water! Mobile are not allowed.

For the final examination, we cannot tell you the score before the AR inform the official results. If you have any question on the score, you can check the marked sheet via AR.

General Information Textbook

Essentials of Business Statistcs

Bowerman/O'Connell/Murphree/Orris

McGraw Hill, International Edition

ISBN 978-0-07-131471-8

Advantages Unified textbook for all the year one students More applications

General Information References

Basic Statistics, for Business & Economics Fifth Edition

D.A. Lind, W.G. Marchal and S.A. Wathen

2006, McGraw Hill, International Edition Business Statistics, A First Course, Fourth Ed.

D.M. Levine, T.C. Krehbiel and M.L. Berenson

2006, Pearson Prentice Hall, New Jersey Statistics for Business and Economics, Ninth Ed.

J.T. McClave, P.G. Benson and T. Sincich

2005, Pearson Prentice Hall, New Jersey Modern Elementary Statistics, 11th Ed.

J.E. Freund, 2004, Prentice Hall.

Statistics for the Behavioral Sciences

Frederick J Gravetterand Larry B. WallnauWadsworth Publishing; 8 edition (December 10, 2008)

18

Chapter 1An introduction to Business Statistics

Populations and Samples Populations and Samples

Sampling a Population of Sampling a Population of Existing UnitsExisting Units

Sampling a ProcessSampling a Process

An Introduction to Survey An Introduction to Survey SamplingSampling

Section 1.1 Populations( 总体 ) and Samples( 样本 )

A set of existing units (people, objects, or A set of existing units (people, objects, or events)events)

Population

1.1. All of the last year’s graduates of Dartmouth College’s Master All of the last year’s graduates of Dartmouth College’s Master of Business Administration program.of Business Administration program.

2.2. All Lincoln Town Cars that were produced last year.All Lincoln Town Cars that were produced last year.

3.3. All accounts receivable invoices accumulated last year by The All accounts receivable invoices accumulated last year by The Procter & Gamble Company.Procter & Gamble Company.

4.4. All fire reported last month to the Tulsa, Oklahoma, fire All fire reported last month to the Tulsa, Oklahoma, fire department. department.

A measurable characteristic of the population.Variable(Variable( 变量变量 ))

The variable is said to be The variable is said to be quantitative(quantitative( 定量的定量的 )): : Measurements that represent quantities (for example, Measurements that represent quantities (for example, “how much” or “how many”). For example, “how much” or “how many”). For example, annual annual starting salarystarting salary is quantitative, is quantitative, age and number of age and number of childrenchildren is also quantitative is also quantitative

The variable is said to be The variable is said to be qualitative(qualitative( 定性的定性的 )) or or categorical(categorical( 属性的属性的 )): A descriptive category to which a : A descriptive category to which a population unit belongs. For example, population unit belongs. For example, a person’s gendera person’s gender, , the make of an automobilethe make of an automobile and and whether a person who whether a person who purchases a product is satisfied with the productpurchases a product is satisfied with the product are are qualitative. qualitative.

We carry out a We carry out a measurementmeasurement to assign a to assign a valuevalue of a of a variable to each population unit. variable to each population unit.

Nominative(Nominative( 无顺序分类的无顺序分类的 )):: Identifier or nameIdentifier or name Unranked categorizationUnranked categorization Example: gender, car colorExample: gender, car color

Ordinal(Ordinal( 顺序的顺序的 )):: All characteristics of nominative plus…All characteristics of nominative plus… Rank-order categoriesRank-order categories Ranks are relative to each otherRanks are relative to each other Example: Low (1), moderate (2), or high (3) riskExample: Low (1), moderate (2), or high (3) risk

There are two types of qualitative variables: There are two types of qualitative variables:

An examination of the entire population of An examination of the entire population of measurements.measurements.

Census(Census( 普查普查 ))

Note: Census usually too expensive, too time consuming, and Note: Census usually too expensive, too time consuming, and too much effort for a large population.too much effort for a large population.

A selected subset of the units of a population.A selected subset of the units of a population.SampleSample

Population

Sample

For example, a university graduated 8,742 studentsFor example, a university graduated 8,742 studentsa.a. This is too large for a census.This is too large for a census.b.b. So, we select a sample of these graduates and So, we select a sample of these graduates and learn their annual starting salaries.learn their annual starting salaries.

Measured values of the variable of interest for the Measured values of the variable of interest for the sample units.sample units.

For example, the actual annual starting salaries of the For example, the actual annual starting salaries of the sampled graduates.sampled graduates.

Sample of measurements

For example, for a set of annual starting salaries, we want For example, for a set of annual starting salaries, we want to know:to know:

How much to expectHow much to expectWhat is a high versus low salaryWhat is a high versus low salaryHow much the salaries differ from each otherHow much the salaries differ from each other

If the population is small enough, could take a census If the population is small enough, could take a census and not have to sample and make any statistical inferencesand not have to sample and make any statistical inferences But if the population is too large, then ……….But if the population is too large, then ……….

The science of describing the important aspects of a set of The science of describing the important aspects of a set of measurementsmeasurements

Descriptive statisticsDescriptive statistics

Statistical Inference(Statistical Inference( 统计推断统计推断 ))

The science of using a sample of measurements The science of using a sample of measurements to make generalizations about the important to make generalizations about the important aspects of a population of measurements.aspects of a population of measurements.

• For example, use a sample of starting salaries For example, use a sample of starting salaries to estimate the important aspects of the to estimate the important aspects of the population of starting salariespopulation of starting salaries

There is a criteria on how to choose a sample: the There is a criteria on how to choose a sample: the information contained in a sample is to accurately reflect information contained in a sample is to accurately reflect the population under study. the population under study.

The Lady Tasting Tea Tea is tasted different depending upon whether the tea was poured into the milk or whether the milk was poured into the tea.

Let us test the proposition!

Observation Study---Smoking is harmful to health

Section 1.2 Sampling a Population of Existing Units

For example, randomly pick two different people from a group of 15:For example, randomly pick two different people from a group of 15: Number the people from 1 to 15; and write their numbers on 15 Number the people from 1 to 15; and write their numbers on 15

different slips of paperdifferent slips of paper Thoroughly mix the papers and randomly pick two of themThoroughly mix the papers and randomly pick two of them The numbers on the slips identifies the people for the sampleThe numbers on the slips identifies the people for the sample

Each population unit has the same chance of being selected as Each population unit has the same chance of being selected as every other unitevery other unit Each possible sample (of the same size) has the same chance Each possible sample (of the same size) has the same chance

of being selectedof being selected

A random sample is a sample selected from a population so that:A random sample is a sample selected from a population so that:Random sample(Random sample( 随机样本随机样本 ))

Guarantees a sample of different unitsGuarantees a sample of different unitsEach sampled unit contributes different informationEach sampled unit contributes different informationSampling without replacement is the usual and customary Sampling without replacement is the usual and customary sampling methodsampling method

A sampled unit is withheld from possibly being A sampled unit is withheld from possibly being selected again in the same sampleselected again in the same sample

Sample without replacement(Sample without replacement( 无放回抽样无放回抽样 ))

The unit is placed back into the population for possible reselection However, the same unit in the sample does not contribute new information

Replace each sampled unit before picking next unitReplace each sampled unit before picking next unitSample with replacement(Sample with replacement( 有放回抽样有放回抽样 ))

Example 1.1Example 1.1 The Cell Phone Case: Estimating Cell Phone The Cell Phone Case: Estimating Cell Phone CostsCosts

The bank has 2,136 employees on a 500-minute-per-The bank has 2,136 employees on a 500-minute-per-month plan with a monthly cost of $50. The bank will month plan with a monthly cost of $50. The bank will estimate its cellular cost per minute for this plan by estimate its cellular cost per minute for this plan by examining the number of minutes used last month by each examining the number of minutes used last month by each of 100 randomly selected employees on this 500-minute of 100 randomly selected employees on this 500-minute plan.plan.

According to the cellular management service, if the According to the cellular management service, if the cellular cost per minute for the random sample of 100 cellular cost per minute for the random sample of 100 employees is over 18 cents per minute, the bank should employees is over 18 cents per minute, the bank should benefit from automated cellular management of its calling benefit from automated cellular management of its calling plans. plans.

In order to randomly select the sample of 100 cell In order to randomly select the sample of 100 cell

phone users, the bank will make a numbered list of the phone users, the bank will make a numbered list of the

2,136 users on the 500-minite plan. This list is called 2,136 users on the 500-minite plan. This list is called a a

frame(frame( 设计框架设计框架 )). .

The bank can use The bank can use a random number tablea random number table, such as , such as

Table 1.1(a), or a computer software package, such as Table 1.1(a), or a computer software package, such as

Table 1.1 (b), to select the needed sample. Table 1.1 (b), to select the needed sample.

The 100 cellular-usage figures are given in Table 1.2.The 100 cellular-usage figures are given in Table 1.2.

Approximately Random SamplesApproximately Random Samples

Sometimes it is not possible to list and thus number all Sometimes it is not possible to list and thus number all the units in a population. In such a situation we often the units in a population. In such a situation we often select select a systematic samplea systematic sample, which approximates a random , which approximates a random sample. sample.

A Systematic Sample(A Systematic Sample( 系统抽样系统抽样 ))

Randomly enter the population and systematically Randomly enter the population and systematically sample every sample every kkthth unit. unit.

Example 1.2Example 1.2 The Marketing Research Case: Rating a The Marketing Research Case: Rating a New Bottle DesignNew Bottle Design

To study consumer reaction to a new design, the brand group To study consumer reaction to a new design, the brand group will use “mall intercept method” in which shoppers at a large will use “mall intercept method” in which shoppers at a large metropolitan shopping mall are intercepted and asked to participate metropolitan shopping mall are intercepted and asked to participate in in a consumer surveya consumer survey. The questionnaire are shown in Figure 1.1. . The questionnaire are shown in Figure 1.1. Each shopper will be exposed to the new bottle design and asked to Each shopper will be exposed to the new bottle design and asked to rate the bottle image using a 7-point “Likert scale.”rate the bottle image using a 7-point “Likert scale.”

We select a systematic sample. To do this, every 100We select a systematic sample. To do this, every 100 thth shopper shopper passing a specified location in the mall will be invited to passing a specified location in the mall will be invited to participate in the survey. During a Tuesday afternoon and evening, participate in the survey. During a Tuesday afternoon and evening, a sample of 60 shoppers is selected by using the systematic a sample of 60 shoppers is selected by using the systematic sampling process. The 60 composite scores are given in Table 1.3. sampling process. The 60 composite scores are given in Table 1.3. From this table, we can estimate that 95 percent of the shoppers From this table, we can estimate that 95 percent of the shoppers would give the bottle design a composite score of at least 25.would give the bottle design a composite score of at least 25.

Voluntary response sampleVoluntary response sample

Participants select themselves to be in the sampleParticipants select themselves to be in the sample

• Participants “self-select”Participants “self-select”• For example, calling in to vote on For example, calling in to vote on American American

IdolIdol• Commonly referred to as a “non-scientific” Commonly referred to as a “non-scientific”

samplesample

Usually not representative of the populationUsually not representative of the population

• Over-represent individuals with strong opinionsOver-represent individuals with strong opinions• Usually, but not always, negative opinionsUsually, but not always, negative opinions

Another Sampling Method

Section 1.3 Sampling a Process Process( 过程 )A sequence of operations that takes inputs (labor, raw materials, methods, machines, and so on) and turns them into outputs (products, services, and the like)

Inputs Process Outputs

Cars will continue to be made over timeCars will continue to be made over time

For example, all automobiles of a particular make For example, all automobiles of a particular make and model, for instance, the Lincoln Town Carand model, for instance, the Lincoln Town Car

The “population” from a process is all output The “population” from a process is all output produced in the past, present, and the yet-to-occur produced in the past, present, and the yet-to-occur future.future.

Processes produce output over time

Example 1.3Example 1.3 The Coffee Temperature Case: Monitoring The Coffee Temperature Case: Monitoring Coffee TemperaturesCoffee Temperatures

This case concerns coffee temperatures at a fast-food This case concerns coffee temperatures at a fast-food restaurant. To do this, the restaurant personnel measure restaurant. To do this, the restaurant personnel measure the temperature of the coffee being dispensed (in degrees the temperature of the coffee being dispensed (in degrees F) at half-hour intervals from 10 A.M. to 9:30 P.M. on a F) at half-hour intervals from 10 A.M. to 9:30 P.M. on a given day. Data is list on Table 1.7. given day. Data is list on Table 1.7. A process is in statistical control if it does not exhibit any unusual process variations.To determine if a process is in control or not, sample To determine if a process is in control or not, sample the process often enough to detect unusual variationsthe process often enough to detect unusual variationsA A runs plotruns plot is a graph of individual process is a graph of individual process measurements over time. Figure 1.3 shows a runs plot of measurements over time. Figure 1.3 shows a runs plot of the temperature data. the temperature data.

Figure 1.3 Runs Plot of Coffee Temperatures: The Process Figure 1.3 Runs Plot of Coffee Temperatures: The Process is in Statistical Control.is in Statistical Control.

Over time, temperatures appear to have a fairly constant amount of variation around a fairly constant level The temperature is expected to be at the constant level

shown by the horizontal blue line Sometimes the temperature is higher and sometimes

lower than the constant level About the same amount of spread of the values (data

points) around the constant level The points are as far above the line as below it The data points appear to form a horizontal band

So, the process is in statistical control Coffee-making process is operating “consistently”

Results Results

Because the coffee temperature has been and is presently Because the coffee temperature has been and is presently in control, it will likely stay in control in the futurein control, it will likely stay in control in the future If the coffee making process stays in control, then If the coffee making process stays in control, then

coffee temperature is predicted to be between 152coffee temperature is predicted to be between 152oo and 170and 170oo F F

In general, if the process appears from the runs plot to In general, if the process appears from the runs plot to be in control, then it will probably remain in control in be in control, then it will probably remain in control in the futurethe future The sample of measurements was approximately The sample of measurements was approximately

randomrandom Future process performance is predictableFuture process performance is predictable

RemarkRemark

Section 1.4 An Introduction to Survey Sampling

Already know some sampling methods Also called sampling designs, they are: Random sampling The focus of this book

Systematic sampling Voluntary response sampling

But there are other sample designs: Stratified random sampling( 分层随机抽样 ) Cluster sampling( 分块抽样 )

Divide the population into non-overlapping groups, called strata, of similar units

Separately, select a random sample from each and every stratum

Combine the random samples from each stratum to make the full sample

Appropriate when the population consists of two or more different groups so that: The groups differ from each other with respect to the

variable of interest Units within a group are similar to each other For example, divide population into strata by age,

gender, income, etc

Stratified Random Sample

“Cluster” or group a population into subpopulations Cluster by geography, time, and so on…

Each cluster is a representative small-scale version of the population (i.e. heterogeneous group)

A simple random sample is chosen from each cluster Combine the random samples from each cluster to make the

full sample Appropriate for populations spread over a large geographic

area so that… There are different sections or regions in the area with

respect to the variable of interest A random sample of the cluster

Cluster Sampling

Want a sample containing n units from a population containing N units

Take the ratio N/n and round down to the nearest whole number Call the rounded result k

Randomly select one of the first k elements from the population list

Step through the population from the first chosen unit and select every kth unit

This method has the properties of a simple random sample, especially if the list of the population elements is a random ordering

More on Systematic Sampling

Random sampling should eliminate bias But even a random sample may not be representative

because of: Under-coverage Too few sampled units or some of the population

was excluded Non-response When a sampled unit cannot be contacted or

refuses to participate Response bias Responses of selected units are not truthful

Sampling Problem

Chapter 2Descriptive StatisticsDescribing the Shape of a DistributionDescribing the Shape of a Distribution

Describing Central TendencyDescribing Central Tendency

Measures of VariationMeasures of Variation

Percentiles, Quartiles, and Box-and-Percentiles, Quartiles, and Box-and-Whiskers DisplaysWhiskers Displays

Describing Qualitative DataDescribing Qualitative Data

Weighted MeansWeighted Means

Section 2.1 Describing the Shape of a Distribution To know what the population looks like, find the To know what the population looks like, find the

“shape” of its distribution“shape” of its distribution Picture the distribution graphically by any of the Picture the distribution graphically by any of the

following methods:following methods: Stem-and-leaf display(Stem-and-leaf display( 茎叶图茎叶图 )) Frequency distributions(Frequency distributions( 頻率分布表頻率分布表 )) Histogram(Histogram( 直方图直方图 )) Dot plot(Dot plot( 点图点图 ))

The purpose of a stem-and-leaf display is to see The purpose of a stem-and-leaf display is to see the overall pattern of the data, by grouping the the overall pattern of the data, by grouping the data into classesdata into classes To see:To see:

the variation from class to classthe variation from class to class the amount of data in each classthe amount of data in each class the distribution of the data within each the distribution of the data within each

classclass Best for small to moderately sized data Best for small to moderately sized data

distributionsdistributions

Stem-and-leaf displayStem-and-leaf display

Example 2.1Example 2.1 The Car Mileage Case The Car Mileage Case

In this case study, we consider a tax credit offered by In this case study, we consider a tax credit offered by the federal government to automakers for improving the federal government to automakers for improving the fuel economy of midsize cars. the fuel economy of midsize cars.

To find the combined city and highway mileage To find the combined city and highway mileage estimate for a particular car model, the EPA tests a estimate for a particular car model, the EPA tests a sample of cars. sample of cars.

Table 2.1 presents the sample of 49 gas mileages that Table 2.1 presents the sample of 49 gas mileages that have been obtained by the new midsize model. have been obtained by the new midsize model.

30.830.8 30.930.9 32.032.0 32.332.3 32.632.6

31.731.7 30.430.4 31.431.4 32.732.7 31.431.4

30.130.1 32.532.5 30.830.8 31.231.2 31.831.8

31.631.6 30.330.3 32.832.8 30.630.6 31.931.9

32.132.1 31.331.3 32.032.0 31.731.7 32.832.8

33.333.3 32.132.1 31.531.5 31.431.4 31.531.5

31.331.3 32.532.5 32.432.4 32.232.2 31.631.6

31.031.0 31.831.8 31.031.0 31.531.5 30.630.6

32.032.0 30.430.4 29.829.8 31.731.7 32.232.2

32.432.4 30.530.5 31.131.1 30.630.6

Table 2.1 A sample of 49 mileages

The stem-and-leaf display of car mileages:The stem-and-leaf display of car mileages:

29 829 830 1344566688930 1344566688931 0012334445556677788931 0012334445556677788932 000112234455678832 000112234455678833 333 3

29 + 0.8 = 29.829 + 0.8 = 29.8

33 + 0.3 = 33.333 + 0.3 = 33.3

Another display of the same data using more classes

Starred classes (*) extend from 0.0 to 0.4

Unstarred classes extend from 0.5 to 0.9

29 830* 134430 566688931* 00123344431 5556677788932* 000112234432 55678833* 3

Looking at the last stem-and-leaf display, the Looking at the last stem-and-leaf display, the distribution appears almost “distribution appears almost “symmetricalsymmetrical”” The upper portion of the display…The upper portion of the display…

Stems 29, 30*, 30, and 31*Stems 29, 30*, 30, and 31* … … is almost a mirror image of the lower portion of is almost a mirror image of the lower portion of

the displaythe display Stems 31, 32*, 32, and 33*Stems 31, 32*, 32, and 33*

But not exactly a mirror reflectionBut not exactly a mirror reflection Maybe slightly more data in the lower portion Maybe slightly more data in the lower portion

than in the upper portionthan in the upper portionLater, we will call this a slightly “left-Later, we will call this a slightly “left-

skewed” distributionskewed” distribution

Constructing a Stem-and-Leaf DisplayConstructing a Stem-and-Leaf Display

1.1. Decide what units will be used for the stems and the Decide what units will be used for the stems and the leaves. As a general rule, choose units for the stems so leaves. As a general rule, choose units for the stems so that there will be somewhere between 5 and 20 stems.that there will be somewhere between 5 and 20 stems.

2.2. Place the stems in a column with the smallest stem at Place the stems in a column with the smallest stem at the top of the column and the largest stem at the the top of the column and the largest stem at the bottom.bottom.

3.3. Enter the leaf for each measurement into the row Enter the leaf for each measurement into the row corresponding to the proper stem. The leaves should corresponding to the proper stem. The leaves should be single-digit numbers (rounded values).be single-digit numbers (rounded values).

4.4. If desired, rearrange the leaves so that they are in If desired, rearrange the leaves so that they are in increasing order from left to right.increasing order from left to right.

Example 2.2Example 2.2 The Payment Time Case: Reducing Payment Times

In order to assess the effectiveness of the system, the In order to assess the effectiveness of the system, the consulting firm will study the payment times for invoices consulting firm will study the payment times for invoices processed during the first three months of the system’s processed during the first three months of the system’s operation. operation.

During this period, 7,823 invoices are processed using During this period, 7,823 invoices are processed using the new system. To study the payment times of these the new system. To study the payment times of these invoices, the consulting firm numbers the invoices from invoices, the consulting firm numbers the invoices from 0001 to 7823 and uses random numbers to select a 0001 to 7823 and uses random numbers to select a random sample of 65 invoices. The resulting 65 payment random sample of 65 invoices. The resulting 65 payment times are given in Table 2.2times are given in Table 2.2

2222 2929 1616 1515 1818 1717 1212 1313 1717 1616 1515

1919 1717 1010 2121 1515 1414 1717 1818 1212 2020 1414

1616 1515 1616 2020 2222 1414 2525 1919 2323 1515 1919

1818 2323 2222 1616 1616 1919 1313 1818 2424 2424 2626

1313 1818 1717 1515 2424 1515 1717 1414 1818 1717 2121

1616 2121 2525 1919 2020 2727 1616 1717 1616 2121

Table 2.2 A Sample of Payment Times (in Days) for 65 Randomly Selected Invoices.

1 10 0 1 11 3 12 00 6 13 000 10 14 0000 17 15 0000000 26 16 000000000 (8) 17 00000000 30 18 000000 24 19 00000 19 20 000 16 21 000 13 22 000 10 23 00 8 24 000 5 25 00 3 26 0 2 27 0 1 28 1 29 0

Shorter tailL

onger tail

The leftmost column of The leftmost column of numbers are the numbers are numbers are the numbers are the amounts of values in the amounts of values in each stemeach stem

• The number 8 in The number 8 in parentheses indicates that parentheses indicates that there are 8 payments in there are 8 payments in the stem for 17 daysthe stem for 17 days

• The number 27 (no The number 27 (no parentheses) indicates that parentheses) indicates that there are 27 payments there are 27 payments made in 16 or less daysmade in 16 or less days

Looking at this display, we see that all of the sampled Looking at this display, we see that all of the sampled payment times are substantially less than the 39-day payment times are substantially less than the 39-day typical payment time of the former billing system. typical payment time of the former billing system.

The stem-and-leaf display do not appear symmetrical. The stem-and-leaf display do not appear symmetrical. The “tail” of the distribution consisting of the higher The “tail” of the distribution consisting of the higher payment times is longer than the “tail” of the payment times is longer than the “tail” of the distribution consisting of the smaller payment times.distribution consisting of the smaller payment times.

We say that the distribution is skewed with a tail to the We say that the distribution is skewed with a tail to the right.right.

The Payment Times: Results

A A frequency distributionfrequency distribution is a list of data classes is a list of data classes with the count or “frequency” of values that belong with the count or “frequency” of values that belong to each classto each class

• “ “Classify and count”Classify and count”• The frequency distribution is a tableThe frequency distribution is a table

Show the frequency distribution in a Show the frequency distribution in a histogramhistogram• The histogram is a picture of the frequency The histogram is a picture of the frequency distributiondistribution

See Examples 2.2, The Payment Time CaseSee Examples 2.2, The Payment Time Case

Frequency Distribution and Histogram

Steps in making a frequency distribution:Steps in making a frequency distribution:

1.1. Determine the number of classes Determine the number of classes KK

2.2. Determine the class lengthDetermine the class length

3.3. Set the starting value for the classes, that is, the Set the starting value for the classes, that is, the distribution “floor”distribution “floor”

4.4. Calculate the class limitsCalculate the class limits

5.5. Setup all the classesSetup all the classes Then tally the data into the Then tally the data into the KK classes and record the classes and record the

frequenciesfrequencies

Constructing the frequency distribution

Group all of the n data into K number of classes K is the smallest whole number for which

2K n

In Examples 2.2 , n = 65 For K = 6, 26 = 64, < n For K = 7, 27 = 128, > n So use K = 7 classes

The number of classes K

Class length L is the step size from one to the next

In Examples 2.2, The Payment Time Case, the largest value is 29 days and the smallest value is 10 days, so

Arbitrarily round the class length up to 3 days/class

KL

value smallest - value Largest

days/class 71432classes 7

days 19

classes 7

days 10 - 29.L

Class Length L

The classes start on the smallest data value This is the lower limit of the first class

The upper limit of the first class is

smallest value + (L – 1) In the example, the first class starts at 10 days and goes

up to 12 days The second class starts at the upper limit of the first class +

1 and goes up (L – 1) more The second class starts at 13 days and goes up to 15

days And so on

Starting the classes

Classes (days) Tally Frequency

10 to 12 ||| 3

13 to 15 |||| 14

16 to 18 ||| 23

19 to 21 || 12

22 to 24 ||| 8

25 to 27 |||| 4

28 to 30 | 1

65

||||||||

|||||||| ||||||||

||||||||

||||

Check: All frequencies must sum to Check: All frequencies must sum to nn

Tallies and Frequencies: Example 2.2

The relative frequency of a class is the proportion or fraction of data that is contained in that class Calculated by dividing the class frequency by the

total number of data values Relative frequency may be expressed as either a

decimal or percent A relative frequency distribution is a list of all

the data classes and their associated relative frequencies

Relative Frequency( 相对频率 )

Classes (days) Frequency Relative Frequency

10 to 12 3 3/65 = 0.0462

13 to 15 14 14/65 = 0.2154

16 to 18 23 0.3538

19 to 21 12 0.1846

22 to 24 8 0.1231

25 to 27 4 0.0615

28 to 30 1 0.0154

65 1.0000

Check: All relative frequencies must sum to 1Check: All relative frequencies must sum to 1

Relative Frequency: Example 2.2

Classes FrequencyClasses Frequency Relative Frequency Boundaries Relative Frequency Boundaries MidpointMidpoint

10 to 12 310 to 12 3 0.0462 9.5, 12.5 11 0.0462 9.5, 12.5 11

13 to 15 14 0.2154 12.5, 15.5 1413 to 15 14 0.2154 12.5, 15.5 14

16 to 18 2316 to 18 23 0.3538 15.5, 18.5 17 0.3538 15.5, 18.5 17

19 to 21 1219 to 21 12 0.1846 18.5, 21.5 20 0.1846 18.5, 21.5 20

22 to 24 822 to 24 8 0.1231 21.5, 24.5 23 0.1231 21.5, 24.5 23

25 to 27 425 to 27 4 0.0615 24.5, 27.5 26 0.0615 24.5, 27.5 26

28 to 30 28 to 30 1 1 0.01540.0154 27.5, 30.5 29 27.5, 30.5 29

6565 1.0000 1.0000

A graph in which rectangles represent the A graph in which rectangles represent the classesclasses

The base of the rectangle represents the class The base of the rectangle represents the class lengthlength

The height of the rectangle represents The height of the rectangle represents the frequency in a frequency histogram, orthe frequency in a frequency histogram, or the relative frequency in a relative frequency the relative frequency in a relative frequency

histogramhistogram

Histogram

HistogramExample 2.2: The Payment Times CaseExample 2.2: The Payment Times Case

Frequency HistogramFrequency Histogram Relative Frequency HistogramRelative Frequency Histogram

As with the earlier stem-and-leaf display, the tail on the As with the earlier stem-and-leaf display, the tail on the right appears to be right appears to be longerlonger than the tail on the left. than the tail on the left.

Example 2.1 The Car Mileage Case

We should use K=6 classes, the largest and smallest We should use K=6 classes, the largest and smallest mileages in Table 2.1 are 33.3 and 29.8. So we find the mileages in Table 2.1 are 33.3 and 29.8. So we find the class length by computing (33.3-29.8)/6=0.5833. class length by computing (33.3-29.8)/6=0.5833.

To obtain a more convenient class length, we round To obtain a more convenient class length, we round this value up to 0.6. this value up to 0.6.

To form the first class, we start with the smallest To form the first class, we start with the smallest mileage-29.8-and add 0.5 to obtain the class 29.8-30.3. mileage-29.8-and add 0.5 to obtain the class 29.8-30.3. Following this instruction, we can obtain all classes. Following this instruction, we can obtain all classes.

Remark:Remark: Although we have given a procedure for Although we have given a procedure for determining the number of classes, it is often desirable determining the number of classes, it is often desirable to let the nature of the problem determine the classes.to let the nature of the problem determine the classes.

Classes Freq.Classes Freq. Relative Freq. Boundaries Midpoint Relative Freq. Boundaries Midpoint

29.8-30.3 329.8-30.3 3 0.0612 29.75, 30.35 30.05 0.0612 29.75, 30.35 30.05

30.4-30.9 9 0.1837 30.35, 30.95 30.6530.4-30.9 9 0.1837 30.35, 30.95 30.65

31.0-31.5 1231.0-31.5 12 0.2449 30.95, 31.55 31.25 0.2449 30.95, 31.55 31.25

31.6-32.1 1331.6-32.1 13 0.2653 31.55, 32.15 31.85 0.2653 31.55, 32.15 31.85

32.2-32.7 932.2-32.7 9 0.1827 32.15, 32.75 32.45 0.1827 32.15, 32.75 32.45

32.8-33.3 332.8-33.3 3 0.0612 32.75, 33.35 33.05 0.0612 32.75, 33.35 33.05

Table: A Frequency Distribution and a Relative Frequency Distribution of the 49 Mileages

Back-to-Back histogram DisplayComparing Two Distributions with back-to-back Histogram

78

The Normal Curve( 正态曲线 )Symmetrical and bell-shaped Symmetrical and bell-shaped curve for a normally distributed curve for a normally distributed populationpopulationThe height of the normal over The height of the normal over any point represents the relative any point represents the relative proportion of values near that pointproportion of values near that point

Example 2.1, The Car Mileages Example 2.1, The Car Mileages CaseCase

The bean machine is a device The bean machine is a device invented by Sir Francis Galton to invented by Sir Francis Galton to demonstrate how the normal demonstrate how the normal distribution appears in nature. This distribution appears in nature. This machine consists of a vertical board machine consists of a vertical board with interleaved rows of pins. Small with interleaved rows of pins. Small balls are dropped from the top and balls are dropped from the top and then bounce randomly left or right then bounce randomly left or right as they hit the pins. The balls are as they hit the pins. The balls are collected into bins at the bottom collected into bins at the bottom and settle down into a pattern and settle down into a pattern resembling the Gaussian curve.resembling the Gaussian curve.

Normal distribution in natureNormal distribution in nature

Height (in.)Height (in.)

Normal distribution in natureNormal distribution in nature

Distribution of the heights of 1052 women fits the normal distribution, Distribution of the heights of 1052 women fits the normal distribution, with a goodness of fit p value of 0.75with a goodness of fit p value of 0.75

Histogram of daily percentage changes in the S&P 500 index

那么何谓正态分布呢？通俗地讲就是“中间多，两头少”，比如我们每个人的身高，巨人或侏儒在人口总数中所占的比例都很小，而中等身材的人占的比例最大。换成统计学的讲法，如果把身高做为随机变量，那么这种规律就是说一个人的身高达到平均值的概率最大，但身高越偏离平均值，其概率也越小。在自然现象和社会现象中，大量的随机变量都服从或近似地服从正态分布 .

由于 P ｛ a-b<X≤a+b ｝ =0.6826 ， P ｛ a-2b<X≤a+2b ｝ =0.9544 ， P｛ a-3b<X≤a+3b ｝ =0.9974 ，我们可以看到，对于服从正态分布的随机变量 X 来说，它的值落在 a-3b 与 a+3b 之间几乎是肯定的，这就是所谓的“ 3b 规则”。

Skewness( 偏度 )Skewed distributions are not symmetrical about their center. Rather, they are lop-sided with a longer tail on one side or the other.

• A population is distributed according to its relative frequency curve

• The skew is the side with the longer tail

Right SkewedRight SkewedLeft SkewedLeft Skewed SymmetricSymmetric

Section 2.2 Describing Central Tendency Population Parameters( 总体参数 )

A population parameter is a number calculated from all the population measurements that describes some aspect of the population

The population mean, denoted , is a population parameter and is the average of the population measurements

Point Estimates and Sample StatisticsA point estimate( 点估计 ) is a one-number estimate of the value of a population parameter

A sample statistic is a number calculated using sample measurements that describes some aspect of the sample Use sample statistics as point estimates of the Use sample statistics as point estimates of the

population parameterspopulation parameters

The sample mean, denoted x, is a sample statistic and is the average of the sample measurements The sample mean is a point estimate of the population The sample mean is a point estimate of the population

meanmean

Measures of Central TendencyMeanMean, , : The average or expected value : The average or expected value

MedianMedian, M, Mdd:: The value of the middle point of The value of the middle point of

the ordered measurementsthe ordered measurements

ModeMode, M, Moo: The most frequent value: The most frequent value

The Mean( 均值 )

Population X1, X2, …, XN

Population Mean

N

X

N

=1ii

Sample x1, x2, …, xn

Sample Mean

n

x x

n

=1ii

x

The Sample Mean( 样本均值 )For a sample of size For a sample of size nn, the , the sample meansample mean is defined as is defined as

n

xxx

n

xx n

n

ii

...211

and is a point estimate of the population mean

• It is the value to expect, on average and in the long run

90

Mean as the balance point for a distributionData: 2, 2, 6, 10mean=(2+2+6+10)/4=5

What will happen to the mean if we add one more number to the data?

Example: Car Mileage Case

Sample mean for first five car mileages from Table 2.1

30.8, 31.7, 30.1, 31.6, 32.1

5554321

5

1 xxxxxx

x ii

26.315

3.156

5

1.326.311.307.318.30

x

Example: Car Mileage Case Continued

Sample mean for all the car mileages from Table 2.1Sample mean for all the car mileages from Table 2.1

5531.3149

1.1546

49

49

1 i

ixx

Based on this calculated sample mean, the point estimate of mean mileage of all cars is 31.5531 mpg

The Median( 中位数 )The population or sample median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it

The median Md is found as follows:

1. If the number of measurements is odd, the median is the middlemost measurement in the ordered values

2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordered values

94

Data: 3, 5, 8, 10, 11median=8

95

Data: 3, 3, 4, 5, 7, 8median=(4+5)/2=4.5

Example: Sample Median Internist’s Yearly Salaries (x$1000)Internist’s Yearly Salaries (x$1000)

127 132 138 141 144 146 127 132 138 141 144 146 152152 154 165 171 177 192 241 154 165 171 177 192 241

Because Because nn = 13 (odd,) then the median is the middlemost = 13 (odd,) then the median is the middlemost or 7or 7thth value of the ordered data, so value of the ordered data, so

MMdd=152=152

An annual salary of $180,000 is in the high end, well An annual salary of $180,000 is in the high end, well above the median salary of $152,000above the median salary of $152,000

• In fact, $180,000 a very high and competitive In fact, $180,000 a very high and competitive salarysalary

Example 2.3Example 2.3

97

Data: 2, 2, 2, 3, 3, 12 mean=4median=(2+3)/2=2.5

The Mode( 众数 )The mode Mo of a population or sample of measurements is the measurement that occurs most frequently

• Modes are the values that are observed “most typically”

• Sometimes higher frequencies at two or more values• If there are two modes, the data is bimodal• If more than two modes, the data is multimodal

• When data are in classes, the class with the highest frequency is the modal class• The tallest box in the histogram

Example 2.4Example 2.4 DVD Recorder SatisfactionDVD Recorder Satisfaction

Satisfaction rankings on a scale of 1 (not satisfied) to 10 Satisfaction rankings on a scale of 1 (not satisfied) to 10 (extremely satisfied), arranged in increasing order(extremely satisfied), arranged in increasing order

1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10

Because Because nn = 20 (even,) then the median is the average of = 20 (even,) then the median is the average of two middlemost ratings; these are the 10two middlemost ratings; these are the 10 thth and 11 and 11thth values. Both of these are 8 (circled), so values. Both of these are 8 (circled), so

MMdd = 8 = 8

Because te rating 8 occurs with the highest rating, Because te rating 8 occurs with the highest rating,

MMoo = 8 = 8

Relationships Among Mean,

Median and Mode

Comparing Mean, Median & ModeBell-shaped distribution: Mean = Median = Mode

Right skewed distribution: Mean > Median > Mode

Left-skewed distribution: Mean < Median < Mode Also: The median is not affected by extreme values

• “Extreme values” are values much larger or much smaller than most of the data

• The median is resistant to extreme values The mean is strongly affected by extreme values

• The mean is sensitive to extreme values

Selecting a measure of Central Tendency Usually the mean is a good measure,

because it uses every score in the distribution.

There are some extreme cases in which the mean is not representative (or calculable). Then the mode and the median are used.

103

104

Mean=(10+11*4+12*3+13+100)/10=20.3

Median=(11+12)/2=11.5

Mode=11

105

Mean – not computableMedian=(12+13)/2=12.5Mode – not meaningful

Open-ended distributions A distribution is said to be open-ended when there is no upper limit (or lower limit) for one of the categories

Payment Time Case

Mean=18.108 daysMean=18.108 daysMedian=17.000 daysMedian=17.000 daysMode=16.000 daysMode=16.000 daysSo:So:Expect the mean payment time to be 18.108 Expect the mean payment time to be 18.108 daysdaysA long payment time would be > 17 days and a A long payment time would be > 17 days and a short payment time would be < 17 daysshort payment time would be < 17 daysThe typical payment time is 16 daysThe typical payment time is 16 days

Section 2.3 Measures of Variation( 变异数 )

Figure 2.31 20 Repair Times for Personal Computers at Two Service Centers

Figure 2.31 indicates that we need measures of Figure 2.31 indicates that we need measures of variation to express how the two distributions differ.variation to express how the two distributions differ.

Range(Range( 全距全距 ))

Largest minus the smallest measurementLargest minus the smallest measurement

The Population Variance (pronounced The Population Variance (pronounced sigma sigma squaredsquared) () ( 总体方差总体方差 ))

The average of the squared deviations of all The average of the squared deviations of all the population measurements from the the population measurements from the population meanpopulation mean

Standard Deviation (pronounced Standard Deviation (pronounced sigmasigma) () ( 标准差标准差 ))

The square root of the varianceThe square root of the variance

2

The RangeThe Range

Range = largest measurement - smallest measurementRange = largest measurement - smallest measurement

The range measures the interval spanned by all the dataThe range measures the interval spanned by all the data

Example 2.3: Internist’s Salaries (in thousands of Example 2.3: Internist’s Salaries (in thousands of dollars)dollars)

127 132 138 141 144 146 152 154 165 171 177 192 241127 132 138 141 144 146 152 154 165 171 177 192 241

Range = 241 - 127 = Range = 241 - 127 = 114114 ($114,000) ($114,000)

The VariancePopulation X1, X2, …, XN Sample x1, x2, …, xn

Sample Variance

s

11

2

2

n-

x - x= s

n

i=i

Population Variance

N

- X

N

=1ii

2

2

The VarianceFor a population of size For a population of size NN, the , the population variancepopulation variance is is defined asdefined as

N

xxx

N

xN

N

ii 22

22

11

2

2

For a sample of size n, the sample variance s2 is defined as

11

222

211

2

2

n

xxxxxx

n

xxs n

n

ii

2

and is a point estimate for 2

112

Sample variability tends to underestimate the population value

The Standard Deviation( 标准差 )

Population Standard Deviation, : 2

Sample Standard Deviation, s: 2ss


Consider the population of profit margins for five of the best big companies in America as rated by Forbes magazine on its website on March 16, 2005. These profit margins are 8%, 10%, 15%, 12% and 5%.

%105

50

5

51215108

2

22222

222222

%6.115

58

5

25425045

52502

5

105101210151010108

%40636112 ..

Population MeanPopulation Mean

Population Population VarianceVariance

Population Standard DeviationPopulation Standard Deviation

Sample variance and standard deviation for first five car mileages from Table 2.1

30.8, 31.7, 30.1, 31.6, 32.1

Example 2.6Example 2.6 The Car Mileage Case

2 63 1 s o .x

15

5

1

2

2

i

i xxs

4

26311322631631263113026317312631830 22222 ..........

= 2.572 /4 = 0.643

8019.0643.2 ss

Sample variance and standard deviation for all car mileages from Table 2.1, .

7992.0638793.0

638793.048

66204.30

149

2

49

1

2

2

ss

xxs i

i

The point estimate of the variance of all cars is 0.638793 mpgThe point estimate of the variance of all cars is 0.638793 mpg22 and the point estimate of the standard deviation of all cars is and the point estimate of the standard deviation of all cars is 0.7992 mpg.0.7992 mpg.

The computational formula for the sample variance

n

x

xn

s

n

iin

ii

2

1

1

22

1

1

Example 2.7Example 2.7 The Payment Time Case

Consider the sample of 65 payment times in Table 2.2.Consider the sample of 65 payment times in Table 2.2.

317,22)21()19()22(

177,1211922

222265

22

21

65

1

2

6521

65

1

xxxx

xxxx

ii

ii

ThereforeTherefore 69135.1564

2464.004,1

65

)177,1(317,22

)165(

1 22

s

andand 9612.369135.152 ss Days.Days.

The Empirical Rule( 经验准则 ) forNormal PopulationsIf a population has mean and standard deviation and is described by a normal curve, then

1. 68.26% of the population measurements lie within one standard deviation of the mean: [

2. 95.44% of the population measurements lie within two standard deviations of the mean: [22

3. 99.73% of the population measurements lie within three standard deviations of the mean: [33

Tolerance Intervals(Tolerance Intervals( 容许区间容许区间 )) An Interval that contains a specified percentage of the An Interval that contains a specified percentage of the individual measurements in a population is called a individual measurements in a population is called a tolerance intervaltolerance interval. .

The one, two, and three standard deviation intervals The one, two, and three standard deviation intervals around given in (1), (2) and (3) are tolerance around given in (1), (2) and (3) are tolerance intervals containing, respectively, 68.26 percent, 95.44 intervals containing, respectively, 68.26 percent, 95.44 percent and 99.73 percent of the measurements in a percent and 99.73 percent of the measurements in a normally distributed population. normally distributed population.

The The three-sigmathree-sigma interval to be a tolerance interval to be a tolerance interval that contains interval that contains almost allalmost all of the measurements of the measurements in a normally distributed population. in a normally distributed population.

3

Figure 2.32 The Empirical Rule and Tolerance Intervals

The Car Mileage CaseThe Car Mileage Case 68.26% of all individual cars will have mileages in 68.26% of all individual cars will have mileages in

the rangethe range

95.44% of all individual cars will have mileages in 95.44% of all individual cars will have mileages in the rangethe range

99.73% of all individual cars will have mileages in 99.73% of all individual cars will have mileages in the rangethe range


43283080631 .,...sx mpg

233030616312 .,...sx mpg

034229426313 .,...sx mpg

Skewness and the Empirical RuleSkewness and the Empirical Rule

The Empirical RuleThe Empirical Rule holds for normally distributed holds for normally distributed populations. populations.

This rule also approximately holds for populations This rule also approximately holds for populations having mound-shaped (single-peaked) distributions having mound-shaped (single-peaked) distributions that are not very skewed to the right or left. that are not very skewed to the right or left.

For example, Recall that the distribution of 65 For example, Recall that the distribution of 65 payment times, it indicates that the empirical rule payment times, it indicates that the empirical rule holds. holds.

Section 2.4 Percentiles, Quartiles( 四分之一分位点 ) and Box-and-Whiskers Display

For a set of measurements arranged in increasing order, For a set of measurements arranged in increasing order, the the ppthth percentile( percentile( 百分位点百分位点 )) is a value such that p is a value such that p percent of the measurements fall at or below the value percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above and (100-p) percent of the measurements fall at or above the valuethe value

The first quartile QThe first quartile Q11 is the 25th percentile is the 25th percentile

The second quartileThe second quartile (or median) M (or median) Mdd is the 50th percentile is the 50th percentile

The third quartile QThe third quartile Q33 is the 75th percentile is the 75th percentile

The interquartile range IQR(The interquartile range IQR( 四分位距四分位距 )) is Q is Q33 - Q - Q11

Calculating pth percentile Calculate the index i=(p/100) ×n

If i is not an integer, the next integer greater than i denotes the position of the pth percentile in the ordered arrangement.

If i is an integer, then the pth percentile is the average of the measurements in position i and i+1 in the ordered arrangement.

Figure 2.33 Using stem-and-leaf displays to find percentiles. Figure 2.33 Using stem-and-leaf displays to find percentiles.

(a) The 75th percentile of the 65 payment

times, and a five-number summary (b) The 5th percentile of the 60 bottle design ratings and a five-number summary

Example 2.10Example 2.10 DVD Recorder SatisfactionDVD Recorder Satisfaction

20 customer satisfaction ratings:

1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10

Q1 = (7+8)/2 = 7.5

Md = (8+8)/2 = 8

Q3 = (9+9)/2 = 9

IQR = Q3 Q1 = 9 7.5 = 1.5

Five Number Summary in descriptive statistic 1. The smallest measurement

2. The first quartile, Q1

3. The median, Md

4. The third quartile, Q3

5. The largest measurement

Displayed visually using a box-and-whiskers plot

Box-and-whisker plots

128

A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.

The box plots the: The box plots the: first quartile, Qfirst quartile, Q11

median, Mmedian, Mdd

third quartile, Qthird quartile, Q33

inner fences, located 1.5inner fences, located 1.5IQR away from the quartiles:IQR away from the quartiles: = Q= Q11 – (1.5 – (1.5 IQR) IQR) = Q= Q33 + (1.5 + (1.5 IQR) IQR)

outer fences, located 3outer fences, located 3IQR away from the quartiles:IQR away from the quartiles: = Q= Q11 – (3 – (3 IQR) IQR) = Q= Q33 + (3 + (3 IQR) IQR)

The Box-and-Whiskers Plots( 盒型图 )

The “whiskers” are dashed lines that plot the The “whiskers” are dashed lines that plot the range of the datarange of the data A dashed line drawn from the box below QA dashed line drawn from the box below Q11

down to the smallest measurementdown to the smallest measurement Another dashed line drawn from the box Another dashed line drawn from the box

above Qabove Q33 up to the largest measurement up to the largest measurement Note: QNote: Q11, M, Mdd, Q, Q33, the smallest value, and the , the smallest value, and the

largest value are sometimes referred to as the largest value are sometimes referred to as the five number summaryfive number summary

Outliers are measurements that are very different from Outliers are measurements that are very different from most of the other measurementsmost of the other measurements Because they are either very much larger or very much Because they are either very much larger or very much

smaller than most of the other measurementssmaller than most of the other measurements Outliers lie beyond the fences of the box-and-whiskers Outliers lie beyond the fences of the box-and-whiskers

plotplot Measurements between the inner and outer fences are Measurements between the inner and outer fences are

mild outliersmild outliers Measurements beyond the outer fences are Measurements beyond the outer fences are severe severe

outliersoutliers

Outliers( 异常值 )

Section 2.5 Describing Qualitative Data

Pie charts(Pie charts( 饼图饼图 )) of the proportion (as percent) of all of the proportion (as percent) of all cars sold in the United States by different cars sold in the United States by different manufacturers, 1970 versus 1997manufacturers, 1970 versus 1997

Bar Chart( 柱状图 )Percentage of Automobiles Sold by Manufacturer, 1970 versus 1997

Pie ChartPercentage of Automobiles Sold by Manufacturer,1997

An Bar Chart of U.S Automobile Sales in 1997An Bar Chart of U.S Automobile Sales in 1997

Misleading Graphs and Charts:Scale BreakBreak the vertical scale to exaggerate effect

Mean Salaries at a Major University, 2002 - 2005

Misleading Graphs and Charts: Scale EffectsCompress vs. stretch the vertical axis to exaggerate or minimize the effect

Mean Salary Increases at a Major University, 2002 - 2005

139

You can use simple mathematical operations (like averages) to create nonsensical “facts” that can drive whatever agenda you’d like. Example: the average wealth of the citizens of a particular town is $100,000, therefore they don’t need any government assistance. (The town consists of 1 stingy millionaire and 9 homeless people.)

Sometimes, some measurements are more important than others Assign numerical “weights” to the data Weights measure relative importance of the value

Calculate weighted mean as

where wi is the weight assigned to the ith measurement xi

i

ii

w

xw

Weighted Means( 加权均值 )


June 2001 unemployment rates in the U.S. by region

Want the mean unemployment rate for the U.S.

Census Region Civilian Labor Force

(millions)

Unemployment

Rate (%)

Northeast 26.9 4.1

South 50.6 4.7

Midwest 34.7 4.4

West 32.5 5.0

Calculate it as a weighted mean So that the bigger the region, the more heavily it counts

in the mean The data values are the regional unemployment

rates The weights are the sizes of the regional labor

forces

Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% That is, 0.0003 144.7 million = 43,410 workers

%5847144

29663532525734650926

05532447347465014926

..

......

........

Population and Sample Proportions

Population X1, X2, …, XN

p

Population Proportion

Sample x1, x2, …, xn

Sample Proportion

n

)V(ˆ ji

Xnp

p ˆ

p is the point estimate of p^

X is a qualitative variable.

Example 2.11Example 2.11 The Marketing Ethics Case

117 out of 205 marketing researchers disapproved of action taken in a hypothetical scenario

X = 117, number of researches who disapprove

n = 205, number of researchers surveyed

Sample Proportion:

570205

117.

n

Xp̂

Scatter Diagrams are used to examine possible relationships between two numerical variables

The Scatter Diagram: one variable is measured on the vertical

axis and the other variable is measured on the horizontal axis

Scatter Diagrams

Scatter Plots( 散点图 )

Restaurant Ratings: Mean Preference vs. Mean Taste

Visualize the data to see patterns, especially “trends”

A Scatter Plot Showing a Positive Linear Relationship

147

A Scatter Plot Showing a Little or No Linear Relationship

148

A Scatter Plot Showing a Negative Linear Relationship

149

Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409

Documents

Transcript of Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409