Collecting Data

COLLECTI

NG DAT

A

THE MEANING OF STATISTICS

Statistics is a way to answer questions using information. The information has to be observed or collected, ordered, represented and then analysed.

The information that is collected is called raw data, and the data that you collected would depend on a question you are trying to answer, and this question is called a hypothesis.

A hypothesis is an assumption that may or may not be true that is the starting point of an investigation.

An example of this is that you want to answer the question “How does the price of a car change as it gets older?”

You could use the hypothesis “As a car gets older, the price goes down.”

The data needed for this question would need to be the age and price of cars.

TYPES OF DATA

Quantitative variables are numerical observations or measurements.

Quantitative variables can be continuous or discrete.

Height, weight, response time, subjective rating of pain, temperature, and score on an exam are all examples of quantitative variables.

Qualitative variables are non-numerical observations.

Also known as categorical variables, qualitative variables are variables with no natural sense of ordering.

They are therefore measured on a nominal scale. For example, hair color is a qualitative variable, as are names. Qualitative variables can be coded to appear numeric but their numbers are meaningless, as in male-1 and female-2.

CONTINUOUS AND DISCRETE DATA

Continuous data can take any value on a continuous numerical scale.

The length of a piece of string could take any value on this scale, so is continuous data.

Discrete data can only take particular values on a continuous numerical scale.

The number of eggs laid by a chicken can only take particular values, so is discrete data.

SUMMARY

CATEGORICAL DATA

To make raw data easier to handle and easier to display they may be gathered or ordered in a particular way.

Categorical data is an example of this.

A set of data is categorical if values or observations belonging to it can be sorted into different categories.

Each piece of categorical data is put into one of a set of non-overlapping categories.

Shoes can be sorted according to colour, The characteristic colour can have non-overlapping categories. (e.g. black, brown, blue etc.)

Numerical data can be put into categories. The characteristic length of thumb can have non-overlapping categories:

x ≤ 30mm, 30mm < x ≤ 40mm, 40mm < x ≤ 50mm, etc.

RANKED DATA

Ranked data has values/observations that can be ranked (put in order) or have a rating scale attached. Ranked data can be counted and ordered, but not measured.

The categories for a ranked set of data have a natural order.

Example:

If judges ranked ten dogs on a scale of 1 to 10, 1 would represent the best of the dogs and 10 would be the worst.

1

23

BIVARIATE DATADefinition:

Bivariate data are pairs of related variables.

Here are some examples of bivariate data pairs:

•Age and weight of a person

•Price and age of an object

•Shoe size and height

•Weight and fitness level

•Popularity and cost of an object

WHY USE BIVARIATE DATA PAIRS?

We use bivariate data pairs in research and analysis for the following

reasons:

To determine weather there is a relationship between two variables

To analyse how a change in one variable will affect the other and vice

versa

Sometimes bivariate data pairs can simplify an analysis

FOR EXAMPLE…

2 students were trying to discover weather there was a relationship between shoe size and age in girls, they asked 3 friends and got these results:

student number age shoesize

1 13 4

2 15 5

3 14 5

This shows that usually as age (the first variable) increases so does the second variable (shoe size). To investigate this pattern further they would interview more people

HOW ACCURATE IS CONTINUOUS DATA?

Continuous data is not always 100% accurate just because it is

measured in numerical units.

This is because continuous data is measured in the kind of units that

are usually rounded;

age (rounded to nearest year)

Distance (nearest m,cm,km,mile etc.)

Time (nearest minute, hour, day, week etc.)

ROUNDING CONTINUOUS DATA

Usually we round to the nearest 10,100,1000 depending on what is being recorded or measured.

or we round to the nearest If the last number is 5 or above we round up, if its 5 or below we round down.

I.e.) if someone is 150.5 cm tall we could say they were 151 cm tall.

Unfortunately this means continuous data is often inaccurate by around half a unit.

ACCURACY OF AGE IN CONTINUOUS DATA

Age isn't usually that accurate because it is usually always given

as the age a person was on their last birthday which means it

could be wrong by up to 364 days.

FOR EXAMPLE…

If Jenny sits an exam it might ask for her age. If her last

birthday was her 15th birthday she would have to write down

15 even if her 16th birthday was the next day. She would be

a lot closer to 16 than 15 but we round age to the nearest

year.

This could make the analysis inaccurate especially if in was a bivariate variable.

POPULATIONS AND SAMPLING

You must identify the population in an investigation. Population simply means everything or everybody that could be involved in an investigation.

Census data contains information about every member of the population.

Sample data is usually used when the population is too large to survey all members. So a small but carefully chosen sample can be used to represent the population. Sample Data contains information about parts of the population.

Advantages Disadvantages

Census Unbiased, accurate, takes the Time-consuming, expensive, lots

whole population into account data to handle

Sample Cheaper, less time-consuming, Not completely representative,

less data to be considered may be biased

SAMPLING

Sampling Units are the people or items in the population that is to be sampled.

A Sampling Frame is when the population is formed into a list that are to be sampled.

When doing a sample, there are two questions to be asked:

• How big does the sample need to be?

• How is a sample taken so that is represents the population accurately?

The sample size can change, however, the larger the sample, the more reliable the results are.

RANDOM SAMPLING

To represent the population accurately, the sample should be taken randomly so that is it free from be biased. Bias can occur is many ways, such as:

• Using an unrepresentative sample

• Poor or misleading questions

• External factors affecting the data collection

• Not correctly identifying the whole population

A random sample is a sample that is taken without a conscious decision being made of which population are to selected.

Random sampling methods includes simple random sampling and stratified sampling.

SIMPLE RANDOM SAMPLING

Simple random sampling makes sure that each sample of size n has an equal chance of being selected.

There are many ways to take a simple random sample which includes:

• Using a random number table

• Using a random number generator on a calculator

• Using a computer to choose numbers

• Putting the numbers in a hat and then selecting how many needed for the sample being used.

For example:

16273 34971 01287 21847 08247 27648 98247 75286

13972 24397 90842 62745 14872 52786 81279 21847

Q. Select 10 random numbers each less than 50 starting from the top left hand corner and working your way down.

STRATIFIED SAMPLING

Stratum means just one sub-population. Strata is the plural word meaning more than one sub-population.

There may often be factors that divide the population into sub-populations. This has to be considered to choose a sample from the population to ensure that is it representative of the all population.

The size of each sample must be in proportion to the relative size of the groups from which is taken.

An example question could be:

How many hours should be included from each group?

You can check your answer by totalling the answers from each all year groups.

STRATIFIED SAMPLING

Example:

A sample of 90 is taken.

You must divide the population within each shift by the total which is 450, and then multiple it by the sample size, which in this case it is 90.

Shift 1: 125/450 x 90=25

Shift 2: 100/450 x 90=20

Shift 3: 85/450 x 90=17

Shift 4: 140/450 x 90=28

Shift 1 Shift 2 Shift 3 Shift 4

125 100 85 140

NON-RANDOM SAMPLINGHIGHER STATISTICS

• Cluster sampling:– http://www.youtube.com/watch?v

=JUiaB4-FTwI– Used when the population being

sampled splits naturally into groups or “clusters”

– Then a sample is randomly selected

http://www.youtube.com/watch?v=JUiaB4-FTwI

http://www.youtube.com/watch?v=JUiaB4-FTwI

NON-RANDOM SAMPLING

Quota Sampling:

A quota of subjects of specified type are interviewed.

Systematic Sampling:

From the sampling frame, a starting point is chosen at random, and therefore items are chosen at regular intervals. Example:

20 students from a total of 100 in the year group. 100/20=5. So every fifth student is chosen after a random starting point. 6, 11, 16, 21, 26, 31...

COLLECTING DATA

Primary data is data that has been collected by, or for the person who is going to use them.

Secondary data is data that has already been collected by someone else.

Examples of primary data include:

• Observing and tally charts (how sunny it is in June)

• The height of the pupils in year 7

Secondary data may be collected from:

• Websites

• Magazines and newspapers

• Databases

• Research articles

SURVEYSA survey is the collection of data from a given population. The

data are used to analyse a particular issue. Primary data may be collected this way.

The main methods of collecting primary data in a survey are:

• Questionnaires

• Interviews

• Observations

• Experiments

• Data logging

Pilot surveys:Conducted on a small

Sample to test the design and methods

Of that survey

QUESTIONNAIRES

A questionnaire is a set of questions designed to obtain data:

Anyone who takes a questionnaire is called a respondent.

Rules for writing a questionnaire:

• Short , simple questions

• Easily understood words and phrases

• Avoid leading questions

• Only address a single issue e.g. does your car run on diesel fuel?

• Use an interval e.g. 0- 10 years , 11-20 years etc.

• Avoid embarrassing questions.

2 TYPES OF QUESTIONS

Open questions: one that has no

suggested answers.

Disadvantage is that different

answers have to be summarised to be

analysed.

Closed questions: has a set of answers for the respondent to choose from.

Advantage is that it is easier to summarise

INTERVIEWS

Different types of interviews are :

• Sending questions to people by post or email.

• Calling people on the phone

• Face to face personal questioning

Advantages:Postal and emails

are cheap. Response for

telephone is good. Personal

interviews are good for complex

questions.

Disadvantages:Emails and postal are least likely to get a response.

Telephones are expensive and so are personal interviews.Telephone and personal can be

Bias!

INVESTIGATIONS AND EXPERIMENTS

By doing some experiments, it can help collect data.

There are 5 ways in completing an experiment which can be part of a statistical investigation, which are:

• Before and after experiments

• Control Groups

• Matched pairs

• Data Logging

• Capture-recapture method

INVESTIGATIONS AND EXPERIMENTS

Control groups- This is when one group is tested with a drug to see the effect, whereas the other group is not. The control group is the one that is given the inactive substance, but the other group is. These groups are randomly selected and the effectiveness of the drug can then be assessed by comparing the groups.

Matched pairs- This is when two individual a group has everything in common apart from the factor being tested. Identical twins can be important.

Data logging- A method which is electronically or mechanically, that is automatically collecting primary data.

Capture-recapture method- This is an estimation of the population. The population is of size N, which is the number to be estimated. First of all, you must capture the members of the population whish is M, and mark them and then release them. After waiting some time, you recapture the n number of the population. This is so that they have time to mix. Then the number of m that has been marked is then recorded. The formula for this method is:

N=Mn/m

EXAMPLE

Forty fish in a lake are caught, marked and returned to the lake. A second sample of 100 fishes are later caught. Of these 100 fished, 10 are marked.

Estimate the number of fished in the lake.

Answer:

N=? N=Mn/m

M=40 N=40x100/10

n=100 N=400 fishes

m= 10

REPLICATION & DIRECT OBSERVATION

Replication: This means repeating the experiments, therefore if the results

are similar each time, the results are more reliable.

Direct Observation:This means recording down the behavioural patterns of people,

or items that are being tested in a systematic manner. For instances, the number of yellow cars driving down the road within 30 minutes.

Collecting Data

Documents

Transcript of Collecting Data