SS EN 20 Data Types - Amazon Web Services...SchoolofSixSigma& Data&Types& Overview&...

3
School of Six Sigma Data Types Overview In this module we’re going to discuss data types. By the end of this module you’ll know the difference between Sample Statistics and Popular Parameters. You’ll also know what the different types of data are as well as the best kind to use whenever possible. Differences Between Sample and Population Let’s get started by learning the difference between a Sample and a Population. To do this we’re going to use an example. Let’s assume a bank wants to gauge their customers’ interest in some new features and has developed a short online survey. Let’s also assume this is a large bank with more than 50,000 paying customers. Obviously, reaching all 50,000 customers would prove to be both difficult and expensive. They may only be able to reach 10,000 customers. These 10,000 customers would be known a sample of the overall population of 50,000 customers. When we speak of a Population we’re referring to a collection of ALL

Transcript of SS EN 20 Data Types - Amazon Web Services...SchoolofSixSigma& Data&Types& Overview&...

Page 1: SS EN 20 Data Types - Amazon Web Services...SchoolofSixSigma& Data&Types& Overview& Inthis!module!we’re!going!todiscuss!data!types.!!Bytheend!ofthismoduleyou’ll know!thedifferencebetween!SampleStatisticsand

 

 

School  of  Six  Sigma    Data  Types  

Overview  In  this  module  we’re  going  to  discuss  data  types.    By  the  end  of  this  module  you’ll  know  the  difference  between  Sample  Statistics  and  Popular  Parameters.    You’ll  also  know  what  the  different  types  of  data  are  as  well  as  the  best  kind  to  use  whenever  possible.      

Differences  Between  Sample  and  Population  Let’s  get  started  by  learning  the  difference  between  a  Sample  and  a  Population.    To  do  this  we’re  going  to  use  an  example.    Let’s  assume  a  bank  wants  to  gauge  their  customers’  interest  in  some  new  features  and  has  developed  a  short  online  

survey.    Let’s  also  assume  this  is  a  large  bank  with  more  than  50,000  paying  customers.      

Obviously,  reaching  all  50,000  customers  would  prove  to  be  both  difficult  and  expensive.    They  may  only  be  able  to  reach  10,000  customers.    These  10,000  

customers  would  be  known  a  sample  of  the  overall  population  of  50,000  customers.    When  we  speak  of  a  Population  we’re  referring  to  a  collection  of  ALL  

Page 2: SS EN 20 Data Types - Amazon Web Services...SchoolofSixSigma& Data&Types& Overview& Inthis!module!we’re!going!todiscuss!data!types.!!Bytheend!ofthismoduleyou’ll know!thedifferencebetween!SampleStatisticsand

subjects  or  objects  of  interest,  with  the  key  word  being  ALL  subjects  or  objects.    We’ll  rarely  have  access  to  an  entire  population.      

Conversely,  a  Sample  is  a  subset  of  the  population  used  to  make  inferences  about  the  characteristics  of  the  population.    So,  instead  of  contacting  all  50,000  customers  the  bank  would  send  the  survey  to  a  subset,  or  sample,  of  the  overall  population.    When  we’re  dealing  with  Samples  we’re  actually  working  with  Sample  STATISTICS  and  when  we’re  dealing  with  a  Population  we’re  actually  working  with  Population  PARAMETERS.      

When  we’re  speaking  about  the  mean  the  Sample  Statistic  is  called  X  bar  while  the  Population  Parameter  is  called  mu.    When  we’re  speaking  about  the  Standard  Deviation  the  Sample  Statistic  is  a  lower  case  s  while  the  Population  Parameter  is  Sigma.    You’ll  notice  the  Population  Parameters  are  Greek  letters  while  Sample  Statistics  are  Roman  letters.      

Two  Main  Types  of  Data  Now  that  we  know  the  difference  between  Sample  Statistics  and  Popular  Parameters  let’s  turn  our  attention  to  the  two  main  types  of  data  we’ll  work  with  as  continuous  improvement  practitioners.      

The  first  type  is  attributes  data.      When  we  speak  of  attributes  data  there  are  actually  two  variations.    The  first  type  is  called  binary  data.    With  binary  data  we’re  dealing  with  two  levels.    For  example,  we  either  pass  or  fail.    The  light  is  either  on  or  off.    The  product  is  either  good  or  bad.      

The  second  form  of  attributes  data  is  count  data.    With  this  type  of  data  we’re  able  count  things  as  the  name  implies.    For  example,  if  someone  fails  a  test  we  can  count  how  many  answers  they  missed.    If  the  product  is  bad  we  can  count  the  number  of  defects  and  so  on.      If  it’s  available,  we’ll  always  want  to  use  

Page 3: SS EN 20 Data Types - Amazon Web Services...SchoolofSixSigma& Data&Types& Overview& Inthis!module!we’re!going!todiscuss!data!types.!!Bytheend!ofthismoduleyou’ll know!thedifferencebetween!SampleStatisticsand

count  data  versus  binary  data  since  we  can  learn  so  much  more  about  the  situation.      

For  example,  instead  of  saying  a  product  is  bad,  it  would  be  much  more  useful  if  we  could  count  the  number  of  defects  on  the  product.    Or  instead  of  simply  telling  

the  student  he  failed  the  exam,  it  would  be  useful  if  we  could  tell  him  exactly  how  many  questions  he  missed.      

The  second  type  of  data  is  called  variables  data,  sometimes  referred  to  as  continuous  data.    Variables  data  comes  from  a  measurement  scale  that  

can  be  divided  into  finer  and  finer  increments.    Things  like  weight,  distance,  dimensions,  and  speed  are  all  examples  of  variables  data.      

What  Type  of  Data  is  Best?  The  question  is:  which  type  of  data  is  best?    If  both  are  available,  what  type  of  data  should  we  seek  to  collect  and  analyze?      

The  answer  is:  if  it’s  available  we  always  want  to  collect  and  analyze  variables  data.      

There  are  some  statistical  reasons  for  this  related  to  something  called  power  and  sample  size  which  we’ll  learn  about  later  in  the  course,  but  the  gist  of  it  comes  down  to  the  fact  that  variables  data  is  more  powerful,  statistically  speaking,  than  attributes  data.      

We  may  only  need  30  data  points  of  variables  data  to  characterize  a  process  while  we  may  need  100  data  points  of  attributes  data  to  learn  anything  at  all.    When  possible  always  seek  to  collect  and  analyze  variables  data  since  we  can  learn  so  much  more.