BIT150Lab10:$Sequence$Analysisand$Association$Mapping...

14
BIT 150 Lab 10: Sequence Analysis and Association Mapping 1 2 The goals of this exercise are to: 3 1. Perform a multiple sequence alignment with ClustalW 4 2. Learn how to perform sequence analysis with DnaSP 5 3. Understand the concepts of association mapping 6 4. Perform association mapping with the TASSEL 7 8 1. BACKGROUND: 9 10 Multiple Sequence Alignment : refers to the process of assembling three or more 11 sequences (DNA, RNA, or protein). Computational algorithms are used to produce and 12 analyze the alignments. Most multiple sequence alignment programs use heuristic 13 methods rather than global optimization because identifying the optimal alignment 14 between more than a few sequences of moderate length is prohibitively expensive 15 (computationally speaking). 16 17 Common MSA Tools: 18 a. ClustalW 19 b. T-Coffee 20 c. ProbconsRNA 21 22 Nucleotide Diversity: Nucleotide diversity is a measure of genetic variation. It is usually 23 associated with other statistical measures of population diversity, and is similar to 24 expected heterzygosity. This can be calculated by examining the DNA sequences 25 directly, or may be estimated from molecular marker data. This statistic may be used to 26 monitor diversity within or between ecological populations, to examine the genetic 27 variation in related species, or to determine evolutionary relationships . 28 Common Tools: 29 a. DNAsp 30 b. MEGA 31 c. Arlequin3 32 33 Association Mapping: Association mapping takes advantage of linkage disequilibrium to 34 map phenotypes to genotypes. Association mapping is based on the idea that traits that 35 have entered a population only recently will still be linked to the surrounding genetic 36 sequence of the original evolutionary ancestor, or in other words, will more often be 37 found within a given haplotype, than outside of it. Association mapping asks if a 38 particular genetic marker (most often a SNP) is more common in a particular phenotype 39 than you would expect by chance. 40 41 3. Common Tools: 42 a. TASSEL 43 b. R 44 c. Plink (GWAS) 45 46

Transcript of BIT150Lab10:$Sequence$Analysisand$Association$Mapping...

BIT  150  Lab  10:  Sequence  Analysis  and  Association  Mapping  1    2  The goals of this exercise are to: 3  

1. Perform a multiple sequence alignment with ClustalW 4  2. Learn how to perform sequence analysis with DnaSP 5  3. Understand the concepts of association mapping 6  4. Perform association mapping with the TASSEL 7  

8  1. BACKGROUND: 9   10  Multiple Sequence Alignment: refers to the process of assembling three or more 11  sequences (DNA, RNA, or protein). Computational algorithms are used to produce and 12  analyze the alignments. Most multiple sequence alignment programs use heuristic 13  methods rather than global optimization because identifying the optimal alignment 14  between more than a few sequences of moderate length is prohibitively expensive 15  (computationally speaking). 16   17  Common MSA Tools: 18  

a. ClustalW 19  b. T-Coffee 20  c. ProbconsRNA 21  

22  

Nucleotide Diversity: Nucleotide diversity is a measure of genetic variation. It is usually 23  associated with other statistical measures of population diversity, and is similar to 24  expected heterzygosity. This can be calculated by examining the DNA sequences 25  directly, or may be estimated from molecular marker data. This statistic may be used to 26  monitor diversity within or between ecological populations, to examine the genetic 27  variation in related species, or to determine evolutionary relationships. 28  

Common Tools: 29  a. DNAsp 30  b. MEGA 31  c. Arlequin3 32  

33  Association Mapping: Association mapping takes advantage of linkage disequilibrium to 34  map phenotypes to genotypes. Association mapping is based on the idea that traits that 35  have entered a population only recently will still be linked to the surrounding genetic 36  sequence of the original evolutionary ancestor, or in other words, will more often be 37  found within a given haplotype, than outside of it. Association mapping asks if a 38  particular genetic marker (most often a SNP) is more common in a particular phenotype 39  than you would expect by chance. 40   41  3. Common Tools: 42  

a. TASSEL 43  b. R 44  c. Plink (GWAS) 45  

 46  

2. MULTIPLE SEQUENCE ALIGNMENT WITH CLUSTALW: 47    48  DnaSP,  the  program  we  will  use  for  nucleotide  diversity  analysis,  requires  multiple  DNA  49  sequence  alignment  files  in  FASTA,  Nexus,  or  Phylip  format.  The  files  in  .ace  format  we  have  50  been  working  with  during  the  previous  labs  obtained  from  Phred,  Phrap,  Polyphred,  51  Consed  do  not  work.  Therefore,  we  need  to  create  a  multiple  DNA  sequence  alignment  file  52  that  the  program  will  accept.  53    54  In  your  directory  in  ‘plantgenome’,  you  will  find  ‘Lab10’  subdirectory.  Within  this  55  subdirectory,  the  folder  named  ‘2_5395_01_fasta’  contains  the  FASTA  files  of  the  sequences  56  you  assembled  in  Hwk9.  Move  into  the  ‘Lab10’  subdirectory,  then  into  the  folder  containing  57  the  FASTA  files,  and  list  the  content  of  this  folder:  58    59     >cd  Lab10  60     >cd  2_5395_01_fasta  61     >ls  62    63  We  need  to  create  a  multiple  DNA  sequence  alignment  file  which  contains  all  the  sequences  64  from  a  single  contig.  It  is  important  to  note  that  for  this  step  you  want  just  the  sequences  65  from  a  single  contig.  Trying  to  align  multiple  sequences  from  different  contigs  will  cause  66  problematic  DNA  alignment  files  revealing  much  more  nucleotide  diversity  than  is  actually  67  present.    68    69  Use  the  command  ‘cat’  to  concatenate  all  the  sequences  from  a  single  contig  from  invidual  70  text  files  into  one  single  text  file.  Name  the  output  file  ‘contig1.fasta’.  71    72            >cat  sequence1.fasta  sequence2.fasta  sequence3.fasta  ...  sequenceN.fasta  >  contig1.fasta  73    74  Sequence1,  sequence2,  sequence3,  sequenceN,  are  the  names  of  the  FASTA  files  of  the  75  sequences  from  Hwk9  contained  in  folder  ‘2_5395_01_fasta’  with  which  we  provided  you.  76    77  To  align  all  the  sequences  from  a  single  contig  that  you  just  concatenated,  we  will  use  78  ClustalW.  79              80     >clustalw  81    82  

 83  

 84  Load  the  input  FASTA  file  by  selecting  option  1.  Sequence  Input  From  Disc:      85    86  

 87    88  Enter  the  name  of  your  FASTA  file  created  above  (contig1.fasta).  89    90  Select  option  2.  Multiple  Alignments  to  perform  a  multiple  sequence  alignment  of  the  91  sequences  from  a  single  contig  you  concatenated  and  saved  in  the  FASTA  file:    92    93    94    95  

 96    97  

Select  option  9.  Output  format  option  to  select  the  format  of  the  output  file:  98    99  

 100    101  Turn  option  F.  Toggle  FASTA  format  output  on  by  typing  F  and  pressing  Enter  to  produce  a  102  multiple  sequence  alignment  output  file  in  FASTA  format:  103    104  

 105  

 106  Return  to  the  previous  menu  to  run  the  alignment  (press  Enter):  107  

 108    109  Select  option  1.  Do  complete  multiple  alignment  now.    110    111  You  will  need  to  enter  a  name  for  the  ClustalW  output  file  (default  is  the  input  file  name  112  ‘contig1’  with  a  .aln  extension).  Use  the  default  for  this  (press  Enter).  113    114  You  will  need  to  enter  a  name  for  the  FASTA  output  file  (default  is  the  input  file  name  with  115  a  .fas  extension).  Enter  a  name  (‘contig11.fas’)  for  the  output  file  and  use  the  extension  .fas.  116    117  You  will  need  to  enter  a  name  for  the  new  GUIDE  TREE  file  (default  is  the  input  file  name  118  with  a  .dnd  extension.)  Use  the  default  for  this  (press  Enter).  119    120  Once  the  multiple  sequence  alignment  if  finished,  return  to  the  main  menu  (type  X)  and  exit  121  ClustalW  (press  Enter  and  then  type  X  to  exit  the  program).  122      123  You  should  now  have  3  new  files  in  your  ‘2_5395_01_fasta’  subdirectory:  the  .aln,  .fas,  and  124  .dnd  files.  We  will  be  using  the  .fas  file  (the  multiple  sequence  alignment  of  the  sequences  125  for  a  single  contig  in  FASTA  format)  for  the  DnaSP  portion  of  the  lab.    126    127    128  FileZilla  is  a  program  to  manage  files  when  working  in  UNIX.  The  program  can  be  129  downloaded  from:  http://filezilla-­‐project.org/.  Steps  to  use  the  software:  130    131  Open  the  Start/Programs/BioInformatics/FileZilla  or  double  click  on  the  desktop  shortcut  132     In  the  host  field:  plantgenome.plantsciences.ucdavis.edu  133     In  the  username  field:  your  Kerberos  username  134  

  In  the  password  field:  your  Kerberos  password  135     In  the  port  field:  22  136     Click  on  Quickconnect  and  you  can  now  transfer  files  between  your  computer  and  137  the  UNIX  server  138    139    140  

141    142  3. DNASP: 143    144  About  DnaSP    145  DnaSP  (Rozas  and  Rozas,  1999;  Rozas  et  al.,  2003)  is  a  software  package  for  the  analysis  of  146  the  DNA  polymorphism  from  nucleotide  sequence  data.  DnaSP  runs  on  a  Windows  platform  147  and  is  freely  available  at  http://www.ub.es/dnasp/.  148    149  In  this  lab,  you  will  learn  how  to  use  DnaSP  to  calculate  the  nucleotide  diversity  present  in  150  nucleotide  sequence  data,  and  how  to  test  for  departure  from  a  neutral  model  of  evolution,  151  i.e.  genetic  drift.  However,  DnaSP  is  also  capable  of  performing  a  number  of  other  152  calculations.  153    154  As  explained  at  the  beginning  of  this  lab,  DnaSP  requires  a  multiple  DNA  sequence  155  alignment  file  in  FASTA  format.  156    157  

Open  DnaSP.  The  opening  screen  has  animated  images  of  DNA  double-­‐helices,  which  stop  158  when  you  click  anywhere  on  the  screen.  159    160  Click  on  File  in  the  toolbar  to  get  a  blank  screen.  161    162  Go  to  File|Open  Data  File...  and  Open  the  .fas  file  that  you  just  prepared.  163    164  This  opens  a  Data  Information  window,  which  shows  a  summary  of  your  data,  i.e.  total  165  number  of  nucleotide  sites,  total  number  of  sequences,  etc.  Close  this  window  (to  open  the  166  Data  Information  window  at  any  time,  go  to  Display|Data  info).  167    168  

 169    170  Go  to  Display|View  Data  to  see  the  multiple  sequence  alignment.  171    172  This  opens  a  DNA  Sequence  Polymorphism  window  with  the  aligned  sequence  names  along  173  the  left  side,  and  nucleotide  bases  along  the  top.  You  can  slide  along  the  length  of  the  174  sequence  or  along  the  right  side  to  view  all  the  sequences  using  the  slide  rules.  175    176  In  the  bottom  right  corner,  there  is  a  Select  Sites/Codons…  drop-­‐down  box  with  options  of  177  how  you  can  view  your  data.  This  includes  options  for  highlighting  the  invariable  178  (monomorphic)  or  the  variable  (polymorphic)  sites  only.  Select  these  options  in  turn  to  see  179  how  this  affects  the  data  shown.  If  the  sequences  were  annotated,  you  could  also  view  the  180  sequence  as  codons,  and  highlight  synonymous  and  nonsynonymous  nucleotide  sites.    181    182    183  

 184    185  Calculation  of  nucleotide  diversity    186  Click  on  Analysis  in  the  toolbar  to  see  the  variety  of  analyses  that  DnaSP  can  perform.  187    188  Go  to  Analysis|DNA  polymorphism.  189    190  This  brings  up  a  DNA  Polymorphism.  Options  window.  191    192  The  Data  Set  drop-­‐down  box  gives  you  the  option  to  select  the  dataset  to  be  analyzed.  Since  193  your  dataset  contains  only  one  set  of  sequences,  the  only  option  given  will  be  All  Included  194  Sequences.    195  You  can  estimate  the  nucleotide  diversity  in  your  data  set  either  across  the  entire  sequence  196  or  in  specific  regions  by  selecting  the  Region  to  Analyze.  197    198  You  can  also  estimate  whether  nucleotide  diversity  is  particularly  high  in  a  specific  region  199  of  the  sequence  using  the  Sliding  Window  option.  If  you  check  the  Compute  box,  you  can  200  then  define  the  size  of  the  sequence  block  (Window  Length)  and  how  often  to  repeat  the  201  calculation  (Step  Size).  As  an  example,  you  can  see  how  the  pattern  of  nucleotide  diversity  202  changes  in  100  nucleotide  blocks,  every  25  nucleotides  along  your  sequence.    203    204  Finally,  there  are  many  Options  of  associated  algorithms  for  the  calculation  of  nucleotide  205  diversity  (average  number  of  nucleotide  differences  per  site  between  two  sequences):  206    207  (i)  Variance  of  Pi  -­‐  this  refers  to  the  the  variance  in  the  average  number  of  nucleotide  208  differences  per  site  between  two  sequences  (Nei  1987).    209    210  (ii)  Nucleotide  diversity  with  Jukes  and  Cantor  correction  factor  -­‐  this  model  corrects  for  211  bases  where  mutation  has  occurred  more  than  once.  As  such,  the  Jukes  and  Cantor  212  correction  accounts  for  how  sequences  evolved  (Jukes  and  Cantor  1969;  Lynch  and  Crease  213  1990).    214  

 215  (iii)  Nucleotide  diversity  (gaps/missing  data)  -­‐  both  of  the  earlier  options  assume  that  216  there  are  no  gaps  in  the  sequence  data.  However,  in  the  event  that  there  are  indels  in  the  217  sequence,  you  will  need  to  select  this  option  otherwise  these  indels  will  be  ignored  during  218  the  analysis.    219    220  NOTE:  You  can  only  select  options  only  (i),  only  (ii),  (i)  and  (ii),  or  only  (iii)  but  not  all  three  221  of  them.    222    223  Select  All  Included  Sequences  as  Data  Set,  the  entire  region  as  Region  to  Analyze,  and  both  224  Compute  Variance  of  Pi  and  Compute  Pi  as  the  Options  to  calculate  nucleotide  diversity.  225  Click  on  OK.  226    227  

 228    229  Once  we  have  estimated  nucleotide  diversity,  we  can  find  out  whether  selection  has  230  potentially  played  any  role  in  influencing  these  sequence  changes.    231    232  Tajima's  test,  or  D  test  statistic  (Tajima,  1989)  tests  the  neutral  theory  of  molecular  233  evolution  (Kimura,  1983).  That  is,  the  vast  majority  of  molecular  differences  that  arise  234  through  spontaneous  mutation  does  not  influence  the  fitness  of  the  individual.  A  corollary  235  to  this  theory  is  then  that  genomes  evolve  primarily  through  the  process  of  genetic  drift.    236    237  Tajima's  D  statistic  compares  the  difference  between  two  estimates  of  the  amount  of  238  nucleotide  variation,  one  being  simply  the  number  of  segregating  sites  (Watterson,  1975)  239  

and  the  other  one  being  the  average  number  of  pairwise  differences  (Nei  and  Li,  1979;  240  Tajima,  1983).  In  a  constant-­‐sized  population  experiencing  only  genetic  drift,  both  241  estimates  should  give  equal  values.  Dissimilar  values  suggest  that  some  form  of  selection  242  could  be  acting  on  this  sequence.    243    244  A  positive  value  of  Tajima's  D  indicates  that  there  has  been  'balancing  selection'  and  the  245  data  will  show  a  few  divergent  haploypes,  whereas  a  negative  value  suggests  that  'purifying  246  selection'  may  have  occurred  and  the  data  will  reveal  an  excess  of  singletons.  247    248  Go  on  Analysis|Tajima's  test.  249    250  Select  All  Included  Sequences  as  Data  Set,  the  entire  region  as  Region  to  Analyze,  and  251  Segregating  Sites  as  the  Nucleotide  Substitutions  Considered  for  the  analysis.  Click  on  OK.  252    253  

 254    255  Questions  to  consider:    256  What  is  the  frequency  of  SNPs?  257  What  are  the  nucleotide  diversity  statistics  theta  and  pi?    258  Does  the  gene  appear  to  be  under  selection?  Why  yes/no?    259    260  4. TASSEL: 261  Trait  Analysis  by  Association,  Evolution,  and  Linkage  (TASSEL)  is  a  java-­‐based  program  262  intended  to  infer  correlations  between  genetic  markers  and  phenotypic  traits  (association  263  mapping).    264    265  In  this  lab  we  will  only  focus  on  methods  used  to  infer  correlations  between  genetic  and  266  phenotypic  data.  Specifically,  we  will  assess  correlations  between  single  nucleotide  267  polymorphism  (SNP)  markers  and  various  wood  property  characteristics  in  loblolly  pine  268  (Pinus  taeda).  However,  this  software  also  performs  a  variety  of  other  quantitative  269  analyses  including  calculation  of  molecular  diversity,  estimation  of  linkage  disequilibrium,  270  and  inference  of  phylogenetic  trees.    271    272  

The  goal  in  association  mapping  is  to  correlate  genotypic  with  phenotypic  variation.  We  273  refer  to  this  as  marker-­‐trait  associations.  The  dataset,  therefore,  consists  of  both  types  of  274  data.    275    276  The  genotypic  data  are  comprised  of  genotypic  classes  defined  across  a  large  number  (n)  of  277  SNPs  (n  =  58  SNPs  in  the  dataset).    278    279  For  a  standard  SNP  with  only  two  states,  there  are  only  three  genotypic  classes  in  a  280  diploid  individual  (homozygous  for  state  1,  heterozygous,  homozygous  for  state  2).  ‘State’  281  refers  to  what  nucleotides  are  found  at  a  given  SNP.  These  genotypic  classes  at  each  SNP  282  are  coded  with  single  letters  for  each  individual  in  the  dataset.    283    284  The  phenotypic  data  are  comprised  of  quantitative  measurements  of  various  wood  285  property  traits  (n  =  18  traits).    286    287  We  will  use  a  General  Linear  Model  (GLM)  to  estimate  genetic  effects  on  phenotypic  data.  288  In  this  context,  variation  at  SNP  markers  is  used  to  explain  variation  in  phenotypes  (y  =  a  +  289  bx  +  e,  where  y  is  the  phenotypic  trait,  b  is  the  linear  term  corresponding  to  the  SNP,  and  e  290  is  the  error).  A  statistical  test  of  the  following  form  will  be  performed  for  each  SNP  and  291  phenotypic  trait:    292    293  H0:  The  linear  term  (b)  corresponding  to  SNP  is  equal  to  zero.  294    295  HA:  The  linear  term  (b)  corresponding  to  SNP  is  not  equal  to  zero.    296    297  The  null  hypothesis  is  rejected  when  the  corresponding  p-­‐value  is  less  than  0.05  or  some  298  other  predetermined  significance  threshold.  Since  there  are  as  many  tests  as  there  are  299  combinations  of  SNPs  and  phenotypic  traits,  the  p-­‐value  is  often  adjusted  to  take  into  300  account  the  fact  of  performing  so  many  independent  statistical  tests  (in  this  dataset  that  is  301  58*18  =  1044  independent  tests!).  When  the  p-­‐value  less  than  0.05,  we  reject  the  null  302  hypothesis  and  conclude  that  variation  at  this  particular  SNP  is  strongly  correlated,  or  303  associated,  with  variation  for  a  certain  phenotypic  trait.    304    305  The  files  you  will  need  to  work  with  Tassel  are  in  your  directory  in  ‘plantgenome’,  in  the  306  Lab10  subdirectory,  in  a  folder  named  ‘tassel_files’.  There  are  two  files  corresponding  to  307  genotypic  and  phenotypic  data,  called  ‘genWood.txt’  and  ‘phenoWood.txt’,  respectively.    308      309     >  cd  Lab10  310     >  cd  tassel_files  311     >  ls  312    313  -­‐Transfer  the  files  onto  your  computer  using  FileZilla.  314  -­‐  Open  Tassel.  315     Go  to  http://www.maizegenetics.net/index.php?option=com_content&id=89    316     Click  on  “Launch  TASSEL  2.0.1”  317  

 318    319  Note  that  the  window  is  divided  into  three  major  frames.  In  the  upper  left  is  a  data  tree,  320  where  all  the  input  data  files  and  subsequent  output  files  are  listed.  In  the  lower  left  is  a  321  status  frame,  which  summarizes  commands  that  are  executed.  In  the  right  is  a  data  322  window,  which  shows  the  data  once  it  is  imported  into  the  program.  323      324  Click  on  POLY  and  open  the  genWood.txt  file.  The  SNP  genotypes  for  each  individual  are  325  now  loaded  into  the  program.  Click  on  the  file  named  Allele  located  in  the  data  tree  to  see  326  the  SNP  data  in  the  data  window.    327  

 328    329  Click  on  TRAIT  and  open  the  phenoWood.txt  file.  The  phenotypic  data  for  each  individual  330  for  each  trait  are  now  loaded  into  the  program.  Click  on  the  file  named  18traits/environ  331  located  in  the  data  tree  to  see  the  phenotypic  data  in  the  data  window.    332    333  

The  next  step  is  to  combine  the  phenotypic  data  with  the  genotypic  data  to  get  a  single  334  dataset.  335    336  Highlight  the  files  corresponding  to  each  dataset,  the  SNP  and  the  trait,  in  the  data  tree.  To  337  highlight  both  files  (Allele  and  18  traits/environ)  hold  the  Ctrl  key  down  while  you  click  on  338  each  of  the  files.  Click  on  U  join.  339    340  We  now  have  a  complete  dataset  comprised  of  both  genotypic  and  phenotypic  data.  341    342  Click  on  the  file  named  18  traits/environ  +  Allele  in  the  data  tree  to  view  the  complete  file.    343    344  We  are  now  ready  to  perform  an  analysis.  345    346  Click  on  Analysis,  and  then  on  GLM.  347    348  The  Input  Data  Definition  window  will  appear,  and  is  composed  of  two  frames.  The  one  on  349  the  left  lists  all  the  input  data.  For  this  lab  that  is  the  phenotypic  traits  and  the  population.  350  Since  there  is  only  a  single  population  in  this  dataset,  the  drop-­‐down  menu  for  pop  should  351  be  set  to  Exclude.  Next  to  each  trait  is  a  drop-­‐down  menu  that  specifies  what  should  be  352  done  with  these  traits.  They  may  be  selected  as  data,  a  factor,  a  covariate,  or  be  excluded.  353  You  will  want  to  have  all  the  phenotypic  traits  set  to  Data.  Lastly,  check  the  box  in  the  right  354  frame  that  is  labeled  as  Analyze  each  data  column  separately.  This  will  perform  separate  355  analyses  for  each  phenotypic  trait.  356    357  Click  on  OK.  358    359  The  Build  a  Linear  Model  window  will  appear,  which  allows  a  number  of  additional  360  specifications  to  be  listed.  361    362  Click  on  Run.    363    364  The  analysis  should  now  be  running.  You  can  verify  this  by  looking  at  the  status  bar  in  the  365  upper  right  corner  of  the  program  window.  The  results  are  printed  to  the  results  folder  366  located  in  the  data  tree.  367    368  Click  on  the  output  file  named  GLM_18  traits/environ  +  Allele.  369    370  

 371    372  The  results  are  located  in  the  data  window.  The  first  column  lists  the  phenotypic  trait.  373  There  are  18  traits.  Subsequent  columns  list  important  values  of  the  GLM  fittings  and  tests  374  of  those  fits  for  each  trait  and  SNP.  Each  phenotypic  trait  has  58  rows,  one  for  each  SNP.  375  SNPs  are  labeled  as  markers  with  the  abbreviation  m(i)  or  q(j),  where  i  =  1,  2,  3…48  and  j  =  376  1,  2…10.    377    378  There  are  two  very  important  columns  that  you  should  inspect.  The  first  is  named  379  F_marker.  It  is  the  test  statistic  used  to  test  the  hypothesis  of  marker-­‐trait  association.  The  380  larger  the  F  value,  the  better  the  fit  of  the  GLM.  The  second  is  named  p_marker.  This  is  the  381  p-­‐value  associated  with  the  test  statistic  (F).  Remember  that  a  p-­‐value  of  less  than  0.05  is  382  considered  significant.  When  p  <<<  0.05  the  marker-­‐trait  association  is  very  strong.    383    384  Questions  to  consider:    385  Are  there  significant  associations  between  the  traits  and  markers  listed?    386  Do  significant  associations  alone  provide  conclusive  evidence  of  causation  (i.e.,  the  387  variation  in  this  markers  CAUSES  the  variation  in  the  phenotype)?    388  What  additional  data  would  be  helpful  to  prove  causation?    389  What  is  the  relationship  between  F  and  p-­‐value?    390    391