BIT150Lab10:$Sequence$Analysisand$Association$Mapping...

BIT 150 Lab 10: Sequence Analysis and Association Mapping 1 2 The goals of this exercise are to: 3

1. Perform a multiple sequence alignment with ClustalW 4 2. Learn how to perform sequence analysis with DnaSP 5 3. Understand the concepts of association mapping 6 4. Perform association mapping with the TASSEL 7

8 1. BACKGROUND: 9 10 Multiple Sequence Alignment: refers to the process of assembling three or more 11 sequences (DNA, RNA, or protein). Computational algorithms are used to produce and 12 analyze the alignments. Most multiple sequence alignment programs use heuristic 13 methods rather than global optimization because identifying the optimal alignment 14 between more than a few sequences of moderate length is prohibitively expensive 15 (computationally speaking). 16 17 Common MSA Tools: 18

a. ClustalW 19 b. T-Coffee 20 c. ProbconsRNA 21

22

Nucleotide Diversity: Nucleotide diversity is a measure of genetic variation. It is usually 23 associated with other statistical measures of population diversity, and is similar to 24 expected heterzygosity. This can be calculated by examining the DNA sequences 25 directly, or may be estimated from molecular marker data. This statistic may be used to 26 monitor diversity within or between ecological populations, to examine the genetic 27 variation in related species, or to determine evolutionary relationships. 28

Common Tools: 29 a. DNAsp 30 b. MEGA 31 c. Arlequin3 32

33 Association Mapping: Association mapping takes advantage of linkage disequilibrium to 34 map phenotypes to genotypes. Association mapping is based on the idea that traits that 35 have entered a population only recently will still be linked to the surrounding genetic 36 sequence of the original evolutionary ancestor, or in other words, will more often be 37 found within a given haplotype, than outside of it. Association mapping asks if a 38 particular genetic marker (most often a SNP) is more common in a particular phenotype 39 than you would expect by chance. 40 41 3. Common Tools: 42

a. TASSEL 43 b. R 44 c. Plink (GWAS) 45

46

2. MULTIPLE SEQUENCE ALIGNMENT WITH CLUSTALW: 47 48 DnaSP, the program we will use for nucleotide diversity analysis, requires multiple DNA 49 sequence alignment files in FASTA, Nexus, or Phylip format. The files in .ace format we have 50 been working with during the previous labs obtained from Phred, Phrap, Polyphred, 51 Consed do not work. Therefore, we need to create a multiple DNA sequence alignment file 52 that the program will accept. 53 54 In your directory in ‘plantgenome’, you will find ‘Lab10’ subdirectory. Within this 55 subdirectory, the folder named ‘2_5395_01_fasta’ contains the FASTA files of the sequences 56 you assembled in Hwk9. Move into the ‘Lab10’ subdirectory, then into the folder containing 57 the FASTA files, and list the content of this folder: 58 59 >cd Lab10 60 >cd 2_5395_01_fasta 61 >ls 62 63 We need to create a multiple DNA sequence alignment file which contains all the sequences 64 from a single contig. It is important to note that for this step you want just the sequences 65 from a single contig. Trying to align multiple sequences from different contigs will cause 66 problematic DNA alignment files revealing much more nucleotide diversity than is actually 67 present. 68 69 Use the command ‘cat’ to concatenate all the sequences from a single contig from invidual 70 text files into one single text file. Name the output file ‘contig1.fasta’. 71 72 >cat sequence1.fasta sequence2.fasta sequence3.fasta ... sequenceN.fasta > contig1.fasta 73 74 Sequence1, sequence2, sequence3, sequenceN, are the names of the FASTA files of the 75 sequences from Hwk9 contained in folder ‘2_5395_01_fasta’ with which we provided you. 76 77 To align all the sequences from a single contig that you just concatenated, we will use 78 ClustalW. 79 80 >clustalw 81 82

83

84 Load the input FASTA file by selecting option 1. Sequence Input From Disc: 85 86

87 88 Enter the name of your FASTA file created above (contig1.fasta). 89 90 Select option 2. Multiple Alignments to perform a multiple sequence alignment of the 91 sequences from a single contig you concatenated and saved in the FASTA file: 92 93 94 95

96 97

Select option 9. Output format option to select the format of the output file: 98 99

100 101 Turn option F. Toggle FASTA format output on by typing F and pressing Enter to produce a 102 multiple sequence alignment output file in FASTA format: 103 104

105

106 Return to the previous menu to run the alignment (press Enter): 107

108 109 Select option 1. Do complete multiple alignment now. 110 111 You will need to enter a name for the ClustalW output file (default is the input file name 112 ‘contig1’ with a .aln extension). Use the default for this (press Enter). 113 114 You will need to enter a name for the FASTA output file (default is the input file name with 115 a .fas extension). Enter a name (‘contig11.fas’) for the output file and use the extension .fas. 116 117 You will need to enter a name for the new GUIDE TREE file (default is the input file name 118 with a .dnd extension.) Use the default for this (press Enter). 119 120 Once the multiple sequence alignment if finished, return to the main menu (type X) and exit 121 ClustalW (press Enter and then type X to exit the program). 122 123 You should now have 3 new files in your ‘2_5395_01_fasta’ subdirectory: the .aln, .fas, and 124 .dnd files. We will be using the .fas file (the multiple sequence alignment of the sequences 125 for a single contig in FASTA format) for the DnaSP portion of the lab. 126 127 128 FileZilla is a program to manage files when working in UNIX. The program can be 129 downloaded from: http://filezilla-‐project.org/. Steps to use the software: 130 131 Open the Start/Programs/BioInformatics/FileZilla or double click on the desktop shortcut 132 In the host field: plantgenome.plantsciences.ucdavis.edu 133 In the username field: your Kerberos username 134

In the password field: your Kerberos password 135 In the port field: 22 136 Click on Quickconnect and you can now transfer files between your computer and 137 the UNIX server 138 139 140

141 142 3. DNASP: 143 144 About DnaSP 145 DnaSP (Rozas and Rozas, 1999; Rozas et al., 2003) is a software package for the analysis of 146 the DNA polymorphism from nucleotide sequence data. DnaSP runs on a Windows platform 147 and is freely available at http://www.ub.es/dnasp/. 148 149 In this lab, you will learn how to use DnaSP to calculate the nucleotide diversity present in 150 nucleotide sequence data, and how to test for departure from a neutral model of evolution, 151 i.e. genetic drift. However, DnaSP is also capable of performing a number of other 152 calculations. 153 154 As explained at the beginning of this lab, DnaSP requires a multiple DNA sequence 155 alignment file in FASTA format. 156 157

Open DnaSP. The opening screen has animated images of DNA double-‐helices, which stop 158 when you click anywhere on the screen. 159 160 Click on File in the toolbar to get a blank screen. 161 162 Go to File|Open Data File... and Open the .fas file that you just prepared. 163 164 This opens a Data Information window, which shows a summary of your data, i.e. total 165 number of nucleotide sites, total number of sequences, etc. Close this window (to open the 166 Data Information window at any time, go to Display|Data info). 167 168

169 170 Go to Display|View Data to see the multiple sequence alignment. 171 172 This opens a DNA Sequence Polymorphism window with the aligned sequence names along 173 the left side, and nucleotide bases along the top. You can slide along the length of the 174 sequence or along the right side to view all the sequences using the slide rules. 175 176 In the bottom right corner, there is a Select Sites/Codons… drop-‐down box with options of 177 how you can view your data. This includes options for highlighting the invariable 178 (monomorphic) or the variable (polymorphic) sites only. Select these options in turn to see 179 how this affects the data shown. If the sequences were annotated, you could also view the 180 sequence as codons, and highlight synonymous and nonsynonymous nucleotide sites. 181 182 183

184 185 Calculation of nucleotide diversity 186 Click on Analysis in the toolbar to see the variety of analyses that DnaSP can perform. 187 188 Go to Analysis|DNA polymorphism. 189 190 This brings up a DNA Polymorphism. Options window. 191 192 The Data Set drop-‐down box gives you the option to select the dataset to be analyzed. Since 193 your dataset contains only one set of sequences, the only option given will be All Included 194 Sequences. 195 You can estimate the nucleotide diversity in your data set either across the entire sequence 196 or in specific regions by selecting the Region to Analyze. 197 198 You can also estimate whether nucleotide diversity is particularly high in a specific region 199 of the sequence using the Sliding Window option. If you check the Compute box, you can 200 then define the size of the sequence block (Window Length) and how often to repeat the 201 calculation (Step Size). As an example, you can see how the pattern of nucleotide diversity 202 changes in 100 nucleotide blocks, every 25 nucleotides along your sequence. 203 204 Finally, there are many Options of associated algorithms for the calculation of nucleotide 205 diversity (average number of nucleotide differences per site between two sequences): 206 207 (i) Variance of Pi -‐ this refers to the the variance in the average number of nucleotide 208 differences per site between two sequences (Nei 1987). 209 210 (ii) Nucleotide diversity with Jukes and Cantor correction factor -‐ this model corrects for 211 bases where mutation has occurred more than once. As such, the Jukes and Cantor 212 correction accounts for how sequences evolved (Jukes and Cantor 1969; Lynch and Crease 213 1990). 214

215 (iii) Nucleotide diversity (gaps/missing data) -‐ both of the earlier options assume that 216 there are no gaps in the sequence data. However, in the event that there are indels in the 217 sequence, you will need to select this option otherwise these indels will be ignored during 218 the analysis. 219 220 NOTE: You can only select options only (i), only (ii), (i) and (ii), or only (iii) but not all three 221 of them. 222 223 Select All Included Sequences as Data Set, the entire region as Region to Analyze, and both 224 Compute Variance of Pi and Compute Pi as the Options to calculate nucleotide diversity. 225 Click on OK. 226 227

228 229 Once we have estimated nucleotide diversity, we can find out whether selection has 230 potentially played any role in influencing these sequence changes. 231 232 Tajima's test, or D test statistic (Tajima, 1989) tests the neutral theory of molecular 233 evolution (Kimura, 1983). That is, the vast majority of molecular differences that arise 234 through spontaneous mutation does not influence the fitness of the individual. A corollary 235 to this theory is then that genomes evolve primarily through the process of genetic drift. 236 237 Tajima's D statistic compares the difference between two estimates of the amount of 238 nucleotide variation, one being simply the number of segregating sites (Watterson, 1975) 239

and the other one being the average number of pairwise differences (Nei and Li, 1979; 240 Tajima, 1983). In a constant-‐sized population experiencing only genetic drift, both 241 estimates should give equal values. Dissimilar values suggest that some form of selection 242 could be acting on this sequence. 243 244 A positive value of Tajima's D indicates that there has been 'balancing selection' and the 245 data will show a few divergent haploypes, whereas a negative value suggests that 'purifying 246 selection' may have occurred and the data will reveal an excess of singletons. 247 248 Go on Analysis|Tajima's test. 249 250 Select All Included Sequences as Data Set, the entire region as Region to Analyze, and 251 Segregating Sites as the Nucleotide Substitutions Considered for the analysis. Click on OK. 252 253

254 255 Questions to consider: 256 What is the frequency of SNPs? 257 What are the nucleotide diversity statistics theta and pi? 258 Does the gene appear to be under selection? Why yes/no? 259 260 4. TASSEL: 261 Trait Analysis by Association, Evolution, and Linkage (TASSEL) is a java-‐based program 262 intended to infer correlations between genetic markers and phenotypic traits (association 263 mapping). 264 265 In this lab we will only focus on methods used to infer correlations between genetic and 266 phenotypic data. Specifically, we will assess correlations between single nucleotide 267 polymorphism (SNP) markers and various wood property characteristics in loblolly pine 268 (Pinus taeda). However, this software also performs a variety of other quantitative 269 analyses including calculation of molecular diversity, estimation of linkage disequilibrium, 270 and inference of phylogenetic trees. 271 272

The goal in association mapping is to correlate genotypic with phenotypic variation. We 273 refer to this as marker-‐trait associations. The dataset, therefore, consists of both types of 274 data. 275 276 The genotypic data are comprised of genotypic classes defined across a large number (n) of 277 SNPs (n = 58 SNPs in the dataset). 278 279 For a standard SNP with only two states, there are only three genotypic classes in a 280 diploid individual (homozygous for state 1, heterozygous, homozygous for state 2). ‘State’ 281 refers to what nucleotides are found at a given SNP. These genotypic classes at each SNP 282 are coded with single letters for each individual in the dataset. 283 284 The phenotypic data are comprised of quantitative measurements of various wood 285 property traits (n = 18 traits). 286 287 We will use a General Linear Model (GLM) to estimate genetic effects on phenotypic data. 288 In this context, variation at SNP markers is used to explain variation in phenotypes (y = a + 289 bx + e, where y is the phenotypic trait, b is the linear term corresponding to the SNP, and e 290 is the error). A statistical test of the following form will be performed for each SNP and 291 phenotypic trait: 292 293 H0: The linear term (b) corresponding to SNP is equal to zero. 294 295 HA: The linear term (b) corresponding to SNP is not equal to zero. 296 297 The null hypothesis is rejected when the corresponding p-‐value is less than 0.05 or some 298 other predetermined significance threshold. Since there are as many tests as there are 299 combinations of SNPs and phenotypic traits, the p-‐value is often adjusted to take into 300 account the fact of performing so many independent statistical tests (in this dataset that is 301 58*18 = 1044 independent tests!). When the p-‐value less than 0.05, we reject the null 302 hypothesis and conclude that variation at this particular SNP is strongly correlated, or 303 associated, with variation for a certain phenotypic trait. 304 305 The files you will need to work with Tassel are in your directory in ‘plantgenome’, in the 306 Lab10 subdirectory, in a folder named ‘tassel_files’. There are two files corresponding to 307 genotypic and phenotypic data, called ‘genWood.txt’ and ‘phenoWood.txt’, respectively. 308 309 > cd Lab10 310 > cd tassel_files 311 > ls 312 313 -‐Transfer the files onto your computer using FileZilla. 314 -‐ Open Tassel. 315 Go to http://www.maizegenetics.net/index.php?option=com_content&id=89 316 Click on “Launch TASSEL 2.0.1” 317

318 319 Note that the window is divided into three major frames. In the upper left is a data tree, 320 where all the input data files and subsequent output files are listed. In the lower left is a 321 status frame, which summarizes commands that are executed. In the right is a data 322 window, which shows the data once it is imported into the program. 323 324 Click on POLY and open the genWood.txt file. The SNP genotypes for each individual are 325 now loaded into the program. Click on the file named Allele located in the data tree to see 326 the SNP data in the data window. 327

328 329 Click on TRAIT and open the phenoWood.txt file. The phenotypic data for each individual 330 for each trait are now loaded into the program. Click on the file named 18traits/environ 331 located in the data tree to see the phenotypic data in the data window. 332 333

The next step is to combine the phenotypic data with the genotypic data to get a single 334 dataset. 335 336 Highlight the files corresponding to each dataset, the SNP and the trait, in the data tree. To 337 highlight both files (Allele and 18 traits/environ) hold the Ctrl key down while you click on 338 each of the files. Click on U join. 339 340 We now have a complete dataset comprised of both genotypic and phenotypic data. 341 342 Click on the file named 18 traits/environ + Allele in the data tree to view the complete file. 343 344 We are now ready to perform an analysis. 345 346 Click on Analysis, and then on GLM. 347 348 The Input Data Definition window will appear, and is composed of two frames. The one on 349 the left lists all the input data. For this lab that is the phenotypic traits and the population. 350 Since there is only a single population in this dataset, the drop-‐down menu for pop should 351 be set to Exclude. Next to each trait is a drop-‐down menu that specifies what should be 352 done with these traits. They may be selected as data, a factor, a covariate, or be excluded. 353 You will want to have all the phenotypic traits set to Data. Lastly, check the box in the right 354 frame that is labeled as Analyze each data column separately. This will perform separate 355 analyses for each phenotypic trait. 356 357 Click on OK. 358 359 The Build a Linear Model window will appear, which allows a number of additional 360 specifications to be listed. 361 362 Click on Run. 363 364 The analysis should now be running. You can verify this by looking at the status bar in the 365 upper right corner of the program window. The results are printed to the results folder 366 located in the data tree. 367 368 Click on the output file named GLM_18 traits/environ + Allele. 369 370

371 372 The results are located in the data window. The first column lists the phenotypic trait. 373 There are 18 traits. Subsequent columns list important values of the GLM fittings and tests 374 of those fits for each trait and SNP. Each phenotypic trait has 58 rows, one for each SNP. 375 SNPs are labeled as markers with the abbreviation m(i) or q(j), where i = 1, 2, 3…48 and j = 376 1, 2…10. 377 378 There are two very important columns that you should inspect. The first is named 379 F_marker. It is the test statistic used to test the hypothesis of marker-‐trait association. The 380 larger the F value, the better the fit of the GLM. The second is named p_marker. This is the 381 p-‐value associated with the test statistic (F). Remember that a p-‐value of less than 0.05 is 382 considered significant. When p <<< 0.05 the marker-‐trait association is very strong. 383 384 Questions to consider: 385 Are there significant associations between the traits and markers listed? 386 Do significant associations alone provide conclusive evidence of causation (i.e., the 387 variation in this markers CAUSES the variation in the phenotype)? 388 What additional data would be helpful to prove causation? 389 What is the relationship between F and p-‐value? 390 391

BIT150Lab10:$Sequence$Analysisand$Association$Mapping...

Documents

Transcript of BIT150Lab10:$Sequence$Analysisand$Association$Mapping...