or, “Where Your Data Go After...

37
1 GBS Bioinformatics Pipeline ...or, “Where Your Data Go After Sequencing” James Harriman Ed Buckler Jeff Glaubitz Qseq QseqToTagCount TagCounts per lane TagCounts for species (Master Tags) Fastq SAM alignment TagsOnPhysical Map Key files TagsByTaxa files (1 per lane) TagsByTaxa for species HapMap Merge TagsByTaxa SAM convertor BWA (Burrows- Wheeler Aligner) TagCountsTo FASTQ Merge TagsCounts TagsToSNP ByAlignment Process QseqToTBT File (data structure) Reference Genome Pipeline

Transcript of or, “Where Your Data Go After...

Page 1: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

1

GBS Bioinformatics Pipeline ...or, “Where Your Data Go After Sequencing”

James Harriman Ed Buckler Jeff Glaubitz

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Fastq

SAM alignment

TagsOnPhysical Map

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

SAM convertor

BWA (Burrows-Wheeler Aligner)

TagCountsTo FASTQ

Merge TagsCounts

TagsToSNP ByAlignment Process

QseqToTBT  

File (data structure)

Reference Genome Pipeline

Page 2: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

2

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

Merge TagsCounts

TagHomology PhaseNoAnchor

Process

QseqToTBT  

File (data structure)

Non-Reference Genome Pipeline

HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG ggggggggggggdeggggggggggg^dccebabcbde_Tb[\`T_baa``ddcdd^c^^b_ab[K]]SO_\_\T\VVVW\`dc\``b`b`]U]UZWXW[T 1 HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA fefefffefe_ddd]feeeceefffeeafbd`d`dfefedccffeeefccedfea^deddcZfae[[ba`bccb_da`^bQccc`c\`c`daT`W]T_[] 1 HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT ggcggggfgggggggggffgggggdeeggdffgfggggggggcgggggggggfgggceggeggcbgggggedebeefccfgcggegeddeegdfde`fef 1 HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG fffffffdefegfdffbcfeffeffffedegdggdgefbfe^cfcf\ded\]_]_Ya`]KW`cc`cYdd`Q[XYXabWaLa]_aZ]Y]TWddd^Y```Tc 1 HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG VTMKOJJMUGVZN[V_`_`YWZYPWILQSNYbb\Y_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG fceffcddbeabbdcaebeccaedZc`bc`^``X`\cbbc^d`b[c\[dcdbcd^`^]`bZdabZaVa\dZZ[SQccS``c^_^c^bTa^\b\b_BBBBB 1 HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG effffefffffYfeedbdddad\ddbTb^^abMb`aedeedaedacdea`cWdaadcYdXdabb`df]efd^db`dddddfefff``Z`cc^ac^`BBBB 1 HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA ggggggggggf`ggfggggdggggf^gfgdfffffggdgeggeggggcgbaedaebdd`debccc^aVccbabK_`_b_`d_beeTMMSM``Z]^X``_B 1 HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ^YTLX]X]ZX]\^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC gggggggfgggggggggggggggggegggg`gggggegggggggdeggggggeggggegegggeead^dddaa]bXYZ^`bdbbb_\Yd_`cM``bb^aW 1 HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC fdedffdc^ddbebececed[acde_^OTN__`a_ed^^dadaabK][^^X]NYMcc^[WOUSSE]`S[[U`ZU`^_I`a`Z_b`Q[]_]]]V[R^^`[` 1 HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC eeeb`ebbeeb^]^bebeeb[`X^`^dXcabZb]bb`bbb_dd]]_]_bbd\_\_cc[___a__Vca_Sc`b]Y\c__]b`V[\^[_PSbBBBBBBBBBB 1 HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG dddddbffffffdfaadaeeffff[M[QUJda`[VRIQR\cX[cddc[df^cX^ISS^QZ`Y^TYEPTZS\`T\`MSTTSNVYQQ]]]T`\`QY_BBBBB 1 HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG XVKYYIUWWM\O__BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG \c\ededdceeeVeedffddaceedeeeWS[O``\cccccdadZd\daddeaacfefeee\^d_d_ccbUQVWHUIKPFGUY\UY[T[[XT`YT\]_b^] 1 HWI-ST397 0 3 68 15765 200113 0 1 CCAGCTCAGCATGGATCTCTCCTTGATGGACTGAAAGCGCGTGTGCTCCCCTGTGTGATGGAAAGTGGCAGTGATCGGAAGTTCGGTACGGAAGGAGTGC NQHLGNUSTUIQUTX_O_X][_YX[K\\U[O^KJ\[SYW]^[[^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15912 200114 0 1 CCAGCTCAGCTCAAGCATTGGCTTCCGCTTTGGCATCCTGGAGGGTAAGCTTCTGCTCTTCTCACTAGAGGAGGATCCTGGCTATCGAAACCTGTTGTAT a\YaY^^[^R^a^`[c[c\`VS^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15791 200127 0 1 ACAAACAGCAGAGGTCGCATTGTAGTTAGTCCGGGACTTGCCCAGTTCATTGCTGAGATCGGAAGAGCGGTTCAGCAGGCATGCCGAGACCGATCTCGTA QY^\^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15831 200117 0 1 GCTCTACAGCTTCTGGCCAGAATGCTTTTGGCACTTGTTTGTCACAAAGCATGCACTGAACCATATTCATGATAGTTCGATTTTTGCGTTCAGCTACCCC b_`ba\X\ZZ__X^XUabbaKKK\V\_\XbUHMXURRNURTT`PVQQTSXSV]NIVXVSHWTTSI[VYW^Wa`_`\^BBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15848 200124 0 1 TTCTCCAGCTGCTACATGCACCGTGGGAAGAAGGTCTGCCCCACATACCCACCAGCCATCGCCCTTCTCACATTCGATTCAAACATCTTTGGGTTATCCC a__^a^d_dda_edeeec]ed]c^da^`^c`ca```cac]aa`Y_acc_`c[V^bc^``b`cG\`_]^VYUabX^UIXXX[OPVUW_]``HU[X]aS`bZ 1 HWI-ST397 0 3 68 15891 200120 0 1 GAGATACAGCTGCGAATTGGGGGTTCCTGTGTTGCGAAGTGGCACTCGTGTGCCAAACTTGGCTACGCAGAGATCGGTAGAGCGGTTAAGCAGGAATGCC eadcdeddffffefbbffdffefdfcedcdfffff]abcbdYddebde_dadXc`cMccbc]^`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HWI-ST397 0 3 68 15931 200128 0 1 AAAAGTTCAGCAATACCTGTTGAAGCCAAGCCCTTGTGGTGATTGCCTCGTTCATTGCTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG ddddNcb\_bdddddeaeceab`_]VX]_\___cbZT[ZQ[[[Y]`^cZ]bcbYbcYcU^K]\S[^`Q]]VVPIUIUQQRMQ[NWbcb\L__LZ`BBBBB 1 HWI-ST397 0 3 68 15991 200121 0 1 GAATCTGCTACTAGTGAGCCTTTGTATGGGGACCGAGTTCAGAAGCTCTAACCCTCGTTTTCCCATCTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATG [\W\YN]XTX\\URV]]T[[O^_[NPLWWT]XXRRYYTXR]T]XX^a`^Y\WJX]Y[GTTUULRKIMPEMT[O[Zaa^Ra`BBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15765 200133 0 1 TAGCATGCCTGCTGCAGGAGTTGGTGCCCAGCATTCTCAGGTGTAGTCCAAATTCTGTCTGATACTTATTGTTTATGCGATTTTGCCATCACATATGGAG d]dcd]ddeeeeceZ`ddaUY]^aX_^[[^acRda`^c``c_eebbbda]X`c^`ce``debc[[_Q_][`cd_Zee`a[eeeda_^_ZX]^[^Kb`^X^ 1 HWI-ST397 0 3 68 15810 200133 0 1 TTCAGACAGATGATGCTTGTCAAGGGTCACCATCTTGCATTGCGCTGCGTCACATCCTTAGTGGGAATAGGGGATCAGCCTCGCTTTTTGAAGCTGAGTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15871 200135 0 1 CTTGCTTCAGCCATGTAGAGTGGTGTTGCTCCTTTACTACCACGAATCATTGGTAACTCCCTGTTCTTATTCACCATTACCTCAGCAATCCTTGTGATTC gggggggggggggggggfggdff^ebbcccgeggggegggggggdggeggefgd^bbeeeggeegegfgfeaee`ccbccUedeedfdfe`caccT[baa 1 HWI-ST397 0 3 68 15974 200136 0 1 TTCAGACAGCCAAACGACGTCTTAGTGGAGAAAATACCTGAGAAAAGTCAAGAAACCAAAACACTAAAAAATGACCAAGAAATAGAGCAGAGATCGGAAG ggggggggggggggggggggeggggeggggegggdagedegggggdgbeddgeedgegdedggdedece^^baXc`f`dde[c[cbcIceddRda\_c\` 1 HWI-ST397 0 3 68 15909 200147 0 1 AGCCTCAGCTTGGTTGCTTGTGGTTGGGGGTGAGGGGGCGGGCGGGAACTTATGTTTGCGCCCCGAGGCGGAGCCGCTCTAGGCAGAGATCGGAAGAAGG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15946 200152 0 1 CTTGACTGGGCGTGGTGCTGAGGCTACTGCGGAATTGAGGTGTTGTCATCCACCGGATTGGGTCGTAGGGCGTGGCTTTGAGCGGAGATCGGATGAGCGG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15774 200153 0 1 TTCAGACAGCCAACTGAGATGACTCTCATTCTTGGTAGGAACCAATTTCTGAGAGCTTCGTAATGACATCAACTACAGCTGTGATCGGAAGAGCGGTTCA U[P[VZ\NWR]M]T\[J``[_[I^RY[Q\\^Z]^KSSUMGOZIOI]WPTT_UNVVTWQMW[^O^\^]]aR][aBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15814 200155 0 1 GAGATACAGCAACAAATGATGTCATTCCTTGCAAAAGCTGTACAAAGCCCTGGTTTCTTAGCTCAGCTGGTACAGCAGAGATCGGACGAGCGGTTCGCCA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15850 200154 0 1 GTGTTTGGTCGTGAAAGTGGACCTCTTTCAGGTGCAGGTGCGAGTAGAAGGAGGTCCCAGAGACGTGCGGCTGGAGATCGGAAGGGTGCTGAGGCGGGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15870 200157 0 1 GAGAAACCGCAGAATGATAGCAAAAAGCGCGTTACAGGAGATATTAAGAAAAGGAGACTTGCAATGCAGGAGTAAGAGATCCATTCTGCTATGCATGCTC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 HWI-ST397 0 3 68 15984 200158 0 1 CGTCAACTGCATGAAGGAGGTTGTCTGGCCGTTGGAGGAGTGATTTTGGAAGGCTGAGATCGGAAGAAAGGTTCAGCAGGAATGCAGAGACCGATCTCGT N[a`cY^Y\X\MXZX\WRPTc`cab[[T_[\]MZ\GPPWQ\_N]_[X\[]TcNcaa\URaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 HW

Raw Sequence (Qseq)

Page 3: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

3

Assignment to Samples

Barcode sequences from the plate map are compared to barcode sequences in the reads, in order to associate reads with the samples from which they originate. Parameters:

Users supply a plate map and staff members supply DNA barcodes. These are combined into a table of barcodes by sample.

Project Details Sample Details Organism Details Origin Details Project Name Source Lab Plate Name Well Sample Name Pedigree Population Stock Number Sample DNA Concentration Sample Volume Sample DNA mass Preparer Kingdom Phylum Class Order Family Genus Species Subspecies Variety Location Name Country BREAD Buckler BREAD-Maize-A A01 PI597982 inbred 04A0160A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B01 blank 0 0 0 plantae zea mays BREAD Buckler BREAD-Maize-A C01 PI576130 inbred 04A0191B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D01 PI655991 inbred 04A0165A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E01 PI656059 inbred 04A0193B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F01 CML91 inbred 04A0005BA 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A G01 CML311 inbred 04A0301A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H01 CML311 inbred 04A0200A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A02 MR_0011.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0281A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B02 MR_0013.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0279B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C02 MR_0014.3 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0164B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D02 MR_0015.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0163A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E02 MR_0016.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0315B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F02 MR_0018.2 (PI655994 x PI655998)S4 PI655994 x PI655998 02F146114A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A G02 MR_0020.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0289B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H02 MR_0022.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0171A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A03 MR_0025.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0170B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B03 MR_0027.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0381B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C03 MR_0028.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0258A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994 x PI655998 04A0304B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada Cacao Buckler BREAD-Maize-A E03 Tc1536 Catie F1 04A0216A 10 100 1000 Jemmy Takrama plantae theobroma cacao gha Tafo Cacao Buckler BREAD-Maize-A F03 Tc7959 Brazil F2 04A0255A 10 100 1000 Jemmy Takrama plantae theobroma cacao gha Tafo BREAD Buckler BREAD-Maize-A G03 PI542406 inbred 04A0217A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A H03 PI655981 inbred 04A0167A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A A04 PI656007 inbred 04P160451A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A B04 PI17548 inbred 04A0258B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A C04 PI564163 inbred 04A0244A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A D04 PI651492 inbred 04A0298A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A E04 PI656008 inbred 04A0293B 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada BREAD Buckler BREAD-Maize-A F04 PI655985 inbred 04A0296A 10 100 1000 Wenyan Zhu plantae zea mays mex Baja Encinada

Plate Map

Page 4: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

4

Flowcell Lane barcode sample Plate# Row Column PlateName 434GFAAXX 2 CTCC M0001 1 A 1 IBM1 1A01 434GFAAXX 2 TGCA M0012 1 A 2 IBM1 1A02 434GFAAXX 2 ACTA M0021 1 A 3 IBM1 1A03 434GFAAXX 2 GTCT M0029 1 A 4 IBM1 1A04 434GFAAXX 2 GAAT M0038 1 A 5 IBM1 1A05 434GFAAXX 2 GCGT M0046 1 A 6 IBM1 1A06 434GFAAXX 2 TGGC M0057 1 A 7 IBM1 1A07 434GFAAXX 2 CGAT M0067 1 A 8 IBM1 1A08 434GFAAXX 2 CTTGA M0080 1 A 9 IBM1 1A09 434GFAAXX 2 TCACC M0090 1 A 10 IBM1 1A10 434GFAAXX 2 CTAGC M0099 1 A 11 IBM1 1A11 434GFAAXX 2 ACAAA M0113 1 A 12 IBM1 1A12 434GFAAXX 2 TTCTC M0003 1 B 1 IBM1 1B01 434GFAAXX 2 AGCCC M0013 1 B 2 IBM1 1B02 434GFAAXX 2 GTATT M0022 1 B 3 IBM1 1B03 434GFAAXX 2 CTGTA M0030 1 B 4 IBM1 1B04 434GFAAXX 2 AGCAT M0039 1 B 5 IBM1 1B05 434GFAAXX 2 ACTAT M0047 1 B 6 IBM1 1B06 434GFAAXX 2 GAGAAT M0058 1 B 7 IBM1 1B07 434GFAAXX 2 CCAGCT M0068 1 B 8 IBM1 1B08 434GFAAXX 2 TTCAGA M0081 1 B 9 IBM1 1B09 434GFAAXX 2 TAGGAA unknown 1 B 10 IBM1 1B10

Example DNA Barcode Key

Notes on Names & Chromosomes

•  Chromosomes (or contigs MUST be integers)

•  Samples Names some Advice: •  NO spaces •  NO “:” •  Try to avoid weird characters.

Page 5: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

5

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Fastq

SAM alignment

TagsOnPhysical Map

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

SAM convertor

BWA (Burrows-Wheeler Aligner)

TagCountsTo FASTQ

Merge TagsCounts

TagsToSNP ByAlignment Process

QseqToTBT  

File (data structure)

Reference Genome Pipeline

QSeqToTagCounts

Processes a Qseq file so we know what alleles (tags) are present in the the sample •  Handles sequence quality issue •  Identifies the barcodes •  Removes problem tags •  Counts tags

Page 6: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

6

Cut site Read Barcode adapter Cut site

Cut site Read Cut site Sequence Barcode adapter

Cut site Read Cut site

GBS Restriction Fragment Structure

Potential chimeric sequence Rejected or Trimmed reads

Common adapter

Accepted read Read Barcode adapter Cut site

Short sequence

Common adapter

Adapter dimer

Barcode adapter Common adapter Cut site

Sequence Processing Raw sequence data is processed into unique 64-bp sequences. For example: CTCCCAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC GTTGAACAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC

Becomes: CAGCCCTCGGCGGTCAAACCACCCGGTCATCCATGCACCAAGGCCTGCGTGCGGGCTTGGTGTCATCGTACGC 64 2

Parameters: Restriction enzyme Different enzymes will create different sequence motifs, such as overlapping cut sites, palindromes or wobble bases.

Barcode Barcode sequences must be provided to identify acceptable reads.

Number of identical sequences accepted This gives investigators the option to ignore repetitive sequences or singleton reads.

Page 7: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

7

26442466 2 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT 64 1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT 64 2 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTTGGCACTCAAGCCCAAAACCACAGATCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCAAGTAATTTGTTGTCTCATACCTCATACCACAGAACTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTTTTCCAACCCCAAAACCCAAGGCTTC 64 1 CAGCAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTTTCCCAAACCCCAAACCCCAGGCTTT 64 1 CAGCAAAAAAAAAAAAAAAAAAAGGGATAGGGAAGATGGGGGAGAGTGGCGGCCACGCATGGAA 64 1 CAGCAAAAAAAAAAAAAAAAAACAACAAGGAATTTGGGTATTCATTCCCCATACCCCAGGATTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACAAAAAAATTTGTTTTCTCAACCCCAAAACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 2 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGGAATTGAATCTCTCACACCTTAAAACACCGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAACACCAATTATTTGAAAGATCATTACCCTATACCACGGGGTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTGATGTCTCATACCCCATACCACAGGACTCCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAAAATTTTATTTCTCATACCCCAAACCCCAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAAGAATTTTATGTCTCATACCTCAAACCAAAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAATAAATTTGTTGCTCATACCCCAAACCACAGGGCTTTC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGCAATTTGATTCCACTTAATCTATCCCACAGAACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCC 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAAAAAATTTTTTGTTTCCCTAACCCCAAAACCACGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCAATGAATTTGTAGTGCCAAACCCCAAACCAACGGACTTT 64 1 CAGCAAAAAAAAAAAAAAAAAACCCCAAGAAATTTGATGTCTCATACCCCAAACCCCAGGACTT 64 1 CAGCAAAAAAAAAAAAAAAAAAGACCAGGTAATTATTGCTCACATACATCAAACTCCAATTGCC 64 1 CAGCAAAAAAAAAAAAAAAAAAGCGCCTAACGTTTCAAAATGAATGAGTTGCCAACCAAGGACT 64 1 CAGCAAAAAAAAAAAAAAAAAAGGGTTAGGAAAGATGGGTGGGAGGGGCGGGCCTGCTTGAAAT 64 1

TagCounts File Number of Tags

Max Size of Tag x 32bp

Tag Sequence Length (bp) Count

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Fastq

SAM alignment

TagsOnPhysical Map

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

SAM convertor

BWA (Burrows-Wheeler Aligner)

TagCountsTo FASTQ

Merge TagsCounts

TagsToSNP ByAlignment Process

QseqToTBT  

File (data structure)

Reference Genome Pipeline

Page 8: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

8

Unique Reads (FASTQ) @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGAATTTTATGTTTCCTACCTCCAACCCCAGGACTTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTGATGTCCTATACCTCATCCCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCAAGTAATTTTATTTCTCATACCTCATACCACAGGACTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTGATGTCTCAAACCCCAACACACAGGCTT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAACCCAAGAAATTTTTTGTCTCAAACCCCAACCCCCAGGCCT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAAAGGGGTTTTGAATAAAAAAAACTGAAGGATCTTAAATCTAC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTTTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=2 CAGCAAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACT + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAAAAATTTTATGTCTCAAACCCCAAACCCCAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAATAATTTGATGTCTCATACCTCATACCACAGGGCTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff @length=64count=1 CAGCAAAAAAAAAAAAAAAAAAACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTTC + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

BWA (Burrows-Wheeler Aligner) •  Aligns the tags in FASTA format to the

reference genome •  Parameters:

•  Similarity of read sequence and genome sequence. This controls the tradeoff between number of SNPs and confidence in the alignment. Default is 4 edits per sequence.

•  Gap penalty. This controls sensitivity to indels. Default is no indels within 5bp of the read ends.

•  Outputs a SAM Alignment •  There are many other aligners. BWA is fast

and memory efficient, but may not be appropriate for your species

Page 9: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

9

Generic Alignment (SAM)

length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=2 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=1 0 7 6994125 37 53M2I9M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:29T32 length=64count=1 0 7 6994125 37 54M2I8M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:62 length=64count=1 0 7 6994125 37 55M2I7M * 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:38G23 length=64count=4 0 7 6994125 37 4M3D47M2I11M * 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:5 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:2 MD:Z:4^A length=64count=1 16 17 14761759 25 64M * 0 0 CCTTTCTTGGCCTGGTTCTCACTCATCTGGGCTTGGAATTGAGAACCGTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:8T1 length=64count=7 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCCCGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:14G12A12T0G length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACGTTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:13C0G25T0G2 length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACAGGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:11C2G25T0G2 length=64count=4 16 18 1517944 25 64M * 0 0 GCCCGTCTACCCGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:10A3G25T0G2 length=64count=2 16 18 1517944 25 64M * 0 0 GCCCGTCTCCACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:8A5G25T0G22 length=64count=53 16 18 1517944 37 64M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:14G25T0G22 length=64count=1 16 18 1517944 25 64M * 0 0 CCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:0G13G25T0G2 length=64count=1 16 18 1517944 25 64M * 0 0 GCCCGTCTACACCCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:12G1G25T0G2 length=64count=1 0 10 10388735 37 58M1I5M * 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 MD:Z:59A length=64count=1 0 2 714861 37 64M * 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:2A61 length=64count=11 16 19 13463035 37 49M1I14M * 0 0 TGCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 length=58count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:57C length=64count=1 0 2 14032437 37 4M1I59M * 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:57C length=64count=1 16 19 13463036 37 48M2I14M * 0 0 GCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCCTTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:2 length=64count=1 0 6 20542400 37 64M * 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:13C length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGACTCCCCCTCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 length=64count=3 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGACTCACCCTCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 length=64count=1 16 5 15019027 37 49M1I14M * 0 0 CCCATTGTTGTATCTTGATTGCAGACTCACCCCCATCACTCTTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 length=64count=1 16 5 15019027 37 49M1I14M * 0 0 ACCATTGTTGTATCTTGATTGCAGACTCTGCCTCATCACTCGTTCTGCATTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 length=64count=1 0 6 20542400 37 4M1I59M * 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:63 length=64count=1 0 8 18851188 37 64M * 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:8G3 length=64count=5 16 19 13463034 23 64M * 0 0 CTGCCCGTCTACACGCTTGTGTCCCATGCACGCAAGCCGCCCCATCCCTCTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:3 X0:i:1 X1:i:1 XM:i:3 XO:i:0 XG:i:0 MD:Z:1C1 length=64count=1 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:26G37 length=64count=7 0 5 6176480 37 64M * 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:64 length=57count=31 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:57C length=64count=4 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:57C length=64count=1 0 2 14032437 25 64M * 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:54C length=64count=1 16 5 15019027 37 16M1I47M * 0 0 TCCATTGTTGTATCTTCGATTGCAGACTCACCCTCATCACTCTTTCCGCCTTTTTTTTTTGCTG ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1

SAMConverter & TagsOnPhysicalMap (TOPM)

•  TOPM is the key file to interpret tags present in a species. Contains:

•  Tag Sequence •  Position •  Divergence from reference •  Polymorphisms •  Genetic mapping support

Page 10: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

10

6040401 2 4 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 7 1 6994125 6994189 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 17 0 14761759 1476182 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 18 0 1517944 1518008 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 10 1 10388735 1038879 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 2 1 714861 714925 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 19 0 13463035 1346309 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 2 1 14032437 1403250 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 19 0 13463036 1346310 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 5 0 15019027 1501909 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 6 1 20542400 2054246 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 8 1 18851188 1885125 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 19 0 13463034 1346309 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 5 1 6176480 6176544 0

TagsOnPhysicalMap File

BWA sensitivity is pretty poor

Alignment Class BWA Bowtie2 Single Best

Mapping 57% 69%

Multiple Mapping 17% 17% Unmapped 26% 14%

BLAST about the same as Bowtie2. Code needs to be updated to parse Bowtie2. Many of the multiple mapping do NOT map with 100% identity, which suggests they can be genetically mapped.

Page 11: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

11

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Fastq

SAM alignment

TagsOnPhysical Map

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

SAM convertor

BWA (Burrows-Wheeler Aligner)

TagCountsTo FASTQ

Merge TagsCounts

TagsToSNP ByAlignment Process

QseqToTBT  

File (data structure)

Reference Genome Pipeline

6040401 2 88 08.0731-5 chardonnay 08.0731-19 08.0731-29 08.0731-6 08.0731-24 08.0731-37 08.0731-15 08.0731-38 08.0731-44 08.0731-23 08.0731-46 08.0731-31 08.0731-10 08.0731-49 08.0731-47 08.0731-7 08.0731-48 08.0731-21 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCAAAGGACTT 64 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGAAATTTGATGTCTCATACCTCATACCCCAGGACTT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAAAAACACCAAGTAATTTGATTTCTCATACCTCATACCAAAGGACTT 64 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAAAACACCAAGTAATTTGATGTCTCATACCTCATACCACAGGACTTCCC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 CAGCAAAAAAAAAAAACGGTTCTCAATTCCAAGCCCAGATGAGTGAGAACCAGGCCAAGAAAGG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGGGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAACGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCCTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGGGTAGACGGGC 64 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGGAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGGGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAAATAGAACTTAGAAACTTATACCGTGGGACACGTCAAGTGACTGCTGATG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAACCAAAGATCGACTTGCAACATCTGGATGGAAACAACAAACAAACAAAGA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGGGA 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAAGGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGC 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATTT 64 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGGGAGTCTGCAATCAAGATACAACAATGGG 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGAGGGTGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAAAGAGTGATGGGGGTGAGTCTGCAATCAAGATACAACAATGGG 64 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAATGCAGAACGAGTGATGAGGCAGAGTCTGCAATCAAGATACAACAATGGT 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAACATCCTCTCCTCATACGCTCCTCCCAGCTTGCACTAACGGCCAACAGATT 64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGAGAGGCCTAAAAAGGGTAATGAAGGCAAAAGTGCCCTTCTTAGCTGTAG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGAGGGATGGGGCGGCTTGCGTGCATGGGACACAAGCGTGTAGACGGGCAG 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 CAGCAAAAAAAAAAGCCCAATCTAGACCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCCCAATCTAGAGCCTATCTTCTAATAGCGAATAAGAAAAGGCCCCATCC 64 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGAAAAAAA 64 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGCTGGAGAGAT 64 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CAGCAAAAAAAAAAGCTATGAACCATCGGGGGAGAGGTGAGAAATGTTGATTGGTTGCAGAGAA 64 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Tags by Taxa

Page 12: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

12

Qseq  

QseqToTagCount  

TagCounts per lane

TagCounts for species

(Master Tags)

Fastq

SAM alignment

TagsOnPhysical Map

Key  files  

TagsByTaxa files (1 per lane)

TagsByTaxa for species

HapMap

Merge TagsByTaxa

SAM convertor

BWA (Burrows-Wheeler Aligner)

TagCountsTo FASTQ

Merge TagsCounts

TagsToSNP ByAlignment Process

QseqToTBT  

File (data structure)

Reference Genome Pipeline

Tags that align to the same region are aligned against one another and SNPs and small indels are identified. Based on the alignments SNPs are propagated to specific lines having that tag into a HapMap file. Parameters: •  chromosomes to search for SNPs •  bi or tri-allelic SNPs •  Indels •  Genetic mapping support •  Max markers on a chromosome

TagsToSNPByAlignment

Page 13: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

13

rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 SgSBRIL061:633Y5AAXX:2:E8 SgSBRIL013:633Y5AAXX:2:E2 SgSBRIL062:633Y5AAXX: S1_2100 A/G 1 2100 + N N N N N N N R N A N A N N N N N N N S1_2163 T/C 1 2163 + N N N N N N T C T T N N C N T N T N C S1_13837 T/G 1 13837 + N N N N N N N G N N T N N N N N N N S1_14606 C/T 1 14606 + N N C N N N T T T T C N C N N C N N S1_20601 T/A 1 20601 + T N N N N N N A N N N N N N N N N N S1_68332 C/T 1 68332 + N N N N N N N N N N N N N N N N N T S1_68596 A/T 1 68596 + A N N N N N N N N A N N A N N N N N S1_69309 G/A 1 69309 + N G N N N N N A N N N N A N N N N N S1_79955 T/G 1 79955 + N T G T T N T T N N N T T T N N T T S1_79961 T/G 1 79961 + N T T T T N T T N N N T T T N N T T S1_80584 G 1 80584 + N N N N N N N N N N G N G N A G N N S1_80647 C/T 1 80647 + N N N N N N N C N N C T N N N C T N S1_81274 T/G 1 81274 + N N N N N N T G N N N N N N N N N N S1_108834 G/A 1 108834 + N N N N N N N N N N N N N N N N N N S1_112345 T/G 1 112345 + N N N N N N K T N N N N N K G G G N S1_115359 C/T 1 115359 + N N N N N N T C N T N N N N N N N N S1_115362 T/C 1 115362 + N N N N N N N C N N N N T N N N C N S1_115405 G/A 1 115405 + G G A N N G G G G N R N G N G N N N S1_115516 T/G 1 115516 + N N T N N N T T N N T N T T T T N N S1_116694 A/G 1 116694 + N A G N N N G A N N N N A N G A G N S1_119016 C/T 1 119016 + N N N N C N N C N N N N N N N N N N S1_155366 T/C 1 155366 + N T N N N N N Y N N N N C N N T N N

HapMap Format

Why  another  pipeline?  

•  The  last  maize  build  (21000  taxa)  with  the  discovery  pipeline  took  over  2  weeks.    

•  Most  common  alleles  have  been  idenBfied  aDer  the  first  few  discovery  builds  

•  Use  the  informaBon  from  the  discovery  pipeline  to  call  SNPs  in  new  runs  quickly.  

•  Improve  efficiency  and  automate.    

Page 14: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

14

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

TOPM  

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Tags  by  Taxa    

Fastq  

TOPM  

Genotypes  

Filtered  Genotypes  

Page 15: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

15

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

TOPM  

Fastq  

Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

TOPM  

Fastq  

TagsOnPhysicalMap  (TOPM)  

Page 16: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

16

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

Filtered  Genotypes  

TOPM  

Fastq  

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

Fastq  

Filtered  Genotypes  

TOPM   TOPM  

Page 17: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

17

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

Fastq  

Filtered  Genotypes  

TOPM   TOPM  

GBS  bioinformaBcs  pipeline  Discovery  

Tag  Counts  

SNP  Caller  

Genotypes  

Tags  by  Taxa    

Fastq  

ProducCon  

Fastq  

Filtered  Genotypes  

TOPM   TOPM  

Genotypes  

Page 18: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

18

Running  the  ProducBon  Pipeline  

•  Required  Files:  –  Sequence  file  (fastq  or  qseq)  –  Key  file  –  ProducBon  TOPM  

•  TASSEL  3  Standalone  &  RawReadsToHapMapPlugin  

•  Running  the  Pipeline:  – One  lane  processed  at  a  Bme  – HapMap  files  by  chromosome  

•  ~7  minutes  

TesBng  ProducBon  Pipeline  

•  Compared  HapMap  files  produced  by  Discovery  Pipeline  and  ProducBon  Pipeline  

•  Site  Comparison:  – Discovery  48,139  – ProducBon  47,676  – Difference  due  to  maximum  8  alleles  

•  99.98%  correlaBon  of  geneBc  distance  matrices  

Page 19: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

19

Shifting to HDF5 •  Hierarchical Data Format – supports very

large data sets and complex data structures.

•  Widely used in climate and astromonomy communities

•  TBT – files can approach 2 Tb in size •  Compressed HDF5 can be 40 times smaller •  Access times looks very good •  Working to fuse TOPM, TBT, and Keyfile

into one HDF5 repository

Edward Buckler USDA-ARS

Cornell University http://www.maizegenetics.net

Why can GBS be complicated? Tools

for filtering, error correction and

imputation.

Page 20: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

20

Maize has more molecular diversity than humans and apes combined

Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

1.34% 0.09%

1.42%

Only 50% of the maize genome is shared between two varieties

Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

50%

Plant 1

Plant 2 Plant 3

99%

Person 1

Person 2 Person 3

Maize Humans

Page 21: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

21

Maize genetic variation has been evolving for 5 million years

Modern Variation Begins Evolving

Sister Genus Diverges

Zea species begin diverging

Maize domesticated

5mya

4mya

3mya

2mya

1mya

War

m

Plio

cene

C

old

Plei

stoc

ene

Divergence from Chimps

Ardipithecus

Homo erectus

Modern Humans Modern Variation Begins

Australopithecus

What are our expectations with GBS?

Page 22: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

22

High Diversity Ensures High Return on Sequencing

•  Proportion of informative markers – Highly repetitive – 15% not easily

informative – Half the genome is not shared between two

maize line •  Potentially all of these are informative with a

large enough database – Low copy shared proportion (1% diversity)

•  Bi-parental information = (1-0.01)^64bp = 48% informative

•  Association information = (1-0.05)^64bp= 97% informative

Expectation of marker distribution

Biallelic, 17%

Too Repetitive, 15%

Non-polymor

phic; 18%

Presense/

Absense, 50%

Multiallelic, 34%

Too Repetitive, 15%

Non-polymorphic; 1%

Presense/

Absense, 50%

Biparental population Across the species

Page 23: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

23

Sequencing Error

Illumina Basic Error Rate is ~1%

•  Error rates are associated with distance from start of sequence – Bad – GBS puts these all at the same

position – Good – Reverse reads can correct – Good – Error are consistent and modelable

Page 24: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

24

Reads with errors

•  Perfect sequences: 0.9964=52.5% of the 64bp sequences are

perfect 47.5 are NOT perfect The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Do we see these errors? •  Assume 10,000 lines genotyped at

0.5X coverage

Base Type Read # (no SNP)

Read # (w/ SNP)

A Major 4950 4900

C Minor 17 67 (50 real)

G Error 17 17

T Error 17 17

Page 25: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

25

Do Errors Matter? •  Yes –Imputation, Haplotype

reconstruction •  Maybe – GWAS for low frequency

SNPs •  No – GS, genetic distance, mapping

on biparental populations

Expectations of Real SNPs

•  Vast majority are biallelic •  Homozygosity is predicted by

inbreeding coefficient •  Allele frequency is constrained in

structured populations •  In linkage disequilibrium with

neighboring SNPs

Page 26: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

26

HapMap

Process

File (data structure)

Clean Up and Imputation

HapMap

GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding

Coefficient, LD

Imputation Imputation &

Phasing

HETEROZYGOUS NOT SOLVED YET

INBREDS PARTIALLY SOLVED

Kinship Distance

Phylogeny LD GS

GWAS

MergeDuplicateSNPsPlugin Merge reads from opposite sides

BiParentalErrorCorrectionPlugin Error rate estimation, LD filters

MergeIdenticalTaxaPlugin Error rate estimation, LD filters

Filters in TagsToSNPByAlignmentMTPlugin •  Only calls bi-allelic (hard coded now)

– Two most common alleles used •  Inbreeding coefficient (-mnF)

–  If have inbred samples definitely use, very powerful for errors and paralogues

•  Minimum minor allele frequency (-mnMAF) – Very important if do not have other tools for

filtering (bi-parental populations or LD) – Set for >=1% if no other filter method present

Page 27: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

27

MergeDuplicateSNPsPlugin

•  When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)

•  ~13% of all sites •  Fusing increases coverage •  Fixes errors •  -misMat = set maximum mismatch rate •  -callHets = mismatch set to hets or not

GBSHapMapFiltersPlugin

•  Basic filters for coverage of sites, taxa inbreeding coefficient, and LD

•  -mnTCov = minimum taxa coverage (e.g.0.05)

•  -mnSCov = minimum site coverage, proportion of taxa with call (e.g. 0.10)

•  -mnMAF = minimum minor allele frequency (e.g. 0.01)

Page 28: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

28

GBSHapMapFiltersPlugin

•  -mnF = minimum inbreeding coefficient (e.g. 0.9) – Don’t use with outcrossers

•  -hLD = require that sites are in high local LD, currently parameters are hard coded, so difficult to tune without using the code. – Tests a sliding window of 100 surrounding

sites, and looks for a Bonferonni corrected P<0.01

– Useful but can be slow option. – More work needed here.

Biparental populations Limited range of alleles,

expected allele frequencies, high LD

Page 29: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

29

Maize RIL population expectations

•  Allele frequency 0% or 50% •  Nearby sites should be in very high

LD (r2>50%) •  Most sites can be tested if multiple

populations are available

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Non-segregating

Segregating

Page 30: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

30

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Median error rate is 0.004, but there is a long tail of some high error sites

Median

Page 31: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

31

BiParentalErrorCorrectionPlugin

•  -popM = REGEX population identification(e.g. “Z[0-9]{3}”)

•  -popF = population File (not implemented) instead of popM option

•  -mxE = maximum error rate (e.g. 0.01); calculated from non-segregating populations

BiParentalErrorCorrectionPlugin •  -mnD = distortion from expectation (e.g.

2.0); the test uses both the binomial distribution and this distortion to classify segregation.

•  -mnPLD = minimum linkage disequilibrum r2= 0.5; this is calculated within each population, and then the median across segregating populations is used

Page 32: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

32

MergeIdenticalTaxaPlugin

•  Fuse taxa with the same name. Useful for checks and duplicated runs. Also useful in determining error rates

•  -xHets = exclude heterozygotes calls (e.g. true)

•  -hetFreq= frequency between hets and homozygous calls (e.g. 0.76)

Product of Filtering

•  After filters, in maize we find 0.0018 error rate – AA<>aa = < 0.0018 – AA<>Aa = 0.8 at low coverage

•  SNPs in wrong location <~1%. Lower in other species.

Page 33: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

33

HapMap

Process

File (data structure)

Clean Up and Imputation

HapMap

GBSHapMapFiltersPlugin Site Coverage, Taxa Coverage, Inbreeding

Coefficient, LD

Imputation Imputation &

Phasing

HETEROZYGOUS Partially SOLVED

INBREDS PARTIALLY SOLVED

Kinship Distance

Phylogeny LD GS

GWAS

MergeDuplicateSNPsPlugin Merge reads from opposite sides

BiParentalErrorCorrectionPlugin Error rate estimation, LD filters

MergeIdenticalTaxaPlugin Error rate estimation, LD filters

Missing Data Two major sources: •  Sampling

•  Low coverage often used in big genomes with inbred lines

•  Differential coverage caused by fragment size biases

•  Biological •  Region on genome not shared between lines •  Cut site polymorphisms

We want to impute the missing sampling but not the biological

Page 34: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

34

Standard Imputation

Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc.

These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition

Some can be slow for sample sizes that we have.

FastImputationBitFixedWindow

•  Imputation approach focused on speed and large sets of taxa with some closely related individuals.

•  Nearest neighbor approach, fixed window sizes

•  Strengths: Very accurate <1% error, much faster than other algorithms 100X

•  Weakness: Not good a recombination junctions, heterozgyosity

•  Code in TASSEL – not plugin, but available

Page 35: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

35

Hidden Markov Model TASSEL GBS Imputation

•  Developed by Peter Bradbury •  Aimed a GBS and biparental populations •  Hidden Markov Model •  Very accurate at determining boundaries •  Works well on Maize NAM inbred lines, and

probably others. •  AA <> BB error rate– 0.00005 •  AB > AA – 0.0278

•  Most problem appears in faulty populations •  Available as TASSEL 4.0 plugin

Only 50% of the maize genome is shared between two varieties

Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

50%

Plant 1

Plant 2 Plant 3

99%

Person 1

Person 2 Person 3

Maize Humans

Page 36: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

36

Mapping all the alleles (TagCallerAgainstAnchor) •  Most maize alleles have no position on

the reference map •  Map allele presence (TagsByTaxa)

versus a anchor SNP map (HapMap) •  8.7M alleles were mapped in <24 hours

using 100 CPU cluster

Physical and genetic mapping of 8.7 million GBS alleles

Gene$c&and&Physical&Agree&

Gene$c&and&Physical&Disagree&

Not&in&Physical,&Gene$cally&mapped&

Complex&mapping&or&modest&power&currently&

Consistent&Error&or&Evenly&repe$$ve&

Reads&with&strong&gene/c&and/or&BLAST&posi/on&

Reads&with&weaker&posi/on&hypothesis&

Reads&with&no&hypothesis&(Error&or&even&repe//ve)&

•  Only 29% of alleles are simple - physical and genetic agree

•  55% of alleles are easily genetically mappable

•  Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable.

•  With more samples and better error models perhaps 90% will be useable

Alle

les

Rea

ds

Page 37: or, “Where Your Data Go After Sequencing”cbsu.tc.cornell.edu/lab/doc/Buckler_FilterImpTools120531.pdf · BREAD Buckler BREAD-Maize-A D03 MR_0029.1 (PI655994 x PI655998)S4 PI655994

37

Using the Presence/Absence Variants

•  In species like maize, this is the majority of the data

•  Less subject to sequencing error •  Need imputation methods to

differentiate between missing from sampling and biologically missing

Future •  Need better integration of Whole Genome

Sequence data with pipeline – Add information on premature cut sites or

mutated cut sites •  Use paired-end read information •  Full incorporation of presence/absence

variants •  Increase range of imputation tools and

phasing for structure populations •  Quantitative genotype tools for polyploids/

GS