Life-Saving Swimming Lessons for Kids in Toronto by Buckler Aquatics
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding:...
-
Upload
sylvia-dennis -
Category
Documents
-
view
213 -
download
0
Transcript of GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding:...
![Page 1: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/1.jpg)
GBS Bioinformatics Pipeline(s) Overview
Getting from sequence files to genotypes.
Pipeline Coding:Ed BucklerJeff GlaubitzJames Harriman
Presentation:Terry CasstevensWith supporting information from the coders.
![Page 2: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/2.jpg)
Three Pipelines
• Discovery Pipeline– Requires a reference genome– Multiple steps to get to genotypes– Hands on tutorial is based on this pipeline
• Production Pipeline– Uses information from Discovery Pipeline– One step from sequence to genotypes
• UNEAK Pipeline– For species without a reference genome– Fei Lu will present this tomorrow at 9:30
![Page 3: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/3.jpg)
Vocabulary• Sequence File
– Text file containing DNA sequence reads and supplemental information from the Illumina Platform.
• Taxa– An individual sample
• GBS Bar Code– A short known sequence of DNA used to assign a GBS Tag to its
original Taxa• Key File
– Text file used to assign a GBS Bar Code to a Taxa• GBS Tag
– DNA sequence consisting of a cut site remnant and additional sequence.
• Plugin– Tassel pipeline module that performs specific task
![Page 4: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/4.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 5: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/5.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 6: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/6.jpg)
HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGAHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTTHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAAHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAGHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTGHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTAHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGCHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAATHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCGHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAGHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTGHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
Raw Sequence (Qseq)
![Page 7: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/7.jpg)
HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGAHWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTTHWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAAHWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGAHWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAGHWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGHWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTGHWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAGHWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTAHWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATHWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACHWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGCHWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAATHWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCHWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCGHWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAGHWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTGHWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
Raw Sequence (Qseq)
![Page 8: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/8.jpg)
Key File
![Page 9: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/9.jpg)
Insert (first 64 bases)Barcode Cut site
InsertBarcode adapter Cut site Common adapterCut site
Insert (<64bp)Cut site 2nd InsertBarcode Cut site
GBS Tags
Insert (<64bp)Barcode Cut site Common adapterCut site
‘Good’ reads: (only the first 64 bases after the barcode are kept)
Fragment from GBS library:
chimera or partial digestion:
short fragment:
typical read:
![Page 10: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/10.jpg)
Insert (first 64 bases)Barcode Cut site
InsertBarcode adapter Cut site Common adapterCut site
Insert (<64bp)Cut siteBarcode Cut site
GBS Tags
Insert (<64bp)Barcode Cut site Cut site
‘Good’ reads: (only the first 64 bases after the barcode are kept)
Fragment from GBS library:
chimera or partial digestion:
short fragment:
typical read:
![Page 11: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/11.jpg)
Insert (first 64 bases)Barcode Cut site
InsertBarcode adapter Cut site Common adapterCut site
Insert (<64bp)Cut siteBarcode Cut site
GBS Tags
Insert (<64bp)Barcode Cut site Cut site
Barcode Cut site Common adapter
Rejected reads:
• Not matching barcode and cut site remnant• Contains N in first 64 bases after the barcode
‘Good’ reads: (only the first 64 bases after the barcode are kept)
Fragment from GBS library:
chimera or partial digestion:
short fragment:
typical read:
adapter dimer
![Page 12: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/12.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 13: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/13.jpg)
Tag Counts
• With information from the key file, each sequence file is processed, tags are identified and counted.
• If a tag is shorter than 64 bases it is padded.• The tags and counts are put into a tag count
file for each sequence file.
QseqToTagCountsPlugin / FastqToTagCountsPlugin
![Page 14: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/14.jpg)
Master Tag Counts
• The individual tag count files are merged into a master tag count file.
• A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors).
MergeMultipleTagCountsPlugin
![Page 15: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/15.jpg)
Conversion of Tags to Fastq
• Sequence aligners do not work with the tag count file format.
• In preparation for the alignment step, the Master Tag Count file is converted to fastq format.
TagCountsToFastqPlugin
![Page 16: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/16.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 17: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/17.jpg)
Tag Alignment / TOPM
• The GBS pipeline uses an external aligner to do the initial alignment.
• The current version uses bowtie2 which produces the alignment in the SAM format.
• We convert the SAM file into our tags on physical map format (TOPM)
bowtie2
SAMConverterPlugin
![Page 18: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/18.jpg)
TOPM
![Page 19: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/19.jpg)
So Far We Have
• Identified and counted GBS tags.• Converted tag counts file to fastq.• Aligned the tags to a reference.• Converted the alignment to TOPM.
![Page 20: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/20.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 21: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/21.jpg)
Tags by Taxa
• In this step we identify which tags are present in which taxa.– Original Sequence Files– Key File– Master Tag Count File
• Recently migrated to HDF5 file format.– Efficient storage– Large data sets
SeqToTBTHDF5Plugin
![Page 22: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/22.jpg)
Tags By Taxa Additional Operations
• If many TBTs have been created they are merged into 1 TBT.
• Taxa that were sequenced multiple times are merged.
• The TBT table is pivoted in preparation for SNP calling.
ModifyTBTHDF5Plugin
![Page 23: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/23.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
TOPM
GBS Discovery Pipeline
![Page 24: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/24.jpg)
SNP Calling
• Files used in SNP Calling– TOPM– TBT– Pedigree File (optional)
• Some Key Settings– mnF MinimumF (inbreeding coefficient)– mnMAF Minimum Minor Allele Frequency– mnMAC Minimum Minor Allele Count– mnLCov Minimum Locus Coverage
TagsToSNPByAlignmentPlugin
![Page 25: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/25.jpg)
HapMaprs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N TS1_14606 C/T 1 14606 + N N C N N N T T T T CS1_2061 T/A 1 20601 + T N N N N N N A N N NS1_68332 C/T 1 68332 + N N N N N N N N N N NS1_68596 A/T 1 68596 + A N N N N N N N N A NS1_69309 G/A 1 69309 + N G N N N N N A N N NS1_79955 T/G 1 79955 + N T G T T N T T N N NS1_79961 T/G 1 79961 + N T T T T N T T N N NS1_80584 G 1 80584 + N N N N N N N N N N GS1_80647 C/T 1 80647 + N N N N N N N C N N CS1_81274 T/G 1 81274 + N N N N N N T G N N NS1_108834 G/A 1 108834 + N N N N N N N N N N NS1_112345 T/G 1 112345 + N N N N N N K T N N NS1_115359 C/T 1 115359 + N N N N N N T C N TS1_115362 T/C 1 115362 + N N N N N N N C N N NS1_115405 G/A 1 115405 + G G A N N G G G G NS1_115516 T/G 1 115516 + N N T N N N T T N N TS1_116694 A/G 1 116694 + N A G N N N G A N N NS1_119016 C/T 1 119016 + N N N N C N N C N N NS1_155366 T/C 1 155366 + N T N N N N
![Page 26: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/26.jpg)
GBS Discovery pipelineDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
TOPM
![Page 27: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/27.jpg)
GBS Discovery pipelineDiscovery
Tag Counts
SNP Caller
Tags by Taxa
Fastq
TOPM
Genotypes
Filtered Genotypes
![Page 28: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/28.jpg)
Production Pipeline
![Page 29: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/29.jpg)
Why another pipeline?
• The last maize build (30000 taxa) with the discovery pipeline took weeks.
• Most common alleles have been identified after the first few discovery builds.
• Use the information from the discovery pipeline to call SNPs in new runs quickly.
• Improve efficiency and automate.
![Page 30: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/30.jpg)
GBS Bioinformatics PipelinesDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
TOPM
Fastq
![Page 31: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/31.jpg)
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
TOPM
Fastq
TagsOnPhysicalMap (TOPM)
![Page 32: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/32.jpg)
GBS Bioinformatics PipelinesDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
Filtered Genotypes
TOPM
Fastq
![Page 33: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/33.jpg)
GBS Bioinformatics PipelinesDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
Fastq
Filtered Genotypes
TOPM TOPM
![Page 34: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/34.jpg)
GBS Bioinformatics PipelinesDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
Fastq
Filtered Genotypes
TOPM TOPM
![Page 35: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/35.jpg)
GBS Bioinformatics PipelinesDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Fastq
Production
Fastq
Filtered Genotypes
TOPM TOPM
Genotypes
![Page 36: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/36.jpg)
Running the Production Pipeline
• Required Files:– Sequence file (fastq or qseq)– Key file– Production TOPM
• TASSEL 3 Standalone & RawReadsToHapMapPlugin
• Running the Pipeline:– One lane processed at a time– HapMap files by chromosome
• ~40 minutes
![Page 37: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/37.jpg)
Testing Production Pipeline
• Compared HapMap files produced by Discovery Pipeline and Production Pipeline
• Site Comparison:– Discovery 48,139– Production 47,676– Difference due to maximum 8 alleles
• 99.98% correlation of genetic distance matrices
![Page 38: GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation:](https://reader035.fdocuments.us/reader035/viewer/2022070409/56649e9a5503460f94b9cf49/html5/thumbnails/38.jpg)
Next Steps In Pipeline Development
• Hierarchical Data Format – supports very large data sets and complex data structures.
• Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository.
• Continued improvements to SNP caller.• Ability to use tags not present in the
reference.