NGS of data of immunological interest Paolo Marcatil i
Transcript of NGS of data of immunological interest Paolo Marcatil i
NGS of data of immunological interest
Paolo Marcatili
Agenda 9.00 – 10.00 - DNA and RNA sequencing 10.00 - 10.30 Problems and solutions in Metagenomics and Metatranscriptomics 10.30 - 12.00 Exercise: Assembling metatranscriptomics data 12.00 - 13.00 Lunch Break 13.00 – 16.00 Exercise – Select putative antigens from a metagenomic sample
Milestones
• First Isolation of DNA : 1867 (Freidrich Meisher) • Composition of nucleic acids; tetranucleotide theory : 1909 - 1940 (Phoebus
Levine) • G=C and A=T however, the G/C and A/T content of different organisms
vary : 1950 (Edwin Chargaff) • G/C content measured by annealing : 1968 (Mandel and Marmur) • Maxam-Gilbert and Sanger Sequencing : 1977 • Next-Generation Sequencing : 2005
Genomes Sequenced
• Virus – 3222 (Bacteriophage phi X 174, 5386 nt – 1977)
• Bacteria – 2289 (Haemophilus influenza, 1.8 x 106 nt – 1995)
• Eukarya – 168 (S. cerevisiae 1.2 x 107 nt – 1995; H. sapien, 3 x 109 nt -2001)
• Archaea – 152 (Methanococcus jannaschi , 1.7 x 106 nt – 1996)
ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796
Changes in instrument capacity*
Next-Generation Sequencing
Next-Generation Sequencing
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Next-Generation Sequencing
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Illumina Technology Illumina Sequencing Technology
Sequencing Types
Single Read
Paired-end read (short or negative linkers)
Mate-pair read (long linkers)
Library Types • Many different library preps : DNA, mate-pair, mRNA, miRNA, ChIP
• Fragmentation – DNA : 300 – 500 nt – RNA : 150 – 200 nt
• Attachment of appropriate adapters – Complex : flow cell binding, F & R sequencing, BC – Custom : Avoid if possible
• Removal of dimers/small inserts
• Amplification (or not)
Applications
• de Novo sequencing (genomes, transcriptomes)
• Resequencing (genomes, exomes, custom sequence capture)
• RNA-seq (mRNA, miRNA, degradome)
• Chip-Seq
• Methyl-seq
• RIP-seq
• Amplicon
Immunological data • Single Pathogen: 50-100M reads (genome or transcriptome, HiSeq)
• BCR o TCR repertoire– 5-50M long reads (454)
• Exploratory metatranscriptome (30-100M reads, HiSeq)
• Annotated metatranscriptome (200-500M reads, HiSeq)
FASTQ format
• FASTA + Quality
• Quality is encoded in 4 different scales
• Extremely large (10-100 Gb)
• human- readable
@FCC41M5ACXX:6:1101:1660:1930#ACATGTAC/1TGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACCCGCACTTACTGGGAATTCCTCGTTCATGGGGAATAATTGCAATCCCCGATC+aabeeeceeggggihiiiihiiiifiiiihiiiiiiihhiiiihiiiigggeeccccccccccccccccccccccdcccccccccddccccccccccccc
Header
FASTQ format
• FASTA + Quality
• Quality is encoded in 4 different scales
• Extremely large (10-100 Gb)
• human- readable
@FCC41M5ACXX:6:1101:1660:1930#ACATGTAC/1TGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACCCGCACTTACTGGGAATTCCTCGTTCATGGGGAATAATTGCAATCCCCGATC+aabeeeceeggggihiiiihiiiifiiiihiiiiiiihhiiiihiiiigggeeccccccccccccccccccccccdcccccccccddccccccccccccc
Sequence
FASTQ format
• FASTA + Quality
• Quality is encoded in 4 different scales
• Extremely large (10-100 Gb)
• human- readable
@FCC41M5ACXX:6:1101:1660:1930#ACATGTAC/1TGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACCCGCACTTACTGGGAATTCCTCGTTCATGGGGAATAATTGCAATCCCCGATC+aabeeeceeggggihiiiihiiiifiiiihiiiiiiihhiiiihiiiigggeeccccccccccccccccccccccdcccccccccddccccccccccccc
Qualities
Qualities
Sequencing Artefacts
• Wrong calls (Illumina) -> Quality
• 3' low quality -> trim 3' basing on quality (threshold ~ 20)
• 5' artifacts -> cut n bases at 5' (0-10 bp)
• primers and adapters sequenced -> ad hoc software to remove adapters
• PCR bias -> Kmer correction
Sequencing Artefacts
• Wrong calls (Illumina) -> Quality
• 3' low quality -> trim 3' basing on quality (threshold ~ 20)
• 5' artifacts -> cut n bases at 5' (0-10 bp)
• primers and adapters sequenced -> ad hoc software to remove adapters
• PCR bias -> Kmer correction
FastQC
• Fastq Quality Check Standard way to check for sample quality BE CAREFUL: some checks are library-dependent Some K-mers in RNAseq are overexpressed
Base quality plot
ATCG content
Per sequence GC content
Adapters
CTCCGCTTCACGCCTCCGCCTTTGCACAGGGGTTTTCCCCTCCTGTACAGCTCCTGCAACGTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG+SRR941165.170 HWI-ST1176_0088:7:1101:3137:2306#10184_ACGTT length=95FFFHHHHHJJJJIJJJJJJJJJJJJJJJJJJJ?FHIJIJJJJJJJIJIJHHHHHFFFFFEDDEDDEDDD8<;ABCDDBBDDDDDBDDDDDCDDDD
Adapters
CTCCGCTTCACGCCTCCGCCTTTGCACAGGGGTTTTCCCCTCCTGTACAGCTCCTGCAACGTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG+SRR941165.170 HWI-ST1176_0088:7:1101:3137:2306#10184_ACGTT length=95FFFHHHHHJJJJIJJJJJJJJJJJJJJJJJJJ?FHIJIJJJJJJJIJIJHHHHHFFFFFEDDEDDEDDD8<;ABCDDBBDDDDDBDDDDDCDDDD
Adapters
@SRR941165.427 HWI-ST1176_0088:7:1101:2503:2914#10184_ACGTT length=95CGCTACTCATTCCGGCATTCTCTCTTCCCAGCCCTCCACGGCTCCTTTCGGTACCGCTTCGCCGGACTGGCAATGCTCCTCAACGTAGATCGGAA+SRR941165.427 HWI-ST1176_0088:7:1101:2503:2914#10184_ACGTT length=95FFFHHHHHJJJJIJJJJJJJJJJJJJJJJIHIIJJJJGIJJJJJJJJJJHHFDFFDDDDDDDDDDDDDDDD?>ACCDDDDDDDDDDDDBEDDDDD
You have to know (or guess) the right adapters Adapter_1 and Adapter_2 are different but usually have some similarity for Illumina kits Remember that if seq_1 is forward seq_2 is reverse Remove as little as 3 bp overlapping with the adapter
Merge Reads
• Especially in RNAseq data, left and right reads overlap
Merge Reads
• Especially in RNAseq data, left and right reads overlap
Can correct 3' errors as well!
Mapping vs De Novo
Mapping vs De Novo
If you already have a sequenced genome -> map reads on it If you don't -> put reads together to form contigs/transcripts Mapping is easier, faster and more accurate De Novo is the only solution in some cases
Mapping softwares
Blast BWA Bowtie Tophat
Mapping softwares
Blast – local, slow BWA – fast, short reads Bowtie – some isoform support Tophat – can manage isoforms and RNAseq
SAM and BAM
Alignment output formats is usually SAM or BAM (compressed SAM)
SAM and BAM
Alignment output formats is usually SAM or BAM (compressed SAM)
SRR941165.8819461 256 1 5927632 0 47M * 0 0 GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927638 HI:i:0SRR941165.8819461 256 1 5927638 0 47M * 0 0 GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927644 HI:i:1SRR941165.8819461 256 1 5927644 0 47M * 0 0 GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927650 HI:i:2SRR941165.8819461 0 1 5927650 0 47M * 0 0 GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927656 HI:i:3SRR941165.8819461 256 1 5927656 0 47M * 0 0 GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 HI:i:4SRR941165.3434522 256 1 5934419 0 64M * 0 0 CTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCTTTCCCTTCCTTT JJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJIJJJJIJIJJJJJJJJJIJJJII###2H%HFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:64 YT:Z:UU NH:i:16 CC:Z:= CP:i:18704897 HI:i:0SRR941165.4818838 0 1 9904364 50 136M * 0 0 GATCTCACCAGACAGGACTGCCAGATGACAACCAAGTAGTGTCCACATACATGCACCTACTGCCGCCGCAGCATCTGTCCAGGCCCTCCTGGTTCTTAAAAGTTCATGAATAATCTGCTGTTATTCTGATGGGCCT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIIIJJIIGHIJJJJJJJGIHHHHHFFFEDDFFFFFHHHHHJJJJJJIIJIJJIIJJIIJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJSRR941165.1983455 272 1 13434365 0 84M * 0 0 TTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCT FFFHH%%%JJJJJJJJJIJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:84 YT:Z:UU NH:i:20 CC:Z:= CP:i:SRR941165.1983455 16 1 14400492 0 84M * 0 0 TTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCTTTTCT FFFHH%%%JJJJJJJJJIJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:84 YT:Z:UU NH:i:20 CC:Z:= CP:i:SRR941165.3434522 272 1 18704897 0 64M * 0 0 AAAGGAAGGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAGGAAAG FFFH%H2###IIJJJIJJJJJJJJJIJIJJJJIJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:64 YT:Z:UU NH:i:16 CC:Z:12 CP:i:16358563 HI:i:1
SAM and BAM
Alignment output formats is usually SAM or BAM (compressed SAM)
ID: SRR941165.8819461 FLAG: 256 SEQ/CHR: 1 MAP_POS: 5927632 MAPQ: 0 CIGAR: 47M RNEXT: * PNEXT: 0 TLEN: 0 SEQ: GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA QUAL: JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ OPT: AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927638 HI:i:0
SAM and BAM
Alignment output formats is usually SAM or BAM (compressed SAM)
ID: SRR941165.8819461 FLAG: 256 SEQ/CHR: 1 MAP_POS: 5927632 MAPQ: 0 CIGAR: 47M RNEXT: * PNEXT: 0 TLEN: 0 SEQ: GAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGATGAAGA QUAL: JJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ OPT: AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:47 YT:Z:UU NH:i:5 CC:Z:= CP:i:5927638 HI:i:0
De Novo assembly
• Much easier to do with long reads • Need very good coverage • Generally produces fragmented
assemblies • Necessary when you don’t have a closely
related (and correctly assembled) reference genome
Take Home
• Sequencing advancements • Different sequencers for different needs • Quality check and filtering
(5', 3', adapters, merge) • FASTQ, SAM & BAM • when to map, when to go de novo