Pasteur deep seq_analysis_theory_2016
-
Upload
christophe-antoniewski -
Category
Science
-
view
82 -
download
0
Transcript of Pasteur deep seq_analysis_theory_2016
Deep Seq Data AnalysisTheoretical training
http://artbio.fr
Mouse GeneticsJanuary 21, 2016, 13:30–15:00
Sequencing Technologies
Latest commercialized Sequencing Technology
e Sequencing-by-pH-variations in ION TORRENT
Sequencing Technologies : Quantitative Facts
Sequencing Technologies : Focus on Illumina technology
Deep sequencing applications
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
Stranded RNAseq library
20-30nt RNA gel purification
Small RNA library
(Biases)
Library “Bar coding”
ChIPseq library preparation(Non Directional)
What can I do with my sequence reads ?
◆
➢
◆ …➢
◆ …➢
Platform Selection
Library Preparation
Sequencing
Quality Control
Alignment Assembly
Visualization & Statistics• Normalization (library comparison)• Peak finding (Binding sites, Breakpoints, etc…)• Differential Calling (expression, variants, etc)
What am I going to sequence ? For what analysis ?
Technical biases and limitations
Specific benefits(Read length, single or paired ends, number of reads)
Whole genomeWhole exomeTarget enrichment
Size selection – Stranded/unstranded ?AmplificationSingle Cell Protocol
Length of the readSingle or paired ends
Number of lanes (depth of sequencing)
Adapter ClippingQuality trimming
Contaminant and Sequencing ErrorsBiases in GC contents
BowtieBWA……Nature Methods 2009P Flicek & E Birney
Velvet, OasesTrinity, SOAPSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)
R, mathlab& Open Source software tools
Flowchart of a sequencing project
Think to the number of replicates
Basic Material for mining sequencing data
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ …◆
Connect to our server
$ ssh lbcd41.snv.jussieu.fr
$ mkdir <mydir>$ cd <mydir>
What is this big* fastq file containning ?
→→
……...
mouse@GED-Server:~/raw_data$ more GKG-13.fastq
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderTNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderbBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
How many sequence reads in my file ?
→ wc - l <path/to/my/file>
mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq
25703828 GKG-13.fastq
mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq
6425957
in python interpreter:>>> 25703828 / 46425957
Are my sequence reads containing the adapter ?
→ cat <path/file> | grep CTGTAGG | wc –l→ grep -c "CTGTAGG" <path/file>
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l6355061mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq6355061
6 355 061 out of6 425 957 sequences… not bad (98.8%)
My 3’ adapter: CTGTAGGCACCATCAAT
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l308
A contrario
$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content of a file, line by line
The output is passed to the input of the next command
perl interpreter is called with –ne options (loop & execute)
In line perl code
Regular expression
The output is passed to the input of the next command
wc with –l option counts the lines
A more advanced example of combining Unix commands
1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
Clipping adapter sequences
Unix Operating Systems already contain powerful native tools for sequence analyses
cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more
mouse@GED-Server:~/raw_data$
cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$count\n"; print
"$1\n"}' > clipped_GKG13.fasta
Final command line clipper
Sequence Quality Control
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FastQC, GUI version
http://bowtie-bio.sourceforge.net/
Bowtie aligns reads on indexed genomes
mouse@GED-Server:~/instructor$ bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam
A bowtie alignment (command lines)
../genomes/Dmel_r5.49-f clipped_GKG13.fasta-v 1-k 1-p 6--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa-S> GKG13_bowtie_output.sam
# reads processed: 5930851# reads with at least one reported alignment: 4992296 (84.18%)# reads that failed to align: 938555 (15.82%)Reported 4992296 alignments to 1 output stream(s)
mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49
Bowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
SAM alignment : $ more GKG13_bowtie_output.samAligned reads: $ more droso_matched_GKG-13.faUnaligned reads: $ more unmatched_GKG13.fa
SAM - BAM
Formats
Raw sequence: Fastq (quality), Fasta (w/o quality)Aligned sequence:
Genome annotation:GFF, GTF,
SamBam
• Sorted• Indexed• Compressed
GFF - GTF
••
••
••
•
•
Pileup Format
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Next week, we will perform an NGS analysis using the Galaxy framework.We will speak about Accessibility, Reproducibility and Transparency.
Please have a look to http://galaxyproject.org/You can register and try it
Also, access to http://lbcd41.snv.jussieu.fr withlogin: (to be communicated)password: (to be communicated)
AND
Register (Menu “user” → “register”) with your email address