Pasteur deep seq_analysis_theory_2016

27
Deep Seq Data Analysis Theoretical training [email protected] http://artbio.fr Mouse Genetics January 21, 2016, 13:30–15:00

Transcript of Pasteur deep seq_analysis_theory_2016

Page 1: Pasteur deep seq_analysis_theory_2016

Deep Seq Data AnalysisTheoretical training

[email protected]

http://artbio.fr

Mouse GeneticsJanuary 21, 2016, 13:30–15:00

Page 2: Pasteur deep seq_analysis_theory_2016

Sequencing Technologies

Page 3: Pasteur deep seq_analysis_theory_2016

Latest commercialized Sequencing Technology

e Sequencing-by-pH-variations in ION TORRENT

Page 4: Pasteur deep seq_analysis_theory_2016

Sequencing Technologies : Quantitative Facts

Page 5: Pasteur deep seq_analysis_theory_2016

Sequencing Technologies : Focus on Illumina technology

Page 6: Pasteur deep seq_analysis_theory_2016

Deep sequencing applications

High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information

Page 7: Pasteur deep seq_analysis_theory_2016

Stranded RNAseq library

Page 8: Pasteur deep seq_analysis_theory_2016

20-30nt RNA gel purification

Small RNA library

(Biases)

Library “Bar coding”

Page 9: Pasteur deep seq_analysis_theory_2016

ChIPseq library preparation(Non Directional)

Page 10: Pasteur deep seq_analysis_theory_2016

What can I do with my sequence reads ?

◆ …➢

◆ …➢

Page 11: Pasteur deep seq_analysis_theory_2016

Platform Selection

Library Preparation

Sequencing

Quality Control

Alignment Assembly

Visualization & Statistics• Normalization (library comparison)• Peak finding (Binding sites, Breakpoints, etc…)• Differential Calling (expression, variants, etc)

What am I going to sequence ? For what analysis ?

Technical biases and limitations

Specific benefits(Read length, single or paired ends, number of reads)

Whole genomeWhole exomeTarget enrichment

Size selection – Stranded/unstranded ?AmplificationSingle Cell Protocol

Length of the readSingle or paired ends

Number of lanes (depth of sequencing)

Adapter ClippingQuality trimming

Contaminant and Sequencing ErrorsBiases in GC contents

BowtieBWA……Nature Methods 2009P Flicek & E Birney

Velvet, OasesTrinity, SOAPSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)

R, mathlab& Open Source software tools

Flowchart of a sequencing project

Think to the number of replicates

Page 12: Pasteur deep seq_analysis_theory_2016

Basic Material for mining sequencing data

◆ …◆

Page 13: Pasteur deep seq_analysis_theory_2016

Connect to our server

$ ssh lbcd41.snv.jussieu.fr

$ mkdir <mydir>$ cd <mydir>

Page 14: Pasteur deep seq_analysis_theory_2016

What is this big* fastq file containning ?

→→

……...

mouse@GED-Server:~/raw_data$ more GKG-13.fastq

@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderTNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderbBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

Page 15: Pasteur deep seq_analysis_theory_2016

How many sequence reads in my file ?

→ wc - l <path/to/my/file>

mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq

25703828 GKG-13.fastq

mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq

6425957

in python interpreter:>>> 25703828 / 46425957

Page 16: Pasteur deep seq_analysis_theory_2016

Are my sequence reads containing the adapter ?

→ cat <path/file> | grep CTGTAGG | wc –l→ grep -c "CTGTAGG" <path/file>

mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l6355061mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq6355061

6 355 061 out of6 425 957 sequences… not bad (98.8%)

My 3’ adapter: CTGTAGGCACCATCAAT

mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l308

A contrario

Page 17: Pasteur deep seq_analysis_theory_2016

$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute)

In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

A more advanced example of combining Unix commands

1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence

Page 18: Pasteur deep seq_analysis_theory_2016

Clipping adapter sequences

Unix Operating Systems already contain powerful native tools for sequence analyses

cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more

mouse@GED-Server:~/raw_data$

cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$count\n"; print

"$1\n"}' > clipped_GKG13.fasta

Final command line clipper

Page 19: Pasteur deep seq_analysis_theory_2016

Sequence Quality Control

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC, GUI version

Page 20: Pasteur deep seq_analysis_theory_2016

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

Page 21: Pasteur deep seq_analysis_theory_2016

mouse@GED-Server:~/instructor$ bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam

A bowtie alignment (command lines)

../genomes/Dmel_r5.49-f clipped_GKG13.fasta-v 1-k 1-p 6--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa-S> GKG13_bowtie_output.sam

# reads processed: 5930851# reads with at least one reported alignment: 4992296 (84.18%)# reads that failed to align: 938555 (15.82%)Reported 4992296 alignments to 1 output stream(s)

mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49

Page 22: Pasteur deep seq_analysis_theory_2016

Bowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

SAM alignment : $ more GKG13_bowtie_output.samAligned reads: $ more droso_matched_GKG-13.faUnaligned reads: $ more unmatched_GKG13.fa

Page 24: Pasteur deep seq_analysis_theory_2016

Formats

Raw sequence: Fastq (quality), Fasta (w/o quality)Aligned sequence:

Genome annotation:GFF, GTF,

SamBam

• Sorted• Indexed• Compressed

Page 26: Pasteur deep seq_analysis_theory_2016

Pileup Format

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Page 27: Pasteur deep seq_analysis_theory_2016

Next week, we will perform an NGS analysis using the Galaxy framework.We will speak about Accessibility, Reproducibility and Transparency.

Please have a look to http://galaxyproject.org/You can register and try it

Also, access to http://lbcd41.snv.jussieu.fr withlogin: (to be communicated)password: (to be communicated)

AND

Register (Menu “user” → “register”) with your email address