Pasteur deep seq_analysis_theory_2016

Post on 13-Jan-2017

82 views 0 download

Transcript of Pasteur deep seq_analysis_theory_2016

Deep Seq Data AnalysisTheoretical training

Christophe.antoniewski@upmc.fr

http://artbio.fr

Mouse GeneticsJanuary 21, 2016, 13:30–15:00

Sequencing Technologies

Latest commercialized Sequencing Technology

e Sequencing-by-pH-variations in ION TORRENT

Sequencing Technologies : Quantitative Facts

Sequencing Technologies : Focus on Illumina technology

Deep sequencing applications

High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information

Stranded RNAseq library

20-30nt RNA gel purification

Small RNA library

(Biases)

Library “Bar coding”

ChIPseq library preparation(Non Directional)

What can I do with my sequence reads ?

◆ …➢

◆ …➢

Platform Selection

Library Preparation

Sequencing

Quality Control

Alignment Assembly

Visualization & Statistics• Normalization (library comparison)• Peak finding (Binding sites, Breakpoints, etc…)• Differential Calling (expression, variants, etc)

What am I going to sequence ? For what analysis ?

Technical biases and limitations

Specific benefits(Read length, single or paired ends, number of reads)

Whole genomeWhole exomeTarget enrichment

Size selection – Stranded/unstranded ?AmplificationSingle Cell Protocol

Length of the readSingle or paired ends

Number of lanes (depth of sequencing)

Adapter ClippingQuality trimming

Contaminant and Sequencing ErrorsBiases in GC contents

BowtieBWA……Nature Methods 2009P Flicek & E Birney

Velvet, OasesTrinity, SOAPSSAKE……PLoS ONE 6(3)Zhang W, Chen J, et al. (2011)

R, mathlab& Open Source software tools

Flowchart of a sequencing project

Think to the number of replicates

Basic Material for mining sequencing data

◆ …◆

Connect to our server

$ ssh lbcd41.snv.jussieu.fr

$ mkdir <mydir>$ cd <mydir>

What is this big* fastq file containning ?

→→

……...

mouse@GED-Server:~/raw_data$ more GKG-13.fastq

@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderTNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 HeaderbBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1]B]VWaaaaaagggfggggggcggggegdgfgeggbab@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

How many sequence reads in my file ?

→ wc - l <path/to/my/file>

mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq

25703828 GKG-13.fastq

mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq

6425957

in python interpreter:>>> 25703828 / 46425957

Are my sequence reads containing the adapter ?

→ cat <path/file> | grep CTGTAGG | wc –l→ grep -c "CTGTAGG" <path/file>

mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l6355061mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq6355061

6 355 061 out of6 425 957 sequences… not bad (98.8%)

My 3’ adapter: CTGTAGGCACCATCAAT

mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l308

A contrario

$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l

Outputs the content of a file, line by line

The output is passed to the input of the next command

perl interpreter is called with –ne options (loop & execute)

In line perl code

Regular expression

The output is passed to the input of the next command

wc with –l option counts the lines

A more advanced example of combining Unix commands

1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence

Clipping adapter sequences

Unix Operating Systems already contain powerful native tools for sequence analyses

cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1\n"}' | more

mouse@GED-Server:~/raw_data$

cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$count\n"; print

"$1\n"}' > clipped_GKG13.fasta

Final command line clipper

Sequence Quality Control

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC, GUI version

http://bowtie-bio.sourceforge.net/

Bowtie aligns reads on indexed genomes

mouse@GED-Server:~/instructor$ bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam

A bowtie alignment (command lines)

../genomes/Dmel_r5.49-f clipped_GKG13.fasta-v 1-k 1-p 6--al droso_matched_GKG-13.fa--un unmatched_GKG13.fa-S> GKG13_bowtie_output.sam

# reads processed: 5930851# reads with at least one reported alignment: 4992296 (84.18%)# reads that failed to align: 938555 (15.82%)Reported 4992296 alignments to 1 output stream(s)

mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49

Bowtie outputsdeepseq$ ls -laht-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa

SAM alignment : $ more GKG13_bowtie_output.samAligned reads: $ more droso_matched_GKG-13.faUnaligned reads: $ more unmatched_GKG13.fa

Formats

Raw sequence: Fastq (quality), Fasta (w/o quality)Aligned sequence:

Genome annotation:GFF, GTF,

SamBam

• Sorted• Indexed• Compressed

Pileup Format

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Next week, we will perform an NGS analysis using the Galaxy framework.We will speak about Accessibility, Reproducibility and Transparency.

Please have a look to http://galaxyproject.org/You can register and try it

Also, access to http://lbcd41.snv.jussieu.fr withlogin: (to be communicated)password: (to be communicated)

AND

Register (Menu “user” → “register”) with your email address