Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf ·...

28
Department of Functional Genomics, UST Jihyeob Mun 2016.12.07 Analysis of barcode sequencing

Transcript of Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf ·...

Page 1: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

Department of Functional Genomics, USTJihyeob Mun

2016.12.07

Analysis of barcode sequencing

Page 2: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

2

Pooled library screen analysis

‘gene A’ is a target?

experience knowledge

Fail

Success

High-throughput

How?

Simplicity

Pooled library screen analysis

A pool‘gene A’ targeted cell

‘gene B’targeted cell

‘gene C’ targeted cell

‘gene D’ targeted cell

The pool is used to analyze.

Page 3: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

3

◎ Barcode sequence

What is barcode sequencing

- Barcodes are Genome-integrated artificial sequences that specifically mark biological materials, such as cells or genes, with unique sequences.

- The barcodes are sequenced and analyzed by barcode sequencing (barcode-seq).

- Library : a set of barcode sequences

- Barcode-seq is used in several genome-wide screening tools, including shRNAs, sgRNAs and barcoded yeast deletion strains.

Page 4: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

4

Workflow : genome-wide shRNA screening

Page 5: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

5

Workflow : barcoded yeast deletion strains

Page 6: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

6

Limitation of barcode-seq data analysis

◎ Previously reported tools are mostly focused on shRNA or sgRNAscreening analysis

◎ Until now, error free production of barcode libraries is important issue. (For examples, barcode error, off-target problem, etc.)

Genome-wide functional analysis using the Barcode Sequence Alignment and Statistical Analysis (Barcas) tool

Page 7: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

7

What is Barcas

- Barcas (Barcode sequence Alignment and Statistical analysis tool) is a specialized program for the analysis of multiplexed barcode sequencing (barcode-seq) data

- input: Barcode-seq data(from shRNAs, sgRNAs and barcoded yeast deletion strains)

Page 8: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

Analysis pipeline of barcode-seq data

Step 1: Data pre-processing Step 2: QC of data

Step 3: Design experiment Step 4: Statistical analysis

Page 9: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

9

Three novel functions of Barcas

- Based on trie data structure, Barcas supports imperfect matching containing mismatches, position shifts and indels (insertion and deletion).

- Detection of barcode errors in the library.

- Checking similarity between barcodes in the library collection (barcode library QC).

Page 10: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

10

Feature 1:

Trie data structure based imperfect matching

Page 11: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

11

Previously reported tools for data preprocessing

Program Mismatches shifts Indels Dynamic length Backend tool Ref

BiNGS!LS-seq O X X X bowtie Kim (2012)Methods Mol Bio

shALIGN O X X X Perl script(or bowtie)

Sims (2011)Genome Bio

edgeR O O X X edgeR Dai (2014)F1000Res

Barcas O O O O java Mun (2016)BMC Bioinfo

MID Universal Primer Barcode

ex) The Cellecta library (shRNA)MID from 9-bp to 17-bp.

MID Universal Primer Barcode

Barcode

Barcode

Universal PrimerMID

MID Universal Primer

Dynamic sequence length

1 10 13 18 25 28 33

Page 12: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

12

Trie data structure based imperfect matching

1:1 sequence matching processingAlgorithm : List basedMaximum time : N * M

(N: read count, M: library sequence count)

1:M sequence matching processingAlgorithm : Trie based

Maximum time : N(N: read count)

Read Library sequences

TTAG

Library sequences

root

TA G C

G C GT C

C CA A A C

T TG G AT G

T A

AGCT

TTAT

TTAG

TCAGT

GCAG

GCCAA

CGCT

A sequence A base

Page 13: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

Comparison of speed and mapping rate

- Option

- Result

Barcas is 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than other two programs.

- Data 215 million reads are mapped to 4,832 heterozygous diploid deletion strains in S. pombe. 45-bp sequences are used as barcode library.

Page 14: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

14

Feature 2:

Detection of barcode errors

Page 15: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

15

Methods of targeting regions (1/2)

○ Barcoded yeast deletion strains

○ shRNAs ○ sgRNAs

Homologous recombination site

When the artificial sequence targets an unexpected region, it is called off-target

Page 16: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

16

Methods of targeting regions (2/2)

Original Design

Correct sequence Barcode error

True Off-target

high low

True Off-target

low high

Solutions are provided by statistical analysis

Not yet;It is essential with

imperfect matching

Page 17: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

17

Detection of barcode errors (1/4)

Eason et al (2004) Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains PNAS 101(30):11046-51

Smith et al (2009) Quantitative phenotyping via deep barcode sequencing Genome Res 19:1836-42

U1 UpTag U2 D2 DnTag D1# correctby Smith 4,242 4,369 4,045 4,207 4,320 3,867

% correct by Smith 80.1% 82.5% 82.9% 80.9% 83.1% 83.7%

# correct by Easton 4185 3,764 4,057 4,343 3,807 4,095

% correct by Easton 79.1% 71.1% 83.2% 83.5% 73.2% 88.7%

% Agreed 86% 84.4% 89.2% 92.6% 85.1% 92%

○ Barcoded yeast deletion strains

Page 18: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

18

Detection of barcode errors (2/4)

Ziller,MJ. et al., Nature 2015, 518, 355-9.

- Library : 1,230 shRNA sequences of TRC library.- Data : Control samples in neuroepithelial (NE), early radial glial (ERG) and

mid radial glial (MRG)- We found 25 (2.03%) erroneous barcodes (<= 2 bases mismatches or indels).

Page 19: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

19

Detection of barcode errors (3/4)

Deletion Mismatch Insertion

Page 20: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

20

Detection of barcode errors (4/4)

A simple method distinguishing barcode errors(PM: perfect matching, IM: imperfect matching)

○ Dominant PM ○ Barcode error

Deletion Mismatch Insertion

Original

Real

GCTGGAGATCCTCAAAGTCAT

GCTGGAGATCCTCAAAGTCAT=

GAATCTGCCACTCTCAGAATA

AATCTGCCACTCTCAGAATA≠

(IM1)

Page 21: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

21

Inclusion of barcode errors

○ Barcoded yeast deletion strains

○ shRNAs (or sgRNAs)

▷ include barcode errors

▷ filtering barcode errors

Why use imperfect matching in shRNAs? • Increase mapped read counts• Consider mutated primers (shifts)• Provide additional information

Barcas supports an option of filtering barcode errors

Except several librariesex) Cellecta library

Page 22: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

22

Feature 3:

Checking similarity between original barcodes

Page 23: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

23

Library reference QC (1/2)

- Barcode errors can potentially be generated during the production of many barcodes.

- If some barcodes are designed similarly and mutations or sequencing errors occur, then it is hard to distinguish errors from true differences.

- Thus, barcodes originally designed to be similar should be separated in a step of pooling.

- For this purpose, Barcas gives notice about sequence similarity between barcodes.

Page 24: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

24

Library reference QC (2/2)

Screen Library Date Species Module Barcode length

Barcode count

Gene count Reference

Staticsequence

comparison

Dynamic sequence

comparison

shRNA

TRC 05/Apr/11

Human

21-bp 61,621 15,435 http://www.broadinstitute.org/rnai/public/ 790 (1.28 %) 1,909 (3.10 %)

Cellecta 15/Feb/12

Module1 18-bp 27,500 5,046

https://www.cellecta.com/

0 (0 %) 412 (1.5 %)

Module2 18-bp 27,500 5,421 0 (0 %) 398 (1.45 %)

Module3 18-bp 27,500 4,923 0 (0 %) 410 (1.49 %)

sgRNA

yusa Mouse 19-bp 87,437 19,149 Koike et al., 2014 517 (0.59 %) 3,944 (4.51 %)

CeCKOv2 09/Mar/15

HumanLibrary A 20-bp 63,950 21,669

https://www.addgene.org/crispr/libraries/geckov2/

517 (0.81 %) 538 (0.84 %)

Library B 20-bp 56,869 19,834 437 (0.77 %) 441 (0.78 %)

MouseLibrary A 20-bp 65,959 22,486 736 (1.12 %) 755 (1.14 %)

Library B 20-bp 61,139 21,263 850 (1.39 %) 860 (1.41 %)

Deletionmutantstrains

Heterozygous diploid

Saccharomycescerevisiae 20-bp 6,318/UP

6,126/DN 6,131

http://www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html

0 (0 %) 0 (0 %)

Schizosaccharomyces pombe 20-bp 4,832/UP

4,832/DN 4,832 Kim,D.U. et al, 2010 0 (0 %) 0 (0 %)

Page 25: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

25

Conclusion

◎ Barcas is an all-in-one software for barcode-seq data analysis and a few new useful functions for data pre-processing and quality control of barcode library

◎ Improvement point Memory usage

Trie-data structure consumes more memory as sequence gets longer due to recursive function.

Barcas consumes much memory while making middle files (.seqmap) from fastq or fasta in mapping step.

For example, Barcas needs about 350 MB memories for uploading Yusalibrary (19-bp 87,437 barcodes).

Statistical analysis Multiple-condition comparison (MAGeCK-VISPR) Utilization of metadata (HiTSelect)

Page 26: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

Acknowledgement

26

Dr. Seon-Young Kim, Dr. Jong-Lyul Park and Jeong-Hwan Kim

Aging Research Center of KRIBB Dong-Uk Kim

Chungnam National University Dr. Kwang-Lae Hoe, Dr. 이숙정, Miyoung Nam, 이아름, etc.

Page 27: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

Thank you for listening

Page 28: Analysis of barcode sequencing - KRIBBmedical-genome.kribb.re.kr/barseq/Barcas_presentation.pdf · 2020-01-17 · Memory usage Trie-data structure consumes more memory as sequence

28

Comparison AGCT sequence with ACTA sequence

Static comparison vs. Dynamic comparison

Static comparison

Based on the same lengths between sequences

2 bases

Static comparison

Based on the length of a specific sequence

1 base

Input sequence (read)Barcode region

AGCTACTA…

Other region