Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

48
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Proposal for the Degree of Doctorate in Philosophy Computer Science & Engineering Department University of Connecticut

description

Bioinformatics Methods for Diagnosis and Treatment of Human Diseases. Jorge Duitama Dissertation Proposal for the Degree of Doctorate in Philosophy Computer Science & Engineering Department University of Connecticut. Outline. Ongoing Research Primer Hunter - PowerPoint PPT Presentation

Transcript of Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Page 1: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Bioinformatics Methods for Diagnosis and Treatment of

Human Diseases

Jorge DuitamaDissertation Proposal for the Degree of Doctorate in

PhilosophyComputer Science & Engineering Department

University of Connecticut

Page 2: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Outline

• Ongoing Research– Primer Hunter– Bioinformatics pipeline for detection of

immunogenic cancer mutations• Future Work

– Isoforms reconstruction problem

Page 3: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Introduction

• Research efforts during the last two decades have provided a huge amount of genomic information for almost every form of life

• Much effort is focused on refining methods for diagnosis and treatment of human diseases

• The focus of this research is on developing computational methods and software tools for diagnosis and treatment of human diseases

Page 4: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

PrimerHunter: A Primer Design Tool for PCR-Based Virus Subtype

Identification

Jorge Duitama1, Dipu Kumar2, Edward Hemphill3, Mazhar Khan2, Ion Mandoiu1, and Craig Nelson3

1 Department of Computer Sciences & Engineering2 Department of Pathobiology & Veterinary Science3 Department of Molecular & Cell Biology

Page 5: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Avian Influenza

C.W.Lee and Y.M. Saif. Avian influenza virus. Comparative Immunology, Microbiology & Infectious Diseases, 32:301-310, 2009

Page 6: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Polymerase Chain Reaction (PCR)

http://www.obgynacademy.com/basicsciences/fetology/genetics/

Page 7: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Primer3PRIMER PICKING RESULTS FOR gi|13260565|gb|AF250358

No mispriming library specifiedUsing 1-based sequence positionsOLIGO start len tm gc% any 3' seq LEFT PRIMER 484 25 59.94 56.00 5.00 3.00 CCTGTTGGTGAAGCTCCCTCTCCATRIGHT PRIMER 621 25 59.95 52.00 3.00 2.00 TTTCAATACAGCCACTGCCCCGTTGSEQUENCE SIZE: 1410INCLUDED REGION SIZE: 1410

PRODUCT SIZE: 138, PAIR ANY COMPL: 4.00, PAIR 3' COMPL: 1.00

… 481 TGTCCTGTTGGTGAAGCTCCCTCTCCATACAATTCAAGGTTTGAGTCGGTTGCTTGGTCA >>>>>>>>>>>>>>>>>>>>>>>>>

541 GCAAGTGCTTGCCATGATGGCATTAGTTGGTTGACAATTGGTATTTCCGGGCCAGACAAC <<<<

601 GGGGCAGTGGCTGTATTGAAATACAATGGTATAATAACAGACACTATCAAGAGTTGGAGA <<<<<<<<<<<<<<<<<<<<< …

Page 8: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases
Page 9: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Tools Comparison

Page 10: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Notations

• s(l,i): subsequence of length l ending at position i (i.e., s(i,l) = si-l+1 … si-1si)

• Given a 5’ – 3’ sequence p and a 3’ – 5’ sequence s, |p| = |s|, the melting temperature T(p,s) is the temperature at which 50% of the possible p-s duplexes are in hybridized state

• Given two 5’ – 3’ sequences p, t and a position i, T(p,t,i): Melting temperature T(p,t’(|p|,i))

Page 11: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Notations (Cont)• Given two 5’ – 3’ sequences p and s, |p| = |s|,

and a 0-1 mask M, p matches s according to M if pi = si for every i {1,…,|s|} for which Mi = 1

AATATAATCTCCATATCTTTAGCCCTTCAGAT0000000000011011

• I(p,t,M): Set of positions i for which p matches t(|p|, i) according to M

Page 12: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Discriminative Primer Selection Problem (DPSP)

Given• Sets TARGETS and NONTARGETS of target/non-target

DNA sequences in 5’ – 3’ orientation, 0-1 mask M, temperature thresholds Tmin_target and Tmax_nontarget

Find• All primers p satisfying that

– for every t TARGETS, exists i I(p,t,M) s.t. T(p,t,i) ≥ Tmin_target

– for every t NONTARGETS T(p,t,i) ≤ Tmax_nontarget for every i {|p|… |t|}

Page 13: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Nearest Neighbor Model

• Given an alignment x: ΔH (x)

Tm (x) = ———————————————— ΔS (x) + 0.368*N/2*ln(Na+) + Rln(C)

where C is c1-c2/2 if c1≠c2 and (c1+c2)/4 if c1=c2

• ΔH (x) and ΔS (x) are calculated by adding contributions of each pair of neighbor base pairs in x

• Problem: Find the alignment x maximizing Tm (x)

Page 14: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Fractional Programming

• Given a finite set S, and two functions f,g:S→R, if g>0, t*= maxxS(f(x) / g(x)) can be approximated by the Dinkelbach algorithm:

1. Choose t1 ≤ t*; i ← 1

2. Find xi S maximizing F(x) = f(x) – ti g(x)

3. If F(xi) ≤ ε for some tolerance output ε > 0, output ti

4. Else, ti+1 ← (f(xi) / g(xi)) and i ← i +1 and then go to step 2

Page 15: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Fractional Programming Applied to Tm Calculation

• Use dynamic programming to maximize:ti(ΔS (x) + 0.368*N/2*ln(Na+) + Rln(C)) - ΔH (x) = -ΔG (x)• ΔG (x) is the free energy of the alignment x at

temperature ti

Page 16: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Melting Temperature Calculation Results

Page 17: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Design forward primers

Make pairs filtering by product length,cross dymerization

and Tm Iterate over targets to build a hash table of occurances

of seed patterns H according with mask M

Build candidates as suitablelength substrings of one or

more target sequences

Test each candidate p

Design reverseprimers

Test GC Content, GCClamp, single base repeatand self complementarity

For each target t use H tobuild I(p,t,M) and test if

T(p,t,i) ≥ Tmin_target

For each non target t test on every i if

T(p,t,i) < Tmax_nontarget

Page 18: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Design Success Rate

FP: Forward Primers; RP: Reverse Primers; PP: Primer Pairs

Page 19: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

NA Phylogenetic Tree

Page 20: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Primers Validation

Page 21: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Primers Validation

Page 22: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Current Status

• Paper published in Nucleic Acids Research in March 2009

• Web server, and open source code available at http://dna.engr.uconn.edu/software/PrimerHunter/

• Successful primers design for 287 submissions since publication

Page 23: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing

Jorge Duitama1, Ion Mandoiu1, and Pramod Srivastava2

1 University of Connecticut. Department of Computer Sciences & Engineering2 University of Connecticut Health Center

Page 24: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Immunology Background

J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003

Page 25: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Cancer Immunotherapy

CTCAATTGATGAAATTGTTCTGAAACTGCAGAGATAGCTAAAGGATACCGGGTTCCGGTATCCTTTAGCTATCTCTGCCTCCTGACACCATCTGTGTGGGCTACCATG

AGGCAAGCTCATGGCCAAATCATGAGA

Tumor mRNASequencing

SYFPEITHIISETDLSLLCALRRNESL

Tumor SpecificEpitopes Discovery

PeptidesSynthesis

Immune SystemTraining

Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

TumorRemission

Page 26: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Illumina Genome Analyzer IIx~100-300M reads/pairs35-100bp4.5-33 Gb / run (2-10 days)

Roche/454 FLX Titanium~1M reads400bp avg. 400-600Mb / run (10h)

ABI SOLiD 3 plus~500M reads/pairs35-50bp25-60Gb / run (3.5-14 days)

Massively parallel, orders of magnitude higher throughput compared to classic Sanger sequencing

2nd Generation Sequencing Technologies

Helicos HeliScope25-55bp reads>1Gb/day

Page 27: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Read Mapping

Reference genome sequence

>ref|NT_082868.6|Mm19_82865_37:1-3688105 Mus musculus chromosome 19 genomic contig, strain C57BL/6JGATCATACTCCTCATGCTGGACATTCTGGTTCCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATTTCCACGCTTTTTCACTACAGATAAAGAACTGGGACTTGCTTATTTACCTTTAGATGAACAGATTCAGGCTCTGCAAGAAAATAGAATTTTCTTCATACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACAAGATAAGAGTCAATGCATATCCTTGTATAAT

@HWI-EAS299_2:2:1:1536:631GGGATGTCAGGATTCACAATGACAGTGCTGGATGAG+HWI-EAS299_2:2:1:1536:631::::::::::::::::::::::::::::::222220@HWI-EAS299_2:2:1:771:94ATTACACCACCTTCAGCCCAGGTGGTTGGAGTACTC+HWI-EAS299_2:2:1:771:94:::::::::::::::::::::::::::2::222220

Read sequences & quality scores

SNP calling

1 4764558 G T 2 11 4767621 C A 2 11 4767623 T A 2 11 4767633 T A 2 11 4767643 A C 4 21 4767656 T C 7 1

SNP Calling from Genomic DNA Reads

Page 28: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Mapping mRNA Reads

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Page 29: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Tumor mRNA (PE) reads

CCDSMapping

Genome Mapping

Read merging

CCDS mapped reads

Genome mapped reads

Tumor-specific mutations

Variants detection

Epitopes Prediction

Tumor-specific CTL

epitopes

Mapped reads

Gene fusion & novel transcript

detectionUnmapped

reads

Analysis Pipeline

Page 30: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Read MergingGenome CCDS Agree? Hard Merge Soft Merge

Unique Unique Yes Keep Keep

Unique Unique No Throw Throw

Unique Multiple No Throw Keep

Unique Not Mapped No Keep Keep

Multiple Unique No Throw Keep

Multiple Multiple No Throw Throw

Multiple Not Mapped No Throw Throw

Not mapped Unique No Keep Keep

Not mapped Multiple No Throw Throw

Not mapped Not Mapped Yes Throw Throw

Page 31: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Variant Calling Methods

• Binomial: Test used in e.g. [Levi et al 07, Wheeler et al 08] for calling SNPs from genomic DNA

• Posterior: Picks the genotype with best posterior probability given the reads, assuming uniform priors

Page 32: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Epitopes Prediction

• Predictions include MHC binding, TAP transport efficiency, and proteasomal cleavage

C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004

Page 33: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Accuracy Assessment of Variants Detection

• 63 million Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 (NCBI SRA database accession number SRX000566)– We selected Hapmap SNPs in known exons for which there

was at least one mapped read by any method (22,362 homozygous reference, 7,893 heterozygous or homozygous variant)

– True positives: called variants for which Hapmap genotype is heterozygous or homozygous variant

– False positives: called variants for which Hapmap genotype is homozygous reference

Page 34: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Comparison of Variant Calling Strategies

0500

100015002000250030003500400045005000

0 200 400 600 800 1000

True

Pos

itive

s

False Positives

PosteriorBinomialMaq

Genome Mapping, Alt. coverage 1

Page 35: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Comparison of Variant Calling Strategies

0

500

1000

1500

2000

2500

3000

3500

0 20 40 60 80

True

Pos

itive

s

False Positives

PosteriorBinomialMaq

Genome Mapping, Alt. coverage 3

Page 36: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Comparison of Mapping Strategies

0

500

1000

1500

2000

2500

3000

3500

0 10 20 30 40

True

Pos

itive

s

False Positives

Transcripts

Genome

Hard Merge

Soft Merge

Posterior , Alt. coverage 3

Page 37: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Results on Meth A Reads

• 6.75 million Illumina reads from mRNA isolated from a mouse cancer tumor cell line

• Filters applied for variant candidates after hard merge mapping and posterior calling:– Minimum of three reads per alternative allele– Filtered out SNVs in or close to regions marked as

repetitive by Repeat Masker– Filtered out homozygous or triallelic SNVs

• 358 variants produced 617 epitopes with SYFPEITHI score higher than 15 for the mutated peptide

Page 38: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

SYFPEITHI Scores Distribution of Mutated Peptides

Page 39: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Distribution of SYFPEITHI Score Differences Between Mutated and Reference Peptides

Page 40: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Current Status

• Presented as a poster in ISBRA 2009 and as a talk at Genome Informatics in CSHL

• Over a hundred of candidate epitopes are currently under experimental validation

Page 41: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Validation Results

• We are using mass spectrometry for confirmation of presentation of epitopes in the surface of the cell

• Mutations reported by [Noguchi et al 94] were found by this pipeline

• We are performing Sanger sequencing of PCR amplicons to confirm reported mutations

Page 42: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Ongoing and Future Work• Primer Hunter

– Experiment with degenerate primers– Capture probes design for TCR sequencing

• Bioinformatics Pipeline– Increase mutation detection robustness– Integrate tools for structural variation detection from

paired end reads – Include predictions of transport efficiency, and

proteasomal cleavage and mass spectrometry data– Detect short indels– Detect novel transcripts

Page 43: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Alternative Splicing

http://en.wikipedia.org/wiki/File:Splicing_overview.jpg

Page 44: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Isoforms Reconstruction• Problem: Given a set of mRNA reads reconstruct the

isoforms present in the sample• Current approaches like RNA-Seq are limited to find

evidence for exon junctions

• We hope to overcome read length limitations by using paired end reads

Page 45: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Transcription Levels Inference

• Isoforms set {s1, s2, … , sj, … sn}

• lj := Length of isoform j

• fj := Relative frequency of isoform j

• For a read r R, Ir is the set of isoforms that can originate r

• wr(j) := Probability of r coming from sj given that its starting position is sampled

Page 46: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Transcription Levels Inference

Page 47: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Acknowledgments Ion Mandoiu, Yufeng Wu and Sanguthevar Rajasekaran Mazhar Khan, Dipu Kumar (Pathobiology & Vet. Science) Craig Nelson and Edward Hemphill (MCB) Pramod Srivastava, Brent Graveley and Duan Fei (UCHC) NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 UCONN Research Foundation UCIG grant

Page 48: Bioinformatics Methods for Diagnosis and Treatment of Human Diseases

Primers Design Parameters1. Primer length between 20 and 252. Amplicon length between 75 and 2003. GC content between 25% and 75%4. Maximum mononucleotide repeat of 55. 3’-end perfect match mask M = 116. No required 3’ GC clamp7. Primer concentration of 0.8μM8. Salt concentration of 50mM9. Tmin_target =Tmax_nontarget = 40o C