CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

41
CSE280 Vineet Bafna CSE280a: Projects Vineet Bafna

Transcript of CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

Page 1: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

CSE280a: Projects

Vineet Bafna

Page 2: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Project Logisitics

• Research project (70%)• Work individually, or in groups of 2• Two presentations:

– Introductory presentation: Feb 1st week (20 minutes) (20% grade)

• Describe the goals of the project• Describe your (computational) formulation• Summarize/critique reading assignment• Present an algorithm• Constructive criticism of other projects

– One on one meeting with instructor (end February) (10% grade)

• Discuss preliminary results– Final presentation (last 2-3 classes): (30% grade)

• Submit a final report• Final presentation

Page 3: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Project 1: disease gene mapping

• Recall, Linkage Disequilibrium• In the absence of recombination,

– Correlation between columns– The joint probability Pr[A=a,B=b] is

different from P(a)P(b)• With extensive recombination

– Pr(a,b)=P(a)P(b)

Page 4: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Measures of LD

• Consider two bi-allelic sites with alleles marked with 0 and 1

• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]

– P0* = Pr[Allele 0 in locus 1]

• Linkage equilibrium if P00 = P0* P*0

• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …

Page 5: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

LD can be used to map disease genes

• LD decays with distance from the disease allele.

• By plotting LD, one can short list the region containing the disease gene.

011001

DNNDDN

LD

Page 6: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Multiple loci

• In complex diseases, multiple loci interact to confer disease susceptibility

001001

DNNDDN

LD

011000

Page 7: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Testing for multiple loci

• Assume SNP matrix with n individuals, m loci. Testing for all sets of 5 SNPs implies a huge number of computations?

• Can you come out with computational strategies that can speed it up?

m

5

⎝ ⎜

⎠ ⎟n

Page 8: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Speeding up multiple locus computations

• A filtering strategy?• Input: a SNP matrix with one or more

pairs that interactively associate• Output: a set of SNP pairs that includes

the interacting pair(s).• Method should be fast, and should NOT

consider all pairs.

Page 9: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

110011

Speeding up the computations

• Correlated SNPs should also have low hamming distance.• Random SNPs should have high hamming distance.• Strategy: select k individuals at random.

– Hash each individual restricted to k individuals– Correlated SNPs should fall in the same bin with high

probability

001001

101011

k=2

Page 10: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Project 2: mtDNA phylogeny

• In the absence of recombination, the history of mitochondrial DNA can be expressed by a tree.

• The goal of this project is to build a robust phylogeny using a heuristic modification of the perfect phylogeny.

Page 11: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

The Genographic project

• The genographic project aims to trace geographic origins of the human race using mitochondrial DNA.

https://www3.nationalgeographic.com/genographic/atlas.html

Page 12: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Without recurrent mutations

• Unique tree can explain the evolutionary history

r

E B

C

D

A

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

1

2

4

3

5

Page 13: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

With recurrent mutations

• Adding another individual F destroys perfect phylogeny

• Why?• It is not so easy

to place F• Can you suggest

a strategy?

r

E B

C

D

A

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0F 0 1 0 0 0

1

2

4

3

5

1

F

2

Page 14: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Tests of Selection

• In class, we have discussed alleles that can be selectively neutral, or under active selection– Active selection may be positive or negative

• How do we identify regions under positive, or negative selection?

• Balancing selection: sometimes it is helpful for a population to

Page 15: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Adaptive Selection

• Selection leads to loss of heterozygosity (will be explained in detail in the next lecture).

• Can you come up with a test for selection?

Page 16: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Balancing selection

• Sometimes both alleles are useful in a population, and it helps to have both around

• A simple example is when diversity is important (the two variants help maintain diversity)

• Bipolar disorder genes could be under balancing selection– High creativity which might confer some

selective/reproductive advantage.– Depression offers a disadvantage

• If so, the tests for this disorder might be tricky. • How can we identify regions under balancing

selection?

Page 17: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Testing for Balancing Selection

• Adaptive selection leads to loss of heterozygosity (will be explained in detail in the next lecture).

• Balancing selection leads to two dominant haplotypes• Can you come up with a test for balancing selection?

Page 18: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Project: Primer design for cancer genomics

Page 19: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

The Science behind Gleevec

Fusions– observed in leukemia,

lymphoma, and sarcomas• “Philadelphia Translocation”

– Drugs target this fusion protein

Page 20: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Fluoroscent in situ hybridization

• Cancer genomes show extensive structural variation

Page 21: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Assaying for tumor variants

• Most tumors start off with a single cell, which then proliferate.

• Drugs like Gleevec are used well after cancer has taken hold.

• Can we detect the cancer early by detecting the genomic abnormality?

– If a very few cells in the person are cancerous, can we still detect it?

• Can we track a patient through his treatment?

Page 22: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Cancer genomics

• In cancers, large genetic changes can occur, including deletions, inversions, and rearrangements of genomes

• In the early stages, only a few cells will show this

deletion

Page 23: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Polymerase Chain Reaction

• PCR is a technique for amplifying and detecting a specific portion of the genome

• Amplification takes place if the primers are ‘appropriate’ distance apart (<2kb)

Page 24: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Assaying for Rare Variants

• PCR can be used to assay for a given genomic abnormality, even in a heterogenous population of tumor and normal cells

Extract Genomic DNA

PCR

Distance too large for amplification

Tumor cell

Detection

Page 25: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Variant Variants

• What if the variant is the minority in the cell population?

• What if deletion boundaries are uncertain?

Deletion

Deletion

Deletion

Patient A

Patient B

Patient C

Page 26: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Observed variation in deletion size

Sizes of homozygous deletions in cell lines from different human cancers.

(scale is in megabases).

Page 27: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Primer Approximation Multiplex PCR (PAMP)*

• Multiple primers are optimally spaced, flanking a breakpoint of interest– Upstream of breakpoint, forward primers– Downstream of breakpoint, reverse primers

• The primers are run in a multiplex PCR reaction– Any pair can form a viable product

Deletion Deletion

Patient B Patient C

Page 28: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Experimental Design (500Kb region)

• 10 sets of 25 primers: upstream and downstream

– 250 upstream– 250 downstream

• Primer-pairs closest to breakpoint amplified

• Assay by oligo array

Goal: Computational selection of an ‘optimal’ primer set

Page 29: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Goal

• Input, a collection of primers• Identify a subset of primers that do not cross-hybridize,

are unique, yet cover the region completely• Use combinatorial optimization, simulated annealing,

integer linear programming…..

Page 30: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Spectral Networks Algorithms for De Novo Interpretation of

Tandem Mass Spectra

Nuno Bandeira, Ph.D.

Department of Computer Science and Engineering, University of California, San Diego

ProtIG seminar series

September 21, 2007

Page 31: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Proteins and their modifications

Proteins are fundamental players in the regulation of biological processes.

DNA Proteins

regulate

encodes for

Knowing proteins involves knowingmany things. This dissertation focuses on: - Identification - Sequencing - Post-translational modifications ( )

Page 32: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Protein sequences and modifications

From a computational perspective, a protein can be represented as a string over a weighted alphabet:

…AFSRLEMILGF…

AFSRLSRLEMILGF

EMILG

Subsequences are called peptides

(obtained via enzymatic digestion)

Amino acid Mass

A 71

F 147

S 87

R 156

L 113

E 129

M 131

I 113

G 57

Protein sequence:

SRLEM ILGF

Modifications change amino acid masses:

SRLEMILGFMass(SRLEMILGF)=1047

Mass(M)=131

Mass(SRLEM ILGF)=1063

Mass(M )=147

Mass( )=16

Page 33: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Nobel prize in chemistry, 2002

Page 34: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

What is mass spectrometry?

http://nobelprize.org/chemistry/laureates/2002/chemadv02.pdf

Amino acid Mass

A 71

F 147

S 87

Page 35: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Modified peptide LARG*E

L

G

R

A

L

Prefix masses

Mass (m/z)

Intensity

LLA

LAR

LARG

L A

RL A

EG

R

A

Suffix masses

E

L

G

R

A

L

Prefix masses

Mass (m/z)

Intensity

LLA

LAR

LARG

E GE

RGE

ARGE

L A

RL A

EG

R EG

Tandem Mass Spectrometry (MS/MS)

…THISISAVERYLARGESAMPLEPRTEINSEQENCE…Protein Sequence:

Peptide LARGE

MS/MS spectrum

Modification: any event that changes the mass

at a specific site.

Mass (m/z)

Intensity

LLA LA

RG*

E G*E

RG*E

ARG*E

Mass shifts

Suffix masses

E

L

G*

R

A

L

Prefix masses

L A

RL A

E

R E

A R E

G*

G*

G*

: b

y:

: b

: y

PM

Page 36: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Example of a real MS/MS spectrum

Symmetric

b10

y12

Page 37: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Tandem Mass Spectrometry (MS/MS)

Enzymatic digestionTandem

Mass SpectrometryProteins

Peptides

…Large set of

MS/MS spectra …

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, SEQUENCE, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, SEQUENCE, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

ss

ss

ss

ff

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

eecc

cc

cc

ssee

ee

ee

qq

uunn

cc

Peptide SEQUENCE

Database search De novo sequencing

Page 38: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Mixture spectraSometimes, the instrument generates a single spectrum from two or more peptides:

Mixture spectrum

Pep

tide

A:

NLA

FF

QLR

Pep

tide

B:

ALD

DIL

NLK

?

Page 39: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

How to identify mixture spectra?

Page 40: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Proposed approach

• When identifying a mixture spectrum of peptides A,B, assume you have non-mixture spectra for the same peptides.

• Compare the non-mixture spectra of known peptides to putative mixture spectra to determine peptide identifications

Page 41: CSE280Vineet Bafna CSE280a: Projects Vineet Bafna.

CSE280 Vineet Bafna

Project description

• Implement an algorithm to identify mixture spectra from pairs of peptides by combining previously identified spectra from isolated peptides.

• Test the above implementation by simulating mixture spectra using an existing database of spectra from isolated peptides.

• Propose a scoring procedure to separate correct from false identifications.

Nuno [email protected]