Combinatorial Problems for Human Polymorphisms Giuseppe Lancia University of Udine.

Combinatorial Problems for Combinatorial Problems for Human PolymorphismsHuman Polymorphisms

Giuseppe Lancia

University of Udine

A genomegenome is a long string over the DNA alphabet {A,C,G,T} encodingfor a form of life

In man it is some 3.000.000.000 letters

DNA is responsible for our diversity as well as our similarity

All humans are 99% identical. Small changes in a genome can make a big difference,like from... to...

What makes us different from each other?

The answer is

POLYMORPHISMSPOLYMORPHISMS

What makes us different from each other?

The answer is

POLYMORPHISMSPOLYMORPHISMS

PolymorphismsPolymorphismsA polymorphism is a feature

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

PolymorphismsPolymorphisms

E.g. think of eye-coloreye-color

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

PolymorphismsPolymorphismsA polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

E.g. think of eye-coloreye-color

Or blood-typeblood-type for a feature not visible from outside

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

atcggattagttagggcacaggacgt

GENOTYPEGENOTYPE: “union” of 2 haplotypes

OaE OaOt

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them X and O. Also, call ? the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over X,OGENOTYPEGENOTYPE: string over X,O,?

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them X and O. Also, call ? the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over X,OGENOTYPEGENOTYPE: string over X,O,?

THE HAPLOTYPING PROBLEMTHE HAPLOTYPING PROBLEM

Single IndividualSingle Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.), under different objective functions

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

Theory and Results

-Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01& Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping (Bafna, L, Istrail, Rizzi 02)

Single individual

- NP-hardness for general gapped haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01)

- Parsimony (Gusfield 03, L, Rizzi, Pinotti 02)

- Clark’s rule: APX-hardness and I.P. approach (Gusfield 00 & 01)

Population

- Formulations for Disease Detection (L, Pesole 02)

- Polynomial algorithm for perfect phylogeny (Bafna, Gusfield, L, Yooseph 02)

The Single-IndividualThe Single-IndividualHaplotyping problemHaplotyping problem

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome [ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA

Paralogous regions:

ACAAACCCTTTGGGACT … CTAGTAAACCCTATGGGGA AAACCCTT TAAACCCT CTATGGGA CCTATGG CTTTGGGACT ACCCTATGGG

ERROR SOURCESERROR SOURCES

Given errorserrors (sequencing errors, and/or paralogous) the data may be inconsistentinconsistent with exactly 2 haplotypes

PROBLEMPROBLEM: Find and remove : Find and remove the errors so that the data the errors so that the data becomes consistent with becomes consistent with exactly 2 haplotypesexactly 2 haplotypes

Hence, assembler is unable Hence, assembler is unable to build 2 chromosomesto build 2 chromosomes

ACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATGACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATG X X O O O X X X X X O

The data: a SNP matrix

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Fragment Conflict Graph GF(M)

We have 2 haplotypes iff GF is BIPARTITE

Snips 1,..,n

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

Snips 1,..,n

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X4 O O X - - - - O -

3 X X O X X - - - -5 - - - - - - - X O

O O X O X X O O X

X X O X X - - X O

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---OXXOO---OXOOX--- gap

---OXXOOXOXOXOOX--- gapless

---OXX--XO----OX--- 2 gaps

Why gaps?

Sequencing errors (don’t call with low confidence)

---OOXX?XX--- ===> ---OOXX-XX---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

THEOREM

For a gapless M, the Min Fragment RemovalProblem is Polynomial

An O(nm + n ) D.P. algo3

1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O

LFT(i) RGT(i)

D(i;h,k) := min # removed to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h)

1 + D(i-1; h, k) otherwise{

OPT is min h,k D( n; h, k ) and can be found in time O(nm^2 + n^3)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraph on 3-regular graphs

Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

WITH GAPS…..WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

for all odd cycles C

The LP relaxation can be solved in polynomial time

Randomized rounding heuristic: round and repeat. Worked well at Celera

What for MFR with long gaps? ILP

nx 1,0

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments andcorrect errors otherwise

A dual point of view is to disregard some SNPs and keepthe largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragmentgraph becomes bipartite.

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

CONFLICT !

SNP conflicts

CONFLICT !

SNP conflicts

SNP conflict graph GS(M)1 node for each SNP (column)edge between conflicting SNPs

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

THEOREM 1

For a gapless M, GF(M) is bipartiteif and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removalproblem is polynomial

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OOXXOO-------------OOXOOXOXXO-----------XXOXOXXX-----XXOOXOXXO-----------XOOOX-----------XXXXXO-------XXOXXOXOO------

Assume M gapless, GS(M) an independent set, but GF(M)not bipartite.

Take an odd cycle in GF

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

There is a generic structure of hor-vert cycle

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

“vertical lines”

There cannot be only one vertical line in odd cycle

We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

Must be X

--O?X???-------------O?????X??O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

Must be X

Merge the rightmost lines

--O?X???-------------O?????X--------------??O----------??????X-------------???O------------????X--------X???????O------

Still a counterexample!

Merge the rightmost lines

1 2 31 O - O 2 - O X 3 X X -

Note: Theorem not true if there are gaps

GF(M) GS(M)

THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1 2L + 2

Gapless MSR is polynomial (max stable set on perfect graph).

We have better, D.P., algorithms, O(mn + m^2)

What if gaps ?

PopulationPopulationHaplotyping Haplotyping

problemsproblems

THE HAPLOTYPING PROBLEMTHE HAPLOTYPING PROBLEM

Single IndividualSingle Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.), under different objective functions

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

The input is GENOTYPE data

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

The input is GENOTYPE data

xxoxxxxxox

oooxxxxxox

xxoxxoxxox

xxoxxxxoxx

oooxxoooxx

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

Each genotype is explained by two haplotypes

We will define some objectives for H

-1st Objective-1st Objective (parsimony): minimize |H|

-2nd Objective-2nd Objective based on Clark’s inference rule

-3rd Objective: solution fits a phylogeny-3rd Objective: solution fits a phylogeny

-4th Objective: disease detection-4th Objective: disease detection

1st Objective (parsimony)1st Objective (parsimony) :

minimize |H|

An easy approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e.

It’s hard to come up with better approximations, (Lancia, Pinotti, Rizzi ’02):

)( nOnLB

minimize |H|

THEOREM: Assuming each genotype has at most k symbols “?”, there is a approximation algorithm

THEOREM: The parsimony haplotyping problem is APX-hard

An easy approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e.

It’s hard to come up with better approximations, (Lancia, Pinotti, Rizzi ’02):

)( nOnLB

2nd Objective2nd Objective based on inference rule:

xoxxooxoxx +********** =x??xoox?x?

known haplotype h

known (ambiguos) genotype g

Inference Rule

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

new (derived) haplotype h’

Inference Rule

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

new (derived) haplotype h’

We write h + h’ = g

g and h must be compatible to derive h’

Inference Rule

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

xxoo xxxx SUCCESS

ooooxooo??ooxx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

1. expand genotypes

x??o? 2. create (h, h’) if exists g s.t. h’ can bederived from g and h

1. expand genotypes 3. Largest number of nodes in forest

rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

3rd. Haplotyping for perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogenyperfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

does research

Assistant professorPhD student

3rd objective is based on perfect phylogenyperfect phylogeny

starves

Associate professor

sleeps > 10hrs / day

FullprofessorUndergrad

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

undergraduate 1 0 0

phd 0 1 1

assistant prof. 0 1 0

assoc. prof 0 0 0

full prof. 1 0 0

sleeps

researches

starves

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

O X ? O? X O ?? O ? O

O X O OO X X OX X O XO X O OX O O OO O X O

O X ? O? X O ?? O ? O

O X O OO X X OX X O XO X O OX O O OO O X O

NOT a perfect phylogeny solution !

O X ? OO X O ?O O O ?

O X O OO X X OO X O O

X X O X O O O OO O O X

A perfect phylogeny

• Main ideas for an algorithm:

1. Companion columns : have a ? ? on a row ? O ? O

O X ? ?

All ?? pairs on companion columns must be expanded in the same way.

OO XOXX or OX

so we can talk of pairs of columns being equated or negated

2.Forcing patterns: O XX O? ?

? XX O ? ?

Forced columns: must be equated or negated in all sols

The most interesting forcing pattern is

? a b ?

forcing for all a, b in {O, X}.

O a X a b O b X

Let PF be forced pairs and PN be non-forced pairs of companion columns. Define a graph G, with edges in PF U PN

Following is key theorem to describe edges from PN. While there can be arbirarily long cycles of forced pairs,if in a cycle there is one unforced pair, then there must be a shortcut (smaller cycle)

MAIN THEOREM (weak triangulation): every cycle in G of length > 3 that has an edge from PN, has a chord.

Theorem: we can find a solution to PP in polynomial time (O(m n^2))

( Bafna, Gusfield, L, Yooseph 2002 )

The algorithm is quite involved.

We find, for each pair of companion columns, if they must be equated or negated.

This is done on connected components in thegraph induced by the edges in PF

Edges of PN are used to “jump from a componentto another”…

Theorem: we can find a solution to PP in polynomial time (O(m n^2))

( Bafna, Gusfield, L, Yooseph 2002 )

The algorithm is quite involved.

We find, for each pair of companion columns, if they must be equated or negated.

This is done on connected components in thegraph induced by the edges in PF

Edges of PN are used to “jump from a componentto another”…

Open problem: can we find a solution to PP in polynomial time (O(m n)) ?

4th Objective4th Objective : Disease Detection:

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

xxoxxxxxox

oooxxxxxox

xxoxxoxxox

xxoxxoooxx

oooxxoooxx

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

4th Objective4th Objective : Disease Detection:

Combinatorial Problems for Human Polymorphisms Giuseppe Lancia University of Udine.

Documents

Transcript of Combinatorial Problems for Human Polymorphisms Giuseppe Lancia University of Udine.

August 2021 Lancia 2000 Berlina Sedan Parts - Mrfiat.com Cars-Lancia... · 2021. 8. 19. · Lancia->2000 Berlina Sedan->Fuel System 2000 Sedan Rubber Ring... 13273-240 Rubber ring

DNA Polymorphisms

Genetic polymorphisms

LANCIA - FCAGroup

Single nucleotide polymorphisms (SNPs)webdoc.sub.gwdg.de/ebook/.../102/Doc9_Introduction.pdf · Single nucleotide polymorphisms (SNPs) Over the past years single nucleotide polymorphisms

Uputstvo Lancia y

Single Nucleotide Polymorphisms

Lancia Stratos - Factory Manual

Genetic polymorphisms pptx

LANCIA LANCIA · lancia a lancia mls mls mls mls mls fpm/acm

Package Alfa/Fiat/Lancia*

lancia - ACDAC · PDF fileLancia (Lancia Automobiles S.p.A. [ˡlantʃa] is an Italian automobile manufacturer founded in 1906 by Vincenzo Lancia and which became part of the Fiat

Melisa Rossi Udine University Ph.D. Thesis Defense Udine – June 14, 2006

Personal Financial Studio Udine

Types of Polymorphisms I. Protein/enzyme polymorphisms Blood groups II. DNA Polymorphisms 1.Single Nucleotide Polymorphisms (SNP) 2.Tandem Repeat Polymorphisms.

Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.

UNIVERSITY OF UDINE · international student guide UNIUD / 4 UNIUD / 5 international student guide Udine University, situated in Udine, a town in the region of Friuli Venezia Giulia,

EZRL2423 Lancia Fulvia 1600 HF Painted - hrcdistribution.com · EZRL2322 Lancia Delta White Clear, fits 1/10 Touring, Drift or Rally Car EZRL2323 Lancia Delta Red Clear, fits 1/10

Network of Excellence WP 1 Nanosensing with Si nanowires ... · IUNET -Udine D. Esseni IUNET -Udine P. Palestri IUNET -Udine F. Pittino IUNET -Udine F. Saccon IUNET -Udine L. Selmi

Lancia LC2 - Slot LC2... · Lancia LC2 The Lancia LC2, designed by Dallara, was engaged by Lancia for the first time in 1983 in the Sport-Prototype category, according to Group