Alessandra Godi

52
Alessandra Godi Alessandra Godi IASI (CNR) IASI (CNR) Roma Roma Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation Université Université Libre de Libre de Bruxelles Bruxelles Martine Labbé Martine Labbé iro Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 200

description

Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation. Alessandra Godi. Martine Labbé. Université Libre de Bruxelles. IASI (CNR) Roma. Airo Winter 2007 - Cortina d’Ampezzo , February 5th -9th, 2007. - PowerPoint PPT Presentation

Transcript of Alessandra Godi

Page 1: Alessandra Godi

Alessandra GodiAlessandra Godi

IASI (CNR) IASI (CNR)

RomaRoma

Solving Haplotyping Inference Parsimony problem using a

polynomial class representative formulation and

a set covering formulation

Université Libre Université Libre de Bruxellesde Bruxelles

Martine LabbéMartine Labbé

Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007

Page 2: Alessandra Godi

The alphabet of life…

Base pairs (A-T, G-C) are complementary

DNA structure=Double Helix (Watson-Crick)

Basic unit = nucleotide: Sugar

PhosphateBase (A, G, T, C)

Page 3: Alessandra Godi

Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom.

Each chromosome includes hundreds of different genes.

In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes.

Human Chromosomes

Page 4: Alessandra Godi

CM1CM2 P1

C CP2

Children

CM CP

FatherMother

Human Chromosomes

Page 5: Alessandra Godi

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATAATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGAT

AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT

Chromosomes

Page 6: Alessandra Godi

Chromosomes

A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype.

For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect.

Genotype data is easy to collect.

Page 7: Alessandra Godi

All humans are 99,99 % identical.

Diversity? polymorphismpolymorphism..

A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

SNPs

Page 8: Alessandra Godi

SNP (Single Nucleotide Polymorphism)

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

AATATATCGAATATATCG

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

TCCGTATACCTATCCGTATACCTA

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

GGGGTGTGTGTACGGGGTGTGTGTAC

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGCTAGCACGCGTGCTAGCACGCG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

TGTGTAATATACGTGTGTAATATACG

Page 9: Alessandra Godi

SNP (Single Nucleotide Polymorphism)

A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

Page 10: Alessandra Godi

SNP (Single Nucleotide Polymorphism)

Genotype: A/T T/G A C

Haplotype 1: A G A CHaplotype 2: T T A C

SNP 1 SNP 2 SNP 3 SNP 4A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

Hetero Hetero Homo Homozigous zigous zigous zigous

Page 11: Alessandra Godi

SNP: encoding

SNP 1 SNP 2 SNP 3 SNP 4A

GG

A

A

A

G

T

T

T

T

G

A

A

CC

C

C

C

C

CT

T

T

011000

100011

110000

000111

Genotype: 0/1 1/0 1 0

Haplotype 1: 0 1 1 0Haplotype 2: 1 0 1 0

2 2 1 0

Page 12: Alessandra Godi

Haplotyping of a population

Given a set of genotypes G (strings on {0,1,2}n alphabet), find a set of “generating” haplotypes HH (strings on {0,1}n alphabet).

genotype genotype individual individual

Page 13: Alessandra Godi

The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.

The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases:A, T, C, G.

The bases are paired each other by hydrogen bonds.

Page 14: Alessandra Godi

The DNA implies differences between the individuals of the same species.

What makes us different from each other is called polymorphism.

Page 15: Alessandra Godi

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacgtacatcagattagttagggcacaggacgtacatccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atccgattagttagggcacaggacggacatccgattagttagggcacaggacggac

atccgattagttagggcacaggacgtacatccgattagttagggcacaggacgtac

atcagattagttagggcacaggacggacatcagattagttagggcacaggacggac

atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac

atcagattagttagggcacaggacggacgtacatcagattagttagggcacaggacggacgtac

atcagattagttagggcacaggacggacggacatcagattagttagggcacaggacggacggac

atccgattagttagggcacaggacggacggacatccgattagttagggcacaggacggacggac

SSingle NNucleotide PPolymorphism (SNPSNP)

Page 16: Alessandra Godi

SSingle NNucleotide PPolymorphism (SNPSNP)

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population:

Page 17: Alessandra Godi

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

Page 18: Alessandra Godi

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

Page 19: Alessandra Godi

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

Page 20: Alessandra Godi

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgttacacatcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcccgattagttagggcacaggacggattagttagggcacaggacgttacac

atcatcaagattagttagggcacaggacggattagttagggcacaggacgggacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgttacac

atcatcaagattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

atcatcccgattagttagggcacaggacggacggattagttagggcacaggacggacgggacac

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

ETEROZYGOUSETEROZYGOUS: different alleles

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

Page 21: Alessandra Godi

aa gg

aa tt c c tt c c gg

cc tt a a gg

aa tt aa tt

a a gg c c gg

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

ETEROZYGOUSETEROZYGOUS: different alleles

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

Page 22: Alessandra Godi

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

GENOTYPESGENOTYPES: “union” of two haplotypes

OcE

EE

OaE

OaOt

EOg

HAPLOTYPESHAPLOTYPES: chromosome at SNP level

ETEROZYGOUSETEROZYGOUS: different alleles

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

Page 23: Alessandra Godi

aagg aatt

cctt ccgg

cctt aagg

aatt aatt

aagg ccgg

OcE

EE

OaE

OaOt

EOg

CODINGCODING: each SNP has only 2 possible values in a biological population. Let us call them ‘0’ and ‘1’. Moreover, let ‘2’ be the eterozygous site.

Page 24: Alessandra Godi

0011

0000 1100 1111

1100 0011

0000 0000

0011 1111

12

22

02

00

21

CODINGCODING: each SNP has only 2 possible values in a biological population.

:{0,1} {0,1,2}

0 0 = 0 1 1 = 1 0 1 = 1 0 = 2

Page 25: Alessandra Godi

HAPLOTYPING of a population

Given a set GG (strings in {0,1,2}n), find a set of

generator haplotypes HH (strings in {0,1}n)

genotype genotype individual individual

Page 26: Alessandra Godi

HAPLOTYPING of a population:State of the Art

Perfect Phylogeny (Bafna, Gusfield, Yooseph 02)

Estimation of haplotype frequencies(probabilistic studies: Fallin – Shork, 00)

Parsimony Objective (Gusfield 02, Brown 05)

Page 27: Alessandra Godi

HAPLOTYPING of a population:Parsimony Objective (NP-hard)

Combinatorial Methods (Gusfield 2002, Brown 2004, LANCIA –Rizzi, 2002):

Exponential and Polynomial ILP formulations

Rule-based methods (HAPINFER - Clark 1990):

Starting from genotypes, haplotypes are inferred

Statistical methods (PHASE- Stephens 2004, HAPLOTYPER – Niu 2001, GERBIL – Shamir 2005)

Page 28: Alessandra Godi

HAPLOTYPING of a population:our approach to the problem by using ILP

A new exponential

formulation

1. A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model

2. A branch and cut procedure to decrease the number of constraints

A new polynomial formulation

A formulation using class representatives

Page 29: Alessandra Godi

A new polynomial formulation

I={h1,…, hq} a solution of the problem

genotypes of length nG={g1,g2,…,gm}

Main idea: class representatives

Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets:

h1 {gi, gj, gk,…}

h2 {gi, gl, gr, gs…}

h3 {gk, gl, gs, gt…}

….

….

= Si

= Si’

= Sk

The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used.

K’ = {1’, 2’, …, m’}K = {1, 2, …, m}

Page 30: Alessandra Godi

A new polynomial formulation

VARIABLES

yk{i,j}=

1

0 Otherwise

If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j

k K i, j K K’

Page 31: Alessandra Godi

A new polynomial formulation

Ex:

h1 = 001 {g1, g2} = S1

g1= 021, g2= 002, g3 = 012

h2 = 011 {g1, g3} = S1’

y1{1,1’} = 1

Let us note that some y variables do not exist:

y2{1’,2’} = 0 If y2

{1’,2’} = 1

S1={g1,….}S1’={g1,g2….}

S2={g2,…}S2’={g2,…}

Absurd!!!

Page 32: Alessandra Godi

A new polynomial formulation

xi =

1 If there exists a subset of geno’s of the solution having geno i as geno with smallest index

i K K’ 0 Otherwise

zi,p =

0 i K K’ p SNP

1It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index

OBJECTIVE FUNCTION:

min xii K K’

VARIABLES

Page 33: Alessandra Godi

A new polynomial formulation

yk{i,j} 1 k K2.

i,j K K’, i≤k,

j≤k

CONSTRAINTS:

xi xi’ i K, i K’1.

Page 34: Alessandra Godi

A new polynomial formulationCONSTRAINTS:

yk{i,j} + yk

{i,j} ≤ xi k K3.

j K K’, j ≥ i

j K K’, j < i

i K K’,

yk{k,k’} ≤ xk’

k K3a. i = k’

Page 35: Alessandra Godi

A new polynomial formulationCONSTRAINTS:

4a. zi,p= 0 i K K’

pSNP s.t. gi(p)=0

4b. zi,p= 1 i K K’

pSNP s.t. gi(p)=1

4c. zi,p + zj,p = 1 {i,j} K K’

pSNP s.t. gi(p)=2

Page 36: Alessandra Godi

A new polynomial formulationCONSTRAINTS:

zi,p ≤ 1 - yk{i,j} - yk

{i,j} xi k K5.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=0

yk{k,k’} + zk’,p ≤ 1 k K, i = k’5a.

pSNP : gk(p)=0

Page 37: Alessandra Godi

A new polynomial formulationCONSTRAINTS:

zi,p ≥ yk{i,j} + yk

{i,j} k K6.j K K’,

j ≥ i

j K K’, j < i

i K K’

pSNP : gk(p)=1

zk’,p ≥ yk{k,k’} k K, i = k’6a.

pSNP : gk(p)=1

Page 38: Alessandra Godi

A new polynomial formulationCONSTRAINTS:

zi,p + zj,p ≥ yk{i,j}

k K7.

i,j K K’

pSNP : gk(p)=2

7a. zi,p + zj,p ≤ 2 - yk{i,j}

k K

i,j K K’

pSNP : gk(p)=2

Page 39: Alessandra Godi

10x10Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 15 12 0,01 54 0,12 263 14

BrownModel[‘05]

15 2 0,05 140 4,85 16,646 1360

15x15Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 27 22,83 0,01 173 0,08 173 11

BrownModel[‘05]

27 8 0,02 129 4.25 19.301 2.213

Preliminar results

Page 40: Alessandra Godi

20x20Opt zLP sec

zLP

LP iter

seczILP

MIP iter

B&B nodes

Poly 16 15 0,2 268 16 573 9

BrownModel[‘05]

16 3 O,07 598 27.604 16*106 540.623

Preliminar results

Page 41: Alessandra Godi

Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G.

^

INTEGER VARIABLES

Xh

1 if h is chosen

0 otherwise

1 if (h1,h2) is selected

0 otherwise

yh1,h2

For each g G

Pg = {(h1,h2) con h1,h2H | h1 h2 = g}^

From Gusfield’s formulation (2002)…

Page 42: Alessandra Godi

min Xh

hH

OBJECTIVE FUNCTION

^

CONSTRAINTS

1 g G1.

X 2.

yh1,h2

(h1,h2) Pg

yh1,h2h1

(h1,h2) Pg , g G

X 3. yh1,h2h2

(h1,h2) Pg , g G

From Gusfield’s formulation (2002)…

Page 43: Alessandra Godi

min xh

hH

1xh

h=h1 h=h2

g G

ˇ

x {0,1}n

…to a new set covering formulation by using the Fourier- Motzkin procedure

Set-Covering

s.t. (h1,h2) Pg

Genotype Structure +

Basic SC theory

Facets and

Valid Inequalities

Page 44: Alessandra Godi

g fixed fixedfreeN is the set of SNP

F

N\FF={pN: g(p) {0,1}}

Set-covering for HIP

1. The polytope HSC if full-dimensional IFF g G , |N\F|=2.

2. xj 0 is a facet for HSC IFF g G there exists hi s.t. hj hi=g, we have |N\F|=3.

3. xj 1 is facet j .

Proposition

Page 45: Alessandra Godi

g

g’

fixed fixed

fixed free

freefree

F

N\F

F’ N\F’

F={pN: g(p) {0,1}}

C=(N\F’)F

F’={pN: g’(p) {0,1}}

xi 1i S

Set-covering for HIP

N is the set of SNPs

Page 46: Alessandra Godi

|C|=|(N\F’)F|= 2 e (N\F)(N\F’)

|C|=|(N\F’)F| 3

TheoremLet us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality:

This inequality is facet defining IFF for each genotype g’g one of the following conditions holds:

Set-covering for HIP

1.xh

h S

Page 47: Alessandra Godi

Set-covering for HIP

1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =

2nd case : If |C|= |{p}|=1

If C= 3rd case :

the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.

NOTE: For the following cases:

Page 48: Alessandra Godi

Set-covering for HIP: main idea

To overcome the exponential structure of the formulation:

1. Add only set-covering inequalities which are facet-defining

2. Add them in branch and cut procedure

Page 49: Alessandra Godi

Set-covering for HIP: a branch and cut procedure

a fractional solution of a subproblem of the original one

x*

g: (h1, h2 )  (h3,h4)  (h5, h6)  (h7, h8)

All set covering inequalities associated with g have the following structure:

x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1

Page 50: Alessandra Godi

Set-covering for HIP: a branch and cut procedure

We want to find a set covering inequality of g that violates x*

If it esists, we have found a set covering inequality which cut off x* !!!

We choose to add it to the system only if it is facet-defining.

min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1

Page 51: Alessandra Godi

Branch and Cut preliminar results

Av. on max # of 2s

#constrmaster problem

#constr reduced problem

#added cuts

Solving time

50 genos10 SNPs

5 >60.000

7 30 0.00 sec

50 genos30 SNPs

8 >2512 7 200 0.05 sec

Average on 10 samples for each kind of instance generated by MS (Hudson, 2002) with recombination level r = 0

Page 52: Alessandra Godi

Future Works

On Polynomial formulation:

1. Strengthening of the model by Clique inequalities on genotype conflict graph

2. Cplex Concert Technologies3. More test vs other polynomial

formuationsOn Exponential formulation:

1. Implementation of Lifting Procedure2. More test in comparison with

Gusfield formulation