Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Advanced Algorithms Advanced Algorithms and Models for and Models for

Computational BiologyComputational Biology-- a machine learning approach-- a machine learning approach

Population Genetics:Population Genetics:

More on halpltypesMore on halpltypes--- coalescence, multi-population phasing, blocks, --- coalescence, multi-population phasing, blocks,

etc.etc.

Eric XingEric Xing

Lecture 19, March 27, 2006

Clustering

How to label them ? inference

How many clusters ??? model selection ? or inference ?

Genetic Demography

Are there genetic prototypes among them ? What are they ? How many ? (how many ancestors do we have ?)

Inference done separately, or jointly?

Multi-population Genetic Demography

Clustering as Mixture Modeling

Model selection "intelligent" guess: ??? cross validation: data-hungry information theoretic:

AIC TIC MDL :

Posterior inference:

we want to handle uncertainty of model complexity explicitly

we favor a distribution that does not constrain M in a "closed" space!

Model Selection vs. Posterior Inference

),ˆ|(|)(minarg KKL MLgf

)()|()|( MpMDpDMp

Parsimony, Ockam's RazorParsimony, Ockam's Razor

??????

haplotype h(h1, h2)possible associations of alleles to chromosomes

Heterozygousdiploid individual

T G ACp

Genotype gpairs of alleles, whose

associations to chromosomes are unknown

ATGCsequencing

TC TG AA

This is a mixture modeling problem!

Haplotype Ambiguity

The probability of a genotype g:

Standard settings: H| = K << 2J fixed-sized population haplotype pool

p(h1,h2)= p(h1)p(h2)=f1f2 Hardy-Weinberg equilibrium

Problem: K ? H ?

),|(),()(Hhh

hhgphhpgp

Genotypingmodel

Haplotypemodel

Population haplotypepool

A Finite (Mixture of ) Allele Model

Hn1 Hn2

Present

22 individuals

The coalescent process

Present

22 individuals

18 ancestors

Present

22 individuals

18 ancestors

16 ancestors

Present

22 individuals

18 ancestors

16 ancestors

14 ancestors

Present

22 individuals

18 ancestors

16 ancestors

14 ancestors

12 ancestors

9 ancestors

8 ancestors

7 ancestors

5 ancestors

3 ancestors

2 ancestors

1 ancestor

Present

Most recent common ancestor(MRCA)

Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)

Present

Most recent common ancestor(MRCA)

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Present

TCGAGGTATTAACTCTAGGTATTAACTCGAGGCATTAACTCTAGGTGTTAACTCGAGGTATTAGCTCTAGGTATCAAC * ** * *

Assuming there are presently k active lineages: The probability of coalescence:

The probability of mutation (killing a lineage):

Population Genetic Basis of an Infinite Allele model

Natural genealogy

Infinite mixturesCoalescent with mutation

Kingman coalescent process with binary lineage merging Kingman coalescent process with binary lineage merging New population haplotype alleles emerge along all branches of the New population haplotype alleles emerge along all branches of the

coalescence tree at rate coalescence tree at rate /2 per unit length/2 per unit length

The Ewens Sampling Formula: an exchangeable random partition of The Ewens Sampling Formula: an exchangeable random partition of individualsindividuals

Dirichlet process mixtureDirichlet process mixture

A Hierarchical Bayesian Infinite Allele model

).},{|(~ kahph

• Assume anAssume an individual haplotypeindividual haplotype hh is stochastically is stochastically derived from aderived from a population haplotypepopulation haplotype ak withwith

nucleotide-substitution frequencynucleotide-substitution frequency k: :

• Not knowing the correspondences between individual Not knowing the correspondences between individual and population haplotypes, each individual haplotype and population haplotypes, each individual haplotype is a mixture of population haplotypesis a mixture of population haplotypes..

• The number and identity of the population haplotypes are unknownThe number and identity of the population haplotypes are unknown

use ause a Dirichlet Process Dirichlet Process to construct a priorto construct a prior distributiondistribution GG on on HH××RRJJ..

Hn1 Hn2

Dirichlet Process Priors

A c.d.f., G onfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of , the joint distribution of the random variables

(G(B1), G(B2), …, G(Bm)) ~ Dirichlet(G0(B1), …., G0(Bm)),

where, G0 is a the base distribution and a is the precision parameter

Stick-breaking Process

0 0.4 0.4

0.6 0.5 0.3

0.3 0.8 0.24

),Beta(~

Location

Chinese Restaurant Process

CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers

=)|=( -ii kcP c 1 0 0

{A} {A} {A} {A} {A} {A} ……3

The DP Mixture of Ancestral Haplotypes

The customers around a table form a cluster associate a mixture component (i.e., a population haplotype) with a table

sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component

With p(h|{}) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)

DP-haplotyper

Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling Metropolis Hasting

Hn1 Hn2

infinite mixture components(for population haplotypes)

Likelihood model(for individual

haplotypes and genotypes)

Model components

Choice of base measure:

Nucleotide-substitution model:

Noisy genotyping model:

jaG )Beta()Unif(~ 0

jkjijk

jkjijkjkjkji

jjkjkjiki

ahpahp

if ),|( where

),|()},{|(

jijiji

jjijijiiii

ghhhhgp

hhgphhgp

if ),|( where

),|(),|(

Gibbs sampling

Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1

(0) , and form initial population-hap pool A(0) ={a1

(0) }:

i) Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals;

ii) Sample from , update ;

iii) Sample , where , from ;

update A(t+1) ;

iii) Sample from , update H(t+1).

),,|( )()()()1( ttti

ti Hccp

)1( tit

)1( tka )1( t

itck ) s.t. |( )1(

''kchap t

)1( tit

h ),,|( )1()()1()1(

ti ttt

Hchp A

)1( tc

Convergence of Ancestral Inference

Haplotyping Error

The Gabriel data

Multi-population Genetic Gemography

Pool everything together and solve 1 hap problem? --- ignore population structures

Solve 4 hap problems separately? --- data fragmentation

Co-clustering … solve 4 coupled hap problems jointly

Why humans are so similar and polymorphic patterns are regional

Population bottleneck: a small population that interbred reduced the genetic variation

Out of Africa ~ 100,000 years ago

Out of Africa

Each population can be associated with a unique DP capturing population-specific genetic demography

Different population may have unique haplotypes

Different population may share

common haplotypes

Thus Population specific DPs

are marginally dependent

Population Specific DPs

Hierarchical DP Mixture

Simulated Data

As 1 HapTyping problem

As 5 HapTyping problems

HapMap Data

The block structure of haplotypes

The Daly et al (2001) data set This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region

implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes.

The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.

Another study: the Patil et al data

The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single

chromosome.

The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP

block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes

include 80% of the chromosomes, and can be distinguished with 2 SNPs.

Figure 2 of Patil et al

Haplotype Blocks

Markov Dependence Between Blocks

On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.

Block Partitioning Algorithms

Greedy algorithm (Patil et al 2001). Begin by considering all possible blocks of ≥1 consecutive SNPs. Next, exclude all blocks in which < 80% of the chromosomes in the data are defined

by haplotypes represented more than once in the block (80% coverage). Considering the remaining overlapping blocks simultaneously, select the one which

maximizes the ratio of total SNPs in the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block.

Hidden Markov Models Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) Minimum description length (Anderson et al.2003)

Dynamic programming (Sun et al. 2002)

Haplotypes are Shaped by Recombination

Inheritance of unknown generation

Hidden Markov DP for Recombination

y1 y2 y3 yN…

y1y1 y2y2 y3y3 yNyN…

Reference

E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004),

N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723.

M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232

Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Documents

Transcript of Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Algorithms in Computational Biology

Algorithms in Computational Biology (236522) Spring 2002

Computational Biology - Los Alamos National Laboratory · algorithms in computational biology ... •Structure prediction ... Observe heterogeneous lipid segregation (patching) -

Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Motif Detection Eric Xing Lecture 6,

Algorithms in Computational Biology (236522) Fall 2005-6 Lecture #1

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

ALGORITHMS FOR COMPUTATIONAL BIOLOGY ......computer science with applications in biology, and has led to the emergence of Computational Biology as a new ﬁeld of research. Algorithms

Scalable Data Mining Algorithms in Computational Biology ...downloads.hindawi.com/journals/bmri/2017/5652041.pdf · Scalable Data Mining Algorithms in Computational Biology and Biomedicine

CS 394C: Computational Biology Algorithms

Introduction to Algorithms in Computational Biologydang/courseCB/lecture01.pdfComputational biology is the application of computational tools and techniques to (primarily) molecular

THE BIOINFORMATICS AND COMPUTATIONAL BIOLOGY€¦ · Bioinformatics and Computational Biology Interdepartmental Graduate Program The Bioinformatics and Computational Biology (BCB)

Advanced Algorithms and Models for Computational Biologyepxing/Class/10810-06/lecture6.pdf · Advanced Algorithms and Models for Computational Biology-- a machine learning approach

Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.

Computational Biology and Bioinformaticscschweikert/cisc4020/sequenceData… · Computational Biology and Bioinformatics • Computational biology – Development of algorithms to

Reducing the Computational Complexity of Information ...pengqiu.gatech.edu/pdf/reduce_comp.pdf · Key words: algorithms, computational molecular biology, machine learning. 1. INTRODUCTION

Algorithms and Computational Aspects of DFT Calculations ... · Algorithms and Computational Aspects of DFT Calculations ... J.M. Thijssen, ... Algorithms and Computational Aspects

NUS Computational Biology Computational Biology range of ...

Problems for 394C { Algorithms for Computational Biology

Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.