Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy...

17
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman

description

What do we mean by a “Family”? Ideally: A group of sequences that have arisen from a common ancestor In practice: families are most often defined based on Similar structure Similar sequence

Transcript of Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy...

Page 1: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Gene Family Size Distributions

Brought to You By Your Neighorhood Durand Lab

Narayanan RaghupathyNan Song

Rose Hoberman

Page 2: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Why Are We Interested in Gene Family Size Distributions?

Want to find homologous chromosomal regions• Genes as markers• Matches between genes indicate possible regional

homology

Cluster statistics depend on• The total number of matches• The distribution of matches

Page 3: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

What do we mean by a “Family”? Ideally: A group of sequences that have

arisen from a common ancestor

In practice: families are most often defined based on• Similar structure• Similar sequence

Page 4: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Families can be defined at many

levels

Either domains or whole

proteins can be grouped

Protein families and their evolution--a structural perspective.Orengo CA, Thornton JM.

Page 5: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Why are other people interested in gene family sizes? To understand protein family evolution

• Fit birth/death model to the data To predict how many more genes there

are in certain families

Page 6: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

How Can Genes Be Grouped Into Families? Construct and analyze gene trees:

• Slow, requires manual supervision• Tree construction is error-prone

Group based on structural similarity• Structure may be similar even if not homologous• Structure is generally not known

Cluster genes based on sequence similarity• Heuristic• Fast and comprehensive, even for large datasets

Page 7: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Clustering Group together genes with similar E-

values (or other sequence-based score)• Many heuristics have been proposed

Page 8: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Why bother with clustering heuristics? May not find true “gene families”

• May be throwing away true matches• May be including extra noise

However, may still be preferable to allowing only 1-to-1 matches

Page 9: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Chromosome 5

Chromosome 3

Page 10: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Existing Gene Family Data

Data for individual species• Recent data is only for bacteria

Data from multiple species• Large sets of species: eukaryotes +

prokaryotes

Page 11: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

The properties of protein family space depend on experimental design Kunin et al, Bioinformatics 2005

Page 12: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Our Questions What does the GFS distribution look

like?• How much does the clustering method affect

the GFSD?• How much does the cluster E-value threshold

affect the GFSD?• How much does the GFSD vary across

species?• Can we fit the GFSD to a particular function?

Page 13: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Our Analysis Species:

• Yeast vs Yeast (5131 Genes)• Mouse vs Mouse (7343 Genes)• Human vs Human (10610 Genes) IN PROGRESS

Clustering Methods• Hierarchical Clustering

• Multiple variants• 5 E-value thresholds

• TribeMCL• 5 inflation parameters

Page 14: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Hierarchical Clustering Method Threshold Complete linkage

Average linkage

Single Linkage

Page 15: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

TribeMCL

Inflation parameter (but is difficult to understand)• 4-5: small clusters• 1.1-3: larger clusters

However, clusters do not strictly increase in size when inflation value is reduced• e.g., clusters are not hierarchical

http://micans.org/mcl

Markov clustering• More flow across higher weight

edges• How much total flow between each

gene? Handles multi-domain proteins? Very Efficient

Page 16: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Mouse Complete-Linkage 10-10

Log (gene family size)

Gene family size

Page 17: Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Yeast Complete Linkage