Evolution and the Santa Cruz Genome Browser

40
Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University

description

Evolution and the Santa Cruz Genome Browser. Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. - PowerPoint PPT Presentation

Transcript of Evolution and the Santa Cruz Genome Browser

Page 1: Evolution and the Santa Cruz Genome Browser

Evolution and the Santa Cruz Genome Browser

Jim Kent and the Genome Bioinformatics Group

University of California Santa Cruz

Pennsylvania State University

Page 2: Evolution and the Santa Cruz Genome Browser

Typical Gene Level View:

Sialic Acid Binding/Ig-like Lectin 7

Page 3: Evolution and the Santa Cruz Genome Browser

Typical Gene Level View:

Sialic Acid Binding/Ig-like Lectin 7

Page 4: Evolution and the Santa Cruz Genome Browser

Known Gene Details Page

Page 5: Evolution and the Santa Cruz Genome Browser

Known Gene Details Page

Page 6: Evolution and the Santa Cruz Genome Browser

PDB Ribbon Diagram

4 clicks away by the wonder of the world wide web

Page 7: Evolution and the Santa Cruz Genome Browser

Hox A Cluster, Many Tracks

Page 8: Evolution and the Santa Cruz Genome Browser

Track Controls are Now Grouped

Page 9: Evolution and the Santa Cruz Genome Browser

Packed mode saves space, makes labels easier to find.

Page 10: Evolution and the Santa Cruz Genome Browser

Squished mode is ideal for ESTs and mouse/human homology

Page 11: Evolution and the Santa Cruz Genome Browser

Squished mode is ideal for ESTs and mouse/human homology

ESTs hint at a smallerversion of exon2

Page 12: Evolution and the Santa Cruz Genome Browser

Publication Quality Output

Page 13: Evolution and the Santa Cruz Genome Browser

Comparative Genomics

Page 14: Evolution and the Santa Cruz Genome Browser

Chaining Alignments

• Chaining bridges the gulf between syntenic blocks and base-by-base alignments.

• Local alignments tend to break at transposon insertions, inversions, duplications, etc.

• Global alignments tend to force non-homologous bases to align.

• Chaining is a rigorous way of joining together local alignments into larger structures.

Page 15: Evolution and the Santa Cruz Genome Browser

Chains join together related local alignments

Protease Regulatory Subunit 3

Page 16: Evolution and the Santa Cruz Genome Browser

Affine penalties are too harsh for long gaps

Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine

gap scores model red/blue plots as straight lines.

Page 17: Evolution and the Santa Cruz Genome Browser

Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment

otherwise non-homologous bases can be forced to pair

Page 18: Evolution and the Santa Cruz Genome Browser

2-D histogram of observed gaps.

The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.

Page 19: Evolution and the Santa Cruz Genome Browser

Before and After Chaining

Page 20: Evolution and the Santa Cruz Genome Browser

Chaining Algorithm

• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence

relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Page 21: Evolution and the Santa Cruz Genome Browser

Netting Alignments

• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

• Net finds best match mouse match for each human region.

• Highest scoring chains are used first.• Lower scoring chains fill in gaps within

chains inducing a natural hierarchy.

Page 22: Evolution and the Santa Cruz Genome Browser

Net Focuses on Ortholog

Page 23: Evolution and the Santa Cruz Genome Browser

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Page 24: Evolution and the Santa Cruz Genome Browser

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Page 25: Evolution and the Santa Cruz Genome Browser

Mouse/HumanRearrangement Statistics

Number of rearrangements of given type per megabase.

Page 26: Evolution and the Santa Cruz Genome Browser

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Page 27: Evolution and the Santa Cruz Genome Browser

year of the rat - 2008

Rat Genome

Page 28: Evolution and the Santa Cruz Genome Browser

Rat/Mouse/Human Genome-Wide Multiz Alignments Available

Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.

Page 29: Evolution and the Santa Cruz Genome Browser

Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.

Page 30: Evolution and the Santa Cruz Genome Browser

Zoom to Base Level

Detail near translation start of tubulin 8

Page 31: Evolution and the Santa Cruz Genome Browser

Zoom to Base Level

Intron consensus sequence visible.

Page 32: Evolution and the Santa Cruz Genome Browser

Zoom to Base Level

Possible alt-splice not consensus and not conserved.

Page 33: Evolution and the Santa Cruz Genome Browser

Tiling the genome in Microarrays

New genes on 21 and 22?

Page 34: Evolution and the Santa Cruz Genome Browser

Cross-hybridization at Work

Zoomed in on right side:

Page 35: Evolution and the Santa Cruz Genome Browser

>hg15_rnaCluster_chr22.246 range=chr22:25204375-25204574 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneaactccgcctcggggccccggggcgccgcctctctcccccggggcgccgcctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcggagcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagctgcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc>hg15_rnaCluster_chr22.234 range=chr22:24125896-24126095 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgctggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcggctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgccctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga>hg15_rnaCluster_chr22.313 range=chr22:29356156-29356355 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=nonegccctcccggtccgggggcggggcttggcctggggcggggcttggctggggtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggcagaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcctgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca>hg15_rnaCluster_chr22.337 range=chr22:30433286-30433485 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneactcagaagctaagataccgacggtgttcctctgaacttcttccaatggctaaaagctacaagcgcctcagatataaaagactcctggacggattttcatccagcacagagcagctgaatccatatttggcagctagtggatgggataagaggcctaacagtaagcccatggcactttattctctcgaatccatcaagat>hg15_rnaCluster_chr22.356 range=chr22:32640965-32641164 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=noneggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccgcggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggccggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtctgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc

200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters

Page 36: Evolution and the Santa Cruz Genome Browser

AcknowledgementsIndividuals Institutions

NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide.

Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers.

UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.

Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler,

Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins,

LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats.

Everyone else!

Page 37: Evolution and the Santa Cruz Genome Browser

THE END

Page 38: Evolution and the Santa Cruz Genome Browser

A Cautionary Note

• Infant digestive systems very permeable, uptake antibodies

• ~10% of infants are allergic to cow’s milk based formula

• These infants get soy/corn based formula

• As we engineer plants, let’s be careful what we put in infant formula

Page 39: Evolution and the Santa Cruz Genome Browser

New Algorithms and Data

• ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements.

• Rat genome is browsable and used in rat/mouse/human multiple alignments.

• Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.

Page 40: Evolution and the Santa Cruz Genome Browser

Ideal Gap Penalties

• Would allow gaps in both sequences at once• Would penalize long gaps less than affine gap

scores.• Still would be quick to compute.

• We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.