Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

48
Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz

description

Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation. Jim Kent University of California Santa Cruz. A Challenge Every Speaker Faces:. Who is the audience? Bioinformaticians: Biologists with bigger, better databases? Geeks trading bits for bases? - PowerPoint PPT Presentation

Transcript of Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Page 1: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Spaghetti Code, Soupy Logicadventures in gene expression & genome annotation

Jim Kent

University of California Santa Cruz

Page 2: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

A Challenge Every Speaker Faces:

• Who is the audience?

• Bioinformaticians:– Biologists with bigger, better databases?– Geeks trading bits for bases?– Leading edge interdisciplinary super scientists?

Page 3: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Top 5 Reasons Biologists Go Into Bioinformatics

• 5 - Microscopes and biochemistry are so 20th century.

• 4 - Got started purifying proteins, but it turns out the cold room is really COLD.

• 3 - After 23 years of school wanted to make MORE than 23,000/year in a postdoc.

• 2 - Like to swear, @ttracted to $_ Perl #!!• 1 - Getting carpel tunnel from pipetting

Page 4: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Top 5 Reasons Computer People go into Bioinformatics

• 5 - Bio courses have some females.

• 4 - Human genome stabler than Windows XP

• 3 - Having mastered binary trees, quad trees, and parse trees ready for phylogenic trees.

• 2 - Missing heady froth of the internet bubble.

• 1 - Must augment humanity to defeat evil artificial intelligent robots.

Page 5: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

The Paradox of GenomicsHow does a long, static, one dimensional string of DNA turn into the remarkably complex, dynamic, and three dimensional human body?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATGTAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATAGCCAGCCAGCCACCACAGGCATGAGT

Page 6: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Models and Metaphors• When trying to understand something we like to

build up metaphors and models.• Computer programs are complex systems that

ultimately are built up of 0’s and 1’s, perhaps they are a model for a genome built of A,C,G and T?

• Human genome lacks documentation, has accumulated 3 billion years of cruft, and does not believe in local variables.

• Therefore we must look to less than straightforward software programs as guides.

Page 7: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Bioperl CORBA modulesub new { my ( $class, @args) = @_; my $self = $class->SUPER::new(@args); my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORBNAME)], @args); $self->{'_ior'} = $ior || 'biocorba.ior'; $self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl'; $self->{'_orbname'} = $orbname || 'orbit-local-orb'; $CORBA::ORBit::IDL_PATH = $self->{'_idl'}; my $orb = CORBA::ORB_init($orbname); my $root_poa = $orb->resolve_initial_references("RootPOA"); $self->{'_orb'} = $orb; $self->{'_rootpoa'} = $root_poa; return $self;}

Page 8: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Obfuscated C#define c(n,s)case n:s;continuechar x[]="((((((((((((((((((((((",w[]="\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g=-1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf("\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+*w,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t={0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>>3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21])*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<=*w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14,SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main(int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==(int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak");h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k=-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1));c(51,h(2));c(52,h(3));}}

Page 9: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Microsoft Windows

mouse

keyboard

network

elaborate proprietary process

blue screen

of death

Page 10: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Looks like metaphor not enough, must study actual cells & DNA

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 11: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

How DNA is Used by the Cell

Page 12: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Promoter Tells Where to Begin

Different promoters activate different genes indifferent parts of the body.

Page 13: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

A Computer in Soup

Idealized promoter for a gene involved in making hair.Proteins that bind to specific DNA sequences in the promoter region together turn a gene on or off. Theseproteins are themselves regulated by their own promotersleading to a gene regulatory network with many of thesame properties as a neural network.

Page 14: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Genes can be transcription factors that activate

or repress other genes, leading to regulatory networks

such as this one from the development of the central

nervous system. (Image from D’Haeseleer Somogyi 1999)

Page 15: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

The Decisions of a Cell

• When to reproduce?

• When to migrate and where?

• What to differentiate into?

• When to secrete something?

• When to make an electrical signal?The more rapid decisions usually are via the cell membrane and 2nd messengers. The longer acting decisions are usually made in the nucleus.

Page 16: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Nucleus Used to Appear Simple

• Cheek cells stained with basic dyes. Nuclei are readily visible.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Mammalian Nuclei Stained in Various Ways

Image from Tom Misteli lab

Page 18: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Artist’s rendition of nucleus

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Image from nuclear protein database

Page 19: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Chromatin

Page 20: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Turning on a gene:

• Getting DNA into the right compartment of the nucleus (may involve very diffuse signals in DNA over very long distances)

• Loosening up chromatin structure (this involves activator and repressors which can act over relatively long distances)

• Attracting RNA Polymerase II to the transcription start site (these involve relatively close factors both upstream and downstream of transcription start).

Page 21: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Methods for Studying Transcription

• Genetics in model organisms

• Promoters hooked to reporter genes

• Gel shifts and DNAse footprinting.

• Phylogenic footprinting

• Motif searches in clusters of coregulated genes.

Page 22: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Drosophila Genetics

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

normal antennapediamutant

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 23: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Reporter Gene Constructs

promoter to study easily seen gene

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Drosophila embryo transfected with ftz promoter hookedup to lacz reporter gene, creating stripes where ftz promoteris active.

Page 24: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Txn factorfootprint

Gel showing selective protection of DNA from nuclease digestion where transcription factor is bound.

Biochemical Footprinting Assays

Page 25: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Pseudogenes

Page 26: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Creative Chaos & Genome

Page 27: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Finding Transcription Start

Page 28: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Phylogenic Footprinting

Page 29: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Mouse Paints Some Promoters

RefSeq

Spliced EST

Mouse

Fish

Repeat

Crystallin - a gene expressed in the eye. Coding regions are very similar to crystallins in the liver, but the promoter is different.

Page 30: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Normalized eScores

Page 31: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Mouse/Human Chrom 7 Synteny

Page 32: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Motifs in Coregulated Genes

Page 33: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Conservation Levels of Regulatory Regions

Page 34: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Transition from Private Research Interests to Role in Genome

Project

Page 35: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Assembly War Story

Page 36: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Building a Better Browser

Page 37: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Pretty Adventurous Programming

Page 38: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Genome BrowserBLAT

Gene SorterTable Browser

Service Organization

Page 39: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Parasol and Kilo Cluster

• UCSC cluster has 1000 CPUs running Linux

• 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment

• We wrote Parasol job scheduler to keep up.– Very fast and free.

– Jobs are organized into batches.

– Error checking at job and at batch level.

Page 40: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

AcknowledgementsIndividuals Institutions

NHGRI, The Wellcome Trust, HHMI, Taxpayers in the US and worldwide.

Whitehead, Sanger, Wash U, Baylor, Stanford, DOE, and the international sequencing centers.

NCBI, Ensembl, Genoscope, The SNP Consortium, UCSC, Softberry, Affymetrix.

David Haussler, Chuck Sugnet

Francis Collins, Bob Waterston, Eric Lander, John Sulston, Richard Gibbs

Lincoln Stein, Sean Eddy, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, Greg Schuler, Deanna Church, Asif Chinwalla, Kim Worley, the Gene Cats.

Everyone else!

Page 41: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

THE END

Page 42: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

gctcgttcaggggtaaaggtgtattctagatCCACAACAAGCCCCGTGGTCTAGCACAGC AAAGAGAAAAAAAGAGAACACGAAAATGCCCTTGCTCCCCTCCGGGGGCCCCTTTTGTGC GGTTCTTGCCAACGCAGCAGCCCTCCTGCTATATAGCCCGCCGCGCCgCAGCCCCACCCG CTCAGCGCCGCCGCCCCACCAGCTCAGCACCGCCGTGCGCCCAGCCAGCCATGGGGAAGG TGAGCCCAGCCTGCGCCCCGGGACCCCGGAGCTTCCTCCATCGCGGGGGCCAGAGACTGG GGCAGGAGCAGGCCTGTGAGACCTCGCCTTGTCCCGCCTTGCCTTGCAGATCACCCTCTA CGAGGACCGGGGCTTCCAGGGCCGCCACTATGAATGCAGCAGCGACCACCCCAACCTGCA GCCCTACTTGAGCCGCTGCAACTCGGCGCGCGTGGACAGCGGCTGCTGGATGCTCTATGA GCAGCCCAACTACTCGGGCCTCCAGTACTTCCTGCGCCGCGGCGACTATGCCGACCACCA GCAGTGGATGGGCCTCAGCGACTCGGTCCGCTCCTGCCGCCTCATCCCCCACGTGAGTAC ATCCTCAAGTCAGGACCCAGGCCCTCAGGACACTCACTGGAtgGTTTCAAGCAAAAGTTA AACATTAGAAGTAGTGATCAGTcacaataaCTGAGAGTGGACAAAAGATGAACTATAGTG GATTAAGTCAATAGagttTGCTCCCCACATAAGCAAAGTATTACCCAGACAcCAGTTAAT caCAATTAATCCACAAATATGTATTGAGTAGGAATGTGTCTCCTGCCctAGGGGTTGTAT

Coloring CRYGD Start

Page 43: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Trends in Society & Biology

50’s Cars are good Mitochondria and metabolism

60’s Recording DNA as recording media of genes

70’s Birth control Working out the cell cycle

80’s Yuppies Start of serious genetic engineering

90’s Microsoft rules Incyte, Celera race to patent genome

2000’s

Page 44: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation
Page 45: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

(The NEED for Bioinformatics)• ~200 million bases of DNA are sequenced

every day.– Not much use without assembly.

• Protein and non-sequence data also being generated at a prodigious rate.– How to store it and find the parts you want?

• Making models that are simple enough to understand, but rich enough to reflect the biology.

Page 46: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

(My Road to a Bio PhD)• Liked bio, but too many prerequisites!• Had fun doing graphics/animation

programming in 80’s & early 90’s.• Bored of endlessly shifting Microsoft APIs• Community college, UC extension to get

bio BA equivalent in 97 & 98.• UC Santa Cruz bio grad school 1999• Interested in developmental biology and

how a cell makes decisions.

Page 47: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Perhaps Must Study Actual Cells

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 48: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation

Spaghetti Code or Soupy Logic

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Steaming fresh modules in

sourceforge.net

Combinatorical assembly of

transcription factors in cell.