Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of...

48
Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of...

Page 1: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Spaghetti Code, Soupy Logicadventures in gene expression & genome annotation

Jim Kent

University of California Santa Cruz

Page 2: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

A Challenge Every Speaker Faces:

• Who is the audience?

• Bioinformaticians:– Biologists with bigger, better databases?– Geeks trading bits for bases?– Leading edge interdisciplinary super scientists?

Page 3: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Top 5 Reasons Biologists Go Into Bioinformatics

• 5 - Microscopes and biochemistry are so 20th century.

• 4 - Got started purifying proteins, but it turns out the cold room is really COLD.

• 3 - After 23 years of school wanted to make MORE than 23,000/year in a postdoc.

• 2 - Like to swear, @ttracted to $_ Perl #!!• 1 - Getting carpel tunnel from pipetting

Page 4: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Top 5 Reasons Computer People go into Bioinformatics

• 5 - Bio courses have some females.

• 4 - Human genome stabler than Windows XP

• 3 - Having mastered binary trees, quad trees, and parse trees ready for phylogenic trees.

• 2 - Missing heady froth of the internet bubble.

• 1 - Must augment humanity to defeat evil artificial intelligent robots.

Page 5: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

The Paradox of GenomicsHow does a long, static, one dimensional string of DNA turn into the remarkably complex, dynamic, and three dimensional human body?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATGTAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATAGCCAGCCAGCCACCACAGGCATGAGT

Page 6: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Models and Metaphors• When trying to understand something we like to

build up metaphors and models.• Computer programs are complex systems that

ultimately are built up of 0’s and 1’s, perhaps they are a model for a genome built of A,C,G and T?

• Human genome lacks documentation, has accumulated 3 billion years of cruft, and does not believe in local variables.

• Therefore we must look to less than straightforward software programs as guides.

Page 7: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Bioperl CORBA modulesub new { my ( $class, @args) = @_; my $self = $class->SUPER::new(@args); my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORBNAME)], @args); $self->{'_ior'} = $ior || 'biocorba.ior'; $self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl'; $self->{'_orbname'} = $orbname || 'orbit-local-orb'; $CORBA::ORBit::IDL_PATH = $self->{'_idl'}; my $orb = CORBA::ORB_init($orbname); my $root_poa = $orb->resolve_initial_references("RootPOA"); $self->{'_orb'} = $orb; $self->{'_rootpoa'} = $root_poa; return $self;}

Page 8: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Obfuscated C#define c(n,s)case n:s;continuechar x[]="((((((((((((((((((((((",w[]="\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g=-1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf("\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+*w,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t={0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>>3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21])*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<=*w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14,SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main(int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==(int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak");h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k=-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1));c(51,h(2));c(52,h(3));}}

Page 9: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Microsoft Windows

mouse

keyboard

network

elaborate proprietary process

blue screen

of death

Page 10: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Looks like metaphor not enough, must study actual cells & DNA

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 11: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

How DNA is Used by the Cell

Page 12: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Promoter Tells Where to Begin

Different promoters activate different genes indifferent parts of the body.

Page 13: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

A Computer in Soup

Idealized promoter for a gene involved in making hair.Proteins that bind to specific DNA sequences in the promoter region together turn a gene on or off. Theseproteins are themselves regulated by their own promotersleading to a gene regulatory network with many of thesame properties as a neural network.

Page 14: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Genes can be transcription factors that activate

or repress other genes, leading to regulatory networks

such as this one from the development of the central

nervous system. (Image from D’Haeseleer Somogyi 1999)

Page 15: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

The Decisions of a Cell

• When to reproduce?

• When to migrate and where?

• What to differentiate into?

• When to secrete something?

• When to make an electrical signal?The more rapid decisions usually are via the cell membrane and 2nd messengers. The longer acting decisions are usually made in the nucleus.

Page 16: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Nucleus Used to Appear Simple

• Cheek cells stained with basic dyes. Nuclei are readily visible.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Mammalian Nuclei Stained in Various Ways

Image from Tom Misteli lab

Page 18: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Artist’s rendition of nucleus

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Image from nuclear protein database

Page 19: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Chromatin

Page 20: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Turning on a gene:

• Getting DNA into the right compartment of the nucleus (may involve very diffuse signals in DNA over very long distances)

• Loosening up chromatin structure (this involves activator and repressors which can act over relatively long distances)

• Attracting RNA Polymerase II to the transcription start site (these involve relatively close factors both upstream and downstream of transcription start).

Page 21: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Methods for Studying Transcription

• Genetics in model organisms

• Promoters hooked to reporter genes

• Gel shifts and DNAse footprinting.

• Phylogenic footprinting

• Motif searches in clusters of coregulated genes.

Page 22: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Drosophila Genetics

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

normal antennapediamutant

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 23: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Reporter Gene Constructs

promoter to study easily seen gene

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Drosophila embryo transfected with ftz promoter hookedup to lacz reporter gene, creating stripes where ftz promoteris active.

Page 24: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Txn factorfootprint

Gel showing selective protection of DNA from nuclease digestion where transcription factor is bound.

Biochemical Footprinting Assays

Page 25: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Pseudogenes

Page 26: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Creative Chaos & Genome

Page 27: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Finding Transcription Start

Page 28: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Phylogenic Footprinting

Page 29: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Mouse Paints Some Promoters

RefSeq

Spliced EST

Mouse

Fish

Repeat

Crystallin - a gene expressed in the eye. Coding regions are very similar to crystallins in the liver, but the promoter is different.

Page 30: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Normalized eScores

Page 31: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Mouse/Human Chrom 7 Synteny

Page 32: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Motifs in Coregulated Genes

Page 33: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Conservation Levels of Regulatory Regions

Page 34: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Transition from Private Research Interests to Role in Genome

Project

Page 35: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Assembly War Story

Page 36: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Building a Better Browser

Page 37: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Pretty Adventurous Programming

Page 38: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Genome BrowserBLAT

Gene SorterTable Browser

Service Organization

Page 39: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Parasol and Kilo Cluster

• UCSC cluster has 1000 CPUs running Linux

• 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment

• We wrote Parasol job scheduler to keep up.– Very fast and free.

– Jobs are organized into batches.

– Error checking at job and at batch level.

Page 40: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

AcknowledgementsIndividuals Institutions

NHGRI, The Wellcome Trust, HHMI, Taxpayers in the US and worldwide.

Whitehead, Sanger, Wash U, Baylor, Stanford, DOE, and the international sequencing centers.

NCBI, Ensembl, Genoscope, The SNP Consortium, UCSC, Softberry, Affymetrix.

David Haussler, Chuck Sugnet

Francis Collins, Bob Waterston, Eric Lander, John Sulston, Richard Gibbs

Lincoln Stein, Sean Eddy, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, Greg Schuler, Deanna Church, Asif Chinwalla, Kim Worley, the Gene Cats.

Everyone else!

Page 41: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

THE END

Page 42: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

gctcgttcaggggtaaaggtgtattctagatCCACAACAAGCCCCGTGGTCTAGCACAGC AAAGAGAAAAAAAGAGAACACGAAAATGCCCTTGCTCCCCTCCGGGGGCCCCTTTTGTGC GGTTCTTGCCAACGCAGCAGCCCTCCTGCTATATAGCCCGCCGCGCCgCAGCCCCACCCG CTCAGCGCCGCCGCCCCACCAGCTCAGCACCGCCGTGCGCCCAGCCAGCCATGGGGAAGG TGAGCCCAGCCTGCGCCCCGGGACCCCGGAGCTTCCTCCATCGCGGGGGCCAGAGACTGG GGCAGGAGCAGGCCTGTGAGACCTCGCCTTGTCCCGCCTTGCCTTGCAGATCACCCTCTA CGAGGACCGGGGCTTCCAGGGCCGCCACTATGAATGCAGCAGCGACCACCCCAACCTGCA GCCCTACTTGAGCCGCTGCAACTCGGCGCGCGTGGACAGCGGCTGCTGGATGCTCTATGA GCAGCCCAACTACTCGGGCCTCCAGTACTTCCTGCGCCGCGGCGACTATGCCGACCACCA GCAGTGGATGGGCCTCAGCGACTCGGTCCGCTCCTGCCGCCTCATCCCCCACGTGAGTAC ATCCTCAAGTCAGGACCCAGGCCCTCAGGACACTCACTGGAtgGTTTCAAGCAAAAGTTA AACATTAGAAGTAGTGATCAGTcacaataaCTGAGAGTGGACAAAAGATGAACTATAGTG GATTAAGTCAATAGagttTGCTCCCCACATAAGCAAAGTATTACCCAGACAcCAGTTAAT caCAATTAATCCACAAATATGTATTGAGTAGGAATGTGTCTCCTGCCctAGGGGTTGTAT

Coloring CRYGD Start

Page 43: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Trends in Society & Biology

50’s Cars are good Mitochondria and metabolism

60’s Recording DNA as recording media of genes

70’s Birth control Working out the cell cycle

80’s Yuppies Start of serious genetic engineering

90’s Microsoft rules Incyte, Celera race to patent genome

2000’s

Page 44: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.
Page 45: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

(The NEED for Bioinformatics)• ~200 million bases of DNA are sequenced

every day.– Not much use without assembly.

• Protein and non-sequence data also being generated at a prodigious rate.– How to store it and find the parts you want?

• Making models that are simple enough to understand, but rich enough to reflect the biology.

Page 46: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

(My Road to a Bio PhD)• Liked bio, but too many prerequisites!• Had fun doing graphics/animation

programming in 80’s & early 90’s.• Bored of endlessly shifting Microsoft APIs• Community college, UC extension to get

bio BA equivalent in 97 & 98.• UC Santa Cruz bio grad school 1999• Interested in developmental biology and

how a cell makes decisions.

Page 47: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Perhaps Must Study Actual Cells

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 48: Spaghetti Code, Soupy Logic adventures in gene expression & genome annotation Jim Kent University of California Santa Cruz.

Spaghetti Code or Soupy Logic

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Steaming fresh modules in

sourceforge.net

Combinatorical assembly of

transcription factors in cell.