Decoding ENCODE

of 87 /87
Decoding ENCODE Jim Kent University of California Santa Cruz

Embed Size (px)

description

Decoding ENCODE. Jim Kent University of California Santa Cruz. ENCODE Timeline. ENCyclopedia of Dna Elements. Attempt to catalog as many functional elements in human genome as possible using current technologies. Pilot project - finished 2007, covered 1% of genome. - PowerPoint PPT Presentation

Transcript of Decoding ENCODE

Page 1: Decoding ENCODE

Decoding ENCODE

Jim KentUniversity of California Santa Cruz

Page 2: Decoding ENCODE

ENCODE Timeline

• ENCyclopedia of Dna Elements. – Attempt to catalog as many functional

elements in human genome as possible using current technologies.

– Pilot project - finished 2007, covered 1% of genome.

– Production project - ramping up now. Genome-wide. Should have major amounts of data in 6 months.

Page 3: Decoding ENCODE

ENCODE Experiments• Chromatin state:

– DNA Hypersensitivity assays– Chromatin Immunoprecipitation (ChIP)

• Histones in various methylation states• Sequence-specific transcription factors

– DNA methylation– Chromatin conformation capture (5C)

• Functional RNA discovery– Nuclear & cytoplasmic, short & long– RNA Immunoprecipitation

• Comparative Genomics• Human curated gene annotation

Page 4: Decoding ENCODE

Role of UCSC

• Display data in context of what else is known on the UCSC Genome Browser and in other tools.

• Facilitate analysis of the data with both Web-based and command line tools.

Page 5: Decoding ENCODE

A Peek at the Pilot Project

Page 6: Decoding ENCODE

ENCODE pilot data at genome.ucsc.edu

Page 7: Decoding ENCODE

Correlation at gene starts in enr221

Page 8: Decoding ENCODE

Transcription at enm221

Page 9: Decoding ENCODE

ENCODE Chromatin Immunoprecipitation

Page 10: Decoding ENCODE

Scientific Highlights of Pilot• Transcription:

– Lots of transcription outside of known genes.– Outside of known genes transcribed areas not very

well conserved across species.– Lots of rare splice variants, also poorly conserved.

• DNA/Protein Interactions– Good correlation between histone markers, gene

starts, and _active_ transcription.– Lots of “occupied transcription factor binding sites”

not conserved, near promoters etc.• Biological noise?

– Main controversy was whether to explain much of the data as “biological noise” that was tolerated but not necessary for function.

Page 11: Decoding ENCODE

From Pilot to Production Phase

Page 12: Decoding ENCODE

ENCODE Production Phase• Moving from microarray based assays

to assays based on next-generation sequencing. (ChIP-chip to ChIP-seq)

• Genome-wide rather than regional.• Broader set of cell lines used more

consistently between labs.• Broader set of antibodies.• Some new technology development

continues.

Page 13: Decoding ENCODE

ENCODE Cell Lines• Tier 1 - used in ALL experiments

– GM12878 (lymphoblastoid cell line)– K562 (chronic myeloid leukemia)

• Tier 2 - used in most experiments– HepG2 (hepatocellular carcinoma)– Hela-S3 (cervical carcinoma)– HUVEC (umbilical vein endothelial cells)– Keratinocyte (normal epidermal cells)– Likely will do an embryonic stem cell too.

• Tier 3 - used in one or two experiments– Many of these for assays such as DNAse

hypersensitivity, RNA measurements where don’t have to do separate experiment for each antibody.

Page 14: Decoding ENCODE

Simple Model of Eukaryotic Transcription Regulation

• Initially chromatin “opened” to allow transcription factors to access DNA

• Multiple transcription factors bind to DNA in combination.– Most factors have such small DNA binding sites that

by themselves they are not specific or the binding even stable

• The right combination of factors in open chromatin leads to active transcription starting at the initiation complex.

• With the ENCODE experiments we can directly test most aspects of this model.

Page 15: Decoding ENCODE

Chromatin Experiments

• In general applied across a large number of cell lines.

• DNAseI hypersensitivity• Formaldehyde Assisted Isolation of

Regulatory Elements• Methylation of CpG Islands• ChIP-seq of relevant factors

– H3K4me1,2,3 H3K9me3 H4K20me3, H3K27me3, H3K36me3, RPol-II, etc.

Page 16: Decoding ENCODE

Transcription Factor ChIP

• Many antibodies in modest number of cell lines.

• Limited by good antibodies, hope for 100 or more.

• Current good antibodies include– E2F1, E2F4, E2F6, KAP1, L3MBTL2, STAT1,

CtBP1, CtBP2, SETDB1, ZNF180, ZNF239, ZNF263, ZNF266, ZNF317, ZNF342

• Part of project pipeline for raising and testing antibodies.

Page 17: Decoding ENCODE

RNA measurement• RNA-seq of poly-A selected RNA to measure

mRNA levels in many cell lines.• Sequencing of G-cap selected tags (CAGE)• Sequencing 5’ and 3’ ends (paired end tags)• Measurement of RNAs of several types in

several cell compartments of a few cell lines.– Long/short, polyA/nonPolyA, associated with

proteins/not associated with proteins– Nucleus, cytosol, polysomes, chromatin, nucleolus

Page 18: Decoding ENCODE

New Pilot Projects Starting to Sprout

Page 19: Decoding ENCODE

New Pilot Projects• Immunoprecipitation of RNA binding

proteins/RNA sequencing. • Mapping silencers and enhancers with

transient transfection assays• Computational identification of active

promoters• Deep comparative sequencing in targeted

regions and conservation analysis.• Chromatin Conformation Capture Carbon

Copy (5C) to capture long range regulatory elements and their targets.

Page 20: Decoding ENCODE

ENCODE Timeline

• Grants funded for 4 years starting Sept 2007.

• First production data just now starting to roll into UCSC, not quite ready for public display.

• Data should accumulate quickly over next few years.

Page 21: Decoding ENCODE

Data Release Policy• Once have reproducible data (where at least 2 of 3

replicates agree) should be released to public within a month.

• Data is still considered pre-publication! – Ok to publish a paper using data on a few genes.– Please wait for consortium papers before papers doing full

genome analysis.– Anyone can join ENCODE consortium analysis group to

help us write the papers.– We just have ~1 year after data release to write papers,

after that fair game to publish full genome analysis. – If in doubt please contact consortium via UCSC.

Page 22: Decoding ENCODE

Web Works for Mice and Men

Page 23: Decoding ENCODE

Mouse ES Cell Chromatin IP

• Brad Bernstein lab ChIP-seq based experiment on methylated histones now on UCSC Genome Browser.

• Shows some of the user interfaces that will be used for the ENCODE data

Page 24: Decoding ENCODE
Page 25: Decoding ENCODE

List of mouse chromatin subtracks….

Page 26: Decoding ENCODE
Page 27: Decoding ENCODE

Signal densities of entire mouse chromatin data set.

Page 28: Decoding ENCODE

The unending quest for genes

Page 29: Decoding ENCODE

Gencode Project• Project to define structure (exons and introns)

for all common splice varients of all genes.• Human curators merge many lines of

evidence including– Computational gene predictions– RNA/DNA alignments– Paired end tags– Cross-species alignments– Possibly chromatin state data

• PI is Tim Hubbard• Much of the work done by Havana group

Page 30: Decoding ENCODE

Data Mining with Table Browser

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 31: Decoding ENCODE

Table Browser

• Complete access to UCSC Database with results in tab-delimited format

• Method for creating “custom tracks” by combining and filtering existing tracks.

• Sample query - getting a table of Ensembl gene coordinates and associated Superfamily annotations.

Page 32: Decoding ENCODE
Page 33: Decoding ENCODE
Page 34: Decoding ENCODE
Page 35: Decoding ENCODE
Page 36: Decoding ENCODE
Page 37: Decoding ENCODE

Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

Page 38: Decoding ENCODE

Table Browser Filters

• Getting list of Ensembl genes that have SH3 domains.

Page 39: Decoding ENCODE
Page 40: Decoding ENCODE
Page 41: Decoding ENCODE
Page 42: Decoding ENCODE
Page 43: Decoding ENCODE
Page 44: Decoding ENCODE

Table Browser Intersection

• Getting list of Ensembl genes that don’t intersect UCSC Known Genes

Page 45: Decoding ENCODE
Page 46: Decoding ENCODE
Page 47: Decoding ENCODE
Page 48: Decoding ENCODE

Custom Track Output

• Useful for visualizing results of queries in genome browser

• The way to produce more complex queries.

• Here we look at how well genes that are Ensembl but not UCSC are conserved across species.

Page 49: Decoding ENCODE
Page 50: Decoding ENCODE
Page 51: Decoding ENCODE
Page 52: Decoding ENCODE
Page 53: Decoding ENCODE
Page 54: Decoding ENCODE

681/3329 (20%) of Ensemble not known also not conserved1728/33,666 (5%) of Ensembl in general not conserved

Page 55: Decoding ENCODE

UCSC Gene Sorter QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

• Swiss army knife for dealing with gene sets.

• Hilights relationships and connections between genes.

• Powerful data mining tool.

Page 56: Decoding ENCODE

Cytochrome P450 - a gene family important in drug metabolism.The family is related in many ways. Sorted by protein homology

Page 57: Decoding ENCODE

Various sorting methods let you focus on different typesof relationships between genes.

Page 58: Decoding ENCODE

Sorting by gene distance is a quick way to browse candidategenes in a region.

Page 59: Decoding ENCODE

Clicking on row # or gene name selects that gene.

Page 60: Decoding ENCODE
Page 61: Decoding ENCODE
Page 62: Decoding ENCODE

Configuration page controls column order and display options.

Page 63: Decoding ENCODE

Also you can upload your own columns here.

Page 64: Decoding ENCODE

Controlling expression display

Page 65: Decoding ENCODE

GNF Atlas 2 column in ‘median of replicates’ mode. ActualColumn includes 79 tissues, slide only fits first half.

Page 66: Decoding ENCODE

Sorting based on expression similarity to selected gene.

Page 67: Decoding ENCODE
Page 68: Decoding ENCODE

The filters page turns the Family Browser into a powerfuldata mining tool.

Page 69: Decoding ENCODE
Page 70: Decoding ENCODE

GO-annotated membrane proteins that are expressed at least 8X in pancreatic islets cells and no more than 4X elsewhere outside of pancreas and central nervous system. These might be good candidates for targets of the autoimmune response that can cause Type I diabetes.

Candidate Pancreatic Islet Membrane Genes

Page 71: Decoding ENCODE

Direct Data Access

Page 72: Decoding ENCODE

FTP or HTTP Download

• Sequence• Multiple genome alignments• “Wiggle” track data.• Database as tab-separated files• Follow downloads link from

http://genome.ucsc.edu• Via ftp://hgdownload.cse.ucsc.edu

Page 73: Decoding ENCODE

Public MySQL Access• Query mirror of our database directly

– Host: genome-mysql.cse.ucsc.edu– User: genome– No password needed

• Best to use table browser to find relevant tables in many cases.

• Some tables are split by chromosomes– chr1_est, chr2_est, etc.

• Some data (genome sequence, multiple alignments, wiggles) are in files just referenced by SQL tables.

• For some purposes easier to use via UCSC C library code than via SQL.

Page 74: Decoding ENCODE

The Sordid Details of the UCSC Genome Informatics

Code Base

Download via http://genome.ucsc.edu/admin/cvs.htmlMany modules require MySQL to be installed.

Page 75: Decoding ENCODE

Lagging Edge Software

• C language - compilers still available!• CGI Scripts - portable if not pretty.• SQL database - at least MySQL is free.

Page 76: Decoding ENCODE

Problems with C

• Missing booleans and strings.

• No real objects.• Must free things

Page 77: Decoding ENCODE

Coping with Missing Data Types in C

• #define boolean int• Fixing lack of real string type much

harder– lineFile/common modules and autoSql

code generator make parsing files relatively painless

– dyString module not a horrible string ‘class’

Page 78: Decoding ENCODE

Object Oriented Programming in C

• Build objects around structures.• Make families of functions with names that

start with the structure name, and that take the structure as the first argument.

• Implement polymorphism/virtual functions with function pointers in structure.

• Inheritance is still difficult. Perhaps this is not such a bad thing.

Page 79: Decoding ENCODE

struct dnaSeq/* A dna sequence in one-letter-per-base format. */ { struct dnaSeq *next; /* Next in list. */ char *name; /* Sequence name. */ char *dna; /* a’s c’s g’s and t’s. Null terminated */ int size; /* Number of bases. */ };

struct dnaSeq *dnaSeqFromString(char *string);/* Convert string containing sequence and possibly * white space and numbers to a dnaSeq. */

void dnaSeqFree(struct dnaSeq **pSeq);/* Free dnaSeq and set pointer to NULL. */

void dnaSeqFreeList(struct dnaSeq **pList);/* Free list of dnaSeq’s. */

Page 80: Decoding ENCODE

struct screenObj/* A two dimensional object in a sleazy video game. */ { struct screenObj *next; /* Next in list. */ char *name; /* Object name. */ int x,y,width,height; /* Bounds of object. */ void (*draw)(struct screenObj *obj); /* Draw object */ boolean (*in)(struct screenObj *obj, int x, int y); /* Return true if x,y is in object */ void *custom; /* Custom data for a particular type */ void (*freeCustom)(struct screenObj *obj); /* Free custom data. */ };

#define screenObjDraw(obj) (obj->draw(obj))/* Draw object. */

void screenObjFree(struct screenObj **pObj);/* Free up screen object including custom part. */

Page 81: Decoding ENCODE

Relational Databases• Relational databases consist of tables, indices, and

the Structured Query Language (SQL).• Tables are much like tab-separated files:

#chrom start end name strand score chr22 14600000 14612345 ldlr + 0.989 chr21 18283999 18298577 vldlr - 0.998

Fields are simple - no lists or substructures.• Can join tables based on a shared field. This is

flexible, but only as fast as the index.• Tables and joins are accessed a row at a time.• The row is represented as an array of strings.

Page 82: Decoding ENCODE

Converting A Row to Object

struct exoFish *exoFishLoad(char **row)/* Load a exoFish from row fetched with select * from exoFish * from database. Dispose of this with exoFishFree(). */{struct exoFish *ret;AllocVar(ret);ret->chrom = cloneString(row[0]);ret->chromStart = sqlUnsigned(row[1]);ret->chromEnd = sqlUnsigned(row[2]);ret->name = cloneString(row[3]);ret->score = sqlUnsigned(row[4]);return ret;}

Page 83: Decoding ENCODE

Motivation for AutoSql• Row to object code is tedious at best.• Also have save object, free object code

to write.• SQL create statement needs to match C

structure.• Lack of lists without doing a join can

seriously impact performance and complicate schema.

Page 84: Decoding ENCODE

AutoSql Data Declarationtable exoFish"An evolutionarily conserved region (ecore) with Tetroadon" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "Ecore name in Genoscope database" uint score; "Score from 0 to 1000" )

See autoSql.doc for more details.

Page 85: Decoding ENCODE

Occasionally useful tools

Page 86: Decoding ENCODE

Unix Command Line• BLAT - RNA/DNA and DNA/DNA alignment.• featureBits - figure out number of bases covered by a

track or intersection of tracks, output track intersections.• htmlCheck - check html tables and other basic web

page stuff. Look at form variables.• dbSnoop - summarize a MySQL database.• autoSql - generate serialization C code for relational

databases/tab-separated files.• autoXml - generate XML parsers• xmlToSql/sqlToXml - convert between XML and

relational database representations• parasol - manage jobs on computer cluster

Page 87: Decoding ENCODE

C Library Modules• hdb - access UCSC genome database• jksql - access SQL databases• htmlPage - parse web pages, submit

forms• readers/writers for maf, psl, chain, net,

bed, 2bit other formats used at UCSC• rangeTree & binRange - fast interval

intersection tools• Hashes, lists, trees, etc.