MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu

63
http://cs173.stanford.edu [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 14: Personal Genomics, GSEA/GREAT

description

CS173. Lecture 14: Personal Genomics, GSEA/GREAT. MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu. Announcements. Coming M onday 3/4 lecture is again in LK101 (see class website for room reminders) - PowerPoint PPT Presentation

Transcript of MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu

Page 1: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 1

MW  11:00-12:15 in Beckman B302Prof: Gill BejeranoTAs: Jim Notwell & Harendra Guturu

CS173

Lecture 14: Personal Genomics, GSEA/GREAT

Page 2: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 2

Announcements• Coming Monday 3/4 lecture is again in LK101(see class website for room reminders)

• I’ll be working on grad student admissions – Harendra will lecture about his work.(we’ll prepare the ground today)

Page 3: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 3

Quick recap

Page 4: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

SequencingPublic project:

Celera project:

Page 5: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Human Structural Variation

http://cs173.stanford.edu [BejeranoWinter12/13] 5

Page 6: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Human Disease• Cancer• Congenital defects• Disease Association studies• Genic and cis-regulatory contributions

http://cs173.stanford.edu [BejeranoWinter12/13] 6

Page 7: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 7

Personal genomics

Page 8: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Gameplan1. As your budget allows, characterize all the variants in an individual’s genome:

A) Against the reference genome.B) Against variants known in the population.C) If possible, against unaffected relatives.

2 Compare the structural variants you observe to the body of knowledge about genome content & function. Seek culprit mutations.

3. Having detected a smoking gun mutation, attempt to recreate it in a cell population or organism to obtain a “disease model”.

http://cs173.stanford.edu [BejeranoWinter12/13] 8

Variant Types

Single Nucleotide Variants(SNVs)

Small Insertion / Deletion (indels)

Copy Number Variants (CNVs)

Structural Variants (SVs)

Novel Sequence

Page 9: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Targeted Sequencing, orlooking under the lamp is 50x cheaper

Capture Methods vs. Shotgun• Targeted sequencing allows for much

higher coverage at less cost• Will only capture known sites• These methods also introduce significant

captures bias, including failure to capture sites that differ significantly from the reference genome. (analogous to microarrays)

Modified from Meyerson et al. . 2010. Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October): 685-696

ExomeLibrary

ShotgunLibrary

Genomic DNAExon 1 Exon 2

Page 10: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 10

Consumer genomics

Page 11: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Gameplan1 Collect scientific literature

about all structural variant correlations with human disease & traits.

2 Genotype customers for as many informative loci as is commercially viable.

3 Offer counseling for your findings, and their meaning.

4 Ask customers to phenotype themselves.

5 Discover new associations!http://cs173.stanford.edu [BejeranoWinter12/13]

Page 12: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Pay, send biosample, get genotyped

Page 13: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Trait associations

Page 14: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Disease Risk Alleles

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 15: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Side Effects: Serious Ethical Issues

http://cs173.stanford.edu [BejeranoWinter12/13] 15

Page 16: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 16

Gene set enrichment analysis:The genic version

Page 17: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Imagine you did a microarray experiment

http://cs173.stanford.edu [BejeranoWinter12/13] 17

Page 18: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Cluster all genes for differential expression

http://cs173.stanford.edu [BejeranoWinter12/13] 18

Most significantly up-regulated genes

Unchanged genes

Most significantly down-regulated genes

Experiment Control(replicates) (replicates)

gene

s

Page 19: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Determine cut-offs, examine individual genes

http://cs173.stanford.edu [BejeranoWinter12/13] 19

Most significantly up-regulated genes

Unchanged genes

Most significantly down-regulated genes

Experiment Control(replicates) (replicates)

gene

s

Page 20: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Genes usually work in groupsBiochemical pathways, signaling pathways, etc.Asking about the expression perturbation of groups of genes is both more appealing biologically, and more powerful statistically (you sum perturbations).

http://cs173.stanford.edu [BejeranoWinter12/13] 20

Page 21: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

ES

/NE

S statistic

+

-

Exper. ControlGene Set 1

Gene Set 2

Gene Set 3

Gene set 3up regulated

Gene set 2down regulated

Ask about whole gene sets

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 22: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

One approach: GSEA

http://cs173.stanford.edu [BejeranoWinter12/13] 22

Dataset distribution Number of genes

Gene Expression Level

Gene set 3 distribution

Gene set 1 distribution

Page 23: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Another popular approach: DAVID

http://cs173.stanford.edu [BejeranoWinter12/13] 23

Input: list of genes of interest (without expression values).

Page 24: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Multiple Testing Correction

http://cs173.stanford.edu [BejeranoWinter12/13] 24

Note that statistically you cannot just run individual tests on 1,000 different gene sets. You have to apply further statistical corrections, to account for the fact that even in 1,000 random experiments a handful may come out good by chance.(eg experiment = Throw a coin 10 times. Ask if it is biased. If you repeat it 1,000 times, you will eventually get an all heads series, from a fair coin. Mustn’t deduce that the coin is biased)

run tool

Page 25: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

What will you test?

http://cs173.stanford.edu [BejeranoWinter12/13] 25

Also note that this is a very general approach to test gene lists.Instead of a microarray experiment you can do RNA-seq.Instead of up/down-regulated genes you can test all the genes in a personal genome where you see surprising mutations.Any gene list can be tested.

run tool

Page 26: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 26

Gene Sets:Cataloging biological knowledge

Page 27: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

27

Keyword lists are not enough

Sheer number of terms too much to remember and sort• Need standardized, stable, carefully defined terms• Need to describe different levels of detail• So…defined terms need to be related in a

hierarchy

With structured vocabularies/hierarchies• Parent/child relationships exist between terms• Increased depth -> Increased resolution• Can annotate data at appropriate level• May query at appropriate level

organ system

embryo

cardiovascular

heart

… …

… …

… …… …

Anatomy Hierarchy

Organ systemCardiovascular systemHeart

Anatomy keywords

Page 28: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

TJL-2004 28

Annotate genes to most specific terms

Page 29: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

1. Annotate at appropriate level, query at appropriate level

2. Queries for higher level terms include annotations to lower level terms

29

General Implementations for Vocabularies

organ system

embryo

cardiovascular

heart

… …

… …

… …… …

Hierarchy DAG

chaperone regulator

molecular function

chaperone activator

… enzyme regulator

enzyme activator… …

Query for this term

Returns things annotated to descendents

Page 30: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Gene Sets• Gene Ontology (“GO”)

– Biological Process– Molecular Function– Cellular Location

• Pathway Databases– KEGG– BioCarta– Broad Institute

Page 31: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Other Gene Sets• Transcription factor targets

– All the genes regulated by particular TF’s• Protein complex components

– Sets of genes whose protein products function together• Ion channel receptors• RNA / DNA Polymerase

• Paralogs– Families of genes descended (in eukaryotic

times) from a common ancestor

Page 32: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Natural Language Processing (NLP) Opportunities

http://cs173.stanford.edu [BejeranoWinter12/13] 32

Literature

Genes

OntologyMap genesto ontology

using literature

Page 33: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 33

Gene set enrichment analysis:The gene regulatory version

Page 34: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 34

Combinatorial Regulatory Code

Gene

2,000 different proteins can bind specific DNA sequences.

A regulatory region encodes 3-10 such protein binding sites.When all are bound by proteins the regulatory region turns “on”,

and the nearby gene is activated to produce protein.

Proteins

DNA

DNA

Protein binding site

Page 35: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

ChIP-Seq: first glimpses of the regulatory genome in action

Cis-regulatory peak

3535http://cs173.stanford.edu [BejeranoWinter12/13]

Peak Calling

Page 36: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Gene transcription start site

What is the transcription factor I just assayed doing?

Cis-regulatory peak

3636http://cs173.stanford.edu [BejeranoWinter12/13]

• Collect known literature of the form• Function A: Gene1, Gene2, Gene3, ...• Function B: Gene1, Gene2, Gene3, ...• Function C: ...

• Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above.

• Form hypothesis and perform further experiments.

Page 37: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile

37

Gene transcription start site

SRF binding ChIP-seq peak

• ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1

• SRF is known as a “master regulator of the actin cytoskeleton”

• In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation.http://cs173.stanford.edu [BejeranoWinter12/13]

Page 38: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile

38

Existing, gene-based method to analyze enrichment:

• Ignore distal binding events.

• Count affected genes.

• Rank by enrichment hypergeometric p-value.

π π π

Gene transcription start site

SRF binding ChIP-seq peakOntology term (e.g. ‘actin cytoskeleton’)π

N = 8 genes in genomeK = 3 genes annotated withn = 2 genes selected by proximal peaksk = 1 selected gene annotated with

π

π

P = Pr(k ≥1 | n=2, K =3, N=8)

π

π

π π

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 39: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for?

39

Microarray tool

Microarray data

Microarray data

Generegulation

data

http://cs173.stanford.edu [BejeranoWinter12/13]

Pro: A lot of tools out there for the analysis of gene lists.Cons: These tools are built for microarray analysis.Does it matter ??

Page 40: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

SRF Gene-based enrichment results

40

• Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1

[1] Valouev A. et al., Nat. Methods, 2008

SRF

SRF

Z

~~

SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding

40http://cs173.stanford.edu [BejeranoWinter12/13]

Where’s the signal?Top “actin” term is ranked #28 in the list.

Page 41: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Associating only proximal peaks loses a lot of information

41

0-2 2-5 5-50 50-500 > 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat)Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb)p300 (M: forebrain) p300 (M: midbrain)

Distance to nearest transcription start site (kb)

Frac

tion

of a

ll el

emen

tsRelationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets

Restricting to proximal peaks often leads to complete loss of key enrichments

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 42: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Bad Solution: Associating distal peaks brings in many false enrichments

42

Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream.

π π π

http://cs173.stanford.edu [BejeranoWinter12/13]

Term Bonferroni corrected p-valuenervous system development 5x10-9

system development 8x10-9

anatomical structure development 7x10-8

multicellular organismal development 1x10-7

developmental process 2x10-6

SRF ChIP-seq set has >2,000 binding events.Throw a random set of 2,000 regions at the genome.

What do you get from a gene list analysis?Large “gene deserts” are oftennext to key developmental genes

Page 43: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Real Solution: Do not convert to gene list.Analyze the set of genomic regions

43

Gene transcription start siteOntology term ( ‘actin cytoskeleton’)

P = Prbinom(k ≥5 | n=6, p =0.33)

p = 0.33 of genome annotated withn = 6 genomic regionsk = 5 genomic regions hit annotation

π π π

π

ππ

π π π

http://cs173.stanford.edu [BejeranoWinter12/13]

Gene regulatory domainGenomic region (ChIP-seq peak)

Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments.

GREAT = Genomic RegionsEnrichment of Annotations Tool

Page 44: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

How does GREAT know how to assign distal binding peaks to genes?

44

Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms

Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc.

• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb

• Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 45: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT infers many specific functions of SRF from its binding profile

45

Ontology Term # Genes Binomial Experimental P-value support*

Gene Ontology actin cytoskeletonactin binding

7x10-9

5x10-5

Miano et al. 2007

Miano et al. 2007

* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.

3031

Pathway Commons

TRAIL signalingClass I PI3K signaling

5x10-7

2x10-6

Bertolotto et al. 2000

Poser et al. 20003226

TreeFam 1x10-85 Chai & Tarnawski 2002

TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1

5x10-76

4x10-9

1x10-6

2x10-4

Positive control

ChIp-Seq support

Natesan & Gilman 1995

84284423

Top gene-basedenrichments of SRF

Top GREAT enrichments of SRF

(top actin-related term 28th in list)

FOS gene family

http://cs173.stanford.edu [BejeranoWinter12/13]

Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq[McLean et al., Nat Biotechnol., 2010]

Page 46: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Limb P300: I was blind and I can see

46http://cs173.stanford.edu [BejeranoWinter12/13]

Gene List

Page 47: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT works with ANY cis-regulatory rich setExample: GWAS Compendium set

47http://cs173.stanford.edu [BejeranoWinter12/13]

Height-associated unlinked SNPs

Page 48: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT analysis of histone mark combinations

http://cs173.stanford.edu [BejeranoWinter12/13] 48

Page 49: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT includes multiple ontologies

49Michael Hiller

• Twenty ontologies spanning broad categories of biology• 44,832 total ontology terms tested in each GREAT run

(2,800 terms)(5,215)(834)

(5,781)(427)(456)

(150)(1,253)(288)(706)

(6,700)(3,079)(911)

(615)(19)(222)(9)

(6,857)(8,272)(238)

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 50: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

Advantages of the GREAT approachTailored to the biology of gene regulation:• Distal sites are incorporated, not ignored• Variable length gene regulatory domains• Multiple bindings next to same target gene rewarded• Extensive ontologies, some home-made

http://cs173.stanford.edu [BejeranoWinter12/13] 50

Page 51: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

http://cs173.stanford.edu [BejeranoWinter12/13] 51

Algorithmic Optimization: A it works; B make it efficient

Page 52: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

52

enter GREAT.stanford.edu

Choose genome

Input peak list

http://cs173.stanford.edu [BejeranoWinter12/13]

Hit submit!

Page 53: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

53

GREAT web app:(Optional): alter association rules

http://great.stanford.edu

Three association rule choices

Literature-curated domains for a small subset of genes

Lnp Evx2 HoxD cluster

[adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]http://cs173.stanford.edu [BejeranoWinter12/13]

Page 54: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

54

Additional ontologies,term statistics,multiple hypothesis corrections, etc.

GREAT web app: output summary

Ontology-specific enrichments

http://cs173.stanford.edu [BejeranoWinter12/13]

Cool visualization opportunities!

Page 55: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

55

GREAT web app: term details page

Genes annotated as “actin binding” with associated genomic regions

Genomic regions annotated with “actin binding”

Drill down to explore how a particular peak regulates Plectin and its role in actin binding

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 56: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

You can also submit any trackstraight from the UCSC Table Browser

56http://cs173.stanford.edu [BejeranoWinter12/13]

A simple, well documentedprogrammatic interface allowsany tool to submit directly to GREAT.(See our Help / Inquiries welcome!)

Page 57: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT web app: export data

57

HTML output displays all user selected rows and columns

Tab-separated values also available for additional postprocessing

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 58: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

GREAT Web Stats

http://cs173.stanford.edu [BejeranoWinter12/13] 58

200-400 job submissions per day, from 7,000 IP addrs

Page 59: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

59

Adding a new species to GREAT

We need:1. A good assembly2. A high quality gene set3. Good gene annotations*

*Most valuable for species with independent annotations!

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 60: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

60

Adapting GREAT for zebrafish

We need:1. A good assembly2. A high quality gene set3. Good gene annotations

# Scaffolds Avg. ScaffoldLength

# Assembly

GapsZv8 11,724 129Kb ~55,000Zv9 1,133 1,250 Kb ~27,000

Zv9 = UCSC danRer7older assemblies? liftover to Zv9/danRer7

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 61: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

61

Adapting GREAT for zebrafish

We need:1. A good assembly2. A high quality gene set3. Good gene annotations

• Carefully combine (95% identity, 80% coverage)RefSeq transcripts Ensembl coding genes RefSeq proteins Uniprot proteins

Obtain 14,567 genes, all with ZFIN gene identifiers

• Using only RefSeq would miss 1,912 annotated genes• Using only Ensembl would miss 1,218 annotated genes

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 62: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

62

Adapting GREAT for zebrafish

We need:1. A good assembly2. A high quality gene set3. Good gene annotations

Curate zebrafish:•Gene Ontology (GO) - Function, Process, Cellular Component•ZFIN Phenotype•Wiki Pathways•ZFIN Wildtype Expression•InterPro - protein domains, families and functional sites•TreeFam - gene families of paralogs

http://cs173.stanford.edu [BejeranoWinter12/13]

Page 63: MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim  Notwell  & Harendra  Guturu

63

96% of our gene set is annotated

Ontology Genes Terms* AnnotationsGene Ontology Molecular Function 10,520 1,697 80,788

Biological Process 8,174 3,597 152,595Cellular Component 7,138 592 68,922

Phenotype Data ZFIN Phenotype 671 11,835 57,976Pathway Data Wiki Pathways 1,754 105 3,622Gene Expression ZFIN Wildtype Expression 8,421 9,812 888,189Gene Families Interpro 12,746 6,667 43,994

Treefam 11,324 6,010 11,338Total 14,038 40,315 1,307,424

Unfolded

• At least one gene is annotated with the term

http://cs173.stanford.edu [BejeranoWinter12/13]