Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk...

65
Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk [email protected] [email protected] [email protected] v1.0

Transcript of Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk...

Extracting Biological Information from Gene Lists

Simon Andrews, Laura Biggins, Boo Virk

[email protected]@babraham.ac.uk

[email protected]

v1.0

Biological material

Isolation of DNA, RNA or proteins

Sample for analysis

Sample processing

Analysis of processed sample: Data acquisition –

sequencing, microarray analysis, mass spectrometry

Raw data file(s)

Data analysis: identification of genes, transcripts or

proteins

Public databases

Results TableContaining hits – genes, transcripts or proteins

Analysis doesn’t end here!

Why functional analysis?

Advantages:• Biological insight• Validation of experiment• Generate new hypothesis

Limitations:• Amount of information depends on the species• Will only find known/published links between genes– If working on something novel – information available may

be limited

What this course covers

• Morning– Introduction to Gene

Lists– Gene List Practical

• Coffee– Presenting results– Presenting Results

Practical

• Afternoon– Motif Searching– Motif Searching Practical

• Coffee– Networks and

Interactions– Network Practical– Commercial tools

Gene ListsTypes of gene list:• Names of genes• Names of genes ordered by qualitative value• Names of genes ordered by quantitative value

• Gene lists can be ranked– P-value– Other Stat– Ordered

• Gene lists can be filtered– Cut off point– Subset of genes

• Need to use relevant ID to extract information from databases

http://www.biomart.org/

• BioMart ID conversion tool allows us to do this easily and quickly online

Transforming Gene ID’s

• Download this data, import transformed ID’s into table

I have my gene list, what next?

Hyperlinked table:Gene UniProt Name Score Reactome UniProtTP53 P04637 Tumor Suppressor p53 125527 Reactome UniProtCDK1 P06493 Cyclin-dependent kinase 1 113740 Reactome UniProtPOLE Q07864 DNA Polymerase Epsilon 107190 Reactome UniProtKPNB1 Q14974 Importin subunit beta-1 35542 Reactome UniProtCHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 Reactome UniProtAURKB Q96GD4 Aurora kinase B 30803 Reactome UniProtRPA2 P15927 Replication protein A 32 kDa subunit 22207 Reactome UniProtCDT1 Q9H211 DNA replication factor Cdt1 21735 Reactome UniProtMCMBP Q9BTE3 MCM complex-binding protein 17811 Reactome UniProtTUBG1 P23258 Tubulin gamma-1 chain 16895 Reactome UniProtRAN P62826 GTP-binding nuclear protein Ran 16384 Reactome UniProtRANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 Reactome UniProtBLM P54132 Bloom syndrome protein 14883 Reactome UniProtPCNA P12004 Proliferating Cell Nuclear Antigen 13982 Reactome UniProtSETD8 Q9NQR1 Pr-Set7 13711 Reactome UniProtRCC1 P18754 Regulator of chromosome condensation 13302 Reactome UniProtMCM5 P33992 DNA replication licensing factor MCM5 12806 Reactome UniProtCDC25C P30307 M-phase inducer phosphatase 3 12510 Reactome UniProtPLK1 P53350 Serine/threonine-protein kinase PLK1 10930 Reactome UniProtMZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 Reactome UniProt

Hyperlinked tables

Advantages– Easy to create –no special data-mining software needed– One-click direct access to relevant pages– Reference resource

However…– Need to become familiar with the resources available - tailor

hyperlinks to be specific for your organism and questions being asked

– Information on one gene at a time in your gene set– Need to get relevant resource ID’s

I have my gene list, what next?Annotated table:

Gene UniProt Name Score PANTHER GO-Slim BP

CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271

apoptotic process;nitrogen compound metabolic process;biosynthetic process;transcription from RNA polymerase II promoter;cellular protein modification process;cell cycle;cell communication;apoptotic process;response to stress;response to abiotic stimulus;regulation of transcription from RNA polymerase II promoter;regulation of cell cycle;chromatin organization

MCM5 P33992 DNA replication licensing factor MCM5 12806 cell cycle;cell communication

RCC1 P18754 Regulator of chromosome condensation 13302cellular component movement;mitosis;chromosome segregation;cellular component morphogenesis;intracellular protein transport;cellular component organization

CDC25C P30307 M-phase inducer phosphatase 3 12510 DNA replication;cell cyclePLK1 P53350 Serine/threonine-protein kinase PLK1 10930 DNA replication;DNA repair;DNA recombination;cell cycleTP53 P04637 Tumor Suppressor p53 125527 glycogen metabolic process;protein phosphorylation;mitosis;cell communication

CDK1 P06493 Cyclin-dependent kinase 1 113740

nitrogen compound metabolic process;biosynthetic process;DNA replication;RNA metabolic process;cellular process;regulation of biological process;regulation of catalytic activity

BLM P54132 Bloom syndrome protein 14883nucleobase-containing compound metabolic process;cell cycle;cell communication;RNA localization;intracellular protein transport;nuclear transport

RPA2 P15927 Replication protein A 32 kDa subunit 22207nucleobase-containing compound metabolic process;mitosis;nucleobase-containing compound transport;regulation of catalytic activity

TUBG1 P23258 Tubulin gamma-1 chain 16895phosphate-containing compound metabolic process;cellular protein modification process;cell cycle

KPNB1 Q14974 Importin subunit beta-1 35542

phosphate-containing compound metabolic process;protein phosphorylation;cytokinesis;cell cycle;regulation of cell cycle;chromatin organization;cytoskeleton organization

MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 protein targeting;nuclear transportPCNA P12004 Proliferating Cell Nuclear Antigen 13982RAN P62826 GTP-binding nuclear protein Ran 16384POLE Q07864 DNA Polymerase Epsilon 107190

AURKB Q96GD4 Aurora kinase B 30803MCMBP Q9BTE3 MCM complex-binding protein 17811

CDT1 Q9H211 DNA replication factor Cdt1 21735RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527SETD8 Q9NQR1 Pr-Set7 13711

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

• Collaborative effort addressing need for consistent descriptions of gene products across different databases

• GO project has three structured ontologies describing gene products independent of species:• Biological Processes (BP),• Cellular Components (CC)• Molecular Functions (MF)

What is Gene Ontology (GO)?

GO Structure• 3 GO domains:

Root ontology terms1 2 3

general

specific

Parent

Child

Subsets of GO terms

• GO slim terms: – Cut-down versions of the GO ontologies that contain a

subset of terms from the GO resource– Give a broad overview of the ontology content without the

detail of the specific, fine-grained terms

• GO fat terms:– subset comprising more specific terms

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

Pathway and Interactions• Are specific pathways enriched in my list?• What other genes are in this pathway?

• Which genes/gene products interact with my genes of interest?

• Databases include:

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

Protein Domain

• Can I find shared protein domains?• What is the function of shared domain?

• Which other proteins share this domain?

• Databases include:

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

Co-expression

• Which genes are co-expressed?

• Automatic grouping of genes (rather than human curation (GO))

• Databases include:

Annotation Sources

There are many databases for annotation sources, including (but not limited to):

• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites

Annotated tablesAdvantages– Information on function from larger gene sets– Sort groups of genes (GO term, pathway, protein domain)– Relatively easy to create– Reference resource

However…– Need to become familiar with the resources specific to your

research– Lots of information can be difficult to sort efficiently

What does functional information tell me?

• Have functional information about a gene set

• Can verify genes implicated in experiment are functionally relevant, and to discover unexpected shared functions

• Determine which functions are enriched in gene set• How? – Compare to a background list of genes

What is a background list?• In theory, any gene that could have been differentially

expressed in your experiment• RNA seq – all genes apart from those with less than 10-20

reads• Arrays – all genes in the array• ChipSeq - any gene on the chip.

vs

Choosing a Background List

• Which background list to use?– Whole set of genes– Tissue/cell specific genes– Manually made list,

derived from your experiment and analysis

vs?

Statistics to test for enrichment

Gene List

3005 genes related to

disease3005/13,101=

23.1%

13,101 genes on chip

Related to disease

260/747 = 34.8%

Do not related to disease487/747 =

65.2%

Are these proportions the

same?

Lots of Statistical tests to choose from

• Hypergeometric test• Fisher’s exact/Chi-squared • Binomial

• Kolmogorov Smirnov• Permutation

Hypergeometric test

• Uses hypergeometric distribution to measure the probability of having drawn a specific number of successes (out of a total number of draws) from a population

• Example:

Imagine that there are 4 green and 16 red marbles in a box. You close your eyes and draw 5 marbles without replacement What is the probability that exactly 2 of the 5 are green?

Gene List

3005 genes map to disease

3005/13,101=23.1%

13,101 genes on chip

Map to disease260/747 =

34.8%

Do not map to disease

487/747 = 65.2%

Are these proportions the

same?

What is the probability (p-value) that exactly 260 genes (out of 747) map to disease, given that there are 3005 of those genes in the background (13,101 genes)?

Hypergeometric test

Limitations:• Assumes independence of categories– Result terms often include directly related terms– Is there really evidence for both terms?

• Works better with larger samples (5% of background)

Test Input Sample Size Output Specifics

Hypergeometric Unranked/RankedList

Large (5% of background)

P-value • Finite population – probability of success changes

Test Input Sample Size Output Specifics

Hypergeometric Unranked/RankedList

Large (5% of background)

P-value • Finite population – probability of success changes

Based on Hypergeometric test:

Test Input Sample Size Output Specifics

Fisher’s Exact Unranked/RankedList

Small P-value • Can be used to compare 2 conditions as well as gene list to background

• one-tailed or two-tailed

Binomial Unranked/RankedList

Large P-value • Does not assume finite population – probability of success remains the same

Limitations of Fisher’s Exact and Binomial test

• Neither account for variation in the number of genes annotated to individual terms/functions being tested

• or the number of terms/functions associated with individual genes

• Therefore, tend to over-estimate significance if the gene set has an unusually high number of annotations

• Assume independence of categories

Lots of Statistical tests to choose from

• Hypergeometric • Fisher’s exact/Chi-squared • Binomial

• Kolmogorov Smirnov• Permutation Used for ranked

gene lists only

Output: enrichment scores (ES) for functions, which can then be translated into a p-value

Multiple testing correction

Statistical Decision:

True state in Gene ListNot

OverrepresentedOverrepresented

Significant Type I error(False Positive)

Correct

Not Significant Correct Type II error(False Negative)

Traditionally, a test or a difference are said to be “significant” if the probability of type I error is: α =< 0.05

Error types in statistics:

• Example: You want to compare 3 groups and you carry out 3 hypergeometric tests, each with a 5% level of significance (P<0.05)

• Probability of not making type I error = 95% = (1 – 0.05)

– Overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = 0.857– Therefore probability of at least one type I error is: 1-0.857 = 0.143 or 14.3%

– If comparing 5 groups instead of 3, the multiple testing error rate is 40%! (=1-(0.95)n)

• Solution for multiple comparisons: Multiple testing correction

Probability of error increases

from 5% to 14.3%

Multiple test corrections

• Bonferroni– Significant level (e.g. 0.05) /number of tests = new threshold– This is an over correction if tests are correlated

• Benjamini-Hochberg– Rank the p-values– Apply more stringent correction to the most significant, and

least stringent to the least significant p-values

Statistical issues

• We want to Identify functions of maximal biological significance– BUT this is not perfectly correlated with statistical significance

• Use p values as a tool to rank functions but don’t take ‐them too literally

• Need to correct for multiple testing

Tools for functional gene list analysis• There are many different tools available, both

free and commercial

• Popular web-based tools include:

PANTHER(Protein ANnotation THrough Evolutionary Relationship)

http://www.pantherdb.org/

• One of the most widely used online resources for gene function classification and genome wide data analysis

• PANTHER users have successfully analysed data from:– Gene expression– Proteomics– Genome-wide association study (GWAS) experiments

• PANTHER is part of the GO consortium, thus PANTHER annotation = up to date GO curation

PANTHER for functional classification

Send list to > File Saves table in a tab delimited .txt file

PANTHER for statistics

• Annotations from PANTHER include:– GO-slim terms– PANTHER “protein class” – PANTHER “Pathway” terms

• Doesn’t cluster together genes with similar GO terms in table

• Statistics: Binomial test with Bonferroni multiple testing correction

https://david.ncifcrf.gov/

• Gathers data from many different databases – this is customisable

• Functional Clustering

• Uses many annotations, including GO-Fat terms – more specific set of GO terms

• Statistics: Fisher’s Exact Test and multiple testing correction

DAVID for functional classification

Functional Clustering

Which DAVID tool should I use?

GOrillahttp://cbl-gorilla.cs.technion.ac.il/

Which tool to use?

Choose a tool that:– Includes your gene / probe identifiers– Includes your species– Has up to date annotation‐ ‐– Lets you define your background (if possible)

– Try a few different tools– Try gene lists of varying length