Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk...
-
Upload
megan-webb -
Category
Documents
-
view
215 -
download
1
Transcript of Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk...
Extracting Biological Information from Gene Lists
Simon Andrews, Laura Biggins, Boo Virk
[email protected]@babraham.ac.uk
v1.0
Biological material
Isolation of DNA, RNA or proteins
Sample for analysis
Sample processing
Analysis of processed sample: Data acquisition –
sequencing, microarray analysis, mass spectrometry
Raw data file(s)
Data analysis: identification of genes, transcripts or
proteins
Public databases
Results TableContaining hits – genes, transcripts or proteins
Analysis doesn’t end here!
Why functional analysis?
Advantages:• Biological insight• Validation of experiment• Generate new hypothesis
Limitations:• Amount of information depends on the species• Will only find known/published links between genes– If working on something novel – information available may
be limited
What this course covers
• Morning– Introduction to Gene
Lists– Gene List Practical
• Coffee– Presenting results– Presenting Results
Practical
• Afternoon– Motif Searching– Motif Searching Practical
• Coffee– Networks and
Interactions– Network Practical– Commercial tools
Gene ListsTypes of gene list:• Names of genes• Names of genes ordered by qualitative value• Names of genes ordered by quantitative value
• Gene lists can be ranked– P-value– Other Stat– Ordered
• Gene lists can be filtered– Cut off point– Subset of genes
• Need to use relevant ID to extract information from databases
http://www.biomart.org/
• BioMart ID conversion tool allows us to do this easily and quickly online
Transforming Gene ID’s
I have my gene list, what next?
Hyperlinked table:Gene UniProt Name Score Reactome UniProtTP53 P04637 Tumor Suppressor p53 125527 Reactome UniProtCDK1 P06493 Cyclin-dependent kinase 1 113740 Reactome UniProtPOLE Q07864 DNA Polymerase Epsilon 107190 Reactome UniProtKPNB1 Q14974 Importin subunit beta-1 35542 Reactome UniProtCHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 Reactome UniProtAURKB Q96GD4 Aurora kinase B 30803 Reactome UniProtRPA2 P15927 Replication protein A 32 kDa subunit 22207 Reactome UniProtCDT1 Q9H211 DNA replication factor Cdt1 21735 Reactome UniProtMCMBP Q9BTE3 MCM complex-binding protein 17811 Reactome UniProtTUBG1 P23258 Tubulin gamma-1 chain 16895 Reactome UniProtRAN P62826 GTP-binding nuclear protein Ran 16384 Reactome UniProtRANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 Reactome UniProtBLM P54132 Bloom syndrome protein 14883 Reactome UniProtPCNA P12004 Proliferating Cell Nuclear Antigen 13982 Reactome UniProtSETD8 Q9NQR1 Pr-Set7 13711 Reactome UniProtRCC1 P18754 Regulator of chromosome condensation 13302 Reactome UniProtMCM5 P33992 DNA replication licensing factor MCM5 12806 Reactome UniProtCDC25C P30307 M-phase inducer phosphatase 3 12510 Reactome UniProtPLK1 P53350 Serine/threonine-protein kinase PLK1 10930 Reactome UniProtMZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 Reactome UniProt
Hyperlinked tables
Advantages– Easy to create –no special data-mining software needed– One-click direct access to relevant pages– Reference resource
However…– Need to become familiar with the resources available - tailor
hyperlinks to be specific for your organism and questions being asked
– Information on one gene at a time in your gene set– Need to get relevant resource ID’s
I have my gene list, what next?Annotated table:
Gene UniProt Name Score PANTHER GO-Slim BP
CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271
apoptotic process;nitrogen compound metabolic process;biosynthetic process;transcription from RNA polymerase II promoter;cellular protein modification process;cell cycle;cell communication;apoptotic process;response to stress;response to abiotic stimulus;regulation of transcription from RNA polymerase II promoter;regulation of cell cycle;chromatin organization
MCM5 P33992 DNA replication licensing factor MCM5 12806 cell cycle;cell communication
RCC1 P18754 Regulator of chromosome condensation 13302cellular component movement;mitosis;chromosome segregation;cellular component morphogenesis;intracellular protein transport;cellular component organization
CDC25C P30307 M-phase inducer phosphatase 3 12510 DNA replication;cell cyclePLK1 P53350 Serine/threonine-protein kinase PLK1 10930 DNA replication;DNA repair;DNA recombination;cell cycleTP53 P04637 Tumor Suppressor p53 125527 glycogen metabolic process;protein phosphorylation;mitosis;cell communication
CDK1 P06493 Cyclin-dependent kinase 1 113740
nitrogen compound metabolic process;biosynthetic process;DNA replication;RNA metabolic process;cellular process;regulation of biological process;regulation of catalytic activity
BLM P54132 Bloom syndrome protein 14883nucleobase-containing compound metabolic process;cell cycle;cell communication;RNA localization;intracellular protein transport;nuclear transport
RPA2 P15927 Replication protein A 32 kDa subunit 22207nucleobase-containing compound metabolic process;mitosis;nucleobase-containing compound transport;regulation of catalytic activity
TUBG1 P23258 Tubulin gamma-1 chain 16895phosphate-containing compound metabolic process;cellular protein modification process;cell cycle
KPNB1 Q14974 Importin subunit beta-1 35542
phosphate-containing compound metabolic process;protein phosphorylation;cytokinesis;cell cycle;regulation of cell cycle;chromatin organization;cytoskeleton organization
MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 protein targeting;nuclear transportPCNA P12004 Proliferating Cell Nuclear Antigen 13982RAN P62826 GTP-binding nuclear protein Ran 16384POLE Q07864 DNA Polymerase Epsilon 107190
AURKB Q96GD4 Aurora kinase B 30803MCMBP Q9BTE3 MCM complex-binding protein 17811
CDT1 Q9H211 DNA replication factor Cdt1 21735RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527SETD8 Q9NQR1 Pr-Set7 13711
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
• Collaborative effort addressing need for consistent descriptions of gene products across different databases
• GO project has three structured ontologies describing gene products independent of species:• Biological Processes (BP),• Cellular Components (CC)• Molecular Functions (MF)
What is Gene Ontology (GO)?
Subsets of GO terms
• GO slim terms: – Cut-down versions of the GO ontologies that contain a
subset of terms from the GO resource– Give a broad overview of the ontology content without the
detail of the specific, fine-grained terms
• GO fat terms:– subset comprising more specific terms
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
Pathway and Interactions• Are specific pathways enriched in my list?• What other genes are in this pathway?
• Which genes/gene products interact with my genes of interest?
• Databases include:
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
Protein Domain
• Can I find shared protein domains?• What is the function of shared domain?
• Which other proteins share this domain?
• Databases include:
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
Co-expression
• Which genes are co-expressed?
• Automatic grouping of genes (rather than human curation (GO))
• Databases include:
Annotation Sources
There are many databases for annotation sources, including (but not limited to):
• Gene Ontology (GO)(most popular)• Pathways and Interactions• Protein domains• Co-expression • Transcription binding sites
Annotated tablesAdvantages– Information on function from larger gene sets– Sort groups of genes (GO term, pathway, protein domain)– Relatively easy to create– Reference resource
However…– Need to become familiar with the resources specific to your
research– Lots of information can be difficult to sort efficiently
What does functional information tell me?
• Have functional information about a gene set
• Can verify genes implicated in experiment are functionally relevant, and to discover unexpected shared functions
• Determine which functions are enriched in gene set• How? – Compare to a background list of genes
What is a background list?• In theory, any gene that could have been differentially
expressed in your experiment• RNA seq – all genes apart from those with less than 10-20
reads• Arrays – all genes in the array• ChipSeq - any gene on the chip.
vs
Choosing a Background List
• Which background list to use?– Whole set of genes– Tissue/cell specific genes– Manually made list,
derived from your experiment and analysis
vs?
Statistics to test for enrichment
Gene List
3005 genes related to
disease3005/13,101=
23.1%
13,101 genes on chip
Related to disease
260/747 = 34.8%
Do not related to disease487/747 =
65.2%
Are these proportions the
same?
Lots of Statistical tests to choose from
• Hypergeometric test• Fisher’s exact/Chi-squared • Binomial
• Kolmogorov Smirnov• Permutation
Hypergeometric test
• Uses hypergeometric distribution to measure the probability of having drawn a specific number of successes (out of a total number of draws) from a population
• Example:
Imagine that there are 4 green and 16 red marbles in a box. You close your eyes and draw 5 marbles without replacement What is the probability that exactly 2 of the 5 are green?
Gene List
3005 genes map to disease
3005/13,101=23.1%
13,101 genes on chip
Map to disease260/747 =
34.8%
Do not map to disease
487/747 = 65.2%
Are these proportions the
same?
What is the probability (p-value) that exactly 260 genes (out of 747) map to disease, given that there are 3005 of those genes in the background (13,101 genes)?
Hypergeometric test
Limitations:• Assumes independence of categories– Result terms often include directly related terms– Is there really evidence for both terms?
• Works better with larger samples (5% of background)
Test Input Sample Size Output Specifics
Hypergeometric Unranked/RankedList
Large (5% of background)
P-value • Finite population – probability of success changes
Test Input Sample Size Output Specifics
Hypergeometric Unranked/RankedList
Large (5% of background)
P-value • Finite population – probability of success changes
Based on Hypergeometric test:
Test Input Sample Size Output Specifics
Fisher’s Exact Unranked/RankedList
Small P-value • Can be used to compare 2 conditions as well as gene list to background
• one-tailed or two-tailed
Binomial Unranked/RankedList
Large P-value • Does not assume finite population – probability of success remains the same
Limitations of Fisher’s Exact and Binomial test
• Neither account for variation in the number of genes annotated to individual terms/functions being tested
• or the number of terms/functions associated with individual genes
• Therefore, tend to over-estimate significance if the gene set has an unusually high number of annotations
• Assume independence of categories
Lots of Statistical tests to choose from
• Hypergeometric • Fisher’s exact/Chi-squared • Binomial
• Kolmogorov Smirnov• Permutation Used for ranked
gene lists only
Output: enrichment scores (ES) for functions, which can then be translated into a p-value
Multiple testing correction
Statistical Decision:
True state in Gene ListNot
OverrepresentedOverrepresented
Significant Type I error(False Positive)
Correct
Not Significant Correct Type II error(False Negative)
Traditionally, a test or a difference are said to be “significant” if the probability of type I error is: α =< 0.05
Error types in statistics:
• Example: You want to compare 3 groups and you carry out 3 hypergeometric tests, each with a 5% level of significance (P<0.05)
• Probability of not making type I error = 95% = (1 – 0.05)
– Overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = 0.857– Therefore probability of at least one type I error is: 1-0.857 = 0.143 or 14.3%
– If comparing 5 groups instead of 3, the multiple testing error rate is 40%! (=1-(0.95)n)
• Solution for multiple comparisons: Multiple testing correction
Probability of error increases
from 5% to 14.3%
Multiple test corrections
• Bonferroni– Significant level (e.g. 0.05) /number of tests = new threshold– This is an over correction if tests are correlated
• Benjamini-Hochberg– Rank the p-values– Apply more stringent correction to the most significant, and
least stringent to the least significant p-values
Statistical issues
• We want to Identify functions of maximal biological significance– BUT this is not perfectly correlated with statistical significance
• Use p values as a tool to rank functions but don’t take ‐them too literally
• Need to correct for multiple testing
Tools for functional gene list analysis• There are many different tools available, both
free and commercial
• Popular web-based tools include:
PANTHER(Protein ANnotation THrough Evolutionary Relationship)
http://www.pantherdb.org/
• One of the most widely used online resources for gene function classification and genome wide data analysis
• PANTHER users have successfully analysed data from:– Gene expression– Proteomics– Genome-wide association study (GWAS) experiments
• PANTHER is part of the GO consortium, thus PANTHER annotation = up to date GO curation
• Annotations from PANTHER include:– GO-slim terms– PANTHER “protein class” – PANTHER “Pathway” terms
• Doesn’t cluster together genes with similar GO terms in table
• Statistics: Binomial test with Bonferroni multiple testing correction
https://david.ncifcrf.gov/
• Gathers data from many different databases – this is customisable
• Functional Clustering
• Uses many annotations, including GO-Fat terms – more specific set of GO terms
• Statistics: Fisher’s Exact Test and multiple testing correction