Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology
-
Upload
amir-hewitt -
Category
Documents
-
view
28 -
download
1
description
Transcript of Bioinfo/Stat 545 Biostat646 Data Analysis in Molecular Biology
Bioinfo/Stat 545 Biostat646Data Analysis in Molecular Biology
Lab 1: Bioinformatics Online Resources
Dongxiao Zhu
Overview
Main types of biological data Sequence Data Interaction Data Microarray and gene expression data Others, macromolecule structure
data, human genes and disease data
Information Retrieval Strategies
Part I. Online Biological Data
Resources 2004 Nucleic Acid Research database issue
http://www3.oup.co.uk/nar/database/cap/ (database list)
Total 548 databases listed, 162 more than last year Main types of biomedical data
Sequence Data (DNA and Protein Sequence) Gene sequencing, “Whole genome shotgun” and Lander &
Waterman Assembly Algorithm Protein sequencing, de novo sequencing from tandem Mass
Spectra Gene Prediction, Sequence alignment and BLAST Gene Annotation and Gene Ontology Protein/RNA secondary/tertiary structure prediction
Interaction data – Biological pathway and network Microarray and Gene Expression Data Others, structure data, human genes and disease
Gene/Protein sequencing –data acquiring and data
accuracyWhole genome shotgun[1]
Double end sequencing short reads off both ends of large inserts additional information for assemble
Clone coverage vs. sequence coverage Scaffolds
ordered and oriented contigs sequence gaps
De novo protein sequencing from Tandem Mass Spectra[2]
Accuracy issues: Large scale repeats Missing and contaminating data Plasmids and minichoromosomes Signature of tandem repeats Polymorphism
Gene Prediction, Annotation and Gene Ontology
Genescan webservice[3]
http://genes.mit.edu/GENSCAN.html Sensitive in recognizing at least on exon
Biochemical Functional Annotation (Biochemical View) Clone, expression and functional studies Database homolog/ortholog search
Sequence alignment (similar seq -> similar function) Structure alignment (similar structure -> similar function)
Protein sub-cellular location prediction using primary sequence alone (Cellular View)
Codon usage bias in differently localized protein Signal peptide
Gene ontology – consistent descriptions of gene products in different databases
Sequence Alignment/BLAST and Literature Search – Bioinformatics approaches to gene
annotation
Why BLAST? Explosively increasing novel sequences, in arguable most characterized
~4200 E.coli proteins, half of them are not experimental studied. Moreover, every newly sequenced genome encodes hundreds to thousands novel proteins
There is a need to infer functional roles of these novel proteins. compare novel sequences with previously characterized genes to
annotate function BLAST algorithm[4]
http://www.bioinformatics.med.umich.edu/Courses/526/lecturenotes.html
BLAST program selection guide http://www.ncbi.nlm.nih.gov/BLAST/producttable.shtml
BLAST tutorial http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Literature Search (Part II)
Gene ontology (GO)[5]
Why GO? Use of GO terms by several collaborating
databases facilitates uniform queries across them Hierarchical structured to allow query a
vocabulary at different levels. For example, you can use GO to find all the gene
products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases
Allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product
http://www.geneontology.org/index.shtml#downloads
What GO[5] is?
GO is designed to be a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any organism. GO is used to annotate genes and gene productsThree categories of GO
Biological Process: a biological objective to which the gene or gene product contributes. A process is accomplished via one or more ordered assemblies of molecular functions. E.g. “cell growth and maintenance” , “signal transduction”, “cAMP biosynthesis”.
Molecular Function: the biochemical activity of a gene product. E.g. “enzyme”, “ligand”, “Toll receptor ligand”.
Cellular Component: the place in the cell where a gene product is active. E.g. “ribosome” or “proteasome”, “nuclear membrane”.
An interesting analog of GO
Statistician’s view A multivariate definition
DB developer’s view A entity/attributes definition in a DB schema
Biologist’s view A nomenclature accepted by
Biochemist/Molecular Biologist, Cell Biologist, Geneticist, Neuroscientist and Development Biologist
What GO is NOT?GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context. GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following.
a. Knowledge changes and updates lag behind. b. Individual curators evaluate data differently. While we can agree to
use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.
c. GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO.
GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a
consensus
Protein/RNA secondary/tertiary structure
prediction
Protein secondary/tertiary structure prediction Server list, http://www.embl-heidelberg.de/predictprotein/doc/explain_meta.html#list Prediction methologies: Sliding window based and Machine learning based Easier and feasible at this moment: prediction of 2D topology for some
functional important and simple patterned protein, e.g. Transmembrane protein [7].
RNA secondary/tertiary structure prediction Algorithms: Biological sequence analysis, R.Durbin et.al. Cambridge
University Press, 1988 p267 Michael Zuker’s prediction server [6]
http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi
Interaction Data – Biological Pathway and Network
Three main types of interaction data Signal transduction or transcription regulation Protein-protein interaction Metabolic pathway (best in terms of studying network
topology)
Interaction databases KEGG database, metabolic pathways and signal
transduction pathways in 107 organisms http://dip.doe-mbi.ucla.edu/dip/Links.cgi
Network model (random vs. scale free, small world)Network analysis and visualization software
http://www-personal.umich.edu/~mejn/courses/2004/cscs535/syllabus.pdf
Pajek, AT&T DOT etc.
Metabolic Network in Homo sapiens
Summary statistics of network analysis in 16 organisms
Domain, Kingdom and Phylum Organism Num nodes
Num edges
Max Kout
Max Kin
of roots
of leaves
Single edges
Mutual edges
H.sapiens 1040 1528 12 11 0.130769 0.161538 572 478 R.norvegious 763 1028 10 8 0.138925 0.165138 348 340
C.elegans 706 974 10 9 0.15864 0.157224 324 325
Eukarya
S. cerevisiae 748 1072 9 10 0.129679 0.140374 396 338 E.coli 893 1365 12 14 0.139978 0.113102 459 453 gamma
V.cholerae 738 1076 12 12 0.150407 0.123306 370 353 Proteobacteria
beta R.solanacearum 864 1238 11 12 0.138889 0.118056 406 416 Bacillales B.subtilis 787 1151 12 12 0.133418 0.125794 401 375 Firmicutes
Lactobacillales L.lactis 545 778 11 11 0.157798 0.12844 280 249 Actinobacteria S.coelicolor 814 1154 12 12 0.14742 0.135135 406 374
Bacteria
Cyanobacteria T.elongates 509 697 12 12 0.143418 0.133595 237 230 M.acetivorans 489 633 8 7 0.143149 0.134969 209 212 Euryarchaeota T.acidophilum 458 593 8 8 0.170306 0.135371 197 198 S.solfataricus 586 730 8 7 0.187713 0.151877 256 237
S.tokodaii 522 651 8 7 0.180077 0.149425 229 211
Archaea
Crenarchaeota
P.aerophilum 482 622 8 7 0.161826 0.120332 204 209
Microarray and Gene Expression Data
Assumptions Measured signal is proportional to amount of
corresponding cDNA/mRNA Amount of mRNA determines amount of protein, i.e.
there is no regulation on translation level Both of assumptions have NOT been proven yet.
DNA microarray databases (useful links) http://industry.ebi.ac.uk/~alan/MicroArray/ http://genome-www5.stanford.edu/resources.html http://www.ebi.ac.uk/microarray/ A lot more, you explore it!
Download Gene Expression Data from SMD – An example
Stanford Microarray Database (SMD)Retrieving public data from SMD Retrieving data for an organism
ftp://genome-ftp.stanford.edu/pub/smd/organisms One directory per organism, whose names are two-letter code
used by SMD Under each directory, one file per experiment Three ways to retrieve
Web Client. i.e. IE, Netscape, etc. Graphic ftp client, e.g. Flashget, etc Command line ftp client
ftp –i genome-ftp.stanford.edu (-i get them all) Name: anonymous Password: XX@ cd pub/smd/organisms/SC mget *gz
Continued
Retrieving all public data for an publication Go to http://genome-www5.stanford.edu/cgi-bin/tools/display/
listMicroArrayData.pl?tableName=publication
Click any entry in column “Data in SMD” Click “view” to read brief experiment design
description Click “display data” to do experiment-wise
query. Click “Data Retrieval and Analysis” to filter
data and retrieve data
Part II. Information Retrieval in Bioinformatics
Master effective information retrieval techniques can keep your research thinking and works up-to-dateMy steps in doing biomedical research
Identify an interesting topic and rise a scientific hypothesis Start from NCBI Entrez, the life science search engine. http://www.ncbi.nlm.nih.gov/Entrez/ Input the keyword or phrase into the query box and click GO Numbers of pieces of retrieved information are displayed Briefly go through each kinds of resources
NCBI Entrez (Good starting point) Common retrieval interface to many databases Controlled links between databases Maintained at the National Center for Biotechnology Information
(NCBI) in the National Library of Medicine (NLM)
Pubmed and related IR Strategies
- biomedical literature and books
What is pubmed? PubMed is a web-based database of bibliographic information drawn
primarily from the life sciences literature
Pubmed tutorial: http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html
Search Mechanisms PubMed uses an Automatic Term Mapping feature Look first in the MeSH Translation Table (Translate keywords into MeSH
term, e.g. from “renal transplant” to “kidney transplant”) Then look into journal translation table Finally in author index
As soon as PubMed finds a match, the mapping stops. That is, if a term matches in the MeSH Translation Table, PubMed does not continue looking in the next table. Its absolutely necessary to specify the “Limit” in NCBI. E.g. “cell” is MeSH term and also a journal name
Pubmed - Continued
What if “no match” is found? PubMed is unable to match a search term with either of the
translation tables or the Author Index PubMed will then search the individual words in All Fields. Individual
terms will be combined (ANDed) together. Example: TATA Box associated transcription factor
Phrase Searching These formats for phrase searching instruct PubMed to bypass
automatic term mapping. Instead PubMed looks for the phrase in its Index of searchable terms. If the phrase is in the Index, PubMed will retrieve citations that contain the phrase.
PubMed may fail to find a phrase because it is not in the Index. Your phrase may actually appear in citation and abstract data,
but may not be in the Index. If this is the case, the double quotes are ignored and the phrase is processed using Automatic Term Mapping.
MeSH database (“GO” in literature search)
Database of indexing termsEntry example
NF-kappa B Ubiquitous, inducible, nuclear transcriptional activator
that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA.
Year introduced: 1991
Entrez => MeSH http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh
NLM => MeSHhttp://www.nlm.nih.gov/mesh/meshhome.html
Structure of MeSH (Combination of EC and
GO)Divisions
Anatomy [A] Organisms [B] Diseases [C] Chemicals and Drugs [D] Analytical, Diagnostic and Therapeutic
Techniques and Equipment [E] Psychiatry and Psychology [F] Biological Sciences [G] Physical Sciences [H] Anthropology, Education, Sociology and
Social Phenomena [I] Technology and Food and Beverages [J] Humanities [K] Information Science [L] Persons [M] Health Care [N] Geographic Locations [Z]
Hierarchy with Multiple Inheritance
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776] DNA-Binding Proteins [D12.776.260] NF-kappa B [D12.776.260.600]
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776] Nuclear Proteins [D12.776.660] NF-kappa B [D12.776.260.600]
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776] Transcription Factors [D12.776.930] NF-kappa B [D12.776.260.600]
MeSH Full ListingNF-kappa B
Ubiquitous, inducible, nuclear transcriptional activator that binds to enhancer elements in many different cell types and is activated by pathogenic stimuli. The NF-kappa B complex is a heterodimer composed of two DNA-binding subunits: NF-kappa B1 and relA.
Year introduced: 1991Subheadings:
administration and dosage agonists analysis antagonists and inhibitors biosynthesis blood cerebrospinal fluid chemistry classification deficiency diagnostic use drug effects genetics immunology isolation and purification metabolism pharmacokinetics pharmacology physiology radiation effects secretion therapeutic use toxicity ultrastructure
Restrict Search to Major Topic headings only Do Not Explode this term
(i.e., do not include MeSH terms found below this term in the MeSH tree).
Entry Terms: NF-kB NF kB Nuclear Factor kappa B kappa B Enhancer Binding Protein Immunoglobulin Enhancer-Binding Protein Enhancer-Binding Protein, Immunoglobulin Immunoglobulin Enhancer Binding Protein Transcription Factor NF-kB Factor NF-kB, Transcription NF-kB, Transcription Factor Transcription Factor NF kB Ig-EBP-1 Ig EBP 1
Previous Indexing: DNA-Binding Proteins (1987-1990) Transcription Factors (1987-1990)
See Also: I-kappa B
All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins DNA-Binding Proteins NF-kappa B
All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Nuclear Proteins NF-kappa B
All MeSH Categories Chemicals and Drugs Category Amino Acids, Peptides, and Proteins Proteins Transcription Factors NF-kappa B
Tips for increasing your searching sensitivity and
specificity Chop query yourself with logic AND, OR, look a term up yourself in MeSH database, and use MeSH terms in your query Use tags to do efficient search
[au],”author”, e.g. States DJ[au]. [dp],”date of publication”,e.g. 2004[dp]. [ad], “address”, e.g. Ann Arbor[ad], etc. [MeSH], “MeSH term”, e.g. Transcription factor[MeSH]
Select “Limited to” option to prevent the search stopping prematurelyUse phrase searching “” if you don’t want your phrase to be partially searched.
Entrez Clipboard and Address Issue
Send to “clipboard”Place to save results collected from multiple searchesSaved for ~ 1hr
Task: Find a local expert on NF kappa B“NF kappa B” AND (48109 [ad] OR “Ann Arbor” [ad] NOT Pfizer
[ad])(scan results for the most common senior author)
Need to think about all the ways people write addresses“University of Michigan” fails to pick up “Univ. Mich.” or
“UMMS” etc.Zipcodes are very specific, but only get about 70%
Won’t catch co-authored articles with a remote collaborator
IR Strategies
Term search Simple search for term matches (exact or stemmed)“Find articles containing ‘p53’”
Boolean Logical combination of term matches“Find articles containing ‘p53’ AND ‘apoptosis’”
Statistical neighboring Assume that articles on the same subject will use similar words Rank articles by similarity of word use“Find articles using vocabulary similar to the vocabulary in this
title/abstract”Deeper parsing
Natural language processing and deeper understanding The field is still in its infancy“Find articles describing the mechanism of p53 activation in
apoptosis”
Boolean Searches
Entrez attempts to intelligently parse your queryQuery: dna binding transcription factor macrophageDetails => (((("dna"[MeSH Terms] OR dna[Text Word]) AND
(("pharmacokinetics"[MeSH Subheading] OR "pharmacokinetics“ [MeSH Terms]) OR binding [Text Word])) AND ("transcription factors“ [MeSH Terms] OR transcription factor [Text Word])) AND ("macrophages"[MeSH Terms] OR macrophage [Text Word]))
You can force a Boolean searchQuery: “dna binding” AND “transcription factor” AND
macrophageDetails => (("dna binding"[All Fields] AND "transcription
factor"[All Fields]) AND ("macrophages"[MeSH Terms] OR macrophage[Text Word]))
Phrase Searching
Specify with quotes“transcription factor” vs. “transcription”
“factor”
Precomputed Fast Often mapped to synonyms and MeSH
terms Just because you get a “phrase not
found” message does not mean it is not present
Text Neighboring
Related articles link (single or multiple articles) Term usage similarity
Articles talking about the same thing are likely to use the same words
Good recall (sensitivity) Precomputed and fast
Limitations Strictly algorithmic, no understanding
“Ras activates PI3K” vs. “PI3K activates Ras” Historical and author biases in vocabulary Poor precision (specificity) Ranking can not satisfy everyone
Computational Issues in Statistical Text Retrieval
Stop words Simple words like “the” and “and” are not worth scoring
Term weights Should weight matches of rare words more heavily than
matches of common wordsStemming and synonyms
Need to stem verbs and plural forms May or may not be able to reduce to a normalized set of
synonmsNormalizing for length
Don’t want to exclude short articles or articles without an abstract
All vs. all comparison is not feasible 107 articles => 1014 comparisons, not feasible Compute demands of the task are growing faster than
Moore’s law
Acknowledgements
Some slides in Part II are taken from Dr.States’ Bioinfo 526 class
http://www.bioinformatics.med.umich.edu/Courses/526
Dr. Zhaohui (Steve) Qin for helpful discussionAll authors of references that I have cited