GS01 0163 Analysis of Microarray Data -...

62
GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center [email protected] [email protected] 7 September 2010

Transcript of GS01 0163 Analysis of Microarray Data -...

  • GS01 0163Analysis of Microarray Data

    Keith Baggerly and Brad BroomDepartment of Bioinformatics and Computational Biology

    UT M. D. Anderson Cancer [email protected]@mdanderson.org

    7 September 2010

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 1

    Lecture 3: Linking Numbers to Biology

    • So, why are we here?

    • Why do we care?

    • Affymetrix source for annotations

    • List of Affymetrix annotations

    • Updating the annotations in dChip

    • What is GeneOntology?

    • Using GeneOntology in dChip

    • GoMiner

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 2

    So, why are we here?

    We want to learn about Gene Annotations.

    Microarrays are designed, which means that someone firstchooses a set of genes of interest, selects probe sequencesto target those genes, and then places those sequences on amicroarray. In order to interpret (and possibly to analyze) thedata produced from a microarray experiment, you need torefer to the accompanying annotations, which describe boththe probes and the targeted genes.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 3

    Things Change

    One might naively think that gene annotations are static;meaning that they are produced when the microarray isdesigned and never change again. Wrong.

    The base pair sequences of probes placed on the array donot change. However, our knowledge of the human genomeis evolving, and thus our opinion about which genes aretargeted by those sequences may need to be updated.

    For Affymetrix microarrays, the company maintainsannotation files (updated quarterly) that contain their latestopinion on the nature and identity of the targeted genes.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 4

    Why Do We Care?

    Earlier, we compared microarray data from samples of acutelymphocytic leukemia (ALL) patients and mixed-lineageleukemia (MLL) patients. Using the criteria that the lowerbound of fold change (LBFC) should be at least 1.2-fold andthe mean difference in expression should be greater than100, we found about 600 probesets to be differentiallyexpressed.

    It is considered bad form to just hand the biologists a list of600 genes.

    They typically want to know: (a) do these genes reflectparticular biological functions that are different betwen thetwo groups of samples, or (b) do they identify specificbiological pathways or networks that are perturbed?

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 5

    List of Differentially Expressed Genes

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 6

    Affymetrix Web Site

    http://www.affymetrix.com

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

    http://www.affymetrix.com

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 7

    NETAFFX

    Annotations are updated quarterly...

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 8

    Affymetrix Support

    Go to the Affymetrix support page to get the full annotations.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 9

    Support By Product

    Follow the “support by product” link to “GeneChip Arrays”.

    ProductsMicroarray SolutionsRNA Analysis Solutions3’ IVT Expression AnalysisArraysU133 Set

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 10

    Affymetrix Main Annotation Files

    There is one primary annotation file:

    Annotation File: HG-U133A.na30.annot.csv contains theupdated annotations of all the genes targeted by themicroarray. (the zipped file is 11.8MB; unzipped, it is80.7MB.)

    The main Affy NetAffx page suggests that version 31 of theannotation was posted on Sep 3, but it isn’t here yet.

    Versions 20-29 are available as archived files.

    Note the header lines in these files: they’re important!

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 11

    What annotations does Affymetrix supply?

    As noted earlier, HG-U133A.na30.annot.csv is 80.7MB. Whatoccupies all that space?

    The file contains lots of redundant information. It hasinformation on 22,283 probesets, one per line, in 41 columns.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 12

    Description of annotation columns

    Probe Set ID. The unique identifier that describes anAffymetrix probe set. Also used in CEL files and CDF files.

    GeneChip Array. The chip type on which the probe setappears. The same entry is repeated for all probe sets.

    Species Scientific Name. The scientific name of thespecies whose gene sequences are on the array. Thesame information is repeated for all probe sets.

    Annotation Date. When the annotations were last updated.The same information is repeated for all probe sets.

    Sequence Type. The kind of sequence used in the design ofthe array: can be “Consensus”, “Control”, or “Exemplar”.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 13

    Sequence Source. Where did the design sequence comefrom? Usually “GenBank”, but rarely (only 81 times on theHG-U133A) from “Affymetrix Proprietary Database”.

    Transcript ID(Array Design). An identifier into one ofseveral unspecified databases indicating the designedtarget sequence.

    Target Description. Long text string describing the target,formed by combining several other fields.

    Representative Public ID. For non-control sequences, aGenBank/RefSeq identifier.

    Archival UniGene Cluster. The UniGene cluster identifierfrom the sequence at the time the array was designed (in

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 14

    this case, from UniGene build 133).

    UniGene ID. UniGene cluster identifier from the build ofUniGene current at the time the annotations were updated.

    Genome Version. The build of the human genome used forsequence alignments. The same information is repeatedfor all probe sets.

    Alignments. Location of the target sequence along thehuman genome, in base pairs along the chromosome.

    Gene Title. Official gene title (from UniGene or EntrezGene).

    Gene Symbol. Official gene symbol (either from UniGene orEntrez Gene).

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 15

    Chromosomal Location. Location of the gene in terms ofcytogenetic bands; e.g., 16p12.

    Unigene Cluster Type. Either absent if not present in thisbuild of UniGene (indicated by “—”), “est”, “full length”, or“est /// full length”.

    Ensembl. The unique identifier of the target sequence in theEnsembl database.

    Entrez Gene. The unique identifier of the target sequence inEntrez Gene (formerly LocusLink). Sequences with theseidentifiers tend to be better understood and more reliablethan genes without them. The identifiers refer to geneticloci that have been mapped explicitly because of theirconnection to specific diseases or biological processes.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 16

    SwissProt. The SwissProt identifier of the protein productproduced by the gene corresponding to the targetsequence.

    EC. Yet another database identifier.

    OMIM. The unique identifier asdsociated to the targetsequence gene in the Online Mendelian Inheritance in Man(OMIM) database, describing the ways in which the gene isknown to be associated with genetic diseases.

    RefSeq Protein ID. The GenBank identifier of the consensussequence for the protein produced by the target sequence.

    RefSeq Transcript ID. The GenBank identifiers of theconsensus sequences for the mRNA’s produced by the

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 17

    target gene. (Alternative splicing accounts for multiples.) Inmany cases, this coincides with the “Representative PublicID”.

    FlyBase. Corresponding identifier in the drosophiladatabase.

    AGI. Arabidopsis genome identifier.

    WormBase. Corresponding identifier in the C. elegansdatabase.

    MGI Name. Probably the identifier in the mouse database.

    RGD Name. Probably the identifier in the rat database.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 18

    SGD accession number. The identifer in thesaccharomyces database.

    Gene Ontology Biological Process. List of identifiers forannotations of the target gene into the “biological process”section of GeneOntology. More about this later.

    Gene Ontology Cellular Component. Similar.

    Gene Ontology Molecular Function. Similar.

    Pathway. List of pathways that the target sequence isinvolved in.

    InterPro. Another protein database.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 19

    Trans Membrane. Description of trans-membrane part of theprotein, if known or if applicable.

    QTL. Unknown.

    Annotation Description. Text description of how the probeset was annotated.

    Annotation Transcript Cluster. Unclear.

    Transcript Assignments. Very long description of theannotations.

    Annotation Notes. Additional comments.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 20

    Updating annotations in dChip

    In order for dChip (or any other Affymetrix microarrayanalysis package) to use the updated annotations, you haveto tell the software package where to get the information.

    In the case of dChip, their online manual page tells you howto build new gene information and genome information files.

    For many common chip types, the dChip web site containsup-to-date copies of these files. It’s still useful to see wherethe data comes from how and how you can update your ownversions.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 21

    dChip Manual on Gene Information

    Requires the annotation CSV files from Affymetrix, along withthree Gene Ontology files, which you can get from dChip orfrom the primary source.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 22

    http://www.geneontology.org

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

    http://www.geneontology.org

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 23

    Making the Gene Information file

    1. Get the updated annotation CSV file from Affymetrix.

    2. Get function.ontology, process.ontology, andcomponent.ontology from GeneOntology.

    3. Rename the three GeneOntology files by adding “.txt”.

    4. Use “Tools” − > “Make information file” in dChip.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 24

    Making the Gene Information file

    Specify the locations of the CSV file, the GeneOntology files,and where you want the output sent. I edited the defaultoutput file name to (i) start with the standard chip name and(2) use the underscore character as a separator.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 25

    The Gene Information file

    This step produces the three dChip annotation files thatwere described in Lecture 2.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 26

    Making the Genome Information file

    Using the same input files, you can also use dChip to createa “Genome information file”, which maps genes to specificpositions along the genome.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 27

    The Genome Information file

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 28

    What is GeneOntology?

    GeneOntology uses controlled vocabularies to create adirected acyclic graph (DAG; a generalized tree) thatdescribes the kinds of functions or properties that a genemight have. There are two parts to GeneOntology:

    • Annotations, maintained in databases like Entrez Gene,that describe which genes actually have which functions.

    • The DAG, maintained by the GeneOntology Consortium,that describes functions and relations between them:

    1. Biological process (what)2. Molecular function (how)3. Cellular component (where)

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 29

    GeneOntology: The top level

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 30

    GeneOntology annotations in Entrez Gene

    You can find the GeneOntology annotations for individualgenes in Entrez Gene. For genes with known functions, theEntrez Gene page will contain a section titled“GeneOntology”, which contains a list of the known functionsfor that gene.

    Every GO annotation asserts that a specific gene has aspecific function. As part of the design of GO, each assertionis itself annotated to explain the kinds of evidence theassertion is based on, as well as the organization orindividual that supplied the annotation.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 31

    GO annotations of the androgen receptor

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 32

    http://www.ebi.ac.uk/GOA/

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

    http://www.ebi.ac.uk/GOA/

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 33

    GO browsing

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 34

    GO browsing

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 35

    Edges are relationships

    Edges in the DAG represent two kinds of relationships:

    is a : Used when the child node is a special case of theparent node. For example, hormone binding is a kind ofbinding.

    part of : Used when the child node is a component of theparent node. For example, a membrane is part of a cell

    Genes may be annotated into different levels of the hierarchy,depending on how detailed the evidence is. In general, agene not only has the function corresponding to the nodewith direct annotation, but also has every property at parentnodes up through the hierarchy.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 36

    GO annotations of the androgen receptor

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 37

    GeneOntology: Evidence Codes

    IDA : inferred from direct assay; indicates that the annotationis based on a paper describing an experiment that directlytested this function for this gene

    TAS : traceable author statement; based on a review articleor textbook including references to the original experiments

    IMP : inferred from mutant phenotype; based on experimentsinvolving mutations, knockouts, antisense, etc.

    IPI : inferred from physical interaction; based on assays (likeco-immunoprecipitation) that demonstrate physicalinteractions between the gene in question and other geneproducts

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 38

    IGI : inferred from genetic interaction; based on experiments(such as synthetic lethals, suppressors, functionalcomplementation) that show a genetic interaction betweenthe gene in question and another gene

    ISS : inferred from sequence or structure similarity; based onBLAST results that have been reviewed for accuracy by acurator

    IEP : inferred from expression pattern; based on Northerns,Westerns, or microarray experiments that revealinformation about the timing or location of expression

    NAS : non-traceable author statement; statements in papers(abstract, introduction, discussion) that a curator cannottrace to another publication

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 39

    IEA : inferred from electronic annotation; based on sequencesimilarity searches or database records that have not beenreviewed by a curator

    IC : inferred by curator; even though no direct evidence isavailable, the property can reasonably be inferred by thecurator. For example, it is reasonable to infer from directevidence of “transcription factor activity” that the geneproduct is found in the nucleus

    ND : no biological data available; only used for annotations to“unknown”

    NR : not recorded; used only for annotations created beforecurators started adding evidence codes

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 40

    Quality of evidence

    The evidence codes fall into a rough hierarchy indicating howstrongly the annotation of function should be believed.

    1. IDA, TAS

    2. IMP, IPI, IGI

    3. ISS, IEP

    4. NAS

    5. IEA

    6. IC

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 41

    Using GeneOntology in dChip

    After running a sample comparison to find interesting genes,use the menu item “Tools” − > “Gene Function Enrichment”.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 42

    Using GeneOntology in dChip

    For the gene list file, select the “compare result” file producedpreviously. It may be a good idea to use the “Options” to setthe cutoff for significant p-values.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 43

    Using GeneOntology in dChip

    The results are available in a few seconds.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 44

    What do the results look like?

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 45

    Interpreting the Results

    Each group of entries in the results file is introduced by a linelike:

    Found 21 Gene Ontology "protein tyrosinekinase" genes in a list with 391 annotatedgenes (all: 157/7685, PValue: 0.000042)

    *****

    The part within quotation marks is the name of theGeneOntology category that was found to be signficantlyoverrepresented among the differentially expressed genes.

    What do the numbers tell us?

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 46

    1. There were 7685 probesets on the array with some kind ofGeneOntology annotation.

    2. There were 391 differentially expressed probesets that hadsome kind of GeneOntology annotation.

    3. Of all the annotated probe sets, 157 had the “proteintyrosine kinase” function.

    4. Of the selected annotated probe sets, 21 had the “proteintyrosine kinase” function.

    The p-value comes from modeling the data using ahypergeometric distribution, which means it is the same valueproduced by Fisher’s Exact Test on a 2× 2 contingency table.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 47

    What’s wrong with the results?

    First, the p-values haven’t been adjusted for multiple testing.Second, we cannot tell if the software has accounted for thefact that the GeneOntology categories form a DAG. Inparticular, a gene with “protein tyrosine kinase” activity alsoinherits every annotation above it in the DAG.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 48

    What’s wrong with the results?

    Third, by working with probe sets instead of genes, thecounts are wrong.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 49

    What alternatives are there?

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 50

    http://discover.nci.nih.gov/gominer

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

    http://discover.nci.nih.gov/gominer

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 51

    GoMiner: Getting Started

    You need a machine with

    • Java 1.3 or higher

    • Windows 98 or higher, Mac OS X or higher, Solaris, Linux,or FreeBSD

    • High-speed internet access

    Download the GoMiner Java code, install it, and double-clickon it to start the program.

    Then go to “File” − > “Load GO Terms” and click “OK”. Waita few minutes while the program loads the GeneOntologyinformation from the NCI.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 52

    GoMiner Start

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 53

    GoMiner: GO terms loaded

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 54

    GoMiner as GO browser

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 55

    Getting array data into GoMiner

    1. Go to “Data Source” and select “UniProt (Hs)” to restrict tohuman gene annotations

    2. Need a file listing all genes in the experiment, one HUGOsymbol per line. Use the “Browse” button, and then click“Query Gene File” to load this information. Time sink.

    3. Need a file containing a list of genes that changed. Can beone HUGO symbol per line. Optionally, you can include asecond column with 1 (overexpressed) or -1 (under). Use“Browse” and “Query Changed Gene File” to load this data.

    Note: GeneLink or Source can convert from various gene idsto HUGO symbols.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 56

    GoMiner with array gene list loaded

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 57

    GoMiner with changed gene list loaded

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 58

    GoMiner subgraphs

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 59

    GoMiner subgraphs

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 60

    Intepreting GoMiner results

    Enrichment is computed as

    changed genes in category/

    total genes in category

    changed genes on array/

    all genes on array

    Statistical evidence of enrichment is based on a Fisher exacttest.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA

  • GENE ANNOTATIONS: LINKING NUMBERS TO BIOLOGY 61

    Intepreting GoMiner results

    The p-values from the Fisher test are not corrected formultiple testing, but they should be since one is potentiallylooking at all GO categories. The categories are notindependent, so it is not clear exactly how one should correctfor multiple testing.

    If we filter genes before testing differential expression (e.g.,by removing low expressing or low variance genes), shouldthose genes be included in the “query gene file” for theexperiment?

    The Fisher exact test isn’t completely appropriate, sincegenes can have overlapping annotations into the GO DAG.

    No existing test exploits the GO evidence codes.

    c© Copyright 2004–2010, KR Coombes, KA Baggerly and BM Broom GS01 0163: ANALYSIS OF MICROARRAY DATA