CANDID: A candidate gene identification tool Janna Hutz [email protected] March 19, 2007.

41
CANDID: A cand idate gene id entification tool Janna Hutz [email protected] March 19, 2007

Transcript of CANDID: A candidate gene identification tool Janna Hutz [email protected] March 19, 2007.

Page 1: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

CANDID:A candidate gene identification tool

Janna Hutz

[email protected]

March 19, 2007

Page 2: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Candidate genes

• Positional– Linkage evidence– Deletion syndrome– Loss of heterozygosity– Disease-related amplification– Association

• Biological– Pathways– Phenotypic characteristics

ACT[A/G]GGA

Page 3: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

A case study: acd

Page 4: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

A case study: acd0 cM

~82 cM

acd

Os

Es-1

~31 cMD8Mit5

D8Mit79

D8Mit13

25 cM

38.7 cM

67 cM

acd51 cM

Page 5: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

A case study: acd

1/145 3/1450/145 0/1451/145

Page 6: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Which gene is acd?

Page 7: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Prioritization tools

• Endocrinologist/Geneticist

• Ensembl

• RT-PCR

• Sequencing

BINGO!…two years later.

Page 8: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

How can we improve this?

Page 9: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Improve our tools

• Clinician– Has memorized information about many

disorders; can name some relevant genes– Gets his/her information from…

PubMed

Page 10: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

PubMed

• How do we use PubMed to analyze our candidates?

– Enter our phenotypic keywords into PubMed. Read the papers that come up in the results. Make a list of genes.

– Do PubMed searches for all the candidates. Read the papers that come up in the results. Rate the candidates.

Better: Don’t do it yourself…

Page 11: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

PubMed

• Each publication has a PubMed ID

• Each gene has a Gene ID

• Wouldn’t it be nice if we could link Gene IDs and PubMed IDs?

–ftp://ftp.ncbi.nlm.nih.gov/gene/DATA–gene2pubmed.gz–TaxonomyID; GeneID; PubMedID

Page 12: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Who makes that file? (1)

• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

• Links between Gene and PubMed are the result of the following:

• 1. Manual curation within NCBI. Part of the process of generating a REVIEWED RefSeq is an analysis of the current literature. Papers that are seminal in defining the gene, its sequence, and its function are added to the record at that time. Alert users point out gaps or errors in papers associated with a Gene record. These messages are reviewed and implemented as required.

Page 13: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Who makes that file? (2)

• 2. Integration of information from other public databases. Gene integrates gene-citation from resources external to NCBI such as model organism-specific databases, Gene Ontology (GO), groups curating interactions, and sequence databases. The assumption in using these source is that they report citations specific to a gene in a known species. Gene does not process citations from OMIM automatically, because many of citations in OMIM refer to studies of genes in species other than human.

Page 14: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Example 1

pancreatic cancer sequence candidates

$

Page 15: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Help Sally.

• Use CANDID’s literature criterion

http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html

User: workshop Password: perl031907

Page 16: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.
Page 17: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Help Sally.

• Look for genes that are involved with pancreatic cancer.

• What are some keywords we can use?

Page 18: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.
Page 19: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

A measure of relevancy

• Find relevant publications

• Is Gene X linked to these publications?

• How many publications match?

• What percent of Gene X’s publications match?

Page 20: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

By the numbers…

• Literature scores run from 0 to 1.

Number of gene’s publications that match

Number of gene’s publications

• The score is…

Page 21: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Matching

• Every publication has a “Text Words” field that includes, when available, …– Title– Abstract– Other abstract– MeSH terms– MeSH subheadings– Publication types– Substance names– Personal name as subject– MEDLINE secondary source– Other terms

Page 22: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Summary

Page 23: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Results

Page 24: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Exporting to Excel

• Output file is a comma-separated file

• Download it, and change the .output to .csv.

• If Excel doesn’t open it automatically when you click on it, paste the data into a new sheet and use the Text Import Wizard to separate the columns.

Page 25: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Drawbacks

• What if a gene isn’t associated with any publications?– It’s not important– It’s not yet characterized

Page 26: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

What about those genes?

Page 27: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Analyzing the “other genes”

• We don’t have literature data.

• We don’t have expression data.

• All we have is a sequence.

Page 28: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Fun with sequences

• DNA– Cross-species conservation

• RNA (cDNA)– Cross-species conservation– Protein sequence prediction

• Protein conservation• Protein domain prediction

Page 29: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Protein domains

• InterPro• Conserved Domain Database (NCBI)

• Wouldn’t it be nice if we could link Gene IDs and protein domains?

Interproftp://ftp.ncbi.nlm.nih.gov/gene/DATA

Page 30: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Who makes those links?

• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

• Links between Entrez Gene and Conserved Domain Database (CDD) are calculated from the domains annotated by the CDD group on Reference Sequence proteins.

Page 31: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

How can we use this?

• The CDD domains have descriptions.

• These descriptions can be searched…

1. CANDID finds domains containing our keywords.

2. If a gene has one of those domains, it gets a score of 1.

…just like when we searched PubMed!

Page 32: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

How far back does our gene go?

• Is our gene in mammals?

• Fish?

• Bacteria?

Page 33: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

More sequence fun

• Many measures of conservation– Nucleotide similarity (percentage, pairwise)– Amino acid similarity (percentage,

pairwise)– etc., etc.

Page 34: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

HomoloGene

• Gets sequences

• Uses amino acid AND nucleotide similarity measures

• Plus lots more math, equals…

• A label that answers our question

Page 35: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Labels used in CANDID

• Homo sapiens• Primates (chimp, gorilla)• Rodents (rat, mouse)• Eutherian mammals (dog, cow, cat)• Amniota (chicken)• Insects (mosquito, bee)• Bilateria (C. elegans)• Fungi• Eukaryotes

HIGHER SCORE

Page 36: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Example 2

pancreas:tumor tissue

pancreas: normal tissue

custom microarray

Known and unknown genes

Page 37: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Array candidates

• Let’s increase the number of CANDID results we got in Example 1…

Page 38: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Weighting system

• Prioritize genes of known or unknown function

• Modify weights for each category

• Well-characterized genes: higher literature weight

• Uncharacterized genes: higher domains, conservation weights

Page 39: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Example 3

• Make up your own example!

• Use literature, domains, and/or conservation criteria.

Page 40: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.

Next week

• Expression data

• Linkage data

• Association data

• CANDID’s efficiency

• Anything else?

Page 41: CANDID: A candidate gene identification tool Janna Hutz jehutz@artsci.wustl.edu March 19, 2007.