Characterization of genes and proteins of cross-species biological pathways

1
Results Six pathways, three each from BioCarta and KEGG, were analyzed using this process and the results for these pathways are presented below. The matrices for one of the pathways are also shown for illustration. BioCarta Pathways Interferon Gamma (IGP): The IGP pathway has a significant role in the body's immune response. It has 6 genes, all well conserved among mammals except for JAK1 and STAT1 in Pan troglodyte. Nerve Growth Factor (NGF): NGF is important for the survival of neurons during embryonic development and has an effect on the growth of sensory and sympathetic ganglia. It has 20 genes and most are well-conserved. Across species the exceptions include DPM2 and ELK1, and KLK2. Within species, only Canis lupus familiaris had NGF genes that were less conserved. Protein Kinase C through G-protein coupled receptor (PKC): GPCRs are involved in signal transduction and play a role in various cellular functions. There are 9 genes in this pathway, and all the genes are extremely well-conserved. KEGG Pathways Hedgehog Pathway: The hedgehog signaling pathway is believed to govern the growth of embryonic stem cells as well as metamorphosis in general. It has 44 genes, of which 23 are conserved among all represented mammals. Three genes SPA18, DRYK1A, and BTRC are common in all mammals. Basal Transcription Factors (BTF): BTF is a major control point for gene expression in eukaryotes and it contains 34 genes. Most genes in this pathway are well-conserved except GTF2AIL and STON1. Dorsal-ventral axis formation (DVF): The DVF pathway is controlled by GRK and EGFR and is important in limb development. It has 29 genes and most of the genes are well- conserved, the exception being FMN2. Matrices obtained through homologene and similarity method are shown below: Future work Future work includes: 1. Fully automate the process 2. Visualization 3. Develop a database schema to store the results Materials and methods Materials and methods Conclusions We developed a process for characterizing cross-species conservation of gene and proteins for mammals, and finding variations within these genes and proteins. This projects highlights the challenges associated with developing meaningful biological networks from disparate computational tools and databases. This project also emphasizes the limitations of relying exclusively on BioCarta and KEGG in pathway discovery. Objectives This project focused on developing methods for deriving the cross-species annotations for genes and protein groups identified in candidate pathways. The project had three primary goals: 1. Produce a matrix containing genes in a particular biological pathway 2. Construct a list of known protein variations associated with each gene in a pathway 3. Develop a more effective procedure for characterizing cross- species conservation of genes and proteins Introduction The new era of genomics and proteomics, with the advent of high throughput technologies such as microarrays and next generation sequencing, has opened up great opportunities for the life science research community to better understand biological processes. The gene lists obtained from data through these experiments are generally analyzed further in the context of biological pathways as well as with available biological knowledge sets such as specifically described gene ontologies, gene sets and gene enrichments. Efforts are underway to develop new methods to derive biologically meaningful information from the gene lists obtained from such technologies. Although there has been considerable effort extended at the level of building, maintaining and distributing these gene sets, a system allowing visualization of their conservation across mammalian species has not been developed. We have developed a process to retrieve information from two pathway databases, KEGG and BioCarta, and combine it with information from other biological databases such as Homologene and Uniprot to characterize cross-species conservation of genes and proteins and gain insights into new biological knowledge. Specifically, we are trying to understand which genes and proteins are common in given pathways across species among mammals such as human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), dog (Canis lupus familiaris), cow (Bos taurus), and chimpanzee (Pan troglodytes). We also explore the problem of finding the variations or mutations in these genes and proteins that are well tolerated across these species. ABCC/NCI-Frederick P.O. Box B, Bldg. 430 Frederick, MD 21702 ABCC/NCI-Frederick P.O. Box B, Bldg. 430 Frederick, MD 21702 University of Maryland University College 3501 University Boulevard East Adelphi, MD 20783 University of Maryland University College 3501 University Boulevard East Adelphi, MD 20783 Characterization of genes and proteins of cross-species biological pathways Jennifer Ivy Dong, Douglas James JoubertNIH Library, Raina Kumar, & Robert StephenABCC/NCI Report value 1 for species from homolog for each gene for mammals Populate matrices (heat map), where genes are at X- axis and species at Y-axis Fetch sequences using protein seq ID for all the homologous genes for each pathway gene for mammals Retrieve gene list from CGAP with gene sequence IDs Start with a BioCarta/KEGG pathway name Retrieve homolog group ID for each gene from Homologene database at NCBI Perform MSA by ClustalW Use *.dnd to make cladogram Search for variations in *.aln files Report variations in tab-delimited files For each protein search for UniProt entry from files derived from UniProt Identify protein IDs of all the proteins for same species in NCBI database using Sequence Id or Map sequence id to UniProt Id using BioDbnet Read known variation in flat file and return annotation in tab delimited file Map Sequence Id to protein Id using BioDbnet Perform BlastP for Proteins Populate matrices with best hits using taxonomy report Identify homologous proteins by similarity search Identify homologous proteins in homologene database Find variations from MSA Find known variations Perl scripts Perl scripts Report value 1 for species from homolog for each gene for mammals Populate matrices (heat map), where genes are at X- axis and species at Y-axis Fetch sequences using protein seq ID for all the homologous genes for each pathway gene for mammals Retrieve gene list from CGAP with gene sequence IDs Start with a BioCarta/KEGG pathway name Retrieve homolog group ID for each gene from Homologene database at NCBI Perform MSA by ClustalW Use *.dnd to make cladogram Search for variations in *.aln files Report variations in tab-delimited files For each protein search for UniProt entry from files derived from UniProt Identify protein IDs of all the proteins for same species in NCBI database using Sequence Id or Map sequence id to UniProt Id using BioDbnet Read known variation in flat file and return annotation in tab delimited file Map Sequence Id to protein Id using BioDbnet Perform BlastP for Proteins Populate matrices with best hits using taxonomy report Identify homologous proteins by similarity search Identify homologous proteins in homologene database Find variations from MSA Perl scripts Find known variations 3. Find variations using multiple sequence alignments 4. Find all known variations from the UniProt database The process has four major modules: 1. Identify homologous proteins using the Homologene database 2. Identify homologous proteins using similarity search Gene Symbol Human Mouse Dog Cow Rat Chimp BRAF 1 1 1 1 1 1 CPEB1 1 1 0 1 1 1 EGFR 1 1 1 1 1 1 ERBB2 1 1 1 0 1 0 ERBB4 1 1 1 0 1 1 ETS1 1 1 1 1 1 0 ETS2 1 1 1 1 1 1 ETV6 1 1 1 1 1 1 ETV7 1 0 1 0 0 1 FMN2 1 0 0 0 0 0 GRB2 1 1 1 1 1 1 KRAS 1 1 1 1 1 1 MAP2K1 1 1 1 1 1 1 MAPK1 1 1 1 1 1 1 MAPK3 1 1 1 1 1 1 NOTCH1 1 1 1 0 1 1 NOTCH2 1 1 1 1 1 1 NOTCH3 1 1 1 0 1 0 NOTCH4 1 1 1 1 1 1 PIWIL1 1 1 1 1 1 1 PIWIL2 1 1 1 0 1 1 PIWIL3 1 0 1 1 0 1 RAF1 1 1 1 1 1 1 RHBDL1 1 1 1 1 1 1 RHBDL3 1 1 1 1 1 1 SOS1 1 1 1 1 1 1 SOS2 1 1 1 1 0 0 - 1 1 1 1 1 1 Gene Symbol Human Mouse Dog Cow Chimp Rat Horse Platypus Wild boar Rhesus Macaque Bonobo Gorilla Sumatran Orangutan Cynomolgus monkey Cat Syrian Hamster European Rabbit Domestic Sheep Opposum mice White Bear Western baboon BRAF 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 CPEB1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 EGFR 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 ERBB2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 ERBB4 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 ETS1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 ETS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 ETV6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ETV7 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 FMN2 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 GRB2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0 KRAS 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 MAP2K1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 MAPK1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 MAPK3 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 NOTCH1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 NOTCH2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 NOTCH3 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 NOTCH4 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 PIWIL1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 PIWIL2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 PIWIL3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 PIWIL4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 RAF1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 RHBDL1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 RHBDL3 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 SOS1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 SOS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 - 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 Homologene Similarity Search Disclaimer: The opinions and assertions presented here are the private views of the authors and are not necessarily that of ABCC/NCI.

description

Presented at the 2010 UMUC Biotechnology Symposium, May 21, 2010, Rockville, MD.

Transcript of Characterization of genes and proteins of cross-species biological pathways

Page 1: Characterization of genes and proteins of cross-species biological pathways

Results

Six pathways, three each from BioCarta and KEGG, were

analyzed using this process and the results for these pathways

are presented below. The matrices for one of the pathways are

also shown for illustration.

BioCarta Pathways

Interferon Gamma (IGP): The IGP pathway has a significant

role in the body's immune response. It has 6 genes, all well

conserved among mammals except for JAK1 and STAT1 in

Pan troglodyte.

Nerve Growth Factor (NGF): NGF is important for the survival

of neurons during embryonic development and has an effect on

the growth of sensory and sympathetic ganglia. It has 20 genes

and most are well-conserved. Across species the exceptions

include DPM2 and ELK1, and KLK2. Within species, only Canis

lupus familiaris had NGF genes that were less conserved.

Protein Kinase C through G-protein coupled receptor

(PKC): GPCRs are involved in signal transduction and play a

role in various cellular functions. There are 9 genes in this

pathway, and all the genes are extremely well-conserved.

KEGG Pathways

Hedgehog Pathway: The hedgehog signaling pathway is

believed to govern the growth of embryonic stem cells as well

as metamorphosis in general. It has 44 genes, of which 23 are

conserved among all represented mammals. Three genes

SPA18, DRYK1A, and BTRC are common in all mammals.

Basal Transcription Factors (BTF): BTF is a major control

point for gene expression in eukaryotes and it contains 34

genes. Most genes in this pathway are well-conserved except

GTF2AIL and STON1.

Dorsal-ventral axis formation (DVF): The DVF pathway is

controlled by GRK and EGFR and is important in limb

development. It has 29 genes and most of the genes are well-

conserved, the exception being FMN2. Matrices obtained

through homologene and similarity method are shown below:

Future work

Future work includes:

1. Fully automate the process

2. Visualization

3. Develop a database schema to store the results

Materials and methodsMaterials and methods

Conclusions

We developed a process for characterizing cross-species

conservation of gene and proteins for mammals, and finding

variations within these genes and proteins. This projects

highlights the challenges associated with developing

meaningful biological networks from disparate computational

tools and databases. This project also emphasizes the

limitations of relying exclusively on BioCarta and KEGG in

pathway discovery.

Objectives

This project focused on developing methods for deriving the

cross-species annotations for genes and protein groups

identified in candidate pathways. The project had three primary

goals:

1. Produce a matrix containing genes in a particular biological

pathway

2. Construct a list of known protein variations associated with

each gene in a pathway

3. Develop a more effective procedure for characterizing cross-

species conservation of genes and proteins

Introduction

The new era of genomics and proteomics, with the advent of

high throughput technologies such as microarrays and next

generation sequencing, has opened up great opportunities for

the life science research community to better understand

biological processes. The gene lists obtained from data through

these experiments are generally analyzed further in the context

of biological pathways as well as with available biological

knowledge sets such as specifically described gene ontologies,

gene sets and gene enrichments. Efforts are underway to

develop new methods to derive biologically meaningful

information from the gene lists obtained from such technologies.

Although there has been considerable effort extended at the

level of building, maintaining and distributing these gene sets, a

system allowing visualization of their conservation across

mammalian species has not been developed. We have

developed a process to retrieve information from two pathway

databases, KEGG and BioCarta, and combine it with information

from other biological databases such as Homologene and

Uniprot to characterize cross-species conservation of genes and

proteins and gain insights into new biological knowledge.

Specifically, we are trying to understand which genes and

proteins are common in given pathways across species among

mammals such as human (Homo sapiens), mouse (Mus

musculus), rat (Rattus norvegicus), dog (Canis lupus familiaris),

cow (Bos taurus), and chimpanzee (Pan troglodytes). We also

explore the problem of finding the variations or mutations in

these genes and proteins that are well tolerated across these

species.

ABCC/NCI-Frederick

P.O. Box B, Bldg. 430

Frederick, MD 21702

ABCC/NCI-Frederick

P.O. Box B, Bldg. 430

Frederick, MD 21702

University of Maryland University College

3501 University Boulevard East

Adelphi, MD 20783

University of Maryland University College

3501 University Boulevard East

Adelphi, MD 20783

Characterization of genes and proteins of cross-species biological pathways

Jennifer Ivy Dong, Douglas James Joubert–NIH Library, Raina Kumar, & Robert Stephen–ABCC/NCI

Perl scripts

Perl scripts

Report value 1 for species from homolog for each

gene for mammals

Populate matrices (heat map), where genes are at X-

axis and species at Y-axis

Fetch sequences using protein seq ID for all the

homologous genes for each pathway gene for

mammals

Retrieve gene list from CGAP with gene sequence IDs

Start with a BioCarta/KEGG pathway name

Retrieve homolog group ID for each gene from

Homologene database at NCBI

Perform MSA by ClustalW

Use *.dnd to make cladogram

Search for variations in *.aln files

Report variations in tab-delimited files

For each protein search for UniProt entry from files

derived from UniProt

Identify protein IDs of all the proteins for same

species in NCBI database using Sequence Id or

Map sequence id to UniProt Id using BioDbnet

Read known variation in flat file and return

annotation in tab delimited file

Map Sequence Id to protein Id using

BioDbnet

Perform BlastP for Proteins

Populate matrices with best hits using

taxonomy report

Identify homologous proteins by

similarity search

Identify homologous proteins in

homologene database

Find variations

from MSA

Perl scripts

Find known variations

Perl scripts

Perl scripts

Report value 1 for species from homolog for each

gene for mammals

Populate matrices (heat map), where genes are at X-

axis and species at Y-axis

Fetch sequences using protein seq ID for all the

homologous genes for each pathway gene for

mammals

Retrieve gene list from CGAP with gene sequence IDs

Start with a BioCarta/KEGG pathway name

Retrieve homolog group ID for each gene from

Homologene database at NCBI

Perform MSA by ClustalW

Use *.dnd to make cladogram

Search for variations in *.aln files

Report variations in tab-delimited files

For each protein search for UniProt entry from files

derived from UniProt

Identify protein IDs of all the proteins for same

species in NCBI database using Sequence Id or

Map sequence id to UniProt Id using BioDbnet

Read known variation in flat file and return

annotation in tab delimited file

Map Sequence Id to protein Id using

BioDbnet

Perform BlastP for Proteins

Populate matrices with best hits using

taxonomy report

Identify homologous proteins by

similarity search

Identify homologous proteins in

homologene database

Find variations

from MSA

Perl scripts

Find known variations

3. Find variations using multiple sequence alignments

4. Find all known variations from the UniProt database

The process has four major modules:

1. Identify homologous proteins using the Homologene database

2. Identify homologous proteins using similarity search

Gen

e S

ym

bo

l

Hu

man

Mo

us

e

Do

g

Co

w

Rat

Ch

imp

BRAF 1 1 1 1 1 1

CPEB1 1 1 0 1 1 1

EGFR 1 1 1 1 1 1

ERBB2 1 1 1 0 1 0

ERBB4 1 1 1 0 1 1

ETS1 1 1 1 1 1 0

ETS2 1 1 1 1 1 1

ETV6 1 1 1 1 1 1

ETV7 1 0 1 0 0 1

FMN2 1 0 0 0 0 0

GRB2 1 1 1 1 1 1

KRAS 1 1 1 1 1 1

MAP2K1 1 1 1 1 1 1

MAPK1 1 1 1 1 1 1

MAPK3 1 1 1 1 1 1

NOTCH1 1 1 1 0 1 1

NOTCH2 1 1 1 1 1 1

NOTCH3 1 1 1 0 1 0

NOTCH4 1 1 1 1 1 1

PIWIL1 1 1 1 1 1 1

PIWIL2 1 1 1 0 1 1

PIWIL3 1 0 1 1 0 1

RAF1 1 1 1 1 1 1

RHBDL1 1 1 1 1 1 1

RHBDL3 1 1 1 1 1 1

SOS1 1 1 1 1 1 1

SOS2 1 1 1 1 0 0

- 1 1 1 1 1 1

Gen

e S

ym

bo

l

Hu

man

Mo

us

e

Do

g

Co

w

Ch

imp

Rat

Ho

rse

Pla

typ

us

Wild

bo

ar

Rh

esu

s M

acaq

ue

Bo

no

bo

Go

rilla

Su

matra

n O

ran

gu

tan

Cyn

om

olg

us

mo

nk

ey

Cat

Syria

n H

am

ste

r

Eu

rop

ean

Rab

bit

Do

mestic

Sh

eep

Op

po

su

m

mic

e

Wh

ite B

ear

Weste

rn b

ab

oo

n

BRAF 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0

CPEB1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0

EGFR 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ERBB2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ERBB4 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ETS1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0

ETS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0

ETV6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

ETV7 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

FMN2 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0

GRB2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0

KRAS 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0

MAP2K1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0

MAPK1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0

MAPK3 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0

NOTCH1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

NOTCH2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1

NOTCH3 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

NOTCH4 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

PIWIL1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0

RAF1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0

RHBDL1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0

RHBDL3 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0

SOS1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0

SOS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0

- 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0

Hom

olo

gene

Sim

ilarity

Searc

h

Disclaimer: The opinions and assertions presented here are the private views of the authors and are not necessarily that of ABCC/NCI.