Characterization of genes and proteins of cross-species biological pathways
-
Upload
doug-joubert -
Category
Technology
-
view
1.091 -
download
6
description
Transcript of Characterization of genes and proteins of cross-species biological pathways
Results
Six pathways, three each from BioCarta and KEGG, were
analyzed using this process and the results for these pathways
are presented below. The matrices for one of the pathways are
also shown for illustration.
BioCarta Pathways
Interferon Gamma (IGP): The IGP pathway has a significant
role in the body's immune response. It has 6 genes, all well
conserved among mammals except for JAK1 and STAT1 in
Pan troglodyte.
Nerve Growth Factor (NGF): NGF is important for the survival
of neurons during embryonic development and has an effect on
the growth of sensory and sympathetic ganglia. It has 20 genes
and most are well-conserved. Across species the exceptions
include DPM2 and ELK1, and KLK2. Within species, only Canis
lupus familiaris had NGF genes that were less conserved.
Protein Kinase C through G-protein coupled receptor
(PKC): GPCRs are involved in signal transduction and play a
role in various cellular functions. There are 9 genes in this
pathway, and all the genes are extremely well-conserved.
KEGG Pathways
Hedgehog Pathway: The hedgehog signaling pathway is
believed to govern the growth of embryonic stem cells as well
as metamorphosis in general. It has 44 genes, of which 23 are
conserved among all represented mammals. Three genes
SPA18, DRYK1A, and BTRC are common in all mammals.
Basal Transcription Factors (BTF): BTF is a major control
point for gene expression in eukaryotes and it contains 34
genes. Most genes in this pathway are well-conserved except
GTF2AIL and STON1.
Dorsal-ventral axis formation (DVF): The DVF pathway is
controlled by GRK and EGFR and is important in limb
development. It has 29 genes and most of the genes are well-
conserved, the exception being FMN2. Matrices obtained
through homologene and similarity method are shown below:
Future work
Future work includes:
1. Fully automate the process
2. Visualization
3. Develop a database schema to store the results
Materials and methodsMaterials and methods
Conclusions
We developed a process for characterizing cross-species
conservation of gene and proteins for mammals, and finding
variations within these genes and proteins. This projects
highlights the challenges associated with developing
meaningful biological networks from disparate computational
tools and databases. This project also emphasizes the
limitations of relying exclusively on BioCarta and KEGG in
pathway discovery.
Objectives
This project focused on developing methods for deriving the
cross-species annotations for genes and protein groups
identified in candidate pathways. The project had three primary
goals:
1. Produce a matrix containing genes in a particular biological
pathway
2. Construct a list of known protein variations associated with
each gene in a pathway
3. Develop a more effective procedure for characterizing cross-
species conservation of genes and proteins
Introduction
The new era of genomics and proteomics, with the advent of
high throughput technologies such as microarrays and next
generation sequencing, has opened up great opportunities for
the life science research community to better understand
biological processes. The gene lists obtained from data through
these experiments are generally analyzed further in the context
of biological pathways as well as with available biological
knowledge sets such as specifically described gene ontologies,
gene sets and gene enrichments. Efforts are underway to
develop new methods to derive biologically meaningful
information from the gene lists obtained from such technologies.
Although there has been considerable effort extended at the
level of building, maintaining and distributing these gene sets, a
system allowing visualization of their conservation across
mammalian species has not been developed. We have
developed a process to retrieve information from two pathway
databases, KEGG and BioCarta, and combine it with information
from other biological databases such as Homologene and
Uniprot to characterize cross-species conservation of genes and
proteins and gain insights into new biological knowledge.
Specifically, we are trying to understand which genes and
proteins are common in given pathways across species among
mammals such as human (Homo sapiens), mouse (Mus
musculus), rat (Rattus norvegicus), dog (Canis lupus familiaris),
cow (Bos taurus), and chimpanzee (Pan troglodytes). We also
explore the problem of finding the variations or mutations in
these genes and proteins that are well tolerated across these
species.
ABCC/NCI-Frederick
P.O. Box B, Bldg. 430
Frederick, MD 21702
ABCC/NCI-Frederick
P.O. Box B, Bldg. 430
Frederick, MD 21702
University of Maryland University College
3501 University Boulevard East
Adelphi, MD 20783
University of Maryland University College
3501 University Boulevard East
Adelphi, MD 20783
Characterization of genes and proteins of cross-species biological pathways
Jennifer Ivy Dong, Douglas James Joubert–NIH Library, Raina Kumar, & Robert Stephen–ABCC/NCI
Perl scripts
Perl scripts
Report value 1 for species from homolog for each
gene for mammals
Populate matrices (heat map), where genes are at X-
axis and species at Y-axis
Fetch sequences using protein seq ID for all the
homologous genes for each pathway gene for
mammals
Retrieve gene list from CGAP with gene sequence IDs
Start with a BioCarta/KEGG pathway name
Retrieve homolog group ID for each gene from
Homologene database at NCBI
Perform MSA by ClustalW
Use *.dnd to make cladogram
Search for variations in *.aln files
Report variations in tab-delimited files
For each protein search for UniProt entry from files
derived from UniProt
Identify protein IDs of all the proteins for same
species in NCBI database using Sequence Id or
Map sequence id to UniProt Id using BioDbnet
Read known variation in flat file and return
annotation in tab delimited file
Map Sequence Id to protein Id using
BioDbnet
Perform BlastP for Proteins
Populate matrices with best hits using
taxonomy report
Identify homologous proteins by
similarity search
Identify homologous proteins in
homologene database
Find variations
from MSA
Perl scripts
Find known variations
Perl scripts
Perl scripts
Report value 1 for species from homolog for each
gene for mammals
Populate matrices (heat map), where genes are at X-
axis and species at Y-axis
Fetch sequences using protein seq ID for all the
homologous genes for each pathway gene for
mammals
Retrieve gene list from CGAP with gene sequence IDs
Start with a BioCarta/KEGG pathway name
Retrieve homolog group ID for each gene from
Homologene database at NCBI
Perform MSA by ClustalW
Use *.dnd to make cladogram
Search for variations in *.aln files
Report variations in tab-delimited files
For each protein search for UniProt entry from files
derived from UniProt
Identify protein IDs of all the proteins for same
species in NCBI database using Sequence Id or
Map sequence id to UniProt Id using BioDbnet
Read known variation in flat file and return
annotation in tab delimited file
Map Sequence Id to protein Id using
BioDbnet
Perform BlastP for Proteins
Populate matrices with best hits using
taxonomy report
Identify homologous proteins by
similarity search
Identify homologous proteins in
homologene database
Find variations
from MSA
Perl scripts
Find known variations
3. Find variations using multiple sequence alignments
4. Find all known variations from the UniProt database
The process has four major modules:
1. Identify homologous proteins using the Homologene database
2. Identify homologous proteins using similarity search
Gen
e S
ym
bo
l
Hu
man
Mo
us
e
Do
g
Co
w
Rat
Ch
imp
BRAF 1 1 1 1 1 1
CPEB1 1 1 0 1 1 1
EGFR 1 1 1 1 1 1
ERBB2 1 1 1 0 1 0
ERBB4 1 1 1 0 1 1
ETS1 1 1 1 1 1 0
ETS2 1 1 1 1 1 1
ETV6 1 1 1 1 1 1
ETV7 1 0 1 0 0 1
FMN2 1 0 0 0 0 0
GRB2 1 1 1 1 1 1
KRAS 1 1 1 1 1 1
MAP2K1 1 1 1 1 1 1
MAPK1 1 1 1 1 1 1
MAPK3 1 1 1 1 1 1
NOTCH1 1 1 1 0 1 1
NOTCH2 1 1 1 1 1 1
NOTCH3 1 1 1 0 1 0
NOTCH4 1 1 1 1 1 1
PIWIL1 1 1 1 1 1 1
PIWIL2 1 1 1 0 1 1
PIWIL3 1 0 1 1 0 1
RAF1 1 1 1 1 1 1
RHBDL1 1 1 1 1 1 1
RHBDL3 1 1 1 1 1 1
SOS1 1 1 1 1 1 1
SOS2 1 1 1 1 0 0
- 1 1 1 1 1 1
Gen
e S
ym
bo
l
Hu
man
Mo
us
e
Do
g
Co
w
Ch
imp
Rat
Ho
rse
Pla
typ
us
Wild
bo
ar
Rh
esu
s M
acaq
ue
Bo
no
bo
Go
rilla
Su
matra
n O
ran
gu
tan
Cyn
om
olg
us
mo
nk
ey
Cat
Syria
n H
am
ste
r
Eu
rop
ean
Rab
bit
Do
mestic
Sh
eep
Op
po
su
m
mic
e
Wh
ite B
ear
Weste
rn b
ab
oo
n
BRAF 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0
CPEB1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0
EGFR 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0
ERBB2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0
ERBB4 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0
ETS1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0
ETS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0
ETV6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ETV7 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
FMN2 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
GRB2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0
KRAS 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0
MAP2K1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0
MAPK1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0
MAPK3 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0
NOTCH1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0
NOTCH2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1
NOTCH3 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0
NOTCH4 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0
PIWIL1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0
PIWIL2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0
PIWIL3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0
PIWIL4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0
RAF1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0
RHBDL1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0
RHBDL3 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0
SOS1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0
SOS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0
- 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0
Hom
olo
gene
Sim
ilarity
Searc
h
Disclaimer: The opinions and assertions presented here are the private views of the authors and are not necessarily that of ABCC/NCI.