Characterization of genes and proteins of cross-species biological pathways

Results

Six pathways, three each from BioCarta and KEGG, were

analyzed using this process and the results for these pathways

are presented below. The matrices for one of the pathways are

also shown for illustration.

BioCarta Pathways

Interferon Gamma (IGP): The IGP pathway has a significant

role in the body's immune response. It has 6 genes, all well

conserved among mammals except for JAK1 and STAT1 in

Pan troglodyte.

Nerve Growth Factor (NGF): NGF is important for the survival

of neurons during embryonic development and has an effect on

the growth of sensory and sympathetic ganglia. It has 20 genes

and most are well-conserved. Across species the exceptions

include DPM2 and ELK1, and KLK2. Within species, only Canis

lupus familiaris had NGF genes that were less conserved.

Protein Kinase C through G-protein coupled receptor

(PKC): GPCRs are involved in signal transduction and play a

role in various cellular functions. There are 9 genes in this

pathway, and all the genes are extremely well-conserved.

KEGG Pathways

Hedgehog Pathway: The hedgehog signaling pathway is

believed to govern the growth of embryonic stem cells as well

as metamorphosis in general. It has 44 genes, of which 23 are

conserved among all represented mammals. Three genes

SPA18, DRYK1A, and BTRC are common in all mammals.

Basal Transcription Factors (BTF): BTF is a major control

point for gene expression in eukaryotes and it contains 34

genes. Most genes in this pathway are well-conserved except

GTF2AIL and STON1.

Dorsal-ventral axis formation (DVF): The DVF pathway is

controlled by GRK and EGFR and is important in limb

development. It has 29 genes and most of the genes are well-

conserved, the exception being FMN2. Matrices obtained

through homologene and similarity method are shown below:

Future work

Future work includes:

1. Fully automate the process

2. Visualization

3. Develop a database schema to store the results

Materials and methodsMaterials and methods

Conclusions

We developed a process for characterizing cross-species

conservation of gene and proteins for mammals, and finding

variations within these genes and proteins. This projects

highlights the challenges associated with developing

meaningful biological networks from disparate computational

tools and databases. This project also emphasizes the

limitations of relying exclusively on BioCarta and KEGG in

pathway discovery.

Objectives

This project focused on developing methods for deriving the

cross-species annotations for genes and protein groups

identified in candidate pathways. The project had three primary

goals:

1. Produce a matrix containing genes in a particular biological

pathway

2. Construct a list of known protein variations associated with

each gene in a pathway

3. Develop a more effective procedure for characterizing cross-

species conservation of genes and proteins

Introduction

The new era of genomics and proteomics, with the advent of

high throughput technologies such as microarrays and next

generation sequencing, has opened up great opportunities for

the life science research community to better understand

biological processes. The gene lists obtained from data through

these experiments are generally analyzed further in the context

of biological pathways as well as with available biological

knowledge sets such as specifically described gene ontologies,

gene sets and gene enrichments. Efforts are underway to

develop new methods to derive biologically meaningful

information from the gene lists obtained from such technologies.

Although there has been considerable effort extended at the

level of building, maintaining and distributing these gene sets, a

system allowing visualization of their conservation across

mammalian species has not been developed. We have

developed a process to retrieve information from two pathway

databases, KEGG and BioCarta, and combine it with information

from other biological databases such as Homologene and

Uniprot to characterize cross-species conservation of genes and

proteins and gain insights into new biological knowledge.

Specifically, we are trying to understand which genes and

proteins are common in given pathways across species among

mammals such as human (Homo sapiens), mouse (Mus

musculus), rat (Rattus norvegicus), dog (Canis lupus familiaris),

cow (Bos taurus), and chimpanzee (Pan troglodytes). We also

explore the problem of finding the variations or mutations in

these genes and proteins that are well tolerated across these

species.

ABCC/NCI-Frederick

P.O. Box B, Bldg. 430

Frederick, MD 21702

ABCC/NCI-Frederick

P.O. Box B, Bldg. 430

Frederick, MD 21702

University of Maryland University College

3501 University Boulevard East

Adelphi, MD 20783

University of Maryland University College

3501 University Boulevard East

Adelphi, MD 20783

Characterization of genes and proteins of cross-species biological pathways

Jennifer Ivy Dong, Douglas James Joubert–NIH Library, Raina Kumar, & Robert Stephen–ABCC/NCI

Perl scripts

Perl scripts

Report value 1 for species from homolog for each

gene for mammals

Populate matrices (heat map), where genes are at X-

axis and species at Y-axis

Fetch sequences using protein seq ID for all the

homologous genes for each pathway gene for

mammals

Retrieve gene list from CGAP with gene sequence IDs

Start with a BioCarta/KEGG pathway name

Retrieve homolog group ID for each gene from

Homologene database at NCBI

Perform MSA by ClustalW

Use *.dnd to make cladogram

Search for variations in *.aln files

Report variations in tab-delimited files

For each protein search for UniProt entry from files

derived from UniProt

Identify protein IDs of all the proteins for same

species in NCBI database using Sequence Id or

Map sequence id to UniProt Id using BioDbnet

Read known variation in flat file and return

annotation in tab delimited file

Map Sequence Id to protein Id using

BioDbnet

Perform BlastP for Proteins

Populate matrices with best hits using

taxonomy report

Identify homologous proteins by

similarity search

Identify homologous proteins in

homologene database

Find variations

from MSA

Perl scripts

Find known variations

Perl scripts

Perl scripts

Report value 1 for species from homolog for each

gene for mammals

Populate matrices (heat map), where genes are at X-

axis and species at Y-axis

Fetch sequences using protein seq ID for all the

homologous genes for each pathway gene for

mammals

Retrieve gene list from CGAP with gene sequence IDs

Start with a BioCarta/KEGG pathway name

Retrieve homolog group ID for each gene from

Homologene database at NCBI

Perform MSA by ClustalW

Use *.dnd to make cladogram

Search for variations in *.aln files

Report variations in tab-delimited files

For each protein search for UniProt entry from files

derived from UniProt

Identify protein IDs of all the proteins for same

species in NCBI database using Sequence Id or

Map sequence id to UniProt Id using BioDbnet

Read known variation in flat file and return

annotation in tab delimited file

Map Sequence Id to protein Id using

BioDbnet

Perform BlastP for Proteins

Populate matrices with best hits using

taxonomy report

Identify homologous proteins by

similarity search

Identify homologous proteins in

homologene database

Find variations

from MSA

Perl scripts

Find known variations

3. Find variations using multiple sequence alignments

4. Find all known variations from the UniProt database

The process has four major modules:

1. Identify homologous proteins using the Homologene database

2. Identify homologous proteins using similarity search

Gen

e S

ym

bo

l

Hu

man

Mo

us

e

Do

g

Co

w

Rat

Ch

imp

BRAF 1 1 1 1 1 1

CPEB1 1 1 0 1 1 1

EGFR 1 1 1 1 1 1

ERBB2 1 1 1 0 1 0

ERBB4 1 1 1 0 1 1

ETS1 1 1 1 1 1 0

ETS2 1 1 1 1 1 1

ETV6 1 1 1 1 1 1

ETV7 1 0 1 0 0 1

FMN2 1 0 0 0 0 0

GRB2 1 1 1 1 1 1

KRAS 1 1 1 1 1 1

MAP2K1 1 1 1 1 1 1

MAPK1 1 1 1 1 1 1

MAPK3 1 1 1 1 1 1

NOTCH1 1 1 1 0 1 1

NOTCH2 1 1 1 1 1 1

NOTCH3 1 1 1 0 1 0

NOTCH4 1 1 1 1 1 1

PIWIL1 1 1 1 1 1 1

PIWIL2 1 1 1 0 1 1

PIWIL3 1 0 1 1 0 1

RAF1 1 1 1 1 1 1

RHBDL1 1 1 1 1 1 1

RHBDL3 1 1 1 1 1 1

SOS1 1 1 1 1 1 1

SOS2 1 1 1 1 0 0

- 1 1 1 1 1 1

Gen

e S

ym

bo

l

Hu

man

Mo

us

e

Do

g

Co

w

Ch

imp

Rat

Ho

rse

Pla

typ

us

Wild

bo

ar

Rh

esu

s M

acaq

ue

Bo

no

bo

Go

rilla

Su

matra

n O

ran

gu

tan

Cyn

om

olg

us

mo

nk

ey

Cat

Syria

n H

am

ste

r

Eu

rop

ean

Rab

bit

Do

mestic

Sh

eep

Op

po

su

m

mic

e

Wh

ite B

ear

Weste

rn b

ab

oo

n

BRAF 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0

CPEB1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0

EGFR 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ERBB2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ERBB4 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0

ETS1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0

ETS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0

ETV6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

ETV7 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0

FMN2 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0

GRB2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0

KRAS 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0

MAP2K1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0

MAPK1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0

MAPK3 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0

NOTCH1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

NOTCH2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1

NOTCH3 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

NOTCH4 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0

PIWIL1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0

PIWIL4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0

RAF1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0

RHBDL1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0

RHBDL3 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0

SOS1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0

SOS2 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0

- 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0

Hom

olo

gene

Sim

ilarity

Searc

h

Disclaimer: The opinions and assertions presented here are the private views of the authors and are not necessarily that of ABCC/NCI.

Characterization of genes and proteins of cross-species biological pathways

Technology

Transcript of Characterization of genes and proteins of cross-species biological pathways