Download - 0 Front Pages New_merged

7/28/2019 0 Front Pages New_merged

1/317

GENOME WIDE SURVEY OF CERTAIN

MAMMALIAN GPCRS AND OLFACTORY

RECEPTORS

A THESIS

Submitted by

NAGARATHNAM B

in partial f ul fi llment for the award of the degree

of

DOCTOR OF PHILOSOPHY

FACULTY OF SCIENCE AND HUMANITIES

ANNA UNIVERSITY

CHENNAI 600 025

JUNE 2012


2/317


3/317


4/317

ii


5/317

iii

ABSTRACT

In the recent era of G-protein coupled receptor (GPCR) research,

computational approaches in sequence analysis play a vital role in identifying

related sequences (homologues), conserved features, (domains, motifs) and

evolutionary impacts (orthologs) for the interested protein families at intra-

and inter-genomic levels. Candidate GPCRs and ORs (class A type GPCR)

are important for their diverse cellular activities and have been considered for

the genome-wide survey in selected eukaryotic genomes, which further helps

to establish a structure, function resemblance.

Generally, GPCRs are predicted for having extracellular N-terminal

(N-out topology), intracellular C-terminal with seven transmembrane-helices

(TMHs) and are connected by three intra and extracellular loops thereby

termed as serpentine-like receptors.

Previous cross-genome studies on human- Drosophila GPCRs,

motivated to perform a cross-genome clustering on human- C. elegans

GPCRs (Chapter 2). A profile based clustering (RPS-BLAST) was employed

to associate more than 1000 C. elegans GPCRs with already grouped human

GPCR clusters of eight major types of receptors. The generated 32 human- C.

elegans GPCR clusters were analyzed for five different types of cluster

association with proposed terminologies such as human GPCR clade [HC],

coclusters [CC], neighbor clades [NC], neighbor members [NM], species-

specific members [SS] observed at tree topology which facilitate to connect

functional relevance at intra-and inter-genomic levels. Interestingly, the referred

CC was significant and exhibited evolutionary integrity at inter-genomic level.

Also, the identified 27 orthologs were evident to illustrate the effectiveness of

using cross-genome clustering techniques in connecting related GPCRs even at


6/317

iv

remote homology. Overall 84% of the GPCR sequences across genomes have

been associated at the significant E-value thresholds (ranges from 0.001 to 1)

successfully by RPS-BLAST (work published).

Cross-genome clustering on human and C. elegans GPCRs motivated

to perform a phylogenetic analysis on serpentine receptors (SRs) exclusively

(Chapter 3). As we know, nearly 20 protein families of SRs from C. elegans

were related to chemosensation, a phylogenetic analysis on 683 serpentine

receptors was carried out to identify the related sequences/clusters to

represent the family specific/receptor specific sequence features, ultimately to

connect at superfamily level. Interestingly, the only one receptor annotated for

olfaction (odr-10) in C. elegans to sense di-acetyl compounds has been

noticed along with 43 SRs in the phylogeny. All the associated homologues to

odr-10 are from Str superfamily and particularly str-112 has been found as the

most closely related sequence homologue to odr-10 from the phylogenetic

analysis. As a case study, odr-10 has been modelled for understanding

secondary structural details. A str family specific QLF motif was identified

in ICL3, TM6 of odr-10 and 92 other SR family specific motifs were also

identified by using TM-MOTIF package. The identified sequence features can

be used further to train SVM models and to predict putative receptors from

other nematode species.

Attempts have been made to design an user-friendly alignment

viewer TM-MOTIF (work published) to detect and to display conserved

motifs on the predicted membrane topology in the set of aligned

transmembrane proteins (Chapter 4). The tool is very effective in identifying

not only the conserved motifs (default 60%) but also the amino acid

substitution (AAS) with its respective physico-chemical properties (by using


7/317

v

an in-house program namely,MotifS) at each position of the alignment. TM-

MOTIF provide option for the users to submit their sequence of interest

(multiple FASTA and MSA) to visualize the seven predicted helices of TM

proteins in VIBGYOR colouring scheme. User can also align sequence of interest

with any one of the given reference sequence (known structure) to get a pairwise

alignment and this particular display is highly helpful as a pre-requisite for

homology modelling. User can also perform a BLAST search to identify a nearest

homologue from the incorporated cross-genome GPCR and OR cluster datasets of

selected organisms. In short, TM-MOTIF is highly suitable for the comparative

genomics and to identify the cluster-specific or receptor specific and common

motifs observed at various percentage of conservation within and across the

genome(s). The package is integrated to DOR (Database of Olfactory Receptors).

As we know, the role of conserved motifs and AAS play crucial

role in functional aspects. The previously established 32 clusters of eight

major types of receptors of cross-genome GPCR clusters such as human-

Drosophila GPCR clusters, human- C. elegans GPCR clusters and human

only GPCR cluster dataset were considered to study primarily for the

conserved motifs (MotifS program) and TM-MOTIF package has been used

to record the observed motifs to its respective membrane topology

(Chapter 5).

Interestingly, a total of 33 conserved motifs have been identified

from the human-Drosophila GPCR clusters and 76% of them were observed

in TM helices, predominately in TM2 and TM7. Besides the classical motifs

such as E/DRY and NPXXY, motifs observed in single receptor type (cluster-

specific motifs or receptor-specific), two-receptor and multi-receptors types

were also documented for the cross-genome GPCR clusters (work published).


8/317

vi

Olfactory receptor data repository was generated for selected eukaryotic

organisms (yeast, worm, fly, mouse and human) and these sequences were aligned

to produce intra- and inter-genomic phylogeny. Interestingly, 371 functional ORs

from human genome were distributed in 10 distinct clusters, and class I (to sense

water-borne odors), II (to sense air-borne odors) type receptors were discriminated

while introducing few selected fish and amphibian ORs in the human OR

phylogeny. In other study, fly ORs showed no significant coclustering with human

OR phylogeny and proves that insect ORs are evolutionarily distinct from

mammalian ORs. This could be due to the independent evolution, life style orreverse topology of fly ORs. Selected nematode ORs also shows no coclustering

with human ORs due to long lineage and nematode life style. Study on human-

mouse OR clusters showed significant coclustering and studies were carried with

ORs of canine, rodents and nonhuman primates to analyze cluster association with

human ORs. The results of sequence studies were organized in a publically

available database namely DOR. It provides sequences, predicted TM boundaries,

intra- and inter-genomic alignments, phylogeny of selected genomes. It also includes

motif identification tool (TM-MOTIF) and is associated with other features like

predicted secondary structure and dimer prediction from collaborators (work in press).

In essence, genome-wide survey suggests representative sequences,

cluster association, cluster specific motifs, orthologs, coclusters arrived at

intra- and inter-genomic levels and are ultimately guiding to connect functional

properties of known to unknown gene/protein and to understand structure function

relationship.


9/317

vii

ACKNOWLEDGEMENT

I express my deep sense of gratitude to Dr. V. Balakrishnan,

Department of Biotechnology, KSR College of Technology, Tiruchengode for

his valuable guidance for my Ph.D. study. Besides I am extremely thankful to

my co-supervisor and mentor Prof. Dr. R. Sowdhamini, Lab-25, National

Center for Biological Sciences, Bangalore who has been a source of

inspiration, help, guidance, advice to me throughout the course of this

research work. Further, I sincerely express my earnest gratitude to my

doctoral committee member Dr. S. SenthilKumar, PSG College of

Technology, Coimbatore. I express my heartfelt thanks to Prof. Dr. K.

Karunakaran, Vice Chancellor, and Dr. P. Renuka Devi, Director-Research,

Anna University of Technology Coimbatore for graciously permitting me to

do this research.

I submit my gratitude to Prof. Dr. K. Vijayaragavan, RSF,

Director, NCBS, Bangalore, Prof. Dr. Obaid Siddiqi, RSF, Prof. Dr. Apurva

Sarin,Prof. N. Srinivasan from IISc., Bangalore for extending care and moral

support to pursue the research work and I submit my deepest gratitude to

Mr. Ashok Rao,Mr. Shaju, teaching and non-teaching staff, my lab mates and

all@ncbs for their kind hearted support in encouraging my research thirst.

Thanks to my family members and my beloved APPA.

B. NAGARATHNAM


10/317

viii

TABLE OF CONTENT

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iii

LIST OF TABLES xxii

LIST OF FIGURES xxiv

LIST OF ABBREVIATIONS xxx

1 INTRODUCTION 1

1.1. PRIOR ART ON GENOME-WIDE SURVEY 21.2. BREAKTHROUGHS IN GPCR

CRYSTALLOGRAPHY STUDIES 4

1.3. GPCRS: POPULAR DRUG TARGETS 61.4. STRUCTURE AND CELLULAR ACTIVITIES

OF MEMBRANE PROTEINS 7

1.5. MEMBRANE PROTEIN: TOPOLOGY 71.6. GPCR MECHANISM 91.7. GPCR CLASSIFICATION 10

1.7.1 Olfactory Receptors (ORs) 11

1.7.2 Classical Knowledge on Olfactory

Receptors 12

1.7.3 Olfactory Signaling Pathway in

Human ORs 13

1.7.4 ORs, GRs and IRs inDrosophila 14

1.7.5 Insect olfaction (Drosophila ORs) 14

1.7.6 Nematode Olfaction 15

1.7.7 Mouse Olfaction 16


11/317

ix


1.8 DATA REPOSITORIES FOR MEMBRANE

PROTEINS 16

1.9 COLLECTION OF GPCR- HOMOLOGUES 17

1.9.1 BLAST (Basic Local Alignment

Search Tool) 18

1.9.2 PSI-BLAST (Profile Vs Sequence

comparison method) 19

1.9.3 Reverse PSI-BLAST (Sequence Vs

Profile comparison method) 20

1.10 MULTIPLE SEQUENCE ALIGNMENT

TECHNIQUES 22

1.10.1 CLUSTAL W 23

1.10.2 PRALINETM

24

1.10.3 MAFFT 24

1.11 DERIVING PHYLOGENY OF GPCRs/ORs 25

1.11.1 PHYLIP 26

1.11.2 TREE-PUZZLE 26

1.11.3 MEGA (Molecular Evolutionary

Genetics Analysis) 27

1.12 CLUSTER ASSOCIATIONS 27

1.13 SEQUENCE CONSERVATION AND

DIVERSITY 28

1.14 HOMOLOGY MODELLING OF GPCRs/ORs 29

2 CROSS-GENOME CLUSTERING OF HUMAN AND

C. ELEGANSG-PROTEIN COUPLED

RECEPTORS 30

2.1 INTRODUCTION 30


12/317

x


2.2 C. elegans - AN ATTRACTIVE ANIMALMODEL 30

2.2.1 Features Related to C. elegans and

Human GPCRs 31

2.3 OBJECTIVES 33

2.4 PRIOR ART 33

2.4.1 Superfamilies of Serpentine Receptors 34

2.5 METHODOLOGY 35

2.5.1 Selection Criteria forC. elegans GPCRs 35

2.5.2 Generation of Representative Profiles 38

2.5.3 Performing RPS-Blast 38

2.5.4 CrossGenome Alignment of

HumanC. elegans GPCRs 39

2.5.5 Cross -Genome Phylogeny of Human

C. elegans GPCRs 40

2.5.6 Terminologies used to Describe Phylogeny

2.5.6.1 Human GPCR clade [HC] 40

2.5.6.2 Coclusters [CC] 40

2.5.6.3 Neighbor Clades [NC] 41

2.5.6.4 Neighbor Members [NM] 41

2.5.6.5 Species specific Members [SS] 41

2.5.6.6 Superfamilies of Serpentine

receptors (SR) 41

2.6 RESULTS AND DISCUSSION 42

2.6.1 Result Summary for Peptide Receptors 43

2.6.2 Result Summary for Chemokine Receptors 67

2.6.3 Result Summary for Nucleotide and Lipid

receptors 68


13/317

xi


2.6.4 Result Summary for Biogenic Amine

Receptors 81

2.6.5 Result Summary for Class B (Secretin)

Receptors 94

2.6.6 Result Summary for Cell

Adhesion Receptors 99

2.6.7 Result Summary for Class C (Glutamate)

Receptors 101

2.6.8 Result Summary for Frizzed/Smoothened

Receptors 108

2.7 CONCLUSION 110

3 PHYLOGENETIC ANALYSIS OF SERPENTINE

RECEPTORS OF C. ELEGANSAND

IDENTIFICATION OF CONSERVED MOTIFS IN

SERPENTINE RECEPTOR SUPERFAMILIES 117

3.1 INTRODUCTION 117

3.2 HOMOLOGUES OF C. elegans GPCRs 118

3.3 OBJECTIVES 118

3.4 CHEMOSENSORY RECEPTORS IN C. elegans 119

3.5 CHEMOSENSORY NEURONS AND

OLFACTORY APPARATUS IN C. elegans 119

3.6 FAMILIES AND SUPERFAMILIES OF

SERPENTINE RECEPTORS IN C. elegans 120

3.7 FEATURES AND IMPORTANCE OF SRs 122

3.8 SRs: FUNCTIONAL RELEVANCE WITH

OTHER EUKARYOTIC GPCRs 122

3.9 METHODOLOGY 123


14/317

xii


3.9.1 Data Collection 123

3.9.2 Prediction of TM-helices by HMMTOP 123

3.9.3 Alignment Procedure by MAFFT 124

3.9.4 Phylogeny of Selected Serpentine Receptors 124

3.9.5 Identification of Motifs in SRs 124

3.10 RESULTS 125

3.10.1 Identified Motifs in SR Families : A

Pilot Study 127

3.10.2 Homology Modelling of odr-10 128

3.10.2.1 Pairwise alignment of odr-10

with bovine rhodopsin sequence 128

3.10.2.2 Alignment by MAFFT 129

3.10.2.3 Structure validation for Odr-

10 model 130

3.10.2.4 Preliminary phylogenetic analysis 131

3.10.2.5 Odr-10 an outgroup to HOR 131

3.11 CONCLUSION 132

4 TM-MOTIF: A PACKAGE AND AN ALIGNMENT

VIEWER TO IDENTIFY CONSERVED MOTIFS

AND AMINO ACID SUBSTITUTIONS INALIGNED SET OF SEVEN TRANSMEMBRANE

HELIX PROTEINS 135


4.1.1. Functional Importance of ConservedMotifs in TM-Proteins 136

4.1.2. Motif Related to Structural Integrityand Stability 137


15/317

xiii


4.1.3. Impacts of Motifs in EvolutionaryBioinformatics 138

4.2. OBJECTIVES OF TM-MOTIF 1384.3. KEY FEATURES OF TM-MOTIF 1394.4. METHODOLOGY 140

4.4.1. In-Built Dataset of Cross-Genome GPCRand OR Cluster Dataset 141

4.4.1.1 Human-Drosophila cross-genome

GPCR clusters 141

4.4.1.2 Human-C. elegans cross-genome

GPCR clusters 141

4.4.1.3 Human-mouse cross-genome OR

clusters 141

4.4.2 Alignment Procedures for Cross-Genome

GPCR/OR Clusters 141

4.4.3. Prediction of Membrane Topology forTM Helices and Loops 142

4.4.4 Detection of Motifs and Amino AcidSubstitution (AAS) in the Cross-Genome

Alignment 143

4.4.5 Mapping of Identified Motifs onTM-helices and Loops in MSA 143

4.4.6 Identification of Homologues Sequencesfor user Submitted Queries by Performing

BLAST 144

4.4.7 Pairwise Alignment in TM-MOTIF 144

4.5 RESULTS 145


16/317

xiv


4.5.1. Software Input and Output Options 1454.5.2. Input Options 146

4.5.3. Output Options 146

4.5.3.1 Display of predicted 7 TM-

|helices in VIBGYOR colouring

scheme: (by using Run TM option)146

4.5.3.2 Display of Identified Motifs andAAS in MSA: (by using Run

Motif option) 147

4.5.3.3 Display of Detected Motifs

on TM-helices: (by using

Run TM-Motif option) 148

4.5.3.4 Alignment with ReferenceSequence 150

4.5.3.5 Identifying closest homologues

of user sequence in selected

organisms 151

4.5.3.6 Display of Over predicted helices 151

4.6. DEFAULT PARAMETERS 1524.6.1 TM-MOTIF- Output Files 152

4.7. CAVEAT AND FUTURE DEVELOPMENT 1534.8. AVAILABILITY 1544.9. CONCLUSIONS 154


17/317

xv


5 ANALYSIS ON CONSERVED MOTIFS

AND PERMITTED AMINO ACID

EXCHANGES IN CROSS-GENOME

GPCR CLUSTERS 156


5.2 OBJECTIVES 157

5.3 RESIDUE CONSERVATION IN CROSS-

GENOME SEQUENCES 158

5.4 IMPACT OF AMINO ACID CONSERVATION

AND TYPES OF SUBSTITUTIONS 159

5.5 METHODS 159

5.5.1 Cross-genome GPCR cluster dataset 160

5.5.2 Alignment Procedure 160

5.5.3 Prediction of membrane topology 161

5.5.4 Program to Detect Motifs and AAS 161

5.6 RESULTS 162

5.7 OCCURRENCE OF MOTIFS FOR SINGLE

RECEPTOR TYPE 163

5.8 MOTIFS OBSERVED IN HUMAN-

DROSOPHILA CROSS-GENOME CLUSTERS 1645.8.1 Motifs Observed in Transmembrane

Helices 164

5.8.2 Motifs Observed in Loop Regions 165

5.9 MOTIFS OBSERVED IN HUMAN- C. elegans

GPCR CROSS-GENOME CLUSTERS 167

5.10 CHARACTERISTIC MOTIFS FROM

CROSS-GENOME GPCR CLUSTERS 169


18/317

xvi


5.10.1 Conserved D/ERY and NPXXY motifs in

GPCR Clusters 169

5.10.2 Identified KLK/R and RLAR/K motif in

Secretin Receptor 169

5.10.3 Conserved PMNYM / PMSYM motif in

BGA Receptor 170

5.11 SUMMARY 171

6 GENOME WIDE SURVEY OF

OLFACTORY RECEPTORS (ORS) IN

SELECTED EUKARYOTIC GENOMES 173

6.1. PHYLOGENETIC STUDY ON SELECTEDHUMAN ORS 173

6.1.1.

Introduction 1736.1.2. Objectives and Scopes 1736.1.3. Olfactory Receptors 1746.1.4. OR: Membrane Topology 1756.1.5. Prior Studies on ORs 1756.1.6. Methodology 177

6.1.6.1. Retrieval of OR sequences 1776.1.6.2. Prediction of membrane topology

: Human ORs 178

6.1.6.3. Alignment procedure 1796.1.6.4. Phylogeny on selected human

olfactory receptors 179

6.1.6.5. Analysis of phylogeny 180


19/317

xvii


6.1.7. Results 1816.1.7.1. Class I and II type receptors in

human OR phylogeny 181

6.1.7.2. Sequence features of 10human OR-subclusters 181

6.1.7.3. Representative OR sequences 1826.1.7.4. Motif analysis on human


6.1.7.5.SVM Analysis 185

6.2. CROSS-GENOME PHYLOGENY ONSELECTED ORS FROM HUMAN AND

FISH GENOMES 186

6.2.1. Objective 1866.2.2. Review of Literatures 1876.2.3. Fish ORs 1876.2.4. Results 1886.2.5. Sequence conservation: across fish and

human ORs 189

6.3 CROSS-GENOME PHYLOGENY ON

SELECTED ORS FROM HUMAN AND

AMPHIBIAN GENOME 191

6.3.1 Objective 191

6.3.2 Literature survey on class

I and II type ORs 192

6.3.3 Amphibian ORs 192

6.3.4 Results 193

6.3.5.1 Cocluster HXC1 Class

I type receptors 195

6.3.5.2 Cocluster HXC2- class

II type receptors 195


20/317

xviii


6.3.5.3 Cocluster HXC3 - classII type receptors 196

6.4 PHYLOGENETIC ANALYSIS ON

DROSOPHILA OLFACTORY RECEPTORS 199

6.4.1 Background 199

6.4.2 Drosophila ORs 199

6.4.3 Results onDrosphila OR

Phylogeny Analysis 200

6.4.3.1 Cluster association: 10 subclusters 200

6.4.4 Summary 203

6.5 CROSS-GENOME PHYLOGENETIC

ANALYSIS ON SELECTED ORS

FROMDROSOPHILA, YEAST AND

HOMO SAPIENS 204


6.5.2 Insect ORs and mammalian ORs:

(Evolutionarily unrelated) 204

6.5.3 Membrane proteins in Yeast 205

6.5.4 Results 205

6.5.5 Summary 206

6.6 CROSS-GENOME PHYLOGENETICANALYSIS ON SELECTED OLFACTORY

RECEPTORS FROM HUMAN AND

C. elegans GENOMES 206

6.6.1 Odr -10 and homologues 207

6.6.2 Results and Discussion 208

6.6.3 Summary 211


21/317

xix


6.7 CROSS-GENOME PHYLOGENETIC

ANALYSIS ON SELECTED ORS

FROM HUMAN AND MOUSE GENOMES 212

6.7.1 Introduction 212

6.7.2 Objectives 213

6.7.3 HumanMouse OR Orthology 213

6.7.4 Complex Picture on Human-Mouse

OR Orthology 214

6.7.5 Methodology 215

6.7.6 Results 215

6.7.6.1 Cross-genome OR cluster

association 215

6.7.6.2 Cross- genome phylogeny

with Class-I type receptor

homologues 217

6.7.7 Common motifs in the Cross-genome

phylogeny 218

6.7.8 Summary 218


OLFACTORY RECEPTORS FROM

SELECTED HUMAN AND NON-HUMAN

PRIMATES 220



6.8.3 Methodology 220

6.8.4 Results 221

6.8.4 Summary 222


22/317

xx


6.9 DATABASE OF OLFACTORY

RECEPTORS (DOR) 222


6.9.2 Features on OR sequences in DOR 224

6.9.2.1 OR sequences of target genomes: 225

6.9.2.2 Predicted TM boundaries 226

6.9.2.3 Single/cross- genome OR

alignments 227

6.9.2.4 Cluster association and

Phylogeny 228

6.9.2.5 Softwares and Tools

(TM-MOTIF) in DOR 229

6.9.3 Structural features (Application of

sequence searches) 230

6.9.4 Summary 233

7 CONCLUSION 236

7.1 COMPENDIUM 236

7.2 CROSS-GENOME GPCR CLUSTERING 237


SERPENTINE RECEPTORS 240

7.4 TM-MOTIF PACKAGE 242

7.5 STUDY ON CONSERVED MOTIFS AND

AAS IN CROSS-GENOME GPCR CLUSTERS 245

7.6 PHYLOGENETIC ANALYSIS ON ORS

IN SELECTED EUKARYOTIC GENOMES 247

7.7 SUMMARY 253


23/317

xxi


APPENIDX 1 THE LIST OF IDENTIFIED

FAMILY-SPECIFIC MOTIFS IN SR 256

REFERENCES 260

LIST OF PUBLICATIONS 284

CURRICULUM VITAE 285


24/317

xxii

LIST OF TABLES

TABLE NO. TITLE PAGE NO.

2.1 Distribution of Human and C. elegans GPCRs in

32 Clusters 114

2.2 List of Identified Orthologs 116

3.1 List of identified motifs in serpentinereceptor super families 134

5.1 Motifs@

observed in the transmembrane

helices and loop regions of human andDrosophila

GPCR clusters+

162

6.1 Analysis on sequence features of 10 human

OR subclusters 183

6.2 List of conserved motifs in 10 human OR

subclusters (60% level of conservations) 184

6.3 Sequence identity of neighboring fish ORs

and human class I type receptors observed in

cross-genome OR phylogeny 191

6.4 Sequence identity of neighboring frog ORs and

human class I type receptors observed in



human class II type receptors observed in

cross-genome OR phylogeny (referred as HXC2) 198


human class II type receptors observed in

cross-genome OR phylogeny (referred as HXC3) 198


25/317

xxiii

TABLE NO. TITLE PAGE NO.

6.7 Significant cluster association for str type

receptors in CeC3 and sequence pairs with high

/low identity has been given 210

6.8 Sequence identity and similarity between

odr-10 and associated SR 213

6.9 Percentage identity for selected human and

mouse ORs for significant association from


6.10 Percentage Identity between selected human ORs and

non-human ORs 221


26/317

xxiv

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

1.1 Central dogma of genome-wide survey on sequences 3

1.2 Crystal structure of bovine rhodopsin (Li et al 2004) 5

1.3 Membrane topology of olfactory receptor

(odr-10) in C. elegans 8

1.4 GPCR signaling pathway 10

1.5 ORs and organization of the olfactory system in

mammals and OR signaling pathway

(Meyer et al 2000) 13

1.6 Overview on the techniques involved in

genomewide survey 22

2.1 Flow-chart to depict the step-wise procedure for

cross-genome clustering of GPCRs 37

2.2(a-c) Pictorial representation for various types of

cluster association 42

2.3(a-b) Cross-genome phylogeny of peptide receptors:

(Rectangular Display & Radial Display) 46










27/317

xxv










2.12 (a-b) Cross-genome phylogeny of peptide receptors:

(Rectangular Display and Radial Display 64



2.14(a-b) Cross-genome phylogeny of chemokine receptors:


2.15(a-b) Cross-genome phylogeny of chemokine receptors:


2.16(a-b) Cross-genome phylogeny of nucleotide and lipid

receptors (Rectangular Display & Radial Display) 72

2.17(a-b) Cross-genome phylogeny of nucleotide and

lipid receptors(Rectangular Display & Radial Display) 74



2.19(a-b) Cross-genome phylogeny of peptide receptors

nucleotide and lipid receptors (Rectangular Display

& Radial Display) 78




28/317

xxvi




2.22(a-b) Cross-genome phylogeny of biogenic amine

receptor receptors (Rectangular Display &

Radial Display) 84


receptor (Rectangular Display & Radial Display) 86







2.27(a-b) Cross-genome phylogeny of secretin type


2.28(a-b) Cross-genome phylogeny of secretin type


2.29(a-b) Cross-genome phylogeny of cell adhesion type


2.30(a-b) Cross-genome phylogeny of glutamate receptor









29/317

xxvii


2.34(a-b) Cross-genome phylogeny of FRZ/SMT type


2.35 (a-b) Distribution ofC. elegans GPCRs at various E-value

thresholds 112

3.1 Pie-diagram to show the distribution of serpentine

receptors (SR) in the dataset 123

3.2 Phylogeny on selected serpentine receptors

(circular view tree) 125

3.3 The subcluster showing odr-10 and its homologues 127

3.4 Pairwise alignment of odr-10 with bovine

rhodopsin sequence 129

3.5 Three -dimensional model of olfactory receptor

odr-10 and structure validation 130

3.6 Phylogeny on selected human olfactory receptors

with an olfactory receptor (odr-10) from C.elegans 132

4.1 Flow-chart 140

4.2 Tool guide of TM-MOTIF : an overview 142

4.3 Snapshot for the available main menu of the frontwindow of TM-MOTIF with user interactive features 145

4.4 Options given for the submission of input sequences

in TM-MOTIF package 146

4.5 Sample output for the option RUN TM 147

4.6 Sample output for the option RUN MOTIF 148

4.7 Sample output for the option RUN TM-Motif 149

4.8 Snapshot for the display of pairwise alignment of

users input sequence with selected reference sequence 150

4.9 Snapshot Depicts the Display of Over Predicted

TM-Helices 151


30/317

xxviii


5.1 Pictorial representation to denote the occurrence

of highly conserved DRY motif in TM3,ICL2 158

5.2 Flow-chart describes about the steps involved in

the study 159

5.3 Percentage residue conservation in TM helices and

loops in GPCR Clusters 168

5.4(a-c) Illustration of characteristic motifs (observed at

60% conservation) 171

6.1 Flow-chart for the sequence analysis on


6.2(a-b) Phylogenetic display of selected human

olfactory receptor 180

6.3 Phylogeny of selected olfactory receptors in

Homo sapiens and fish genomes 189

6.4 Snapshot of Alignment window for the motif

KAFSTC in human ORs and in few fish

ORs at cross-genome alignment 190

6.5 Snapshot depicts the co-clustering of fish ORs

with class I type receptors of human ORs in

HSC1(given in A),also exhibiting the coclusters

like HXC1,HXC2 and HXC3 to indicate the class

I and II type receptors from frog ORs with humanORs (given in B). 193

6.6 Snapshot depicts the co-clustering of fish ORs

with class I type receptors of human ORs in

HSC1(given in A),also exhibiting the

coclusters like HXC1,HXC2 and HXC3 to indicate

the class I and II type receptors from frog ORs

with human ORs (given in B). 194


31/317

xxix


6.7 Phylogeny ofDrosophila Olfactory receptors 2016.8 Observed 10 subclusters ofDrosophila olfactory

receptors 203

6.9 Cross-genome phylogeny on selected ORs from

human,Drosophila and yeast 206

6.10 Observed cluster association in the cross-genome

phylogeny of selected ORs from human and

C. elegans genomes 208

6.11 Cross-genome phylogeny of selected olfactory

receptors (ORs) from human and mouse genomes 216

6.12 Phylogeny on selected human and mouse

olfactory receptors with special emphasize to mouse

class I type receptors 216

6.13 Cross genome phylogeny on selected human ORs

with ORs from non human primates and aves 222

6.14 Available main menu in the front page of DOR 225

6.15 A snapshot of the give option sequence and

its application in DOR 226

6.16 Display of predicted membrane boundaries in DOR 227

6.17 Display of Alignment option in DOR 228

6.18 Display of cross-genome OR phylogeny in DOR 229

6.19 Overview on pictorial representation of available

features in DOR for sequence analysis 230

6.20 Overview on DOR features for sequence

and structural information for olfactory receptors

in DOR 231

6.21 Display of 3D Structure and related features in DOR 233


32/317

xxx

LIST OF ABBREVIATIONS

AAS - Amino acid substitutions

BGA receptors - Biogenic amine receptors

BLAST - Basic Local Alignment Tool

BS - Bootstrap

CAR - Cell adhesion receptors (CAR),

CC - Co-clusters

CMK - Chemokine receptors (CMK),

FRZ/SMT - Frizzed/smoothened receptors

GLR - Class C (glutamate) receptors

GPCRs - G-protein coupled receptors

HC - Human GPCR clade

MAFFT - Multiple Alignment using Fast Fourier Transform

MEGA - Molecular Evolutionary Genetics Analysis

N&L - Nucleotide and lipid receptors

NC - Neighbor clades

NJ - Neighbor joining

NM - Neighbor members

ORs - Olfactory receptors

PR - Peptide receptors

RMSD - Root-mean-square deviation

RPS-BLAST - Reverse PSI-BLAST

SEC - Class B (secretion) receptors

SR - Serpentine receptors

SS - Species-specific members

SVM - Support vector machine

TM proteins - Trans-membrane proteins


33/317

1

CHAPTER 1

INTRODUCTION

The vast and frequent update of sequence databases to build

repositories for various genomes and predicting accurate structural

information of these sequences are two critical steps in Computational

Genomics. Available knowledge and approaches for genomics (Lipman et al2011) and structural genomics (Redfern et al 2008) are drastically different,

but can be inter-connected effectively for the cause of identifying functional

annotations (Alfarano et al 2005).

Huge accumulation of sequence information in one end and limited

resources on structural details on the other end is the crucial scenario in

bioinformatics. This imbalance is indeed a challenge to achieve the goal ofidentifying function(s) of interested gene(s) immediately.

However, the accumulated large size data repositories can be

handled effectively only through bioinformatics techniques such as genome

wide survey which is a more sophisticated approach than the traditional gene-

by-gene approach and provide clues to connect sequences from various

genomes for the common function. Methods such as data clustering orprincipal component analysis, artificial neural networks or support vector

machines are useful for gene/protein prediction, classification, association and

annotation of novel proteins etc., further support in analyzing functional

genomics data.

My current objective is applying effective bioinformatics

approaches such as genome-wide survey, cross-genome phylogenetic analysis


34/317

2

on certain GPCRs and ORs to propose representative sequences, cluster

association, cluster-specific motifs, orthologs, species-specific behavior and

co-clusters arrived at intra- and inter-genomic levels, ultimately to connect

the functional properties of known to unknown gene/protein (Figure 1.1).

In principle, sequence comparison studies, along with reference to

structural similarities, provide clues to connect functional resemblance

(Redfern et al 2008) at cellular, biochemical and molecular levels

(Ye et al 2006).

This unidirectional hypothesis of associating sequences, predicting

structural details, relating biochemical functions with the phenotypes, forms

the baseline of computational biology. Sequence studies for various genomes

will provide opportunity to identify a group of associated proteins based on

phylogeny and can be exploited for functional relevance. This conceptual

framework really helps to compare sequences from various genomes and

provides clues to connect the sequences of known function to the

unknown. These rationale on genome-wide survey of interested

gene/protein sequences provide platform to integrate knowledge on sequence-

structure-function paradigm for public access (Kerrien et al 2011). Thus,

sequence studies act as a primary step to connect structural and functional

studies.

1.1 PRIOR ART ON GENOME-WIDE SURVEYPerforming genomewide survey on selected or interested protein

families (Tripathi and Sowdhamini, 2008 and Metpally and Sowdhamini

2005) will be appropriate to explain the approach of accumulating related

proteins (associated gene clusters), identifying putative orthologs and to

observe conserved motifs from various genomes. Cross-genome sequence

analysis provides knowledge on sequence conservation across taxa, preserved


35/317

3

species-specific tendencies and exhibit evolutionary integrity at cross-genome

level (Figure 1.1). Particularly, cross-genome sequence studies with selected

model organisms will be useful for vast practical applications. For instance, a

cross-genome phylogenetic analysis on selected GPCRs of human and

Drosophila genome (Metpally and Sowdhamini, 2005) organized as eight

major groups of GPCRs, led to generate 32 cross-genome GPCR clusters.

Such an approach proved valuable for identifying the natural ligands of

Drosophila and human orphan receptors.

[

Figure 1.1 Central dogma of genome-wide survey on sequences

Note: Pictorial representation describing the procedures involved in genome-wide sequenceanalysis. Label 1 refers to the selection of interested genomes. Label 2 refers to thecollection of non-redundant sequences from the selected genomes. Label 3 refers to cross-genome alignment procedure. Label 4 refers to cross-genome phylogeny on sequences. Label

5 refers to cross-genome cluster association and analysis for species-specificity, co-clusterarrangements, identification of orthologs, conserved motifs, observing functional clues tohypothetical proteins in the phylogeny.

Other case studies like genome-wide survey on identifying putative

serine/threonine protein kinases (STKs) in cyanobacteria, (Zhang et al 2007),

gaining practically useful insights on symbiotic nitrogen-fixing alpha-


36/317

4

proteobacterium like Sinorhizobium meliloti (Schluter et al 2010) based on

experimental data, phylogenetic classification on transporters and membrane

proteins from lower organisms (De Hertogh et al 2002) to higherorder

organisms (Chang et al 2004), phylogenetic analysis on olfactory receptor

subfamilies (class I and class II type) in fish (Freitag et al 1999), amphibians

(Freitag et al 1995), phylogenetic analysis in discriminating gustatory and

olfactory receptors in Drosophila (Robertson et al 2003), phylogenetic

grouping of serpentine receptor superfamilies in C. elegans (Robertson and

Thomas 2006), identifying olfactory receptor subfamilies in mouse

(Sullivan, et al 1996) and human (Glusman et al 2001), influence of

phylogenetic analysis in ethno-medicinal studies (Saslis-Lagoudakis et al

2011) are highly commendable. These case studies illustrate the important

applications of genome-wide survey and usage of phylogeny in identifying

similar or related sequences for protein of interest across genomes.

1.2 BREAKTHROUGHS IN GPCR CRYSTALLOGRAPHY STUDIESAs we know, the diverse cell surface proteins exist as 30% in

human genome and are very popular for their therapeutic importance and

applications. Among the available (>82,160) structures in the PDB, crystal

structures are available for only very few membrane proteins. For structural

crystallization, membrane proteins embedded in the lipid bilayer have to be

extracted and need to form a protein-detergent complex (PDC) (Koszelak-

Rosenblum et al 2009). Also, the surrounding environmental lipids in cell

membranes interfere with both crystallography and nuclear magnetic

resonance (NMR) spectroscopy, while solving three-dimensional structures of

membrane proteins. As purification and crystallization of membrane proteins

are very crucial events in membrane protein crystallography (Dilanian et al

2011), only a limited number of membrane proteins have been reported so far.


37/317

5

Figure 1.2 Crystal structure of bovine rhodopsin (Li et al 2004)

a) Crystal structure of bovine rhodopsin displayed in ribbon representation(Li, et al., 2004). The observed seven TM-helices and one peripheral helix are coloredin the rainbow order: TM-helix1 in dark blue (residues 34

64); TM-helix 2 in light blue

(71100); TM-helix 3 in blue-green (106140); TM-helix 4 in yellow-green (150

173); TM-helix 5 in yellow (200230); TM-helix 6 in orange (241276); TM-helix 7 inred (286309); TM-helix 8 in magenta (311321). 1.2.

b) Space-filling representation of rhodopsin- a photoreceptor protein.Rhodopsin- is the first solved crystal structure (Palczewski et al

2000) (Figure 1.2 a and b), 1 adrenergic receptor (Warne et al 2008), 2

adrenergic receptor (Rasmussen et al 2007), adenosine receptor (Jaakola et al

2008), dopamine D3 receptor, CXCR4 chemokine receptor (Wu et al 2010),

histamine receptor and most recently reported lipid GPCR - sphingosine 1-

phosphate receptors (S1P1 receptors) are few important crystal structures.

These structural studies will guide to compare the reference structures with

disease-implicated genes based on modelling to interpret the dysfunctions.

Most of the solved structures are used as templates for molecular modelling.


38/317

6

1.3 GPCRS: POPULAR DRUG TARGETSAs GPCRs are involved in a wide variety of physiological

processes, such as regulation of immune system activity and inflammation,

cell density sensing, sense of smell, visual sense, autonomous nervous system

transmission and behavioral and mood regulation, they are effectively

targeted in medicinal chemistry. Several previous reviews and literature

highlight the clinical importance of GPCRs (Insel et al 2007) and few

examples can be discussed to denote the importance of GPCR biology in

medicine. For instance, a number of monogenic mutations have been

identified in rhodopsin causing disease called retinitis pigmentosa, number of

endocrine disorders, serious illness such as schizophrenia (Seeman 1987),

Alzheimer's disease and Parkinson's disease (Lee et al 1978). Also there are

many reported disorders such as genetic disorders of the calcium-sensing

receptor (CaSR), graves disease, cancer, diabetes, heart diseases,

neurodegenerative diseases, asthma, and diseases related to autoimmunity,AIDS and so on are few other examples to emphasize the multi-functional

role of GPCRs and its clinical implications.

Diversity of GPCRs and ligand-binding properties make these

receptors as interesting targets for the structure-based drug design (Schlyer

and Horuk 2006) and even lead the scope for personalized medicine.

Notably, receptors such as AT1 angiotensin, adrenergic, dopamine

and serotonin (5-hydroxytryptamine, 5-HT) receptor subtypes are most

exploited for their clinical importance and related diseases which are all

useful drug targets.


39/317

7

1.4 STRUCTURE AND CELLULAR ACTIVITIES OF MEMBRANEPROTEINS

Membrane proteins are embedded within the lipid bilayer and are

designated as transmembrane proteins, since they loop inside and outside of

the cell boundaries (Figure 1.2). A class of cell-surface receptors retain

structural features, having extracellular N-terminal, intracellular C-terminal

with seven transmembrane-helices (TMHs) connected by three intra and

extracellular loops and reminding a snake-like structural element /display to

have names such as 7TM receptors or heptahelical receptors or serpentine-like

receptors (Probst et al 1992). Since the downstream targets of such membrane

receptors are guanine nucleotide binding proteins, they are also referred as

Guanine nucleotide-binding protein-coupled receptors, G-protein coupled

receptors (GPCRs), serpentine receptors, and are popular for their versatile

functional importance.

GPCRs are ubiquitous as they majorly participate in signal

transduction, and recognize various type of ligands (Bockaert and Pin 1999).

Substantial evidence on GPCR oligomerization (Prinster et al 2005),

participation in signaling pathways (Greenwald 2005), clinical importance

(Kuwabara and N 2001) and availability of repositories for multiple

organisms (Fredriksson and Schioth 2005) provide significant impetus for the

study of GPCR sequences and their ligand-binding properties. Ligands could

be endogenous compounds such as amines, peptides, Wnt proteins or

endogenous cell surface adhesion molecules or photons and exogenous

compounds like odorants.

1.5 MEMBRANE PROTEIN: TOPOLOGYThere are several prediction methods available online to predict

topology of membrane proteins. The prediction methods are mainly based on


40/317

8

the hydrophobicity profile of the helices. Notably, canonical GPCR

members exhibit N-in and C-out topology, but olfactory receptors show N-out

and C-in topology in higher order organisms (Figure 1.3). The other

interesting fact is that especially Drosophila ORs and GRs retain N-in and

C-out topology (Bargmann 2006, Benton et al 2006, Lundin et al 2007) and

also referred as inverted/reverse topology.

The methods like HMMTOP (Tusnady and Simon 2001), SOSUI

(Hirokawa et al 1998),TMHMM (Krogh et al 2001), TMAP, MEMSAT,

TMpred, TSEG, TM-finder, Pred-TMP, SPLIT, DAS, TopPred II, PRED-

TMR2, MPEx, Phobious and TOPCON are popularly used to predict the

secondary structure of membrane proteins. Methods are also available to

discriminate signal peptides (Lao et al 2002) in proteins.

Figure 1.3 Membrane topology of olfactory receptor (odr-10) in

C. elegans

The predicted seven trans membrane helices (by HMMTOP) for odr-10 was given in TOPO2

display, wherein residues from 12-31 for TM1, 44-63 for TM2 , 94 -113 for TM3, 126-145

for TM4, 202-225 for TM5, 256-275 for TM6 and 286-305 for TM7 was predicted by

HMMTOP. The conserved YRY motif in TM3, ICL2 and the Str superfamily specific

QLF motif in ICL3 has been highlighted in red colour.


41/317

9

1.6 GPCR MECHANISMMembrane proteins are effectively involved in signal transduction

(Figure 1.4), where GPCRs are activated by various external stimuli

(Rodbell et al 1971). Due the influence of various external stimuli, receptors

undergo conformational change (i.e., minimal rearrangement occur in TM6

and TM3 helices, but still the area remains unclear) and causes the activation

of a guanine nucleotide-binding proteins (G-protein). GPCRs are dedicated to

recognize intercellular messenger molecules (such as hormones,

neurotransmitters, lipids, biogenic amines, growth and developmental

factors), and several sensory messages (such as light, odors and gustative

molecules). Also, this event is primarily dependent on the type of the

G-protein. For instance, The Golf subunit is mainly related to sense the

chemosensory signals and participates in olfactory signaling pathways

(Figure 1.4). Gs state of G-protein regulates the enzyme called adenylate

cyclase (AC). AC activity is triggered when it binds to a subunit of the

activated G-protein and subsequently triggers cAMP pathway for further

transduction to result in various biological responses. Activation of AC stops

when G-proteins return to the GDP-bound state (Figure 1.4). GPCRs are also

involved in various secondary pathways like ion channels, adenylyl cyclases,

and phospholipases.


42/317

10

Figure 1.4 GPCR signaling pathway

Image represents about GPCR-signal transductions which depicts the entry of ligands/stimuli, activation of G-protein subunit, subsequent activation of cAMP and event ofinternalization for biological responses. (Image adopted from DB-DRD4 - a database of

dopamine D4 receptor (home page) and SOURCE: TRENDS in Pharmacological sciencesURL: http://www.ibibiobase.com/projects/db-drd4/G_protein.htm)

1.7 GPCR CLASSIFICATIONGPCRs comprise the most prolific family of cell membrane

proteins. Knowledge on GPCR classification is necessary since they involve

in various signaling pathways and recognize diverse set of ligands and are

related to various biological functions. The candidate GPCRs with

characteristic seven TM-helices were classified with the aid of several

prediction methods and classifiers. Though all the candidate GPCRs from

various families retain seven TM-helices and are connected by ICLs and

ECLs, sequence differences occur and exhibit subtle structural diversity

(Gether 2000). Superfamily of GPCRs are classified majorly as class A

(rhodopsin-like), class B (Secretin-like), class C (Metabotropic glutamate),

class D (Fungal pheromone), class E (cAMP receptors) and class F

(Frizzled/smoothened) (Kristiansen 2004). Particularly, class A is the largest,

occupying 80% of the distribution and retains diverse receptors like

rhodopsin, olfactory, biogenic amine, bioactive lipid, nucleic acid, and


43/317

11

peptide receptors. Wherein receptors such as secretin, calcitonin, glucagon,

parathyroid hormone, vasoactive intestinal peptide and so on are related to

class B. Class C includes receptors like metabotropic glutamate receptors(mGluRs), Ca

2+-sensing receptor, -aminobutyric acid type B receptors

(GABA-B) and vomeronasal receptors type 2. Class D retains receptors such

as fungal pheromone P and -factor receptors (STE2/MAM2), whereas fungal

pheromone A and M-factor receptors (STE3/MAP3) are related to class E.

Class F retains slime mold cyclic adenosine monophosphate (cAMP)

receptors. Recently, few other GPCR families, such as frizzled type

receptors/FRZ (Vinson and Adler 1987, Bhanot et al 1996), smoothened type

receptors/SMT (Alcedo et al 1996 and Nehme et al 2010), vomeronasal

receptors type 1 /VNS (Dulac and Axel, 1995), ocular albinism (Schiaffino et

al 1996, Schiaffino et al 1999), and plant receptors (Grill and Christmann

2007)i.e.,Arabidopsis thaliana receptor GCR1 (Josefsson and Rask 1997),

(Perfus-Barbeoch et al 2004) have also been added to the existing GPCR

families. It has been observed that Class A, B and C cover nearly 600 GPCRs

in the human genome, excluding putative candidate GPCRs. Notably,

olfactory receptors (ORs) are members of class A type receptors and has been

dealt exclusively in Chapter 6 under the title of genomewide survey on

olfactory receptors in selected eukaryotes.

1.7.1 Olfactory Receptors (ORs)

Sense of smell - a process of olfaction is beyond simple scientific

understanding. In general, chemical senses are broadly divided into olfaction

(the sense of smell) and gustation (the sense of taste). Critical knowledge on

understanding and analyzing about the olfaction is a necessary science, not

only for its biological or chemical perspective, but also for its powerful socio-

cultural phenomenon (Low 2005).

Olfactory receptors participate in sensing diverse chemical stimuli

or odors (Firestein 2001). ORs are fascinating for their functional significance


44/317

12

in detecting food, to assess its quality, to enhance its flavor, to indicate the

presence of potential toxins and pathogens, to know about reproductive status,

gender, genetic identity, conspecifics, mates as well as threats. ORs activatechemosensory cells leading to neural recognition and influence behaviours,

hormone state and also mood (Munger et al 2009). Due to their diverse role,

ORs are very important as well as present in our everyday life experiences

and are need to be explored more in detail for the vast practical applications in

the field of pharmaceutical industry (aroma therapy), cosmetic industry

(scent/perfume manufacturing), food industry, olfacto-sexual function and to

study olfacto-neural communication, olfactory dis-orders and so on. Thus,

performing genome-wide survey on ORs of selected eukaryotic organisms

will improve scientific credibility and ultimately serve for human benefit.

1.7.2 Classical Knowledge on Olfactory Receptors

The landmark paper published in the year 1991, by Nobel

Laureates Buck and Axel, have explained about the role of olfactory receptors

and the organization of olfactory system in humans (Buck and Axel 1991).

Around three percent of our genes are used to code for different odorant

receptors on the membrane of the olfactory receptor cells. Further research

studies on phylogenetic approach in discriminating class I and class II type

receptors to sense the water- and air-borne odors in higher eukaryotes i.e.,

human and mouse (Zozulya et al 2001; Niimura and Nei 2005), studies

related to insect olfaction (Robertson et al 2003), nematode olfaction

(Robertson and Thomas 2006), olfactory signaling , availability of ORs in

various genomes, and observed common peptides in OR subfamilies

(Gottlieb et al 2009) are providing remarkable background and facilitate the

genome-wide survey of ORs in selected eukaryotic genomes further to

identify OR subclusters, cluster-specific motifs, species-specific tendencies

and co-clusters in tree topology (Chapter 6 for more details).


45/317

13

1.7.3. Olfactory Signaling Pathway in Human ORs

The process of olfaction primarily starts with binding of an odor to

specific receptor on sensory neuron where chemical energies transformed to

electrical signals to sense the smell. Such binding activates Golf a G

protein. The alpha subunit of Golf activates the enzyme adenyl cyclase,

generating the major second messenger 3`,5`-cyclic adenosine

monophosphate (cAMP) which directly opens the cyclic nucleotide gated

channel. This allows the Na2+

and Ca2+

to flow in and depolarize the cell.

Depolarization of these cells cause action potentials (nerve impulses) and are

sent to the olfactory bulb and also by the pathway involving guanylyl cyclase

GC-D (Meyer et al 2000). Human nose expresses different types of receptors,

enabling the main olfactory system and using common pathway to encode

thousands of odorants (Figure 1.5 a and b).

(a) (b)Figure 1.5 ORs and organization of the olfactory system in mammals

and OR signaling pathway (Meyer et al 2000)

a)Depicts the pictorial representation of ORs and organization of the olfactory system inmammals

b)Depicts OR signaling pathway, which depicts the proposed two hypothesis of OR-signaltransduction (Meyer et al 2000). In this, upper panel describes the entry of various odors

and recognized by ORs and initiate cGMP signaling pathway which involves G protein(Golf), an adenylyl cyclase (ACIII), a cyclic nucleotide-gated (CNG) channel (341b) anda chloride channel (ClC). After the response, cAMP is degraded by a CaM-dependentphosphodiesterase (PDE1C2). The other hypothesis (lower panel in b) explains the

components of cGMP-signaling pathway and putative targets of cGMP which involvesreceptor guanylyl cyclase GC-D, cGMP-regulated PDE2, an unknown cGMP-regulated ion

channel and the known CNG channel of the cAMP-signaling pathway.


46/317

14

1.7.4. ORs, GRs and IRs in Drosophila

As we know, olfactory neurons play a central role in sensing

volatile cues that afford the organism the ability to detect food, predators and

mates. But, gustatory neurons sense soluble chemical cues that elicit feeding

behaviours. In insects, the taste neurons initiate innate sexual and

reproductive responses.

It is believed that nearly 60 olfactory receptors (Berkeley

Drosophila Genome Project database) play a major role in identifying and

discriminating diverse odors for the insectsurvival and these Drosophila

olfactory receptor (DORs) gene family are identified as G-protein coupled

receptors (Clyne et al 1997, Gao and Chess 1999, Vosshall and Stocker

2007). These proteins are expressed in distinct subsets of olfactory neurons

and certain family members were restricted to distinct portions of the

olfactory system. Nearly the same numbers of gustatory receptors (GR) are

meant for gustatory functions (Clyne et al 1997).

Notably, insects GRs have the same transmembrane topology as

ORs. Ionotropic Glutamate Receptors (IR)inDrosophilais referred as a new

family of odorant receptors and these proteins accumulate in sensory

dendrites and not present at synapses. They mediate chemical communication

between neurons at synapses and are expressed in a combinatorial fashion in

sensory neurons that respond to many distinct odors, but do not express either

insect odorant receptors (ORs) or gustatory receptors (GRs).

1.7.5. Insect olfaction (DrosophilaORs)

Several fundamental explanations have been published (Siddiqi,

1990), (Clyne et al 1999) to investigate molecular mechanism onDrosophila

olfaction. Electrophysiological studies explained the differentiation in the


47/317

15

morphology of the olfactory sensilla and their distribution patterns

(Venkatesh and Singh 1984, Stocker 1994). Studies suggest that there are 30

different classes of ORNs in the antenna (in adult ~40), based upon the odor

response profile of individual neurons and few exhibit odor specificity.

Notably, 24 antennal receptors such as Or2a, Or47b, Or33b, Or49b, Or65a,

Or23a, Or85f, Or88a, Or67c, Or43a, Or7a, Or43b, Or59b, Or9a, Or85a,

Or47a, Or22a, Or19a, Or67a, Or35a, Or98a, Or85b, Or82a and Or10a were

tested experimentally with 110 odorant molecules using empty neuron system

(Dobritsa et al 2003) and responses of receptors vary to different chemical

classes.

Generally, the functional insect ORs retain variable insect ORs with

a constant odorant binding receptor called OR83b and forms the heteromeric

complex then participate in signaling pathway. OR83b is also called asco-

receptor (Vosshall and Stocker 2007) for its functional importance. In the

literature (Larsson et al 2004), it is also mentioned that heteromeric insect

ORs comprise a new class of ligand-activated non-selective cation channels

(Sato et al 2008).

Notably, insects ORs lack homology to G-protein coupled

chemosensory receptors of vertebrates and exhibit drastically differing

mechanisms in olfaction. Recent studies explained insect ORs as heteromeric

ligand-gated ion channels (More details in Chapter 6).

1.7.6. Nematode Olfaction

Chemosensory receptors in nematodes are highly diverse and large

in number. Since worms lack both auditory and visual sense, chemosensation

plays a central role in nematodes for its survival. In C. elegans, chemosensory

receptors belong to G-protein coupled receptors and retain seven

transmembrane proteins. Around 1330 genes and 400 pseudo genes have been


48/317

16

identified as chemoreceptors (Robertson and Thomas 2006) in C. elegans.

Also many of these receptors are known as serpentine receptors and around

19 largest gene families are reported so far. Among the large number of

proteins, only one protein namely odr-10 (Figure 1.3), was reported as an

olfactory receptor in C. elegans (Sengupta et al 1996).

1.7.7. Mouse Olfaction

As found in human olfactory receptors, mouse ORs also possess

two broad classes of ORs with excellent bootstrap support (Glusman et al

2001). The class I type in mouse ORs are as found in fish and in the frog, but

had been considered an evolutionary relic in mammals (Ngai et al 1993) and

the class II receptors are found in amphibians and terrestrial vertebrates

(Freitag et al 1995). There are 147 class I OR genes found in mouse OR

subgenome, among them 120 OR genes were potentially functional. In

mouse, all of the class-I type ORs were located in a single large cluster in

chromosome 7.

1.8 DATA REPOSITORIES FOR MEMBRANE PROTEINS

There are a huge number of data repositories and prediction servers

for membrane topology are available exclusively for membrane proteins.

Notably, repositories related to GPCRs (Elefsinioti et al 2004) like gpDB

(Theodoropoulou et al 2008), GPCRDB and integrated web resources like GProtein Coupled Receptor - Oligomerization Knowledge Base Project, GPCR

Natural Variants database (NaVa). Database namely SEVENS (Ono et al

2005) provides useful sequence information, chromosomal location and intra-

genomic phylogenetic clusters for membrane proteins from more than 50

eukaryotic organisms. IUPHAR (Committee on Receptor Nomenclature and

Drug classification) incorporates detailed pharmacological, functional and


49/317

17

patho-physiological information on GPCRs, voltage-gated ion channels,

ligand-gated ion channels and nuclear hormone receptors.

The other related databases for structural resources like PDBTM,

TOPDB (Tusnady et al 2008), provide collection of domains and sequence

motifs. TMpad (Trans Membrane Protein Helix-Packing Database) and

MPDB (Membrane Protein Data Bank) are useful to provide structural

information on integral, peripheral and anchored membrane proteins and also

peptides (Raman et al 2006).

Data repositories for olfactory receptors are also available for

public access. ORDB (Skoufos et al 2000), HORDE (The Human Olfactory

Data Explorer) and integrated web resources from Sense Lab for ORs with

associated links such as odorDB, odorMapDB are highly useful and

particularly relevant to retrieve sequences for the olfactory receptors (ORs)

from multi-genomes.

1.9 COLLECTION OF GPCR- HOMOLOGUES

Sequence similarity searches are robust techniques to identify

nearest homologues for a query sequence from database of interest. Pairwise

comparison of proteins is a fundamental step in sequence similarity

searches. The similarity scores depend upon the sequence features like

amino acids and permitted amino acid substitutions (AAS). Generally, when aquery and the subject are aligned with high similarity scores, then they can be

referred for their sequence relevance and can be called as homologues. In

other words, two proteins retaining similar sequences can be called as

homologues. Homologues are further classified into orthologs and paralogs.

While orthologous proteins evolved from a common ancestral gene belonging

to two different genomes, paralogs were generated by the event of gene

duplication and belong to the same genome. Thus, homologues share


50/317

18

significant sequence similarity and can be further connected for their

functional relevance. A necessity arises to select an appropriate technique for

similarity search when we deal with evolutionarily distant sequences and

particularly membrane proteins. Each method is unique for its scoring scheme

with respect to amino acid substitutions and the gap penalties.

Functionally and evolutionarily important protein similarities can

be recognized by comparing three-dimensional structures, but when structures

are not available, patterns of conservation such as motifs, profiles, position-

specific scoring matrices, and Hidden Markov Models can be used to identify

related sequences from the database of protein sequences.

Several methods like BLAST (Altschul, et al 1997), FASTA

(sequence based searches) (Lipman and Pearson 1985), IMPALA (profile-

based searches) (Schaffer et al 1999) other approaches like PSI-BLAST, RPS-

BLAST, are effectively used to find homologues and further to identify

common functional relevance.

1.9.1 BLAST (Basic Local Alignment Search Tool)

Sequence comparisons between two sequences are achieved by

producing quality alignments which maximize the correspondence between

similar residues and minimize gaps (Altschul et al 1997). The objective here

is to align or match a sequence of unknown function with

characterized/annotated proteins from model organisms, so that the structure

and function can be extrapolated to the new sequence. Generally, dynamic

programming technique has been implicated to achieve alignments locally

(BLAST) or globally (FASTA). BLAST and FASTA (Lipman and Pearson

1985) are robust methods. Conceptually, the heuristic approach (BLAST) can

deal with sequences considerably differing in length and identifies islands of


51/317

19

short matches. It relies upon Smith-Waterman algorithm (Smith and

Waterman 1981), and is guaranteed to find the optimal local alignment with

respect to the scoring system to provide maximal scoring segment pairs

(MSPs). The scoring system majorly includes the substitution matrix and the

gap-scoring scheme to align the sequences based on possible similarities.

BLAST-a robust sequence comparison tool - is applicable for five main

search methods such as blastp, blastn, blastx, tblastn and tblastx for varying

inputs such as nucleotide and protein sequences.

BLAST produces statistically significant alignments in the output

and features like raw scores, bit scores and E-values are considered for

quantify the alignment significance. Among them, E-values are most often

used. Generally, lowest E-values are considered as highly significant for best

alignment. An E-value refers to the number of alignments one expects to find

with a score greater than or equal to the observed alignment score in a search

against a random database. PAM (point accepted mutations per 100residues) amino acid scoring matrix which is based on an explicit

evolutionary model (Dayhoff et al 1978) is provided in the BLAST

software distribution. It includes PAM40, PAM120, and PAM250, whereas

the BLOSUM matrices are based on an implicit model of evolution and

includes BLOSUM 45, 62 and 85 (Henikoff and Henikoff 1992). Generally,

these matrices are very appropriate to deal with globular proteins, whereas

PAM and JTT-200 (Jones et al 1992) can be used for membrane proteins.

1.9.2 PSI-BLAST (Profile Vs Sequence comparison method)

Among the five BLAST programs, the work described in this thesis

mostly relies on the basic protein BLAST technique, which includes blastp

(protein-protein BLAST), PSI-BLAST (Position Specific Iterated BLAST),

PHI-BLAST (Pattern Hit Initiated BLAST) and DELTA-BLAST (Domain


52/317

20

Enhanced Lookup Time Accelerated BLAST). As the name suggests, blastp

compares a protein query with a protein database, PSI-BLAST allows the user

to build a PSSM (position-specific scoring matrix) using the results of the first

blastp run and iteratively uses the profile as query against the database of

protein sequences (Altschul et al 1997). The generated profiles at each

iteration, are searched against the database of protein sequences by rigorous

iterations until convergence (meaning iterate until no new sequences are

found). Thus, this method is effective in associating even distantly related

sequences with remote homology. The application can be further improvised

by using as jump-start PSI-BLAST (Altschul et al 1997), jack-knife approach,

HOE (Homologous over-extension) reduced profile search (Gonzalez and

Pearson, 2010) and the improved PSI-BLAST search techniques such as

cascade PSI-BLAST (Bhadra et al 2006) as per user requirement.

1.9.3 Reverse PSI-BLAST (Sequence Vs Profile comparison method)

To associate remotely related sequences, reverse PSI-BLAST

technique (RPS-BLAST) is highly effective. This method differs from other

sequence searches, wherein the query sequences are searched against a

database of PSSM (Position Specific Scoring Matrices) profiles. PSSMs give

the amino acid propensities at each sequence position based on the multiple

alignments. PSSM generation also uses the multiple alignment sequence

weights, the expected number of amino acids and the frequencies of un-

observed amino acids (pseudo counts). Representative sequences from the

protein families (example:3PFDB Shameer et al 2009), related domains and

cluster types can be used to generate profiles to represent sequence properties

as a block of consensus of amino acids. Hence, sequence search space has

been broadened and opportunity has been extended to connect sequences at

remote homology (Figure 1.6).


53/317

21

In the other method, that compares protein sequences against

database of protein sequences, some limitations do exist. If stringent sequence

properties are employed, scaled at sequence against database of sequences,

there is little chance of missing very distantly related sequences in these

search techniques. But, RPS-BLAST helps to associate even the distantly

related sequences to its related profiles. So, the practical implications like

generating cross-genome phylogenies, finding new members, associating

evolutionarily distant sequences, classification and to associate functional

annotation to new sequences based on known data. This effective method can

be employed carefully in designing profiles, setting significant E-value

thresholds and to interpret sequence search for related profiles.

Separately, Hidden Markov Model (HMM) can also be used for

pattern recognition and it provides a mathematical representation of a protein

sequence (Eddy 1998, Karplus et al 1998). HMMs have been used for gene

prediction, recognition of transmembrane helices (Sonnhammer et al 1998),phylogenetic analysis (Felsenstein and Churchill, 1996) and in distant

homology detection (Krogh et al 1994b). Machine learning approaches are

appropriate techniques to deal with pattern recognition problems and to

recognize remote homology. Method like support vector machines (SVMs)

(Pugalenthi et al 2010) is effectively used in classification problems where the

already trained dataset with known features (Positive set) is used to associate

unknown gene/protein sequence (Negative set) and is useful to propose

putative members, where the predictions relay upon training dataset.


54/317

22

Figure 1.6 Overview on the techniques involved in genomewide survey

The given diagram depicts the use of available data repositories related to membrane

proteins (GPCRDB, SEVENS DB, ORDB, HORDE and so on.) following the collection of

sequences, predicting the membrane topology, using redundancy filter as the primary step

for the cross-genome studies. The methodology is starting with sequence search programs

(such as BLAST, PHI-BLAST, PSI-BLAST, RPS-BLAST) to homologues sequences and to

perform cross-genome analysis.

1.10 MULTIPLE SEQUENCE ALIGNMENT TECHNIQUES

Alignment procedures play a crucial role (Figure 1.1 and

Figure 1.6) in analyzing the relationships among diverse sequences. The

arrangement of two or more sequences can be possible by aligning the

sequences for common properties or sites. Weights can be assigned to thealigned elements so as to determine the degree of relatedness or to detect the

existing homology between the multiple sequences. A pairwise alignment is

between two sequences and a multiple sequence alignment (MSA) with many

sequences, which are facilitating sequence comparison studies and the

sequence can be aligned by various alignment methods. MSA can be referred

as a generalization of pairwise sequence alignments. Here, instead of aligning

two sequences, n number of sequences were aligned simultaneously, where


55/317

23

n is always >2, thus called as multiple sequence alignments and the alignment of

multiple sequences is possible by introducing the gaps _ into the sequences.

Membrane proteins differ considerably from globular proteins in

sequence composition. The region that inserts into the cell membrane

possesses different hydrophobicity patterns when compared to soluble

proteins. Multiple sequence alignment techniques which are designed for

globular proteins are not optimal to align the transmembrane proteins. And

recommended alignment procedures (Pirovano 2008), can be employed

carefully. When sequences from different genomes have been aligned

together, then the alignment has been referred as cross-genome sequence

alignments and the resulting phylogeny is referred as cross-genome

phylogeny (Figure 1.6).

1.10.1 CLUSTAL W

The CLUSTAL W (Thompson JD, 1994) is a popular MSA tooland generally the MSA technique consists of three main stages like 1) All

pairs of sequences are aligned separately in order to calculate distance matrix

giving the divergence of each pair of sequences. 2) A guide tree is generated

from the distance matrix. 3) The sequences are progressively aligned

according to the branching order in the guide tree.

Initially, the CLUSTAL W program apply fast approximate(heuristic) method based on the number of K-tuple (this is the size of exactly

matching fragment that is used) matches for generating pairwise distances

(Wilbur and Lipman, 1983). Later, dynamic programming algorithm was used

to enhance accuracy by providing the scores using gap opening penalties

(GOP) and gap extension penalties (GEP). The method improves quality of

alignment by implementing amino acid weight matrices such as BLOSUM

with series of 80,62,45,30, PAM with series of 20, 60, 120, 350, GONNET


56/317

24

matrix (can be used for larger datset) with series of 80, 120, 160, 250 and 350.

Though CLUSTAL W is handy to align large number of sequences with

reliable accuracy, there are few recommended alignment tools to align

transmembrane proteins, which are conceptually different in aligning TM helices

and loops by using different matrices (for example PRALINE TM and MAFFT).

1.10.2 PRALINETM

Thus, the servers to align TM-proteins (like PRALINETM

)are more

specific, where the transmembrane regions are first predicted (Pirovano

2008). The reliable topology prediction methods guide the boundaries of TM

domain and loop as an initial requirement. PRALINETM

refers HMMTOP v2.

1 (Tusnady and Simon, 2001), TMHMM v2. 0 (Krogh et al 2001) and

Phobius (Kll et al 2007) for membrane predictions. Then, the profile scoring

scheme simply applies TM-specific substitution scores from the matrices like

PHAT to reliably compare TM positions. Finally, an alternative iterative

scheme was implied to enhance the alignment quality. Recent study suggests

that PHAT matrix (Ng et al 2000) outperforms to the JTT matrix (Jones et al

1992) especially on database searching (Ng et al 2000). Earlier methods like

STMP (Shafrir and Guy, 2004) is also useful and is the first multiple sequence

alignment program targeted to align transmembrane proteins.

1.10.3 MAFFT

MAFFT (Multiple Alignment using Fast Fourier Transform) can be

used for aligning large datasets of transmembrane protein. The method is very

advanced than other alignment programs, in increasing the accuracy of

alignments even for sequences having large insertions or extensions as well as

distantly related sequences of similar length. MAFFT alignment program

(Katoh et al 2002) is more effective with two different heuristics, such as the

progressive method (FFT-NS-2) and the iterative refinement method


57/317

25

(FFT-NS-I). The other important feature of the program is that the number of

input sequences can be very large and it offers a range of multiple alignment

methods such as L-INS-I (accurate; for alignment of


58/317

26

by assigning probabilities to every possible evolutionary change at

informative sites, and by maximizing the total probability of the tree, search

for the optimal choice can be reached. In NJ method, it eliminates possible

errors that can occur when we use UPGMA method. NJ algorithm searches

not only evaluate pairwise distances (using distance matrices), but also set

neighbors that minimize the total length of the tree. NJ method is

recommended to deal with sequences whose evolutionary distances are short.

There are multiple packages available both for the standalone and on-line access.

Suites like PHYLIP, TREE-PUZZLE and MEGA are more user-friendly and are

appropriate tools to perform phylogenetic analysis both for ML and NJ method.

1.11.1 PHYLIP

PHYLIP (Phylogeny Inference Package) (Felsenstein, 1981) is a

free computational phylogenetic package consisting of 35 portable programs.

It facilitates to perform parsimony, distance matrix, and likelihood methods,

including bootstrapping and consensus trees.

1.11.2 TREE-PUZZLE

It is a popular computer program to reconstruct phylogenetic trees

from molecular sequence data such as nucleotide sequence/ proteins based on

the maximum likelihood (ML) method (Schmidt et al 2002). It implements

quartet puzzling algorithm. The average distance between all pairs ofsequences (maximum likelihood distances) is computed. These distances can

be viewed as a rough measure for the overall sequence divergence. This is

performed in three steps: In ML step, the supplied n (number of sequences

in the alignments) is set for the quartets. All quartets are evaluated using ML

method and the three quartet topologies such as ab|cd, ac|bd, and ac|bd are

weighted by their posterior probabilities. In the puzzling step, quartet trees are

considered from intermediate tree adding sequences one-by-one. As this step


59/317

27

is highly dependent on the order of sequences, many intermediate trees from

different input orders are constructed. In the consensus step, with the

generated intermediate trees, a majority rule consensus tree has been built.

These two steps are timeconsuming and the result files (.dist, .puzzle, and

.outtree) are useful for interpreting tree topologies. The evolutionary models

such as DAYHOFF, JTT and mtREV24 (Adachi and Hasegawa, 1996) (is for

use with proteins encoded on mtDNA) matrices are provided. Others like

BLOSUM 62 and the WAG model (Whelan and Goldman, 2004) are for more

distantly related amino acid sequences. VT is for use with proteins of distant

relationships as well (Muller and Vingron 2000).

1.11.3 MEGA (Molecular Evolutionary Genetics Analysis)

MEGA is an user-friendly software for phylogenetic studies, which

also integrates sequence alignment approaches like CLUSTAL W and

MUSCLE. MEGA 5 can be employed for phylogenetic reconstruction and

phylogeny visualization, testing an array of evolutionary hypotheses using

maximum likelihood (ML), maximum composite likelihood (MCL),

neighbor-joining (NJ), minimum evolution (ME) and maximum parsimony

(MP) to produce bootstrap construction tree for the required replications.

MEGA is handy to display tree topologies legibly such as rectangular, radial

and circular displays (Kumar et al 2008).

1.12 CLUSTER ASSOCIATIONS

The generated tree topologies can be inferred for cluster associations.

Understanding the distribution of clusters with significant bootstrap (BS) values

helps to classify / group the related sequences. For example, in the phylogenetic

analysis on mouse olfactory receptors (Zhang and Firestein 2002), by using

consensus tree, nearly 1000 OR genes were classified into several OR families.

For the classification, they identified reliable clusters as those having >50%


60/317

28

bootstrap support and more than 40% protein identity. By this definition, mouse

ORs were classified into 228 families. This kind of segregation of gene/protein

sequences will create cluster association for the interested protein families. Cluster

associations will provide information about the conserved species-specific

behaviors and evolutionary integrity obtained at intra- and inter-genomic level

(Figure 1.6).

1.13 SEQUENCE CONSERVATION AND DIVERSITY

The performed intra- and inter-genomic phylogenetic studies guide

the sequence association for the species-specific tendency as well as co-

clustering arrangements. Evolutionarily conserved sequence properties such

as motifs (Scott Gleim 2009) are highly important to connect further for the

structural and functional relevance.

Several computational techniques and software tools are available

to locate and display conserved amino acid residues in the aligned set ofhomologues sequences. Available tools and databases such as TOPDOM,

MeMotif, PROSITE, IMOTdb and SmoS, WEBLOGO, and with the guidance

of in-house program MotifS program (by Sowdhamini, yet to be published)

can be used to visualize the set of aligned TM-proteins and observed motifs

and AAS. Such annotation tools can be applied in comparative genomics of

GPCRs or ORs to identify cluster-specific/family-specific motifs along with

the knowledge on predicted topology (Figure 1.6).

1.14 HOMOLOGY MODELLING OF GPCRs/ORs

The sequence searches and clustering provide representative

sequences to generate three-dimensional structures and this further helps to

map hotspots and to associate functional properties. Comparative


61/317

29

modelling/homology modelling is an appropriate procedure for generating 3D

models for the interested proteins and can be achieved by the following steps:

i) Primarily, homologues sequences of the query can becollected by using effective sequence search methods. The

nearest homologues sequence with reference sequence, whose

structure is known, can be used as a template.

ii) Pairwise alignment of template and target sequence can bemade by using appropriate alignment methods. Procedures

such as PRALINE TM, MAFFT can be used for membrane

proteins. Alignments can be manually edited to improve the

alignment quality (using MEGA).

iii) Building co-ordinates of the three-dimensional model basedon the generated alignment can be achieved by using software

like MODELLER (Sali and Blundell, 1993) and web server

like SWISS-MODEL (Arnold et al 2006).

iv) Assessing potential accuracy for the generated models andmodels with least energy constraints can be selected. If

unfavorable conformations and short contacts are observed,

model can be minimized by using SYBYL software package

(Tripos associate Inc).

v)

Structure validation can be done by checking for disallowedconformations or structural environments (can be guided by

Ramachandran Plot values, using PROCHECK server

(Laskowski et al 1993) and VERIFY 3D (Bowie et al 1991).

In essence, the compiled writings in this introductory chapter

provide a necessary background to the following work chapters 2-6.


62/317

30

CHAPTER 2

CROSS-GENOME CLUSTERING OF HUMAN AND

C. ELEGANSG-PROTEIN COUPLED RECEPTORS

2.1 INTRODUCTION

Membrane proteins are ubiquitous (Perez 2005), constitute nearly

20% of whole genomes and are most attractive drug targets since they are

implicated in various diseases. Membrane proteins are embedded within the

lipid bilayer and are designated as transmembrane proteins, since they loop

inside and outside of the cell boundaries. A class of cell-surface receptors

retains structural features, having extracellular N-terminal, intracellular

C-terminal with seven transmembrane-helices (TMHs) connected by three

intra and extracellular loops and provides a snake-like structural element

/display to have names such as 7TM receptors or heptahelical receptors or

serpentine-like receptors. If the downstream targets of such membrane