Download - 0 Front Pages New_merged

Transcript
  • 7/28/2019 0 Front Pages New_merged

    1/317

    GENOME WIDE SURVEY OF CERTAIN

    MAMMALIAN GPCRS AND OLFACTORY

    RECEPTORS

    A THESIS

    Submitted by

    NAGARATHNAM B

    in partial f ul fi llment for the award of the degree

    of

    DOCTOR OF PHILOSOPHY

    FACULTY OF SCIENCE AND HUMANITIES

    ANNA UNIVERSITY

    CHENNAI 600 025

    JUNE 2012

  • 7/28/2019 0 Front Pages New_merged

    2/317

  • 7/28/2019 0 Front Pages New_merged

    3/317

  • 7/28/2019 0 Front Pages New_merged

    4/317

    ii

  • 7/28/2019 0 Front Pages New_merged

    5/317

    iii

    ABSTRACT

    In the recent era of G-protein coupled receptor (GPCR) research,

    computational approaches in sequence analysis play a vital role in identifying

    related sequences (homologues), conserved features, (domains, motifs) and

    evolutionary impacts (orthologs) for the interested protein families at intra-

    and inter-genomic levels. Candidate GPCRs and ORs (class A type GPCR)

    are important for their diverse cellular activities and have been considered for

    the genome-wide survey in selected eukaryotic genomes, which further helps

    to establish a structure, function resemblance.

    Generally, GPCRs are predicted for having extracellular N-terminal

    (N-out topology), intracellular C-terminal with seven transmembrane-helices

    (TMHs) and are connected by three intra and extracellular loops thereby

    termed as serpentine-like receptors.

    Previous cross-genome studies on human- Drosophila GPCRs,

    motivated to perform a cross-genome clustering on human- C. elegans

    GPCRs (Chapter 2). A profile based clustering (RPS-BLAST) was employed

    to associate more than 1000 C. elegans GPCRs with already grouped human

    GPCR clusters of eight major types of receptors. The generated 32 human- C.

    elegans GPCR clusters were analyzed for five different types of cluster

    association with proposed terminologies such as human GPCR clade [HC],

    coclusters [CC], neighbor clades [NC], neighbor members [NM], species-

    specific members [SS] observed at tree topology which facilitate to connect

    functional relevance at intra-and inter-genomic levels. Interestingly, the referred

    CC was significant and exhibited evolutionary integrity at inter-genomic level.

    Also, the identified 27 orthologs were evident to illustrate the effectiveness of

    using cross-genome clustering techniques in connecting related GPCRs even at

  • 7/28/2019 0 Front Pages New_merged

    6/317

    iv

    remote homology. Overall 84% of the GPCR sequences across genomes have

    been associated at the significant E-value thresholds (ranges from 0.001 to 1)

    successfully by RPS-BLAST (work published).

    Cross-genome clustering on human and C. elegans GPCRs motivated

    to perform a phylogenetic analysis on serpentine receptors (SRs) exclusively

    (Chapter 3). As we know, nearly 20 protein families of SRs from C. elegans

    were related to chemosensation, a phylogenetic analysis on 683 serpentine

    receptors was carried out to identify the related sequences/clusters to

    represent the family specific/receptor specific sequence features, ultimately to

    connect at superfamily level. Interestingly, the only one receptor annotated for

    olfaction (odr-10) in C. elegans to sense di-acetyl compounds has been

    noticed along with 43 SRs in the phylogeny. All the associated homologues to

    odr-10 are from Str superfamily and particularly str-112 has been found as the

    most closely related sequence homologue to odr-10 from the phylogenetic

    analysis. As a case study, odr-10 has been modelled for understanding

    secondary structural details. A str family specific QLF motif was identified

    in ICL3, TM6 of odr-10 and 92 other SR family specific motifs were also

    identified by using TM-MOTIF package. The identified sequence features can

    be used further to train SVM models and to predict putative receptors from

    other nematode species.

    Attempts have been made to design an user-friendly alignment

    viewer TM-MOTIF (work published) to detect and to display conserved

    motifs on the predicted membrane topology in the set of aligned

    transmembrane proteins (Chapter 4). The tool is very effective in identifying

    not only the conserved motifs (default 60%) but also the amino acid

    substitution (AAS) with its respective physico-chemical properties (by using

  • 7/28/2019 0 Front Pages New_merged

    7/317

    v

    an in-house program namely,MotifS) at each position of the alignment. TM-

    MOTIF provide option for the users to submit their sequence of interest

    (multiple FASTA and MSA) to visualize the seven predicted helices of TM

    proteins in VIBGYOR colouring scheme. User can also align sequence of interest

    with any one of the given reference sequence (known structure) to get a pairwise

    alignment and this particular display is highly helpful as a pre-requisite for

    homology modelling. User can also perform a BLAST search to identify a nearest

    homologue from the incorporated cross-genome GPCR and OR cluster datasets of

    selected organisms. In short, TM-MOTIF is highly suitable for the comparative

    genomics and to identify the cluster-specific or receptor specific and common

    motifs observed at various percentage of conservation within and across the

    genome(s). The package is integrated to DOR (Database of Olfactory Receptors).

    As we know, the role of conserved motifs and AAS play crucial

    role in functional aspects. The previously established 32 clusters of eight

    major types of receptors of cross-genome GPCR clusters such as human-

    Drosophila GPCR clusters, human- C. elegans GPCR clusters and human

    only GPCR cluster dataset were considered to study primarily for the

    conserved motifs (MotifS program) and TM-MOTIF package has been used

    to record the observed motifs to its respective membrane topology

    (Chapter 5).

    Interestingly, a total of 33 conserved motifs have been identified

    from the human-Drosophila GPCR clusters and 76% of them were observed

    in TM helices, predominately in TM2 and TM7. Besides the classical motifs

    such as E/DRY and NPXXY, motifs observed in single receptor type (cluster-

    specific motifs or receptor-specific), two-receptor and multi-receptors types

    were also documented for the cross-genome GPCR clusters (work published).

  • 7/28/2019 0 Front Pages New_merged

    8/317

    vi

    Olfactory receptor data repository was generated for selected eukaryotic

    organisms (yeast, worm, fly, mouse and human) and these sequences were aligned

    to produce intra- and inter-genomic phylogeny. Interestingly, 371 functional ORs

    from human genome were distributed in 10 distinct clusters, and class I (to sense

    water-borne odors), II (to sense air-borne odors) type receptors were discriminated

    while introducing few selected fish and amphibian ORs in the human OR

    phylogeny. In other study, fly ORs showed no significant coclustering with human

    OR phylogeny and proves that insect ORs are evolutionarily distinct from

    mammalian ORs. This could be due to the independent evolution, life style orreverse topology of fly ORs. Selected nematode ORs also shows no coclustering

    with human ORs due to long lineage and nematode life style. Study on human-

    mouse OR clusters showed significant coclustering and studies were carried with

    ORs of canine, rodents and nonhuman primates to analyze cluster association with

    human ORs. The results of sequence studies were organized in a publically

    available database namely DOR. It provides sequences, predicted TM boundaries,

    intra- and inter-genomic alignments, phylogeny of selected genomes. It also includes

    motif identification tool (TM-MOTIF) and is associated with other features like

    predicted secondary structure and dimer prediction from collaborators (work in press).

    In essence, genome-wide survey suggests representative sequences,

    cluster association, cluster specific motifs, orthologs, coclusters arrived at

    intra- and inter-genomic levels and are ultimately guiding to connect functional

    properties of known to unknown gene/protein and to understand structure function

    relationship.

  • 7/28/2019 0 Front Pages New_merged

    9/317

    vii

    ACKNOWLEDGEMENT

    I express my deep sense of gratitude to Dr. V. Balakrishnan,

    Department of Biotechnology, KSR College of Technology, Tiruchengode for

    his valuable guidance for my Ph.D. study. Besides I am extremely thankful to

    my co-supervisor and mentor Prof. Dr. R. Sowdhamini, Lab-25, National

    Center for Biological Sciences, Bangalore who has been a source of

    inspiration, help, guidance, advice to me throughout the course of this

    research work. Further, I sincerely express my earnest gratitude to my

    doctoral committee member Dr. S. SenthilKumar, PSG College of

    Technology, Coimbatore. I express my heartfelt thanks to Prof. Dr. K.

    Karunakaran, Vice Chancellor, and Dr. P. Renuka Devi, Director-Research,

    Anna University of Technology Coimbatore for graciously permitting me to

    do this research.

    I submit my gratitude to Prof. Dr. K. Vijayaragavan, RSF,

    Director, NCBS, Bangalore, Prof. Dr. Obaid Siddiqi, RSF, Prof. Dr. Apurva

    Sarin,Prof. N. Srinivasan from IISc., Bangalore for extending care and moral

    support to pursue the research work and I submit my deepest gratitude to

    Mr. Ashok Rao,Mr. Shaju, teaching and non-teaching staff, my lab mates and

    all@ncbs for their kind hearted support in encouraging my research thirst.

    Thanks to my family members and my beloved APPA.

    B. NAGARATHNAM

  • 7/28/2019 0 Front Pages New_merged

    10/317

    viii

    TABLE OF CONTENT

    CHAPTER NO. TITLE PAGE NO.

    ABSTRACT iii

    LIST OF TABLES xxii

    LIST OF FIGURES xxiv

    LIST OF ABBREVIATIONS xxx

    1 INTRODUCTION 1

    1.1. PRIOR ART ON GENOME-WIDE SURVEY 21.2. BREAKTHROUGHS IN GPCR

    CRYSTALLOGRAPHY STUDIES 4

    1.3. GPCRS: POPULAR DRUG TARGETS 61.4. STRUCTURE AND CELLULAR ACTIVITIES

    OF MEMBRANE PROTEINS 7

    1.5. MEMBRANE PROTEIN: TOPOLOGY 71.6. GPCR MECHANISM 91.7. GPCR CLASSIFICATION 10

    1.7.1 Olfactory Receptors (ORs) 11

    1.7.2 Classical Knowledge on Olfactory

    Receptors 12

    1.7.3 Olfactory Signaling Pathway in

    Human ORs 13

    1.7.4 ORs, GRs and IRs inDrosophila 14

    1.7.5 Insect olfaction (Drosophila ORs) 14

    1.7.6 Nematode Olfaction 15

    1.7.7 Mouse Olfaction 16

  • 7/28/2019 0 Front Pages New_merged

    11/317

    ix

    CHAPTER NO. TITLE PAGE NO.

    1.8 DATA REPOSITORIES FOR MEMBRANE

    PROTEINS 16

    1.9 COLLECTION OF GPCR- HOMOLOGUES 17

    1.9.1 BLAST (Basic Local Alignment

    Search Tool) 18

    1.9.2 PSI-BLAST (Profile Vs Sequence

    comparison method) 19

    1.9.3 Reverse PSI-BLAST (Sequence Vs

    Profile comparison method) 20

    1.10 MULTIPLE SEQUENCE ALIGNMENT

    TECHNIQUES 22

    1.10.1 CLUSTAL W 23

    1.10.2 PRALINETM

    24

    1.10.3 MAFFT 24

    1.11 DERIVING PHYLOGENY OF GPCRs/ORs 25

    1.11.1 PHYLIP 26

    1.11.2 TREE-PUZZLE 26

    1.11.3 MEGA (Molecular Evolutionary

    Genetics Analysis) 27

    1.12 CLUSTER ASSOCIATIONS 27

    1.13 SEQUENCE CONSERVATION AND

    DIVERSITY 28

    1.14 HOMOLOGY MODELLING OF GPCRs/ORs 29

    2 CROSS-GENOME CLUSTERING OF HUMAN AND

    C. ELEGANSG-PROTEIN COUPLED

    RECEPTORS 30

    2.1 INTRODUCTION 30

  • 7/28/2019 0 Front Pages New_merged

    12/317

    x

    CHAPTER NO. TITLE PAGE NO.

    2.2 C. elegans - AN ATTRACTIVE ANIMALMODEL 30

    2.2.1 Features Related to C. elegans and

    Human GPCRs 31

    2.3 OBJECTIVES 33

    2.4 PRIOR ART 33

    2.4.1 Superfamilies of Serpentine Receptors 34

    2.5 METHODOLOGY 35

    2.5.1 Selection Criteria forC. elegans GPCRs 35

    2.5.2 Generation of Representative Profiles 38

    2.5.3 Performing RPS-Blast 38

    2.5.4 CrossGenome Alignment of

    HumanC. elegans GPCRs 39

    2.5.5 Cross -Genome Phylogeny of Human

    C. elegans GPCRs 40

    2.5.6 Terminologies used to Describe Phylogeny

    2.5.6.1 Human GPCR clade [HC] 40

    2.5.6.2 Coclusters [CC] 40

    2.5.6.3 Neighbor Clades [NC] 41

    2.5.6.4 Neighbor Members [NM] 41

    2.5.6.5 Species specific Members [SS] 41

    2.5.6.6 Superfamilies of Serpentine

    receptors (SR) 41

    2.6 RESULTS AND DISCUSSION 42

    2.6.1 Result Summary for Peptide Receptors 43

    2.6.2 Result Summary for Chemokine Receptors 67

    2.6.3 Result Summary for Nucleotide and Lipid

    receptors 68

  • 7/28/2019 0 Front Pages New_merged

    13/317

    xi

    CHAPTER NO. TITLE PAGE NO.

    2.6.4 Result Summary for Biogenic Amine

    Receptors 81

    2.6.5 Result Summary for Class B (Secretin)

    Receptors 94

    2.6.6 Result Summary for Cell

    Adhesion Receptors 99

    2.6.7 Result Summary for Class C (Glutamate)

    Receptors 101

    2.6.8 Result Summary for Frizzed/Smoothened

    Receptors 108

    2.7 CONCLUSION 110

    3 PHYLOGENETIC ANALYSIS OF SERPENTINE

    RECEPTORS OF C. ELEGANSAND

    IDENTIFICATION OF CONSERVED MOTIFS IN

    SERPENTINE RECEPTOR SUPERFAMILIES 117

    3.1 INTRODUCTION 117

    3.2 HOMOLOGUES OF C. elegans GPCRs 118

    3.3 OBJECTIVES 118

    3.4 CHEMOSENSORY RECEPTORS IN C. elegans 119

    3.5 CHEMOSENSORY NEURONS AND

    OLFACTORY APPARATUS IN C. elegans 119

    3.6 FAMILIES AND SUPERFAMILIES OF

    SERPENTINE RECEPTORS IN C. elegans 120

    3.7 FEATURES AND IMPORTANCE OF SRs 122

    3.8 SRs: FUNCTIONAL RELEVANCE WITH

    OTHER EUKARYOTIC GPCRs 122

    3.9 METHODOLOGY 123

  • 7/28/2019 0 Front Pages New_merged

    14/317

    xii

    CHAPTER NO. TITLE PAGE NO.

    3.9.1 Data Collection 123

    3.9.2 Prediction of TM-helices by HMMTOP 123

    3.9.3 Alignment Procedure by MAFFT 124

    3.9.4 Phylogeny of Selected Serpentine Receptors 124

    3.9.5 Identification of Motifs in SRs 124

    3.10 RESULTS 125

    3.10.1 Identified Motifs in SR Families : A

    Pilot Study 127

    3.10.2 Homology Modelling of odr-10 128

    3.10.2.1 Pairwise alignment of odr-10

    with bovine rhodopsin sequence 128

    3.10.2.2 Alignment by MAFFT 129

    3.10.2.3 Structure validation for Odr-

    10 model 130

    3.10.2.4 Preliminary phylogenetic analysis 131

    3.10.2.5 Odr-10 an outgroup to HOR 131

    3.11 CONCLUSION 132

    4 TM-MOTIF: A PACKAGE AND AN ALIGNMENT

    VIEWER TO IDENTIFY CONSERVED MOTIFS

    AND AMINO ACID SUBSTITUTIONS INALIGNED SET OF SEVEN TRANSMEMBRANE

    HELIX PROTEINS 135

    4.1 INTRODUCTION 135

    4.1.1. Functional Importance of ConservedMotifs in TM-Proteins 136

    4.1.2. Motif Related to Structural Integrityand Stability 137

  • 7/28/2019 0 Front Pages New_merged

    15/317

    xiii

    CHAPTER NO. TITLE PAGE NO.

    4.1.3. Impacts of Motifs in EvolutionaryBioinformatics 138

    4.2. OBJECTIVES OF TM-MOTIF 1384.3. KEY FEATURES OF TM-MOTIF 1394.4. METHODOLOGY 140

    4.4.1. In-Built Dataset of Cross-Genome GPCRand OR Cluster Dataset 141

    4.4.1.1 Human-Drosophila cross-genome

    GPCR clusters 141

    4.4.1.2 Human-C. elegans cross-genome

    GPCR clusters 141

    4.4.1.3 Human-mouse cross-genome OR

    clusters 141

    4.4.2 Alignment Procedures for Cross-Genome

    GPCR/OR Clusters 141

    4.4.3. Prediction of Membrane Topology forTM Helices and Loops 142

    4.4.4 Detection of Motifs and Amino AcidSubstitution (AAS) in the Cross-Genome

    Alignment 143

    4.4.5 Mapping of Identified Motifs onTM-helices and Loops in MSA 143

    4.4.6 Identification of Homologues Sequencesfor user Submitted Queries by Performing

    BLAST 144

    4.4.7 Pairwise Alignment in TM-MOTIF 144

    4.5 RESULTS 145

  • 7/28/2019 0 Front Pages New_merged

    16/317

    xiv

    CHAPTER NO. TITLE PAGE NO.

    4.5.1. Software Input and Output Options 1454.5.2. Input Options 146

    4.5.3. Output Options 146

    4.5.3.1 Display of predicted 7 TM-

    |helices in VIBGYOR colouring

    scheme: (by using Run TM option)146

    4.5.3.2 Display of Identified Motifs andAAS in MSA: (by using Run

    Motif option) 147

    4.5.3.3 Display of Detected Motifs

    on TM-helices: (by using

    Run TM-Motif option) 148

    4.5.3.4 Alignment with ReferenceSequence 150

    4.5.3.5 Identifying closest homologues

    of user sequence in selected

    organisms 151

    4.5.3.6 Display of Over predicted helices 151

    4.6. DEFAULT PARAMETERS 1524.6.1 TM-MOTIF- Output Files 152

    4.7. CAVEAT AND FUTURE DEVELOPMENT 1534.8. AVAILABILITY 1544.9. CONCLUSIONS 154

  • 7/28/2019 0 Front Pages New_merged

    17/317

    xv

    CHAPTER NO. TITLE PAGE NO.

    5 ANALYSIS ON CONSERVED MOTIFS

    AND PERMITTED AMINO ACID

    EXCHANGES IN CROSS-GENOME

    GPCR CLUSTERS 156

    5.1 INTRODUCTION 156

    5.2 OBJECTIVES 157

    5.3 RESIDUE CONSERVATION IN CROSS-

    GENOME SEQUENCES 158

    5.4 IMPACT OF AMINO ACID CONSERVATION

    AND TYPES OF SUBSTITUTIONS 159

    5.5 METHODS 159

    5.5.1 Cross-genome GPCR cluster dataset 160

    5.5.2 Alignment Procedure 160

    5.5.3 Prediction of membrane topology 161

    5.5.4 Program to Detect Motifs and AAS 161

    5.6 RESULTS 162

    5.7 OCCURRENCE OF MOTIFS FOR SINGLE

    RECEPTOR TYPE 163

    5.8 MOTIFS OBSERVED IN HUMAN-

    DROSOPHILA CROSS-GENOME CLUSTERS 1645.8.1 Motifs Observed in Transmembrane

    Helices 164

    5.8.2 Motifs Observed in Loop Regions 165

    5.9 MOTIFS OBSERVED IN HUMAN- C. elegans

    GPCR CROSS-GENOME CLUSTERS 167

    5.10 CHARACTERISTIC MOTIFS FROM

    CROSS-GENOME GPCR CLUSTERS 169

  • 7/28/2019 0 Front Pages New_merged

    18/317

    xvi

    CHAPTER NO. TITLE PAGE NO.

    5.10.1 Conserved D/ERY and NPXXY motifs in

    GPCR Clusters 169

    5.10.2 Identified KLK/R and RLAR/K motif in

    Secretin Receptor 169

    5.10.3 Conserved PMNYM / PMSYM motif in

    BGA Receptor 170

    5.11 SUMMARY 171

    6 GENOME WIDE SURVEY OF

    OLFACTORY RECEPTORS (ORS) IN

    SELECTED EUKARYOTIC GENOMES 173

    6.1. PHYLOGENETIC STUDY ON SELECTEDHUMAN ORS 173

    6.1.1.

    Introduction 1736.1.2. Objectives and Scopes 1736.1.3. Olfactory Receptors 1746.1.4. OR: Membrane Topology 1756.1.5. Prior Studies on ORs 1756.1.6. Methodology 177

    6.1.6.1. Retrieval of OR sequences 1776.1.6.2. Prediction of membrane topology

    : Human ORs 178

    6.1.6.3. Alignment procedure 1796.1.6.4. Phylogeny on selected human

    olfactory receptors 179

    6.1.6.5. Analysis of phylogeny 180

  • 7/28/2019 0 Front Pages New_merged

    19/317

    xvii

    CHAPTER NO. TITLE PAGE NO.

    6.1.7. Results 1816.1.7.1. Class I and II type receptors in

    human OR phylogeny 181

    6.1.7.2. Sequence features of 10human OR-subclusters 181

    6.1.7.3. Representative OR sequences 1826.1.7.4. Motif analysis on human

    olfactory receptors 183

    6.1.7.5.SVM Analysis 185

    6.2. CROSS-GENOME PHYLOGENY ONSELECTED ORS FROM HUMAN AND

    FISH GENOMES 186

    6.2.1. Objective 1866.2.2. Review of Literatures 1876.2.3. Fish ORs 1876.2.4. Results 1886.2.5. Sequence conservation: across fish and

    human ORs 189

    6.3 CROSS-GENOME PHYLOGENY ON

    SELECTED ORS FROM HUMAN AND

    AMPHIBIAN GENOME 191

    6.3.1 Objective 191

    6.3.2 Literature survey on class

    I and II type ORs 192

    6.3.3 Amphibian ORs 192

    6.3.4 Results 193

    6.3.5.1 Cocluster HXC1 Class

    I type receptors 195

    6.3.5.2 Cocluster HXC2- class

    II type receptors 195

  • 7/28/2019 0 Front Pages New_merged

    20/317

    xviii

    CHAPTER NO. TITLE PAGE NO.

    6.3.5.3 Cocluster HXC3 - classII type receptors 196

    6.4 PHYLOGENETIC ANALYSIS ON

    DROSOPHILA OLFACTORY RECEPTORS 199

    6.4.1 Background 199

    6.4.2 Drosophila ORs 199

    6.4.3 Results onDrosphila OR

    Phylogeny Analysis 200

    6.4.3.1 Cluster association: 10 subclusters 200

    6.4.4 Summary 203

    6.5 CROSS-GENOME PHYLOGENETIC

    ANALYSIS ON SELECTED ORS

    FROMDROSOPHILA, YEAST AND

    HOMO SAPIENS 204

    6.5.1 Background 204

    6.5.2 Insect ORs and mammalian ORs:

    (Evolutionarily unrelated) 204

    6.5.3 Membrane proteins in Yeast 205

    6.5.4 Results 205

    6.5.5 Summary 206

    6.6 CROSS-GENOME PHYLOGENETICANALYSIS ON SELECTED OLFACTORY

    RECEPTORS FROM HUMAN AND

    C. elegans GENOMES 206

    6.6.1 Odr -10 and homologues 207

    6.6.2 Results and Discussion 208

    6.6.3 Summary 211

  • 7/28/2019 0 Front Pages New_merged

    21/317

    xix

    CHAPTER NO. TITLE PAGE NO.

    6.7 CROSS-GENOME PHYLOGENETIC

    ANALYSIS ON SELECTED ORS

    FROM HUMAN AND MOUSE GENOMES 212

    6.7.1 Introduction 212

    6.7.2 Objectives 213

    6.7.3 HumanMouse OR Orthology 213

    6.7.4 Complex Picture on Human-Mouse

    OR Orthology 214

    6.7.5 Methodology 215

    6.7.6 Results 215

    6.7.6.1 Cross-genome OR cluster

    association 215

    6.7.6.2 Cross- genome phylogeny

    with Class-I type receptor

    homologues 217

    6.7.7 Common motifs in the Cross-genome

    phylogeny 218

    6.7.8 Summary 218

    6.8 PHYLOGENETIC ANALYSIS ON

    OLFACTORY RECEPTORS FROM

    SELECTED HUMAN AND NON-HUMAN

    PRIMATES 220

    6.8.1 Objectives 220

    6.8.2 Background 220

    6.8.3 Methodology 220

    6.8.4 Results 221

    6.8.4 Summary 222

  • 7/28/2019 0 Front Pages New_merged

    22/317

    xx

    CHAPTER NO. TITLE PAGE NO.

    6.9 DATABASE OF OLFACTORY

    RECEPTORS (DOR) 222

    6.9.1 Objectives 222

    6.9.2 Features on OR sequences in DOR 224

    6.9.2.1 OR sequences of target genomes: 225

    6.9.2.2 Predicted TM boundaries 226

    6.9.2.3 Single/cross- genome OR

    alignments 227

    6.9.2.4 Cluster association and

    Phylogeny 228

    6.9.2.5 Softwares and Tools

    (TM-MOTIF) in DOR 229

    6.9.3 Structural features (Application of

    sequence searches) 230

    6.9.4 Summary 233

    7 CONCLUSION 236

    7.1 COMPENDIUM 236

    7.2 CROSS-GENOME GPCR CLUSTERING 237

    7.3 PHYLOGENETIC ANALYSIS ON

    SERPENTINE RECEPTORS 240

    7.4 TM-MOTIF PACKAGE 242

    7.5 STUDY ON CONSERVED MOTIFS AND

    AAS IN CROSS-GENOME GPCR CLUSTERS 245

    7.6 PHYLOGENETIC ANALYSIS ON ORS

    IN SELECTED EUKARYOTIC GENOMES 247

    7.7 SUMMARY 253

  • 7/28/2019 0 Front Pages New_merged

    23/317

    xxi

    CHAPTER NO. TITLE PAGE NO.

    APPENIDX 1 THE LIST OF IDENTIFIED

    FAMILY-SPECIFIC MOTIFS IN SR 256

    REFERENCES 260

    LIST OF PUBLICATIONS 284

    CURRICULUM VITAE 285

  • 7/28/2019 0 Front Pages New_merged

    24/317

    xxii

    LIST OF TABLES

    TABLE NO. TITLE PAGE NO.

    2.1 Distribution of Human and C. elegans GPCRs in

    32 Clusters 114

    2.2 List of Identified Orthologs 116

    3.1 List of identified motifs in serpentinereceptor super families 134

    5.1 Motifs@

    observed in the transmembrane

    helices and loop regions of human andDrosophila

    GPCR clusters+

    162

    6.1 Analysis on sequence features of 10 human

    OR subclusters 183

    6.2 List of conserved motifs in 10 human OR

    subclusters (60% level of conservations) 184

    6.3 Sequence identity of neighboring fish ORs

    and human class I type receptors observed in

    cross-genome OR phylogeny 191

    6.4 Sequence identity of neighboring frog ORs and

    human class I type receptors observed in

    cross-genome OR phylogeny 197

    6.5 Sequence identity of neighboring frog ORs and

    human class II type receptors observed in

    cross-genome OR phylogeny (referred as HXC2) 198

    6.6 Sequence identity of neighboring frog ORs and

    human class II type receptors observed in

    cross-genome OR phylogeny (referred as HXC3) 198

  • 7/28/2019 0 Front Pages New_merged

    25/317

    xxiii

    TABLE NO. TITLE PAGE NO.

    6.7 Significant cluster association for str type

    receptors in CeC3 and sequence pairs with high

    /low identity has been given 210

    6.8 Sequence identity and similarity between

    odr-10 and associated SR 213

    6.9 Percentage identity for selected human and

    mouse ORs for significant association from

    cross-genome OR phylogeny 219

    6.10 Percentage Identity between selected human ORs and

    non-human ORs 221

  • 7/28/2019 0 Front Pages New_merged

    26/317

    xxiv

    LIST OF FIGURES

    FIGURE NO. TITLE PAGE NO.

    1.1 Central dogma of genome-wide survey on sequences 3

    1.2 Crystal structure of bovine rhodopsin (Li et al 2004) 5

    1.3 Membrane topology of olfactory receptor

    (odr-10) in C. elegans 8

    1.4 GPCR signaling pathway 10

    1.5 ORs and organization of the olfactory system in

    mammals and OR signaling pathway

    (Meyer et al 2000) 13

    1.6 Overview on the techniques involved in

    genomewide survey 22

    2.1 Flow-chart to depict the step-wise procedure for

    cross-genome clustering of GPCRs 37

    2.2(a-c) Pictorial representation for various types of

    cluster association 42

    2.3(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 46

    2.4(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 48

    2.5(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 50

    2.6(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 52

    2.7(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 55

  • 7/28/2019 0 Front Pages New_merged

    27/317

    xxv

    FIGURE NO. TITLE PAGE NO.

    2.8(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 57

    2.9(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 59

    2.10(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 61

    2.11(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 63

    2.12 (a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display and Radial Display 64

    2.13(a-b) Cross-genome phylogeny of peptide receptors:

    (Rectangular Display & Radial Display) 66

    2.14(a-b) Cross-genome phylogeny of chemokine receptors:

    (Rectangular Display & Radial Display) 69

    2.15(a-b) Cross-genome phylogeny of chemokine receptors:

    (Rectangular Display & Radial Display) 70

    2.16(a-b) Cross-genome phylogeny of nucleotide and lipid

    receptors (Rectangular Display & Radial Display) 72

    2.17(a-b) Cross-genome phylogeny of nucleotide and

    lipid receptors(Rectangular Display & Radial Display) 74

    2.18(a-b) Cross-genome phylogeny of nucleotide and lipid

    receptors (Rectangular Display & Radial Display) 76

    2.19(a-b) Cross-genome phylogeny of peptide receptors

    nucleotide and lipid receptors (Rectangular Display

    & Radial Display) 78

    2.20(a-b) Cross-genome phylogeny of nucleotide and lipid

    receptors (Rectangular Display & Radial Display) 80

  • 7/28/2019 0 Front Pages New_merged

    28/317

    xxvi

    FIGURE NO. TITLE PAGE NO.

    2.21(a-b) Cross-genome phylogeny of nucleotide and lipid

    receptors (Rectangular Display & Radial Display) 82

    2.22(a-b) Cross-genome phylogeny of biogenic amine

    receptor receptors (Rectangular Display &

    Radial Display) 84

    2.23(a-b) Cross-genome phylogeny of biogenic amine

    receptor (Rectangular Display & Radial Display) 86

    2.24(a-b) Cross-genome phylogeny of biogenic amine

    receptor (Rectangular Display & Radial Display) 88

    2.25(a-b) Cross-genome phylogeny of biogenic amine

    receptor (Rectangular Display & Radial Display) 91

    2.26(a-b) Cross-genome phylogeny of biogenic amine

    receptor (Rectangular Display & Radial Display) 93

    2.27(a-b) Cross-genome phylogeny of secretin type

    receptors (Rectangular Display & Radial Display) 96

    2.28(a-b) Cross-genome phylogeny of secretin type

    receptors (Rectangular Display & Radial Display) 98

    2.29(a-b) Cross-genome phylogeny of cell adhesion type

    receptor (Rectangular Display & Radial Display) 100

    2.30(a-b) Cross-genome phylogeny of glutamate receptor

    (Rectangular Display & Radial Display) 102

    2.31(a-b) Cross-genome phylogeny of glutamate receptor

    (Rectangular Display & Radial Display) 104

    2.32(a-b) Cross-genome phylogeny of glutamate receptor

    (Rectangular Display & Radial Display) 105

    2.33(a-b) Cross-genome phylogeny of glutamate receptor

    (Rectangular Display & Radial Display) 107

  • 7/28/2019 0 Front Pages New_merged

    29/317

    xxvii

    FIGURE NO. TITLE PAGE NO.

    2.34(a-b) Cross-genome phylogeny of FRZ/SMT type

    receptor (Rectangular Display & Radial Display) 109

    2.35 (a-b) Distribution ofC. elegans GPCRs at various E-value

    thresholds 112

    3.1 Pie-diagram to show the distribution of serpentine

    receptors (SR) in the dataset 123

    3.2 Phylogeny on selected serpentine receptors

    (circular view tree) 125

    3.3 The subcluster showing odr-10 and its homologues 127

    3.4 Pairwise alignment of odr-10 with bovine

    rhodopsin sequence 129

    3.5 Three -dimensional model of olfactory receptor

    odr-10 and structure validation 130

    3.6 Phylogeny on selected human olfactory receptors

    with an olfactory receptor (odr-10) from C.elegans 132

    4.1 Flow-chart 140

    4.2 Tool guide of TM-MOTIF : an overview 142

    4.3 Snapshot for the available main menu of the frontwindow of TM-MOTIF with user interactive features 145

    4.4 Options given for the submission of input sequences

    in TM-MOTIF package 146

    4.5 Sample output for the option RUN TM 147

    4.6 Sample output for the option RUN MOTIF 148

    4.7 Sample output for the option RUN TM-Motif 149

    4.8 Snapshot for the display of pairwise alignment of

    users input sequence with selected reference sequence 150

    4.9 Snapshot Depicts the Display of Over Predicted

    TM-Helices 151

  • 7/28/2019 0 Front Pages New_merged

    30/317

    xxviii

    FIGURE NO. TITLE PAGE NO.

    5.1 Pictorial representation to denote the occurrence

    of highly conserved DRY motif in TM3,ICL2 158

    5.2 Flow-chart describes about the steps involved in

    the study 159

    5.3 Percentage residue conservation in TM helices and

    loops in GPCR Clusters 168

    5.4(a-c) Illustration of characteristic motifs (observed at

    60% conservation) 171

    6.1 Flow-chart for the sequence analysis on

    olfactory receptors 179

    6.2(a-b) Phylogenetic display of selected human

    olfactory receptor 180

    6.3 Phylogeny of selected olfactory receptors in

    Homo sapiens and fish genomes 189

    6.4 Snapshot of Alignment window for the motif

    KAFSTC in human ORs and in few fish

    ORs at cross-genome alignment 190

    6.5 Snapshot depicts the co-clustering of fish ORs

    with class I type receptors of human ORs in

    HSC1(given in A),also exhibiting the coclusters

    like HXC1,HXC2 and HXC3 to indicate the class

    I and II type receptors from frog ORs with humanORs (given in B). 193

    6.6 Snapshot depicts the co-clustering of fish ORs

    with class I type receptors of human ORs in

    HSC1(given in A),also exhibiting the

    coclusters like HXC1,HXC2 and HXC3 to indicate

    the class I and II type receptors from frog ORs

    with human ORs (given in B). 194

  • 7/28/2019 0 Front Pages New_merged

    31/317

    xxix

    FIGURE NO. TITLE PAGE NO.

    6.7 Phylogeny ofDrosophila Olfactory receptors 2016.8 Observed 10 subclusters ofDrosophila olfactory

    receptors 203

    6.9 Cross-genome phylogeny on selected ORs from

    human,Drosophila and yeast 206

    6.10 Observed cluster association in the cross-genome

    phylogeny of selected ORs from human and

    C. elegans genomes 208

    6.11 Cross-genome phylogeny of selected olfactory

    receptors (ORs) from human and mouse genomes 216

    6.12 Phylogeny on selected human and mouse

    olfactory receptors with special emphasize to mouse

    class I type receptors 216

    6.13 Cross genome phylogeny on selected human ORs

    with ORs from non human primates and aves 222

    6.14 Available main menu in the front page of DOR 225

    6.15 A snapshot of the give option sequence and

    its application in DOR 226

    6.16 Display of predicted membrane boundaries in DOR 227

    6.17 Display of Alignment option in DOR 228

    6.18 Display of cross-genome OR phylogeny in DOR 229

    6.19 Overview on pictorial representation of available

    features in DOR for sequence analysis 230

    6.20 Overview on DOR features for sequence

    and structural information for olfactory receptors

    in DOR 231

    6.21 Display of 3D Structure and related features in DOR 233

  • 7/28/2019 0 Front Pages New_merged

    32/317

    xxx

    LIST OF ABBREVIATIONS

    AAS - Amino acid substitutions

    BGA receptors - Biogenic amine receptors

    BLAST - Basic Local Alignment Tool

    BS - Bootstrap

    CAR - Cell adhesion receptors (CAR),

    CC - Co-clusters

    CMK - Chemokine receptors (CMK),

    FRZ/SMT - Frizzed/smoothened receptors

    GLR - Class C (glutamate) receptors

    GPCRs - G-protein coupled receptors

    HC - Human GPCR clade

    MAFFT - Multiple Alignment using Fast Fourier Transform

    MEGA - Molecular Evolutionary Genetics Analysis

    N&L - Nucleotide and lipid receptors

    NC - Neighbor clades

    NJ - Neighbor joining

    NM - Neighbor members

    ORs - Olfactory receptors

    PR - Peptide receptors

    RMSD - Root-mean-square deviation

    RPS-BLAST - Reverse PSI-BLAST

    SEC - Class B (secretion) receptors

    SR - Serpentine receptors

    SS - Species-specific members

    SVM - Support vector machine

    TM proteins - Trans-membrane proteins

  • 7/28/2019 0 Front Pages New_merged

    33/317

    1

    CHAPTER 1

    INTRODUCTION

    The vast and frequent update of sequence databases to build

    repositories for various genomes and predicting accurate structural

    information of these sequences are two critical steps in Computational

    Genomics. Available knowledge and approaches for genomics (Lipman et al2011) and structural genomics (Redfern et al 2008) are drastically different,

    but can be inter-connected effectively for the cause of identifying functional

    annotations (Alfarano et al 2005).

    Huge accumulation of sequence information in one end and limited

    resources on structural details on the other end is the crucial scenario in

    bioinformatics. This imbalance is indeed a challenge to achieve the goal ofidentifying function(s) of interested gene(s) immediately.

    However, the accumulated large size data repositories can be

    handled effectively only through bioinformatics techniques such as genome

    wide survey which is a more sophisticated approach than the traditional gene-

    by-gene approach and provide clues to connect sequences from various

    genomes for the common function. Methods such as data clustering orprincipal component analysis, artificial neural networks or support vector

    machines are useful for gene/protein prediction, classification, association and

    annotation of novel proteins etc., further support in analyzing functional

    genomics data.

    My current objective is applying effective bioinformatics

    approaches such as genome-wide survey, cross-genome phylogenetic analysis

  • 7/28/2019 0 Front Pages New_merged

    34/317

    2

    on certain GPCRs and ORs to propose representative sequences, cluster

    association, cluster-specific motifs, orthologs, species-specific behavior and

    co-clusters arrived at intra- and inter-genomic levels, ultimately to connect

    the functional properties of known to unknown gene/protein (Figure 1.1).

    In principle, sequence comparison studies, along with reference to

    structural similarities, provide clues to connect functional resemblance

    (Redfern et al 2008) at cellular, biochemical and molecular levels

    (Ye et al 2006).

    This unidirectional hypothesis of associating sequences, predicting

    structural details, relating biochemical functions with the phenotypes, forms

    the baseline of computational biology. Sequence studies for various genomes

    will provide opportunity to identify a group of associated proteins based on

    phylogeny and can be exploited for functional relevance. This conceptual

    framework really helps to compare sequences from various genomes and

    provides clues to connect the sequences of known function to the

    unknown. These rationale on genome-wide survey of interested

    gene/protein sequences provide platform to integrate knowledge on sequence-

    structure-function paradigm for public access (Kerrien et al 2011). Thus,

    sequence studies act as a primary step to connect structural and functional

    studies.

    1.1 PRIOR ART ON GENOME-WIDE SURVEYPerforming genomewide survey on selected or interested protein

    families (Tripathi and Sowdhamini, 2008 and Metpally and Sowdhamini

    2005) will be appropriate to explain the approach of accumulating related

    proteins (associated gene clusters), identifying putative orthologs and to

    observe conserved motifs from various genomes. Cross-genome sequence

    analysis provides knowledge on sequence conservation across taxa, preserved

  • 7/28/2019 0 Front Pages New_merged

    35/317

    3

    species-specific tendencies and exhibit evolutionary integrity at cross-genome

    level (Figure 1.1). Particularly, cross-genome sequence studies with selected

    model organisms will be useful for vast practical applications. For instance, a

    cross-genome phylogenetic analysis on selected GPCRs of human and

    Drosophila genome (Metpally and Sowdhamini, 2005) organized as eight

    major groups of GPCRs, led to generate 32 cross-genome GPCR clusters.

    Such an approach proved valuable for identifying the natural ligands of

    Drosophila and human orphan receptors.

    [

    Figure 1.1 Central dogma of genome-wide survey on sequences

    Note: Pictorial representation describing the procedures involved in genome-wide sequenceanalysis. Label 1 refers to the selection of interested genomes. Label 2 refers to thecollection of non-redundant sequences from the selected genomes. Label 3 refers to cross-genome alignment procedure. Label 4 refers to cross-genome phylogeny on sequences. Label

    5 refers to cross-genome cluster association and analysis for species-specificity, co-clusterarrangements, identification of orthologs, conserved motifs, observing functional clues tohypothetical proteins in the phylogeny.

    Other case studies like genome-wide survey on identifying putative

    serine/threonine protein kinases (STKs) in cyanobacteria, (Zhang et al 2007),

    gaining practically useful insights on symbiotic nitrogen-fixing alpha-

  • 7/28/2019 0 Front Pages New_merged

    36/317

    4

    proteobacterium like Sinorhizobium meliloti (Schluter et al 2010) based on

    experimental data, phylogenetic classification on transporters and membrane

    proteins from lower organisms (De Hertogh et al 2002) to higherorder

    organisms (Chang et al 2004), phylogenetic analysis on olfactory receptor

    subfamilies (class I and class II type) in fish (Freitag et al 1999), amphibians

    (Freitag et al 1995), phylogenetic analysis in discriminating gustatory and

    olfactory receptors in Drosophila (Robertson et al 2003), phylogenetic

    grouping of serpentine receptor superfamilies in C. elegans (Robertson and

    Thomas 2006), identifying olfactory receptor subfamilies in mouse

    (Sullivan, et al 1996) and human (Glusman et al 2001), influence of

    phylogenetic analysis in ethno-medicinal studies (Saslis-Lagoudakis et al

    2011) are highly commendable. These case studies illustrate the important

    applications of genome-wide survey and usage of phylogeny in identifying

    similar or related sequences for protein of interest across genomes.

    1.2 BREAKTHROUGHS IN GPCR CRYSTALLOGRAPHY STUDIESAs we know, the diverse cell surface proteins exist as 30% in

    human genome and are very popular for their therapeutic importance and

    applications. Among the available (>82,160) structures in the PDB, crystal

    structures are available for only very few membrane proteins. For structural

    crystallization, membrane proteins embedded in the lipid bilayer have to be

    extracted and need to form a protein-detergent complex (PDC) (Koszelak-

    Rosenblum et al 2009). Also, the surrounding environmental lipids in cell

    membranes interfere with both crystallography and nuclear magnetic

    resonance (NMR) spectroscopy, while solving three-dimensional structures of

    membrane proteins. As purification and crystallization of membrane proteins

    are very crucial events in membrane protein crystallography (Dilanian et al

    2011), only a limited number of membrane proteins have been reported so far.

  • 7/28/2019 0 Front Pages New_merged

    37/317

    5

    Figure 1.2 Crystal structure of bovine rhodopsin (Li et al 2004)

    a) Crystal structure of bovine rhodopsin displayed in ribbon representation(Li, et al., 2004). The observed seven TM-helices and one peripheral helix are coloredin the rainbow order: TM-helix1 in dark blue (residues 34

    64); TM-helix 2 in light blue

    (71100); TM-helix 3 in blue-green (106140); TM-helix 4 in yellow-green (150

    173); TM-helix 5 in yellow (200230); TM-helix 6 in orange (241276); TM-helix 7 inred (286309); TM-helix 8 in magenta (311321). 1.2.

    b) Space-filling representation of rhodopsin- a photoreceptor protein.Rhodopsin- is the first solved crystal structure (Palczewski et al

    2000) (Figure 1.2 a and b), 1 adrenergic receptor (Warne et al 2008), 2

    adrenergic receptor (Rasmussen et al 2007), adenosine receptor (Jaakola et al

    2008), dopamine D3 receptor, CXCR4 chemokine receptor (Wu et al 2010),

    histamine receptor and most recently reported lipid GPCR - sphingosine 1-

    phosphate receptors (S1P1 receptors) are few important crystal structures.

    These structural studies will guide to compare the reference structures with

    disease-implicated genes based on modelling to interpret the dysfunctions.

    Most of the solved structures are used as templates for molecular modelling.

  • 7/28/2019 0 Front Pages New_merged

    38/317

    6

    1.3 GPCRS: POPULAR DRUG TARGETSAs GPCRs are involved in a wide variety of physiological

    processes, such as regulation of immune system activity and inflammation,

    cell density sensing, sense of smell, visual sense, autonomous nervous system

    transmission and behavioral and mood regulation, they are effectively

    targeted in medicinal chemistry. Several previous reviews and literature

    highlight the clinical importance of GPCRs (Insel et al 2007) and few

    examples can be discussed to denote the importance of GPCR biology in

    medicine. For instance, a number of monogenic mutations have been

    identified in rhodopsin causing disease called retinitis pigmentosa, number of

    endocrine disorders, serious illness such as schizophrenia (Seeman 1987),

    Alzheimer's disease and Parkinson's disease (Lee et al 1978). Also there are

    many reported disorders such as genetic disorders of the calcium-sensing

    receptor (CaSR), graves disease, cancer, diabetes, heart diseases,

    neurodegenerative diseases, asthma, and diseases related to autoimmunity,AIDS and so on are few other examples to emphasize the multi-functional

    role of GPCRs and its clinical implications.

    Diversity of GPCRs and ligand-binding properties make these

    receptors as interesting targets for the structure-based drug design (Schlyer

    and Horuk 2006) and even lead the scope for personalized medicine.

    Notably, receptors such as AT1 angiotensin, adrenergic, dopamine

    and serotonin (5-hydroxytryptamine, 5-HT) receptor subtypes are most

    exploited for their clinical importance and related diseases which are all

    useful drug targets.

  • 7/28/2019 0 Front Pages New_merged

    39/317

    7

    1.4 STRUCTURE AND CELLULAR ACTIVITIES OF MEMBRANEPROTEINS

    Membrane proteins are embedded within the lipid bilayer and are

    designated as transmembrane proteins, since they loop inside and outside of

    the cell boundaries (Figure 1.2). A class of cell-surface receptors retain

    structural features, having extracellular N-terminal, intracellular C-terminal

    with seven transmembrane-helices (TMHs) connected by three intra and

    extracellular loops and reminding a snake-like structural element /display to

    have names such as 7TM receptors or heptahelical receptors or serpentine-like

    receptors (Probst et al 1992). Since the downstream targets of such membrane

    receptors are guanine nucleotide binding proteins, they are also referred as

    Guanine nucleotide-binding protein-coupled receptors, G-protein coupled

    receptors (GPCRs), serpentine receptors, and are popular for their versatile

    functional importance.

    GPCRs are ubiquitous as they majorly participate in signal

    transduction, and recognize various type of ligands (Bockaert and Pin 1999).

    Substantial evidence on GPCR oligomerization (Prinster et al 2005),

    participation in signaling pathways (Greenwald 2005), clinical importance

    (Kuwabara and N 2001) and availability of repositories for multiple

    organisms (Fredriksson and Schioth 2005) provide significant impetus for the

    study of GPCR sequences and their ligand-binding properties. Ligands could

    be endogenous compounds such as amines, peptides, Wnt proteins or

    endogenous cell surface adhesion molecules or photons and exogenous

    compounds like odorants.

    1.5 MEMBRANE PROTEIN: TOPOLOGYThere are several prediction methods available online to predict

    topology of membrane proteins. The prediction methods are mainly based on

  • 7/28/2019 0 Front Pages New_merged

    40/317

    8

    the hydrophobicity profile of the helices. Notably, canonical GPCR

    members exhibit N-in and C-out topology, but olfactory receptors show N-out

    and C-in topology in higher order organisms (Figure 1.3). The other

    interesting fact is that especially Drosophila ORs and GRs retain N-in and

    C-out topology (Bargmann 2006, Benton et al 2006, Lundin et al 2007) and

    also referred as inverted/reverse topology.

    The methods like HMMTOP (Tusnady and Simon 2001), SOSUI

    (Hirokawa et al 1998),TMHMM (Krogh et al 2001), TMAP, MEMSAT,

    TMpred, TSEG, TM-finder, Pred-TMP, SPLIT, DAS, TopPred II, PRED-

    TMR2, MPEx, Phobious and TOPCON are popularly used to predict the

    secondary structure of membrane proteins. Methods are also available to

    discriminate signal peptides (Lao et al 2002) in proteins.

    Figure 1.3 Membrane topology of olfactory receptor (odr-10) in

    C. elegans

    The predicted seven trans membrane helices (by HMMTOP) for odr-10 was given in TOPO2

    display, wherein residues from 12-31 for TM1, 44-63 for TM2 , 94 -113 for TM3, 126-145

    for TM4, 202-225 for TM5, 256-275 for TM6 and 286-305 for TM7 was predicted by

    HMMTOP. The conserved YRY motif in TM3, ICL2 and the Str superfamily specific

    QLF motif in ICL3 has been highlighted in red colour.

  • 7/28/2019 0 Front Pages New_merged

    41/317

    9

    1.6 GPCR MECHANISMMembrane proteins are effectively involved in signal transduction

    (Figure 1.4), where GPCRs are activated by various external stimuli

    (Rodbell et al 1971). Due the influence of various external stimuli, receptors

    undergo conformational change (i.e., minimal rearrangement occur in TM6

    and TM3 helices, but still the area remains unclear) and causes the activation

    of a guanine nucleotide-binding proteins (G-protein). GPCRs are dedicated to

    recognize intercellular messenger molecules (such as hormones,

    neurotransmitters, lipids, biogenic amines, growth and developmental

    factors), and several sensory messages (such as light, odors and gustative

    molecules). Also, this event is primarily dependent on the type of the

    G-protein. For instance, The Golf subunit is mainly related to sense the

    chemosensory signals and participates in olfactory signaling pathways

    (Figure 1.4). Gs state of G-protein regulates the enzyme called adenylate

    cyclase (AC). AC activity is triggered when it binds to a subunit of the

    activated G-protein and subsequently triggers cAMP pathway for further

    transduction to result in various biological responses. Activation of AC stops

    when G-proteins return to the GDP-bound state (Figure 1.4). GPCRs are also

    involved in various secondary pathways like ion channels, adenylyl cyclases,

    and phospholipases.

  • 7/28/2019 0 Front Pages New_merged

    42/317

    10

    Figure 1.4 GPCR signaling pathway

    Image represents about GPCR-signal transductions which depicts the entry of ligands/stimuli, activation of G-protein subunit, subsequent activation of cAMP and event ofinternalization for biological responses. (Image adopted from DB-DRD4 - a database of

    dopamine D4 receptor (home page) and SOURCE: TRENDS in Pharmacological sciencesURL: http://www.ibibiobase.com/projects/db-drd4/G_protein.htm)

    1.7 GPCR CLASSIFICATIONGPCRs comprise the most prolific family of cell membrane

    proteins. Knowledge on GPCR classification is necessary since they involve

    in various signaling pathways and recognize diverse set of ligands and are

    related to various biological functions. The candidate GPCRs with

    characteristic seven TM-helices were classified with the aid of several

    prediction methods and classifiers. Though all the candidate GPCRs from

    various families retain seven TM-helices and are connected by ICLs and

    ECLs, sequence differences occur and exhibit subtle structural diversity

    (Gether 2000). Superfamily of GPCRs are classified majorly as class A

    (rhodopsin-like), class B (Secretin-like), class C (Metabotropic glutamate),

    class D (Fungal pheromone), class E (cAMP receptors) and class F

    (Frizzled/smoothened) (Kristiansen 2004). Particularly, class A is the largest,

    occupying 80% of the distribution and retains diverse receptors like

    rhodopsin, olfactory, biogenic amine, bioactive lipid, nucleic acid, and

  • 7/28/2019 0 Front Pages New_merged

    43/317

    11

    peptide receptors. Wherein receptors such as secretin, calcitonin, glucagon,

    parathyroid hormone, vasoactive intestinal peptide and so on are related to

    class B. Class C includes receptors like metabotropic glutamate receptors(mGluRs), Ca

    2+-sensing receptor, -aminobutyric acid type B receptors

    (GABA-B) and vomeronasal receptors type 2. Class D retains receptors such

    as fungal pheromone P and -factor receptors (STE2/MAM2), whereas fungal

    pheromone A and M-factor receptors (STE3/MAP3) are related to class E.

    Class F retains slime mold cyclic adenosine monophosphate (cAMP)

    receptors. Recently, few other GPCR families, such as frizzled type

    receptors/FRZ (Vinson and Adler 1987, Bhanot et al 1996), smoothened type

    receptors/SMT (Alcedo et al 1996 and Nehme et al 2010), vomeronasal

    receptors type 1 /VNS (Dulac and Axel, 1995), ocular albinism (Schiaffino et

    al 1996, Schiaffino et al 1999), and plant receptors (Grill and Christmann

    2007)i.e.,Arabidopsis thaliana receptor GCR1 (Josefsson and Rask 1997),

    (Perfus-Barbeoch et al 2004) have also been added to the existing GPCR

    families. It has been observed that Class A, B and C cover nearly 600 GPCRs

    in the human genome, excluding putative candidate GPCRs. Notably,

    olfactory receptors (ORs) are members of class A type receptors and has been

    dealt exclusively in Chapter 6 under the title of genomewide survey on

    olfactory receptors in selected eukaryotes.

    1.7.1 Olfactory Receptors (ORs)

    Sense of smell - a process of olfaction is beyond simple scientific

    understanding. In general, chemical senses are broadly divided into olfaction

    (the sense of smell) and gustation (the sense of taste). Critical knowledge on

    understanding and analyzing about the olfaction is a necessary science, not

    only for its biological or chemical perspective, but also for its powerful socio-

    cultural phenomenon (Low 2005).

    Olfactory receptors participate in sensing diverse chemical stimuli

    or odors (Firestein 2001). ORs are fascinating for their functional significance

  • 7/28/2019 0 Front Pages New_merged

    44/317

    12

    in detecting food, to assess its quality, to enhance its flavor, to indicate the

    presence of potential toxins and pathogens, to know about reproductive status,

    gender, genetic identity, conspecifics, mates as well as threats. ORs activatechemosensory cells leading to neural recognition and influence behaviours,

    hormone state and also mood (Munger et al 2009). Due to their diverse role,

    ORs are very important as well as present in our everyday life experiences

    and are need to be explored more in detail for the vast practical applications in

    the field of pharmaceutical industry (aroma therapy), cosmetic industry

    (scent/perfume manufacturing), food industry, olfacto-sexual function and to

    study olfacto-neural communication, olfactory dis-orders and so on. Thus,

    performing genome-wide survey on ORs of selected eukaryotic organisms

    will improve scientific credibility and ultimately serve for human benefit.

    1.7.2 Classical Knowledge on Olfactory Receptors

    The landmark paper published in the year 1991, by Nobel

    Laureates Buck and Axel, have explained about the role of olfactory receptors

    and the organization of olfactory system in humans (Buck and Axel 1991).

    Around three percent of our genes are used to code for different odorant

    receptors on the membrane of the olfactory receptor cells. Further research

    studies on phylogenetic approach in discriminating class I and class II type

    receptors to sense the water- and air-borne odors in higher eukaryotes i.e.,

    human and mouse (Zozulya et al 2001; Niimura and Nei 2005), studies

    related to insect olfaction (Robertson et al 2003), nematode olfaction

    (Robertson and Thomas 2006), olfactory signaling , availability of ORs in

    various genomes, and observed common peptides in OR subfamilies

    (Gottlieb et al 2009) are providing remarkable background and facilitate the

    genome-wide survey of ORs in selected eukaryotic genomes further to

    identify OR subclusters, cluster-specific motifs, species-specific tendencies

    and co-clusters in tree topology (Chapter 6 for more details).

  • 7/28/2019 0 Front Pages New_merged

    45/317

    13

    1.7.3. Olfactory Signaling Pathway in Human ORs

    The process of olfaction primarily starts with binding of an odor to

    specific receptor on sensory neuron where chemical energies transformed to

    electrical signals to sense the smell. Such binding activates Golf a G

    protein. The alpha subunit of Golf activates the enzyme adenyl cyclase,

    generating the major second messenger 3`,5`-cyclic adenosine

    monophosphate (cAMP) which directly opens the cyclic nucleotide gated

    channel. This allows the Na2+

    and Ca2+

    to flow in and depolarize the cell.

    Depolarization of these cells cause action potentials (nerve impulses) and are

    sent to the olfactory bulb and also by the pathway involving guanylyl cyclase

    GC-D (Meyer et al 2000). Human nose expresses different types of receptors,

    enabling the main olfactory system and using common pathway to encode

    thousands of odorants (Figure 1.5 a and b).

    (a) (b)Figure 1.5 ORs and organization of the olfactory system in mammals

    and OR signaling pathway (Meyer et al 2000)

    a)Depicts the pictorial representation of ORs and organization of the olfactory system inmammals

    b)Depicts OR signaling pathway, which depicts the proposed two hypothesis of OR-signaltransduction (Meyer et al 2000). In this, upper panel describes the entry of various odors

    and recognized by ORs and initiate cGMP signaling pathway which involves G protein(Golf), an adenylyl cyclase (ACIII), a cyclic nucleotide-gated (CNG) channel (341b) anda chloride channel (ClC). After the response, cAMP is degraded by a CaM-dependentphosphodiesterase (PDE1C2). The other hypothesis (lower panel in b) explains the

    components of cGMP-signaling pathway and putative targets of cGMP which involvesreceptor guanylyl cyclase GC-D, cGMP-regulated PDE2, an unknown cGMP-regulated ion

    channel and the known CNG channel of the cAMP-signaling pathway.

  • 7/28/2019 0 Front Pages New_merged

    46/317

    14

    1.7.4. ORs, GRs and IRs in Drosophila

    As we know, olfactory neurons play a central role in sensing

    volatile cues that afford the organism the ability to detect food, predators and

    mates. But, gustatory neurons sense soluble chemical cues that elicit feeding

    behaviours. In insects, the taste neurons initiate innate sexual and

    reproductive responses.

    It is believed that nearly 60 olfactory receptors (Berkeley

    Drosophila Genome Project database) play a major role in identifying and

    discriminating diverse odors for the insectsurvival and these Drosophila

    olfactory receptor (DORs) gene family are identified as G-protein coupled

    receptors (Clyne et al 1997, Gao and Chess 1999, Vosshall and Stocker

    2007). These proteins are expressed in distinct subsets of olfactory neurons

    and certain family members were restricted to distinct portions of the

    olfactory system. Nearly the same numbers of gustatory receptors (GR) are

    meant for gustatory functions (Clyne et al 1997).

    Notably, insects GRs have the same transmembrane topology as

    ORs. Ionotropic Glutamate Receptors (IR)inDrosophilais referred as a new

    family of odorant receptors and these proteins accumulate in sensory

    dendrites and not present at synapses. They mediate chemical communication

    between neurons at synapses and are expressed in a combinatorial fashion in

    sensory neurons that respond to many distinct odors, but do not express either

    insect odorant receptors (ORs) or gustatory receptors (GRs).

    1.7.5. Insect olfaction (DrosophilaORs)

    Several fundamental explanations have been published (Siddiqi,

    1990), (Clyne et al 1999) to investigate molecular mechanism onDrosophila

    olfaction. Electrophysiological studies explained the differentiation in the

  • 7/28/2019 0 Front Pages New_merged

    47/317

    15

    morphology of the olfactory sensilla and their distribution patterns

    (Venkatesh and Singh 1984, Stocker 1994). Studies suggest that there are 30

    different classes of ORNs in the antenna (in adult ~40), based upon the odor

    response profile of individual neurons and few exhibit odor specificity.

    Notably, 24 antennal receptors such as Or2a, Or47b, Or33b, Or49b, Or65a,

    Or23a, Or85f, Or88a, Or67c, Or43a, Or7a, Or43b, Or59b, Or9a, Or85a,

    Or47a, Or22a, Or19a, Or67a, Or35a, Or98a, Or85b, Or82a and Or10a were

    tested experimentally with 110 odorant molecules using empty neuron system

    (Dobritsa et al 2003) and responses of receptors vary to different chemical

    classes.

    Generally, the functional insect ORs retain variable insect ORs with

    a constant odorant binding receptor called OR83b and forms the heteromeric

    complex then participate in signaling pathway. OR83b is also called asco-

    receptor (Vosshall and Stocker 2007) for its functional importance. In the

    literature (Larsson et al 2004), it is also mentioned that heteromeric insect

    ORs comprise a new class of ligand-activated non-selective cation channels

    (Sato et al 2008).

    Notably, insects ORs lack homology to G-protein coupled

    chemosensory receptors of vertebrates and exhibit drastically differing

    mechanisms in olfaction. Recent studies explained insect ORs as heteromeric

    ligand-gated ion channels (More details in Chapter 6).

    1.7.6. Nematode Olfaction

    Chemosensory receptors in nematodes are highly diverse and large

    in number. Since worms lack both auditory and visual sense, chemosensation

    plays a central role in nematodes for its survival. In C. elegans, chemosensory

    receptors belong to G-protein coupled receptors and retain seven

    transmembrane proteins. Around 1330 genes and 400 pseudo genes have been

  • 7/28/2019 0 Front Pages New_merged

    48/317

    16

    identified as chemoreceptors (Robertson and Thomas 2006) in C. elegans.

    Also many of these receptors are known as serpentine receptors and around

    19 largest gene families are reported so far. Among the large number of

    proteins, only one protein namely odr-10 (Figure 1.3), was reported as an

    olfactory receptor in C. elegans (Sengupta et al 1996).

    1.7.7. Mouse Olfaction

    As found in human olfactory receptors, mouse ORs also possess

    two broad classes of ORs with excellent bootstrap support (Glusman et al

    2001). The class I type in mouse ORs are as found in fish and in the frog, but

    had been considered an evolutionary relic in mammals (Ngai et al 1993) and

    the class II receptors are found in amphibians and terrestrial vertebrates

    (Freitag et al 1995). There are 147 class I OR genes found in mouse OR

    subgenome, among them 120 OR genes were potentially functional. In

    mouse, all of the class-I type ORs were located in a single large cluster in

    chromosome 7.

    1.8 DATA REPOSITORIES FOR MEMBRANE PROTEINS

    There are a huge number of data repositories and prediction servers

    for membrane topology are available exclusively for membrane proteins.

    Notably, repositories related to GPCRs (Elefsinioti et al 2004) like gpDB

    (Theodoropoulou et al 2008), GPCRDB and integrated web resources like GProtein Coupled Receptor - Oligomerization Knowledge Base Project, GPCR

    Natural Variants database (NaVa). Database namely SEVENS (Ono et al

    2005) provides useful sequence information, chromosomal location and intra-

    genomic phylogenetic clusters for membrane proteins from more than 50

    eukaryotic organisms. IUPHAR (Committee on Receptor Nomenclature and

    Drug classification) incorporates detailed pharmacological, functional and

  • 7/28/2019 0 Front Pages New_merged

    49/317

    17

    patho-physiological information on GPCRs, voltage-gated ion channels,

    ligand-gated ion channels and nuclear hormone receptors.

    The other related databases for structural resources like PDBTM,

    TOPDB (Tusnady et al 2008), provide collection of domains and sequence

    motifs. TMpad (Trans Membrane Protein Helix-Packing Database) and

    MPDB (Membrane Protein Data Bank) are useful to provide structural

    information on integral, peripheral and anchored membrane proteins and also

    peptides (Raman et al 2006).

    Data repositories for olfactory receptors are also available for

    public access. ORDB (Skoufos et al 2000), HORDE (The Human Olfactory

    Data Explorer) and integrated web resources from Sense Lab for ORs with

    associated links such as odorDB, odorMapDB are highly useful and

    particularly relevant to retrieve sequences for the olfactory receptors (ORs)

    from multi-genomes.

    1.9 COLLECTION OF GPCR- HOMOLOGUES

    Sequence similarity searches are robust techniques to identify

    nearest homologues for a query sequence from database of interest. Pairwise

    comparison of proteins is a fundamental step in sequence similarity

    searches. The similarity scores depend upon the sequence features like

    amino acids and permitted amino acid substitutions (AAS). Generally, when aquery and the subject are aligned with high similarity scores, then they can be

    referred for their sequence relevance and can be called as homologues. In

    other words, two proteins retaining similar sequences can be called as

    homologues. Homologues are further classified into orthologs and paralogs.

    While orthologous proteins evolved from a common ancestral gene belonging

    to two different genomes, paralogs were generated by the event of gene

    duplication and belong to the same genome. Thus, homologues share

  • 7/28/2019 0 Front Pages New_merged

    50/317

    18

    significant sequence similarity and can be further connected for their

    functional relevance. A necessity arises to select an appropriate technique for

    similarity search when we deal with evolutionarily distant sequences and

    particularly membrane proteins. Each method is unique for its scoring scheme

    with respect to amino acid substitutions and the gap penalties.

    Functionally and evolutionarily important protein similarities can

    be recognized by comparing three-dimensional structures, but when structures

    are not available, patterns of conservation such as motifs, profiles, position-

    specific scoring matrices, and Hidden Markov Models can be used to identify

    related sequences from the database of protein sequences.

    Several methods like BLAST (Altschul, et al 1997), FASTA

    (sequence based searches) (Lipman and Pearson 1985), IMPALA (profile-

    based searches) (Schaffer et al 1999) other approaches like PSI-BLAST, RPS-

    BLAST, are effectively used to find homologues and further to identify

    common functional relevance.

    1.9.1 BLAST (Basic Local Alignment Search Tool)

    Sequence comparisons between two sequences are achieved by

    producing quality alignments which maximize the correspondence between

    similar residues and minimize gaps (Altschul et al 1997). The objective here

    is to align or match a sequence of unknown function with

    characterized/annotated proteins from model organisms, so that the structure

    and function can be extrapolated to the new sequence. Generally, dynamic

    programming technique has been implicated to achieve alignments locally

    (BLAST) or globally (FASTA). BLAST and FASTA (Lipman and Pearson

    1985) are robust methods. Conceptually, the heuristic approach (BLAST) can

    deal with sequences considerably differing in length and identifies islands of

  • 7/28/2019 0 Front Pages New_merged

    51/317

    19

    short matches. It relies upon Smith-Waterman algorithm (Smith and

    Waterman 1981), and is guaranteed to find the optimal local alignment with

    respect to the scoring system to provide maximal scoring segment pairs

    (MSPs). The scoring system majorly includes the substitution matrix and the

    gap-scoring scheme to align the sequences based on possible similarities.

    BLAST-a robust sequence comparison tool - is applicable for five main

    search methods such as blastp, blastn, blastx, tblastn and tblastx for varying

    inputs such as nucleotide and protein sequences.

    BLAST produces statistically significant alignments in the output

    and features like raw scores, bit scores and E-values are considered for

    quantify the alignment significance. Among them, E-values are most often

    used. Generally, lowest E-values are considered as highly significant for best

    alignment. An E-value refers to the number of alignments one expects to find

    with a score greater than or equal to the observed alignment score in a search

    against a random database. PAM (point accepted mutations per 100residues) amino acid scoring matrix which is based on an explicit

    evolutionary model (Dayhoff et al 1978) is provided in the BLAST

    software distribution. It includes PAM40, PAM120, and PAM250, whereas

    the BLOSUM matrices are based on an implicit model of evolution and

    includes BLOSUM 45, 62 and 85 (Henikoff and Henikoff 1992). Generally,

    these matrices are very appropriate to deal with globular proteins, whereas

    PAM and JTT-200 (Jones et al 1992) can be used for membrane proteins.

    1.9.2 PSI-BLAST (Profile Vs Sequence comparison method)

    Among the five BLAST programs, the work described in this thesis

    mostly relies on the basic protein BLAST technique, which includes blastp

    (protein-protein BLAST), PSI-BLAST (Position Specific Iterated BLAST),

    PHI-BLAST (Pattern Hit Initiated BLAST) and DELTA-BLAST (Domain

  • 7/28/2019 0 Front Pages New_merged

    52/317

    20

    Enhanced Lookup Time Accelerated BLAST). As the name suggests, blastp

    compares a protein query with a protein database, PSI-BLAST allows the user

    to build a PSSM (position-specific scoring matrix) using the results of the first

    blastp run and iteratively uses the profile as query against the database of

    protein sequences (Altschul et al 1997). The generated profiles at each

    iteration, are searched against the database of protein sequences by rigorous

    iterations until convergence (meaning iterate until no new sequences are

    found). Thus, this method is effective in associating even distantly related

    sequences with remote homology. The application can be further improvised

    by using as jump-start PSI-BLAST (Altschul et al 1997), jack-knife approach,

    HOE (Homologous over-extension) reduced profile search (Gonzalez and

    Pearson, 2010) and the improved PSI-BLAST search techniques such as

    cascade PSI-BLAST (Bhadra et al 2006) as per user requirement.

    1.9.3 Reverse PSI-BLAST (Sequence Vs Profile comparison method)

    To associate remotely related sequences, reverse PSI-BLAST

    technique (RPS-BLAST) is highly effective. This method differs from other

    sequence searches, wherein the query sequences are searched against a

    database of PSSM (Position Specific Scoring Matrices) profiles. PSSMs give

    the amino acid propensities at each sequence position based on the multiple

    alignments. PSSM generation also uses the multiple alignment sequence

    weights, the expected number of amino acids and the frequencies of un-

    observed amino acids (pseudo counts). Representative sequences from the

    protein families (example:3PFDB Shameer et al 2009), related domains and

    cluster types can be used to generate profiles to represent sequence properties

    as a block of consensus of amino acids. Hence, sequence search space has

    been broadened and opportunity has been extended to connect sequences at

    remote homology (Figure 1.6).

  • 7/28/2019 0 Front Pages New_merged

    53/317

    21

    In the other method, that compares protein sequences against

    database of protein sequences, some limitations do exist. If stringent sequence

    properties are employed, scaled at sequence against database of sequences,

    there is little chance of missing very distantly related sequences in these

    search techniques. But, RPS-BLAST helps to associate even the distantly

    related sequences to its related profiles. So, the practical implications like

    generating cross-genome phylogenies, finding new members, associating

    evolutionarily distant sequences, classification and to associate functional

    annotation to new sequences based on known data. This effective method can

    be employed carefully in designing profiles, setting significant E-value

    thresholds and to interpret sequence search for related profiles.

    Separately, Hidden Markov Model (HMM) can also be used for

    pattern recognition and it provides a mathematical representation of a protein

    sequence (Eddy 1998, Karplus et al 1998). HMMs have been used for gene

    prediction, recognition of transmembrane helices (Sonnhammer et al 1998),phylogenetic analysis (Felsenstein and Churchill, 1996) and in distant

    homology detection (Krogh et al 1994b). Machine learning approaches are

    appropriate techniques to deal with pattern recognition problems and to

    recognize remote homology. Method like support vector machines (SVMs)

    (Pugalenthi et al 2010) is effectively used in classification problems where the

    already trained dataset with known features (Positive set) is used to associate

    unknown gene/protein sequence (Negative set) and is useful to propose

    putative members, where the predictions relay upon training dataset.

  • 7/28/2019 0 Front Pages New_merged

    54/317

    22

    Figure 1.6 Overview on the techniques involved in genomewide survey

    The given diagram depicts the use of available data repositories related to membrane

    proteins (GPCRDB, SEVENS DB, ORDB, HORDE and so on.) following the collection of

    sequences, predicting the membrane topology, using redundancy filter as the primary step

    for the cross-genome studies. The methodology is starting with sequence search programs

    (such as BLAST, PHI-BLAST, PSI-BLAST, RPS-BLAST) to homologues sequences and to

    perform cross-genome analysis.

    1.10 MULTIPLE SEQUENCE ALIGNMENT TECHNIQUES

    Alignment procedures play a crucial role (Figure 1.1 and

    Figure 1.6) in analyzing the relationships among diverse sequences. The

    arrangement of two or more sequences can be possible by aligning the

    sequences for common properties or sites. Weights can be assigned to thealigned elements so as to determine the degree of relatedness or to detect the

    existing homology between the multiple sequences. A pairwise alignment is

    between two sequences and a multiple sequence alignment (MSA) with many

    sequences, which are facilitating sequence comparison studies and the

    sequence can be aligned by various alignment methods. MSA can be referred

    as a generalization of pairwise sequence alignments. Here, instead of aligning

    two sequences, n number of sequences were aligned simultaneously, where

  • 7/28/2019 0 Front Pages New_merged

    55/317

    23

    n is always >2, thus called as multiple sequence alignments and the alignment of

    multiple sequences is possible by introducing the gaps _ into the sequences.

    Membrane proteins differ considerably from globular proteins in

    sequence composition. The region that inserts into the cell membrane

    possesses different hydrophobicity patterns when compared to soluble

    proteins. Multiple sequence alignment techniques which are designed for

    globular proteins are not optimal to align the transmembrane proteins. And

    recommended alignment procedures (Pirovano 2008), can be employed

    carefully. When sequences from different genomes have been aligned

    together, then the alignment has been referred as cross-genome sequence

    alignments and the resulting phylogeny is referred as cross-genome

    phylogeny (Figure 1.6).

    1.10.1 CLUSTAL W

    The CLUSTAL W (Thompson JD, 1994) is a popular MSA tooland generally the MSA technique consists of three main stages like 1) All

    pairs of sequences are aligned separately in order to calculate distance matrix

    giving the divergence of each pair of sequences. 2) A guide tree is generated

    from the distance matrix. 3) The sequences are progressively aligned

    according to the branching order in the guide tree.

    Initially, the CLUSTAL W program apply fast approximate(heuristic) method based on the number of K-tuple (this is the size of exactly

    matching fragment that is used) matches for generating pairwise distances

    (Wilbur and Lipman, 1983). Later, dynamic programming algorithm was used

    to enhance accuracy by providing the scores using gap opening penalties

    (GOP) and gap extension penalties (GEP). The method improves quality of

    alignment by implementing amino acid weight matrices such as BLOSUM

    with series of 80,62,45,30, PAM with series of 20, 60, 120, 350, GONNET

  • 7/28/2019 0 Front Pages New_merged

    56/317

    24

    matrix (can be used for larger datset) with series of 80, 120, 160, 250 and 350.

    Though CLUSTAL W is handy to align large number of sequences with

    reliable accuracy, there are few recommended alignment tools to align

    transmembrane proteins, which are conceptually different in aligning TM helices

    and loops by using different matrices (for example PRALINE TM and MAFFT).

    1.10.2 PRALINETM

    Thus, the servers to align TM-proteins (like PRALINETM

    )are more

    specific, where the transmembrane regions are first predicted (Pirovano

    2008). The reliable topology prediction methods guide the boundaries of TM

    domain and loop as an initial requirement. PRALINETM

    refers HMMTOP v2.

    1 (Tusnady and Simon, 2001), TMHMM v2. 0 (Krogh et al 2001) and

    Phobius (Kll et al 2007) for membrane predictions. Then, the profile scoring

    scheme simply applies TM-specific substitution scores from the matrices like

    PHAT to reliably compare TM positions. Finally, an alternative iterative

    scheme was implied to enhance the alignment quality. Recent study suggests

    that PHAT matrix (Ng et al 2000) outperforms to the JTT matrix (Jones et al

    1992) especially on database searching (Ng et al 2000). Earlier methods like

    STMP (Shafrir and Guy, 2004) is also useful and is the first multiple sequence

    alignment program targeted to align transmembrane proteins.

    1.10.3 MAFFT

    MAFFT (Multiple Alignment using Fast Fourier Transform) can be

    used for aligning large datasets of transmembrane protein. The method is very

    advanced than other alignment programs, in increasing the accuracy of

    alignments even for sequences having large insertions or extensions as well as

    distantly related sequences of similar length. MAFFT alignment program

    (Katoh et al 2002) is more effective with two different heuristics, such as the

    progressive method (FFT-NS-2) and the iterative refinement method

  • 7/28/2019 0 Front Pages New_merged

    57/317

    25

    (FFT-NS-I). The other important feature of the program is that the number of

    input sequences can be very large and it offers a range of multiple alignment

    methods such as L-INS-I (accurate; for alignment of

  • 7/28/2019 0 Front Pages New_merged

    58/317

    26

    by assigning probabilities to every possible evolutionary change at

    informative sites, and by maximizing the total probability of the tree, search

    for the optimal choice can be reached. In NJ method, it eliminates possible

    errors that can occur when we use UPGMA method. NJ algorithm searches

    not only evaluate pairwise distances (using distance matrices), but also set

    neighbors that minimize the total length of the tree. NJ method is

    recommended to deal with sequences whose evolutionary distances are short.

    There are multiple packages available both for the standalone and on-line access.

    Suites like PHYLIP, TREE-PUZZLE and MEGA are more user-friendly and are

    appropriate tools to perform phylogenetic analysis both for ML and NJ method.

    1.11.1 PHYLIP

    PHYLIP (Phylogeny Inference Package) (Felsenstein, 1981) is a

    free computational phylogenetic package consisting of 35 portable programs.

    It facilitates to perform parsimony, distance matrix, and likelihood methods,

    including bootstrapping and consensus trees.

    1.11.2 TREE-PUZZLE

    It is a popular computer program to reconstruct phylogenetic trees

    from molecular sequence data such as nucleotide sequence/ proteins based on

    the maximum likelihood (ML) method (Schmidt et al 2002). It implements

    quartet puzzling algorithm. The average distance between all pairs ofsequences (maximum likelihood distances) is computed. These distances can

    be viewed as a rough measure for the overall sequence divergence. This is

    performed in three steps: In ML step, the supplied n (number of sequences

    in the alignments) is set for the quartets. All quartets are evaluated using ML

    method and the three quartet topologies such as ab|cd, ac|bd, and ac|bd are

    weighted by their posterior probabilities. In the puzzling step, quartet trees are

    considered from intermediate tree adding sequences one-by-one. As this step

  • 7/28/2019 0 Front Pages New_merged

    59/317

    27

    is highly dependent on the order of sequences, many intermediate trees from

    different input orders are constructed. In the consensus step, with the

    generated intermediate trees, a majority rule consensus tree has been built.

    These two steps are timeconsuming and the result files (.dist, .puzzle, and

    .outtree) are useful for interpreting tree topologies. The evolutionary models

    such as DAYHOFF, JTT and mtREV24 (Adachi and Hasegawa, 1996) (is for

    use with proteins encoded on mtDNA) matrices are provided. Others like

    BLOSUM 62 and the WAG model (Whelan and Goldman, 2004) are for more

    distantly related amino acid sequences. VT is for use with proteins of distant

    relationships as well (Muller and Vingron 2000).

    1.11.3 MEGA (Molecular Evolutionary Genetics Analysis)

    MEGA is an user-friendly software for phylogenetic studies, which

    also integrates sequence alignment approaches like CLUSTAL W and

    MUSCLE. MEGA 5 can be employed for phylogenetic reconstruction and

    phylogeny visualization, testing an array of evolutionary hypotheses using

    maximum likelihood (ML), maximum composite likelihood (MCL),

    neighbor-joining (NJ), minimum evolution (ME) and maximum parsimony

    (MP) to produce bootstrap construction tree for the required replications.

    MEGA is handy to display tree topologies legibly such as rectangular, radial

    and circular displays (Kumar et al 2008).

    1.12 CLUSTER ASSOCIATIONS

    The generated tree topologies can be inferred for cluster associations.

    Understanding the distribution of clusters with significant bootstrap (BS) values

    helps to classify / group the related sequences. For example, in the phylogenetic

    analysis on mouse olfactory receptors (Zhang and Firestein 2002), by using

    consensus tree, nearly 1000 OR genes were classified into several OR families.

    For the classification, they identified reliable clusters as those having >50%

  • 7/28/2019 0 Front Pages New_merged

    60/317

    28

    bootstrap support and more than 40% protein identity. By this definition, mouse

    ORs were classified into 228 families. This kind of segregation of gene/protein

    sequences will create cluster association for the interested protein families. Cluster

    associations will provide information about the conserved species-specific

    behaviors and evolutionary integrity obtained at intra- and inter-genomic level

    (Figure 1.6).

    1.13 SEQUENCE CONSERVATION AND DIVERSITY

    The performed intra- and inter-genomic phylogenetic studies guide

    the sequence association for the species-specific tendency as well as co-

    clustering arrangements. Evolutionarily conserved sequence properties such

    as motifs (Scott Gleim 2009) are highly important to connect further for the

    structural and functional relevance.

    Several computational techniques and software tools are available

    to locate and display conserved amino acid residues in the aligned set ofhomologues sequences. Available tools and databases such as TOPDOM,

    MeMotif, PROSITE, IMOTdb and SmoS, WEBLOGO, and with the guidance

    of in-house program MotifS program (by Sowdhamini, yet to be published)

    can be used to visualize the set of aligned TM-proteins and observed motifs

    and AAS. Such annotation tools can be applied in comparative genomics of

    GPCRs or ORs to identify cluster-specific/family-specific motifs along with

    the knowledge on predicted topology (Figure 1.6).

    1.14 HOMOLOGY MODELLING OF GPCRs/ORs

    The sequence searches and clustering provide representative

    sequences to generate three-dimensional structures and this further helps to

    map hotspots and to associate functional properties. Comparative

  • 7/28/2019 0 Front Pages New_merged

    61/317

    29

    modelling/homology modelling is an appropriate procedure for generating 3D

    models for the interested proteins and can be achieved by the following steps:

    i) Primarily, homologues sequences of the query can becollected by using effective sequence search methods. The

    nearest homologues sequence with reference sequence, whose

    structure is known, can be used as a template.

    ii) Pairwise alignment of template and target sequence can bemade by using appropriate alignment methods. Procedures

    such as PRALINE TM, MAFFT can be used for membrane

    proteins. Alignments can be manually edited to improve the

    alignment quality (using MEGA).

    iii) Building co-ordinates of the three-dimensional model basedon the generated alignment can be achieved by using software

    like MODELLER (Sali and Blundell, 1993) and web server

    like SWISS-MODEL (Arnold et al 2006).

    iv) Assessing potential accuracy for the generated models andmodels with least energy constraints can be selected. If

    unfavorable conformations and short contacts are observed,

    model can be minimized by using SYBYL software package

    (Tripos associate Inc).

    v)

    Structure validation can be done by checking for disallowedconformations or structural environments (can be guided by

    Ramachandran Plot values, using PROCHECK server

    (Laskowski et al 1993) and VERIFY 3D (Bowie et al 1991).

    In essence, the compiled writings in this introductory chapter

    provide a necessary background to the following work chapters 2-6.

  • 7/28/2019 0 Front Pages New_merged

    62/317

    30

    CHAPTER 2

    CROSS-GENOME CLUSTERING OF HUMAN AND

    C. ELEGANSG-PROTEIN COUPLED RECEPTORS

    2.1 INTRODUCTION

    Membrane proteins are ubiquitous (Perez 2005), constitute nearly

    20% of whole genomes and are most attractive drug targets since they are

    implicated in various diseases. Membrane proteins are embedded within the

    lipid bilayer and are designated as transmembrane proteins, since they loop

    inside and outside of the cell boundaries. A class of cell-surface receptors

    retains structural features, having extracellular N-terminal, intracellular

    C-terminal with seven transmembrane-helices (TMHs) connected by three

    intra and extracellular loops and provides a snake-like structural element

    /display to have names such as 7TM receptors or heptahelical receptors or

    serpentine-like receptors. If the downstream targets of such membrane