STATISTICAL PHYSICS APPROACH TO POST-TRANSCRIPTIONAL … · 2017. 3. 14. · STATISTICAL PHYSICS...

116
STATISTICAL PHYSICS APPROACH TO POST-TRANSCRIPTIONAL REGULATION PhD Thesis Candidate: Araks Martirosyan Advisors: Prof. Andrea De Martino, Prof. Enzo Marinari Universit` a La Sapienza, Rome, Italy December 9, 2015

Transcript of STATISTICAL PHYSICS APPROACH TO POST-TRANSCRIPTIONAL … · 2017. 3. 14. · STATISTICAL PHYSICS...

  • STATISTICAL PHYSICS APPROACHTO

    POST-TRANSCRIPTIONAL REGULATION

    PhD Thesis

    Candidate: Araks MartirosyanAdvisors: Prof. Andrea De Martino, Prof. Enzo Marinari

    Università La Sapienza, Rome, Italy

    December 9, 2015

  • I

    Acknowledgments

    I take this opportunity to thank all the wonderful people who were around to help mein opening wider the door to the life sciences for me. I had a fortune to run my doctoralprogram in the NETADIS network. It was delightful to be a part of that great family,to have the opportunity to develop my skills and earn new ones in the field of StatisticalPhysics, Complex Systems and their applications.

    I would like to express the deepest gratitude to my advisors, Andrea De Martinoand Enzo Marinari, for their continuous help, support and guidance. My sincere thanksto all of the Sapienza Physics faculty members, especially to Matteo Figliuzzi, who hasbeen always there for brainstorms, ready to help both in finding new ideas and for theirimplementation.

    Two months of my doctoral program I had the pleasure to spend at HUGEF groupin Torino and start a project in collaboration with Riccardo Zecchina, Andrea Pagnani,Carla Bosia, Carlo Baldassi. I would like to thank them all for the interesting work andmotivating project, that is yet not accomplished. Special thanks to Corinna Giorgi fromEBRI (Rome) for frequent consultations. I’m looking forward to carry on our research innext years.

    Another two months I had the great opportunity to spend in QLS group at ICTP,Trieste. I have been engaged into the ongoing project run by Matteo Marsili, AreejitSamal and Ryan John Cubero in close collaboration with the group of Mauro Giacca atICGEB. I want to express my gratitude to them all for highly collaborative work andmemorable experience. Special thanks to Antonio Celani and Michele Vendruscolo fortheir support in the development of our work. I’m longing to contribute in the furtherprogress of the project.

    I hereby acknowledge the support of the People Programme - Marie Curie Actions -of the European Union’s Seventh Framework Programme FP7 that made this researchpossible.

    And last but not least I would like to thank my family and friends who were thereto share both my disappointments and excitements while walking on the uneasy path ofuncertainty and knowledge.

  • I

    List of Publications

    Martirosyan A, Figliuzzi M, Marinari E, De Martino A. Probing the limits to microRNA-mediated control of gene expression. Submitted to PLOS Computational Biology, 2015.

  • II

    Abstract

    With the development of technologies such as RNA sequencing or mass spectrometrylarge gene and protein expression data became available. The analyses of stored informa-tion is a challenging open question. Our ability to interpret the experimental observationsis limited by the complexity of cellular processes and, most importantly, by the molecularnoise.

    In this thesis we focus our research on non-coding RNA molecules, miRNAs in par-ticular. These are short ∼ 22 nucleotide length RNA strands that silence gene expres-sion at the post-transcriptional stage via sequence-dependent base-pairing with mRNAmolecules. An increasing number of evidences confirms that miRNA molecules play anextremely important role in the control of cell cycle. Due to their exceptional regu-latory role miRNA molecules are applied in number of ways to control cell death anddifferentiation and to regulate synthesis of particular genes.

    In our work we (i) perform database analysis to construct miRNA-target interactionnetworks, (ii) employ information theory to quantify miRNA-mediated processes, suchas noise buffering and ceRNA effect, (iii) build a coarse-grained model to understand theinfluence of miRNA molecules on the cardiomyocytes proliferation capacity.

  • III

  • IV

    Abbreviations used in the manuscript

    Biosystems Biological systems

    CM Cardiomyocytes

    DC Destruction Complex

    GA Gillespie Algorithm

    HTS High Throughput Screening

    LOOCV Leave-one-out cross validation

    miRNA MicroRNA

    MI Mutual information

    NMD Nonsense-mediated decay

    ncRNA Non-coding RNA

    RNAi RNA interference

    TF Transcription factor

  • Contents

    Preface 1

    1 Regulatory RNAs 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 miRNA molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2.1 Genomics and Biogenesis . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 The functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.3 The Role of miRNA Molecules . . . . . . . . . . . . . . . . . . . . 101.2.4 miRNA-mRNA interaction networks . . . . . . . . . . . . . . . . 15

    1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2 Information processing in gene expression 212.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.1 MI in small noise approximation . . . . . . . . . . . . . . . . . . . 322.3.2 The relevance and usefulness of MI approach . . . . . . . . . . . . 35

    2.4 Information processing in miRNA-regulated gene expression . . . . . . . 382.4.1 Mathematical model of ceRNA competition . . . . . . . . . . . . 382.4.2 Comparing indirect miRNA-mediated regulation with direct tran-

    scriptional control . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.4.3 Robustness and Effective Information Transmission . . . . . . . . 50

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 Supporting Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    3 Modeling miRNA regulation of CM proliferation 573.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Proliferation pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.2.1 Hippo Signaling Pathway . . . . . . . . . . . . . . . . . . . . . . . 593.2.2 Canonical Wnt Pathway . . . . . . . . . . . . . . . . . . . . . . . 603.2.3 The interplay between Hippo and Wnt signaling pathways . . . . 60

    3.3 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3.1 The coarse-grained picture . . . . . . . . . . . . . . . . . . . . . . 623.3.2 miRNA integration . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    3.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    Outlook 73

    V

  • VI CONTENTS

    Appendix 753.7 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.8 Gillespie Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    3.8.1 Reaction Probability Destiny Function . . . . . . . . . . . . . . . 773.8.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    3.9 Linear noise approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 803.10 Model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

  • Preface

    “In three words I can sum up everything I’ve learned about life: it goes on.”Robert Frost

    The second law of thermodynamics states that the entropy of an isolated system (suchas the universe is supposed to be) increases over time [1]. The latter is understood as themeasure of uncertainty (lack of knowledge). In the high entropic state of the universethere are going to be no gatherings of atoms in the shapes, no structures. However, naturegifted us living organisms, the matter able to show “resistance” to the equilibration of theuniverse - “life” [2]. Resisting to the universal trend of rising entropy, living organismsevolved during centuries and now there are millions of species on the Earth. Structureand behavior of every individual is an interesting problem as well as understanding in-teractions between them. Conditionally one can group living organisms in the biologicalsystems (biosystems), where only entities relevant to the considerable biological processare involved. Thinking of biosystems one can scale from a single cell to entire populations.If earlier investigations of those systems were targeting questions like “What is life?” or“How did it begin?” during last decades with the development of technologies the humaninquisitive mind started to ask also if one can create life in laboratories. Although sciencediverged into Physics, Mathematics, Chemistry and Biology this all theories meet againwhen answers to such universal questions are aimed to be given.

    Typically biosystems have a large number of components. For instance human DNAconsists of 3 billion nucelotides, encoding for ∼ 21 000 protein coding genes [3], whilethere are millions of species known on the Earth. Indeed, the complexity of biosystemschanges with the scale. Phenomenons relevant in small scales might become irrelevantin large ones, like interactions between amino-acids become unimportant once we startspeaking about biological pathways. Experimental observations show, that componentsof biosystems are governed mostly by non-linear processes, that happen in different timescales. For instance a bacteria divides every 20-40 minutes, the evolution needs longyears, while biochemical processes take place in the scales of few minutes [3]. Notabledifference in time scales might be considered to be an additional layer of complexity,but at the same time it allows us to do the separation between fast and slow processes.This gives an advantage to (i) assume that fast processes are equilibrated and integratethem out leaving only the slow ones to be investigated, or (ii) focus research on thecharacterization of fast processes, assuming that the slow ones are stabilized. On the topof all, biological processes indeed have stochastic character, therefore a cellular functionneeds to be designed in the robust way, so that random perturbations caused by internaland external sources wouldn’t damage it.

    Large numbers of involved species, non linearity of processes, stochasticity and differ-ence of the complexity level in different time/spatial scales create difficulties in findinguniversal laws that every biosystem would obey. That’s why research in Biology goes

    1

  • 2 CONTENTS

    towards the reduction procedure - it is focusing on understanding individually everycomponent. In parallel the physicist inclination to test if universal laws, like energyconservation, hold in biological processes or if the ideas of complementarity and indeter-minacy of Quantum Mechanics limit our understanding of the phenomenon of life broughta new insight to the modern Biology - the aim to find principles [4]. As it was elegantlydone in Physics by examples of Maxwell’s and Einstein’s theories of electromagnetismand general relativity, hopefully, in the end of the day structure and behavior of everyliving organism will come out to be driven from one single theory. That theory of Naturewill be indeed charming and simple.

    However, before the universal theory is established one needs to create a ground forit by dealing with the complexity of biosystems. Apparently, one of the best tools thatcould be applied is the network theory [5]. A Network is a mathematical abstraction, arepresentation of a set of objects (nodes) connected by links (edges). Each connectionindicates to the influence of connected pairs on each other. The impact of one nodeon the other might have different character (activation or repression), be one or twodirectional corresponding to the one way effects and interactions. Ultimately, dependingon the interaction strength different weights might be assigned to the edges. If severallinks meet at one node, it is necessary to specify the logical function (AND, OR, ect.)that connects inputs and defines the order of function and cooperative effect they haveon the output.

    Treating molecules, species or groups of them as nodes and marking their interactionsas edges one can map biosystems to networks. Usually those networks are constructedbased on the provided experimental data, where possible links are inferred calculating (i)correlations [6–10] or (ii) mutual information [11–21] between different nodes. Indeed, ifthe dataset is highly noisy (which is usually the case) false interactions between nodesmight be inferred. Moreover, not necessarily every detected correlation corresponds to adirect link between species. It might be an outcome of an effective interaction emergingout from cross talk effects. Therefore to construct more reliable network of biosystems in-tegration of several experiments done from different prospectives as well as computationaltools are required. Final confirmation indeed will be testing experimentally hypothesesthat an inferred network suggests.

    Apparently the mapping of biosystems to networks reminds us of engineering blueprints,however there is a notable difference between them: engineers use requested materialsto build a block with a required functionality putting from the scratch the problem ofoptimality (of operating time, spatial distribution and usage of material/resources). Inbiological systems blocks where constructed during the years of evolution with the avail-able material in the manner “As long as it functions, it’s good. Better it functions, it’sbetter” [4]. Also functions were not really pre-defined, they were generated and selectedto persist spontaneously. Nevertheless, one finds that there are similarities between engi-neered and evolutionarily selected systems. In both of them one can distinguish relativelyindependent parts - modules, both systems display robustness and contain repeating pat-terns. This observation lead to the hypothesis that as a result of natural selection thestructure and functionality of living organisms saturate to their optimal solutions. Hav-ing this viewpoint on can state that the evolution is an algorithm of making ”better”organisms. It is important to understand what does “better” mean in this context? Aliving organism is characterized by the ability to metabolize, grow, reproduce and re-spond to the changes of its environment. “Better” is the organism that performs thosefunctions more reliably.

  • CONTENTS 3

    To survive and develop, a living organism collects energy taking in the substratesfrom the environment. The latter are used in the form of metabolism. Therefore thereare constant flows of matter and energy from one part of a living organism to the otheras well as from and into the environment. Importantly, in the line with those flows thereis another essential process - flow of information [4]. Indeed, during its life every livingorganism receives, elaborates and responds to the signals coming from the outer andinter worlds. Observations show that some groups of species in a cell can regulate theothers. Interconnected regulatory mechanisms form an integral network controlling all thecellular and multicellular processes happening in the living organism, from the simplestto the most complex ones. Exactly those regulatory networks will be the focus of ourresearch. In particular, our study will be devoted to the regulation of gene expression inthe post-transcription level by the so called non-coding RNA molecules.

    To characterize information flows we will follow Shannon’s approach, who introducedentropy as a measure of information [22,23]. That theory was developed having in mindthe syntactic meaning of the information transmission, i.e. Shannon wanted to findreliable and optimal ways of transferring a message from the sender to the receiver throughthe communication channel encoding and decoded in the appropriate manner so thatthe syntax of the message would be readable in both sides. This approach is indeeddirectly mapped to the Molecular Biology in the example of transformation of the geneticinformation from DNA to the protein sequences. However, thinking at the semantic levelone can give to the information more conceptional meaning treating it as a measureof control power one molecular species (regulators) can have on the others (targets).Choosing input and output nodes for a biological regulatory network one can treat itas a communication channel and estimate its capacity. Quantitative characterization ofbiological channels are challenging since their functionality might depend on the inputvariable range. Moreover, in the presence of feedback loops input might be affected bythe processes happening in the channel. Information theoretic approach is successfullyapplied to the problems of Molecular Biology at both the syntactic and semantic levelsfor (i) the characterization of information flows from the DNA to the proteins [26–29],(i) the interpretation of the experimental data, in particular for finding out non-linearinteractions between species [4,13,14], and (iii) for the inference of network structure andproperties [16–19,30].

    The thesis consists of three main chapters entitled “Regulatory RNAs”, “Informa-tion processing in gene expression” and “Modeling miRNA regulation of cardiomyocyteproliferation” respectively.

    In the first chapter we make a short introduction to the non-coding RNA molecules(ncRNA). Although initially their function and biological role escaped notice, now theyare believed to be responsible for the biodiversity of living organisms. ncRNA moleculesare found to be building blocks of complexes that transport and localize the proteins.Moreover, they are shown to be highly integrated in gene regulatory networks that controlthe synthesis of the proteins. Their regulatory role, often referred as RNA interference(RNAi), was a revolutionary discovery in Biology. In particular we focus our research onunderstanding the functions of microRNA (miRNA) molecules, a big class of ncRNAs.Those molecules are known to control the development of the cell, therefore unbalancedperturbations in their populations may lead to the diseases like cancer, cardiac toxicity,ect. Full understanding of the mechanisms miRNA molecules function and their biological

  • 4 CONTENTS

    role will open a new way for designing effective therapies. In this chapter, based on thebioinformatics we construct a miRNA-RNA interaction network and integrate it withnonsense-mediated decay (NMD) (another gene regulatory mechanism) to understandthe paradoxical behavior of miniature synaptic current in response to the protein ARCover-expression.

    In the second chapter we estimate the limits of miRNA-mediated post- transcriptionalregulation to understand the role of miRNA molecules as noise suppressor and as amediator of effective positive interaction between its targets. For this aim we employinformation theory, in particular using mutual information between regulator and targetas a measure of the regulatory circuit capacity. Although, it is not clear how close totheir capacities those regulatory elements function in natural environment, understandingtheir potential power might help to highlight situations, where their function could berelevant.

    In the third and final chapter we study the activation of cardiomyocyte proliferation bythe injection of select miRNA molecules. We build a coarse-grained model of proliferationpathways to understand this phenomenon. Inferring parameters of the model based on theexperimental measurements and bioinformatics predictions of miRNA-RNA interactions,we suggest several hypotheses for further check.

    The first chapter, which contains a general introduction to miRNA molecules, is linkedto the second and third ones, however the basic ideas are repeated in order to make everychapter complete and comprehensible independent unit. In order to avoid overloadingthe main text with technical details, descriptions of the utilized methods, algorithms andany other relevant information are placed in the appendix at the end of the thesis.

  • Chapter 1

    Regulatory RNAs

    Abstract

    Non coding RNA molecules are shown to inhibit gene expression via base-pairing with protein-coding mRNAs. The presence of non-coding RNA mole-cules in gene regulatory networks is ubiquitous, they are known to controlcell development, differentiation, growth and death.

    In this chapter we give an introduction to the biology of non-coding RNAmolecules with the focus on miRNAs. We discuss the ability of miRNA-mediated regulation to convey robustness to the gene expression, a phe-nomenon that has been observed in several experiments. Next, we introducerecently emerged “ceRNA-effect” that refers to the establishment of an ef-fective positive interaction between target genes due to the competition for alimited number of shared miRNA regulators. Finally, we construct a miRNA-target interaction network and integrate it with the post-transcriptional NMDcontrol in order to find out proteins regulating the amplitude of miniaturesynaptic currents.

    1.1 Introduction

    Gene expression is the process by which information from a gene (segment of DNA) isused in the synthesis of proteins [31]. Proteins maintain functionality of the cell [31,32],they are involved nearly in all the cellular processes performing various functions essentialfor the main physiological processes of life, like the canalization of chemical reactions(enzymes) [33], translation of DNA (ribosomes) [34], transport of materials across thecell (transport proteins) [35] and transmission of chemical signals (hormones) [36], ect.According to the the central dogma of Molecular Biology a protein is synthesized in twobasic steps (a) transcription of DNA to the RNA and (b) translation of RNA to theprotein [31,37]. Gene expression is schematically demonstrated in Fig. 1.1.

    To have the spatial picture of how and where proteins are expressed let us recall thestructure of eukaryotic cells (animals, plants, funghi, ect.) that are going to be studiedin our work [31, 32]. Cytoplasm, where most cellular activities occur, fills the volume ofthe cell. It is the framework of cell functionality. Inside the cell one can distinguish anindependent compartment, the nucleus, that is separated with the nuclear membrane fromthe rest of cell volume and envelopes genetic material (DNA) packed in chromosomes.

    5

  • 6 CHAPTER 1. REGULATORY RNAS

    DNA is a polymer consisting of 4 types of monomers - nucleotides A, T, C, G. When thecell needs indications DNA makes copies of the required pages in the form of single strandribonucleic acid, pre-RNA. The latter is a sequence of nucleotides A, U, C, G, which iscomposed of two parts, namely introns and exons. By definition introns are non-codingregions, while genetic code is written in the sequence of exons. In somewhat simplifiedpicture at the first stage introns are cut from the pre-RNA transcribe and exons are joinedto make a readable genetic code. Described process is called splicing of pre-RNA to amessenger RNA (mRNA). Latter leaves the nucleus passing through the tiny pores andbrings the genetic information to the cytoplasm. Being in the cytoplasm mRNAs meetribosomes that read the nucleotide sequences and synthesize proteins.

    On the mRNA strand each 3-length sequence (codon) encodes for an amino acid, whichis a building block for proteins. In the mRNA molecule one can separate functionallydifferent regions, among which 5’ untranslated region (5’UTR) contains the start codon(typically AUG) that indicates the ribosome to start the translation process, open readingframe (ORF) codes for the protein and 3’ untranslated region (3’UTR) contains the stopcodon (typically UAG, UAA, UGA) that terminates the translation process [31].

    For a long time RNA molecules were seen only as an intermediate polymers, thattransfer genetic information from DNA to proteins. However, they appear to have amore complex role. Scientific community started to focus its attention on RNAs afterthe important discovery that RNAs can auto-catalyze [38], meaning they are able toinvoke reactions that lead to the production of molecules like themselves. The propertyof self-promoting is believed to be the initiative for life, since it makes self-reproductivedistinguishable species. Remarkably, RNAs are proposed to be the first carriers of thegenetic information, while DNAs are hypothesized to be evolved later on [39].

    Next, it was identified that simple organisms like C. elegans have approximately thesame number of protein coding genes as human, however human genome is about 30 timeslonger than that of C. elegans [3]. Hence, it was hypothesized that the complexity of anorganism is tentative on the ncRNA molecules [40]. In the human genome for instancein front of ∼ 21.000 protein coding RNA molecules, ∼ 19.000 non-protein coding RNAs(ncRNAs) were identified [3]. The latter have various roles in the cellular processes[41, 42]. Some of them are meant to transfer material from one place to the other, suchas transfer RNAs (tRNA),others might enter as a fractional unit to the structure ofcertain proteins complexes, like ribosomal RNAs (rRNA) that are integrated into theribosome. Another large group of ncRNA molecules, such as miRNAs, siRNAs, piRNAs,ect, have a regulatory role. They appear in the mammals in large numbers. In human, forinstance, there are found more than thousand miRNAs, hundreds of siRNAs and millionsof piRNAs [3]. Those regulatory ncRNA molecules silence gene expression at the post-transcriptional stage destructing mRNA molecules or impeding their functionality. Thediscovery of RNA regulatory function, named as RNA interference (RNAi), for whichAndrew Fire, Craig Mello and others got a Nobel Prize in 2006, opened a new page inMolecular Biology [43, 44]. Clearly the usage of RNA molecules instead of proteins invarious cellular processed is energetically beneficial for the cell, since mRNA synthesisand degradation is less costly.

    In this chapter we will study the regulatory mechanism of ncRNA molecules with thefocus on microRNAs (miRNAs). The latter are small, ∼22 nucleotide length polymers,that bind to the RNA sequence based on the Watson-Crick complementarity (pairs A-Uand C-G) and suppress production of the proteins [45–47]. Notably miRNA moleculesshow high conservation among homologous,hinting that the post-transcriptional regula-

  • 1.1. INTRODUCTION 7

    DNA

    RNA mRNA

    Protein

    ncRNA

    RegulatoryRNAs

    miRNA

    ...

    siRNApiRNA

    ConstituentRNAs

    tRNA rRNA...

    transcription translation

    Figure 1.1: Gene expression and RNA classification. At the top we see basic stepsof gene expression: DNA transcription to RNA and translation of mRNA (the class ofprotein coding RNA molecules) to proteins. Another class of RNA molecules, ncRNAs(non-coding RNA molecules), are not translated to functional proteins, but remain incytoplasm and can play constituent (tRNA, rRNA, ect.) or regulatory (miRNA, piRNA,siRNA, ect.) roles.

    tion of gene expression by miRNA molecules was one of the first regulatory mechanismsdeveloped in the cell.

    Biological function of miRNA molecules ranges from the development to diseases[48–50]. Despite down-regulation of their targets, miRNA molecules are shown to havenoise buffering effect on the output protein level and due to that they confer robustness tothe gene expression process [51,52]. This functionality indeed might be one of the reasonswhy miRNA molecules are widely spread in the eukaryotes. It is now clear that a singlemiRNA molecule can bind to hundreds of targets and a single RNA molecule can base-pair with several miRNA molecules [53,54]. This observation led to the more subtle role,that is an effective positive regulation of the targets sharing the same miRNA moleculesdue to the competition effect [55, 56]. Such derivative effects in specific situations mightbe as important as the primary ones.

    In the first part of the current chapter we will give a general introduction to miRNAmolecules, their biogenesis and function. In the second part we will construct a miRNA-target network aiming to integrate miRNA mediated regulation with the NMD controlof gene expression in order to understand the suppression of miniature synaptic current’samplitude in response to the ARC protein over-expression.

  • 8 CHAPTER 1. REGULATORY RNAS

    1.2 miRNA molecules

    In early 1990’s an interesting phenomenon have been observed [57,58], namely lin-4 gene,that controlls time development of C-elegans, doesn’t encode for any protein, but producesa pair of small RNAs of ∼ 22 nucleotide ∼ 61 nucleotide length. Soon it became clearthat short strand of lin-4 was silencing the production of lin-14 gene. However, lin-14product (protein) was repressed, without changes in the lin-14 mRNA level. Thereforea hypotheses was suggested that lin-4 represses the translation of lin-14 mRNA to theprotein. Non coding RNA lin-4 is considered to be the funding member of miRNA family.Later on thousands molecules were discovered with the similar functionality [59–62]. Itis predicted that miRNA molecules account for 1-5% of the human genome and regulateat least 30% of protein-coding genes [3].

    Although there are tissue and organism specific miRNA molecules responsible for theregulation of specific functions, in the major part miRNA molecules are highly conservedamong homologue species both in plants and animals [45,61]. Not only miRNA molecules,but also their targets at the binding positions display conservation, suggesting that geneexpression in the common ancestor of those species was similarly regulated by the samemiRNA molecules. Hence emergence of miRNA-mediated post-transcriptional regulationhas a long history.

    Figure 1.2: miRNA maturation.

    A little is understood about the tar-gets recognition mechanisms and biologi-cal functions of miRNA molecules, how-ever it is evident that miRNAs plays a cru-cial role in the regulation of gene expres-sion controlling diverse pathways. IndeedmiRNA molecules are shown to play an im-portant role in the number of biologicalprocesses including embryo development,cell proliferation, differentiation, apopto-sis and metabolism [45, 59, 60]. They areshown to have a big functional significancein the formation and maintenance of thephenotype [46,47,61].

    The therapeutic potential of miRNAmole- cules is expected to be high. It hadbeen shown that specific miRNAs changetheir expression profiles in case of diseases.For instance let-7 expression is frequentlyreduced in different cancers. Another ob-servation showed that the activation of lin-7 synthesis exogenously dramatically in-hibits tumor growth in a lung cancer modelin vivo, suggesting that inhibition of lin-7is crucial in cancer development [63]. An-other famous example is miR-122 that isknown to stop the replication of the virushepatitis C and in present is used for HCVtherapies [64].

  • 1.2. MIRNA MOLECULES 9

    Below we give the state of the art of miRNA genomics, biogenesis ending up with thediscussion about their function and role in cellular processes.

    1.2.1 Genomics and Biogenesis

    The location of miRNA genes in the genome is various. They might be placed eitherfar from previously annotated genes or introns or, surprisingly, in the exonic regions.Typically, the primary transcribe of the miRNA (pri-miRNA) is long ∼ 77 nucleotidepolymer containing a stem-loop. During miRNA maturation, see Fig. 1.2, two cleavingsteps take place. In the nucleus the molecule Dosha/Pasha crops the stem loop andreleases ∼ 65 nucleotide length RNA (pre-miRNA), which is hairpin-shaped. The lattertransfers to the cytoplasm by RAN-GTP or Exportin-5 where it can be maturated afterthe second cleavage performed by the molecule called Dicer. This cleavage happens closeto the terminal loop and releases a small duplex, usually annotated by miRNA:miRNA*where miRNA and miRNA* denote the ‘mature’ and ‘passenger’ strands respectively.Those two strands have nearly-perfect complementarity and both of them can be usedfor gene silencing. In the last step one strand of the duplex loads into the Ago2 to formthe RISC (RNA-induced silencing complex), the other peels away [45,60].

    The mechanism of duplex cleavage is yet unclear, however some models suggest that ittakes place in the step of loading to the Ago2, which separates the mature strand, binds itand frees the passenger one [65]. How mature miRNA strand is identified is still an openquestion. Some observations led to the hypothesis that the strand that enters the RISCcomplex is nearly always the one whose end is less tightly paired [66,67]. In the work ofSchwarz et al. [67] it had been shown that duplexes having nearly equivalent stabilitiesat their ends where chosen to be load into the he Ago2 to form the RISC complex withsimilar frequency. This observation indeed supports the previous hypothesis.

    At the final sage matured RISC complex binds to the RNA sequence based on the com-plementarity between its miRNA strand and the target inhibiting target gene syntheses.It is now clear that several RISC complexes can bind to the same target simultaneously.Their cooperative function amplifies gene repression. This explains the presence of mul-tiple miRNA complementary sites in most genetically identified targets [57,58,68–70]

    At the end it would worth mentioning, that one can block miRNA maturation bytitrating away essential molecules required for that process, such as Dosha, Dicer orAgo2. Typically in the experiments that mechanism is applied: the level of miRNAmolecules is decreased by Dosha, Dicer or Ago2 knockdown [71–73].

    1.2.2 The functionality

    Mature miRNA molecules guide RISC complex to bind a target RNA base-pairing withit [45–47]. The degree of miRNA:mRNA complementarity is a major determinant of theregulatory mechanism process [54, 60, 61]. It is widely recognized that miRNA moleculecan direct RISC complex to down-regulate gene expressions in three different mechanisms:(i) mRNA cleavage, (ii) mRNA destabilization and (iii) translational repression. Let usdiscuss them separately.

    1. mRNA cleavage: In the cases when mature miRNA molecule loaded in the RISChas nearly-perfect complementarity with the target (at least 7-8 base-pairs), it cancleavage the latter. After the cleavage mRNA molecule is not functional anymore,

  • 10 CHAPTER 1. REGULATORY RNAS

    however miRNA returns to the cytoplasm and potentially can guide other mRNAcleavages. This mechanism is first observed for siRNA double strands and is thebases of RNAi interference. Apparently, it can lead to very effective gene silencingsince only few miRNAs might be enough to completely repress the target. Inthe further text mRNA cleavage will be referred also as miRNA-recycling process[75–77,149].

    2. mRNA destabilization: The second mechanism of gene inhibition led bymiRNA molecules is enhancing their degradation rate. This happens by the re-moval of an adenylate group from the targets that leads to their destabilization andthus promotes the degradation. As the primary effect population of proteins in thecell drops down [78–80].

    3. Translational repression: In their study Wightman et al [58] had shown thatlin-14 proteins are downregulated by lin-4 miRNA without notable changes in lin-14 mRNA. Also lin-28 gene was shown to be regulated by the same lin-4 miRNAwithout changes in mRNA lin-28 level [81] . Later on several other miRNA-mediatedtarget inhibitions were shown to have the same character. This observations suggestthat mRNA itself was not destroyed, but translation of the proteins was repressed.Translational repression is hypothesized to be done in two ways: either miRNAsprevent ribosome to translate RNA sequence to proteins or they direct selectivedegradation of already translated proteins [82]. Both mechanisms are observedexperimentally [45].

    If for first mechanism, the cleavage, nearly-perfect complementarity was required, forlast two it is not necessary [45, 60]. The flexibility miRNA molecules have to base-pairand repress target genes makes it possible for a single miRNA molecule to co-regulateseveral targets. On the other hand, the same target gene can be regulated by differentmiRNA molecules. Therefore miRNAs can be embedded in large regulatory networkscausing not only direct inhibition effects, but also derivative effective interactions amongtheir targets [45,54–56,60,61,149].

    1.2.3 The Role of miRNA Molecules

    Due to their regulatory role and vast involvement in gene regulatory networks miRNAmolecules play a central physiological role. However, in the two limiting situations of veryhigh and very low populations miRNAs can have a negligible effect on their targets. Inthe case of high miRNA population target might be totally repressed and relatively smallchanges in the miRNA population will not change its state. On the other hand, highlypopulated targets will not feel the presence of few miRNA molecules unless miRNA-recycling mechanism doesn’t take place [83].

    Generally speaking, presence of miRNA molecules increases mRNA effective degrada-tion rate. Based on the relation between spontaneous and miRNA-mediated degradationof a target one can separate the following regimes for mRNAs (for quantitative descriptionsee Chapter 2-2.4 or [83]):

    • free or unrepressed: spontaneous mRNA degradation dominates over miRNA -mediated decay channels, so that, effectively, the mRNA level is weakly sensitiveto small changes in miRNA concentration;

  • 1.2. MIRNA MOLECULES 11

    • bound or repressed: miRNA-mediated mRNA decay dominates over spontaneousmRNA degradation but, again, the mRNA level is weakly sensitive to small changesin miRNA concentration as most mRNAs are bound in complexes with the miRNA;

    • susceptible: spontaneous and miRNA-mediated decay channels have comparableweights and the mRNA level is very sensitive to small changes in miRNA expressionlevel.

    Repressing their targets, miRNA molecules are shown to be able to confer robustnessto the gene expression and cause cross-talks between their targets by the competitionmechanism. Below we discuss this phenomenons more in details.

    Noise buffering and robustness

    Biological processes have stochastic character, since molecules are discrete and they in-teract by the random collisions during their diffusive motions in the cellular environment.On one hand exactly due to the randomness of events biochemical reactions are possi-ble. Indeed, for the occurrence of a chemical reaction the system of substrates shouldbe transformed form one energy level to the other overcoming the activation energy bar-rier E∗. The reaction rate k is connected to the activation energy by the Arrhenius lawk ∼ exp (−E∗/kBT ), where kB is the Boltzmann constant and T is the temperature. Ap-parently, the activation energy can be gained due to the random fluctuations [4]. On theother hand, noise can impediment tuning a system to the most beneficial state and keep-ing it there. It is known that at the steady state noise of a system scales with its degreesof freedom (size) N as 1/

    √N . Hence, in the case of large populations at the equilibrium

    noise doesn’t have any significant effect [84]. However, in the cell proteins are usuallypresent in low copy numbers [3]. Moreover biological processes dynamically develop,keeping the cell out of equilibrium. That’s why random fluctuations may propagate,amplify and cause serious diseases in the living organism. As a consequence biologicalfunctions may fail. Hence, to survive and develop, any living organism should have arobustness against perturbations.

    Most importantly the protein expression is a noisy process [85, 86]. Recording theprotein expression level p at one of the cell’s steady states one will observe that it fluctu-ates around an average value p. Noise can be calculated by the Fano factor (FF), definedas the variance of the variable divided by its average value

    FF =p2 − p2

    p, (1.1)

    where bars stand for the averages over the probability distribution at that steady state.The total noise on protein expression level is generated by both the intrinsic (trans-

    lation and transcription) and extrinsic sources. Below, following Swain et al [86], we willdiscuss how to distinguish this two sources of noises in practice. Suppose I is the set ofinternal variables and E is the set of external ones, that affect gene expression and aredistributed according to some probability density function P(I, E). Imagine there wasan experiment done in N identical cells, labeled by k. The m-th moment of the proteinexpression level is

    1

    N

    N∑k=1

    pmk =

    ∫dE

    ∫dI pm P(I, E) . (1.2)

  • 12 CHAPTER 1. REGULATORY RNAS

    Defining the conditional probability as π(I|E) = P(I,E)P(E) , where P(E) =∫dI P(I, E), one

    can re-write the last expression us

    1

    N

    N∑k=1

    pmk =

    ∫dE P(E)

    ∫dI pmπ(I|E) = 〈pm〉 . (1.3)

    Here the bar and angular brackets indicate averages over external and internal variablesrespectively. Taking into account this notions, the total noise on the protein expressionmeasured in experiments can be written in the following form

    FF =〈p2〉 − 〈p〉

    2

    〈p〉=〈p2〉 − 〈p〉2

    〈p〉+〈p2〉 − 〈p〉

    2

    〈p〉= FFint + FFext , (1.4)

    which means that measured total noise is the sum of two components: intrinsic FFint andextrinsic FFext noises. It is indeed easy to see that if variables E are fixed, the secondterm vanishes, whilst if I are fixed, the first term does.

    Imagine we could make two identical copies p(1)k and p

    (2)k of the protein pk in N samples

    (here k ∈ {1, N} labels cells). If in the experiments those two proteins could be measuredsimultaneously, then the external parameters might be considered to be the same andthe average of the product p

    (1)k p

    (2)k would be

    1

    N

    N∑k=1

    p(1)k p

    (2)k =

    ∫dE P(E)

    ∫dI1 p

    (1)k π(I1|E)

    ∫dI2 p

    (2)k π(I2|E) = (1.5)

    =

    ∫dE P(E)

    (∫dIpkπ(I|E)

    )2= 〈p〉2 . (1.6)

    Taking into account that 〈p〉 = 〈p(1)〉, 〈p2〉 = 〈(p(1))2〉 and 〈p〉2 = 〈p(1)〉2

    intrinsic andextrinsic noise contributions on the protein expression can be calculate by eq. (1.4). Thisapproach was successfully applied to E. coli by Elowitz et al [87] to calculate intrinsicnoise on green fluorescence proteins.

    Keeping the number of expressed proteins in the desired range is a task of crucialimportance for the cell, so gene expression process is expected to be robust. It hadbeen argued that miRNA molecules in certain biological circuits have a noise-bufferingrole [51, 52, 88–92]. Carthew laboratory recently provided the first strong experimentalevidence for this phenomenon. They have demonstrated that in Drosophila miRNAmolecule miR-7 is able to stabilize gene expression in certain developmental networks incase of perturbations [93]. Interestingly, function of miR-7 as a noise suppressor is evidentonly when the system experiences temperature instability. Thus, miRNAs were believedto contribute to developmental stability in case of environmental changes. Although aseries of experiments in silico confirm the noise buffering role of miRNA molecules [52],however its relevance in vivo under normal physiological conditions remains questionable[94,95]

    Below we compare noise on the output protein level of a non miRNA regulated tran-script (Fig. 1.3 a) and a miRNA regulated one (Fig. 1.3 b) in (i) bound, (ii) susceptible,(iii) free regimes.

    In the first row of Fig. 1.4 we see the distribution of protein expression levels P (p)at the steady state for the two circuits under consideration. Apparently, P (p) is more

  • 1.2. MIRNA MOLECULES 13

    mRNADNA

    P

    (a)

    mRNADNA

    P

    miRNA

    (b)

    Figure 1.3: Schematic representation of protein expression. Two basic stepsare seen, transcription from DNA to mRNA and translation from mRNA to protein (P),where mRNA transcript is a) non miRNA regulated, b) miRNA regulated.

    concentrated for the miRNA-regulated target (red), if it is in bound or susceptible regimes.While in the free regime of the target two circuits generate the same amount of noise.

    To check how miRNA-regulation affects robustness (the property of the system toremain or to return to its initial state in case of perturbations) of gene expression letus perform the following experiment on both the circuits: at the steady state, when thesystem is equilibrated, let us inject extra mRNA molecules. The introduced perturbationwill break the equilibrium leading to the growth in proteins population. However, aftera certain period of time system is expected to stabilize and return to its initial state.

    Results of the experiment are reported in the second row of Fig. 1.4, where one can seethat at the bound and susceptible regimes after the extra mRNA molecule injection (at100th time step) population of proteins doesn’t change for the miRNA-regulated circuit.At the same time population of the non miRNA-regulated transcript’s products growssignificantly and returns to its steady state value after a certain time period, demon-strating that it was not robust towards perturbation on mRNA level. This observationcan be understood recalling the main function of miRNA molecules - gene silencing. In-deed, in the bound or susceptible regimes miRNA molecules have the potential to boundand repress newly added mRNAs, while in the free regime the same phenomenon cannotbe observed due to the negligible impact of miRNA molecules on the process of geneexpression.

    The ceRNA effect

    Investigations of the miRNA-mediated post-transcriptional regulatory network have hintedat a possibly more subtle and complex role. It is indeed now clear that the miRNA-RNAnetwork describing the potential couplings stretches across a major fraction of the tran-scriptome, with a large heterogeneity both in the number of miRNA targets and in thenumber of miRNA regulators for a given mRNA [53, 54, 57, 96–98]. The competitioneffects that may emerge in such conditions suggest that miRNAs may act as channelsthrough which perturbations in the levels of one RNA could be transmitted to otherRNA species sharing the same miRNA regulator(s). Such a scenario has been termed the‘ceRNA effect’, whereby ceRNA stands for ‘competing endogenous RNA’ [55]. In viewof its considerable regulatory and therapeutic implications, the ceRNA effect has beenextensively analyzed both theoretically and experimentally [83,92,99–102,104–108,154].

    At this point let us discuss a simple network containing one miRNA molecule and 2 ofits targets (ceRNAs), see Fig. 1.5. In the case of ceRNA1 injecting, the level of miRNAmolecules will decrease leading to the raise of ceRNA2 population, demonstrating howan effective positive interaction emerges out between two ceRNA molecules binding to

  • 14 CHAPTER 1. REGULATORY RNAS

    30 40 50 60 70 80 900.000.010.020.030.040.05

    Protein

    Prob

    (a) Bound mRNA

    0 10 20 30 40 50 600.000.020.040.060.080.10

    Protein

    Prob

    (b) Susceptible mRNA

    0 50 100 1500.000.020.040.060.080.10

    Protein

    Prob

    (c) Free mRNA

    0 100 200 300 40020406080100

    Time

    Protein

    (d) Bound mRNA

    0 100 200 300 400020406080100

    Time

    Protein

    (e) Susceptible mRNA

    100 200 300 400406080100120140160

    Time

    Protein

    (f) Free mRNA

    Figure 1.4: Noise buffering. (a, b, c) Histogram of the protein expression levels atthe steady state. (d, e, f) Perturbation on the level of proteins by injection of additional100 mRNA molecules; and return to the steady state. Blue color corresponds to the nonmiRNA regulated gene expression circuit (Fig. 1.3 a), red one to the miRNA-regulatedone (Fig. 1.3). Details of modeling are discussed in Chapter 2-2.4.3, model parametersare reported in Table 3.5.

    the same miRNA. Apparently in vivo it is hard to find such a small ceRNA regulatorynetwork. Typically they contain hundreds of nodes which creates difficulties in under-standing the effectiveness of ceRNA effect experimentally.

    Several studies had shown that ceRNA effect is selective towards targets [109,110]. Itseems that among hundreds of targets of a miRNA molecule only few are able to com-pete for miRNA. This leads to the assumption that other targets just control miRNApopulation size, so that it could accordingly influence few selected targets both by re-pression and by the ceRNA effect. In particular, observed ubiquity of pseudogenes mightbe explained by such an argument. Pseudogenes are genomic DNA sequences that re-semble known genes, but are non functional, since their translation is interrupted bypremature stop codons, frameshift mutations, insertions, or deletions [111]. There are∼19,000 pseudogenes revealed in humans [3]. Since pseudogenes copy functional genes,they may share miRNA molecules with the latter [156] and regulate the “availability” ofmiRNA-regulators by ceRNA effect [55].

    The importance of ceRNA regulation cannot be overestimated in the cases where itoperates. Indeed, changing the synthesis rate of miRNA molecules can affect entire vectorof ceRNAs causing dramatic changes in the phenotype. However, there are number offundamental questions about the effectiveness of “regulation via competition” per se. Infact, there is an ongoing debate about the ceRNA effect and its relevance in vivo. On onehand, Denzler et al [99] in their experiments in mouse hepatocytes, analyzing responseof miRNA targets on the expression of a competitor, had hypothesized that modulationof miRNA target abundance is unlikely to cause significant effects on gene expressionthrough a ceRNA effect. According to their results only in non-physiological conditionsthe latter might be important. On the other hand, the work of Bosson et al [100] onmouse embryonic stem cell have revealed that miRNA:target pool ratios and an affinity

  • 1.2. MIRNA MOLECULES 15

    ceRNA1

    miRNA

    ceRNA2

    void

    (a)

    ceRNA2

    miRNA

    -6 -4 -2 0 2 4 60100200300400500600700

    ln [ ceRNA1 ]N

    (b)

    Figure 1.5: The ceRNA effect. (a) A simple ceRNA network containing two ceRNAmolecules and one miRNA, which targets both of them. (b) Steady state behavior of theceRNA2 and miRNA species when number of ceRNA1 increases in the system. Details ofmodeling are discussed in Chapter 2-2.4, model parameters are reported in Table 3.5.

    partitioned target pool accurately predict miRNA susceptibility to target competition.They have argued that hierarchical binding of high- to low-affinity miRNA targets mightbe a key characteristic of in vivo activity. In this respect, it would be especially importantto understand in which conditions the degree of control of the output variable (i.e. theceRNA/protein level) accomplished through post-transcriptional miRNA-mediated cross-talk may exceed that obtainable by different regulatory mechanisms. In the next chapter,using information theoretic means we will study the ceRNA effect, highlighting situationswhen it functions most optimally or when it has comparable to direct transcriptionalcontrol role in gene expression regulation.

    1.2.4 miRNA-mRNA interaction networks

    ♦ This work has been done in collaboration with Riccardo Zecchina, AndreaPagnani, Carla Bosia, Carlo Baldassi and Corinna Giorgi.

    As it had been mentioned above miRNA molecules find their targets based on Watson-Crick complementarity (A-U and C-G), see Fig. 1.6. It had been observed that thecomplementarity of miRNA residues at the positions 2-8 to the 3’ UTR site of targetmRNA is decisive weather miRNA can bind to mRNA or not. This region of miRNAmolecules is known as seed sequence. Sites in target mRNAs that are complementary tothe miRNA seed region are called canonical sites. Further analysis show that both seedsand canonical sites that guide miRNA-dependent mRNA degradation display evolution-ary selection, suggesting that there was a selective pressure to maintain possible miRNAregulation of target genes. Especially complementary sites of 7-8 nucleotides long appearto be highly conserved among homologous species [113].

    Identification of canonical sites are the main focus of computational methods formiRNA target prediction, such as TargetScan [114], miRanda [115], miRBase [116], Pic-Tar [117], ect. Alignment algorithms are designed to find the potential canonical sites ofthe target mRNA having the seed of miRNA molecule. Based on alignment interaction

  • 16 CHAPTER 1. REGULATORY RNAS

    5’UTR AAUCGCGAAGAUCUACUAGAGUAGGUCACCAGGA 3’ UTR| | | | | | |

    3’UTR ACGACGUUCUCAUUCAGUGGUU 5’ UTR

    seed

    canonical sitesmRNA

    miRNA

    Figure 1.6: miRNA binding to mRNA. Target is recognized based on the comple-mentarity of the miRNA seed and mRNA canonical site at mRNA 3’UTR region.

    strength can be estimated. However, those predictions are only in the hypothetic level.They can give false positive and false negative pairs [113]. Experimentally targets canbe tested by altering miRNA population in the system and recording changes in the phe-notype (the vector of expressed proteins) or changes in target degradation rate [99, 100].However, experiments also are not very reliably, since side effects might be confused withdirect ones due to the involvement of a miRNA molecule in large regulatory networks.Synthetic biology indeed could be a good tool, as networks with required componentsmight be designed and tested [118, 119], but they do not cover the question about therelevance of given interaction in vivo. Combination of several experimental and compu-tational tools is therefore required for identifications of miRNA targets.

    The miRNA-mRNA interaction networks, ceRNA networks, are bipartite graphs withtwo types of nodes (referring to mRNA and miRNA molecules) that are linked if there isa predicted interaction between them. Sketching the ceRNA networks can indeed be veryuseful for understanding the impact of entire regulatory network on certain miRNA-drivenprocesses. In particular, here we will study protein ARC’s ceRNA network to understandrecent experiments by [120, 121]. ARC is believed to play a critical role in learningand memory-related molecular processes because of its influence on synaptic activity.In particular, over-expression of ARC causes the knockdown of miniature postsynapticcurrents (mEPSC) amplitude [120].

    It is known that ARC is a natural target of the so called Nonsense Mediated De-cay (NMD), which is a regulatory mechanism regulating mRNA expression [122]. Thisregulation is a result of pre-RNA splicing and it inhibits synthesis of proteins. To un-derstand the mechanism of NMD function, let us recall here the splicing process. Thepre-RNA primary transcribe consists of introns (non-coding sequence) and exons (codingsequences). During pre-RNA splicing (see Fig. 1.7 ) introns are removed, exons are joinedto create the mRNA. At this step EJC proteins sit on the junction points independentfrom the nucleotide sequence. During the translation process if a stop codon is read, EJCsitting on mRNA after the position of the ribosome detects it and terminates productionof the protein. If the stop codon appeared on the ORF as a result of mutation, thismechanism detects and inhibits translation of a non-functional mutant gene. However, ifEJC sits after the natural stop codon, production of functional proteins is repressed, inother words the target gene undergoes NMD.

    There are several mRNA transcribes that have a junction point on their 3’UTR site.For instance, Arc has two of them. Therefore knockdown of EJC proteins is supposedto lead to the growth of ARC protein’s population in the cell. Indeed it is observed inthe experiments. As an outcome of Arc over-expression the amplitude of mEPSC wassupposed to decrease, but the experimental observation shows the opposite - it increases

  • 1.2. MIRNA MOLECULES 17

    pre-RNA Exon Intron Exon Intron Exon Intron Exon Intron

    mRNA Exon

    EJC

    Exon

    EJC

    Exon

    EJC

    Exon Exon

    EJC

    Exon

    EJC

    Exon

    EJC

    Exon

    Premature stop codon

    Degradation of a mutant gene

    Natural stop codon

    Degradation of a functionalgene, NMD mechanism

    splicing

    Figure 1.7: Schematic representation of NMD regulation. On the left EJCmolecules sitting after premature stop codon represses translation of a non-functionalprotein, whilst on the right, sitting after the natural stop codon EJC prevents synthesisof a functional protein causing nonsense mediated decay (NMD).

    miRNA number of targets, number of targets, number ofTargetScan miRanda overlapped targets

    miR-7a/b 434 1750 172miR-19a/b 1132 1739 428miR-135a/b 685 1988 296

    Table 1.1: miRNA target predictions. Number of mRNA targets predicted for agiven miRNA molecule according to the Targetscan (release 5.2) and miRanda (August2010 release) databases.

    [120].A hypothesis was suggested, that integration of miRNA related regulation to NMD

    might give an answer to this paradox. The idea is that the growth of mEPSC amplitudeis an indirect effect, that can be a result of the influence of other proteins on the latter,that are also regulated by NMD or are competitors of ARC for the commonly sharedmiRNA. Such genes indeed could have been altered by EJC knockdown as well. To findthem out let us look at the conserved mouse families in two databases, TargetScan [114](release 5.2) and miRanda [115] (August 2010 release).

    As the first step we search miRNA molecules that target Arc. According to thepredictions of TargetScan database there are 8 such molecules, while miRanda databasepredicts 72 of them. Apparently two different algorithms used in those two databases givevery different results. However, if we check the overlap of those predictions we will findout that there are 3 miRNA families (miR-7a/b, miR-19a/b and miR-135a/b) predictedby both the databases. Assuming the overlap of two predictions to be relatively morereliable, let us move on to the second step in finding all the ceRNAs competing with Arcfor those miRNAs. In the Table 1.1 results are reported. Apparently, in this predictionsagain a huge difference between two databases is detected.

    Taking into account only the overlap of two databases, we obtain the interaction

  • 18 CHAPTER 1. REGULATORY RNAS

    Figure 1.8: The ceRNA network of Arc protein. miRNA-mRNA interactions arepredicted by the Targetscan (release 5.2) and miRanda (August 2010 release) databasesfor conserved mRNA/miRNA families in mouse.

    graph shown in Fig. 1.8. In the bipartite graph red nodes correspond to the miRNAmolecules and blue ones to the ceRNAs. We observe that there is a small fraction ofshared ceRNA transcripts among every couple of miRNA molecules. On the other handall 3 miRNAs share only 4 mRNA targets, one of which is indeed Arc. Another one isthe protein Slc6a8, which is also known to play an important role in the regulation ofsynaptic activity [123]. Moreover, one transcript of Slc6a8, gAug10, is predicted to bea natural targets of NMD regulation according to [124] Therefore it can compete withArc not only for miRNA molecules but also for EJCs. And so both the processes, theknockdown of EJC and the over-expression of ARC, are supposed to increase the levelof Slc6a8 proteins in the neural cells due to the suppression of NMD regulation and theceRNA effect. We suggest that observations of Slc6a8 in the experiments and its effecton mEPSC amplitude might shed light on the unexpected response of synapses on ARCover-expression.

    In our analysis, done for Arc ceRNA network construction, we have discarded allthe targets and miRNA regulators that were not predicted by both the databases. Onecan choose another approach, i.e. assign probabilities to every interaction link - the moredatabases predict an interaction, the more probable it is. This approach will be employedin the Chapter 3.

    1.3 Summary

    In recent years non-coding RNA molecules captured the attention of scientific commu-nity due to the integral role they play in cell differentiation and development. They are

  • 1.3. SUMMARY 19

    involved in almost all cellular processes and are believed to be the reason behind biodiver-sity in the living organisms [38–40]. Some of them are shown to play an exceptional role inthe regulation of gene expression [40–42]. In particular, miRNAs molecules silence genestargeting corresponding mRNA molecules based on the sequential complementarity viatranslational repression, mRNA cleavage or mRNA destabilization. Interestingly, miRNAmolecules can inhibit their target genes even if the complementarity is not perfect. Dueto that ability a single miRNA molecule can target hundreds of mRNAs. On the otherhand a given mRNA transcript can be target for several miRNA molecules. ThereforemiRNA molecules are expected to have a huge impact on the phenotype [45–47,53].

    The biological role of miRNA molecules in cellular processes is still an open question.In this chapter we have considered two miRNA-mediated processes: (i) noise suppressionon the output protein expression levels conferring robustness to gene expression profiles[51, 52], and (ii) ‘ceRNA hypothesis’ suggesting that an effective positive interactionbetween target RNAs may emerge out as a result of the competition for a shared miRNAregulator [55,56].

    There are several algorithms developed to predict miRNA targets [114–117], but theymight give false results since our understanding of target recognition and binding mecha-nism is still limited [113,125]. Based on the overlap of TargetScan [114] and miRanda [115]databases, in this chapter we have constructed the ceRNA network of protein Arc. Theaim was to integrate it with the NMD post-transcriptional regulation [122] in order tounderstand the influence of protein ARC over-expression on miniature synaptic currents(mEPSC) [120,121]. Our analyzes suggest that Arc shares 3 miRNA families with more3 genes. One of those genes, Slc6a8, is known to regulate synaptic currents, similarly toARC [123]. We suggest that alterations on Slc6a8 protein expression level may explainrecent experimental results of Giorgi et al [120], that had revealed the growth of miniaturesynaptic currents (mEPSC) in response to ARC over-expression.

  • 20 CHAPTER 1. REGULATORY RNAS

  • Chapter 2

    Information processing in geneexpression

    Abstract

    There are numerous examples of regulatory circuits appearing in biolog-ical networks. One can treat these circuits as communication channels andestimate their capacity in order to understand how well they could controlcorresponding biological processes.

    In this chapter we first review the basic concepts of information theoryand introduce mutual information as a measure of channel’s capacity. Thenwe employ information theoretic approach to analyze simple TF-based andmiRNA-mediated gene regulatory circuits, asking how the choice of param-eters may influence the optimized information transmission from input tooutput at the steady state. We point out situations where miRNA-mediatedphenomenons, such as noise buffering and “ceRNA effect”, might play a de-cisive role conferring robustness to gene expression and/or regulating thelatter.

    2.1 Introduction

    The ability of a living organism to answer quickly to the changes in its natural envi-ronment is a central requirement for its survival. By the sensors the organism receives aknowledge from the environment and using its resources demonstrates beneficial behaviorfor its own survival. Therefore, one can view the living organism as a device that receivesand transmits information. To this respect not only the whole organism, but also itssubsystems are information transmission channels, e.g. the flow of genetic informationfrom DNA to proteins in the gene expression process. Information flow from DNA toproteins should be with the least amount of errors in transforming nucleotide sequencesto amino acids. At the same time that process should be well controlled to tune proteinexpression levels, which is essential for eukaryotic cell functionality. A variety of molecu-lar mechanisms are implemented to guarantee, on one hand, that protein copy numbersstay within a range that is optimal in the given conditions and, on the other, that shiftsin expression levels can be achieved efficiently whenever necessary [41, 42, 126] (whereby‘efficiency’ here encompasses both a dynamical characterization, in terms of the times

    21

  • 22 CHAPTER 2. INFORMATION PROCESSING IN GENE EXPRESSION

    required to shift, and a static one, in terms of moving as precisely as possible from onefunctional range to another). Quantifying and comparing their effectiveness in differentconditions is an important step to both deepen our fundamental understanding of regula-tory circuits and to get case-by-case functional insight about why a specific biochemicalnetwork has been selected over the others.

    As the major direct regulators of gene expression, transcription factors (TFs) aremost immediately identified as the key potential modulators of protein levels [127]. Ina somewhat simplified picture, one may imagine that a change in amount of a TF caninduce a change in the expression level of the corresponding gene, and that the abilityto regulate the latter (the output node) via the former (the input node) can be assessedby how strongly the two levels correlate. The effectiveness of a regulatory element ishowever limited by the stochasticity of intracellular processes, from the TF-DNA bindingdynamics to translation [128]. A convenient framework to analyze how noise constrainsregulation is provided by information theory [22, 23]. In the first part of this chapterfollowing [4, 129] we will argue that mutual information maximized over distribution ofmodulator levels can be used as a measure of channel’s capacity. We will discuss thesimplest situation in which a single TF modulates the expression of a single protein.This circuit can be characterized analytically under the assumption that noise affectinginput-output channel is sufficiently small. Remarkably, in at least one case this maximumhas been found to be almost saturated by the actual information flow measured in a livingsystem [129]. In other terms, for sufficiently small noise levels in the channel that linksTFs to their functional products, one may quantify the optimal regulatory performanceachievable in terms of the maximum number of bits of mutual information that can beexchanged between modulator and target.

    In the second part of this chapter the same approach will be applied to the miRNA-mediated post-transcriptional regulation. miRNA molecules are short, ∼22 nucleotidelength RNA molecules, that known to base pare with the protein coding mRNAs andsilence gene expression by the RNAi machinery (for more details about RNAi see Chapter1). The complete understanding of the biological role of miRNA molecules is yet toachieve. It is however known that they are able to stabilize protein output by bufferingtranscriptional noise and create an effective positive interaction between the levels oftheir target RNAs through a simple competition mechanism known as ‘ceRNA effect’.Although hundreds of targets are predicted for a single miRNA, observations show thatonly few of them are sensitive to changes in miRNA expression levels. Most targets arelikely to provide a global buffering mechanism through which miRNA levels are overallstabilized [55, 56]. Effective competition between miRNA targets requires that the ratioof miRNA molecules to the number of target sites lies in a specific range, so that therelative abundance of miRNA and RNA species must be tightly regulated for the ceRNAmechanism to operate [56, 99, 101, 130–132]. On the other hand, the magnitude of theceRNA effect is tunable by the miRNA binding and mRNA loss rates [83, 101, 132].The performance of a regulatory element, however, does not only depend on kineticparameters, but also on the range of variability (and possibly on the distribution itself)of modulator levels (e.g. TFs) [4, 129]. The maximal regulatory effectiveness of a givengenetic circuit –quantifying how precisely the output level can be determined by the inputlevel– can therefore generically be obtained by solving an optimization problem over thedistribution of inputs. This type of approach provides an upper bound to the effectivenessof a regulatory mechanism as well as indications concerning which parameters, noisesources and/or interactions most hamper its performance.

  • 2.2. ENTROPY AND INFORMATION 23

    It is especially important to understand in which conditions the degree of control ofthe output variable (i.e. the ceRNA/protein level) that can be accomplished throughpost-transcriptional miRNA-mediated cross-talk may exceed that obtainable by differentregulatory mechanisms. We will characterize the maximal regulatory power achievableby miRNA-mediated control and compare it with that of a direct, TF-based transcrip-tional unit [30]. In principle, since fluctuations can be reduced by increasing the numberof molecules, an (almost) arbitrary amount of information can be transmitted througha biochemical network. However, cells have to face the burden of macromolecular syn-thesis [133–135]. Optimality is therefore the result of a trade-off between the benefitsof reduced fluctuations and the drawbacks of the associated metabolic costs. For thisreason, we will start by fixing a maximal rate of transcription (or, alternatively, the max-imal number of output molecules) so as to have a simple but reasonable framework tocharacterize and compare the capacities of the different regulatory channels. Next, wewill quantify how an input signal is processed by the transcriptional (TF-based) and post-transcriptional (miRNA-mediated) regulatory elements by characterizing the response inthe output ceRNA’s expression levels. In such a setting, information flow is hamperedby intrinsic noise if the target gene is weakly derepressed by the activation of its com-petitor. Otherwise, target derepression appears to have a strong impact on a regulatoryelement’s capacity. Upon varying the magnitude of derepression by tuning the kineticparameters, we will then show that in certain regimes miRNA-mediated regulation canindeed outperform direct control of gene expression. Finally, we argue that the presenceof miRNA molecules in large copy numbers notably reduces the level of intrinsic noiseon weakly targeted transcripts. In this case, the mutual regulation of ceRNA moleculesby miRNA-mediated channels may become a primary mechanism to finely tune geneexpression.

    It in known that miRNA-regulation may confer robustness to gene expression reducingthe size of output protein fluctuations. In the last part of this chapter we will showthat for a certain range of miRNA binding rate information transmission from TF toproteins, whose expression is regulated post-transcriptionally by miRNA molecules, maybe performed more reliably, than in the case of a simple gene expression without miRNA-regulation.

    Besides providing a quantitative characterization of the maximal regulatory powerachievable through miRNA-based post-transcriptional control, these results provide im-portant hints on the circumstances in which regulation by small RNAs may function asthe main tuner of gene expression in cells.

    2.2 Entropy and Information

    Without focusing on the particular examples in this section we consider a classical com-munication system, an information transmission channel, schematically represented inFig. 2.1. It consists of three parts:1. Input, which is a message to be communicated,2. Channel, which is the medium used to transmit the input signal,3. Output, which is the received message.

    Let us consider a discrete channel, that is a system which transfers from one pointto another a sequence of symbols from a finite set {S1, S2, S3, ..., Sn}. We can think

  • 24 CHAPTER 2. INFORMATION PROCESSING IN GENE EXPRESSION

    Input Channel Output

    NoiseNoise

    Figure 2.1: Simple communication channel. A signal from the input (which mightbe noisy) is transmitted through a noisy channel and the output is read.

    p1p2

    p3

    p1

    p0

    p3/p0

    p2/p0

    Figure 2.2: Choice making in a schematic way. On the left scheme one choosesamong 3 symbols with corresponding probabilities p1, p2, p3, while on the right schemeone of the choices (p0) itself is broken in two more sub-choices with probabilities p2/p0and p3/p0.

    about discrete source of such a channel as a generator that produces symbols one byone according to some probability distribution, or, in other words, it choses symbolsto transmit with a certain probability. Suppose {p1, p2, p3, ..., pn} is the set of eventprobabilities, i.e. pi is the probability of the event that symbol Si will be chosen to betransmitted through the channel. Our goal is to find a measure H(pi) of the informationproduced by the source. From that measure we require the following:

    1. H should be continuous in pi,

    2. If all n entities have similar probabilities ( pi = 1/n for each i) H should bemonotonically increasing function of n, since transferring symbols selected from awider set with equal probabilities increases the information one can gain about thesource.

    3. If a choice of a symbol is broken down into 2 successive choices (sub-choices), theoriginal H should be the weighted sum of the individual values of H correspondingto the sub-choices. In other words (see Fig. 2.2)

    H(p1, p2, p3) = H(p1, p0) + p0H(p2p0,p3p0

    ). (2.1)

    Following [22] we prove below the fundamental theorem about the measure of infor-mation.

    Theorem: The only function satisfying mentioned three requirements above has the form

    H(pi) = −K∑

    pi log pi, (2.2)

  • 2.2. ENTROPY AND INFORMATION 25

    where K is a positive constant.

    To prove this theorem we start from the equally likely events, i.e. pi = 1/n for every i.In that case H will be function of n only

    H(1

    n,

    1

    n, ...,

    1

    n) = A(n) .

    Thinking of a special case when n = sm we could divide the choice from sm possibilitiesto m parts. Then, as it follows from the 3rd requirement total information should be thesum of m equally likely choices s

    A(sm) = mA(s) (2.3)

    For every s and m one can find n� 1 and t, such that

    sm ≤ tn ≤ sm+1. (2.4)

    Taking log from both parts for large n we get

    m

    n≤ log t

    log s≤ m

    n+

    1

    n=m

    n+ � , (2.5)

    where � is a small variable. Second requirement states that A(n) is a monotonicallyincreasing function. Hence, from Eq. (2.4) one has A(sm) ≤ A(tn) ≤ A(sm+1) . Usingthe property given by Eq. (2.3), one obtains

    mA(s) ≤ nA(t) ≤ (m+ 1)A(s) ,

    which is equivalent tom

    n≤ A(t)A(s)

    ≤ mn

    +1

    n=m

    n+ � . (2.6)

    Putting eqs. (2.5) and (2.6) together one has∣∣∣∣A(t)A(s) − log tlog s∣∣∣∣→ 0 . (2.7)

    Thus, for equally likely events A(s) = K log s, where K is an arbitrary constant.To generalize the result to the any probability distribution we again address the thirdrequirement. Suppose we have a choice from l options with probabilities pi =

    nin

    , wheren =

    ∑i ni and ni is an integer. Denote information one can have about that system as

    H(p1, ..., pl). Each choice i can be broken down to equally likely ni sub-choices, as it isshown in Fig. 2.3. According to the previous result, total information of the sub-choicesis equal to A(n) = K log n. On the other hand, adding H(p1, ..., pn) to the weighted sumof the information about each sub-choice will give us A(n), see Eq. (2.1). Hence

    K log n = H(p1, ..., pl) +K∑i

    pi log ni .

    Therefore total information reads:

    H(p1, ..., pl) = −K∑i

    pi logni∑i ni

    = −K∑i

    pi log pi ,

  • 26 CHAPTER 2. INFORMATION PROCESSING IN GENE EXPRESSION

    p0

    p pp

    p1

    p p

    ... pl

    p p p p

    Figure 2.3: Choice breakdown. Every choice with a certain probability pi = ni/n isbroken into equally probable ni sub-choices.

    where one can recognize the famous expression of the entropy - the measure of uncertainty.Since pi ≤ 1, K will be taken to be positive in order to have positive information - H ≥ 0.Theorem is proven.

    In the following we will keep the base of log function equal to 2 and K = 1 in order tohave information measured in bits, the most common measure of the latter. Clearly, if wehave 2 possible states with equal probabilities H = −1/2 log2(1/2) − 1/2 log2(1/2) = 1bit.

    Entropy, defined in this way, has several useful properties, that make it intuitivelyaccepted measure of information. Below we introduce some of them:

    1. H = 0 when probability of 1 choice is 1, the rest are 0.

    2. H has its maximum when all choices occur with the same probability (the most un-certain case).

    This property can be easily driven using Lagrange multipliers. Indeed, the La-grangian L = H(p1, ..., pl)− λ

    ∑i pi has its stationary state at equal probabilities.

    3. The entropy for joint probability is less than or equal to the sum of individual en-tropies

    H(x, y) ≤ H(x) +H(y) . (2.8)

    In the last equation x and y are events with m and n possible values respectivelydistributed according to the joint probability distribution p(x, y). Correspondingentropies are defined as

    H(x, y) = −∑x,y

    p(x, y) log2 p(x, y) , (2.9)

    H(x) = −∑x,y

    p(x, y) log2∑y

    p(x, y) , (2.10)

  • 2.2. ENTROPY AND INFORMATION 27

    H(y) = −∑x,y

    p(x, y) log2∑x

    p(x, y) , (2.11)

    where x runs in the set {x1, ...., xm} and y runs in the set {y1, ..., yn}. The equalityholds for statistically independent variables.

    Jensen’s inequality for convex functions states, that E[f(x)] ≤ f(E[x]), where E[ξ]stands for the average value of function ξ over the probability distribution of itsarguments. Since (− log) is a convex function, the following holds

    H(x, y)−H(x)−H(y) == −

    ∑x,y

    p(x, y) log2 p(x, y) +∑x,y

    p(x, y) log2∑y

    p(x, y) +∑x,y

    p(x, y) log2∑x

    p(x, y) =

    = −∑x,y

    p(x, y) log2p(x, y)∑

    y p(x, y)∑

    x p(x, y)

    ≤ − log2∑x,y

    p(x, y)p(x, y)∑

    y p(x, y)∑

    x p(x, y)= 2 log2 1 = 0 ,

    Indeed, if p(x, y) = p(x) p(y), then

    H(x, y) = −∑x,y

    p(x, y) log2 p(x, y) = (2.12)

    = −∑x,y

    p(x)p(y) [ log2 p(x) + log2 p(y) ] = H(x) +H(y) .

    4. Hx(y) = H(x, y)−H(x), where Hx(y) = −∑

    x,y p(x, y) log2 px(y) is the conditional

    entropy defined on conditional probability px(y) =p(x,y)∑y p(x,y)

    .

    It is indeed easy to see, that

    Hx(y) =

    = −∑x,y

    p(x, y) log2p(x, y)∑y p(x, y)

    = −∑x,y

    p(x, y) log2 p(x, y)−∑x,y

    p(x, y) log2∑y

    p(x, y) =

    = H(x, y)−H(x) .

    5. Conditional entropy of a variable is always less or equal than entropy of that vari-able Hx(y) ≤ H(y).

    Indeed,

    H(x, y) = Hx(y) +H(x) ≤ H(x) +H(y) ,Hx(y) ≤ H(y) ,

    which means the uncertainty about a variable never increases by receiving a knowl-edge about the other one. Obviously, it remains the same if that two variables arenot connected.

  • 28 CHAPTER 2. INFORMATION PROCESSING IN GENE EXPRESSION

    With analogy to the discrete case one can define the entropy of a continuous variabledistributed according to p(x) as

    H = −∫ ∞−∞

    dx p(x) log2 p(x) .

    Joint and conditional entropies are respectively

    H(x, y) = −∫dx dy p(x, y) log2 p(x, y) ,

    Hx(y) = −∫dx dy p(x, y) log2

    p(x, y)

    p(x),

    Hy(y) = −∫dx dy p(x, y) log2

    p(x, y)

    p(y),

    where p(x) =∫dy p(x, y) and p(y) =

    ∫dx p(x, y).

    Properties of the entropy of a discrete variable are mapped to the continuous case. Inaddition, below we prove two fundamental theorems about the maximization of entropy.

    Theorem I Suppose a variable x is distributed according to p(x) with fixed standarddeviation σ. Then the entropy H(x) has its maximum for the Gaussian distribution andis equal to

    H =√

    2πeσ. (2.13)

    To prove this statement we use the method of Lagrange multipliers. We should maximizeH(x) =

    ∫dxp(x) log2 p(x) with the constraints

    ∫∞−∞ dxp(x) = 1 and

    ∫∞−∞ dxx

    2p(x) = σ2.To do that let us define the Lagrangian as

    Λ(p(x)) = −∫dx (p(x) log2 p(x) + λp(x)x+ µp(x)) , (2.14)

    where λ and µ are Lagrange multipliers. Next, to find the distribution that maximizesΛ(p(x)), let us calculate the derivative

    ∂Λ(p(x))

    ∂p(x)= −1− log2 p(x) + λx2 + µ = 0 . (2.15)

    From the last expression we find out

    p(x) = eλx2+µ−1 . (2.16)

    Parameters λ and µ can be calculated using the conditions of fixed variance and normal-ization

    ∫∞−∞ dxx

    2p(x) = σ2 and∫∞−∞ dxp(x) = 1. The latter give

    σ2 =

    √π

    2

    e−1+µ

    (−λ)3/2, 1 =

    e−1+µ√π√

    −λ, (2.17)

    that corresponds to

    σ2 = − 12λ

    ,1√

    2πσ2= e−1+µ . (2.18)

  • 2.3. MUTUAL INFORMATION 29

    Therefore, the p()x that maximizes the entropy has a Gaussian form

    p(x) =1√

    2πσ2e−

    x2

    2σ2 . (2.19)

    At last, let us calculate the entropy of Gaussian variables.

    H = −∫dx p(x) log2 p(x) =

    ∫dx p(x)

    (log2√

    2πσ2 +x2

    2σ2

    )=

    = log2√

    2πσ2∫dx p(x) +

    1

    2σ2

    ∫dx p(x)x2 = log2

    √2πσ2 +

    1

    2σ2σ2 = log2

    √2πeσ2 .

    Theorem is proven.

    Theorem II If x is limited to positive axis and the first moment of x is fixed to thecertain value a ∫ ∞

    0

    dxxp(x) = a , (2.20)

    then the distribution p(x) = 1ae−x/a maximizes entropy and gives H = log2(e a).

    To solve the constrained maximization problem let us define a Lagrangian function

    Λ(p(x)) = −∫dxp(x) log2 p(x) + λ

    ∫dxp(x) + µ

    ∫dxxp(x) . (2.21)

    where λ and µ are Lagrange multipliers. Derivation of the latter with respect to p(x)gives

    − 1− log2 p(x) + λ+ µx = 0 . (2.22)

    Hence the probability distribution that maximizes the entropy is given by p(x) = e−1+λ+µx.Requiring normalization and

    ∫dx x p(x) = a conditions to hold one gets p(x) = 1

    ae−x/a.

    Now let us calculate the maximal entropy obtained by that solution.

    H = −∫dxp(x)logp(x) =

    ∫dxp(x)

    (log2 a+

    x

    a

    )=

    = log2 a

    ∫dxp(x) +

    1

    a

    ∫dxp(x)x = log2(a e) .

    Theorem is proven.

    2.3 Mutual Information

    Here we are going to understand how noise in the channel affects information transmissionfrom the input to the output. If the channel is noisy a signal send by source does notundergo to the same changes every time. Therefore, we may have different output signalsfor the same input. Let us denote input signal by x that has probability distributionp(x) and output one by y with probability distribution p(y). For this system we can

  • 30 CHAPTER 2. INFORMATION PROCESSING IN GENE EXPRESSION

    Figure 2.4: Schematic representation of the input and the output of a noisychannel and the relation between them. After an observation in time period T 1input can be associated with 2Hx(y)T outputs and 1 output can be associated with 2Hy(x)T

    inputs.

    define entropy of input and output H(x) and H(y), joint H(x, y) and conditional onesHx(y) and Hy(x) correspondingly. Obviously, if the channel is noiseless there will be noinformation loss, and one will have H(x) = H(y). For a noisy channel, however, someinformation might be lost and not always it is possible to recover the original message.To measure how much our uncertainty about input decreases when we know output, letus define mutual information (MI) or rate of transmission as follows [22]

    I(x, y) = H(x)−Hy(x) . (2.23)

    Note, that since conditional entropy of variable is always less or equal than entropy ofthat variable (see H’s Property 5 in section 2.2), MI is a positive quantity. Apparently,MI can be given by one of the following equivalent forms

    I(x, y) = H(y)−Hx(y) = H(x) +H(y)−H(x, y) . (2.24)

    Let us define the capacity of a given channel as the maximum value of MI taken over allpossible input sources p(x)

    C = maxp(x)I(x, y) . (2.25)

    At this point let us prove a theorem that will ‘justify’ our definition of the capacity of anoisy channel (2.25) and explain its meaning.

    Theorem: If a noisy discrete channel has a capacity C and a discrete source with theentropy per second H, such that H ≤ C, there exist an encoding such that the input signalcan be transmitted over the cannel with small frequency of errors.

    Let us calculate the frequency of errors for a random association of the inputs messagesfor long time period T . If we denote H(x) the rate of input x transmission and H(y) therate of the output y recordings, than for the time period T we will have 2H(x)T inputstransmitted and 2H(y)T outputs received, where 2Hy(x)T possible inputs could give therecorded output y and 2Hx(y)T possible outputs could be generated from the same inputx, like it is presented schematically in Fig. 2.4.

  • 2.3. MUTUAL INFORMATION 31

    Suppose a particular output was observed in response to the input source that pro-duces an information at rate R < C. The probability of some other input to be the sourceof that given output is

    p =2RT

    2H(x)T.

    Therefore the probability that none of the input messages (except the original one) willgive the observed output during the time period T is equal to

    P = (1− p)2Hy(x)T = (1− 2(R−H(x))T )2Hy(x)T .

    Since by the assumption R < C = H(x)−Hy(x) one will have R−H(x) = −Hy(x)− η,where η is a positive variable. Using this notation probability P will take the form

    P = (1− 2−Hy(x)T−ηT )2Hy(x)T .

    Indeed, for large enough time period T the term 2−Hy(x)T−ηT is infinitely small. At thatlimit one can expand P to the first order

    P = 1− 2−Hy(x)T−ηT2Hy(x)T = 1− 2−Tη →T→∞ 1 .

    This result shows that except the original message no other message would give the sameoutput. Hence the probability of the error approaches zero. Theorem is proven.

    The analogy of a discrete channel of communication (or a discrete channel in thecontinuous limit) is the continuous one [23], that has as an inputs continuous function oftime f(t) belonging to a set of possible inputs. The output is also a continuous functionthat can be perturbed because of the noise. The rate of transmission of information orMI is defined similarly to the discrete case

    I(x, y) = H(x)−Hy(x) , (2.26)

    where H(x) is the entropy of the input and Hy(x) is the conditional entropy of x giveny. The capacity of the channel, as in the discrete case, is defined as the maximuminformation transmission rate obtained over all possible inputs

    C = maxp(x)

    (−∫dx p(x) log2 p(x) +

    ∫ ∫dx dy p(x, y) log2

    p(x, y)

    p(y)

    )=

    = maxp(x)

    ∫ ∫dx dy p(x, y) log2

    p(x, y)

    p(y) p(x).

    Mutual information has several properties, that make it very comfortable to quantifydependences in biosystems [4, 30]. In particular

    1. MI is defined both for discrete and continuous variables. In biosystemsone can deal with the absolute number of molecules (discrete variables) or theirconcentrations (cont