Genetomic Promototypes: High-throughput, …...to optimize protein expression and/or redesign their...
Transcript of Genetomic Promototypes: High-throughput, …...to optimize protein expression and/or redesign their...
Genetomic Promototypes: High-throughput, Computational Design of Synthetic Promoter Regions
by
Mirkó Palla
2
Clarkson University
Genetomic Promototypes: High-throughput, Computational Design of Synthetic Promoter Regions
A Thesis by
Mirkó Palla
Department of Mechanical and Aeronautical Engineering Harvard Medical School, Genetics Department – Church Laboratory
Submitted in partial fulfillment of the requirements for a
Bachelor of Science Degree with
University Honors
April 2007
Accepted by the Honors Program
______________________________ Dana Pe’er, Advisor Date
______________________________ James Schulte, Honors Reader Date
______________________________ David Craig, Honors Director Date
3
Contents
1 Executive Summary 5
2 Introduction 6
1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Oligonucleotide Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
1.3 BAHSER – Computational Design Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
2 Background Information 9
2.1 DNA Discovery and Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
2.2 From DNA to Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2.3 From DNA to RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Transcriptional Machinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
2.5 Basics of Genetic Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Eukaryotic Transcriptional Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Methodology 26
3.1 The Rise of Synthetic Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
3.2 Overall Design of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 3.3 Designing Promoter Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Results 33
4.1 BAHSER Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
4.2 Basher Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 4.2.1 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Mutagenesis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Combinatorial Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
4.2.4 Promoter Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5 Regulatory Combinatory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4
4.2.6 Regulatory Combinatory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
4.2.7 Pair Mutagenesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
4.2.8 Overlapping Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
4.2.9 Module Mutagenesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
5 Discussion 51
6 Conclusion 53
6.1 Future Prospects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Project Barriers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
6.3 Project Reflection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5
Acknowledgements I would like to thank to my mentors Dana Pe’er, Aimee Dudley and Noel Goddard at
Harvard University for providing me the unique opportunity to work with them on the
Polypromoter Project. It was truly a life-changing experience, which helped my to get a
glimpse what it takes to be a real research scientist. I would like also thank to Prof.
George Church to let me work in his innovative and well-respected laboratory, I felt
privileged to interact with such a bright group of researchers. I would like to give many
thanks to the people in the laboratory, for their enormous support from day one, who
were never too busy to spend some quality time with a “green horn” in the field of
genomics.
This thesis could have not been possible without the support form the Honors Program.
Prof. Craig and Prof. Shen offered me great mentorship during my academic career here
at Clarkson University, which gave me tremendous opportunity to blossom individually
ad intellectually. I would like to express my gratitude to Prof. Craig, whom helped my
through difficult and sometimes even stressful times. I feel that along the way, I have
grown a lot personally and will enter the “real world” with experiences I could have not
gotten at another place than the Honors Program at Clarkson.
Finally, I must express my gratitude for my mother, who sacrificed a lot for me just to
come to study in the United States. She blindly supported me during my Odyssey-like
journey, giving me strength and confidence to believe in myself in every situation I
encountered. I dedicate this thesis to my ever-young and energetic grandma, who updates
me regularly with her hand-written letters, which keep me going here. She is always with
me – in my thoughts – even though an ocean separates us. I would like to also thank my
dad and two brothers, and their beautiful family for their continuous support, this work
could not have been done without their encouragement.
6
Executive Summary Over the last decade most of molecular biology has concentrated analyzing naturally
occurring DNA sequences as revealed by large scale sequencing efforts. In contrast, the
goal of synthetic biology aims to write new genetic information, thereby designing non-
natural DNA regions, genes, proteins, biological processes and entire organisms [1].
Unlike in the past, protein and DNA sequences now have become easier to obtain
electronically through databases than physically from library clones. At the same time
gene synthesis technologies developed to a level of high reliability. For this reason direct
synthesis of DNA regions is swiftly becoming the most efficient way to make functional
genetic constructs and enables applications such as gene, protein and genome engineering
[2].
Synthetic biology is the junction of molecular biology and engineering principles that
is supported by efficient technologies for creating full-length genes, promoters and even
genomes [3]. DNA segment mutation at the upstream regulatory region of the gene has
been shown to often drastically increase gene expression levels [4]. Central to such
efforts is the ability to design the genetic constructs as easily as possible while
considering multiple design parameters in parallel. For example, degree of sequence
identity to homologs and the presence or absence of specific regulatory sites or motifs
must all be considered simultaneously. Current sequence manipulation packages are
typically very feature-rich with graphic user interfaces and multiple integrated tools to
allow for a seamless workflow. They are primarily built to analyze sequence data with a
very little freedom of design and fine-tuning of genetic information.
On the other hand our software BASHER is built to integrate promoter region
manipulation with all the tools necessary to design, write and edit sequence information
within one unifying interface. The software enables the quick, reliable and robust creation
of predefined and custom genetic building blocks, a process essential for systems
biologists to understand how gene expression is controlled by the cell, which is the main
objective of this research project. The project title genetomic promototypes resembles to
this design approach. Just like atoms making up all elements in nature, genetic building
blocks, small functional DNA sequences are used to create new type of synthetic
regulatory regions, therefore the name genetomic promototypes.
7
Chapter 1
Introduction Transcriptional regulation plays a vital role in all living organisms. It influences
development, complexity, diversity, homeostasis and other important biological functions [5]. Transcription is the first stage in the universal information flow from genome, where
all genetic programs are stored, to proteome, through which these programs are executed.
Thus, understanding the complex mechanism behind the control of transcription
machinery constitutes one of the fundamental goals of quantitative biology. At the most
fundamental level, transcription is controlled by the combinatorial interplay of cis-
regulatory elements (or motifs) present in the gene’s promoter region1 and associated
regulatory proteins (or transcription factors) present in the cytoplasm [6]. Because all
transcription factors are gene products themselves, this mechanism is regulated by a set
of motifs present in the particular gene’s promoter. Thus, the elementary principles
governing transcription can be understood by a quantitative description of how the
motif’s influence on gene expression depends on promoter context. In spite of major
efforts aimed at identifying motifs in different species using a variety of approaches and
analyzing their precise influence on gene expression, little is known about the principles
by which a gene’s motifs translate into an expression level [7]. In other words, the
quantitative effect of motifs on gene expression as a function of their promoter context is
still poorly understood.
1.1 Biological Background Modern molecular biology has brought many new tools to the research scientists as well
as an expanding database of genomes and new genes for study. Of particular use in the
analysis of these genes is the synthetic promoter region, a 600-1000 base pair nucleotide
sequence designed to the specifications of the investigator, which controls the
transcription machinery. Synthetic promoters are responsible to control the same product
1 See Figure 4 on page 17 for hypothetical gene control mechanism.
8
as the gene of interest, but the bioengineered nucleotide sequence regulating that protein
may express it differently under various environmental conditions. Designing synthetic
promoters by hand is a time-consuming and error-prone process that may involve several
computer programs. For this reason, an integrated bioengineering tool (design software
called BASHER) is under development that combines many modules to provide a
platform for high-throughput synthetic promoter region design for multi-kilobase
sequences. Of all sequenced genomes, the yeast Saccharomyces cerevisiae has gained the
most attention due to the availability of multiple yeast genomes and high quality mRNA
data [8]. For this reason, this yeast species was chosen as our core model in the genomic
analysis.
1.2 Oligonucleotide Synthesis The power and flexibility of oligonucleotide synthesis is increasingly being recognized in
the bioengineering community. Traditional promoter region synthesis applications
include facilitation of site-directed mutagenesis, structural analysis and investigation of
transcription regulation. The new theory of promoter variant design takes combinational
and spacial effects of cis-binding sites2 into account and incorporates them into the
modeling process. Since binding sites can act as activators or inhibitors and can form
modules (set of cis-elements) with linear, epistatic, synergistic or switch effects as result
of their interaction, a deep combinatorial analysis is needed to decipher the governing
regulatory logic. Previous studies show that there are functional and mechanistic
implications of spatial organization of these regulatory elements [9]. There are physical
interactions between them as certain transcription factor binding sites overlap, implying
the possibility for protein complex formation. Also, in the higher chromatin structure,
there are regions of 3-dimensional occlusions blocking protein binding to regulatory
motif sequence. Motif positioning relative to transcription start plays a significant role in
the transcription regulatory mechanism, so synthetic DNA segment2 insertions might
reveal some functionality. Finally, the distance between cis-elements plays a major role
in regulation; certain motif pairs only occur in a particular base pair distance form each
other and some pairs occur more frequently then others in the promoter. It was also 2 See Figure VII on page 61 in Appendix for an example of cis-binding sites of YCL027W promoter
9
shown, that motif orientation and order has regulatory effects, i.e., a regulatory module
will only influence gene expression in the right spatial combination (orientation, order)
[9]. To decipher the governing regulatory logic, first combinations of elements must be
removed or replaced with new synthetic motif sequences and the resulting gene
expression profile can be analyzed under various environmental conditions. Furthermore
the additional logical design steps should include: randomly moving a binding site to
other locations, making small changes to cis-elements or adding new motifs based on
new statistical data. These designing steps are performed by BASHER resulting in a set
of systematic promoter variants in a high-throughput manner.
1.3 BAHSER – Computational Design Tool In the past, researchers used many different programs to address the requirements of the
separate steps of synthetic promoter design [10]. Alternatively, they sent off their
requirements to a black box provided by a gene synthesis company and let it use its
proprietary programs to design nucleotide sequences of interest. To facilitate the use of
synthetic promoter regions in both traditional and high-throughput applications, new and
more flexible solutions are required. BASHER is a useful tool for investigators who wish
to optimize protein expression and/or redesign their promoter of interest for detailed
structure/function studies (e.g., mutagenesis) [11]. The objective of this research project
is to create a command-line program that is able to perform all of the functions outlined
above for promoter design in a directed, step-wise manner. It accepts as input both
ortholog promoter sequences and global transcription factor binding site maps of the
organism of interest and allows users to move through the process of design in a series of
modules that address practical issues surrounding oligonucleotide design. Users can
follow the main “design a promoter” path or use the modules individually as needed.
10
Chapter 2
Background Information Life depends on the cell’s multi-dimensional ability to store, recover and translate the
genetic information needed to create and maintain a living organism. At cell division this
hereditary instruction is passed on from a cell to its daughter cells and from one
generation to the next through the organism’s reproductive mechanism. Every living cell
contains these instructions which are called genes3, the information-containing region of
the DNA (deoxyribonucleic acid) that determines the hereditary characteristics of a
distinct species and its individuals within. At the beginning of the twentieth century, when genetics started to emerge as a
scientific field of its own, researchers set their goals to understand the biochemical
structure of genes and cell functions in general. They knew that the heredity information
in genes is copied and transmitted from cell to cell many times during the life cycle of a
multi-cellular organism. They also realized that the genetic code during this process is
essentially unchanged. At this time, they did not know the type of molecule capable of
virtually unlimited replication on such accurate level and directing the development and
daily life of a living cell. The next logical step was to figure out the type of instructions
the genetic code contain and the physical organization of the genetic information, which
is responsible for the development and maintenance of even the simplest organism alive.
In the 1940’s when researchers discovered that genetic information consists primarily
of instructions for making proteins, some promising light shed onto their previous
questions. Proteins are macromolecules that perform the majority of cellular functions:
they enable cells to move and to communicate with each other, they serve as building
blocks for cellular structures, they form enzymes that catalyze all chemical reactions
inside the cell, and they regulate gene expression.
3 See Figure I and II in Appendix for definition of gene and further details on gene architecture.
11
2.1 DNA Discovery and Chromosomes The other crucial discovery of this era was the identification of DNA4 as the most
probable carrier of genetic information [12]. But the mechanism whereby the hereditary
code is transmitted unaltered from cell to cell, and how proteins are directed by the
instructions encrypted in the DNA, remained unknown. In 1953 this mystery was solved
by two molecular biologists - James D. Watson and Francis Crick - when the chemical
and geometric structure of DNA was determined. The structure of DNA immediately
solved the problem of how the information in this molecule might be replicated and also
provided insight how a DNA molecule might encode the instructions for making proteins.
In the nineteenth century, biologists have also recognized that genes are carried on
chromosomes, which are threadlike structures5 in the nucleus. Later, they also discovered
that chromosomes consist of both DNA and protein. As discussed previously, we know
that the heredity information of the cell is encrypted into the DNA, as in contrast, the
protein components of chromosomes play vital role in the packaging and control of the
enormously long DNA molecules so that they fit inside cells and can easily be accessed
by them.
Despite its molecular simplicity, the structure and chemical properties of DNA
provides an excellent fit for the raw material of genes. Every gene of the cell on Earth is
made of DNA, and insights into the relationship between DNA and genes have come
from experiments in a wide variety of organisms. It is crucial to understand how genes
and other important regions of DNA are arranged on the molecules of DNA that are
present in chromosomes in a 3-dimensional fashion. It is also fundamental to fully
comprehend how eukaryotic cells fold these long DNA molecules into compact
chromosomes, which is then can be correctly replicated between two daughter cells at
cell division. Furthermore, more understanding must be gained about enzymatic
chromosomal DNA repair and the specialized proteins that direct the expression of the
DNA’s many genes.
4 For DNA double-helix architecture see Figure III in Appendix. 5 For eukaryote chromosome structure see Figure IV in Appendix.
12
2.2 From DNA to Protein When the structure of DNA was discovered, it became clear how the hereditary
information is encoded in DNA’s sequence of nucleotides. Transitioning to the past forty
years the scientific progress has been astonishing. Now, we have complete genome
sequences for many organisms, and thus the maximum amount of information is known
to produce a complex organism like ourselves. Since the hereditary information has finite
limits constrained by biochemical and structural features of the cell, it is obvious now,
that biology has finite complexity.
At this stage, we still have a great deal to discover about how the genome directs the
development of a simple, unicellular organism with about 500 genes, not to say a human
with approximately 30,000 genes. A vast amount of questions remain to be answered
giving great challenges to the next generation of bioengineers. But, as we know now,
much of the DNA-encoded information present in the genome is used to specify a linear
amino acid order6 for every protein of the organism. The amino acid sequence in turn
determines how each protein folds into a distinct 3-dimensional molecular shape, which
gives unique chemical characteristics. When a specific protein is produced by the cell, the
corresponding genome region must be precisely decoded. Additional sequence
information in the DNA of the genome determines exactly when in the life of the cell and
in which cell types each gene will be expressed into protein [13]. Since proteins are the
main components of living cells, the decoding of the genome determines the mechanical
configuration, biochemical properties as well as the distinctive features of species on
Earth.
Even though we expect the genome information to be arranged in an orderly manner,
the genomes of most multi-cellular organisms are unexpectedly disordered. Small
sections of DNA coding regions are scattered with large blocks of seemingly meaningless
DNA. Some sections of the genome contain multiple genes and others lack genes
altogether. It is common that cooperative proteins in the cell have their genes located on
different chromosomes, and adjacent ones usually express proteins which do not interact
at all [15]. Thus, deciphering genomes is not a simple task. Even with the help of
6 For the schematic depiction of a portion of chromosome 2 from the genome of the fruit fly see Figure V.
13
powerful computers, it is still very difficult to absolutely locate the beginning and end of
genes in the DNA sequences of complex genomes, and to predict when each gene is
expressed in the life cycle.
RNA as an intermediate molecule directs protein synthesis, and not the DNA itself.
When the cell needs a specific protein, the DNA sequence of the corresponding portion
of the chromosome is first copied into RNA, which process is called transcription. Then
these RNA templates of copied DNA sequences are used directly to synthesize protein,
which process is called translation. The genetic information flow in cells is therefore
from DNA to RNA to protein. All cells on Earth, from unicellular to complex multi-
cellular organisms, express their genetic information in this way, which is also termed the
central dogma of molecular biology because of its universality (see Figure 1).
Figure 1 – The central dogma. Figure 2 – Different gene expression efficiencies. Left: The flow of genetic information from DNA to RNA (transcription) and from RNA to
protein (translation) occurs in all living cells. Right: Genes can be expressed with different
efficiencies. Gene A is transcribed and translated much more efficiently than gene B. This allows
the amount of protein A in the cell to be much greater than that of protein B [15].
Despite the generality of the dogma, there are major variations in the way information
flows from DNA to protein. One of the most important variations in eukaryotic cells is
14
that the RNA transcripts undergo a series of processing steps in the nucleus, before they
are allowed to exit and be translated into protein. These steps can fundamentally change
the functionality of the RNA molecule and are therefore vital to understand how the
eukaryotic genome is being deciphered. It is also interesting to point out that for some
genes RNA not protein is the final product. Many of these RNA’s fold into set three-
dimensional structures that have structural and catalytic roles in the cell.
2.3 From DNA to RNA Since the primary focus of the research is related to RNA transcription, it is crucial to
understand in great detail the process of transcription by which an RNA molecule is
produced from the DNA of a gene. Transcription is the means by which cells read out the
genetic code in their genes. Because many identical RNA copies can be produced from
the same gene, and each RNA molecule can orchestrate the synthesis of many identical
protein molecules, cells can synthesize a big amount of protein when necessary. Also,
each gene can be translated and transcribed with a different rate7, allowing the cell to
control the protein quantity production on a delicate scale. Furthermore, the cell can
control its gene expression by controlling the RNA production according to the
momentary need of cell state [16].
The first step a cell takes in retrieving the needed part of genetic instruction is to copy
a specified portion of its DNA nucleotide sequence - a gene - into an RNA sequence. The
new information copied from DNA to RNA, although in another chemical form, is still a
nucleotide sequence, hence the name transcription. RNA is a linear polymer made of four
different types of nucleotide subunits linked together by phosphordiester bonds similarly
to DNA. But, it chemically differs from DNA in two ways: (1) the nucleotides in RNA
are ribonucleotides rather than deoxyribose; (2) although, but nucleic acids contain the
bases adenine (A), guanine (G), and cytosine (C), it contains the base uracil (U) instead
of the thymine (T) in DNA. In RNA, G pairs with C, and A pairs with U by hydrogen-
bonding. It is not uncommon, however, to find other types of base pairs: for example, G
pairing with U occasionally.
7 For gene efficiency control see Figure 2 on page 12.
15
Despite these minor chemical differences, DNA and RNA differ quite significantly in
overall structure. Whereas DNA always forms a double-stranded helix, RNA is single-
stranded. Therefore they fold up into a variety of shapes to form complex three-
dimensional shapes providing structural and catalytic functions as mentioned before.
The transcription process begins with the opening and unwinding of a small section
of the double helix to expose the bases on each strand similarly to DNA replication. One
of the two strands then acts as a template for the synthesis of an RNA molecule. Then the
nucleotide sequence of RNA chain is determined by the complementary base-pairing
between incoming nucleotides and the DNA template (see Figure 3). When an
appropriate match is found, the incoming ribonucleotide is linked to the growing RNA
chain by covalent bonding, which is catalyzed by various enzymes. So, the transcript is
elongated one nucleotide at a time, and is exactly complementary to the strand of DNA
used as the template.
Figure 3 - DNA transcription produces a single-stranded RNA molecule that is complementary to
one strand of DNA [15].
However, transcription differs from replication in several ways. The RNA strand does not
remain hydrogen-bonded to the DNA template, unlike the newly constructed DNA
strand. At the location of ribonucleotides addition the RNA chain is displaced and the
DNA helix re-forms. Thus, the RNA molecules are single stranded as released form the
DNA template. Also, since they are copied from only a specific region of the DNA, the
resultant RNA molecules are much shorter than DNA. Most RNA’s are no more than a
few thousand nucleotides long, and many are considerably shorter in the human body.
16
The enzymes that perform transcription are called RNA polymerases. RNA polymerases
catalyze the formation of the bonds that link the nucleotides together in the formation of
the linear RNA chain. This enzyme moves stepwise along the DNA strand, opening the
helix just ahead of the active site for polymerization to expose a new region of the
template DNA for complementary base-pairing. Thus, the growing RNA chain is
elongated by one nucleotide at a time in the 5’-to-3’ direction. The immediate release of
the RNA sequence from the DNA as it is transcribed means that many RNA copies can
be made from the same gene in a relatively short time. When RNA polymerase molecules
follow close to each other in this way over a thousand transcripts can be synthesized in an
hour from a single gene.
It is important to point out, that the majority of genes carried in a cell's DNA specify
the amino acid sequence of proteins; during the transcription process the RNA molecules
that are copied from these genes are called messenger RNA (mRNA) molecules. To
precisely transcribe a gene, RNA polymerase must identify where on the genome to start
and where to finish its initiation. The initiation of transcription is an essential step in gene
expression because it is the origin at which the cell regulates which proteins are to be
produced and at what rate.
2.4 The Transcriptional Machinery Bacterial RNA polymerase is a multi-subunit complex, in which the sigma (σ) factor is
largely responsible for its ability to tell where to begin transcribing [17]. Initially RNA
polymerase molecules hold on weakly to the bacterial DNA. Then the polymerase
molecule typically slides swiftly along the DNA molecule until it arrives into a region
called a promoter, a special sequence of nucleotides indicating the starting point for
synthesis. At this moment it binds tightly to the promoter DNA and opens up the double
helix to expose a short stretch of nucleotides on each strand. With the DNA unwound,
one of the two exposed DNA strands acts as a template for complementary base-pairing
with incoming ribonucleotides. Approximately after the first ten nucleotides of RNA
synthesis the σ factor relaxes its firm hold on the polymerase and eventually dissociates
from it. RNA chain elongation continues until the enzyme encounters a second signaling
region in the DNA, called the terminator, where the polymerase halts and releases both
17
the DNA template and the new RNA chain. After the release of the polymerase at a
terminator, it regroups with an open σ factor and searches for a new promoter, where the
transcription cycle can start again.
As described above, the processes of transcription initiation and termination involve a
complex sequence of structural transitions in protein, DNA, and RNA molecules. Thus,
the signals encoded in DNA that specify these critical areas are difficult for researchers to
identify. On one hand, after many bacterial promoter comparisons it reveals that they are
heterogeneous. But on the other, it is shown that all contain related sequences, reflecting
that they are recognized directly by the σ factor. These common features are often
summarized in the form of a consensus sequence. In general, a consensus nucleotide
sequence is derived by comparing many sequences with the same basic functionality and
adding up the most common nucleotides found at each position. It therefore serves as a
summary or “average” of a large number of individual nucleotide sequences.
One of the reasons bacterial promoters differ in composition is that the specific
sequence determines the number of initiation events (strength) of the promoter. In other
words, evolution designed each promoter to initiate as often as needed and have created a
wide array of promoters. Promoters for genes that code for widely used proteins are much
stronger than those associated with rare protein encoding genes, and their nucleotide
sequences are responsible for these differences. As bacterial promoters, transcription
terminators also include a wide variety of sequences, where in some cases a simple RNA
structure is the most important common feature [18]. Since an infinite number of
nucleotide sequences have this potential, terminator sequences are much more
heterogeneous than those of promoters.
Although a great deal is known about bacterial promoters, terminators and their
consensus sequences, their dissimilarity makes it difficult for researchers to surely locate
them simply by inspection of the nucleotide sequence of a genome. When analogous
sequences are encountered in eukaryotes, the problem of locating them is even more
complicated. Often, additional information, some of it from direct experimentation, is
needed to accurately locate the short DNA signals contained in genomes.
All promoter sequences are asymmetric, which plays an important role in their
arrangement in genomes. Since DNA is double-stranded, in theory two different RNA
18
molecules could transcribed from any gene, using each of the strands as a template.
However a typical gene only has a single promoter and because the nucleotide sequences
of promoters are asymmetric the RNA polymerase can bind in only one configuration.
Since the polymerase can synthesize RNA in the 5’ to 3’ direction only, the template
DNA strand for each gene is determined by the location and orientation of the promoter.
Analysis of genome sequences revealed that the DNA strand used as the template for
transcription varies from gene to gene [19].
In contrast to bacteria, eukaryotic nuclei have three RNA polymerase, called RNA
polymerase I, RNA polymerase II, and RNA polymerase III. They are structurally similar
to one another and also to the bacterial enzyme. They share some common subunits and
many structural features, but they transcribe different types of genes. Transfer RNA,
ribosomal RNA, and other small RNA’s are transcribed by RNA polymerases I and III,
while vast majority of genes, which encode proteins are transcribed by RNA polymerase
II. Besides many structural similarities to bacterial polymerase, the eukaryotic RNA
polymerase II has many important enzymatic functional differences.
1. Eukaryotic RNA polymerases require general transcription factors (set of specific
proteins), which must assemble at the promoter with the polymerase before the
polymerase can begin transcription (see Figure 4).
2. Eukaryotic transcription initiation must deal with the packing of DNA into
nucleosomes and higher order forms of chromatin structure.
Figure 4 - Gene control mechanism for gene X [15].
19
The general transcription factors aid the correct positioning of the RNA polymerase at the
promoter by pulling the two strands of DNA apart to allow transcription to begin, and
releasing it from the promoter into the elongation phase once transcription has begun.
2.5 Basics of Genetic Switches In the previous section, the basic components of genetic switches - regulatory proteins
and the DNA sequences (motifs) that these proteins recognize - were identified. To
understand how these components operate to turn genes on and off in response to a range
of signals, an E. coli bacteria study is brought up as an example, during which the
composition of their growth medium has been changed. This is an example of one of the
simplest control mechanisms in gene regulation: an on-off switch in that responds to a
single signal [20].
The chromosome of the bacterium E. coli, a single-celled organism, consists of a
single circular DNA molecule, which encodes approximately 4300 proteins. The
expression of these genes is regulated according to the available food in the surrounding
environment. This is demonstrated by five E. coli genes that code for enzymes that
manufacture the amino acid tryptophan. These genes are clustered together on the same
chromosome and are transcribed as one mRNA molecule from a single promoter
[operon]. But when tryptophan is present in the medium, the cell shuts off their
production, since no longer needs these enzymes. The molecular basis for this switch is
understood in extensive detail. If the level of tryptophan is low, the polymerase binds to
the promoter and transcribes the genes of the tryptophan operon. If the level of
tryptophan is high, its repressor is activated to bind to the operator, where it blocks the
binding of RNA polymerase to the promoter. When the level of tryptophan drops, the
repressor releases its tryptophan and becomes inactive, allowing the polymerase to begin
transcribing these genes (see Figure 5). Thus the tryptophan repressor and operator form
a simple device that switches production of the tryptophan enzymes on and off according
to the availability of free tryptophan. Because the active, DNA-binding form of the
protein serves to turn genes off, this mode of gene regulation is called negative control
20
and the gene regulatory proteins that function in this way are called transcriptional
repressors.
Figure 5 – Tryptophan negative control in E. coli [15].
In some cases, bacterial promoter has reduced functionality, because they are not
recognized by the RNA polymerase or the polymerase has some difficulty to open the
DNA double helix at initiation. In both cases, the promoters can be helped out by so-
called gene regulatory proteins that bind to a nearby site on the DNA and attach to the
RNA polymerase so that the transcription probability drastically increases. This form of
gene regulation is termed positive control, since a DNA-binding protein switches the
gene on. For this reason the gene regulatory proteins that function in this manner are
known as transcriptional activators. In some cases, gene activator proteins aid RNA
polymerase binding to the promoter by providing extra surface for attachment. In other
cases, they assist the initial DNA-bound polymerase to transition to the active
transcription phase.
For example, the bacterial activator protein CAP (catabolite activator protein),
activates genes that enable E. coli to use other carbon sources when glucose is not
available [21]. When the glucose level falls there is an increase in the intracellular
signaling molecule cyclic AMP, which binds to the CAP protein, enabling it to bind near
to its target promoters and thereby acting as gene switches. Thus, the expression of a
target gene is turned on or off, depending on whether cyclic AMP levels in the cell is
high or low, respectively (see Figure 6).
21
Figure 6 – Positive and negative gene control by regulatory proteins in prokaryotes [15].
Note that the addition of an "inducing" ligand can turn on a gene either by removing a
gene repressor protein from the DNA (upper left panel) or by causing a gene activator
protein to bind (lower right panel). Likewise, the addition of an "inhibitory" ligand can
turn off a gene either by removing a gene activator protein from the DNA (upper right
panel) or by causing a gene repressor protein to bind (lower left panel).
Positive and negative controls can be combined to form more complicated genetic
switches [22]. An example is the lac operon in E. coli, for example, which is controlled
by both negative and positive regulatory mechanism by the lac repressor protein and
CAP (see Figure 7). The lac operon codes for proteins required in lactose transport and
break down, while CAP provides an alternative carbon source for the bacteria in glucose
scarce medium. CAP should not to induce lac operon expression if lactose is not present,
and the lac repressor should ensure that the lac operon is off in lactose scarce
environment. This circuitry makes sure that the lac operon can respond and differentiate
between two distinct signals, so that lac is only expressed when two conditions are met:
lactose must be present and glucose must be absent.
22
Figure 7 - Dual control of the lac operon [15].
The logic of this simple genetic switch first attracted the attention of biologists over 50
years ago. As explained above, the molecular basis of the switch was uncovered by a
combination of genetics and biochemistry, providing the first insight into how gene
expression is controlled. Although the same basic strategies are used to control gene
expression in higher organisms, the genetic switches that are used are usually much more
complex.
2.6 Eukaryotic Transcriptional Regulation The transcriptional regulation in eukaryotes differs in three important ways from that
found in bacteria. First, in eukaryotes there are gene regulatory proteins that can control
even when they are bound to DNA thousands of nucleotide pairs away from the promoter
that they influence. Second, the eukaryotic RNA polymerase II requires general
transcription factors, which must be assembled at the promoter before transcription can
be initiated. This assembly process can be regulated by signals, so that transcription
23
initiation can be speeded up or slowed down. Third, the packaging of DNA into
chromatin provides opportunities for special regulation not available to bacteria.
Eukaryotes use gene regulatory proteins (activators and repressors) to regulate the
expression of their genes just like bacteria. The DNA sites close to the promoter to which
the eukaryotic gene activators bound increases the rate of transcription. At great surprise,
in 1979, scientists discovered that these activators can act thousands of base pairs away
from the promoter. Moreover, they could influence transcription when bound either
upstream or downstream from it. In this case the DNA between the enhancer and the
promoter loops out to allow the activator proteins bound to the enhancer to come into
contact with proteins bound to the promoter (see Figure 8).
Figure 8 - Transcription initiation by an activator from a distance in a eukaryotic cell [15].
In eukaryotes, the DNA control regions are often spread over a long stretch of DNA,
since some regulatory proteins control gene expression from a distance. For this reason
the gene control region should be defined as the whole DNA stretch involved in
transcriptional regulation, i.e. it should include the promoter, the location of general
transcription factor assembly, and all regulatory sequences to which regulatory proteins
bind to control the rate of the assembly at the promoter.
There are thousands of different gene regulatory proteins, some of which regulate
gene expression recognizing their specific DNA sequences via DNA-binding motifs.
Others do not recognize DNA directly but instead assemble on other DNA-bound
proteins (see Figure 4 and Figure 10). These proteins control the genes of an organism to
be turned on or off according to their presence in different cell types, thus causing unique
gene expression patterns giving each cell type its own characteristics. It is also interesting
24
to point out, that each gene in a eukaryotic cell is regulated differently from every other
gene. Therefore, given the number of genes and the pure complexity of regulatory logic,
it has been almost impossible to come up with standard rules for gene regulatory
mechanism.
Most gene regulatory proteins have usually two domains with distinct functions. One
of the domains contains the motifs that recognize a specific regulatory DNA sequence,
while the other influences the rate of transcription initiation. As shown by biochemists,
the main function of activators is to bind, position, and modify the general transcription
factors and the polymerase at the promoter. This is accomplished by two ways: 1) acting
directly on the transcription machinery itself, or 2) by changing the chromatin structure
around the promoter region.
As it was pointed out earlier eukaryotic gene activators have the ability to influence
transcription initiation steps, and this functionality has important consequences when
they work together. In many cases a joint effort is present in the regulatory mechanism,
which is usually the product of the effect for the regulators alone. So, if factor X
increases the reaction speed of a certain process by 10-fold and another factor Y increases
in a different way at the same rate, and then the parallel effort will result in a 100-fold
overall increase. In a similar manner, when activators X and Y help in the recruitment of
proteins at some reaction site, there will be a multiplicative result in the process. Thus,
gene activator proteins often act in this way, which is called transcriptional synergy [23].
Figure 9 – Transcriptional synergy [15].
Transcriptional synergy is observed between both upstream-bound activator proteins and
multiple DNA-bound molecules of the same activator. Therefore, with this collaborative
switch-like mechanism, multiple gene regulatory proteins - each binding to a different
25
regulatory motif - are responsible for the transcriptional rate control of a eukaryotic gene
(see Figure 9). Thus, in conclusion of regulatory control, regulatory protein must be
bound to DNA to influence its target promoter, and the rate of transcription depends on
the fine arrangement of regulatory proteins bound upstream and downstream of its
transcription start site.
Up until now, eukaryotic regulatory proteins were evaluated as individual
components in the control mechanism. In reality though, most are building blocks of
complexes composed of several polypeptides, each with its own function (see Figure 10).
These complexes often require a sequence specific DNA binding site. In some well-
studied cases, for example, two gene regulatory proteins with a weak affinity on its own
cooperatively bind to DNA with sufficient combined affinity [24]. A particular regulatory
protein usually forms more than one type of complex acting neither as activator nor
repressor on its own, but as a component of a regulatory complex with function
determined by its final assembly. This assembly depends on both on the control region
sequence arrangements and the variety of regulatory proteins present in the cell. In
summary, the assembly of larger complexes of regulatory proteins provides a second
alternative for the mechanism of combinatorial control, offering a new dimension of
opportunities.
Figure 10 – Eukaryotic regulatory protein complex formation [15].
It has been shown by researchers that, in Drosophila (fruit fly) regulatory proteins are
positioned at multiple sites along long stretches of DNA forming multi-component
complexes [25]. They influence the chromatin structure and the recruitment and assembly
of the general transcription machinery at the promoter. With these small cooperative
26
modules present in the coding region, there are unbounded opportunities for the control
of eukaryotic gene transcription.
Another interesting example of regulatory control type is based on the combinatorial
interplay of certain regulatory proteins located on the promoter [26]. As an example,
there is the ‘eve’ Drosophila gene regulated by two gene activators (Bicoid and
Hunchback) and two repressors (Krüppel and Giant). The relative concentrations of these
four proteins determine whether protein complexes forming at the stripe 2 module turn on
transcription of the ‘eve’ gene. Seven combinations of regulatory proteins activate eve
expression, while many other combinations keep the stripe elements silent. This is an
exciting example of combinatorial control, where a single gene can respond to an
enormous number of combinatorial inputs.
27
Chapter 3
Methodology
The motivation of this research project was to fundamentally understand transcriptional
gene regulation in the model organism yeast Saccharomyces cerevisiae. The genome of
this organism was selected for initial hypothesis testing, since there is a lot known about
its genetic regulatory mechanism. This is a very complex area of active research, since
first the cis-regulatory elements on the DNA must be accurately mapped out providing
the physical locations of regulatory protein binding sites. Second, it has to be described
how these gene regulatory elements affect expression under different environmental
conditions. Third, since many gene regulatory sites act as complexes, it must be known
what other regulatory proteins binds to them forming functional subunits. Fourth, it must
be accurately described how these regulatory elements/units interact in a combinatorial
manner, i.e. what kind of general regulatory circuitries, universal combinatorial logic
exist in nature. Finally, it is important to point out, that expression of a gene is the
product of the design of its regulatory region, environmental condition and the abundance
of its regulators, which gives a 3-dimensional scope to the problem (see Figure 11).
Figure 11 – Transcriptional gene regulation as a 3-dimensional space
28
3.1 The Rise of Synthetic Biology In the last decade, molecular biology focused on reading and analyzing naturally
occurring DNA sequences as the result of world-wide sequencing efforts. In contrast, our
innovative research project aimed to write new genetic information, thereby creating non-
natural DNA sequences, proteins and biological processes in order to prove a biological
hypothesis. Since protein and DNA sequences have become easier to obtain
electronically through databases than physically from library clones, direct synthesis of
DNA regions of interest is rapidly becoming the most efficient way to make functional
genetic constructs, which enables applications such as genetic engineering. Thus, high-
throughput computational promoter design in a systematic manner provides new means
of dissecting the underlying mechanism of transcriptional gene regulation.
As it was mentioned earlier, transcription factor mapping on the regulatory region of
DNA is a very active challenge in computational biology, which gave birth to tens of
methods producing hundreds of papers with limited success [27]. There is a high false
positive - false negative rate in their predictive power, little success in the understanding
of the functional role of regulators and combinatorial examples were scarce for solid
theoretical conclusions. Even though many pioneering mathematical and computational
models were developed, the common point specific scoring matrix (PSSM) and
comparative polygenetic conservation methods provided predictive, but confirming
power in hypothesis testing [7, 28].
Another difficulty to cope with is the magnitude and complexity of the existing
problem. Previous researchers demonstrated great efforts in hard core promoter
mutagenesis (virtually mutate each base pair in the region of interest), but as it turned out
they were very time consuming [29, 30]. Many years of work had to be devoted for a
single promoter analysis and even then, the results were not as comprehensive as it was
desired. They only pinpointed a unique regulatory region on the DNA, providing no
global insight into the regulatory mechanism.
Therefore, in order to fully comprehend transcriptional regulation of genes we must
understand the mechanism of regulation, their process of evolutionary evolvement, the
role and interaction of contributing factors on a high-throughput, comprehensive scale. It
is essential to know the physical locations of the regulatory binding sites (cis-elements)
29
on the DNA, the effect of expression under different control conditions, the combinatory
interplay between gene regulatory proteins bound to control regions, the principles of
promoter spatial organization and finally the functionality of upstream transcription
factors, signaling molecules and chromatin modifiers controlling gene regulation. This is
the ultimate “wish-list” that BASHER, the promoter designer software tries to address by
constructing a series of synthetic promoter regions with mutations based on the well-
studied regulatory principles for hypothesis testing. It is important to point out, that
BASHER is only one of the basic elements in the pipeline of analysis, since it contributes
only to computational promoter design, which results in a list of synthetic DNA
sequences specific to the hypothesis testing of interest in a text file. Each of these
promoters is labeled with a unique non-coding DNA segment (bar-code), which is
produced by one of BASHER’s algorithms. In this way, when implanted into yeast and
tested under different stress conditions, the gene expression patterns can be uniquely
identified by these bar-codes in the pool of mRNA’s.
3.2 Overall Design of Hypothesis Testing
BASHER output results, a list of synthetic promoter regions of interest (in a text file), are
decomposed into an array of 30-mers with unique flanking regions, which are
complementary to the next segment for each construct. Then these oligosaccharide
segments can be industrially probed according to traditional solid-phase array technology
into a collection of microscopic DNA spots attached to a solid surface. The
oligosaccharide-chip obtained in this way contains all segments of synthetic promoters
for the particular experimental design of interest. In the next step, polymerase chain
reaction (PCR) reaction is utilized to amplify and connect the pool of 30-mers into their
distinctive promoter constructs [31]. In theory, the synthetic promoter regions will be
obtained when the components of 30-mers line up with their overlapping flanking end,
which leads to combination. When successful promoter regions are constructed, they are
separated into different yeast cell pools, where due to homologous recombination the
synthetic promoters get transferred into the genome of the host organism. Then these
various pools are subjected to different stress conditions (rich media, amino acid
starvation, osmosis stress, mating factor, etc.), which results in the expression of different
30
genes in the organism. These gene expression characteristics are directly correlated with
the mRNA content of the cytoplasm, which can be measured by the polony sequencing
technique developed in the Church laboratory [32]. Since, the synthetic promoters were
each labeled with a bar-code, the mRNA can be uniquely identified and thus qualitative
and quantitative conclusions can be made about some particular gene regulatory
mechanisms for each promoter construct.
But, paired with the gene deletion strain available for Saccharomyces cerevisiae an
even more impressive gene regulatory comparison is possible [30]. Similarly, to the
synthetic promoter constructs, naïve (unmodified) bar-coded yeast cells are mixed with
the yeast deletion cell line, which leads to a new yeast deletion cell line labeled with the
particular bar-code for each gene of interest due to homologous recombination. As
before, these are grown under different stress conditions and their mRNA fingerprint is
captured by polony sequencing. Therefore, in the final stage of the comparative analysis
we have two mRNA profiles in hand, one for the synthetic cell line and another for the
yeast deletion cell line. Comparing these profiles gene regulatory mechanism validation
is possible, which gives the most informative results of regulatory logic of today. This
way, we are able to pinpoint synergistic transcription factor cooperation or even
requirements for complex formation at the regulatory region of the DNA8.
As the result of this comparative analysis we obtain a cis- and trans-regulatory
protein lists and their biochemical interactions of gene control under particular cell states.
From this data, we can infer regulatory logic of the gene(s) of interest, which might lead
to quality of data facilitating biophysics modeling. With the follow up on interesting
constructs or interactions, new regulatory mechanisms can be discovered, which can lead
to gene regulatory network reconstruction. It shows hope to help understanding gene
regulatory mechanism when combining a number of genes and the basic principles of cis-
regulation. This result can ultimately guide better computational “motif finding”, thus
clearer vision of gene regulatory proteins.
8 See flow chart of hypothesis testing in Figure VI in Appendix.
31
3.3 Designing Promoter Variations It is vital, when designing synthetic promoter regions, to properly model cis-regulatory
elements, since they are the fundamental building blocks of transcriptional gene
regulation. First, we must recognize that the particular DNA segment is a binding site for
a regulatory protein. This is usually done, by different computational methods validated
by ChIP-on-chip9 experiments, which give a list of functionally important elements by
location relative to the promoter start site [33]. To figure out, what effects cis-binding site
have, we must consider its location and neighboring sequences by randomly moving
binding sites to different location on the regulatory DNA sequence. Also, the already
mapped binding motifs should be modified by small base pair substitutions, partial or
even full motif deletions. Finally, in order to understand the comprehensive gene
regulatory protein control, the design should include the addition of new motifs based on
PSSM data available from previous experimental efforts [28].
To cover gene regulatory logic, we must decipher how cis-elements combine to
determine gene expression. It is important to know if the regulatory elements act as
activators or inhibitors and if individual elements form bigger, more complex regulatory
modules. And if they do, the combination of modules can be according to various
regulatory modes, for example they can be in linear, epistatic, synergistic, or switch
relationships with each other. In a modular regulatory arrangement it is to be known
which element to modify or remove completely form the complex, changing the effect of
regulatory mechanism or even turning it completely off.
Another important aspect considering synthetic promoters design is the spacing of
regulatory elements on the active DNA segment. As proved by scientist before, different
spatial organizations of transcription factors imply functional and mechanistic
characteristics in gene regulation [34]. The first dimension of control is based on physical
interactions of regulatory elements, which is embodied in the overlapping of closely-
bounded transcription factor binding sites. This means that some of the regulatory
proteins, even though having different functional roles, might share exactly the same or
similar sequence motifs for biochemical binding to the DNA. But regulatory proteins
9 ChIP-on-chip is a technique that combines chromatin immunoprecipitation (ChIP) with microarray technology (chip). It is used to investigate interactions between proteins and DNA in vivo.
32
might have an effect on the target promoter, which does not necessarily contribute to the
transcriptional layer of regulation, but rather influences the positioning of higher
chromatin structure by generating 3-dimensional occlusions. This transcriptional layer
provides the second dimension of regulatory control. “Blocking” proteins or protein
complexes close or even far away from the regulatory region might activate or inhibit
RNA polymerase initiation and determine the rate of transcription with mechanisms
described in the theory section. In the spatial organization of regulatory elements the
relative distance to the promoter start site also plays a major role. Thus, when inserting
large DNA segments to modify the regulatory hotspot on the promoter, we might infer
protein interaction patters related to RNA polymerase II positioning. On the other hand,
when inserting small segments, mutating a smaller portion of the regulatory region, we
can draw conclusions regarding nucleosome positioning (see Figure 12).
Figure 12 - Nucleosomes are the fundamental repeating subunit of all eukaryotic chromatin [15].
The modification of nucleosomes by remodeling factors can open up a new DNA region
for transcription if the new cell state requires the expression of different genes. This way,
nucleosomes play an essential role in RNA polymerase transcription efficiency, sine they
prevent them to unnecessarily access the promoter regions of genes which are not needed
by the cell at that state.
33
As pointed out before, there are many transcription factor binding sites on the 5’
upstream region of promoters, which interact according to the cell’s needs of gene
transcription. It has been shown by Beer and Tavazoie [9], that base pair distances
between these regulatory elements and the order/orientation of protein binding plays a
vital role in the regulatory mechanism. Their key discoveries showed that there is a great
deal of redundancy in the modes of transcriptional regulation (OR logic), many factors
require at least one partner to be functional (AND logic) and one mode of combinatorial
regulation is the absence of a factor that would cause a different mode of regulation
(NOT logic). Therefore, to systematically move pairs of regulatory sites closer or farther
away relative to each other influences gene expression patterns in different stress
conditions of the cell. Also, when the orientation of some of the transcription factors on
the DNA are inverted, different mRNA production response was obtained. Similarly,
when changing the relative order of two or more regulatory elements it showed
significant deviations from the original expression levels.
Figure 13 – Gene regulatory logic inferred from motif sequence and expression pattern
34
Chapter 4
Results
My main contribution to the overall collaborative Harvard project was to develop
BASHER, the fundamental software for synthetic promoter region design, which
incorporates multiple built-in functionalities for DNA sequence modification based on
previous research results. It was written in freely available, open-source programming
language Perl version 5.8.6, because of its flawless capability of string manipulation. The
synthetic promoters described in this article were designed on a 1.0 GHz Intel Pentium III
PC with 512 MB of memory, running the Fedora 1.0 Linux operating system. BASHER
uses multiple data bases as input: ortholog promoter sequences and global transcription
factor binding site maps of the organism of interest and allows users to move through the
process of design in a series of modules that address practical issues surrounding
oligonucleotide design (see Figure 14). BASHER is a useful tool for computationally
experienced investigators who wish to optimize protein expression and/or redesign their
promoter of interest for detailed structure or functional studies.
Figure 14 – Flow chart of BASHER architecture for input/output variants
35
4.1 BASHER Preliminaries The objective of BASHER is to provide well-designed promoter sequences for the gene
of interest (YFG). Combinatorially, this is a very complicated task, since the total number
of possible mutations on a single 600 base pair long promoter sequence is 4600 when
performing single nucleotide substitutions. This gives an infeasible number which
exceeds even the computational limitations of powerful computers of today. Thus, the
solution to this problem is to develop an interactive tool operated by biologist which
automatically provides “smart” promoter designs founded on the results of preliminary
research on transcriptional regulation.
There are six major sources of motifs available in the literature, two of which are used
as input into BASHER. The motif compendiums came from various collaborators
involved in transcription factor mapping in the Saccharomyces cerevisiae and other yeast
genomes. The first compendium was obtained by computational, motif discovering
techniques based on genome-wide chromatin immunoprecipitation data by Fraenkel [34].
The second was produced by the National Laboratory of Protein Engineering and Plant
Genetic Engineering in Germany with the database named TRANSFAC [35]. The third
compendium is based on Kellis’ research of genome-wide comparatrative analysis of
three, closely related yeast species [36]. The fourth motif compendium was compiled by
automated, comprehensive analyses of promoter regulatory motifs based on expression
coherence by Lapidot [37]. The fifth data base was generated by p-value statistical
analysis derived from probabilitistic graph theory developed by Friednman [38]. Finally,
the sixth data base was obtained by Tanay using bicluster analysis of enriched motifs of
previously published results from heterogeneous experimental techniques [39].
Unfortunately, these motif databases are far from reliable and complete, thus – as an
initial step - it is necessary to filter out and choose between redundant motifs.
Even though, we have these comprehensive lists of motif compendiums available, we
must come up with the one, which realistically models regulatory binding sites at the
upstream regions of the promoters. The motif deciphering poses a fundamental barrier,
since there are no accurate and systematic computational / experimental validating
methods available in the scientific community. Therefore, our research group decided to
use motif compendiums [34] and [38] as defaults in BASHER, since they provided the
36
most-comprehensive motif coverage and a unique probabilitistic method showing great
potential. Based on user’s configuration parameters, BASHER is able to run its design
functionalities on each individual and combined data set. In this way, the investigator is
able to compare and combine transcription factor binding locations founded on
independent theoretical techniques. It must be noted, that in the evolution of BASHER
these motif compendium repertoires were updated multiple times following the latest
developments in motif discovery. With this needed flexibility in mind, BASHER was
designed in a way that in the future compendium upgrades are easily implementable.
As mentioned before, BASHER also needs an ortholog promoter sequence library in
order to use them as templates for synthetic promoter region manipulation. We used the
Saccharomyces Genome Database (SGD) project data [40] which collects information
and maintains an up-to-date database of the molecular biology of the yeast
Saccharomyces cerevisiae. Thus, about 8000 ortholog promoter sequences – permanently
stored in a sub-directory of BASHER - were available for input in standard FASTA
format in each experimental promoter design. This simple, text-based format contains
promoter sequences, in which base pairs are represented with single-letters (A, C, T, and
G). The format also allows for sequence names and comments to precede the sequences,
which makes it easy to manipulate and parse sequences using Perl scripting language.
4.2 BASHER Structure BASHER’s structure reflects available data sources in hand. That is, it has a blindly
functional part that manipulates the promoter sequences without priory biological
knowledge of the regulatory regions and transcription factors involved in transcriptional
regulation. To perform synthetic promoter region design the software does not need to
have an input which specifies the cis-regulatory elements, thus performing systematic,
combinatorial string manipulation.
The second structural level of BASHER is based on biological data already
discovered by collaborating research groups on transcriptional regulation in yeast. The
software must use these PSSM compendiums – allocated in a specific library in the main
frame – to perform data mining of cis-regulatory element binding sites on the promoter.
In other word, from raw computational / experimental data, it constructs a cis-regulatory
37
map (one for each promoter of interest), which is used in all functionalities built into this
layer of computational promoter design. Utilizing on these regulatory maps, one of
BASHER’s unique features is to perform the visualization of transcription factor binding
sites on the promoter of choice using a GUI interface. According to current literature this
has never been done before placing BASHER into a novel design software category for
computational biologists interested in transcriptional regulation. This second structural
level - with the priory knowledge of cis-regulatory elements – performs various
mutations on the promoter region based on regulatory logic and cis-element pair analysis.
These mutation algorithms result in a set of newly created, synthetic promoters regions in
text format, which are the base of our hypothesis testing as described before. Since
BASHER is a softer prototype itself, it was designed in mind so that it can be easily
extended with other capabilities using the same or different resources available. Thus, the
flexible main frame is easily upgradeable incorporating updated biological data and/or
function libraries.
4.2.1 Configuration File
In order to run BASHER a configuration file [config.txt] must be specified and placed in
the main directory, which contains the configuration data required by all scripts. These
configuration parameters are changeable by the investigator of use, thus each promoter
sequence is uniquely designed as required by specifications. Thus, the input to BASHER
is the configuration file only, which after evaluation results in a set of modified, synthetic
promoter sequences outputted into a text file (see Figure 15).
Figure 15 – Flow chart of BASHER input/output requirements
For each promoter variant, the mutation steps and locations are documented. So this way,
when the investigator finds an interesting construct with a unique gene expression
Basher
Config. file Any output
38
pattern, he can exactly pinpoint which change cause that response. Then with another
slight modification of that combinatorial mutation new regulatory mechanisms can be
retrieved from the expression data of synthetic regulatory regions.
At the beginning of every program data retrieval occurs from ‘config.txt’. If required
by the procedure, the particular keyword – surrounded by brackets [KEYWORD] – is
pattern-matched and checked for input validity. This means each input field in use must
have a specific data type; otherwise an error message will be generated prompting for
correct input format. After this step, the next field is read, checked again and stored in an
input hash corresponding to that keyword. The configuration file typically contains fields
specifying certain directory locations (base directory, promoter library, PSSM matrix
library, etc.), mutation algorithm arguments (kmer length, cis-element distance threshold,
overlap gap, etc.) and running mode designators (unit definition modes, modules
definition modes) as shown below in Figure 16. Thus, the newly-created input hash will
act as a memory module for any scripts to retrieve input configurations for the promoter
design of interest. This way, input has to be read in only once from the text file via I/O.
Figure 16 – Partial list of configuration file parameters.
4.2.2 Mutagenesis Module
The goal of the first structural module of BASHER is to pinpoint all active cis-elements
and elements with functional relevance in the promoter. One of the scripts mimics a
39
typical genetic technique called scanning mutagenesis used in the laboratory setting by
microbiologist, when trying to determinate the transcriptional hotspots in the DNA region
of interest. The method behind this technique is to remove kmers (short segments of
DNA) and ultimately detect a change in the gene’s expression patters under investigation.
When systematically performing mutations, sliding window iteration is utilized where the
kmer length and sliding window size is adjustable (see Figure 17). As noted before, this
algorithm parameter (or argument) can be modified according to the investigators desire
in the configuration file, thus providing great flexibility in high-throughput synthetic
promoter generation.
Figure 17 – Partial list of configuration file parameters.
There are three types of sequence mutations available in BASHER’s repertoire of
algorithms. The first mutation is called permutation, which randomly permutes a given
length of kmer in the promoter, i.e., k-length sub-sequences are randomly shuffled using
the sliding window method described above. This “weak” mutation conserves the GC
content10 of the promoter region, thus does not cause major disturbances in the genome
according to theory. This computational script mimics the naturally occurring DNA
replication discrepancies providing a simple modeling option for promoter design. [For
detailed algorithm description see ‘permute.pl’ in the Manual.]
10 Guanine-cytosine content (GC-content) is a characteristic of the genome of an organism or a piece of DNA. Usually expressed as a percentage, it is the proportion of GC-base pairs in the genome of interest. The remaining fraction of provides the AT-content (adenine-thymine content). For example 58% GC-content = 42% AT-content.
40
Figure 18 – Partial permutation script output for gene FUS1 with specified changes.
Before introducing the second mutation type, the proper mathematical definition of
position-specific scoring matrix (PSSM) is necessary, a commonly used representation of
motifs (patterns) in biological sequences. The PSSM is a matrix of score values that gives
a weighted match to any given substring of fixed length. It has one row for each symbol
of the alphabet, and one column for each position in the pattern. The score assigned by a
PSSM to a substring ( ) 1==j
Nss j is defined as∑
=
N
jjs j
m1
, , where j represents position in
the substring, sj is the symbol at position j in the substring, and mα,j is the score in row α,
column j of the matrix. In other words, a PSSM score is the sum of position-specific
scores for each symbol in the substring. [http://en.wikipedia.org/wiki/PSSM]
A PSSM assumes independence between positions in the pattern, as it calculates
scores at each position independently from the symbols at other positions. The score of a
substring aligned with a PSSM can be interpreted as the log-likelihood of the substring
under a product multinomial distribution. The PSSM scores can also be interpreted in a
physical framework as the sum of binding energies for all nucleotides aligned with the
PSSM.
In our model, the PSSM matrices were obtained from one of our collaborators at
MIT, which represent a collection of 204 cis-regulatory binding motifs. These regulatory
sequences were computationally derived as demonstrated in [34]. These matrices can be
graphical represented as sequence logos where at each position the size of each residue is
proportional to its frequency in that position compared to background frequency (see
Figure 19).
41
Figure 19 – Graphical representation of a motif.
The second mutation is called randomization, which randomly generates a log-value
proof kmer, i.e. a short (4-10 base pair) DNA sequence which is checked against all
PSSM matrices satisfying constant log-value threshold of 0.005 (default) or as specified
by the user in the configuration file. This guarantees that a functionally neutral sequence
is created and will act as a “strong” mutation when systematically inserted into the
promoter region using the siding window method, similarly to the “weak” mutation (see
Figure 19). The GC-content is not conserved in this case, so the small region is
completely erased keeping the original promoter length the same. This computational
script mimics the procedure frequently used to microbiology, as in yeast deletion strains
blocking the functional region out completely. [For detailed algorithm description see
‘randomize.pl’ in the Manual.]
Figure 19 – Partial randomize script output for gene FUS1 showing random kmer insertion.
Therefore, when performing both mutations in parallel in an experimental promoter
design we expect four possible outcomes with different biological meanings. 1) Case (--):
If neither of the mutations had any effect on the gene expression pattern, then the
promoter region must be a non-functional sequence. 2) Case (++): If both mutations had
an effect on the mRNA levels due to transcriptional changes, then it shows that the
promoter region is a possible hotspot for cis-regulatory element locations. 3) Case (-+): If
permutation does not have an effect, but on the other hand randomization changes gene
42
expression, then we conclude that the randomly generated kmer is a newly discovered
motif, since it plays a regulatory role. 4) Case (+-): This case is a highly unlikely
scenario, since the “strong” mutation does not show a regulatory effect while the “weak”
does. This is very unrealistic in biological systems and should be treated as a low
probability (or even no-occurring) scenario.
The third built-in mutation type available in BASHER is ‘scan’, which was
implemented because of one of the well-known combinatorial transcriptional mechanism
present in yeast. As it has been shown by scientist there are promoters in which four
copies of STE12 motifs (each varying by 1 base pair) are present and two of which is
enough for transcriptional response, while three active sites will produce full response.
Because of the presence of this combinatorial control, we scan the promoter regions with
a sliding window and map the locations of similar kmer occurrences. The strength of
similarity can be refined by the user in the configuration file specifying any base pair
differences in the motif. After the mapping algorithm is finished, we randomize a kmer
corresponding to the motif length of interest via the randomization script described
above. Then we replace these frequently occurring motifs with the same random
sequence safeguarding against STE12-like combinatorial control (see Figure 20). With
this type of promoter mutation the investigator has the option to filter cis-regulatory
complexes, which were not caught during the first two filtration process. [For detailed
algorithm description see ‘scan.pl’ in the Manual.]
Figure 20 – Combinatorial control demonstrated by STE12 via promoter visualization.
43
4.2.3 Combinatorial Module
The second structural level of BASHER is based on biological data already discovered by
collaborating research groups on transcriptional regulation in yeast. The software uses the
cis-regulatory element maps and the ortholog promoter sequences to perform data mining
of regulatory binding sites in the promoter. Both of these sources are needed to generate
the synthetic DNA regions of S. cerevisiae, since the cis-regulatory maps are based only
on the relative motif start site location in the corresponding promoter. Thus, first we need
to retrieve the motif coordinates from the maps then locate them on the ortholog
promoters for further sequence manipulation. This mapping procedure is preformed by
‘factors’, which is the fundamental data mining script contained in all sequence
modifying algorithms in the combinatorial module. So, the purpose of this program is to
output transcription factor binding site data given cis-element maps and promoter regions
as inputs (see Figure 21).
Figure 21 – Flow chart of BASHER’s ‘factors’ data mining I/O requirements
The output results in a text file with specific transcription factor binding site parameters
like name, orientation, relative offset, location on chromosome, P-value, motif sequence
(see Figure 22 & 23). These parameters are used in the string manipulation part of the
algorithm to locate the motif on the promoter of interest.
44
Figure 22 – Partial list of transcription factors bound by gene STE12.
Figure 23 –Visualization script showing partial STE12 promoter with transcription factors 4.2.4 Promoter Visualization
Utilizing on the result of ‘factors’, one of BASHER’s unique features is to perform the
visualization of transcription factor binding sites in the promoter of choice using a GUI
interface. Perl Tk module was used in the graphical features of BASHER, thus if
necessary the appropriate installations are needed from the CPAN site.
Depending on user filtering selection in the configuration file based on evolutionary
conservation and/or log-value threshold, the corresponding regulatory motifs can be
displayed via this script. The bound transcription factors (TF) are shown in the same
color as the cis-element coloring in the sequence and are aligned with the starting point of
the site (see Figure 24).
45
Figure 24 – Major differences in the cis-regulatory maps for the same FUS1 promoter based on
two techniques of motif mapping: (left) Friedman group’s regulatory map based on statistical
analysis, (right) Fraenkel group’s map based on regulatory protein binding strength
The factors are denoted by their name and designated orientation (YAP6-) of the DNA
strand, where the plus denotes Watson, while the minus the Crick strands of the helix. If
two or more TF binding sites have same offsets, the TF names are displayed in the same
color under each other, like DIG1 and STE12 on line [1-80] highlighted in purple. If two
or more TF binding sites overlap, the bounded TFs are displayed in separate rows under
each other in their assigned colors, like PHO4-RTG3-CBF1-INO2 {blue-gold-purple-
blue} complex on line [161-240]. At the beginning of each promoter segment, the relative
offset location is denoted in brackets.
This open-source script was frequently used by my research group members to take
an initial look at the promoter region and its potential control regions, thus contributing to
the “smart” series of the synthetic promoter design.
4.2.5 Regulatory Combinatory
In this BASHER module, we developed algorithms investigating the fundamental
combinatorial interplay between cis-regulatory elements in the upstream region of the
promoters. We wish to find out what types of functions promoters “calculate” and how
are these “calculations” performed resulting in a particular gene expression pattern. It
must be mapped out what relationships exist between cis-regulatory elements based on
the biological data available. But it must be noted that there is a complex control
46
mechanism between cis- and trans-regulatory elements also, which should be look at as a
future extension of the software. As shown by previous research efforts [41], there are
different types of interactions between cis-elements. In one case, given one
environmental condition there is no interaction between elements, but in other state of the
cell the element behave actively in the regulatory control. Functional relationships can be
embodied in additive or opposing logic. As the part of opposing logic inhibiting and/or
occluding factors can play a major role as they are bound to the regulatory region of the
DNA. Other types of regulatory logic might include epistatic (OR) and/or synergistic
(AND) interactions between elements, thus after detailed analysis a regulatory network
can be constructed as a visual demonstrator of interactions (see Figure 25). When
deciphering combinatorial interplay, it should be also known how many cis-elements
(pairs, triplets, or beyond) interact with each other as part of this regulatory circuitry.
Since regulatory modules exits composed of a set of transcription factor binding sites, it
is important to break them down to the smallest regulatory blocks and compare them to
unique combinations of pair-wise analysis. In this way fundamental regulatory rules can
be deducted and used later in the synthetic promoter region design as optional mutation
parameters.
Figure 25 – Simple, Boolean regulatory logic of cis-elements [template.bio.warwick.ac.uk]
47
4.2.6 Defining Cis-Modules
In BASHER computational methods are used to define cis-regulatory modules in the
promoter, which targets clusters of potentially interacting element during gene regulation.
An algorithm has been developed, which gives some flexibility in the module definition
specified by the user in the configuration file. There are three running modes of cis-
module definition: 1) every copy of a motif, 2) repeated copies of the same motif, and 3)
copies of same motif in a certain base pair distance form each other is treated as a distinct
unit (see Figure 26). These unit definitions are based on the same motifs, which are a true
model of biological modules already discovered by molecular biologist. Extending
module definition capabilities, BASHER can define a module as a combination of motifs
specified by the investigator. As described, this software has a great ability for flexible
cis-regulatory module definition, which is essential in the efforts of understanding gene
regulation.
Figure 26 – Unit definition flexibility cis-element TEC1 in UME6 gene regulatory region
4.2.7 Pair Mutagenesis
When investigating cis-regulatory element interactions, the most fundamental and
combinatorially simplest base comparison is the pair-wise analysis. The regulatory motifs
can be located in various geometrical positions in the upstream regions, one of which is
48
the overlapping motif pair scenario. In this case, the strategy is the remove one element
of the pair while keeping the other unchanged and vice and versa. On of the other built-in
pair mutagenesis algorithm deals with the removal of each possible cis-regulatory pair in
the promoter according to the given mutation types available in BASHER. Finally, the
rest of the scripts in this module deal with relative distances and orientations of elements
of pairs in the control regions. It demonstrates peek distribution of distance or peaked co-
expression at a specified distance of pairs under investigation. Similarly to regulatory
element distances, orientation and order patterns can be investigated in the same fashion
providing another unique synthetic promoter design option for the user of this software.
4.2.8 Overlapping Pairs
This algorithm first scans the promoter region of interest and determines the exact
locations of overlapping transcription factor pairs according to ‘overlap’ definition. There
are three overlap type definitions available in BASHER, which are handled differently in
the script. The potential transcription factor overlap positioning can occur as follows:
1. TYPE: Fake (overlap)
Def.: If two motifs are in a certain distance – [Overlap_gap]11- from each other, then
consider them as a type fake overlap.
2. TYPE: Real-> I (overlap)
Def.: If two motifs overlap such that there is an overlapping and only one non-
overlapping fragment (from any of the two motif’s point of view), then consider them as
a type real->I overlap.
11 Overlap gap parameter [Overlap_gap] can be modified in the configuration file by the user.
49
3. TYPE: Real-> II (overlap)
Def.: If two motifs overlap such that there is only one overlapping and two non-
overlapping fragments (from at least one of the two motif’s point of view), then consider
them as a type real->II overlap.
Then in every motif overlap pair motif A is "knocked out" using PSSM values, while
motif B is conserved [as it was observed originally] and vice and versa (see Figure 27).
The “knock out” algorithm called ‘remove motif’ replaces the overlapping region with
the most extreme motif mutation, which has the lowest log-value possible for that
particular case, i.e., at each PSSM position in the non-overlapping motif fragment the
base with the smallest entry value, while in the overlapping fragment the best base-pair
entry combination is chosen12. This way it is guaranteed that in each overlap mutation
direction, one cis-element is damaged while the other is conserved since theoretically
untouched. This script automatically generates a log file in PDF format, where all overlap
mutation steps are documented for each cis-element pair of interest. It includes
before/after motif sequence modification statistics with sequence and log-value changes.
It also contains the sequence logo for the particular cis-elements based on default PSSM
matrices (see Figure 28). This is log file is very useful for the investigator, since visual
representation of modifications are provided via sequence logos instead of matrices
which are hard to process when looking at them.
12 For more detailed algorithm steps see ‘remove_overlap.pl’ in Manual.
50
Figure 27 – SPT23-STE12 cis-element overlap and matrix entry “knock out” schematics
Figure 28 – Statistics of PHD1-SKN7 cis-element overlap removal
4.2.9 Module Mutagenesis
As described before, regulatory modules play a major role in transcriptional regulation
[43]. Thus, it is crucial to understand their functional role as singleton elements or a list
of interacting regulatory complexes. A particular module of BASHER was developed for
the investigation of these cis-regulatory complexes, which has multiple built-in promoter
mutation algorithms available.
The first focuses on the functional role of the module. 1) The program removes all
copies of the same transcription factor if occurs more than once in the upstream region of
the promoter (see Figure 29 & 30). 2) It removes all elements of a unit (see definition
above) it the number of elements of the unit is more than one but not all. 3) It can also
remove all cis-regulatory elements but the module itself focusing on the individual unit
51
contribution in regulation. These removal techniques can be specified in the configuration
file as desired.
The second algorithm deals with the relationships within the module. It removes all
pair-wise combinations of elements within a module. The third investigates the
relationships between the modules, i.e., removes pairs of entire modules and removes
representatives from pairs (possible more) modules. Finally, we can also vary the
distance of a module to transcription start. In all of the above algorithms by removal we
mean “strong” mutation [randomization], so we make the removed regions become
functionally insignificant.
Figure 29 – Removal of all copies of DIG1 motif in the upstream region of gene DIG1
Figure 30 – Removal of all copies of DIG1 motif in sequence format via permutation
52
Chapter 5
Discussion
Here, we describe a user-friendly, advanced software package called BASHER for the
design of synthetic promoter regions in a high-throughput manner. The software was
developed to design synthetic promoters in Saccharomyces cerevisiae to be made by the
PCR assembly of short oligonucleotides in Perl. But it should be adaptable for other yeast
genomes (e.g. S. paradoxus, S. bayanus. S. hanseii) which are closely related to S.
cerevisiae in the polygenetic tree.
It provides a powerful and flexible tool for hypothesis testing of regulatory logic in
the eukaryotic yeast cell. Beside the traditional promoter region synthesis such as site-
directed mutagenesis, structural analysis and investigation of transcription regulation, it
incorporates the option of the new theory of promoter variant design. It takes
combinational and spacial effects of cis-binding sites into account and integrates them
into the modeling process. It provides novel algorithms considering local and global
binding site geometry providing functional and mechanistic implications in gene
regulation. It also considers the physical interactions between cis-elements with linear,
epistatic, synergistic or switch-like effects as result of their interaction. Novel mutations
algorithms were developed using the latest PSSM compendiums available in the
literature. Based on this data BASHER is capable to perform the visualization of
transcription factor binding sites in the promoter of choice using a GUI interface. The
software also contains a module designated to analyze combinatorial interactions between
regulatory complexes. The algorithms design synthetic promoter regions, which might
lead to the understanding of the functional roles of modules, the relationships in modules,
the relationships between modules and the spacial relevance of modules to transcription
start site.
BASHER is a useful tool for computationally trained investigators who wish to
optimize protein expression and/or redesign their promoter of interest in a step-wise
manner for detailed structure/function studies. It accepts as input both ortholog promoter
sequences and global transcription factor binding site maps of the organism of interest
53
and allows users to move through the process of design in a series of modules that
address practical issues surrounding oligonucleotide design. Users can follow the main
“design a promoter” path or use the modules individually as needed. The design software
is freely available for download from the author upon request. The software is provided
“as is” with no guarantee or warranty of any kind for non-commercial use.
54
Chapter 6
Conclusion
The proposed software for high-throughput synthetic promoter region design has been
developed to a level of a very exciting tool for scientific investigators interested in
genome engineering. But as all software packages it can be always updated with new data
libraries available and with new functionalities resulted from fundamental genetic
research. Since scientific efforts are on-going in cis-element discovery the motif
compendium should be periodically updated with more reliable data sources. The
software uses internally standardized formats, thus the new data can be smoothly
incorporated using some of the supporting scripts written for data mining purposes from
FASTA and other standardized formats13 used in bioinformatics.
6.1 Future Prospects
There are way more combinatorial possibilities involved in the pair-wise cis-element
analysis, which was not implemented because of limitless algorithm options involved in
the subject area. Since I was the only software developer we had to draw a realistic line to
consider BASHER as a finished product. Even though we focused on some of the major
combinatorial interactions and geometric constrains of regulatory elements, available
mRNA expression data was never incorporated as a guiding biological result in the
design. With this data in hand we could have selected an interesting set of genes with
similar gene expression patterns. The co-occurrence of these sets could have been
investigated in all promoter regions or all hyper-geometric biclusters. This could have
provided another interesting module in BAHSHER, which could strictly modify the
upstream region of the promoter based on these experimental results obtained under
various experimental conditions.
Another interesting idea came up during the research project to incorporate other
yeast species from the phylogenetic tree. In this way, other genetically closely related
13 For supporting data mining Perl script descriptions see the Manual’s data miner section.
55
genomes could have been compared to each other, thus from the homologous promoter
regions evolutionary conservation of transcriptional regulatory logic could have been
inferred [44, 45].
As of now, the software requires some computational knowledge since a graphical user
interface (GUI) was not developed for user interaction at the time of my departure. I feel
that the fundamental algorithms have been completed, which is the research part of the
computational project. The GUI development is just “icing-on-the-cake”, which does not
require any theoretical understanding of the topic of gene regulation or design. Therefore
it can be easily implemented by software engineers if needed, so it can be widely used
even among the ones with limited programming experiences.
6.2 Project Barriers It also important to point out, that since the “Polypromoter Project” was a collaborative
effort between computational and experimental biologist at the Church laboratory, we
were mutually relying on each others’ results. As the product of BASHER’s output, a list
of synthetic promoter regions of interest could be printed onto an industrially obtained
oligosaccharide-chip [46]. The promoters are decomposed into an array of 30-mers on the
chip surface with unique flanking regions, which are complementary to the next segment,
for each construct. From the computational, design phase the project now transitions into
the experimental phase.
In the next step, polymerase chain reaction (PCR) reaction is utilized to amplify and
connect the pool of 30-mers into their distinctive promoter constructs. In theory, the
synthetic promoter regions will be obtained when the components of 30-mers line up
with their overlapping flanking end, which leads to combination. Unfortunately, this part
of the project did not work, even though a post-doctoral candidate spent tremendous
hours in the optimization of the protocol. I was involved in the trouble-shooting process
of figuring out the reason behind the unsuccessful assembly, where I developed scripts to
analyze DNA segment obtained from defected experimental trials.
As a result we concluded that certain oligonucleotides amplified with more success
than others from the oligo-chip, resulting in an uneven distribution in the PCR solution.
For this reason, the desired promoter regions were not constructed evenly with the same
56
concentration as expected. This could have had to reasons. One of the error sources could
have been the defective industrial DNA spotting of the oligo-chip, while the other had to
due with a unsuccessful repeating of large scale oligo-assembly protocol described by
[cite]. For this reason, the hypothetical testing of constructs was unavailable. In a sense,
the research group was ahead of current technology since large scale oligo-assembly does
not exist industrial setting only in certain research laboratories. But, when the technology
for assembly will be available, BASHER can be revisited and commonly used in
promoter design investigating the underlying principles of gene regulation.
6.3 Project Reflection In conclusion, I had a marvelous intellectual experience at Harvard Medical School,
thanks to the generosity of the Honors Program and my advisor giving me the
opportunity to be a part of such a well-respected laboratory. I did not only learn a great
deal about computational genomics and bioengineering, but also about a cutting-edge
research institution and the surrounding circumstances how to be a good research
scientist. I had been exposed to an area of my career interest: bioengineering, which I did
not have particular training in, but I feel that this opportunity challenged me in every
aspect of life. I learned a new programming language and the tiniest details of software
development while at Harvard. I also learned a tremendous amount of biology and
genetics when reading hundreds of journals assigned by my advisor on a regular basis,
attending lab meetings twice a week and just talking with my experienced lab mates
every day. Participating in a graduate level course in biophysics also helped me in this
intellectual journey, which confirmed the continuation of my career path into the field of
biomedical engineering. This unique experience also gave me confidence in my abilities
to learn a completely new topic I have never been exposed to. Just to think about a
problem and come up with your own design decisions every day helped the development
of my logical though process, which improved a lot while there. Even though it was
challenging experience, it was great fun too: I met a lot of bright individuals I made
lifelong friendships with and opened up my eyes in a direction I would like to head too in
the near future.
57
Bibliography
[1] S.A. Benner, A.M. Sismour, “Synthetic biology,” Nature Reviews Genetics, 6: 533 543 (2005). [2] C. Gustafsson, S. Govindarajan, J. Minshull, “Putting engineering back into protein
engineering: bioinformatic approaches to catalyst design,” Current Opinion in Biotechnology, 14: 366-370 (2003).
[3] S.J. Kodumal, K.G. Patel, R. Reid, H.G. Menzella, M. Welch, D.W. Santi, “Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster,” Proc. Natl. Acad. Sci., 101: 15573-15578 (2004). [4] C. Gustafsson, S. Govindarajan, J. Minshull, “Codon bias and heterologous protein expression,” Trends in Biotechnology, 22: 346-353 (2004). [5] E.H Davidson, “Genomic Regulatory Systems: Development and Evolution”, San Diego: Academic Press, 2001. [6] F. Jacob, J. Monod, “Genetic regulatory mechanisms in the synthesis of proteins,“ Journal of Molecular Biology, 3: 318–356 (1961). [7] A.M. McGuire, G.M. Church, “Predicting regulons and their cis-regulatory motifs by comparative genomics,” Nucleic Acids Research, 15: 4523–4530 (2000). [8] F.P. Roth, J.D. Hughes, P.W. Estep, G.M. Church, “Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantization,” Nature Biotechnology, 16: 939–945 (1998). [9] M.A. Beer, S. Tavazoie, “Predicting gene expression from sequence,” Cell, 117: 185–198 (2004). [10] D.M Hoover, J. Lubkowski, “DNAWorks: an automated method for designing
oligonucleotides for PCR-based gene synthesis,” Nucleic Acids Research, 30, e43 (2002).
[11] G.Giaever et al., “Functional profiling of the Saccharomyces cerevisiae genome,” Nature, 418: 387–391 (2002). [12] J.D. Watson, “The Double Helix: A Personal Account of the Discovery of the
Structure of DNA,” Touchstone, 2001. [13] C.T. Harbison et al., “Transcriptional regulatory code of a eukaryotic genome,”
Nature 431: 99–104 (2004).
58
[14] R.K. Mortimer, D.C. Hawthorne, “Genetic mapping in yeast”, Methods in cell biology, 11: 221-33 (1975).
[15] B. Alberts, A. Johnson, J. Lewis, K. Roberts M. Raff, and P. Walter, Molecular Biology of the Cell. Garland, 2002.
[16] K. Gausing, “Efficiency of protein and messenger RNA synthesis in bacteriophage
T4-infected cells of Escherichia coli,” Journal of Molecular Biology, 7: 529-45 (1972).
[17] W.G. Haldenwang, “The sigma factors of Bacillus subtilis,” Microbiology Review,
59: 1-30 (1995). [18] D.F. Browning, S.J.W. Busby, “The regulation of bacterial transcription initiation,”
Nature Reviews Microbiology, 2004. [19] A.I. Lamond, A.A. Travers, “Stringent control of bacterial transcription,” Cell, 41:
6-8 (1985). [20] T. Denis et al., “From specific gene regulation to genomic networks: a global
analysis of transcriptional regulation in Escherichia coli,” BioEssays, 5: 433-440 (1998)
[21] O. Soutourina et al., “Multiple Control of Flagellum Biosynthesis in Escherichia
coli: Role of H-NS Protein and the Cyclic AMP-Catabolite Activator Protein Complex in Transcription of the flhDC Master Operon,” Journal of Bacteriology, 24: 7500-7508 (1999).
[22] B.L. Wanner, R. Kodaira, F.C. Neidhart, “Physiological regulation of a decontrolled
lac operon,” Journal of Bacteriology, 130: 212-222 (1977). [23] M. Carey, “The Enhanceosome and Transcriptional Synergy,” Cell, 92: 5–8 (1998). [24] S. Tavazoie et al., “Systematic determination of genetic network architecture,”
Nature Genetics, 22: 281–285 (1999). [25] E. B. Lewis, “A gene complex controlling segmentation in Drosophila,” Nature,
276: 565-570 (1978). [26] M. Hoch, E. Seifert, H. Jäckle, “Gene expression mediated by cis-acting sequences
of the Krüppel gene in response to the Drosophila morphogens bicoid and hunchback,” EMBO Journal, 10: 2267–2278 (1991).
[27] J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church, “Computational identification
of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae,” Journal of Molecular Biology, 296: 1205–1214 (2001).
59
[28] K.D. MacIsaac et al., “A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data,” Bioinformatics, 22: 423–429 (2006).
[29] Hagen et.al., “Pheromone response elements are necessary and sufficient for basal
and pheromone-induced transcription of the FUS1 gene of Saccharomyces cerevisiae,” Molecular and Cellular Biology, 11: 2952-61 (1991).
[30] A.M Dudley et al., “A global view of pleiotropy and phenotypically derived gene
function in yeast,” Molecular Systems Biology, 1: 2005.0001 (2005). [31] C.A. Heid et al., “Real time quantitative PCR,” Genome Research, 6: 986-994
(1996). [32] J. Shendure et al., “Accurate Multiplex Polony Sequencing of an Evolved Bacterial
Genome,” Science, 309: 1728 – 1732 (2005). [33] F. Gao, B.C. Foat, H.J. Bussemaker, “Defining transcriptional networks through
integrative modeling of mRNA expression and transcription factor binding data,” BMC Bioinformatics, 5: 31 (2004).
[34] K.D. MacIsaac, T. Wang, D.B. Gordon, D.K. Gifford, G. Stormo, E. Fraenkel, “An
Improved Map of Conserved Regulatory Sites for Saccharomyces cerevisiae,” BMC Bioinformatics, 7: 113 (2006).
[35] E. Wingender et al., “TRANSFAC: an integrated system for gene expression
regulation,” Nucleic Acids Research, 28: 316–319 (2000). [36] M. Kellis et al., “Methods in comparative genomics: genome correspondence, gene
identification and regulatory motif discovery,” Journal of Computational Biology, 11: 319-55 (2004).
[37] M. Lapidot, Y. Pilpel, “Comprehensive quantitative analyses of the effects of
promoter sequence elements on mRNA transcription,” Nucleic Acids Research, 31: 3824-8 (2003).
[38] Y. Barash, G. Elidan, T. Kaplan, N. Friedman, “CIS: Compound importance
sampling method for protein-DNA binding site p-value estimation,” Bioinformatics, 2004.
[39] A. Tanay et al., “Links Integrative analysis of genome-wide experiments in the
context of a large high-throughput data compendium,” Molecular Systems Biology, 1: 2005.0002 (2005).
[40] J.M. Cherry et al., “Genetic and physical maps of Saccharomyces cerevisiae,”
Nature, 387: 67-73 (1997).
60
[41] Y. Pilpel, P. Sudarsanam, G.M. Church, “Identifying regulatory networks by combinatorial analysis of promoter elements,” Nature Genetics 29: 153–159 (2001).
[42] E. Segal et al., “Module networks: identifying regulatory modules and their
condition-specific regulators from gene expression data,” Nature Genetics, 34: 166–176 (2003).
[43] A.M. McGuire, J.D. Hughes, G.M. Church, “Conservation of DNA regulatory
motifs and discovery of new motifs in microbial genomes,” Genome Research, 10: 744–757 (200).
[45] M. Kellis et al., “Sequencing and comparison of yeast species to identify genes and
regulatory elements,” Nature, 423: 241–254 (2003). [46] J. Tian et al., “Accurate multiplex gene synthesis from programmable DNA
microchips,” Nature, 432: 1050–1054 (2004).
61
APPENDIX
[Index of Tables and Figures]
62
Figure I - Gene in global view [Wikipedia].
As shown on Figure I, the functional units correspond to a single protein or RNA
(ribonucleic acid) encompassing coding, non-coding regulatory DNA sequences and
introns. In most genes, exons contain the part of the open reading frame (ORF) that codes
for a protein’s specific portion. While introns are regions, that will be removed (spliced)
after transcription, but before the RNA is used. In contrary of common misconception,
exons are not only the coding sequences for the final protein, but also some non-coding
sequences that play major role in translation phase.
Figure II – Gene in local view [Wikipedia].
Figure II depicts an unedited mRNA transcript, or pre-mRNAs. Both sequence that code
for amino acids (red) and untranslated stretches (grey) are classified as exons. Regions of
unused sequence called introns (blue) are spliced out, and the exons are joined together to
form the final functional mRNA. The untranslated regions are vital in the process of
efficient transcript translation and translation rate control.
63
Figure III - The structure of part of a DNA double helix [Wikipedia].
64
Figure IV - A representation of a condensed eukaryotic chromosome, as seen during cell division [Wikipedia].
Figure V - Schematic depiction of a portion of chromosome 2 from the genome of the fruit fly Drosophila melanogaster [15].
65
This figure represents approximately 3% of the total Drosophila genome, arranged as six
contiguous segments. The symbolic representations are → rainbow-colored bar: G-C
base-pair content; black vertical lines: locations of transposable elements; colored boxes:
genes coded on one strand of DNA. The color of each gene box (see color code in the
key) indicates whether a closely related gene is known to occur in other organisms. For
example, MWY means the gene has close relatives in mammals, in the worm
Caenorhabditis elegans, and in the yeast Saccharomyces cerevisiae. MW indicates the
gene has close relatives in mammals and the worm but not in yeast.
Figure VI – Flow chart of hypothesis testing using synthetic promoter constructs and the yeast deletion strain.
66
Figure VII – Transcription factor binding sites for promoter YCL027W.