Copyright by Yidan Qin 2016
Transcript of Copyright by Yidan Qin 2016
Copyright
by
Yidan Qin
2016
The Dissertation Committee for Yidan Qin Certifies that this is the approved
version of the following dissertation:
Thermostable Group II Intron Reverse Transcriptases and Their
Applications in Next Generation RNA Sequencing, Diagnostics, and
Precision Medicine
Committee:
Alan M. Lambowitz, Supervisor
Vishwanath R. Iyer
Robert M. Krug
Rick Russell
Scott W. Stevens
Christopher S. Sullivan
Thermostable Group II Intron Reverse Transcriptases and Their
Applications in Next Generation RNA Sequencing, Diagnostics, and
Precision Medicine
by
Yidan Qin, B.S.Biochem.; B.S.ForensicSci.
Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
The University of Texas at Austin
May 2016
Dedication
Dedicated to my parents, Caiying Xia and Huihong Qin.
v
Acknowledgements
I would like to express my sincere gratitude to my advisor, Dr. Alan Lambowitz.
His guidance, encouragement and inspiration allow me to learn and grow as a scientist. I
would also like to thank my committee members, Dr. Vishwanath Iyer, Dr. Robert Krug,
Dr. Rick Russell, Dr. Scott Stevens, and Dr. Chris Sullivan, for providing their important
expertise and valuable critiques throughout the development of this research work.
Moreover, I am very grateful for the fun and fruitful work I shared with all the
collaborators at or outside the University of Texas at Austin.
Many thanks goes to the current and previous members in the Lambowitz Lab for
their enormous support, particularly Jun Yao, Sabine Mohr, Marta Mastroianni, Ryan
Nottingham, and Tawsy Lamech. I am very fortunate to have joined this lab and worked
with them. My appreciation also extends to my friends outside the Lambowitz Lab,
particularly Xia Xia, Tina Hsiang, and Lily Wang, whose friendship are vital throughout
my graduate school.
Finally, I would like to take this opportunity to thank my family. I thank my
cousin Zhen Qin, my cousin-in-law Jian Zhou, my aunt Huapei Li, and my Uncle
Huiming Qin, for their love and caring. Most importantly, I thank my parents, Caiying
Xia and Huihong Qin, for always being there for me.
vi
Thermostable Group II Intron Reverse Transcriptases and Their
Applications in Next Generation RNA Sequencing, Diagnostics, and
Precision Medicine
Yidan Qin, Ph.D.
The University of Texas at Austin, 2016
Supervisor: Alan M. Lambowitz
Thermostable group II intron reverse transcriptases (TGIRTs) from thermophilic
bacteria are advantageous for biotechnological applications that require cDNA synthesis,
such as RT-qPCR and RNA-seq. TGIRTs have higher thermostability, processivity and
fidelity than conventional retroviral RTs, along with a novel end-to-end template-
switching activity that attaches RNA-seq adapters to target RNAs without RNA ligation.
First, I optimized the TGIRT template-switching method for RNA-seq analysis of small
non-coding RNAs (ncRNAs). I showed that TGIRT-seq gives full-length reads of tRNAs,
which are refractory to retroviral RTs, and enables identification of a variety of base
modifications in tRNAs by distinctive patterns of misincorporated nucleotides. With
collaborators, I developed an efficient and quantitative high-throughput tRNA sequencing
method, identified RNAs bound by the human interferon-induced protein IFIT5, yielding
new insights into its functions in tRNA quality control and innate immunity, and
uncovered a novel mRNA-independent mechanism for elongation of nascent peptides.
vii
Second, I developed a new, streamlined TGIRT-seq method for comprehensive analysis
of all RNA size classes in a single RNA-seq. This method enables RNA-seq library
construction from <1 ng of fragmented RNAs in <5 h. By using the method, I showed
that human plasma contains large numbers of protein-coding and long ncRNAs together
with diverse classes of small ncRNAs, which are mostly present as full-length transcripts.
With collaborators, I showed that TGIRT-seq analysis of circulating RNAs identified
potential biomarkers at different stages of multiple myeloma and may provide a sensitive,
non-invasive diagnostic tool for a variety of human diseases. Finally, I adapted TGIRTs
for use in mapping of RNA structures and RNA-protein interaction sites, and
identification of RNA targets of cellular RNA-binding proteins. My research led to a
series of new biological insights, which would have been difficult or impossible to obtain
by current methods, and established TGIRTs as a tool for a broad range of applications in
RNA research and diagnostics.
viii
Table of Contents
List of Tables ......................................................................................................... xi
List of Figures ....................................................................................................... xii
Chapter 1: Thermostable Group II Intron Reverse Transcriptases ..........................1
1.1 Group II introns.........................................................................................1
1.2 Group II intron reverse transcriptases .......................................................2
1.3 Thermostable group II intron reverse transcriptases are advantageous
for cDNA synthesis ................................................................................3
1.4 Thermostable group II intron reverse transcriptases are advantageous
for next-generation RNA sequencing ....................................................5
1.5 Overview of the dissertation research .......................................................7
Chapter 2: RNA-seq of transfer RNAs ..................................................................14
2.1 Efficient and quantitative high-throughput tRNA sequencing* .............14
2.1.1 tRNA sequencing by combining demethylase treatment with
the TGIRT-seq small RNA/CircLigase method .........................15
2.1.2 Analysis of tRNA isoacceptors, modifications and gene
expressions ..................................................................................17
2.1.3 Discussion ...................................................................................18
2.2 Analysis of precursor and mature tRNAs associated with the human
interferon-induced protein IFIT5* .......................................................19
2.3.1 The human IFIT5 protein ............................................................20
2.3.2 TGIRT-seq profiling of IFIT5-bound cellular RNAs .................22
2.3.3 IFIT5 binds to a broad spectrum of precursor and processed
tRNAs, as well as other RNA polymerase III transcripts ...........23
2.3.4 Discussion ...................................................................................26
2.4 Analysis of tRNAs associated with the yeast Rqc2p protein* ................27
2.4.1 The Rqc2p protein.......................................................................28
2.4.2 TGIRT-seq profiling of Rpc2p-bound tRNAs ............................29
2.4.3 Discussion ...................................................................................30
2.5 Materials and methods ............................................................................31
2.5.1 Deacylation of tRNA samples ....................................................31
ix
2.5.2 Construction of RNA-seq libraries by TGIRT-seq small
RNA/CircLigase method ............................................................32
Chapter 3: RNA-seq of circulating RNAs in human plasma .................................46
3.1 Introduction .............................................................................................46
3.2 TGIRT-seq, the total RNA method .........................................................47
3.2.1 Overview of the TGIRT-seq total RNA method.........................47
3.2.2 Validation of the TGIRT-seq total RNA method .......................51
3.3 Human plasma RNA ...............................................................................52
3.3.1 Preparations and treatments of human plasma RNAs .................52
3.3.2 TGIRT-seq of human plasma RNA samples ..............................54
3.3.3 Classes of RNAs detected in human plasma ...............................54
3.3.4 Protein-coding gene and long non-coding RNAs in human
plasma .........................................................................................56
3.3.5 Small non-coding RNAs in human plasma .................................58
3.4 Discussion ...............................................................................................63
3.5 Materials and methods ............................................................................64
3.5.1 Thermostable group II intron RTs ..............................................64
3.5.2 Preparation of human plasma RNA samples ..............................65
3.5.3 Construction of plasma RNA-seq libraries .................................67
3.5.4 RNA-seq analysis of cDNA recopying by TGIRT enzymes ......69
3.5.5 Bioinformatics analysis ...............................................................70
3.5.6 Accession numbers .....................................................................73
Chapter 4: Identification of circulating RNA biomarkers in multiple myeloma .114
4.1 Introduction ...........................................................................................114
4.2 RNA profiles of extracellular vesicles in human plasma ......................117
4.3 TGIRT-seq identifies differentially expressed transcripts by disease
stages ..................................................................................................119
4.4 Discussion .............................................................................................121
4.5 Materials and methods ..........................................................................122
4.5.1 Thermostable group II intron RTs ............................................122
4.5.2 RNA preparations* ...................................................................122
x
4.5.3 Construction of RNA-seq libraries ...........................................122
4.5.4 Bioinformatics*.........................................................................124
Chapter 5: Mapping RNA secondary structures and ...........................................136
RNA-protein interaction sites ..............................................................................136
5.1 Overview of SHAPE and CRAC ..........................................................136
5.2 Protein-assisted group II intron splicing ...............................................137
5.2.1 Determination of optimal exon length and protein
concentration for in vitro splicing of the GsI-IIC intron...........140
5.2.2 RNA-structure mapping of the GsI-IIC intron via TGIRT-
SHAPE* ....................................................................................142
5.2.3 Mapping of RNA-protein contact sites by TGIRT-CRAC* .....143
5.3 Discussion .............................................................................................145
5.4 Materials and methods ..........................................................................146
5.4.1 Recombinant plasmids ..............................................................146
5.4.2 Preparation of GsI-IIC intron RNA and IEP.............................147
5.4.3 GsI-IIC intron splicing ..............................................................147
5.4.4 TGIRT-SHAPE .........................................................................148
5.4.5 TGIRT-CRAC...........................................................................150
Bibliography ........................................................................................................161
Vita………………………………………………………………………….......179
xi
List of Tables
Table 2.1: TGIRT-seq read mapping. ....................................................................34
Table 2.2: Biological replicate sequencing of pooled RNA. .................................35
Table 3.1: Read statistics and mapping for RNA-seq of total plasma RNAs
using TeI4c group II intron RT. ........................................................74
Table 3.2: Read statistics and mapping for RNA-seq of total plasma RNAs
using GsI-IIC group II intron RT. .....................................................76
Table 3.3: Analysis of 3’-terminal nucleotides of RNAs in RNA-seq datasets
constructed from total plasma RNA using TeI4c or GsI-IIC
group II intron RTs. ..........................................................................81
Table 3.4: Read statistics and mapping for RNA-seq of whole-cell RNAs by
using TeI4c or GsI-IIC group II intron RT. ......................................82
Table 3.5: Summary of RNA-seq datasets. ............................................................84
Table 4.1: Read statistics and mapping for RNA-seq of plasma EV-RNAs. ......127
xii
List of Figures
Figure 1.1: Group II intron splicing and mobility..................................................12
Figure 1.2: Comparision of group II intron and retroviral RTs. ............................13
Figure 2.1: Demethylase-thermostable group II intron RT tRNA sequencing
(DM-tRNA-seq). ...............................................................................36
Figure 2.2: cDNA synthesis of IFIT-bound RNAs by TGIRT-seq small
RNA/CircLigase method. .................................................................37
Figure 2.3: Broad representation of IFIT5-bound tRNAs. ....................................38
Figure 2.4: Individual gene coverage by reads from the WT IFIT5 cross-
linked RNA sample. ..........................................................................39
Figure 2.5: Read sequence alignments for the WT IFIT5 cross-linked RNA
sample. ..............................................................................................41
Figure 2.6: Composite read start sites for IFIT5-bound tRNAs. ...........................44
Figure 2.7: Rqc2p-dependent enrichment of tRNAAla(IGC) and tRNAThr(IGU). ........45
Figure 3.1: TGIRT-seq overview. ..........................................................................85
Figure 3.2: Bioanalyzer traces showing size profiles of plasma RNAs before
and after various treatments. .............................................................88
Figure 3.3: Bioanalyzer traces testing the efficiency of DNase treatments used
on plasma RNA preparations. ...........................................................89
Figure 3.4: The distribution of transcript lengths in total plasma RNA libraries
calculated by the coverage of paired-end read span. ........................90
Figure 3.5: Percentage of TGIRT-seq reads from total plasma RNA datasets
mapping to different categories of genomic features. .......................92
xiii
Figure 3.6: Correlation analysis for biological replicates of total plasma RNA
libraries. ............................................................................................94
Figure 3.7: RNA-seq analysis of total plasma RNA libraries constructed with
GsI-IIC group II intron RT. ..............................................................95
Figure 3.8: Human plasma RNA is enriched in intron and antisense sequences
compared to whole-cell RNAs. .........................................................97
Figure 3.9: Proportion of reads mapping to the sense strand of protein-coding
genes as a function of gene length in RNA-seq datasets of human
plasma or whole-cell RNAs. .............................................................99
Figure 3.10: Human plasma contains both mature and pre-miRNAs. .................100
Figure 3.11: Tissue expression profiles for mature miRNAs in plasma. .............103
Figure 3.12: Tissue expression profiles of mature miRNA identified in total
plasma RNA prepared by the mirVana combined method. ............104
Figure 3.13: TGIRT-seq detects full-length pre-miRNAs and a miRNA that
may be present in plasma in an RNA/DNA hybrid. .......................106
Figure 3.14: Relative abundance and IGV alignments of miRNAs identified in
a small plasma RNA-seq dataset constructed with GsI-IIC RT. ....108
Figure 3.15: TGIRT-seq identifies full-length mature tRNAs and tRNA
fragments in human plasma. ...........................................................110
Figure 3.16: Other classes of small non-coding RNAs identified as full-length
mature transcripts in human plasma by TGIRT-seq. ......................112
Figure 4.1: Bioanalyzer traces showing size profiles of plasma EV-RNAs. .......129
Figure 4.2: Percentage of TGIRT-seq reads from EV-RNA datasets mapping
to different categories of genomic features. ....................................131
Figure 4.3: Heatmap for sample-to-sample distance. ..........................................132
xiv
Figure 4.4: Transcript expressions in plasma EVs...............................................133
Figure 4.5: Survival curves. .................................................................................135
Figure 5.1: Determining the optimal exon length for in vitro splicing of the
GsI-IIC intron..................................................................................153
Figure 5.2: Determining the optimal IEP concentration for in vitro splicing of
the GsI-IIC intron. ...........................................................................155
Figure 5.3: SHAPE analysis of the GsI-IIC intron RNA. ....................................156
Figure 5.4: Mapping of protein binding sites in GsI-IIC intron RNA. ................159
1
Chapter 1: Thermostable Group II Intron Reverse Transcriptases
1.1 GROUP II INTRONS
Group II introns are mobile genetic elements found in bacterial and organellar
genomes and are thought to be evolutionary ancestors of eukaryotic spliceosomes,
retrotransposons, and retroviruses (Lambowitz and Belfort, 2015). Mobile group II intron
consists of a catalytic intron RNA (a “ribozyme”), which folds into stable secondary and
tertiary structures, and an intron-encoded protein (IEP), which is a multifunctional
reverse transcriptase (RT) that assists intron splicing and promote intron mobility within
the genome (Lambowitz and Zimmerly, 2011). The IEP binds to the intron RNA to
stabilize the catalytically active RNA structure for intron splicing (Matsuura et al., 2001).
Group II introns use the same splicing mechanism used by the spliceosomal introns in
higher organisms, producing an excised lariat intron RNA via two transesterification
steps (Fig. 1.1A). After splicing, the IEP remains bound to the excised lariat intron,
forming a ribonucleoprotein (RNP) to promote intron mobility to new DNA sites. Intron
mobility occurs by “retrohoming”, a process in which the intron RNA reverse splices
directly into a specific DNA site and is then reverse transcribed by the IEP (Fig. 1.1B).
Studies of protein-assisted group II intron splicing and mobility can further our
understanding of how proteins promote RNA folding and catalysis, and the origin,
evolution and mechanisms of spliceosomal introns in higher organisms.
2
1.2 GROUP II INTRON REVERSE TRANSCRIPTASES
Group II intron reverse transcriptases (RTs) consist of four domains, an N-
terminal RT domain, an X domain, and C-terminal DNA-binding (D) and DNA
endonuclease domains (En) (Fig. 1.2) (Mohr et al., 2013). The RT domain of group II
intron RTs contains seven conserved sequence blocks that correspond to the finger and
palm regions of retroviral RTs, such as the HIV-1 RT. However, their RT domain is
larger in size due to an N-terminal extension and several insertions, some of which are
conserved in retroplasmid and non-LTR-retrotransposon RTs (Blocker et al., 2005).
These additional regions may contribute to more extensive interactions between the
group II intron RT and the RNA template, leading to high processivity during reverse
transcription (Chen and Lambowitz, 1997; Bibillo and Eickbush, 2002; Blocker et al.,
2005). The X domain is structurally homologous to the thumb domain of retroviral RTs
(Blocker et al., 2005). Both the RT and X domains function in binding the intron RNA
for RNA splicing and in reverse transcription to synthesize a full-length cDNA copy of
the group II intron RNA during intron mobility (Cui et al., 2004). In contrast to retroviral
RTs, group II intron RTs lack an RNase H domain and instead have D and En domains
for binding and cleaving DNA target sites during intron mobility (Blocker et al., 2005;
Lambowitz and Zimmerly, 2011).
3
1.3 THERMOSTABLE GROUP II INTRON REVERSE TRANSCRIPTASES ARE ADVANTAGEOUS
FOR CDNA SYNTHESIS
A wide range of biotechnological applications requires cDNA synthesis by
reverse transcriptases (RTs), such as mapping of RNA structures and RNA-protein
interactions, qRT-PCR, and next generation RNA sequencing (RNA-seq) (Tijerina et al.,
2007; Wang et al., 2009; Mayer et al., 2011; Ozsolak and Milos, 2011; Lusvarghi et al.,
2013). However, the only commercially available RTs used for these applications are
retroviral RTs, which have inherently low fidelity and processivity for introducing
genetic variations and propagating them by RNA recombination in order to evade host
defenses (Ji and Loeb, 1992; Hu and Hughes, 2012). Additionally, only a few RTs are
capable of functioning at elevated temperature, which facilitates the melting of higher-
order RNA structures for full-length cDNA synthesis, and these typically have decreased
fidelity (Beckman et al., 1985; Baranauskas et al., 2012; Mohr et al., 2013).
In contrast to retroviral RTs, group II intron RTs have inherently high fidelity and
processivity in order to perform their normal biological function during intron mobility,
which requires accurate and full-length cDNA synthesis of a highly structured, 2-3-kb
intron RNA (Conlan et al., 2005; Lambowitz and Zimmerly, 2011; Mohr et al., 2013;
Enyeart et al., 2014; Lambowitz and Belfort, 2015). Group II intron RTs found in
thermophilic bacteria can potentially combine the above useful properties with high
thermostability. However, group II introns have remained untapped as a source of RTs
for biotechnological applications due to two major challenges: (i) although hundreds of
group II intron RTs were identified by genome sequencing (Candales et al., 2012), they
4
often have mutations that decrease or abolish RT activity, suggesting that they are under
selective pressure to suppress intron mobility, which is deleterious to their hosts (Mohr et
al., 2010); and (ii) group II intron RTs have generally been difficult to express with high
yield and activity and become mostly insoluble without the bound intron RNA (Vellore et
al., 2004; Ng et al., 2007). Most previous studies of group II intron RTs have focused on
the LtrA protein encoded by the Lactococcus lactis Ll.LtrB intron for which expression
and solubility problems could be partially overcome under some experimental conditions.
The LtrA protein has been expressed in Escherichia coli with a cleavable intein-affinity
tag and purified with relatively high yield and activity (Saldanha et al., 1999). In vivo, the
LtrA protein synthesizes a full-length cDNA copy of the ~3-kb Ll.LtrB intron and
demonstrated significantly lower error rate (~10-5) than that of retroviral RTs (Cousineau
et al., 1998; Conlan et al., 2005).
Our laboratory recently identified thermostable group II introns that are actively
mobile (Mohr et al., 2010), and developed general methods for the high-level expression
of thermostable group II intron RTs (TGIRTs) as fusion proteins with a non-cleavable
solubility tag attached via a rigid linker (denoted MRF) (Mohr et al., 2013). The two most
active TGIRTs identified were TeI4c-MRF from Thermosynechococcus elongatus and
GsI-IIC-MRF from Geobacillus stearothermophilus (Vellore et al., 2004; Mohr et al.,
2010, 2013). We found that these TGIRT enzymes have higher thermostability,
processivity, and fidelity than retroviral RTs. They carried out reverse transcription
reaction at high temperature (up to 81°C) and synthesized cDNAs with uniform 5’ to 3’
coverage of a 1.2-kb RNA template, measured by the Taqman qRT-PCR assay. Similarly,
5
in capillary electrophoresis assay, TGIRTs produced full-length cDNAs of an 807-nt
highly structured group II intron RNA with significantly fewer premature stops than
SuperScript III (SSIII; Thermo Fisher Scientific), a widely used genetically engineered
derivative of Moloney murine leukemia virus (M-MLV) RT. The high signal (full-length
cDNA copies) to noise (premature RT stops) ratio is crucial for accurately and efficiently
mapping the RNA structures and RNA-protein interactions. Finally, the TGIRT enzymes
were found to have a two- to four-fold lower in vitro error rate than SSIII in an M13-
based lacZ forward mutation assay (Mohr et al., 2013).
1.4 THERMOSTABLE GROUP II INTRON REVERSE TRANSCRIPTASES ARE ADVANTAGEOUS
FOR NEXT-GENERATION RNA SEQUENCING
Next-generation RNA sequencing (RNA-seq) is a supremely powerful method for
transcriptome profiling and gene expression analysis, with applications that include the
identification of novel biomarkers and new diagnostic methods for diseases (Wang et al.,
2009; Wilhelm and Landry, 2009; Ozsolak and Milos, 2011; Chen et al., 2012).
All RNA-seq methods rely upon an initial cDNA synthesis step in which a reverse
transcriptase (RT) converts RNA sequences into DNA, which can then be sequenced by
powerful high-throughput DNA sequencing technologies. Current RNA-seq methods can
be divided into two general categories. In one category, used for the analysis of mRNAs
and long non-coding RNAs (lncRNAs), the initial reverse transcription step typically
enriches for cDNAs of polyadenylated (poly(A)+) RNAs, either by priming with
oligo(dT) or by priming with random oligomers after depletion of the highly abundant
6
rRNAs (Levin et al., 2010; Ozsolak and Milos, 2011). The resulting cDNAs are then
converted into suitably sized double-stranded DNAs and ligated to platform-specific
sequencing adapters (Ozsolak and Milos, 2011). The most widely used of these methods
employs RNA fragmentation, random hexamer priming, and addition of dUTP during
second-strand synthesis; after adapter ligation, the uridine-containing second strand is
either excluded during PCR with a high-fidelity DNA polymerase or degraded
enzymatically to achieve strand specificity (Levin et al., 2010; Head et al., 2014). A
second category of RNA-seq methods, used for miRNAs and other small non-coding
RNAs (small ncRNAs), involves ligation of RNA-seq adapters containing primer-binding
sites to the 3’ and/or 5’ ends of target RNAs with RNA ligase, followed by reverse
transcription and PCR amplification for RNA-seq library construction (Levin et al., 2010;
Raabe et al., 2014). Limitations of these methods include: (i) the inability to
comprehensively profile mRNAs and lncRNAs together with small ncRNAs in the same
RNA-seq reaction; (ii) the relatively low fidelity and processivity of retroviral RTs used
for cDNA synthesis (Hu and Hughes, 2012), making it difficult to analyze RNA sequence
polymorphisms and highly structured or GC-rich RNAs; and (iii) the inefficiency and/or
biases introduced by RNA-seq adapter ligation using RNA ligases or by random hexamer
priming (Linsen et al., 2009; Hansen et al., 2010; Levin et al., 2010; Lamm et al., 2011;
Raabe et al., 2014).
In addition to high thermostability, processivity, and fidelity, properties that are
useful for producing full-length reads from the highly structured or GC-rich RNAs,
TGIRT enzymes also have a novel end-to-end template-switching activity that can attach
7
RNA-seq adapters to the target RNA during reverse transcription without a separate RNA
ligase step (Mohr et al., 2013). TGIRTs differ from retroviral RTs in template-switching
with minimal base-pairing to the 3’ ends of the target RNA (Mohr et al., 2013). Recent
work in our lab showed that the use of TGIRT template-switching enables facile and less
biased RNA-seq analysis of miRNAs than two commercial kits and could potentially
have wide RNA-seq applications (Mohr et al., 2013).
1.5 OVERVIEW OF THE DISSERTATION RESEARCH
This dissertation focuses on the further development of the TGIRT template-
switching method and its broad applications in next-generation RNA sequencing,
diagnostics and precision medicine. Specifically, by providing a new biotechnology that
is simple, rapid and efficient, I aim to: (i) contribute new insights into biological studies
that require high-throughput sequencing of structured RNAs that are refractory to
conventional RNA-seq analysis (Chapter 2); (ii) develop sensitive, non-invasive and cost-
effective diagnostic tools and personalized medical care for diseases, including cancer
(Chapter 3 and 4); (iii) improve the accuracy and efficiency of current research tools used
in the mapping of RNA secondary structure and RNA-protein interactions (Chapter 5).
In Chapter 2, I optimized the initial TGIRT template-switching method for RNA-
seq analysis of diverse small RNA classes, now referred to as the TGIRT-seq small
RNA/CircLigase method, and demonstrated its usefulness by sequencing tRNAs, which
are virtually absent from datasets obtained with conventional RNA-seq methods due to
their stable secondary and tertiary structures, and extensive post-transcriptional
8
modifications. Through collaboration with Dr. Tao Pan’s research group at the University
of Chicago, I developed an efficient and quantitative high-throughput tRNA sequencing
method that can be widely used in studies of tRNA expression, modification and
regulation. Additionally, I describe two studies that revealed novel functions of tRNA-
binding proteins by utilizing TGIRTs for tRNA deep sequencing. In the first study, by
collaborating with Dr. Kathleen Collin’s research group at the University of California-
Berkeley, we showed that the human interferon-induced protein IFIT5 binds to a broad
spectrum of precursor and processed tRNA transcripts, uncovering a surprisingly flexible
order of human tRNA processing reactions, and potential roles of IFIT5 protein in
cytosolic tRNA quality control and innate immunity. In the second study, by
collaborating with several research groups, including Dr. Adam Frost at the University of
Utah, Dr. Onn Brandman at the Standford University, and Drs. Johnathan Weissman and
Dr. Yifan Cheng at the University of California-San Francisco, we established tRNA
recognition specificity of the Rqc2 protein, a component of the yeast quality control
complex, and uncovered a novel mRNA-independent mechanism for elongation of
nascent peptides.
In Chapter 3, I developed a new TGIRT-seq method that is simple, rapid and
efficient for analysis of RNAs of all sizes in a single RNA-seq reaction, now referred to
as the TGIRT-seq total RNA method. I demonstrated the use of the method in profiling
circulating RNAs in human plasma. Circulating RNAs are potentially useful as
biomarkers for human diseases. However, the extraction and analysis of circulating
RNAs have been challenging due to their extremely low quantity and quality. In this
9
chapter, I describe methods for plasma RNA isolation and RNA-seq analysis by TGIRT-
seq total RNA method, which enabled construction of RNA-seq libraries from <1 ng of
plasma RNAs in <5 h. TGIRT-seq of RNA in 1-mL plasma samples from a healthy
individual revealed RNA fragments mapping to a diverse population of protein-coding
gene and lncRNAs, which are enriched in intron and antisense sequences, as well as
nearly all known classes of small ncRNAs, some of which have never before been seen in
plasma. Surprisingly, many of the small ncRNA species were present as full-length
transcripts, suggesting that they are protected from plasma RNases in ribonucleoprotein
(RNP) complexes and/or exosomes. The TGIRT-seq total RNA method is readily
adaptable for profiling of whole-cell and exosomal RNAs, and related procedures
including ribosome profiling.
In Chapter 4, by using RNAs isolated from extracellular vesicles in plasma, I
explored the use of TGIRT-seq total RNA method for the identification of novel
biomarkers in patients at different stages of multiple myeloma, which is a prevalent blood
cancer. This is an on-going study done in collaboration with Drs. Flavia Pichiorri and
Craig Hofmeisters’ group at the Ohio State University. Preliminary sequencing results
showed that TGIRT-seq identified differentially expressed mRNA transcripts that are
consistent with patient survival based on a published microarray-based gene expression
dataset (Popovici et al., 2010; Shi et al., 2010). Additionally, TGIRT-seq also identified
several small ncRNAs as potential novel biomarkers, including Y RNA derived
fragments. Other on-going collaborations described in chapter 4 include analysis of FFPE
(formalin-fixed, paraffin-embedded) tumor tissue, PBMCs (peripheral blood
10
mononuclear cells) and plasma samples from patients with inflammatory breast cancer
with Dr. Naoto Ueno’s group at the MD Anderson Cancer Center, and analysis of plasma
samples with Dr. Joseph McCormick’s group at the University of Texas Rio Grande
Valley for a large-scale population study of environmental impact on human health.
In Chapter 5, I adapted TGIRT-seq in commonly used procedures for mapping
RNA secondary structure and RNA-protein interactions, including: (i) selective 2′-
hydroxyl acylation analyzed by primer extension (SHAPE); (ii) cross-linking and
analysis of cDNAs (CRAC); and (iii) individual-nucleotide resolution cross-linking and
immunoprecipitation (iCLIP). Using Group IIC intron GsI-IIC, found in Geobacillus
stearothermophilus, and its encoded protein (denoted GsI-IIC-MRF), as an in vitro model
system, I demonstrated the ability of TGIRT-SHAPE to map the secondary structure of a
722-nt highly structured GsI-IIC intron RNA at a single nucleotide resolution using a
single primer annealed to the 3’ end of the RNA. The secondary structure of GsI-IIC
intron RNA obtained by TGIRT-SHAPE agreed with that predicted based on
phylogenetic studies (unpublished). I also used TGIRT-CRAC to identify the direct
interaction sites between GsI-IIC intron RNA and its IEP at the pre-catalytic step of
splicing. Preliminary data identified regions known to be involved in IEP binding in other
group II introns, and several nucleotides involved in long-range RNA interactions at the
tertiary level, suggesting the IEP functions to facilitate formation of active intron RNA
structures during splicing. I also contributed to adapting TGIRT-seq for iCLIP procedures
to study RNA-protein interactions in vivo, including the identification of RNA substrates
and binding sites recognized by NS1 protein of influenza virus, and by human MDA5
11
protein, through collaborations with research groups including Dr. Krug at the University
of Texas at Austin and Dr. Michael Gale, Jr. at the University of Washington,
respectively.
12
Figure 1.1: Group II intron splicing and mobility.
(A) Intron splicing. After transcription, the group II intron RNA folds into conserved
secondary and tertiary structures and forms an active site that binds the splice sites and the
branch-point nucleotide to catalyze splicing. The intron-encoded protein is a multifunctional
reverse transcriptase (RT) that binds specifically to the intron RNA and stabilizes the
catalytically active RNA structure for RNA splicing. (B) Intron mobility. After splicing, the
group II intron RT binds remains bound to the excised intron lariat RNA in an RNP that
promotes intron mobility (“retrohoming”) to new DNA sites. In this process, the intron RNA
reverse splices directly into the top strand of the target DNA, while the intron-encoded
multifunctional RT cleaves the bottom strand of the target DNA and uses the 3′ end of the
cleavage site as a primer for reverse transcription of the inserted intron RNA. The
resulting intron cDNA is integrated into the host genome by cellular DNA recombination
and/or repair mechanisms (Lambowitz and Belfort, 2015).
A B
13
Figure 1.2: Comparision of group II intron and retroviral RTs.
Group II intron RT domains: N-terminal RT domain with conserved sequence
blocks RT-1 to RT-7, corresponding to the fingers and palm domains of retroviral RTs
(HIV-1 RT); X/thumb with predicted -helices (above) corresponding to thumb domain
of retroviral RTs; C-terminal DNA binding (D) and DNA endonuclease (En) domains
instead of the RNase H domain of retroviral RTs (HIV-1 RT). Group II intron RTs have
an N-terminal extension (RT-0) and insertions between the conserved RT sequence
blocks (RT-2a, -3a, -4a and -7a) that are absent in retroviral RTs (HIV-1 RT).
14
Chapter 2: RNA-seq of transfer RNAs
2.1 EFFICIENT AND QUANTITATIVE HIGH-THROUGHPUT TRNA SEQUENCING*
*Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12, 835-837 (2015). Authors
include Guanqun Zheng, Yidan Qin, Wesley C. Clark, Qing Dai, Chengqi Yi, Chuan He, Alan M.
Lambowitz and Tao Pan; G.Z. and Y.Q. equally contributed to this work; T.P. and A.M.L. jointly
supervised this work.
Widely used small RNA-seq methods start with adaptor ligation and cDNA
synthesis from biological RNA samples followed by PCR amplification to generate
sequencing libraries (Wang et al., 2009). These standard methods are able to sequence
most cellular RNAs, including short-read sequencing of microRNAs, fragments of
rRNAs, small nuclear RNAs, and small nucleolar RNAs, or fragmented mRNAs and
long-noncoding RNAs (lncRNAs). tRNA is the only class of small cellular RNA for
which the standard sequencing methods cannot yet be applied efficiently and
quantitatively, although attempts have been made (Pang et al., 2014). Significant
obstacles for the sequencing of tRNA include the presence of numerous post-
transcriptional modifications and its stable and extensive secondary structure, which
interfere with cDNA synthesis and adaptor ligation. tRNAs are essential for cells, and
their synthesis is under stringent cellular control (Phizicky and Hopper, 2010). Recent
findings show that tRNA expression and mutations, and cleaved tRNA fragments are
associated with various diseases, such as neurological pathologies and cancer
development (Abbott et al., 2014; Anderson and Ivanov, 2014; Goodarzi et al., 2015;
15
Kirchner and Ignatova, 2015). The lack of efficient and quantitative tRNA sequencing
methods has hindered biological studies of tRNA.
2.1.1 tRNA sequencing by combining demethylase treatment with the TGIRT-seq
small RNA/CircLigase method
In collaboration with Dr. Tao Pan’s research group at the University of Chicago,
we applied two strategies to eliminate or substantially reduce the obstacles of tRNA
modification and structure for efficient and quantitative tRNA sequencing (Fig. 2.1)
(Zheng et al., 2015).
First, an enzyme mixture was used to remove methylations at the Watson-Crick
face. Three specific modifications are abundant in eukaryotic tRNAs and are particularly
problematic for reverse transcriptases (RTs), causing cDNA synthesis to stop or
incorporate a wrong nucleotide. In mammals, N1-methyladenosine (m1A) is present in all
tRNAs at position 58, N3-methylcytosine (m3C) is present in five tRNAs at position 32
and the variable loop, and N1-methylguanosine (m1G) is present in about half of all
tRNAs at position 37 or 9 (Fig. 2.1A). Our collaborators used a mixture of two
recombinant enzymes, a wild-type AlkB (wtAlkB) from E. coli and an engineered mutant
AlkB (D135S) to remove ~70-80% of these three methylations in human tRNAs (Falnes
et al., 2002; Trewick et al., 2002). The remaining m1A or m1G may be buried deeper in
the tRNA tertiary structure (m1A or m1G at position 9 of tRNAs) and thus not easily
accessible to demethylase treatment without causing tRNA degradation.
16
Second, we used a thermostable group II intron reverse transcriptase (TGIRT) to
generate cDNAs from highly structured tRNAs (Fig. 2.1B). First, the TGIRT binds to an
initial template-primer substrate comprised of an RNA oligonucleotide containing RNA-
seq adapter sequences annealed to a complementary DNA primer. For Illumina
sequencing, the RNA-seq adapter contains both Illumina Read 1 and 2 primer-binding
sites, and the DNA primer contains the complementary sequence (Materials and
Methods). After forming a complex with the initial template-primer substrate, the TGIRT
initiates reverse transcription by switching directly from the 5’ end of the RNA-seq
adapter to the 3’ end of a target RNA, yielding a continuous cDNA linking the two
sequences. The RNA-seq adapter has a 3’-blocking group that impedes secondary
template-switching to the 3’ end of that RNA.
To increase the efficiency of template-switching, the DNA primer annealed to the
RNA-seq adapter in the initial adapter substrate has a single-nucleotide 3’ overhang. This
3’-overhang nucleotide base-pairs to the 3’-terminal nucleotide of the target RNA,
resulting in a seamless template-switching junction between the RNA-seq adapter and the
target RNA (Mohr et al., 2013). In the present work, an initial template-primer substrate
with a single T overhang was used to enrich for mature tRNAs, which always have an A
at their 3’ ends due to post-transcriptionally added CCA sequence. Alternatively, an
equimolar mixture of A, C, G, or T 3’ overhangs (denoted N (Mohr et al., 2013)) can be
used to construct RNA-seq libraries from RNA pools with minimal bias. The resulting
cDNAs are gel-purified, circularized by CircLigase II ssDNA Ligase (Epicentre),
amplified by PCR, and sequenced on an Illumina instrument. This method is widely
17
applicable for other small RNA classes, including miRNA, and is referred to as the
TGIRT-seq small RNA/CircLigase method.
The tRNA libraries were sequenced on an Illumina HiSeq 2500 instrument. The
sequencing reads were mapped to the genomic tRNA database, which contains 515
predicted tRNA genes distributed over 330 unique sequences and 110 predicted tRNA
pseudogenes (Chan and Lowe, 2009). In combination with the demethylase treatment, we
obtained longer and full-length tRNA reads with markedly reduced amounts of RT stops
at the m1A58 and m1G37 positions (Figure 2.1C), a property that is crucial for the ability
to adequately map the mammalian tRNAome at single-base resolution. Alternatively, we
found that the TGIRT enzyme produced more full-length cDNA products when
increasing the reaction time of reverse transcription (Fig. 2.1D).
2.1.2 Analysis of tRNA isoacceptors, modifications and gene expressions
Our collaborators performed additional analysis to further demonstrate the
usefulness of our sequencing method. Plotting each tRNA isoacceptor against its gene
copy number showed a poor correlation, which is consistent with the known tissue-
specific tRNA expression in humans (Dittmar et al., 2006; Gingold et al., 2014). The
comparison between sequencing and array results of the Arg-tRNA showed the same
trend of isoacceptor abundance, thus validating the quantitative nature of tRNA
abundance obtained independently through sequencing- and hybridization-based
approaches. The analysis of RT misincorporations at known modification positions with
and without demethylase treatment indicates that the DM-tRNA-seq (demethylase-
18
thermostable group II intron RT tRNA sequencing) method can determine differences in
the modification dynamics of m1A, m1G and m3C at single-base resolution, as well as
potentially infer positions of non-demethylated modifications. Finally, the examination of
unique tRNA genes from human chromosome 6, which contains a major tRNA gene
cluster (Horton et al., 2004), showed higher expression level within the cluster than
outside of the cluster. The expression levels of tRNA genes in the cluster were uneven,
suggesting that the expression of tRNA genes was not coordinated throughout the entire
cluster in HEK293T cells.
2.1.3 Discussion
The approach described above makes efficient and quantitative tRNA-seq
feasible. Furthermore, in a time-course reverse transcription reaction of tRNAs, the
TGIRT enzyme produced more full-length cDNA products at longer time points. It
suggests an extremely tight binding to the RNA template by the TGIRT enzyme, which
stalls at the modification sites without falling off, and is capable of reading through the
modified nucleotides with more time given. Interestingly, it also appears that the TGIRT
enzyme yields a distinct pattern of misincorporated nucleotides characteristic of the
modification, providing an additional advantage of being able to study modifications at
single-nucleotide resolution in a high-throughput manner.
19
2.2 ANALYSIS OF PRECURSOR AND MATURE TRNAS ASSOCIATED WITH THE HUMAN
INTERFERON-INDUCED PROTEIN IFIT5*
*Broad and adaptable RNA structure recognition by the human interferon-induced tetratricopeptide repeat
protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030 (2014). Authors include George E.
Katibah, Yidan Qin, David J. Sidote, Jun Yao, Alan M. Lambowitz, and Kathleen Collins. G.E.K., A.M.L.,
and K.C. designed research; G.E.K., Y.Q., and D.J.S. performed research; Y.Q., D.J.S., and J.Y.
contributed new reagents/analytic tools; G.E.K., Y.Q., D.J.S., J.Y., A.M.L., and K.C. analyzed data; and
G.E.K., Y.Q., D.J.S., J.Y., A.M.L., and K.C. wrote the paper.
Innate immune responses provide a front-line defense against pathogens. Unlike
adaptive immune responses, innate immunity relies on general principles of
discrimination between self and pathogen epitopes to trigger pathogen suppression
(Gürtler and Bowie, 2013). Pathogen-specific features that can provide this
discrimination come under evolutionary selection to evade host detection, and in turn
host genes adapt new recognition specificities for pathogen signatures. Among the most
clearly established targets of innate immune response recognition are nucleic acid
structures not typical of the host cell, such as cytoplasmic double-stranded RNA (Goubau
et al., 2013). Detection of a pathogen nucleic acid signature robustly induces type I
interferon, which activates a cascade of pathways for producing anti-viral effectors
(Schoggins and Rice, 2011). Highly expressed interferon-induced proteins with
tetratricopeptide repeats (IFITs) are proposed to function as RNA binding proteins, but
the RNA binding and discrimination specificities of IFIT proteins remain unclear.
20
2.3.1 The human IFIT5 protein
Cytoplasmic viral RNA synthesis occurs without co-transcriptional coupling to
the 5'-capping machinery, which acts pervasively on host-cell nuclear RNA polymerase
II transcripts (Ghosh and Lima, 2010; Topisirovic et al., 2011). Eukaryotic mRNA 5'
ends are first modified by addition of a cap0 structure containing N7-methylated
guanosine, which is joined to the first nucleotide (nt) of the RNA by a 5’-5’ triphosphate
linkage (7mGpppN). In higher eukaryotes including humans, cap0 is further modified by
ribose 2'-O-methylation of at least 1 nt (7mGpppNm, cap1) and sometimes 2 nt
(7mGpppNmpNm, cap2). Cap0 addition makes essential contributions to mRNA
biogenesis and function in steps of mRNA splicing, translation and protection from decay
(Ghosh and Lima, 2010; Topisirovic et al., 2011). In contrast, the biological role of
mRNA cap0 modification to cap1 and cap2 structures is largely enigmatic. Some viruses
encode enzymes for 7mGpppN formation and less frequently the ribose 2'-O-methylation
necessary to generate cap1 (Decroly et al., 2012). Recent studies show that virally
encoded cap 2’-O-methyltransferase activity can inhibit the innate immune response
(Daffis et al., 2010; Züst et al., 2011; Szretter et al., 2012; Habjan et al., 2013; Kimura et
al., 2013).
The IFIT family of interferon-induced proteins with tetratricopeptide repeats
(TPRs) are among the most robustly accumulated proteins following type I interferon
signaling (Diamond and Farzan, 2013; Zhou et al., 2013). Phylogenetic analyses reveal
different copy numbers and combinations of four distinct IFIT proteins (IFIT1, 2, 3 and
5) even within mammals, generated by paralog expansions and/or gene deletions,
21
including the loss of IFIT5 in mice and rats (Liu et al., 2013). Human IFIT1, IFIT2 and
IFIT3 co-assemble in cells into poorly characterized multimeric complexes that exclude
IFIT5 (Pichlmair et al., 2011; Katibah et al., 2013). Recombinant IFIT-family proteins
range from monomer to multimer, with crystal structures solved for a human IFIT2
homodimer, the human IFIT5 monomer, and an N-terminal fragment of human IFIT1
(Yang et al., 2012; Abbas et al., 2013; Feng et al., 2013; Katibah et al., 2013). Studies of
IFIT1 report its preferential binding to either 5' triphosphate (ppp) RNA or cap0 RNA or
optimally cap0 without guanosine N7-methylation (Pichlmair et al., 2011; Habjan et al.,
2013; Kimura et al., 2013; Kumar et al., 2014). Reports of IFIT5 RNA binding specificity
are likewise inconsistent: the protein has been described to bind RNA single-stranded 5'
ends with ppp and monophosphate (p) but not OH (Katibah et al., 2013); ppp but not p,
OH or cap0 (Abbas et al., 2013); ppp but not cap0 (Habjan et al., 2013; Kumar et al.,
2014); or single-stranded 5'-p RNA and double-stranded DNA (Feng et al., 2013).
Using structure-guided mutagenesis coupled with quantitative binding assays of
purified recombinant protein, our collaborators in Dr. Kathleen Collins’s research group
at the University of California-Berkeley, established that IFIT5 can alternatively expand
or introduce bias in protein binding to RNAs with 5' monophosphate, triphosphate, cap0
(triphosphate-bridged N7-methylguanosine) or cap1 (cap0 with RNA 2’-O-methylation)
(Katibah et al., 2014). This surprisingly adaptable IFIT5 recognition specificity for RNA
5' structure in vitro suggested that it could bind to many cellular RNAs.
22
2.3.2 TGIRT-seq profiling of IFIT5-bound cellular RNAs
To investigate the diversity of IFIT5-bound cellular RNAs in an unbiased manner,
we deep sequenced RNAs copurified with IFIT5 from HEK293 cells. A previous study
had shown that IFIT5 binds to tRNAs (Katibah et al., 2013), which are recalcitrant to
standard sequencing methods. Therefore we used TGIRT-seq small RNA/CircLigase
method, with an equimolar mixture of A, T, G, C overhangs in the initial template-primer
substrate for RNA-seq library construction with minimal bias. TGIRT-seq was first
performed for cellular RNAs that co-purified with IFIT5 from a HEK293 cell line with
3xF-IFIT5 expressed at a physiological level (Katibah et al., 2013). To capture in vivo
protein-RNA interactions, formaldehyde cross-linking was used before stringent
purification and then reversed prior to analyzing the bound RNAs. We also compared
IFIT5-bound RNAs isolated under native affinity purification conditions from extracts of
cells with or without prior interferon-β treatment. In addition, we compared wild-type and
mutant E33A and E33A/D334A IFIT5 proteins expressed in HEK293 cells by transient
transfection. In the first set of 3 samples, comparing wild-type IFIT5 with or without
formaldehyde cross-linking or interferon-β treatment prior to cell lysis, cDNA products
were pooled and amplified together (Fig. 2.2A, Table 2.1). In the second set of 3 samples,
because RNAs bound to wild-type and mutant IFIT5 had different size profiles, we
amplified and sequenced discrete pools of cDNA lengths (Fig. 2.2B, Table 2.1). Finally,
in a third sample, we pooled cDNAs before amplification and sequencing for a biological
replicate of the wild-type versus mutant IFIT5 comparison (Table 2.2). For each
purification condition, cDNAs were sequenced on an Illumina MiSeq to a depth of 1
23
million or more reads, which were mapped to the Ensembl GRCh37 human genome
reference sequence.
RNA from IFIT5 purifications gave TGIRT-seq reads that mapped predominantly
to tRNA gene loci in all samples (Table 2.1 and Table 2.2). Cross-linked and native
extract purifications showed a large diversity of bound tRNAs, with reads from different
samples mapping to 507-527 of the 625 annotated human tRNA and tRNA pseudogene
loci (Fig. 2.3). For IFIT5 expressed by transient transfection with size-selected cDNA
pools sequenced separately, the largest cDNA size pool contained substantial amounts of
5S rRNA, which is less abundant in the cross-linked RNA purification (Table 2.1) and
thus could in part reflect IFIT5 binding of a highly abundant RNA in native cell extract
(Katibah et al., 2013) (Table 2.1; size categories a, b and c correspond to cDNA of ~55-
82, 84-150 and 150-230 nt, respectively, including the 42 nt primer added by template-
switching; Fig. 2.2).
2.3.3 IFIT5 binds to a broad spectrum of precursor and processed tRNAs, as well as
other RNA polymerase III transcripts
To further characterize IFIT5-bound tRNAs, we plotted read coverage across
individual tRNA loci from 50 bp upstream to 50 bp downstream of the mature tRNA
ends, with representative coverage plots shown for the cross-linked RNA sample (Fig.
2.4; mature RNA ends are indicated with dashed lines). Some tRNA loci were
represented by reads abundant only across the mature tRNA region (iMetCAT, AspGTC
and HisGTG). Read alignments to the genome sequence revealed that many IFIT5-bound
24
mature tRNAs were full length including the post-transcriptionally added 3’ CCA tail. In
the case of HisGTG, the alignments also detected the expected post-transcriptional 5’
guanosine addition (Fig. 2.5A) (Phizicky and Hopper, 2010). Post-transcriptionally
modified nucleotides within the tRNA were evident from positions of frequent read
mismatch to the genome sequence (Fig. 2.5A). Some IFIT5-bound tRNA reads had
truncated 5' and/or 3' ends (Fig. 2.4 and Fig. 2.5A; iMetCAT) resulting from nuclease
cleavage of tRNAs and, for 5'-truncated ends, potentially from premature reverse
transcription stops.
In addition to mature tRNAs, we were surprised to find that numerous tRNA loci
were represented by abundant IFIT5-bound tRNAs with the 5' extension of a primary Pol
III transcript (Fig. 2.4, AlaTGC, ValAAC, ArgTCT, and LeuCAA). Many of these 5'-
extended tRNAs included the full-length mature tRNA sequence with a 3' CCA tail (Fig.
2.5A, AlaTGC). Also, some IFIT5-bound tRNAs with a 5' precursor extension and CCA
tail had undergone splicing to remove the intron (Fig. 2.4, ArgTCT and LeuCAA), which
is unexpected given that 5' processing precedes splicing in known tRNA biogenesis
pathways (Phizicky and Hopper, 2010). Furthermore, some of the spliced tRNAs had
aberrant splice junctions suggestive of missplicing (Fig. 2.5B). Of interest, some 5' and/or
3' extended or truncated tRNAs had post-transcriptionally appended poly-U tails (Table
2.1, Table 2.2 and Fig. 2.5C). We also found tRNA pseudogene transcripts (Fig. 2.4,
PseudoCCC), as well as a few tRNAs with atypically long 5' or 3' extensions or with
sequence reads ending at an internal modified nucleotide position suggestive of a reverse
transcription stop.
25
Cellular IFIT5 binding to the RNAs described above is consistent with its
biochemical specificity of RNA interaction in vitro: precursor tRNAs are expected to
have 5'-ppp from RNA polymerase III initiation, while mature tRNAs are expected to
have 5'-p generated by RNase P. Although biochemically consistent, some types of
incompletely processed IFIT5-bound tRNAs should be nuclear, whereas IFIT5 is
cytoplasmic (Katibah et al., 2013). The cytoplasmic localization of IFIT5 suggests that
some immature or aberrantly processed tRNA transcripts escape the nucleus to become
available for IFIT5 binding, either via mistransport or during mitosis.
IFIT5 also bound to a family of cytoplasmic, ~120 nt, Alu-related, primate-
specific RNA polymerase III small NF90-associated RNA (snaR) transcripts and 5S
rRNA (Fig. 2.4, Table 2.1, Table 2.2 and Fig. 2.5D). The snaRs have a single-stranded 5’
end but extensive secondary structure that impedes cDNA synthesis by a conventional
reverse transcriptase (Parrott and Mathews, 2007; Parrott et al., 2011). Nonetheless
TGIRT-seq gave coverage across the full snaR (Fig. 2.4 and Fig. 2.5D). The snaR
association with IFIT5 was further confirmed using blot hybridization. Notably, the poly-
U tailing of IFIT5-bound tRNAs was also observed for IFIT5-bound snaRs (Fig. 2.5D).
Compared to wild-type IFIT5 assayed in parallel, mutant E33A or E33A/D334A
IFIT5 purifications contained an increased proportion of rRNA and mRNA (Table 2.1
and Table 2.2). The mRNA reads showed no obvious bias for 5' ends and were more
abundant in native than in cross-linked samples (Table 2.1), suggestive of IFIT5 binding
to 5'-p mRNA fragments generated in cell extract. To investigate a potential change in
specificity of IFIT5 binding to tRNA 5' ends imposed by the E33A and E33A/D334A
26
substitutions, we determined the overall frequency of tRNA read start-site positions for
all tRNA loci combined (Fig. 2.6). Using reads mapped against tRNA loci from 50 bp
upstream to 50 bp downstream of the mature tRNA ends, most read start sites
corresponded to 5'-extended precursor (positions 1-50) or the mature tRNA 5’ end
(position 51). The cross-linked sample had a higher fraction of mature tRNA start sites at
position 51 than the two native purifications from the same cell line (Fig. 2.6A). Mutant
E33A or E33A/D334A IFIT5 purifications also showed an increased fraction of read start
sites at the mature tRNA 5' end (position 51) compared to the parallel purification of
wild-type IFIT5 (Fig. 2.6B), possibly reflecting some shift of the mutant IFIT5 proteins
toward binding of 5’-p versus 5’-ppp RNAs.
2.3.4 Discussion
TGIRT-seq analysis supports IFIT5 binding to both 5'-p and 5'-ppp cellular RNAs
and also the poly-U tailing of IFIT5-bound RNA fragments, which appeared to be the
case for an IFIT5-bound tRNA fragment sequenced previously (Katibah et al., 2013).
Recent studies describe poly-U tailing as a commitment step for RNA degradation by the
human cytoplasmic exonuclease DIS3L2, which is deficient in human Perlman syndrome
(Astuti et al., 2012; Chang et al., 2013; Malecki et al., 2013). Because IFIT5-bound
tRNAs include 3′-extended or truncated poly-U tailed forms that would be a minority of
total cellular tRNA forms, we suggest that IFIT5 may not only sequester cellular tRNAs
but also trigger their subsequent degradation by DIS3L2. Analogous modes of action
have been found for RNaseL, which degrades cellular RNA to mediate its function in
27
innate immunity, and human schlafen 11, which binds tRNAs to alter translation as its
antiviral effector mechanism (Malathi et al., 2007; Li et al., 2012). We speculate that any
cytoplasmic single-stranded viral RNA 5′-p or 5′-ppp end would be bound by IFIT5,
potentially inhibiting viral mRNA capping and/or translation. In addition, by recruiting
RNA degradation enzymes to bound RNAs, IFIT5 could target virally encoded RNAs for
rapid turnover. Finally, our results suggest that IFIT5 could also play a general role,
beyond its function in innate immunity, in cytoplasmic surveillance for 5′-ppp RNA
polymerase III transcripts that escape the nucleus.
2.4 ANALYSIS OF TRNAS ASSOCIATED WITH THE YEAST RQC2P PROTEIN*
*Protein synthesis. Rqc2p and 60S ribosomal subunits mediate mRNA-independent elongation of nascent
chains. Science 347, 75–78 (2015). Authors include Peter S. Shen, Joseph Park, Yidan Qin, Xueming Li,
Krishna Parsawar, Matthew H. Larson, James Cox, Yifan Cheng, Alan M. Lambowitz, Jonathan S.
Weissman, Onn Brandman, Adam Frost. P.S.S., A.M.L., J.S.W., O.B. and A.F. designed research. P.S.S.,
J.P., Y.Q., X.L. M.L. O.B. and A.F. performed research. P.S.S., J.P., Y.Q., X.L. M.L. Y.C. A.L.M., J.S.W.
O.B. and A.F. analyzed data. P.S.S., J.S.W., O.B. and A.F. wrote the paper.
Despite the processivity of protein synthesis, faulty messages or defective
ribosomes can result in translational stalling and incomplete nascent chains. In Eukarya,
this leads to recruitment of the RQC (Ribosome Quality Control) complex for
ubiquitylation and degradation of incompletely-synthesized nascent chains (Brandman et
al., 2012; Defenouillère et al., 2013; Shao et al., 2013; Verma et al., 2013). The molecular
components of the RQC complex include the AAA ATPase Cdc48p and its ubiquitin-
binding cofactors, the RING-domain E3 ligase Ltn1p, and two proteins of unknown
28
function, Rqc1p and Rqc2p. In collaboration with several research groups, including Dr.
Adam Frost at the University of Utah, Dr. Onn Brandman at the Standford University,
and Drs. Johnathan Weissman and Dr. Yifan Cheng at the University of California-San
Francisco, we set out to determine the mechanism(s) by which relatively rare proteins
like Ltn1p, Rqc1p, and Rqc2p recognize and rescue stalled 60S ribosome-nascent chain
complexes, which are vastly outnumbered by ribosomes translating normally or in stages
of assembly (Li et al., 2014).
2.4.1 The Rqc2p protein
Using cryo–electron microscopy (Cryo-EM) structures, our collaborators found
that the RQC components Ltn1p (YMR247C/Rkr1), an RING-domain E3 ubiquitin
ligase, and Rqc2p (YPL009C/Tae2) bind to the 60S subunit at sites exposed after 40S
dissociation, placing the Ltn1p RING domain near the exit channel of the ribosome and
Rqc2p over the P-site transfer RNA (tRNA) (Shen et al., 2015). Cryo-EM structures also
revealed Rqc2p binding to an ~A-site tRNA whose 3′-CCA tail is within the peptidyl
transferase center of the 60S. This observation was unexpected since A-site tRNA
interactions with the large ribosomal subunit are typically unstable and require mRNA
templates and elongation factors (Lill et al., 1986). Rqc2p’s interactions with the ~A-site
tRNA appeared to involve binding of the anticodon loop by a globular N-terminal
domain, as well as D-loop and T-loop interactions along Rqc2p’s coiled coil.
29
2.4.2 TGIRT-seq profiling of Rpc2p-bound tRNAs
To determine whether Rqc2p binds specific tRNA molecules, we extracted total
RNA after RQC purification from strains with intact RQC2 versus rqc2 strains. Deep
sequencing by using TGIRT-seq small RNA/CircLigase method revealed that the
presence of Rqc2p leads to an ~10-fold enrichment of tRNAAla(AGC) and tRNAThr(AGT) in
the RQC (Fig. 2.7A). In complexes isolated from strains with intact RQC2, Ala(AGC)
and Thr(AGT) are the most abundant tRNA molecules, even though they are less
abundant than a number of other tRNAs in yeast (Chu et al., 2011).
Cryo-EM structures suggested that Rqc2p’s specificity for these tRNAs is due in
part to direct interactions between Rqc2p and nucleotides 32-36 of the anticodon loop,
some of which are edited or modified in the mature tRNA (Fig. 2.7B). Adenosine 34 in
the anticodon of both tRNAAla(AGC) and tRNAThr(AGT) is deaminated to inosine (Crick,
1966; Gerber and Keller, 1999; Agris et al., 2007), and this was detected by TGIRT-seq
as a diagnostic guanosine upon reverse transcription (Fig. 2.7B,C) (Delannoy et al., 2009;
Katibah et al., 2014). Further analysis of the sequencing data revealed that cytosine 32 in
tRNAThr(AGT) is also deaminated to uracil in ~70% of the Rqc2p-enriched reads (Fig.
2.7C) (Rubio et al., 2006). Together with the structure, this suggests that Rqc2p binds to
the D-, T- and anticodon loop of the ~A-site tRNA, and that recognition of the 32-
UUIGY-36 edited motif accounts for Rqc2’s specificity for these two tRNAs (Fig. 2.7C).
The pyrimidine at position 36 could explain the discrimination between the otherwise
similar anticodon loops that harbor purines at base 36.
30
Through a series of biochemical and genetic assays, our collaborators demonstrate
that Rqc2p recruits alanine- and threonine-charged tRNAs to the A site and directs the
elongation of stalled nascent chains with non-templated Carboxy-terminal Ala and Thr
extensions or “CAT” tails, which may also function in the activation of Heat Shock
Factor 1 (Hsf1p). The identification of the Rqc2p-bound tRNAAla(AGC) and tRNAThr(AGT)
could not be done by conventional RNA-seq and required the use of TGIRT-seq.
2.4.3 Discussion
Integrating our observations, we propose the model schematized in Figure 2.8.
Ribosome stalling leads to dissociation of the 60S and 40S subunits, followed by
recognition of the peptidyl-tRNA-60S species by Rqc2p and Ltn1p. Ltn1p ubiquitinates
the stalled nascent chain, and this leads to Cdc48 recruitment for extraction and
degradation of the incomplete translation product. Rqc2p, through specific binding to
Ala(IGC) and Thr(IGU) tRNAs, directs the template-free and 40S-free elongation of the
incomplete translation product with CAT tails. CAT tails induce a heat shock response
through a mechanism that is yet to be determined.
Hypomorphic mutations in the mammalian homolog of LTN1 cause
neurodegeneration in mice (Chu et al., 2009). Similarly, mice with mutations in a CNS-
specific isoform of tRNAArg and GTPBP2, a homolog of yeast Hbs1 which works with
PELOTA/Dom34 to dissociate stalled 80S ribosomes, suffer from neurodegeneration
(Ishimura et al., 2014). These observations speak to the consequences ribosome stalls
impose on the cellular economy. Eubacteria rescue stalled ribosomes with the tmRNA-
31
SmpB system, which releases nascent chains fused with a unique C-terminal tag that
targets the nascent chain for proteolysis (Moore and Sauer, 2007). The mechanisms
utilized by eukaryotes, which lack tmRNA, to recognize and rescue stalled ribosomes and
their incomplete translation products have been unclear. The RQC complex—and
Rqc2p’s CAT tail tagging mechanism in particular—bear both similarities and contrasts
to the tmRNA trans-translation system. The evolutionary convergence upon distinct
mechanisms for extending incomplete nascent chains at C-terminus argues for their
importance in maintaining proteostasis. One advantage of tagging stalled chains is that it
may distinguish them from normal translation products and promote their removal from
the protein pool. An alternate, not mutually exclusive, possibility is that the extension
serves to test the functional integrity of large ribosomal subunits so that the cell can
detect and dispose of defective large subunits that induce stalling.
2.5 MATERIALS AND METHODS
2.5.1 Deacylation of tRNA samples
The TGIRT enzyme initiates reverse transcription by an end-to-end template-
switching mechanism that is sensitive to whether or not the 3’ end of the tRNA is
aminoacylated. For deacylation of tRNA, RNA samples were incubated in 0.1 M Tris-
HCl (pH 9.0) for 45 min at 37°C (Dittmar et al., 2005), and purified by ethanol
precipitation in the presence of 0.3 M sodium acetate (pH 5.2) or with an RNA Clean &
Concentrator Kit (Zymo Research). Portions of the purified RNA samples before and
after deacylation were analyzed with the Small RNA Kit on a 2100 Bioanalyzer (Agilent)
32
to assess the quality and quantity of the RNAs.
2.5.2 Construction of RNA-seq libraries by TGIRT-seq small RNA/CircLigase
method
The construction of RNA-seq libraries via TGIRT template-switching was done
by using an initial template-primer substrate consists of a 41-nt RNA oligonucleotide (5'-
AGA UCG GAA GAG CAC ACG UCU AGU UCU ACA GUC CGA CGA UC/3SpC3/-
3'), which contains both the Illumina Read 1 and Read 2 primer-binding sites and a 3'
blocking group (C3 Spacer, 3SpC3; IDT), annealed to a complementary 42-nt 32P-labeled
DNA primer that leaves an equimolar mixture of A, C, G, or T single-nucleotide 3'
overhangs. Reactions were done with RNA samples, initial template-primer substrate
(100 nM), TGIRT enzyme and 1 mM dNTPs (an equimolar mix of dATP, dCTP, dGTP,
and dTTP) in 20 μl of reaction medium containing 450 mM NaCl, 5 mM MgCl2, 20 mM
Tris-HCl, pH 7.5, 1 mM dithiothreitol (DTT). ~100 ng of gel-purified HEK293T tRNA
or ~1 μg HEK293T whole-cell RNA were used in the high-throughput tRNA sequencing
experiment; ~30 ng of IFIT5-bound RNA or 25 ng of synthetic miRNA control was used
in the IFIT5 study; and ~100-200 ng of RQC-bound RNA was used in the RQC study.
For TGIRT enzyme, a thermostable GsI-IIC reverse transcriptase (TGIRT-III; InGex)
was used at 500 nM in the high-throughput tRNA sequencing experiment; and a
thermostable TeI4c-MRF reverse transcriptase with a C-terminal truncation of the DNA
endonuclease domain was used at 1 μM in both the IFIT5 and the RQC studies.
After pre-incubating a mixture of all components except dNTPs at room
33
temperature for 30 min, reactions were initiated by adding dNTPs, incubated at 60°C for
15 min (IFIT5) or 30 min (high-throughput tRNA sequencing and RQC), and terminated
by adding 5 M NaOH to a final concentration of 0.25 M, incubating at 95°C for 3 min,
and then neutralizing with 5 M HCl. The labeled cDNAs were analyzed by
electrophoresis in a denaturing 6% polyacrylamide gel, which was scanned with a
Typhoon FLA9500 phosphorImager (GE Healthcare). Gel regions containing the desired
cDNA products were isolated and electroeluted using a D-tube Dialyzer Maxi with
MWCO of 6-8 kDa (EMD Millipore) and ethanol precipitated in the presence of 0.3 M
sodium acetate and linear acrylamide carrier (25-50 μg; Thermo Scientific). The purified
cDNAs were then circularized with CircLigase II (Epicentre; manufacturer’s protocol
with an extended incubation time of 5 h at 60°C), extracted with phenol-chloroform-
isoamyl alcohol (25:24:1) and ethanol precipitated. The circularized cDNA products were
amplified by PCR with Phusion-HF (Thermo Scientific) using Illumina multiplex (5'-
AAT GAT ACG GCG ACC ACC GAG ATC TAC ACG TTC AGA GTT CTA CAG
TCC GAC GAT C -3') and barcode (5'- CAA GCA GAA GAC GGC ATA CGA GAT
BARCODE GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC T -3') primers
under conditions of initial denaturation at 98C for 5 s, followed by 12 (high-throughput
tRNA sequencing) or 15 (IFIT5 and RQC) cycles of 98°C for 5 s, 60°C for 10 s and 72°C
for 10 s.
34
Table values are rounded percentages. RNA classes >1.5% of reads in any sample are
shown.
*Cross-linked and native extract purifications were done following induction of IFIT5
expression in cells without (−) or with (+) IFN-β.
†Size categories of cDNA a, b, and c (defined in text) were analyzed for WT and mutant
IFIT5 proteins expressed by transfection.
‡Transcript categories from Ensembl GRCh37.
Table 2.1: TGIRT-seq read mapping.
35
Table values are rounded percentages. RNA classes included were >1.5% of reads in at
least one sample analyzed in Table 2.2.
*Transcript categories from Ensembl GRCh37.
Table 2.2: Biological replicate sequencing of pooled RNA.
36
Figure 2.1: Demethylase-thermostable group II intron RT tRNA sequencing (DM-tRNA-
seq).
Schematic representation for (A) Demethylation and (B) TGIRT-seq small
RNA/CircLigase method. (C) RT reaction for both purified tRNA and total RNA as
template with (+) or without (−) demethylase treatment. The blue line shows the gel
region excised for library construction. (D) Time-course RT reaction for purified tRNA
as template without demethylase treatment.
C
D
B
A
RNA-seq adapter 3’-Blocker
37
Figure 2.2: cDNA synthesis of IFIT-bound RNAs by TGIRT-seq small RNA/CircLigase
method.
(A) PCR amplification and sequencing were done using pooled cDNA excised as
two gel slices of ∼55–82 and ∼84–200 nt, excluding only the 83-nt cDNA product from
template switching to Read1,2 RNA. In the legend above the gel, N indicates native
extract purification and XL indicates purification after in vivo cross-linking. (B)
Amplification and sequencing were performed separately for cDNA from size pools a =
∼55–82, b = ∼84–150, and c = ∼150–230 nt. TGIRT-seq from the biological replicate in
which cDNAs were processed as in A is summarized in Table 2.2.
B A
38
Figure 2.3: Broad representation of IFIT5-bound tRNAs.
Profiles of tRNA read abundance were plotted using RNA libraries from in vivo
cross-linking or native extract of cells without (−) or with (+) prior IFN-β treatment. All
tRNA loci with mapped reads were rank-ordered in normalized abundance. The number
of different tRNA species identified by sequence reads is indicated in parenthesis in the
plots on the right and represent most of the 625 reference human genome tRNA and
tRNA pseudogene loci searched.
39
Figure 2.4: Individual gene coverage by reads from the WT IFIT5 cross-linked RNA
sample.
Read coverage of loci was mapped for tRNA, snaR, and 5S rRNA genes across a
window from 50 bp upstream to 50 bp downstream of the mature RNA ends, which are
indicated with dashed lines. The top left plot shows coverage for an iMet tRNA gene,
followed by six additional tRNA genes, a tRNA pseudogene, a snaR locus (National
Center for Biotechnology Information NR_024229.1), and 5S rRNA (Ensembl
ENSG00000199352.1). Each tRNA gene is identified by chromosome number,
chromosome position, charged amino acid, and anticodon sequence (5′–3′). The apparent
excess of 3′ exon fragments for LeuCAA likely reflects misalignment of truncated 5′
exon sequences by Bowtie 2 local alignment after the gap resulting from intron removal.
40
B
A
[3618 reads]
[9809 reads]
[119545 reads]
41
Figure 2.5: Read sequence alignments for the WT IFIT5 cross-linked RNA sample.
The figure shows screen shots of IGV sequence alignments for some RNAs bound
by the WT IFIT5. The blue bar at the top delineates the mature tRNA sequence encoded
in the genome, with the arrow indicating 5′ to 3′ direction of the tRNA, which differs
C
D
42
across alignments depending on the DNA strand to which the reads are mapped by the
Bowtie 2 aligner. The total number of reads mapped to the locus is indicated near the top
of each panel. To fit the entire alignment on one page, loci with more than 1,500 mapped
reads were down-sampled to 1,500 reads in IGV, and only parts of the IGV screen shot
were shown in (B) and (C). Reads were sorted by their start site on the chromosome,
which can be from either the 5′ or 3′ RNA end depending on the orientation of the gene
on the chromosome. In the coverage plot profiles, nucleotides matching the genome
sequence are represented in gray color, and mismatches are represented in different
colors (A, green; C, blue; G, brown; T, red). Soft-clipped sites, which demarcate the
beginning of extra 5′ and 3′ nucleotides that do not match the genomic sequence, are
indicated by a short black bar, and read continuity between a genome sequence gap, such
as a spliced intron, is indicated by a black line. Pol III ter, predicted RNA polymerase III
termination site. For the spliced tRNAs, reads were mapped with or without intron
removal from the gene sequence to highlight inaccurate splice junctions and modified
nucleotides near the junction that affect the sequence alignment. The spliced ArgTCT
tRNA reads contain potential examples of missplicing with a shifted splice junction
and/or one extra nucleotide inserted at the junction (highlighted in the inset sequence
alignment). Examples of untrimmed adapter sequence, nontemplated nucleotide addition
by the TGIRT at cDNAs 3′ ends (corresponding to tRNA 5′ ends), and rare second
template switches are indicated in the alignments. Mismatches at positions corresponding
to modified nucleotides known to be present in the tRNA are indicated by arrows
indicating the tRNA position and modified nucleotide. The spectrum of misincorporated
43
nucleotides at modification sites is shown in the coverage plot, with a misincorporated
nucleotides threshold of 10%. In at least some cases (e.g., m1A and m2,2G), the spectrum
of mismatches appears to be characteristic of the modified base and may be useful for
identifying unknown base modifications in other coding and noncoding RNAs. A position
of potential posttranscriptional modification of a conserved guanosine residue in the snaR
is indicated in the alignment. Cm, 2′-O-methylcytidine; D, dihydrouridine; I, inosine; i6A,
N6-isopentenyl adenosine; t6A, N6-threonylcarbamoyladenosine; m1A, 1-
methyladenosine; m1G, 1-methylguanosine; m1I, 1-methylinosine; m2,2G, N2,N2-
dimethylguanosine; m3C, 3-methylcytidine.
44
Figure 2.6: Composite read start sites for IFIT5-bound tRNAs.
Cross-comparison of tRNA read start sites for WT IFIT5 variously purified from
extracts of a stable cell line (A) or WT and mutant IFIT5 proteins purified after
expression by transient transfection (B). Native extract was from cells without (−) or with
(+) IFN-β treatment. X axis positions are as in Fig. 2.4, and the y axis represents the
percentage of reads starting at each position. Precursor tRNA ends are at positions 1–50
and the mature tRNA 5′ end is at position 51. Read start sites at positions within the
tRNA correlate with positions of reverse transcription stops at or near modification sites
common among eukaryotic tRNAs (Fig. 2.5): position 59, G9/1-methylguanosine (m1G);
position 70, U20/dihydrouridine (D); position 77, G26/N2,N2-dimethylguanosine (m2,2G)
or U27/pseudouridine (Ψ) depending on on the length of the tRNA D-loop; position 87,
A37/N6-isopentenyladenosine (i6A), N6-threonylcarbamoyladenosine (t6A) or 1-
methylinosine (m1I), and G37/m1G or wybutosine (yW); position 108, A58/1-
methyladenosine (m1A).
45
Figure 2.7: Rqc2p-dependent enrichment of tRNAAla(IGC) and tRNAThr(IGU).
(A) tRNA cDNA reads extracted from purified RQC particles and summed per
unique anticodon, with versus without Rqc2p. (B) Secondary structures of tRNAAla(IGC)
and tRNAThr(IGU). Identical nucleotides are underlined. Edited nucleotides are indicated
with asterisks. (C) Weblogo representation of cDNA sequencing reads related to shared
sequences found in anticodon loops (positions 32 to 38) of mature tRNAAla(IGC) and
tRNAThr(IGU).
46
Chapter 3: RNA-seq of circulating RNAs in human plasma
3.1 INTRODUCTION
Next-generation RNA sequencing (RNA-seq) is a supremely powerful method for
transcriptome profiling and gene expression analysis, with applications that include the
identification of novel biomarkers and new diagnostic methods for diseases (Wang et al.,
2009; Wilhelm and Landry, 2009; Ozsolak and Milos, 2011; Chen et al., 2012). A recent
exciting application of RNA-seq is the analysis of extracellular RNAs present in plasma
and other bodily fluids (Mitchell et al., 2008; Burgos et al., 2013; Huang et al., 2013;
Williams et al., 2013; Koh et al., 2014). Such extracellular RNAs are potential
biomarkers for human disease and may be involved in intercellular communication
(Valadi et al., 2007; Zernecke et al., 2009; Fabbri et al., 2012; Grasedieck et al., 2013). In
plasma, extracellular RNAs, also known as circulating RNAs, are present in vesicles,
such as exosomes, microvesicles, and apoptotic bodies, and/or in ribonucleoprotein
(RNP) complexes, e.g., miRNAs with Argonaute2 (Ago2) or high-density lipoproteins
(HDLs) (Zernecke et al., 2009; Arroyo et al., 2011; Vickers et al., 2011; Huang et al.,
2013). Circulating RNAs found in human plasma include fragments of mRNAs and long
non-coding RNAs (lncRNAs), possibly resulting from intracellular RNA turnover and
secretion in exosomes, as well as miRNAs and other small non-coding RNAs (small
ncRNAs) (Huang et al., 2013; Williams et al., 2013; Koh et al., 2014). Dysregulation of
non-coding RNAs and malfunctions in their processing machinery are frequently
hallmarks of human diseases, including cancer and Alzheimer’s disease (Croce, 2009;
47
Esteller, 2011; Batista and Chang, 2013). Further, the expression profiles of miRNAs and
lncRNAs are often tissue- and cell-state specific, which may facilitate disease diagnoses
(Lu et al., 2005; Rosenfeld et al., 2008; Cabili et al., 2011; Brunner et al., 2012). Multiple
reports correlate the presence of specific mRNAs or miRNAs in plasma or serum with
different types of cancer and other diseases, suggesting that the analysis of circulating
RNAs may provide a non-invasive, cost-effective solution for detecting and monitoring
cancer progression (Kopreski et al., 2001; Silva et al., 2007; Keller et al., 2011; Moussay
et al., 2011; Koh et al., 2014). Thus far, however, knowledge of different RNA types that
circulate in human plasma and their relative abundance remains limited. Here, I
optimized methods for plasma RNA isolation to maximize small RNA representation,
and developed a new method for RNA-seq library construction via the use of
thermostable group II intron reverse transcriptases (TGIRTs), which allow the analysis of
all human plasma RNAs in a single RNA-seq experiment.
3.2 TGIRT-SEQ, THE TOTAL RNA METHOD
3.2.1 Overview of the TGIRT-seq total RNA method
In the initial TGIRT-seq small RNA/CircLigase method (see Chapter 2), the
cDNAs with an RNA-seq adapter linked by TGIRT template switching during reverse
transcription were size-selected on a denaturing polyacrylamide gel and circularized with
CircLigase II ssDNA Ligase (Epicentre) prior to PCR amplification (Mohr et al., 2013;
Katibah et al., 2014; Shen et al., 2015; Zheng et al., 2015). Although this procedure
remains useful for RNA-seq of specific RNA size classes or homogenously sized RNA
48
fragments in procedures like HITS-CLIP or ribosome profiling, disadvantages include: (i)
size limitations introduced by CircLigase, whose efficiency deceases for longer cDNAs
(Epicentre product literature); (ii) a gel-purification step, which is time consuming and
results in loss of material; and (iii) the use of hazardous chemicals, such as phenol and
chloroform.
To achieve simplicity, speed, and high efficiency, we developed a new method for
using TGIRT template-switching in RNA-seq library construction from RNA pools
without size selection, referred to as the total RNA method (Qin et al., 2016). By
eliminating gel-purification and phenol-extraction steps, the method enables the
construction of RNA-seq libraries from small amounts of RNA in <5 h. The method is
readily adaptable for a variety of other applications, including sequencing of whole-cell
and exosomal RNAs, profiling of miRNAs and other non-coding RNAs, and for
streamlining the identification of protein- or ribosome-bound RNA fragments in
procedures like HITS-CLIP, RIP-Seq, and ribosome profiling.
Figure 3.1A outlines the new TGIRT-seq total RNA method. First, the TGIRT
binds to an initial template-primer substrate comprised of an RNA oligonucleotide
containing an RNA-seq adapter sequence annealed to a complementary DNA primer. For
Illumina sequencing, the RNA oligonucleotide contains an Illumina Read 2 primer-
binding site (R2 RNA), and the DNA primer contains the complementary sequence (R2R
DNA) (Fig. 3.1A,B). After forming a complex with the initial template-primer substrate,
the TGIRT initiates reverse transcription by switching directly from the 5’ end of the
RNA-seq adapter to the 3’ end of a target RNA, yielding a continuous cDNA linking the
49
two sequences. The RNA-seq adapter has a 3’-blocking group that impedes secondary
template-switching to the 3’ end of that RNA.
To increase the efficiency of template-switching, the DNA primer annealed to the
RNA-seq adapter in the initial template-primer substrate has a single-nucleotide 3’
overhang. This 3’-overhang nucleotide base-pairs to the 3’-terminal nucleotide of the
target RNA, resulting in a seamless template-switching junction between the RNA-seq
adapter and the target RNA (Mohr et al., 2013). In the present work, an initial template-
primer substrate with an equimolar mixture of A, C, G, or T 3’ overhangs (denoted N
(Mohr et al., 2013)) was used to construct RNA-seq libraries from RNA pools with
minimal bias. The ability of a single base pair between the 3’-overhang nucleotide and
the 3’ end of the target RNA to direct TGIRT template-switching at 60oC, the operational
temperature of TGIRT enzymes, indicates a very potent strand annealing activity of
group II intron RTs. Alternatively, to enrich for certain target RNA, the A, C, G or T
overhangs can be mixed at a customized ratio or replaced by a string of nucleotides
complementary to the 3’ end sequences of the target RNA (Zheng et al., 2015).
Because an RNA-seq adapter is added directly during cDNA synthesis, TGIRT-
seq is inherently strand-specific. This strand specificity was confirmed by the low
frequency of antisense reads from a 74-nt RNA synthetic oligonucleotide template (0.72
and 1.9 x 10-5 for the TeI4c and GsI-IIC thermostable group II intron RTs, respectively;
Materials and Methods).
For RNA-seq profiling, reverse transcription by TGIRT enzymes is done at 60°C
in a reaction medium containing high salt (450 mM NaCl), which limits multiple
50
template-switches. In the primary plasma RNA-seq datasets (DSs) presented here (DS1-
10), the percentage of fusion reads, which include multiple template-switches, was ≤0.14,
comparable to conventional RNA-seq methods using retroviral RTs (Lu and Matera,
2014). Multiple template-switches that do occur are sporadic and can be distinguished
from novel biologically relevant junctions resulting from DNA translocations or
unannotated splice junctions by a combination of technical replicates, Integrative
Genomics Viewer (IGV) alignments, and qRT-PCR validation. Because TGIRT enzymes
have very high processivity, TGIRT template-switching is virtually always end-to-end
and does not occur appreciably from internal sites (Mohr et al., 2013). By contrast,
retroviral RTs frequently template-switch by dissociating from an internal site and
reinitiating at a different site, resulting in artifactual internal deletions (Mader et al.,
2001; Cocquet et al., 2006).
In the previous small RNA/CircLigase method, cDNAs were linked by TGIRT
template-switching to an RNA-seq adapter containing the complements to both the
Illumina Read 2 (R2R) and Read 1 (R1R) primer-binding sites, gel purified and then
circularized with CircLigase prior to PCR amplification (Katibah et al., 2014; Shen et al.,
2015; Zheng et al., 2015). By contrast, in the new TGIRT-seq method developed here,
the cDNAs linked to an R2R adapter sequence are processed into RNA-seq libraries
without size selection by ligating a 5’-adenylated (5’ App) DNA oligonucleotide
containing the R1R adapter to the cDNA 3’ end with Thermostable 5’ AppDNA/RNA
Ligase (New England Biolabs). The 5’ App DNA oligonucleotide has a 3’-blocking
group that impedes self-ligation. The ligated cDNAs were then amplified by 12 cycles of
51
PCR with primers that introduce Illumina P5 and P7 flow cell capture sites and barcodes
(Fig. 3.1B). The elimination of the gel-purification step improves sample recovery and
decreases processing time, enabling us to construct RNA-seq libraries from small
amounts of starting material in less than 5 h.
Because TGIRTs give full-length reads of tRNAs and other small ncRNAs, we
developed a pipeline for read mapping, which uses TopHat v2.0.10 end-to-end alignment
followed by Bowtie2 local alignment (Fig. 3.1C) to include RNAs with post-
transcriptionally added nucleotides, such as the 3’ CCA of tRNAs or poly(U) tails
(Malecki et al., 2013; Katibah et al., 2014). Like other RTs and DNA polymerases,
TGIRTs can add a small number of extra non-templated nucleotides to the 3’ ends of
cDNAs (referred to as terminal transferase activity) (Clark, 1988; Golinelli and Hughes,
2002). Such extra nucleotides remain after local alignment, but are readily evaluated by
IGV plots.
3.2.2 Validation of the TGIRT-seq total RNA method
In parallel work carried out primarily by Ryan Nottingham and Douglas Wu, the
TGIRT-seq total RNA method was validated by using two well-characterized,
commercially available human RNA reference samples including the Universal Human
Reference RNA (UHR) and the Human Brain Reference RNA (HBR) (Nottingham et al.,
2016). This work showed that TGIRT-seq recapitulates the relative abundance of human
transcripts and RNA spike-ins in ribo-depleted, fragmented RNA samples comparably to
non-strand-specific TruSeq v2 and better than strand-specific TruSeq v3. Moreover,
52
TGIRT-seq is more strand-specific than TruSeq v3 and eliminates sampling biases from
random hexamer priming, which are inherent to TruSeq. The TGIRT-seq datasets also
show more uniform 5’ to 3’ gene coverage and identify more splice junctions,
particularly near the 5’ ends of mRNAs, than do the TruSeq datasets. Finally, TGIRT-seq
enables the simultaneous profiling of mRNAs and lncRNAs in the same RNA-seq
experiment as structured small ncRNAs, including tRNAs, which are essentially absent
with TruSeq.
3.3 HUMAN PLASMA RNA
3.3.1 Preparations and treatments of human plasma RNAs
To obtain suitable starting material for RNA-seq, we tested several different
plasma RNA preparations and DNase treatment methods with the aims of increasing the
representation of miRNAs, which comprise only a small proportion of plasma RNA, and
reducing contamination from plasma DNA. Each RNA-seq dataset presented below was
constructed from RNAs extracted from 1 ml of plasma obtained from a healthy male
individual at intervals at least one week apart. For the primary datasets, plasma RNAs
were extracted by using Trizol LS Reagent (Thermo Fisher Scientific) followed by a
Direct-zol RNA MiniPrep Kit (Zymo Research), as described in Materials and Methods.
This method, which we refer to as the Direct-zol method, typically recovered 2-8 ng of
nucleic acids per ml plasma, comparable to yields in previous studies (Burgos et al.,
2013; Williams et al., 2013; Spornraft et al., 2014). The plasma RNA samples were
analyzed by RNA-seq with no further treatment (NT), after enzymatic treatment to
53
remove 3’ phosphates (-3’P), which block TGIRT template-switching (Mohr et al.,
2013), or after on-column DNase I digestion (OCD) under conditions that completely
digest 10 ng of a mixture of 74-nt ssDNA and 275-nt dsDNA PCR product (Figs. 3.2 and
3.3).
Bioanalyzer traces of the NT sample showed two broad peaks: Peak 1 at ~40-60
nt and Peak 2 at ~160-170 nt (Fig. 3.2A). After the on-column DNase I treatment, Peak 2
disappeared, leaving only Peak 1 (Fig. 3.2B), which was sensitive to RNase I, an enzyme
that degrades ssRNA, or alkaline hydrolysis, which degrades RNA but not DNA (Fig.
3.2C,D). The DNase sensitivity of the Peak 2 is consistent with previous findings that
plasma DNA fragments cluster at ~160-170 bp corresponding in size to the length of
dsDNA protected by nucleosomes (Fan et al., 2008). We found that total plasma RNA
prepared by mirVana miRNA Isolation Kit (Thermo Fisher Scientific) using a method
that combines large and small RNA fractions to increase small RNA recovery (Materials
and Methods) also contain a DNase-sensitive peak of size similar to Peak 2 (Fig. 3.3B-
D). Thus, plasma RNA prepared by either method cannot be assumed to be free of DNA.
Because TGIRT enzymes can template-switch to either RNA or DNA fragments
containing a 3’ OH (Mohr et al., 2013), RNA-seq datasets constructed from the NT and -
3’P samples potentially contain both plasma RNA and DNA sequences, whereas those
constructed from the DNase-treated samples correspond almost entirely to RNA
sequences, as judged by their sensitivity to RNase I and alkali (Fig. 3.2C,D).
54
3.3.2 TGIRT-seq of human plasma RNA samples
Table 3.1 summarizes mapping statistics for RNA-seq datasets constructed from
Direct-zol NT, -3’ P, and OCD plasma RNAs by using the thermostable TeI4c group II
intron RT. Samples were sequenced on an Illumina HiSeq 2500 (Dataset 1 (DS1); 69.4
million 100-nt paired-end reads) or NextSeq 500 (DS2-10; 14.6 to 37.8 million 75-nt or
150-nt paired-end reads). For each type of RNA preparation, we obtained at least three
RNA-seq datasets, each using a different plasma sample taken from the same individual.
After trimming and filtering to remove adapter sequences and low quality base calls,
transcript lengths determined by the coverage of the paired-end read span were consistent
with plasma RNA size profiles in bioanalyzer traces (Fig. 3.4A,B). The processed reads
were mapped to a human genome reference sequence (Ensembl GRCh38 Release 76)
supplemented with additional rRNA gene contigs (Materials and Methods). For the
plasma RNA-seq datasets constructed with the TeI4c thermostable group II intron RT,
85.7-95.3% of the paired-end reads mapped to the human genome, and 27.3-30.7% were
concordant read pairs that mapped uniquely and with high mapping quality (MAPQ ≥15)
to genomic features in the annotated orientation (Table 3.1). For confidence, only
features with ≥10 hits were counted in the analysis.
3.3.3 Classes of RNAs detected in human plasma
Figure 3.5 shows the percentage of reads mapping to different genomic features in
the RNA-seq datasets constructed by using TeI4c RT for total plasma RNA treated in
various ways, using only uniquely mapped concordant read pairs for the calculation. The
55
number of individual genes to which the reads mapped is shown next to each feature in
the stacked bar graphs. The datasets for NT, -3’ P, and OCD-treated plasma RNAs show
similar overall profiles of RNA classes with the majority of the reads corresponding to
fragmented protein-coding gene and lncRNAs (Fig. 3.5A), and a smaller proportion (1.8-
5.8%) mapping to a variety of small ncRNAs (Fig. 3.5B).
While having little effect on the proportion of reads mapping to protein-coding
gene and lncRNAs, the removal of 3’ phosphates (-3’P), which block TGIRT template-
switching, reproducibly increased the proportion of reads mapping to 18S and 28S
rRNAs (from 0.9 ± 0.2 to 6.3 ± 4.0% of reads mapped to features, p-value = 0.15) and 5’-
tRNA halves (from 0.4% to 7.1% of reads mapped to tRNAs, see below). These findings
suggest that the protein-coding and lncRNA fragments present in plasma were either
generated by RNases that leave a 3’ OH or had their 3’ phosphates removed by a
phosphatase. Previous findings indicate that most intracellular RNases involved in
cellular RNA turnover leave 3’-OH groups (Houseley and Tollervey, 2009; Schoenberg
and Maquat, 2012). By contrast, the rRNA and 5’-tRNA halves present in plasma, whose
representation increased after 3’ phosphate removal, were generated by RNases that leave
a 2’3’-cyclic phosphate or 3’ phosphate (e.g., RNase A in blood or angiogenin in the case
of tRNA haves) (Houseley and Tollervey, 2009; Yamasaki et al., 2009).
Despite the differences in plasma collection dates, DNA sequencers, and read
lengths, the biological replicates for RNA-seq datasets constructed with the TeI4c RT
from each type of plasma RNA preparation (NT, -3’P and OCD) were highly
56
reproducible, with pairwise Spearman’s correlation coefficients () ranging from 0.85 to
0.92 (Fig. 3.6A-C).
We obtained additional RNA-seq datasets of NT plasma RNA with the GsI-IIC
thermostable group II intron RT, which is sold commercially as TGIRT-III enzyme
(Materials and Methods). The GsI-IIC RT datasets were very similar to TeI4c RT
datasets in terms of mapping statistics, reproducibility, and features detected (Table 3.2,
Fig. 3.6D and Fig. 3.7). The correlation coefficient between combined NT plasma RNA
datasets obtained with the two TGIRT enzymes (DS1-3 versus DS12-14) was 0.92, with
most of the differences due to low abundance RNA species (Fig. 3.6E). Analysis of 3’-
terminal nucleotides of RNAs in RNA-seq datasets constructed from DNase-treated
plasma RNA preparations showed a relatively even distribution of the four possible 3’-
terminal nucleotides by both enzymes, with only small differences of unknown
significance in the frequencies of some di- or tri-nucleotide sequences (Table 3.3).
3.3.4 Protein-coding gene and long non-coding RNAs in human plasma
The TGIRT-seq profiles suggest that human plasma RNA consists largely of
RNA fragments derived from a diverse population of protein-coding gene and lncRNAs.
From the bioanalyzer traces of the on-column DNase I-treated (OCD) samples, we infer
that the protein-coding and lncRNA fragments, which comprise a high proportion of
plasma RNA, are heterogeneous in size with a broad peak at ~40-60 nt (Peak 1; Fig.
3.2B), and this was supported by separately calculating the length distribution of protein-
57
coding gene reads (excluding embedded small ncRNAs) in the DNase-treated samples
(Fig. 3.4C).
Further analysis of the protein-coding gene reads in NT and OCD-treated plasma
RNA datasets indicated that they are enriched in intron and antisense sequences
compared to human whole-cell RNAs analyzed by the same TGIRT-seq method using
TeI4c RT (Jurkat cells) or GsI-IIC RT (K562 cells) (Fig. 3.8, and Table 3.4). RNA-seq
datasets constructed from plasma RNA prepared by either the Direct-zol or mirVana
combined methods and treated with Baseline-ZERO DNase (Epicentre), which according
to the manufacturer digests DNA to mononucleotides, showed similar enrichments of
intron and antisense sequences (datasets BZD and M-BZD in Fig. 3.8), as did limiting the
analyzed protein-coding reads in the DNase-treated datasets to 30 nt to exclude residual
small DNA fragments (denoted read span 30 nt in Fig. 3.8). Plots of the proportion of
reads mapping to the sense and antisense strands versus gene length in the datasets for
DNase-treated plasma RNAs showed wide variations for different genes with
convergence toward 50% sense/antisense reads for longer genes in the larger datasets
(Fig. 3.9).
Previous studies have shown that a high proportion of the human genome is
transcribed from both strands, with many annotated antisense RNAs overlapping protein-
coding sequences on the opposite strand and concordantly regulated with the sense RNAs
(Katayama et al., 2005; Werner, 2013; Brown et al., 2014; Khorkova et al., 2014; Portal
et al., 2015). Our findings raise the possibility that plasma RNA is enriched in extraneous
58
intron and antisense RNAs, which may be preferentially targeted for degradation and
cellular secretion, eventually finding their way into plasma.
3.3.5 Small non-coding RNAs in human plasma
miRNAs. The TGIRT-seq profiles for different types of plasma RNA preparations
indicate that miRNA are not abundant in human plasma. Fig. 3.10A shows profiles of
miRNAs detected in total plasma RNAs prepared by the Direct-zol method with on-
column DNase I treatment (OCD) and by the mirVana combined method with Baseline-
ZERO DNase treatment (M-BZD; Materials and Methods). The miRNAs detected by
TeI4c RT in both types of RNA preparations showed skewed distributions (Fig. 3.10A).
miRNA species with the highest read counts in both datasets include miR-451a, miR-142,
miR-16-2, mir-122 (a liver-specific miRNA), miR-223, miR-19a, let-7a, miR-16-1, let-
7b, miR-6087, miR-126, miR-17, and miR-21 (Fig. 3.10A). The abundant plasma
miRNAs identified here include those previously reported to be present in plasma in
complex with Ago2 proteins (e.g., miR-451a, miR-16, miR-122, miR-223, miR-19a, let-
7b, and miR-21), largely in exosomes (e.g., mirR-142 and let-7a) or in both Ago2
complexes and exosomes (miR-126) (Arroyo et al., 2011).
Tissue expression profiles of the mature miRNAs in the RNA-seq datasets for
both types of DNase-treated plasma RNA (Fig. 3.11 and Fig 3.16) indicate that plasma is
enriched in miRNAs that are abundant in endocrine glands and highly vascularized
organs, along with a subset of miRNAs that are abundant (top 10 percentile) in red blood
cells or platelets (miRNA names indicated in red in Fig. 3.11 and Fig. 3.12) (Landgraf et
59
al., 2007; Wang et al., 2012). Some miRNAs abundant in brain were also detected with
relative high read count in the plasma, in agreement with a previous study which detected
brain-specific transcripts in plasma with increased abundance of certain neuronal
transcripts correlated with Alzheimer’s disease (Koh et al., 2014).
IGV plots, in which reads are aligned to the genomic sequence, showed that most
of the abundant miRNA are present in plasma as full-length, mature species, including
some with post-transcriptionally added 3’ A residues (e.g., miR-122) (Fig. 3.10B)
(Norbury, 2013). For miR-126, both the mature miRNA (miR-126-3p) and passenger
strand (miR-126-5p) are present in human plasma, consistent with previous findings
(Arroyo et al., 2011). In addition to annotated miRNAs, the M-BZD dataset identified
mature-sized miRNAs from several predicted miRNA loci (e.g., AC034205.1,
AC023050.1, and AL589669.1) (Fig. 3.10C). The IGV plots also show that a few
miRNA species are present in plasma as full-length pre-miRNAs with both 5’ and 3’ ends
corresponding exactly to the annotated mature miRNA arms (Fig. 3.13A). Some of these
pre-miRNAs are present together with the mature miRNAs (e.g., let-7f, miR-27a, miR-
146a, and miR-30c), whereas others are present almost entirely as the pre-miRNA (e.g.,
miR-1229 and miR-139) (Fig. 3.10B,C). Such distinctions would be missed in miRNA
quantitation by qRT-PCR or microarray assays. Although GsI-IIC RT used at a limiting
concentrations (500 nM) appears to under-represent miRNAs in total plasma RNA
preparations, RNA-seq datasets constructed with GsI-IIC RT for mirVana small RNA
preparations (Materials and Methods) were similar to those for TeI4c RT, with mostly
60
minor differences in profiles for abundant miRNA species detected by the two TGIRT
enzymes (Fig. 3.14).
Finally, although the abundant miRNA species in DNase-treated plasma RNA
datasets (OCD and M-BZD) correspond well to those detected in the non-treated (NT)
plasma RNA datasets, we note the curious case of miR-182 for which we detected
abundant reads corresponding to the exact antisense of the annotated mature miRNA in
the NT datasets (Fig. 3.13B), the only mature miRNA for which antisense sequences
were detected. This antisense miR-182 sequence was found reproducibly in multiple
datasets of non-treated plasma RNAs generated by both TGIRT enzymes (98% of miR-
182 reads in total plasma RNAs datasets constructed with TeI4c, and 4% and 14% of
reads in total and small RNA datasets constructed with GsI-IIC RT; respectively), but
disappeared after DNase treatment, leaving only the annotated sense orientation of the
miRNA. These findings raise the possibility that antisense miR-182 was initially part of
an RNA/DNA hybrid with the annotated miRNA, either an in vitro artifact or hinting at a
novel DNA-based mode of miRNA-regulated gene expression.
tRNAs and tRNA fragments. tRNAs are the most abundant small ncRNAs
detected in the datasets for total plasma RNA (83.0-93.4% of the small ncRNA reads,
mapping to 376-419 different tRNA genes; Fig. 3.5B). tRNA species grouped by
anticodon showed a skewed distribution, with good correspondence between the
abundant tRNA species detected by TeI4c in the NT and -3’P plasma RNA preparations
(Fig. 3.15A). IGV alignments for representative tRNA species to individual loci showed
that most are full-length, extending from the processed 5’ end of the mature tRNA, or
61
post-transcriptionally added 5’ G residue in the case of tRNAHis, to the post-
transcriptionally added 3’ CCA (Fig. 3.15B). In contrast to retroviral RTs, which
terminate at base modifications that affect Watson-Crick base-pairing interactions
(Burnett and McHenry, 1997; Ansmant et al., 2001; Jackman et al., 2003), TGIRT
enzymes frequently read through a number of such modifications (e.g., m1A58 and
m1G9) by misincorporation, with the spectrum of misincorporated nucleotides
characteristic of the modification (Elagib et al., 2013; Katibah et al., 2014). tRNA-protein
complexes have been identified previously in human sera as autoantigens in patients with
autoimmune diseases, a well-studied example being HisGUG, which is bound to histidyl-
tRNA synthetase in the polymyositis-specific autoantigen Jo-1 (Hardin et al., 1982;
Mathews and Bernstein, 1983; Rosa et al., 1983). Our findings indicate that HisGUG and
other full-length tRNAs are normal, relatively abundant components of human plasma.
In addition to full-length tRNAs, several abundant tRNA species in the NT and -
3’P plasma RNA-seq datasets correspond to 5’- and 3’-tRNA halves resulting from
cleavage within the anticodon loop (Fig. 3.15C). As noted previously, the percentage of
5’-tRNA halves reads increased from 0.4% of mapped tRNA reads in NT datasets to
7.1% of mapped tRNA reads in -3’P datasets, consistent with cleavage by an RNase, such
as angiogenin, which leaves a 2’,3’-phosphate or 3’ phosphate (Fig. 3.15C) (Fu et al.,
2009). 5’-tRNA halves in plasma have been reported to be present in RNP complexes
that are destabilized by chelating agents such as EDTA, which was used in our plasma
preparation (Dhahbi et al., 2013a). It is possible that the proportion of 5’-tRNA halves
detected by TGIRT-seq would be higher in plasma prepared without EDTA.
62
Other small ncRNAs. The remaining small ncRNAs detected by TeI4c RT in NT
total plasma RNA datasets include Y RNAs (3.8%; 84 species, including 3 of 4 known Y
RNAs); snoRNAs (1.9%; 220 species); 7SL RNAs (1.8%; 191 species); snRNAs (0.9%;
145 species); Vault RNAs (VT; 0.8%; 5 species, including 3 of 4 known Vault RNAs);
and 7SK RNAs (0.5%; 71 species) (Fig. 3.5B). Only fragments of snoRNAs, snRNAs
and Y RNAs were previously reported to be present in plasma or exosomes (Dhahbi et
al., 2013b; Huang et al., 2013; Spornraft et al., 2014). We detected longer transcripts
mapping to the piRNA cluster but not mature piRNAs, possibly reflecting the 2’-O-
methyl group at their 3’ end, which inhibits TGIRT template-switching (Mohr et al.,
2013).
Remarkably, many of the small ncRNAs that we identified in plasma are full-
length transcripts, including snRNAs, both H/ACA-box and C/D-box snoRNAs, Y
RNAs, Vault RNAs, 7SL RNAs (299 nt), and 7SK RNAs (332 nt) (Fig. 3.16A). All of
these RNAs function intracellularly in RNP complexes (Walter and Blobel, 1982;
Kickhoefer et al., 2002; He et al., 2008; Markert et al., 2008; Esteller, 2011; Chen et al.,
2013), and their presence as full-length transcripts protected from plasma RNases
suggests that they are present as such in plasma. Y RNA and Vault RNA are associated
with autoantigens Ro/SSA and La/SSB, respectively, both of which have been implicated
in autoimmune diseases, including systemic lupus erythematosus and Sjögren’s syndrome
(Halse et al., 1999; Xue et al., 2003; Routsias and Tzioufas, 2010), while 7SL RNA, an
RNA component of the signal recognition particle, has been implicated in the
autoimmune disease myositis (Satoh et al., 2005). 7SK RNA, the central scaffold of an
63
RNP complex that regulates nuclear transcription elongation (He et al., 2008; Markert et
al., 2008), has not been reported previously in plasma. Notably, the unmapped reads
contain 5’ truncated Y RNAs and Vault RNA fragments with poly(U) tails (Fig. 3.16B),
presumably reflect that they were targeted for degradation before being exported into
plasma (Malecki et al., 2013).
3.4 DISCUSSION
The RNA-seq method developed here employing a thermostable group II intron
reverse transcriptase (TGIRT-seq) enables strand-specific comprehensive RNA profiling
of different RNA size classes starting from small amounts of RNA. In addition to simpler
library preparation without known biases of RNA ligation or random hexamer priming of
reverse transcription (Linsen et al., 2009; Hansen et al., 2010; Levin et al., 2010; Lamm
et al., 2011; Hu and Hughes, 2012; Raabe et al., 2014), TGIRT-seq distinguishes mature
miRNAs from pre-miRNAs and longer miRNA-containing transcripts, and it gives full-
length reads including both the 5’- and 3’-RNA termini of a variety of highly structured
small ncRNAs. Because gel-purification and phenol-extraction steps in previous versions
of the method have been eliminated, RNA-seq libraries can be prepared from a small
amount of starting material in <5 h and can potentially be automated to further enhance
efficiency and throughput.
In this initial demonstration of the method, we prepared RNA from 1 ml of human
plasma and used Illumina sequencing to obtain 14.6-69.4 million paired-end reads for
total plasma RNA datasets, enabling profiling of plasma RNAs at relatively low cost. We
64
found that human plasma RNAs consist largely of fragments of protein-coding genes and
lncRNAs, together with less abundant small ncRNAs. The RNA fragments of protein-
coding gene appear to be enriched in intron and antisense sequences, possibly reflecting
preferential turnover of extraneous RNA sequences, which are packaged into exosomes,
exported into the intercellular space, and eventually find their way into plasma.
Surprisingly, we found that many of the small ncRNAs, including miRNAs, tRNAs,
snoRNAs, snRNAs, Y RNAs, Vault RNAs, 7SL RNAs, and 7SK RNAs, are present as
full-length transcripts, suggesting that they are protected from plasma RNase in RNP
complexes and/or exosomes. Although miRNAs are not abundant in the total plasma
RNA preparations, they were amply detected in a way that distinguishes mature miRNAs
from pre-miRNAs, and their coverage could be improved by greater sequencing depth or
by small RNA enrichment.
The TGIRT-seq method should be easily modifiable for different sequencing
platforms. By including additional steps for rRNA depletion followed by RNA
fragmentation and 3’-phosphate removal (Materials and Methods), TGIRT-seq is readily
adaptable for the profiling of whole-cell RNAs, as well as for the analysis of exosomal
RNAs and protein-bound RNA fragments in procedures like HITS-CLIP, RIP-Seq, and
for ribosome profiling.
3.5 MATERIALS AND METHODS
3.5.1 Thermostable group II intron RTs
Reverse transcription of plasma RNAs for the construction of RNA-seq libraries
65
was done by using a thermostable TeI4c group II intron RT (TeI4c-∆En fusion protein
RT for Datasets 1-11 and 16; TeI4c-MRF group II intron RT (Mohr et al., 2013) for
Dataset 18; Table 3.5), and a thermostable GsI-IIC group II intron RT (TGIRT-III;
InGex) (Datasets 12-15, 17 and 19; Table 3.5). The TeI4c-∆En fusion protein RT was a
gift from Enzymatics and is functionally equivalent to the TeI4c-MRF group II intron
RTs described and used previously (Mohr et al., 2013).
3.5.2 Preparation of human plasma RNA samples
Plasma from a healthy male individual was obtained from the Genome
Sequencing and Analysis Facility at the University of Texas at Austin. To prepare
plasma, fresh blood was collected in 10-ml K+/EDTA venous blood collection tubes,
mixed with an equal volume of phosphate-buffered saline without calcium and
magnesium (PBS -/-; Thermo Fisher Scientific), gently layered over 15-ml Ficoll-Paque
PLUS (GE Healthcare) in a 50-ml conical tube, and centrifuged at 400 x g for 35 min at
room temperature. After centrifugation, plasma (top layer) was transferred into a clean
tube, aliquoted, and stored at -80°C.
To prepare total plasma RNA using the Direct-zol method, plasma (1 ml or four
250-µl aliquots) was mixed with 3-volume Trizol LS Reagent (Thermo Fisher Scientific),
shaken vigorously for 10-30 sec to obtain a homogenous mixture, incubated at room
temperature for 10 min with occasional mixing, and centrifuged at 12,000 x g for 10 min
at 4 °C in a 1.7-ml Eppendorf tube. The resulting supernatant was then mixed with 1-
volume 100% ethanol and 5 μg of linear acrylamide carrier (Thermo Fisher Scientific),
66
incubated at room temperature for 10 min with occasional mixing, and processed with a
Direct-zol RNA Miniprep Kit (Zymo Research) following the manufacturer’s protocol.
RNA extracted from 1-ml plasma was concentrated into 11 µl of double-distilled water
(ddH2O) by ethanol precipitation in the presence of 0.3 M sodium acetate (pH 5.2) or
with an RNA Clean & Concentrator Kit (Zymo Research) with 8 volumes of 100%
ethanol added to the sample to increase recovery of small RNAs.
To prepare total plasma RNA by using the mirVana combined method, 1 ml of
plasma was processed by using a mirVana miRNA Isolation kit (Thermo Fisher
Scientific) following the manufacturer’s protocol, but combining the large and small
RNA fractions to obtain a total plasma RNA preparation. After mixing the plasma lysate
with 1/3-volume 100% ethanol, the large RNA fraction was bound to the first column and
eluted, while the small RNA fraction collected in the filtrate was mixed with an
additional 2/3-volume 100% ethanol, bound to the second column, eluted, and combined
with the large RNA fraction. For mirVana small plasma RNA preparation, the large RNA
fraction was discarded. In either case, the RNA was concentrated and cleaned up as
described above for the Direct-zol method.
RNA samples were used for RNA-seq either without further treatment (denoted
NT), after 3’-phosphate removal (denoted -3’ P), or after different DNase treatments. For
3’-phosphate removal, the RNA samples were treated with T4 polynucleotide kinase
(Epicentre) according to manufacturer’s recommendations, extracted with acid phenol-
chloroform-isoamyl alcohol (25:24:1; Thermo Fisher Scientific), ethanol precipitated,
and dissolved in 11-µl double-distilled (dd) H2O. DNase treatment of RNA samples
67
prepared by the Direct-zol RNA MiniPrep Kit (Zymo Research) was done following the
manufacturer’s protocol for on-column DNase I digestion with either 5-units DNase I
(Zymo Research) as specified in the protocol (DS15) or 20-units DNase I (DS7-10).
Alternatively, DNase treatment was done on the eluted RNA by using Baseline-ZERO
DNase (Epicentre) according to manufacturer’s recommendations. For RNase digestion,
the on-column DNase I-treated samples were digested with RNase I (Epicentre)
following the manufacture’s protocol, and for alkaline hydrolysis, they were incubated at
95C for 15 min in presence of 0.25 M NaOH and then neutralized with equimolar HCl.
After treatments, RNA samples were cleaned up with an RNA Clean & Concentrator
Kit (Zymo Research) and eluted with 11-µl ddH2O. To check the efficiency of DNase
digestion, we used a 10-ng mixture of a 74-nt synthetic ssDNA oligonucleotide (5’-TTT
TGA TTG TTT TTC GAT GAT GTT CGG TGA GCA TTG TTC GAG TTT CA TTT
TAT CAC AGC CAG CTT TGA TGT GC-3’; IDT) and a 275-bp dsDNA PCR product
derived from the Lactococcus lactis Ll.LtrB group II intron.
RNA quality and quantity were assessed by running 1 µl of the 11-µl RNA
samples on a 2100 Bioanalyzer (Agilent) using the RNA 6000 Pico Kit (mRNA assay) or
Small RNA Kit for total or small plasma RNA preparations, respectively.
3.5.3 Construction of plasma RNA-seq libraries
For the construction of plasma RNA-seq libraries, TGIRT template-switching
reverse transcription reactions were done by using an initial template-primer substrate
consisting of a 34-nt RNA oligonucleotide (R2 RNA), which contains an Illumina Read 2
68
primer-binding site and a 3’-blocking group (C3 Spacer, 3SpC3; IDT), annealed to a
complementary 35-nt DNA primer (R2R DNA) that leaves an equimolar mixture of A, C,
G, or T single-nucleotide 3’ overhangs (Fig. 3.1B). Reactions were done in 20 µl of
reaction medium containing plasma RNA (0.9-4.4 ng for total RNA and 7.2-12 ng for
small RNA preparations in 10-µl double-distilled water), 100 nM template-primer
substrate, TGIRT enzyme (2 µM TeI4c or 500 nM GsI-IIC RT), and 1 mM dNTPs (an
equimolar mix of dATP, dCTP, dGTP, and dTTP) in 450 mM NaCl, 5 mM MgCl2, 20
mM Tris-HCl, pH 7.5, and dithiothreitol (DTT; 1 mM for TeI4c RT and 5 mM for GsI-
IIC RT). DTT was either prepared freshly or from a frozen concentrated (0.5 or 1 M)
stock solution. Reactions were assembled by adding all components, except dNTPs, to a
sterile PCR tube containing plasma RNAs with the TGIRT enzyme added last. After pre-
incubating at room temperature for 30 min, reactions were initiated by adding dNTPs and
incubated for 15 min at 60°C. cDNA synthesis was terminated by adding 5 M NaOH to a
final concentration of 0.25 M, incubating at 95°C for 3 min, and then neutralizing with 5
M HCl. The resulting cDNAs were purified with a MinElute Reaction Cleanup Kit
(QIAGEN) and ligated at their 3’ end to a 5’-adenlyated/3’-blocked (C3 spacer, 3SpC3;
IDT) adapter (R1R; Fig. 3.1B) by using Thermostable 5’ AppDNA/RNA Ligase (New
England Biolabs) according to the manufacturer’s recommendations. The ligated cDNA
products were re-purified with a MinElute column and amplified by PCR by using
Phusion High-Fidelity DNA polymerase (Thermo Fisher Scientific) with 200 nM of
Illumina multiplex and 200 nM of barcode primers (a 5’ primer that adds a P5 capture
site and a 3’ primer that adds a barcode plus P7 capture site; Fig. 3.1B). PCR was done
69
with initial denaturation at 98°C for 5 sec followed by 12 cycles of 98°C for 5 sec, 60°C
for 10 sec and 72°C for 10 sec. The PCR products were purified by using the Agencourt
AMPure XP (Beckman Coulter) and sequenced on a HiSeq 2500 or a NextSeq 500
instrument (Illumina) to obtain 100-nt (HiSeq), 75-nt (NextSeq) or 150-nt (NextSeq)
paired-end reads.
RNA-seq libraries of cellular RNAs were constructed similarly from RNAs
isolated from K562 cells (ATCC CCL-243, maintained in IMDM supplemented with
10% FBS at 37°C with a 5% CO2 atmosphere) using a mirVana miRNA Isolation Kit
(Thermo Fisher Scientific) following the manufacturer’s protocol, or commercial T Cell
Leukemia (Jurkat) Total RNA (Thermo Fisher Scientific). Whole-cell RNAs (5 µg) were
ribo-depleted by using a RiboZero Gold Kit (Human/Mouse/Rat) (Epicentre) and then
fragmented to a size predominantly between 70~100 nt by using an NEBNext
Magnesium Fragmentation Module (New England Biolabs). 40 ng of fragmented RNAs
was treated with T4 Polynucleotide Kinase (Epicentre) to remove 3’ phosphates, cleaned
up with an RNA Clean & Concentrator Kit (Zymo Research), and used for RNA-seq
library construction with TGIRT enzymes (GsI-IIC for K562 and TeI4c for Jurkat) as
described above.
3.5.4 RNA-seq analysis of cDNA recopying by TGIRT enzymes
Control RNA-seq to assess the strand specificity of TGIRT enzymes was done
with 50 ng of a 74-nt synthetic RNA oligonucleotide (5’-UUU UGA UUG UUU UUC
GAU GAU GUU CGG UGA GCA UUG UUC GAG UUU CAU UUU UAU CAC AGC
70
CAG CUU UGA UGU GC; IDT) using 2 M TeI4c MRF or 1 M GsI-IIC RTs under
the conditions described above. Libraries were sequenced on an Illumina HiSeq, yielding
6.5-6.9 x 105 100-nt single-end reads that mapped to the RNA oligonucleotide sequence
in the expected orientation. Only a very small number of reads (3 for TeI4c-MRF RT and
12 for GsI-IIC RT) mapped to the RNA oligonucleotide in the antisense orientation,
corresponding to re-copying frequencies of 0.72 and 1.9 x 10-5 for TeI4c-MRF and GsI-
IIC RTs, respectively. All of the antisense reads resulted from template-switching to a
previously synthesized cDNA from either the 5’ end of the R2 RNA (the template-primer
substrate) or from the 5’ end of a previously copied RNA, resulting in a product with the
R2R DNA sequence on one end and the R2 RNA sequence on the other end. Both types
of recopying are readily identifiable by examining the reads without adapter trimming.
3.5.5 Bioinformatics analysis
The bioinformatics pipeline used for analysis of RNA-seq data is outlined in
Figure 3.1C. First, Illumina TruSeq DNA adapter and primer sequences were trimmed
from the reads by using cutadapt (Martin, 2011) (sequencing quality score cut-off at 20;
p-value < 0.01), and reads <18-nt after trimming were discarded. Reads were then
mapped by using Tophat v2.0.10 and Bowtie2 v2.1.0 (default settings) to the human
genome reference sequence (Ensembl GRCh38 Release 76) (Langmead and Salzberg,
2012; Kim et al., 2013) supplemented with additional contigs encoding the 5S rRNA
gene (2.2-kb 5S rRNA repeats from the cluster on chromosome 1 (1q42); GeneBank:
X12811) and the 45S rRNA gene (43-kb 45S rRNA repeats containing 5.8S, 18S and 28S
71
rRNA sequences from clusters on chromosomes 13,14,15,21, and 22; GeneBank:
U13369). Other sequences used for mapping included DNA oligonucleotide sequences
used in control experiments (see above) to test for sample cross-contamination, and the E.
coli genome sequence (Genebank: NC_000913) to remove any reads resulting from E.
coli nucleic acids in enzyme preparations. Unmapped reads from this first pass (Pass 1)
were re-mapped to Ensembl GRCh38 Release 76 by Bowtie2 with local alignment
(default settings) to improve the mapping rate for those reads that contain post-
transcriptionally added nucleotides (e.g., CCA and poly(U)), untrimmed adapter
sequences, and non-templated nucleotides added to the 3’ end of the cDNAs by TGIRT
enzymes (Pass 2). The mapped reads from Passes 1 and 2 were combined and filtered by
mapping quality (MAPQ ≥15; p-value < 0.03), and concordant read pairs were collected
by using Samtools. The concordant read pairs were then intersected with gene
annotations (Ensembl GRCh38 Release 76) and piRNA cluster annotations from
piRNABank (Sai Lakshmi and Agrawal, 2008) to collect reads that mapped uniquely in
the annotated orientation to genomic features (genomic coordinates for piRNAs were
converted to Ensembl GRCh38 Release 76 coordinates using scripts from the UCSC
genome browser website). Coverage of each feature was calculated by Bedtools. To
improve the mapping rate for tRNAs, mapped reads from Passes 1 and 2 were intersected
with tRNA annotations from the Genomic tRNA Database (Lowe and Eddy, 1997) to
collect both uniquely and multiply mapped tRNAs reads. These were then combined with
unmapped reads after Pass 2 and mapped to the tRNA reference sequences (UCSC
genome browser website) using Bowtie2 local alignment with default settings. Because
72
similar or identical tRNAs with the same anticodon can be multiply mapped to different
tRNA loci by Bowtie2, mapped tRNA reads with MAPQ ≥1 were combined according to
their tRNA anticodon prior to calculating the tRNA distributions. Only those features
with ten or more mapped reads were counted.
Coverage plots and alignments of reads were created by using Integrative
Genomics Viewer (IGV) (Robinson et al., 2011). Information about single nucleotide
polymorphisms (SNPs) was obtained from NCBI dbSNP (Database of Single Nucleotide
Polymorphisms Build 142; common category, minor allele frequency 1% in at least one
of the 26 major populations, with at least two unrelated individuals having the minor
allele).
For correlation analysis, RNA-seq datasets were normalized for the total number
of mapped reads by using DESeq (Anders and Huber, 2010) and plotted with ggplot2 in
R. To assess tissue expression profiles for mature miRNAs detected in plasma, reads
mapped to genomic features (Ensembl GRCh38 Release 76) were filtered by size and
reads shorter than 30 nt were intersected with miRBase 21 to obtain reads for mature
miRNAs. The latter were intersected with a published database to obtain RNA-seq
expression values (Landgraf et al., 2007), which were then normalized across different
tissues and plotted with ggplot2 in R.
To identify RNAs with poly(U) tails, unmapped reads after the first Tophat
alignment (pass 1; see above) were processed by using cutadapt and custom scripts to
find a stretch of 10 Us with <10% other nucleotides at the beginning of the Read 2
reads. The corresponding Read 1 reads were then mapped to human genome reference
73
sequence using Bowtie2 local alignment to identify the RNA species to which the
poly(U) tails are appended, and were used for IGV plots.
Excel spreadsheets for miRNAs, tRNAs, and other small ncRNAs identified by
TGIRT-seq in different plasma RNA preparations are included in the supplemental data
file as part of the manuscript (see reference (Qin et al., 2016)).
3.5.6 Accession numbers
The plasma RNA-seq datasets have been deposited in the National Center for
Biotechnology Information Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra)
under accession number SRP064378.
74
Dataset
NT -3’P OCD
1 2 3 1-3 4 5 6 4-6 7 8 9 10 7-10
Total reads (×106)1 69.4 23.4 31.7 124.5 20.5 21.5 26.0 68.0 14.6 37.8 36.4 28.5 117.4
Mapped to genome (%)2 92.0 95.3 93.5 93.0 91.1 88.8 92.3 90.8 90.2 85.7 86.6 87.7 87.0
Mapped to features (%)3 28.6 28.8 27.6 28.4 29.3 27.3 29.2 28.7 30.7 30.2 30.1 30.4 30.3
1Total reads after trimming and filtering.
2Percentage of concordant or discordant paired-end reads that mapped uniquely or multiply to the human genome reference
sequence.
3Percentage of concordant paired-end reads that mapped uniquely in the correct orientation to annotated features of the human
genome reference sequence.
Table 3.1: Read statistics and mapping for RNA-seq of total plasma RNAs using TeI4c group II intron RT.
RNA-seq libraries were prepared from plasma RNA samples by using TeI4c RT and sequenced on an Illumina HiSeq
or NextSeq instrument to obtain the indicated number of 100-nt (HiSeq; DS1), 150-nt (NextSeq; DS2-6), or 75-nt (NextSeq;
DS7-10) paired-end reads. Each sample corresponds to plasma RNA (0.9-4.4 ng) obtained from a healthy individual at
75
intervals at least one week apart and was analyzed either with no further treatment (NT), after T4 polynucleotide kinase
treatment under conditions that remove 3’ phosphates (-3’ P), or after on-column DNase I treatment (OCD). The reads were
trimmed to remove adapter sequences and low quality base-calls (sequencing quality score cut-off at 20 (p-value <0.01)), and
reads <18-nt after trimming were discarded. Trimmed reads were filtered and then mapped by using Tophat and Bowtie2 to a
human genome reference sequence (Ensembl GRCh38 Release 76) supplemented with additional rRNA gene contigs, as
described in Materials and Methods.
76
Dataset
GsI-IIC, NT GsI-IIC, OCD
12 13 14 12-14 15
Total reads (×106)1 33.6 27.8 43.6 104.9 22.9
Mapped to genome (%)2 90.1 94.6 93.0 92.5 95.1
Mapped to features (%)3 27.9 27.8 27.7 27.8 29.3
1Total reads after trimming and filtering.
2Percentage of concordant or discordant paired-end reads that mapped uniquely or
multiply to the human genome reference sequence.
3Percentage of concordant paired-end reads that mapped uniquely in the correct
orientation to annotated features of the human genome reference sequence.
Table 3.2: Read statistics and mapping for RNA-seq of total plasma RNAs using GsI-IIC
group II intron RT.
RNA-seq libraries were prepared from different plasma RNA samples by using
GsI-IIC RT and sequenced on an Illumina NextSeq instrument to obtain the indicated
number of 150-nt paired-end reads. Each sample corresponds to plasma RNA (1.4-4.4
ng) obtained from a healthy individual at intervals at least one week apart and was
analyzed with no further treatment (NT) or after on-column DNase I treatment (OCD).
The reads were trimmed to remove adapter sequences and low quality base-calls
(sequencing quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming
were discarded. Trimmed reads were filtered and mapped by using Tophat and Bowtie2
77
to a human genome reference sequence (Ensembl GRCh38 Release 76) supplemented
with additional rRNA gene contigs, as described in Materials and Methods.
78
N-3’ TeI4c
GsI-
IIC
NN-3’ TeI4c GsI-IIC NNN-3’ TeI4c GsI-IIC
A 26.2 26.2
AA 3.2 1.3
AAA 0.9 0.3
CAA 1.1 0.5
GAA 0.5 0.2
UAA 0.7 0.4
CA 11.1 15.0
ACA 2.4 2.6
CCA 4.4 6.7
GCA 1.5 1.4
UCA 2.8 4.4
GA 7.3 3.9
AGA 2.5 1.2
CGA 1.2 0.2
GGA 1.3 0.8
UGA 2.3 1.6
UA 4.6 6.0
AUA 1.0 1.1
CUA 1.4 1.9
GUA 0.6 0.7
UUA 1.6 2.3
C 26.6 25.8 AC 1.8 1.8
AAC 0.5 0.2
CAC 0.6 1.2
GAC 0.2 0.1
79
UAC 0.5 0.3
CC 9.4 11.3
ACC 2.4 1.8
CCC 2.1 3.5
GCC 1.5 1.3
UCC 3.4 4.8
GC 5.1 4.0
AGC 1.5 1.2
CGC 0.7 0.3
GGC 0.9 0.7
UGC 2.0 1.9
UC 10.2 8.7
AUC 1.8 1.3
CUC 3.0 3.4
GUC 1.2 0.9
UUC 4.3 3.0
G 25.7 24.0
AG 4.1 2.8
AAG 1.0 0.6
CAG 1.5 1.1
GAG 0.7 0.5
UAG 0.9 0.7
CG 3.6 2.4
ACG 0.8 0.5
CCG 1.2 0.9
GCG 0.6 0.3
UCG 1.0 0.8
GG 8.7 7.4 AGG 2.5 2.4
80
CGG 1.1 0.4
GGG 1.5 1.4
UGG 3.7 3.2
UG 9.2 11.3
AUG 1.9 2.4
CUG 2.9 4.3
GUG 1.5 1.7
UUG 2.9 3.0
U 21.6 24.0
AU 1.6 1.0
AAU 0.4 0.1
CAU 0.4 0.4
GAU 0.3 0.1
UAU 0.5 0.3
CU 6.2 13.8
ACU 1.3 1.5
CCU 1.6 5.3
GCU 1.0 2.4
UCU 2.4 4.5
GU 3.1 3.1
AGU 0.8 0.7
CGU 0.3 0.2
GGU 0.6 0.5
UGU 1.4 1.8
UU 10.7 6.1 AUU 2.0 0.8
81
CUU 2.8 2.9
GUU 1.5 0.8
UUU 4.4 1.6
Table 3.3: Analysis of 3’-terminal nucleotides of RNAs in RNA-seq datasets constructed
from total plasma RNA using TeI4c or GsI-IIC group II intron RTs.
Read 2 from RNA-seq datasets constructed from on-column DNase I-treated total
plasma RNA by using TeI4c (DS7-10) or GsI-IIC (DS15) group II intron RTs were
trimmed for adapter sequence and low quality bases. Then nucleotides frequencies for the
first three nucleotides of Read 2 (corresponding to the last three nucleotides of the RNA)
were calculated by using customized scripts. Frequencies for the last (N-3’), last two
(NN-3’), and the last three (NNN-3’) nucleotides of RNAs are shown as the percent of all
3’ RNA ends in the dataset.
82
Dataset
Jurkat K562
18 19
Total reads (×106)1 23.4 37.8
Mapped to genome (%)2 86.2 93.4
Mapped to features (%)3 70.1 73.0
1Total reads after trimming and filtering.
2Percentage of concordant or discordant paired-end reads that mapped uniquely or
multiply to the human genome reference sequence.
3Percentage of concordant paired-end reads that mapped uniquely in the correct
orientation to annotated features of the human genome reference sequence.
Table 3.4: Read statistics and mapping for RNA-seq of whole-cell RNAs by using TeI4c
or GsI-IIC group II intron RT.
RNA-seq libraries were prepared from 40 ng of ribo-depleted, fragmented whole-
cell RNAs by using TeI4c RT (Jurkat cells) or GsI-IIC RT (K562 cells) and sequenced on
an Illumina NextSeq instrument to obtain the indicated number of 150-nt paired-end
reads. The reads were trimmed to remove adapter sequences and low quality base-calls
(sequencing quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming
were discarded. Trimmed reads were then mapped by using Tophat and Bowtie2 to a
human genome reference sequence (Ensembl GRCh38 Release 76) supplemented with
additional rRNA gene contigs, as described in Materials and Methods.
83
Plasma RNA prepared by the Direct-zol method
Dataset 1 (DS1) TeI4c RT, total plasma RNA, no treatment (NT)
Dataset 2 (DS2) TeI4c RT, total plasma RNA, no treatment (NT)
Dataset 3 (DS3) TeI4c RT, total plasma RNA, no treatment (NT)
Dataset 4 (DS4) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)
Dataset 5 (DS5) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)
Dataset 6 (DS6) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)
Dataset 7 (DS7) TeI4c RT, total plasma RNA, on-column DNase I (OCD)
Dataset 8 (DS8) TeI4c RT, total plasma RNA, on-column DNase I (OCD)
Dataset 9 (DS9) TeI4c RT, total plasma RNA, on-column DNase I (OCD)
Dataset 10 (DS10) TeI4c RT, total plasma RNA, on-column DNase I (OCD)
Dataset 11 (DS11) TeI4c RT, total plasma RNA, Baseline-ZERO DNase (BZD)1
Dataset 12 (DS12) GsI-IIC RT, total plasma RNA, no treatment (NT)
Dataset 13 (DS13) GsI-IIC RT, total plasma RNA, no treatment (NT)
Dataset 14 (DS14) GsI-IIC RT, total plasma RNA, no treatment (NT)
Dataset 15 (DS15) GsI-IIC RT, total plasma RNA, on-column DNase I (OCD)
Plasma RNA prepared with a mirVana miRNA isolation kit
Dataset 16 (DS16) TeI4c RT, total plasma RNA, Baseline-ZERO DNase (M-BZD)1,2
Dataset 17 (DS17) GsI-IIC RT, small plasma RNA, no treatment (NT)2
Whole-cell RNA
Dataset 18 (DS18) TeI4c RT, Jurkat cells, ribo-depleted and fragmented
84
Dataset 19 (DS19) GsI-IIC RT, K562 cells, ribo-depleted and fragmented
1Datasets constructed from plasma RNA treated with Baseline-ZERO DNase had
decreased mapping rates and complexity, reflecting loss of material due to additional
treatment and recovery steps when starting with very small amounts of plasma RNA.
2The dataset contains reads combined from multiple biological replicates (two for DS16
and four for DS17).
Table 3.5: Summary of RNA-seq datasets.
85
Figure 3.1: TGIRT-seq overview.
(A) RNA-seq library construction via TGIRT template-switching. TGIRT
template-switching reverse transcription reactions use an initial template-primer substrate
comprised of an RNA oligonucleotide, which contains an Illumina Read 2 primer-binding
site (R2 RNA) and has a 3’-blocking group, annealed to a complementary DNA primer
(R2R DNA), which leaves an equimolar mixture of A, C, G, and T (denoted N) single-
nucleotide 3’ overhangs. The initial R2 RNA-R2R DNA substrate was mixed with target
RNA and TGIRT enzyme in the reaction medium, with the enzyme added last, and then
A C
B
86
pre-incubated for 30 min at room temperature prior to initiating reverse transcription
reactions by adding dNTPs. The reactions were incubated at 60°C for 5 to 30 min,
depending on the length and/or modification level of target RNA, and terminated by
alkaline treatment (Materials and Methods). The cDNA products were then purified with
a MinElute Reaction Cleanup Kit (QIAGEN) and ligated at their 3’ ends to a 5’-
adenylated/3’-blocked DNA oligonucleotide complementary to an Illumina Read 1
primer (R1R) by using a Thermostable 5’ AppDNA/RNA Ligase (New England Biolabs).
The ligated cDNAs were re-purified and amplified by PCR for 12 cycles to add Illumina
flow cell capture sites (P5 and P7) and barcode sequences for sequencing. (B) Sequences
of oligonucleotides used for TGIRT-seq. (C) Mapping pipeline for human RNA-seq
datasets constructed with TGIRT enzymes. After trimming adapter sequences and reads
with low quality base calls by using cutadapt, reads of 18 nt were mapped by Tophat
and Bowtie2 (default settings) to a human genome reference sequence (Ensembl GRCh38
release 76) supplemented with additional rRNA gene contigs and other sequences (Pass
1; see Materials and Methods). Unmapped reads from Pass 1 were then re-mapped to the
same human genome reference sequence using Bowtie2 local alignment (default settings)
to recover reads from RNAs with post-transcriptionally added nucleotides (e.g., 3’ CCA,
poly(U)) or short introns (e.g., tRNA introns) (Pass 2). Concordant read pairs that
mapped uniquely with MAPQ 15 from Passes 1 and 2 were combined and mapped to
genomic features. Reads that mapped to tRNA genes were filtered and combined with the
reads that remained unmapped after the Bowtie2 local alignment, and remapped to
human tRNA reference sequences (UCSC genome browser website) to achieve optimal
87
recovery and mapping of tRNA reads. tRNA reads with MAPQ 1 were combined with
mapped genome reads from the prior steps for downstream analysis.
88
Figure 3.2: Bioanalyzer traces showing size profiles of plasma RNAs before and after
various treatments.
Total plasma RNA was prepared by the Direct-zol method, and a 1-µl portion was
analyzed with an RNA 6000 Pico Kit (mRNA assay) on a 2100 Bioanalyzer (Agilent) to
obtain the traces shown in the Figure. (A) Total plasma RNA with no further treatment
(NT). (B) Total plasma RNA after on-column DNase I treatment (OCD). (C) and (D)
Total plasma RNA after OCD treatment followed by RNase I or alkaline hydrolysis
treatments, respectively.
89
Figure 3.3: Bioanalyzer traces testing the efficiency of DNase treatments used on plasma
RNA preparations.
(A) On-column DNase I treatment. A mixture containing 10 ng of a 74-nt single-
stranded DNA and 275-bp double-stranded DNA (Materials and Methods) was mock
extracted with Trizol LS Reagent and processed by the Direct-zol method without or with
on-column DNase I treatment (OCD), as described for plasma RNA preparations in
Material and Methods. After processing, a 1-µl portion of the DNA was analyzed with a
High Sensitivity DNA Kit on a 2100 Bioanalyzer (Agilent). (B-D) Baseline-Zero DNase
treatment. A 1-µl portion of total plasma RNA extracted from 1 ml of plasma by using
the mirVana combined method was analyzed with an RNA 6000 Pico Kit (mRNA assay)
on a 2100 Bioanalyzer with (B) no further treatment (M-NT); (C) after addition of a 10-
ng mixture of the same ssDNA and dsDNA as in (A); and (D) after addition of the 10-ng
mixture of the DNAs followed by treatment with Baseline-ZERO DNase (M-BZD).
90
Figure 3.4: The distribution of transcript lengths in total plasma RNA libraries calculated
by the coverage of paired-end read span.
(A) and (B) Distribution of calculated transcript lengths in total plasma RNA
prepared by the Direct-zol method with no further treatment (NT; combined DS1-3) or
after on-column DNase I treatment (OCD; combined DS7-10), respectively. Transcript
lengths were calculated by paired-end read span using bedtools, and their distribution was
plotted in R. (C) The distribution of transcript lengths for paired-end reads mapping to
protein-coding genes in the OCD datasets calculated and plotted as above. The reads
mapping to protein-coding genes were filtered to remove reads for which >50% of the
read length overlapped embedded small ncRNAs prior to calculating transcript lengths.
91
The read gap correlates with read length in the RNA-seq reaction and is caused by the
loss of coverage due to trimming of the final nucleotides of the reads, which are often
lower quality base calls.
92
Figure 3.5: Percentage of TGIRT-seq reads from total plasma RNA datasets mapping to
different categories of genomic features.
RNA-seq datasets were constructed by using TeI4c RT for total plasma RNA
prepared by the Direct-zol method and either not treated (NT; combined DS1-3), 3’
dephosphorylated (-3’ P; combined DS4-6), or on-column DNase I-treated (OCD;
combined DS7-10). Reads were mapped to genomic features as described in Materials
and Methods. (A) Stacked bar graphs showing the percentage of concordant read pairs
that mapped uniquely in the correct orientation to the indicated category of genomic
features. Protein-coding genes include immunoglobulin and T-cell receptor genes; long
ncRNAs include lincRNAs, antisense RNAs and other lncRNAs; and rRNA genes
include 5S, 5.8S, 18S, and 28S rRNA genes. (B) Stacked bar graphs showing the
percentage of small ncRNA read pairs (1.8-5.8% of the reads in the total plasma RNA
datasets) that mapped to different categories of small ncRNA genes. In (A) and (B), the
93
numbers next to each stacked bar segment indicate the number of different genes for
which transcripts were identified in that category. Only features with ten or more mapped
reads in the combined datasets were included. Abbreviation: MT, mitochondrial genes.
94
Figure 3.6: Correlation analysis for biological replicates of total plasma RNA libraries.
Reads from the indicated RNA-seq datasets constructed by using either TeI4 or
GsI-IIC RTs for total plasma RNA prepared by the Direct-zol method and treated in
different ways were normalized to generate (A-D) correlation matrices and (E) a scatter
plot. Pairwise Spearman’s correlation coefficients () are shown in the boxes of the
correlation matrices and at the upper left of the scatterplot. NT, not treated; -3’ P, treated
to remove 3’ phosphates; OCD, on-column DNase I treatment.
95
Figure 3.7: RNA-seq analysis of total plasma RNA libraries constructed with GsI-IIC
group II intron RT.
RNA-Seq libraries were constructed by using GsI-IIC RT from total plasma RNA
prepared by the Direct-zol method without (GsI-IIC, NT; combined DS12-14) or with
(GsI-IIC, OCD; DS15) on-column DNase I treatment following the manufacturer’s
protocol. (A) Stacked bar graphs showing the percentage of concordant read pairs that
96
mapped uniquely in the annotated orientation to the indicated category of features. (B)
Stacked bar graphs showing the percentage of small ncRNA read pairs (1.3-2.1% of the
reads in the total plasma RNA datasets; also see Supplemental Data File) that mapped to
different categories of small ncRNAs. Protein-coding genes include immunoglobulin and
T-cell receptor genes; long ncRNAs include lincRNAs, antisense RNAs and other
lncRNAs; and rRNA genes include 5S, 5.8S, 18S, and 28S rRNAs genes. The numbers
next to the stacked bars segments indicate the number of different genes for which
transcripts were identified in each category of features. Only features with ten or more
mapped reads in the combined datasets were included. Abbreviation: MT, mitochondrial
genes. (C) Stacked bar graphs showing the percentage of bases in protein-coding gene
reads that mapped to coding sequences (CDS), introns, 5’- and 3’-untranslated regions
(UTRs), and intergenic regions. (D) Stacked bar graphs showing the proportion of
concordant read pairs that mapped to the sense and antisense strands of protein-coding
genes. In (C) and (D), the reads that mapped to protein-coding genes were filtered to
remove those with >50% of the read length overlapping embedded small ncRNAs, and
the percentage of bases or reads mapping to different regions or strands was calculated by
using picard tools.
97
Figure 3.8: Human plasma RNA is enriched in intron and antisense sequences compared
to whole-cell RNAs.
Reads mapping to protein-coding genes were analyzed to assess coverage across
different regions and both DNA strands in RNA-seq datasets constructed with TGIRT
enzymes for total plasma or whole-cell RNA prepared and treated in different ways.
These include plasma RNA prepared by the Direct-zol method with no further treatment
(NT; combined DS1-3), after on-column DNase I treatment (OCD; combined DS7-10),
or after Baseline-ZERO DNase treatment (BZD; DS11); plasma RNA prepared by the
mirVana combined method after Baseline-ZERO DNase treatment (M-BZD; DS16); and
ribo-depleted and fragmented whole-cell RNA from Jurkat cells (TeI4c RT; DS18) or
K562 cells (GsI-IIC RT; DS19). (A) Stacked bar graphs showing the percentage of bases
98
in protein-coding gene reads that mapped to coding sequences (CDS), introns, 5’- and 3’-
untranslated regions (UTRs), and intergenic regions. (B) Stacked bar graphs showing the
proportion of concordant read pairs that mapped to the sense and antisense strands of
protein-coding genes. In (A) and (B), reads that mapped to protein-coding genes were
filtered to remove those with >50% of the read length overlapping embedded small
ncRNAs, and the percentage of bases or reads mapping to different regions or strands
was calculated by using picard tools. Reads from the OCD, BZD, and M-BZD datasets
were analyzed with or without removal of read pairs with a span of <30 nt to exclude
short DNA fragments that may have escaped DNase treatment.
99
Figure 3.9: Proportion of reads mapping to the sense strand of protein-coding genes as a
function of gene length in RNA-seq datasets of human plasma or whole-cell
RNAs.
Reads that mapped to either the sense or antisense strands of the protein-coding
genes in the datasets indicated in the Figure were retrieved using bedtools and filtered to
remove reads for which >50% of the read length overlapped embedded small ncRNAs.
The percentage of sense reads (black dots) versus gene length (red line) was then plotted
using R for genes with ≥10 reads mapping to one or both strands.
100
Figure 3.10: Human plasma contains both mature and pre-miRNAs.
(A) Relative abundance of miRNAs identified in RNA-seq datasets constructed
with TeI4c RT for total plasma RNAs prepared by the Direct-zol method with on-column
DNase I treatment (OCD; combined DS7-10; left) or by the mirVana combined method
with Baseline-ZERO DNase treatment (M-BZD; DS16; right). miRNA loci with ten or
more mapped reads were rank-ordered by read count and plotted to display relative
101
abundance. The 20 most abundant miRNAs loci by read count are shown in the bar graph
insets. Loci encoding predicted miRNAs (Ensembl GRCh38 Release 76) were not
included in the bar graphs unless mature-sized miRNAs mapping to the locus were
identified in the datasets. (B) and (C) IGV screen shots showing coverage plots (CP;
above) and alignments (below) of reads for loci in which abundant miRNA transcripts
were identified in the OCD and M-BZD datasets, respectively. In (B), the miRNA
transcripts were ordered based on abundance as shown in the left panel of (A). (C) IGV
screen shots showing additional miRNA transcripts that were abundant in the M-BZD
dataset, but less abundant or not present in the OCD datasets. The arrow at the top
indicates the boundaries and 5’ to 3’ orientation of the mature miRNA on the
chromosomal DNA sequence. Reads were sorted by start site on the chromosome, which
can be from either the 5’ or 3’ end depending on the orientation of the gene on the
chromosome. Nucleotides matching the genome sequence are shown in gray, and
mismatches are shown as different colors (A, green; C, blue; G, brown; and T, red),
which can either correspond to or be the complement of the RNA sequence depending on
the orientation of the gene on the chromosome. Mismatches were checked against NCBI
dbSNP, and known SNPs are indicated with the nucleotide change and corresponding
SNP ID. Mismatches at the 5’ end of the reads are likely due to non-templated nucleotide
addition by the TGIRT enzyme to the 3’ end of the cDNAs. Some miRNAs (e.g., miR-
122) have post-transcriptionally added A or AA residues at their 3’ ends(Norbury, 2013).
102
103
Figure 3.11: Tissue expression profiles for mature miRNAs in plasma.
The Figure shows tissue expression profiles of the mature miRNAs identified by
TGIRT-Seq in total plasma RNA prepared by the Direct-zol method with on-column
DNase I treatment (OCD; combined DS7-10). The profiles are based on the relative
RNA-seq expression values of the miRNAs in a published database(Landgraf et al.,
2007), and only miRNAs present in that database are shown. Tissue categories:
podocytes include both differentiated and undifferentiated podocytes; peripheral
leukocytes include T-lymphocytes, NK cells, monocytes, granulocytes and dendritic
cells. miRNAs highlighted in red are also abundant (top 10 percentile) in red blood cells
or plateles(Wang et al., 2012), cell types for which relative RNA-seq expression values
were not available in the database used to calculate the expression profiles(Landgraf et
al., 2007).
104
Figure 3.12: Tissue expression profiles of mature miRNA identified in total plasma RNA
prepared by the mirVana combined method.
105
The Figure shows tissue expression profiles of mature miRNAs in an RNA-seq
dataset constructed with TeI4c RT from total plasma RNA prepared by the mirVana
combined method and treated with Baseline-ZERO DNase (M-BZD; DS16). Tissue
expression profiles were plotted as described in Fig. 3.11. Tissue categories: podocytes
include both differentiated and undifferentiated podocytes; peripheral leukocytes include
T-lymphocytes, NK cells, monocytes, granulocytes and dendritic cells. miRNAs
highlighted in red are also abundant (top 10 percentile) in red blood cells or
platelets(Wang et al., 2012), cell types for which relative RNA-seq expression values
were not available in the database used to calculate the expression profiles(Landgraf et
al., 2007).
106
Figure 3.13: TGIRT-seq detects full-length pre-miRNAs and a miRNA that may be
present in plasma in an RNA/DNA hybrid.
(A) Secondary structures of full-length pre-miRNAs shown in the IGV plots of
Figure 5C. (B) IGV screen shots showing coverage plots (CP; above) and alignments
(below) of reads for miR-182 in the RNA-seq datasets indicated in the Figure for non-
treated (NT) or on-column DNase I (OCD)-treated plasma RNA preparations with TeI4c
or GsI-IIC RTs. The arrow at the top indicates the boundaries and 5’ to 3’ orientation of
the annotated mature miRNA on the chromosomal DNA sequence. Read pairs were
grouped and colored by orientation, with the sense read pairs shown in light purple and
the antisense read pairs shown in salmon. The numbers to the right of the alignment
107
indicate the number of reads in each category. The alignment with >1,000 mapped reads
was down-sampled to 1,000 reads in IGV.
108
Figure 3.14: Relative abundance and IGV alignments of miRNAs identified in a small
plasma RNA-seq dataset constructed with GsI-IIC RT.
(A) Relative abundance. Small plasma RNA was isolated by the mirVana small
RNA enrichment method, and RNA-seq libraries were constructed by using GsI-IIC RT
(GsI-IIC, Small; DS17). miRNA loci with ten or more mapped reads were rank-ordered by
read count and plotted to display relative abundance. The 20 most abundant miRNAs loci
by read count are shown in the bar graph inset. Loci encoding predicted miRNAs (Ensembl
GRCh38 Release 76) were not included in the bar graph unless mature-sized miRNAs
mapping to the locus were identified in the dataset. (B) IGV screen shots. The screen shots
show coverage plots (CP; above) and alignments (below) of reads for loci in which the 20
109
most abundant miRNA transcripts were identified in the dataset. The IGV coverage plots
and alignments of reads are as described in Figure 5.
110
Figure 3.15: TGIRT-seq identifies full-length mature tRNAs and tRNA fragments in
human plasma.
(A) Relative abundance of tRNAs identified in RNA-seq datasets constructed
with TeI4c RT for total plasma RNA prepared by the Direct-zol method without (NT;
111
combined DS1-3) or with treatment to remove 3’ phosphates (-3’ P; combined DS4-6).
The plots show tRNAs with ten or more mapped reads grouped by anticodon and rank-
ordered by read count. The 15 most abundant tRNAs based on anticodon are shown in the
bar graph insets. (B) IGV screen shots showing coverage plots (CP; above) and
alignments (below) of reads for abundant full-length mature tRNAs identified in the NT
datasets. The tRNAs were ordered by abundance as in the left panel of (A). For cases in
which multiple loci encode tRNAs with the same sequence, tRNA reads were distributed
equally among different tRNA loci for the IGV alignments. (C) IGV screen shots
showing coverage plots and alignments of reads for representative 3’-tRNA halves in the
NT datasets (AlaAGC and ThrCGT) and 5’-tRNA halves in the -3’ P datasets (GlyCCC,
ArgCCG and AspGTC). The arrow at the top indicates the boundaries and 5’ to 3’
orientation of the mature tRNA on the chromosomal DNA sequence. In order to fit the
entire alignment in one panel, genes with >1,000 mapped reads were down-sampled to
1,000 reads in IGV. Reads were sorted by start site on the chromosome. Nucleotides
matching the genome sequence are shown in gray, and mismatches are shown as different
colors (A, green; C, blue; G, brown; and T, red). Mismatches at the 5’ end of the reads
are likely due to non-templated nucleotide addition by the TGIRT enzyme to the 3’ end
of the cDNAs. Mismatches due to misincorporation at known sites of post-transcriptional
modifications are highlighted with the name of the modification. Modifications: I,
inosine; m1A, 1-methyladenosine; m3C, 3-methylcytidine; m5C, 5-methylcytidine; m1G,
1-methylguanosine; m2G, N2-methylguanosine; m22G, N2,N2-dimethylguanosine.
112
Figure 3.16: Other classes of small non-coding RNAs identified as full-length mature
transcripts in human plasma by TGIRT-seq.
(A) IGV screen shots showing coverage plots (CP; above) and alignments (below)
of reads mapping to small ncRNAs loci in RNA-seq datasets constructed with TeI4c RT
for total plasma RNA prepared by the Direct-zol method (NT; combined DS1-3). The RNA
biotype is indicated at the top with the gene name and transcript length in parentheses. (B)
113
Examples of small ncRNA fragments with poly(U) tails. IGV screen shots of showing
coverage plots (CP; above) and alignments (below) of Read 1s for poly(U)-tailed small
ncRNAs found among the unmapped reads in NT datasets. In (A) and (B), the arrow at the
top indicates the boundaries and 5’ to 3’ orientation of the mature transcript on the
chromosomal DNA sequence. In order to fit the entire alignment in one panel, genes with
>1,000 mapped reads were down-sampled to 1,000 reads in IGV. Reads were sorted by
start site on the chromosome, which can be from either the 5’ or 3’ end depending on the
orientation of the gene on the chromosome. Nucleotides matching the genome sequence
are shown in gray, and mismatches are shown as different colors (A, green; C, blue; G,
brown; and T, red), which can either correspond to or be the complement of the RNA
sequence. Mismatches were checked against NCBI dbSNP, and known SNPs are indicated
with the nucleotide change and corresponding SNP ID. Other mismatches were manually
checked and were due to lower quality base-calls, non-templated nucleotide addition to the
3’ end of the cDNA resulting in extra nucleotides at the 5’ end of the read, or misalignment
by Bowtie2 local alignment.
114
Chapter 4: Identification of circulating RNA biomarkers in multiple
myeloma
4.1 INTRODUCTION
Multiple myeloma is the second most prevalent hematological cancer in the USA
after non-Hodgkin lymphoma (Raab et al., 2009). It remains as an incurable disease that
causes 15-20% of death from blood malignancies and about 2% of all deaths from cancer
(International Myeloma Working Group, 2003), with a median survival of around five
years for newly diagnosed patients (Bergsagel et al., 2013). In myeloma, malignant
plasma cells in the bone marrow proliferate and interfere with production of normal
blood cells (Raab et al., 2009). They also produce a monoclonal protein, referred to as the
paraprotein, which can be detected in blood or urine, or both. The paraproteins are
comprised of monoclonal immunoglobulins, typically IgG or IgA, and monoclonal free
light chains, which together are responsible for decreased humoral immunity and over
90% of the renal impairment that occurs in myeloma (Stringer et al., 2011). Other
common symptoms associated with myeloma include anemia, bone disease, and
hypercalcemia (Smith and Yong, 2013). The genetic abnormalities underlying the
pathogenesis of myeloma include chromosomal translocations, multiple trisomies, and
late onset mutations (Smith and Yong, 2013).
Diagnosis of myeloma often involves: (i) paraprotein concentration in serum or
urine, detected by serum electrophoresis and immunofixation; (ii) plasma cell infiltration
in the bone marrow (BMPCs), assessed by bone marrow aspirate; and (iii) bone lesions,
115
screened by skeletal survey using plain radiographs and magnetic resonance imaging
(MRI) (Raab et al., 2009). Monoclonal gammopathy of undetermined significance
(MGUS) and smoldering multiple myeloma (SMM) are conditions in which patients have
less (MGUS) or more (SMM) than 30 g/L paraproteins and <10 % BMPCs, but have not
yet developed any myeloma-associated symptoms (Smith and Yong, 2013). The risk of
progression to active multiple myeloma (AMM) in the first 5 years after initial diagnosis
is 1% and 10% per year for MGUS and SMM, respectively (Rajkumar et al., 2015).
However, SMM is biologically heterogeneous, including a subset of patients displaying
premalignancy similar to MGUS, and a subset of high-risk patients progressing to AMM
with a median time of only 2 years (Rajkumar et al., 2015). Although evidence suggest
the high-risk SMM patients may benefit from early therapeutic intervention,
unfortunately there is no reliable pathological or molecular biomarker that can be used to
distinguish the MGUS-like SMM patients from the malignant SMM patients, making it
challenging for early detection and treatment (Rajkumar et al., 2015).
Next-generation RNA-sequencing (RNA-seq) is a powerful tool for transcriptome
profiling and gene expression analysis, which can potentially allow diagnostics of human
diseases to move from morphology and low-sensitivity protein analysis into global
identification of RNA biomarkers (Meldrum et al., 2011; Byron et al., 2016). The key
factor underlying the success of RNA-based diagnostics is the ability to analyze all RNAs
at the same time with high sensitivity and minimal bias. However, retroviral RTs used in
conventional methods for reverse transcription of target RNAs have inherently low
processivity and fidelity, resulting in RNA-seq libraries with reduced complexity and
116
accuracy (Hu and Hughes, 2012). Additionally, the use of RNA ligase for attaching
RNA-seq adapter to the target RNA leads to bias and low efficiency in RNA-seq library
construction (Linsen et al., 2009; Levin et al., 2010; Lamm et al., 2011).
Recently, we developed new RNA-seq methods for the analysis of whole-cell,
exosomal, and plasma RNAs based on the use of thermostable group II intron reverse
transcriptase (TGIRT enzymes) (Qin et al., 2016; Nottingham et al., 2016). TGIRTs have
higher thermostability, processivity and fidelity than conventional retroviral reverse
transcriptases, along with a novel end-to-end template-switching activity that attaches
RNA-seq adapters to target RNAs without using RNA ligase (Mohr et al., 2013). TGIRTs
give full-length reads of structured small non-coding RNAs (small ncRNAs), including
tRNAs and snoRNAs, which are refractory to retroviral RTs, and enable identification of
a variety of base modifications in these RNAs by distinctive patterns of misincorporated
nucleotides (Katibah et al., 2014; Shen et al., 2015; Zheng et al., 2015). Validation of
TGIRT-seq on well-characterized human RNA reference samples and comparisons to
published Illumina TruSeq datasets for these samples further showed that TGIRT-seq: (i)
is simpler yet more strand-specific as TruSeq v3; (ii) recapitulates the relative abundance
of human transcripts and RNA spike-ins in ribo-depleted, fragmented RNA samples
comparably to the non-strand-specific TruSeq v2 and better than the strand-specific
TruSeq v3 methods; (iii) gives more uniform 5’ to 3’ gene coverage than either TruSeq
method; (iv) detects more splice junctions, particularly near the 5’ ends of genes, than
TruSeq v3 at comparable read depth, even from fragmented RNAs; and (v) eliminates
sequence biases due to random hexamer priming that are inherent in TruSeq (Nottingham
117
et al., 2016). By using the TGIRT-seq total RNA method, we constructed RNA-seq
libraries from <1 ng plasma RNA in <5 h (Qin et al., 2016). We find that plasma contains
RNA fragments derived from large numbers of protein-coding and lncRNAs, along with
most known classes of small ncRNAs. Many of the latter are present as full-length
transcripts, suggesting protection from plasma RNases in ribonucleoprotein complexes
and/or exosomes (Qin et al., 2016). A particular advantage of TGIRT-seq is that these
structured small ncRNAs can be profiled in the same RNA-seq run as protein-coding and
long non-coding RNAs (lncRNAs), providing a potentially more robust biomarker
identification than conventional methods.
We are currently collaborating with Drs. Flavia Pichiorri and Craig Hofmeisters’
group at the Ohio State University to identify RNA biomarkers for early, sensitive and
non-invasive myeloma diagnostics by carrying out TGIRT-seq analysis of circulating
RNAs isolated from extracellular vesicles in human plasma. Extracellular vesicles (EVs)
are membrane-enclosed structures containing nucleic acids and proteins released by cells
(Raposo and Stoorvogel, 2013). They have been recognized as important vehicles for
intercellular communication, and their emerging roles in diagnostics and therapeutics are
under intensive investigation (EL Andaloussi et al., 2013).
4.2 RNA PROFILES OF EXTRACELLULAR VESICLES IN HUMAN PLASMA
To obtain RNAs from extracellular vesicles (EV-RNAs), ~20 ml of plasma
collected from 10 de-identified individuals, including 4 healthy (H), 3 smoldering (SMM)
and 3 active multiple myeloma (AMM), were processed directly with an exoEasy Maxi
Kit (QIAGEN) and analyzed on a 2100 Bioanalyzer (Agilent). Previous studies of whole
118
plasma RNA in our laboratory showed the presence of ~160-bp DNA fragments in
plasma (Qin et al., 2016). However, bioanalyzer traces of the EV-RNA preparations
showed a broad peak centered around 40-60 nt with minor or non-detected peak at ~160-
bp (Fig. 4.1), suggesting the ~160-bp DNA fragments in plasma potentially come from
other sources, such as apoptotic bodies, rather than from the extracellular vesicles. For
initial analysis described here, we used the TGIRT-seq total RNA method as described
for human plasma RNAs (see Chapter 3 and ref. Qin et al., 2016) to construct RNA-seq
libraries from 1.7-7.5 ng of EV-RNAs without further treatment from four healthy
individuals (H1-4), three patients with SMM (SMM1-3), and three patients with AMM
(AMM1-3).
Table 4.1 summarizes the mapping statistics for all RNA-seq datasets. The
samples were sequenced on an Illumina NextSeq 500 to give 3.3-61.2 million 75-nt
paired-end reads. The reads were trimmed to remove adapter sequences and low quality
base calls and then mapped to a human genome reference sequence (Ensembl GRCh38
Release 76) supplemented with additional rRNA gene contigs (Materials and Methods),
using the TGIRT-seq mapping pipeline (Qin et al. 2016). For all RNA-seq datasets, 67.5-
93.7% of the paired-end reads mapped to the human genome, and 4.7-36.2% were
concordant read pairs that mapped uniquely and with high mapping quality (MAPQ ≥15)
to non-ribosomal and non-mitochondrial genomic features in the annotated orientation.
For confidence, only features with ≥10 hits were counted in the analysis. The reduced
total number of reads obtained for dataset AMM1 was likely explained by its extremely
low RNA input (~1.7 ng), resulting in excess primer-dimers being produced during PCR,
119
remained after clean-up and consumed the majority of the reads. Datasets SMM1 and
AMM3 had lower mapping rates to genomic features, which were due to mitochondria
contamination during the preparation of EVs.
Figure 4.2 shows the percentage of reads mapping to different genomic features in
combined EV-RNA datasets obtained for each group (H, SMM and AMM), using only
uniquely mapped concordant read pairs for the calculation. The number of individual
genes to which the reads mapped is shown next to each feature in the stacked bar graphs.
The TGIRT-seq profiles of EV-RNA samples are very similar to previously published
plasma RNA samples (Qin et al. 2016), with the majority of the reads corresponding to
fragmented protein-coding gene and lncRNAs (Fig. 4.2A), and a smaller proportion
mapping to a variety of small ncRNAs (Fig. 4.2B). These findings suggest that
extracellular vesicles are major contributors to the plasma RNA pool.
4.3 TGIRT-SEQ IDENTIFIES DIFFERENTIALLY EXPRESSED TRANSCRIPTS BY DISEASE
STAGES
Despite the overall similarity among RNA classes identified in EV-RNAs for
healthy, SMM and AMM groups, their TGIRT-seq profiles separated into three distinct
classes (Fig. 4.3), suggesting gene expression changes as the disease advances. Indeed, a
number of differentially expressed protein-coding and small ncRNAs that were up- or
down-regulated in the SMM group showed progressive increases or decreases,
respectively, in the AMM group (Fig. 4.4). Therefore, TGIRT-seq analysis of RNAs in
plasma EVs is potentially useful for the identification of high-risk SMM patients with
rapid progression to malignancy.
120
Among the differentially expressed transcripts were a population of ~33 nt RNA
fragments derived from Y RNAs, which are strongly elevated in both SMM and AMM
patients (YF; Fig. 4.4). Y RNAs are small RNAs (84-112 nt) that are part of the 60-kDa
Ro ribonucleoprotein autoantigens and function in RNA stability and cellular responses
to stress (Chen and Wolin, 2004; Wolin et al., 2012). Recent studies show that Y RNAs
are essential for the initiation of chromosomal DNA replication in vertebrates (Christov
et al., 2006; Krude et al., 2009). Interestingly, emerging evidence has identified short
fragments of Y RNAs in cells, solid tumors and bodily fluids of human and mammals
with implications in a variety of human diseases, including cancer (Kowalski and Krude,
2015). The potential role of Y RNA fragments as a novel diagnostic biomarker for
myeloma will be further investigated, including validation by qRT-PCR.
Finally, we compared our TGIRT-seq datasets to a previously published
microarray dataset, which tracked gene expression profiles of CD138-selected plasma
cells in 559 newly diagnosed myeloma patients for 730 days, with end-points
representing event-free survival (EFS), meaning a lack of malignancy or disease
recurrence, and overall survival (OS) (Popovici et al., 2010; Shi et al., 2010). Several
protein-coding genes, which had altered expression levels in EV-RNAs, demonstrated
association with survival (Fig. 4.5), providing support for using TGIRT-seq analysis of
circulating RNAs as an easily accessible and sensitive diagnostic tool. In the next phase
of this research, we will increase the number of patient samples for each disease stage
and include MGUS in the analysis.
121
4.4 DISCUSSION
We demonstrated the potential biotechnological applications of TGIRT-seq in
diagnostics for human diseases by the identification of circulating RNA biomarkers in
multiple myeloma. In order to obtain a complete profile of protein-coding genes and
lncRNAs together with small ncRNAs that are present in plasma EVs, we used the
TGIRT-seq total RNA method for rapid and efficient RNA-seq library construction with
no size selection and minimal bias (Nottingham et al., 2016; Qin et al., 2016). Initial
results for 10 datasets obtained from de-identified healthy individuals, patients with
SMM and AMM identified differentially expressed transcripts, including novel small
ncRNAs, Y RNA-derived fragments, and protein-coding genes correlated with survival
based on a previously published microarray study (Popovici et al., 2010; Shi et al., 2010).
We are now extending this approach to other types of cancer including
inflammatory breast cancer, a rare and very aggressive disease. By collaborating with Dr.
Naoto Ueno’s group at MD Anderson Cancer Center, we will be analyzing RNA samples
from FFPE (formalin-fixed, paraffin-embedded) tumor tissue, PBMCs (peripheral blood
mononuclear cells) and plasma. Finally, we are collaborating with Dr. Joseph
McCormick’s group at the University of Texas at Brownsville to analyze plasma RNA
samples for a large-scale population study of environmental impact on human health. We
will continue to explore and develop methods for using TGIRT-seq analysis of
circulating RNAs in blood or other bodily fluids as a sensitive, non-invasive and cost-
effective tool for early detection of a variety of human diseases, and for personalized
medical care.
122
4.5 MATERIALS AND METHODS
4.5.1 Thermostable group II intron RTs
Reverse transcription of RNAs for the construction of RNA-seq libraries was
done by using a thermostable GsI-IIC RT (TGIRT-III; InGex, St. Louis MO).
4.5.2 RNA preparations*
* This is done by Enrico Caserta in Flavia Pichiorri’s research group at Ohio State University.
Plasma from de-identified healthy individuals or patients at different stages of
myeloma (SMM and AMM) were collected and processed with an exoEasy Maxi Kit
(QIAGEN) to obtain RNAs in the extracellular vesicles (EV-RNAs).
4.5.3 Construction of RNA-seq libraries
EV-RNA preparations were concentrated with an RNA Clean & Concentrator Kit
(Zymo Research) to 23 µl for healthy individuals and 12 µl for myeloma patients (SMM
and AMM). The quality and quantity of EV-RNAs were assessed by running 1 µl on a
2100 Bioanalyzer (Agilent) using the RNA 6000 Pico Kit (mRNA assay).
For construction of RNA-seq libraries, TGIRT template-switching reverse
transcription reactions were done by using an initial template-primer substrate consisting
of a 34-nt RNA oligonucleotide (R2 RNA), which contains an Illumina Read 2 primer-
binding site and a 3’-blocking group (C3 Spacer, 3SpC3; IDT), annealed to a
complementary 35-nt DNA primer (R2R DNA) that leaves an equimolar mixture of A, C,
G, or T single-nucleotide 3’ overhangs. Reactions were done in 20 µl of reaction medium
123
containing EV-RNAs (1.7-7.5 ng in 11-µl ddH2O), 100 nM template-primer substrate, 1
µM TGIRT-III enzyme, and 1 mM dNTPs (an equimolar mix of dATP, dCTP, dGTP,
and dTTP) in 450 mM NaCl, 5 mM MgCl2, 20 mM Tris-HCl, pH 7.5, and 5 mM
dithiothreitol (DTT). DTT was either prepared freshly or from a frozen concentrated 1 M
stock solution. Reactions were assembled by adding all components, except dNTPs, to a
sterile PCR tube containing EV-RNAs with the TGIRT-III enzyme added last. After pre-
incubating at room temperature for 30 min, reactions were initiated by adding dNTPs and
incubated for 15 min at 60°C. cDNA synthesis was terminated by adding 5 M NaOH to a
final concentration of 0.25 M, incubating at 95°C for 3 min, and then neutralizing with 5
M HCl. The resulting cDNAs were purified with a MinElute Reaction Cleanup Kit
(QIAGEN) and ligated at their 3’ end to a 5’-adenlyated/3’-blocked (C3 spacer, 3SpC3;
IDT) adapter (R1R) by using Thermostable 5’ AppDNA/RNA Ligase (New England
Biolabs) according to the manufacturer’s recommendations. The ligated cDNA products
were re-purified with a MinElute column and amplified by PCR by using Phusion High-
Fidelity DNA polymerase (Thermo Fisher Scientific) with 200 nM of Illumina multiplex
and 200 nM of barcode primers (a 5’ primer that adds a P5 capture site and a 3’ primer
that adds a barcode plus P7 capture site). PCR was done with initial denaturation at 98°C
for 5 sec followed by 12 cycles of 98°C for 5 sec, 60°C for 10 sec and 72°C for 10 sec.
The PCR products were purified by using the Agencourt AMPure XP (Beckman Coulter)
and sequenced on a NextSeq 500 instrument (Illumina) to obtain 75-nt paired-end reads.
124
4.5.4 Bioinformatics*
* This is done by Jun Yao in the Lambowitz lab and Dennis Wylie in the Bioinformatics Consulting Group
at the University of Texas at Austin.
Analysis of all RNA-seq datasets was done by using the TGIRT-seq mapping
pipeline as described previously for human plasma RNAs (Qin et al., 2016). First,
Illumina TruSeq DNA adapter and primer sequences were trimmed from the reads by
using cutadapt (Martin, 2011) (sequencing quality score cut-off at 20; p-value < 0.01),
and reads <18-nt after trimming were discarded. Reads were then mapped by using
Tophat v2.0.10 and Bowtie2 v2.1.0 (default settings) to the human genome reference
sequence (Ensembl GRCh38 Release 76) (Langmead and Salzberg, 2012; Kim et al.,
2013) supplemented with additional contigs encoding the 5S rRNA gene (2.2-kb 5S
rRNA repeats from the cluster on chromosome 1 (1q42); GeneBank: X12811) and the
45S rRNA gene (43-kb 45S rRNA repeats containing 5.8S, 18S and 28S rRNA sequences
from clusters on chromosomes 13,14,15,21, and 22; GeneBank: U13369). Other
sequences used for mapping included the E. coli genome sequence (Genebank:
NC_000913) to remove any reads resulting from E. coli nucleic acids in enzyme
preparations. Unmapped reads from this first pass (Pass 1) were re-mapped to Ensembl
GRCh38 Release 76 by Bowtie2 with local alignment (default settings) to improve the
mapping rate for those reads that contain post-transcriptionally added nucleotides (e.g.,
CCA and poly(U)), untrimmed adapter sequences, and non-templated nucleotides added
to the 3’ end of the cDNAs by TGIRT enzymes (Pass 2). The mapped reads from Passes
1 and 2 were combined and filtered by mapping quality (MAPQ ≥15; p-value < 0.03),
125
and concordant read pairs were collected by using Samtools. The concordant read pairs
were then intersected with gene annotations (Ensembl GRCh38 Release 76) and piRNA
cluster annotations from piRNABank (Sai Lakshmi and Agrawal, 2008) to collect reads
that mapped uniquely in the annotated orientation to genomic features (genomic
coordinates for piRNAs were converted to Ensembl GRCh38 Release 76 coordinates
using scripts from the UCSC genome browser website). Coverage of each non-ribosomal
and non-mitochondrial feature was calculated by Bedtools. To improve the mapping rate
for tRNAs, mapped reads from Passes 1 and 2 were intersected with tRNA annotations
from the Genomic tRNA Database (Lowe and Eddy, 1997) to collect both uniquely and
multiply mapped tRNAs reads. These were then combined with unmapped reads after
Pass 2 and mapped to the tRNA reference sequences (UCSC genome browser website)
using Bowtie2 local alignment with default settings. Because similar or identical tRNAs
with the same anticodon can be multiply mapped to different tRNA loci by Bowtie2,
mapped tRNA reads with MAPQ ≥1 were combined according to their tRNA anticodon
prior to calculating the tRNA distributions. Only those features with ten or more mapped
reads were counted.
For transcript expression analysis, RNA-seq datasets were normalized for the total
number of mapped reads by using DESeq (Anders and Huber, 2010) and plotted in R. To
assess correlation with survival, protein-coding genes that showed significant differences
(p < 0.05) in EV-RNA datasets obtained for healthy, SMM and AMM groups, were
intersected with a published microarray dataset (GSE24080) (Popovici et al., 2010; Shi et
al., 2010) to obtain their expression levels in CD138-selected plasma cells of 559 newly
126
diagnosed myeloma patients across 730 days. For each protein-coding gene, patients
were divided into three equal-size groups based on its expression level (low, middle, and
high), and the number of event-free survival (EFS) was plotted as a function of time for
each patient group.
127
Dataset
H SMM AMM
1 2 3 4 1 2 3 1 2 3
Total reads (×106)1 34.0 42.0 41.7 31.2 28.6 11.8 17.4 3.3 19.7 61.2
Mapped to genome (%)2 91.3 93.0 90.2 90.8 91.7 86.1 79.9 67.5 84.2 93.7
Mapped to features (%)
excluding MT and rRNA3
26.9 21.0 28.0 32.8 7.3 31.3 36.2 31.8 16.8 4.7
1Total reads after trimming and filtering.
2Percentage of concordant or discordant paired-end reads that mapped uniquely or multiply to the human genome reference
sequence.
3Percentage of concordant paired-end reads that mapped uniquely in the correct orientation to annotated non-ribosomal and
non-mitochondrial features of the human genome reference sequence.
Table 4.1: Read statistics and mapping for RNA-seq of plasma EV-RNAs.
RNA-seq libraries were constructed from plasma EV-RNAs by using the TGIRT-seq total RNA method and sequenced
on an Illumina NextSeq instrument to obtain the indicated number of 75-nt paired-end reads. Each sample corresponds to
128
plasma EV-RNA (1.7-7.5 ng) obtained from de-identified healthy individuals (H), patients with smoldering (SMM) or active
(AMM) multiple myeloma. The reads were trimmed to remove adapter sequences and low quality base-calls (sequencing
quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming were discarded. Trimmed reads were filtered and
then mapped by using Tophat and Bowtie2 to a human genome reference sequence (Ensembl GRCh38 Release 76)
supplemented with additional rRNA gene contigs, as described in Materials and Methods.
129
Figure 4.1: Bioanalyzer traces showing size profiles of plasma EV-RNAs.
Plasma EV-RNAs were prepared with an exoEasy Maxi Kit (QIAGEN),
concentrated with an RNA Clean & Concentrator Kit (Zymo Research), and a 1-µl
portion was analyzed with an RNA 6000 Pico Kit (mRNA assay) on a 2100 Bioanalyzer
(Agilent) to obtain the traces shown in the Figure. SMM, smoldering multiple myeloma;
AMM, active multiple myeloma.
Healthy
Smoldering
Active
40-60 nt
40-60 nt
[s]
[s]
[s]
40-60 nt
130
A
B
131
Figure 4.2: Percentage of TGIRT-seq reads from EV-RNA datasets mapping to different
categories of genomic features.
Reads from individual RNA-seq datasets obtained for healthy individuals (H1-4),
patients with smoldering (SMM1-3) and active (AMM1-3) multiple myeloma were
combined, and mapped to genomic features as described in Materials and Methods. (A)
Stacked bar graphs showing the percentage of concordant read pairs that mapped
uniquely in the correct orientation to the indicated category of genomic features. Protein-
coding genes include immunoglobulin and T-cell receptor genes; lncRNAs include
lincRNAs, antisense RNAs and other lncRNAs. (B) Stacked bar graphs showing the
percentage of small ncRNA read pairs in (A) that mapped to different categories of small
ncRNA genes. In (A) and (B), the numbers next to each stacked bar segment indicate the
number of different genes for which transcripts were identified in that category. Only
features with ten or more mapped reads in the combined datasets were included.
132
Figure 4.3: Heatmap for sample-to-sample distance.
Reads from EV-RNA datasets obtained for healthy individuals (H1-4), patients
with smoldering (SMM1-3) and active (AMM1-3) multiple myeloma were normalized
and plotted in R. EV-RNA datasets were clustered based on Euclidean distance, which is
a measure of sample divergence, with larger number indicating more variations in
transcript expressions between two datasets.
133
Figure 4.4: Transcript expressions in plasma EVs.
(A) Scatter plots comparing average levels of RNA species detected by TGIRT-
seq in plasma EVs prepared using a kit from 3 patients with smoldering multiple
myeloma myeloma (SMM; left), 3 patients with active multiple myeloma (AMM; right),
and 4 healthy individuals (H). Read counts in RNA-seq datasets for different individuals
A
B
134
were normalized with DESeq, averaged, and log2 scaled with an offset of 1. Selected
protein-coding genes whose levels in plasma vesicles correlated positively or negatively
with survival based on data in previous microarray studies (GSE24080) (Popovici et al.,
2010; Shi et al., 2010) are highlighted in red and blue, respectively. Examples of small
ncRNAs identified by TGIRT-seq as potential biomarkers subject to RT-qPCR validation
are in green. YF is a 5’ Y RNA fragment whose abundance appears strongly correlated
with myeloma. (B) Box plots comparing average levels of selected protein-coding genes
(first row) and small ncRNAs (bottom row) in (A).
135
Figure 4.5: Survival curves.
Patients from a previous microarray study (GSE24080) (Popovici et al., 2010; Shi
et al., 2010) were divided into three equal-size groups based on the expression level
(blue, low; grey, middle; red, high) of each selected protein-coding gene, which was
significantly (p < 0.05) down- (top row) or up- (bottom row) regulated in plasma EVs
(see Fig. 4.4). The number of event-free survival (EFS) was plotted as a function of time,
with the number in the x-axis representing the percentage of a 730-day study period.
136
Chapter 5: Mapping RNA secondary structures and
RNA-protein interaction sites
5.1 OVERVIEW OF SHAPE AND CRAC
All cellular RNAs must fold into specific structures and/or interact with proteins
in order to fulfill their biological functions. In recent years, powerful new methods have
been developed for studying RNA folding and RNA-protein interactions (Weeks and
Mauger, 2011; Ule et al., 2005; Granneman et al., 2009; König et al., 2010; Zarnack et
al., 2013).
Selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) is a
quantitative method, which probes RNA structure in a simple two-step process: (i) RNA
modification and (ii) primer extension (Weeks and Mauger, 2011). In the first step,
exposure of RNA molecules to the electrophilic SHAPE reagent results in flexible
nucleotides, often located in single-stranded regions, being preferentially acylated at their
2’-hydroxyl group (denoted 2’-O-adducts). In the second step, the modified nucleotide
residues are mapped by primer extension. Because RNA 2’-O-adducts create stops in
reverse transcription one nucleotide prior to the RNA modification, the length of the
cDNA maps a position of modification on the RNA, revealing flexible or single-stranded
region of RNA at single-nucleotide resolution. Time-resolved SHAPE using fast SHAPE
reagents is also very powerful for mapping changes in RNA structure and RNA-protein
interaction sites during protein-assisted RNA folding (Weeks and Mauger, 2011).
137
For the mapping of direct binding sites between RNA and protein in vitro or in
vivo, cross-linking and analysis of cDNAs (CRAC) or cross-linking and
immunoprecipitation (CLIP) are often used to covalently link the protein of interest to the
target RNA (Ule et al., 2005; Granneman et al., 2009; König et al., 2010; Zarnack et al.,
2013). The isolated RNA-protein complexes are digested with RNase followed by
protease, leaving RNA fragments with one or more amino acid residues cross-linked to
the nucleotides at the site of protein contact. In current versions of the methods, RNA
ligase is used to attach adaptor sequences containing PCR-primer binding sites to the
RNA fragments, which are then reverse transcribed to cDNAs, PCR amplified, and
sequenced. The binding sites between RNA and protein are identified by reverse
transcription stops at the cross-linked sites during cDNA synthesis (König et al., 2010).
5.2 PROTEIN-ASSISTED GROUP II INTRON SPLICING
Studies of protein-assisted group II intron splicing are important for
understanding eukaryotic gene structure, expression, and evolution. A substantial fraction
(~25%) of the human genome is comprised of spliceosomal introns, whose evolutionary
ancestors are group II introns, and their splicing plays important roles in both gene
expression and regulation (Black, 2000, 2003; Lambowitz and Belfort, 2015). Defective
splicing has been implicated in numerous human diseases including cancer (Faustino and
Cooper, 2003; Karni et al., 2007). Studies of group II intron splicing can also provide
insights into mechanisms of RNA catalysis, which underlies critical cellular processes
including pre-messenger RNA (pre-mRNA) splicing, transfer RNA (tRNA) processing
138
and translation (Lambowitz and Zimmerly, 2011; Lambowitz and Belfort, 2015). RNA
splicing and catalysis are also involved in propagation and replication of human
pathogens, such as HIV-1 and hepatitis virus (Been and Wickham, 1997; Stoltzfus,
2009).
Group II introns have a conserved secondary structure consisting of six domains
(DI-DVI), which interact via tertiary contacts to form the core and the peripheral RNA
structures that are crucial for splicing (Lambowitz and Zimmerly, 2011). Although group
II introns are ribozymes that catalyze their own splicing, their efficient splicing under
physiological conditions requires the binding of the intron-encoded protein (IEP) to
promote formation of active RNA structures (Lambowitz and Zimmerly, 2011). Using
the Ll.LtrB intron, a group IIA intron found in Lactococcus lactis, and its IEP, denoted
the LtrA protein, as a model system, our laboratory discovered that the IEP has high-
affinity binding sites in intron subdomain DIVa and weaker binding sites across
contiguous regions of DI, II and VI, suggesting LtrA promotes intron splicing by
stabilizing the interactions of these RNA domains (Matsuura et al., 1997; Saldanha et al.,
1999; Matsuura et al., 2001; Dai et al., 2008). However, there is no information about
RNA conformational changes at different steps of splicing, such as exon binding and the
release of branch point adenosine, and how the interactions between intron RNA and IEP
facilitate such processes. Additionally, there is a lack of information about protein-
assisted splicing of group IIB and IIC introns, which have distinctive differences in the
active site and peripheral structures from group IIA introns (Lambowitz and Zimmerly,
139
2011). For example, group IIA introns position the 5’ and 3’ exons at the active site by
base-pairing of exon-binding sites 1, 2 (EBS1 and EBS2, respectively) and δ from DI to
intron-binding sites 1, 2 (IBS1 and IBS2, respectively) and δ’ in the flanking 5’ and 3’
exons. In contrast, there is no EBS2 sequence group IIC introns, which instead use only
two interactions (IBS1/EBS1 and IBS3/EBS3) and may also recognize a stem-loop
derived from a transcription terminator or attC site in the 5′ exon (Toor et al., 2006;
Robart et al., 2007; Lambowitz and Zimmerly, 2011).
Here I focused on a small group IIC intron found in thermophile Geobacillus
stearothermophilus (denoted GsI-IIC). The GsI-IIC intron is closely related to a group
IIC intron from Oceanobacillus iheyensis (about 50% identities in the catalytic RNA part
of the intron), whose crystal structure has been determined (Toor et al., 2008). To map
the secondary structure of GsI-IIC intron RNA and to study the mechanism of protein-
assisted intron splicing, I employed the SHAPE and CRAC methods with thermostable
group II intron reverse transcriptases (TGIRTs) replacing retroviral RTs and RNA
ligation used in both methods. The higher thermostability and processivity of TGIRT
enzymes can potentially address some of the limitations of using retroviral RTs in the
primer extension step used in SHAPE. One major issue of the retroviral RTs is that they
fall off at stable RNA secondary structures during cDNA synthesis, resulting in
premature stops that produce high background noise in SHAPE and other RNA-structure
mapping methods. Such premature stops will be minimized by TGIRT enzymes.
Additionally, TGIRT enzymes can attach RNA-seq adapter sequences during cDNA
140
synthesis via template-switching, thereby eliminating the use of RNA ligase, which has
sequence bias (Linsen et al., 2009; Levin et al., 2010; Lamm et al., 2011), is time-
consuming, and results in loss of material. The latter is a major challenge for methods
such as CRAC and CLIP, which are limited by the amount of starting material for RNA-
seq library construction.
5.2.1 Determination of optimal exon length and protein concentration for in vitro
splicing of the GsI-IIC intron.
To determine the optimal length of the flanking 5’ exon required for protein-
assisted GsI-IIC intron splicing, precursor RNAs containing an IEP-ORF-deleted (ΔORF)
intron (656-nt), flanking 3’ exon (32-nt), and different flanking 5’ exons (55-nt, 46-nt,
and 35-nt; denoted GsI2c5532, 4632 and 3532, respectively), were constructed and used
for intron transcription and splicing.
The GsI-IIC intron RNA was transcribed in vitro from each construct using a
mutant T7 polymerase, which does not pause or terminate at a variety of signals,
including a terminator found fortuitously in the human preproparathyroid hormone (PTH)
gene, a pause site found in the concatamer junction (CJ) of replicating T7 DNA, and
termination signals that are also utilized by Escherichia coli RNAP (e.g. rrnB T2)
(Lyakhov et al., 1997). To suppress hydrolytic splicing, in which water rather than the 2’-
hydroxyl group (2’-OH) of the branch-site nucleotide, is used as the nucleophile in the
first transesterification step (van der Veen et al., 1987; Jarrell et al., 1988), 4 mM dTTP
was added to the transcription reaction to sequester extra magnesium ions.
141
The in vitro splicing reactions for GsI2c5532, 4632 and 3532, was done in
reaction medium containing 450 mM KCl and 5 mM Mg2+ at 50°C, and was initiated by
adding a two-fold molar excess of the IEP, which was expressed in Escherichia coli and
purified with high yield and activity as a fusion protein with a non-cleavable maltose-
binding protein (MalE) attached to the N-terminus of the protein via a rigid linker
(denoted GsI-IIC-MRF) (Mohr et al., 2013). To examine the splicing activity of each
GsI-IIC precursor RNA, the splicing reaction was quenched at different time-points over
a course of one hour, and was analyzed by electrophoresis in a denaturing 4% acrylamide
gel.
Figure 5.1 shows that at low Mg2+ concentration (5 mM), GsI-IIC intron splicing
occurred via lariat formation and was strictly dependent upon the addition of the IEP for
all precursor RNAs. Due to inefficient splicing of GsI2c5532 and 4632 (Fig. 5.1A,B),
GsI2c3532 (Fig. 5.1C), which has a 35-nt flanking 5’ exon including the complete hairpin
structure, was used for further biochemical characterization of the protein-assisted
splicing.
To determine the optimal concentration of IEP required for protein-assisted GsI-IIC
intron splicing, one-, two-, and five-fold molar excess of IEP, were examined in a time-
course splicing reaction using GsI2c3532 (Fig. 5.2A). The percentage of lariat formation
for each IEP to RNA molar ratio were quantitated using ImageQuant TL (GE
Healthcare), and the time-courses were fit to the two-exponential equation using
SigmaPlot (Systat Software, Inc) Figure 5.2B showed that the protein-assisted splicing of
142
GsI-IIC intron was biphasic with an initial fast phase followed by a slow phase.
Interestingly, optimal splicing activity of the GsI-IIC intron occurred at a 1:1 molar ratio
between IEP and RNA (fast phase, 5.7/min and slow phase, 0.14/min), which is lower
than the optimal molar ratio 2:1 used for in vitro protein-assisted splicing of the Ll.LtrB
group IIA intron (Saldanha et al., 1999; Matsuura et al., 2001; Rambo and Doudna,
2004). This finding may reflect that the IEP of GsI-IIC intron, GsI-IIC-MRF, functions in
splicing as a monomer rather than as a dimer, which is thought to be the case for the IEP
of Ll.LtrB group IIA intron, LtrA (Saldanha et al., 1999). This difference could reflect
different active site and peripheral structures of group IIA and IIC introns, which enables
IIC introns to be efficiently spliced by an IEP protein monomer. Alternatively, this
difference could be explained by a higher proportion of inactive protein in the LtrA
preparations.
5.2.2 RNA-structure mapping of the GsI-IIC intron via TGIRT-SHAPE*
*This work was done in collaboration with Jacob Grohman in the Lambowitz Lab.
For mapping the RNA structure by TGIRT-SHAPE, we used a 722-nt in vitro
transcript corresponding to GsI-IIC3532 but with deletion of the branch-point adenosine
(denoted GsI2c3532ΔA) to trap the intron in pre-catalytic state prior to lariat formation
without affecting IEP binding (Matsuura et al., 2001). The transcript was incubated under
splicing conditions to allow proper folding, and then modified with the SHAPE reagent
isatoic anhydride (IA; Sigma-Aldrich) under conditions that give an average of one
modification per RNA molecule. The modified intron RNA was then reverse transcribed
143
from a fluorescently labeled DNA primer annealed at its 3’ end by a thermostable group
II intron TeI4c-MRF RT or by SuperScript III (SSIII; Thermo Fisher Scientific). SHAPE
modifications were identified by capillary electrophoresis as reverse transcription stops.
Figure 5.3A shows a plot of SHAPE reactivity determined by TeI4c-MRF as a
function of nucleotide position in the intron RNA, with high reactivity representing
flexible or single-stranded regions and low reactivity representing inflexible or base-
paired regions. The high processivity of TeI4c-MRF RT allowed mapping of the entire
722-nt intron RNA at single-nucleotide resolution by using a single primer annealed at its
3’ end (Fig. 5.3A). By using RNAstructure (Reuter and Mathews, 2010) software with
SHAPE reactivities as constraints, we obtained a secondary structure of the GsI-IIC
intron RNA in agreement with that predicted based on phylogenetic analysis (Fig. 5.3B).
Figure 5.3B also shows a stem-loop region from DIII of the intron RNA in which
TGIRT-SHAPE indicates formation of a short stem containing A-U and G-C base pairs
that appear unpaired due to high background noise caused by premature reverse
transcription stops of SSIII RT.
5.2.3 Mapping of RNA-protein contact sites by TGIRT-CRAC*
*This work was done in collaboration with David Sidote in the Lambowitz Lab.
Next, we used TGIRT-CRAC to map direct binding sites between the intron RNA
and its IEP during splicing. Figure 5.4A outlines the methods for TGIRT-CRAC. The in
vitro transcribed GsI2c3532dA was incubated in presence or absence of the IEP under
splicing conditions, and then irradiated on ice by an ultraviolet (UV) lamp (Spectroline)
144
at 254 nm emission (UV-C). The cross-linked ribonucleoprotein complexes (RNPs) were
digested by RNase T1 (Thermo Fisher Scientific) at low or high concentration, followed
by RNase inactivation using SUPERaseIn (Thermo Fisher Scientific), and then treated
with [γ-32P]-ATP and T4 polynucleotide kinase (Epicentre) for 5’-labeling and 3’-
dephosphorylation of the RNA fragments. The 32P-labeled RNase-digested RNPs were
analyzed by SDS-PAGE followed by nitrocellulose membrane transfer. An
autoradiogram of the membrane (Fig. 5.4B) showed labeled bands with a molecular
weight (MW) higher than 80 kDa (the MW of the IEP alone), which was indicated on the
autoradiogram using an 80kDa marker from an unlabeled protein ladder. RNA fragments
of the GsI-IIC intron RNA were released from the membrane by digestion with protease
K (Thermo Fisher Scientific) in presence of 7 M urea, and ethanol-precipitated. The
purified RNA fragments were used for RNA-seq library construction by using the
TGIRT-seq small RNA/CircLigase method to attach RNA-seq adapter sequences via
template-switching during cDNA synthesis without the use of RNA ligase. Samples were
sequenced on an Illumina Miseq instrument.
Sequencing analysis of the reads that mapped to the GsI-IIC intron RNA with
distinctive 5’ ends (Fig. 5.4C), which represent positions of the cross-linking sites,
revealed a number of nucleotides potentially involved in direct interactions with the IEP
in DI, DIV and DVI. These regions have been previously shown to be involved in IEP
binding in the Ll.LtrB group IIA intron (Dai et al. 2008). The cross-linking sites
identified here include ε, γ, EBS1 and 3, and DIVa, which is a known high-affinity IEP-
145
binding site in the L1.LtrB intron (Wank et al., 1999; Matsuura et al., 2001), and the
guanosine opposite to the branch-point adenosine (Fig. 5.3D).
5.3 DISCUSSION
Here, I established a protein-assisted in vitro splicing system for the group IIC
intron GsI-IIC, and analyzed its secondary structure and interaction sites with the IEP.
Moreover, I demonstrated the usefulness of TGIRT enzymes in the SHAPE and CRAC
procedures used for mapping RNA secondary structures and RNA-protein interactions.
By using TGIRT-SHAPE, I mapped the secondary structure of a 722-nt highly structured
GsI-IIC intron RNA at a single-nucleotide resolution using a single primer annealed to its
3’ end. The secondary structure of the GsI-IIC intron RNA obtained by TGIRT-SHAPE
is consistent with that predicted based on phylogenetic studies, suggesting not only
efficiency, but also accuracy of the method. Furthermore, by using TGIRT-CRAC, I
showed that potential interaction sites between the GsI-IIC intron RNA and its IEP reside
in DI, DIV, and DVI regions, which are known for IEP binding in group IIA intron
(Wank et al., 1999; Matsuura et al., 2001; Dai et al., 2008). Most of the identified
nucleotides are involved in making long-range RNA tertiary contacts, suggesting the IEP
functions to facilitate formation of active intron RNA structures during splicing. Since
UV also induces RNA-RNA cross-linking, further investigation will be conducted to
show protein-dependent enrichment of the cross-linked sites identified here.
The use of TGIRT-seq small RNA/CircLigase method in CRAC allows
construction of the RNA-seq libraries without RNA ligation, eliminates two steps from
146
the original protocol and greatly reduced RNA sample loss, which is critical in
procedures like CRAC and CLIP. Finally, the new TGIRT-seq total RNA method (see
Chapter 3) will further improve the speed and efficiency of CRAC and CLIP procedures,
facilitating the identification of global RNA-protein interactions in vivo. I am currently
collaborating with Dr. Robert Krug’s research group at the University of Texas at Austin
applying these methods to identify RNAs bound by the influenza virus NS1A and NS1B
proteins, and with Dr. Michael Gale, Jr.’s research group at the University of Washington
to identify pathogen-associated molecular patterns (PAMPs) of the MDA5 protein. Both
projects will contribute new insights to the fundamental understanding of our innate
immunity against viral infection and virus-host interactions.
5.4 MATERIALS AND METHODS
5.4.1 Recombinant plasmids
Recombinant plasmids used for in vitro transcription contain GsI-IIC intron
precursor RNA cloned downstream of a phage T7 promoter and upstream of a BamHI
recognition site in a pUC19 vector (New England BioLabs). I constructed three different
constructs that express different GsI-IIC intron precursor RNAs. They were comprised of
the same ΔORF 656-nt intron and 32-nt 3’ exon, but different 5’ exons, 55-nt for
GsI2c5532, 46-nt for GsI2c4632 and 35-nt for GsI2c3532. Plasmid GsI2c3532ΔA differs
from GsI2c3532 by a single branch-point adenosine deletion in the intron.
147
5.4.2 Preparation of GsI-IIC intron RNA and IEP
GsI-IIC intron RNA was transcribed in vitro from 1 µg of recombinant plasmid,
which was linearized by BamHI (New England BioLabs), using a mutant T7 polymerase
that does not pause or terminate at a variety of signals, including a terminator found
fortuitously in the human preproparathyroid hormone (PTH) gene, a pause site found in
the concatamer junction (CJ) of replicating T7 DNA, and termination signals that are also
utilized by Escherichia coli RNAP (e.g. rrnB T2) (Lyakhov et al., 1997). Transcription
was done at 37°C for 2 h in reaction medium containing 40 mM Tris-HCl (pH 7.9), 10
mM DTT, 2 mM spermidine, 6 mM MgCl2, 1 mM GTP, 1 mM CTP, 1 mM ATP, 1 mM
UTP, and 4 mM dTTPs to sequester extra Mg2+ that favors hydrolytic splicing.
Transcripts were treated with 2 units of DNase I (New England BioLabs) at 37°C for 10
min according to manufacturer’s protocol, extracted with phenol-chloroform-isoamyl
alcohol (25:24:1), and purified with Sephadex G-50 column (Sigma-Aldrich). The
TGIRT enzymes, GsI-IIC-MRF used for splicing, and TeI4c-MRF used for primer
extension and TGIRT-seq, were expressed and purified as described previously (Mohr et
al., 2013).
5.4.3 GsI-IIC intron splicing
For time-course RNA splicing reactions in vitro, 30 nM precursor RNA internally
labeled with [α-32P]-UTP was denatured by heating at 85°C for 2 min in double-distilled
H2O (ddH2O) and then renatured at 50°C for 2 min in reaction medium containing 450
mM KCl, 20 mM Tris-HCl (pH 7.5), and 5 mM MgCl2. To measure splicing rates under
148
different IEP concentrations, splicing reactions were initiated by adding 0, 30, 60, or 150
nM GsI-IIC-MRF, incubated at 50oC and then terminated at different time points by
adding stop solution containing 0.25 M EDTA and 0.2% SDS. The splicing products
were analyzed in a denaturing 4% polyacrylamide gel, which was scanned with a
Phosphorimager (GE Healthcare). Band intensities were quantified by using ImageQuant
TL (GE Healthcare) and plotted using SigmaPlot (Systat Software Inc).
5.4.4 TGIRT-SHAPE
The in vitro transcript GsI-3c3532ΔA incubated under splicing condition (2 pmol
in 9 l of splicing buffer) was added to 1 µl of freshly prepared isatoic anhydride (50 mM
in DMSO; Sigma-Aldrich) or 1 µl of DMSO as a negative control. The RNA was
incubated at 37°C for 36 min (~5 half-lives for isatoic anhydride) and ethanol
precipitated (3 volumes of ethanol, one-tenth volume of 3 M sodium acetate, pH 5.2, and
1 µl of 20 mg/ml glycogen). Alternatively if available, 1-methyl-7-nitroisatoic anhydride
(1m7) SHAPE reagent can be used to capture faster 2’ OH conformational dynamics (t1/2
~20 s; incubation at 37°C for 3 min) (Weeks and Mauger, 2011). Primer extension of the
SHAPE-modified or control RNAs was carried out using a fluorescently labeled primer A
(5’-/Cy5/CAT ACA ACG CCT TTT TCT CTC CAG G-3’; IDT), which anneals near the
3’ end of the RNA. The annealed template-primer substrate was pre-incubated with 2 M
TeI4c-MRF RT at room temperature for 30 min in 28.2 l of reaction medium containing
450 mM NaCl, 20 mM Tris-HCl (pH 7.5), 5 mM MgCl2, and 5 mM fresh DTT. Reverse
149
transcription reactions were initiated by adding 1.8 l of 25 mM dNTPs (final
concentration 1.5 mM) and incubated at 60°C for 1 h. Reverse transcription using
SuperScript III (Invitrogen) was done in parallel according to the manufacturer’s
protocol. Reactions were stopped by adding 1 l of 5 M NaOH to a final concentration of
0.1 M, incubating at 95˚C for 3 min, and neutralizing with an equal volume of 5 M HCl.
The resulting cDNAs were then ethanol precipitated, as described above for GsI-
3c3532ΔA in vitro transcript, dissolved in 40 µl of Hi-Di formamide or other capillary
electrophoresis instrument-specific loading solution. Sequencing reactions were
performed using TeI4c-MRF or SSIII RTs following methods described above, except
unmodified RNA was used as a template, and a Cy5.5-labeled primer B (IDT) of
identical sequence to primer A and 1.5 mM ddCTP were added to the reaction. Cy5-
labeled cDNAs synthesized from SHAPE-modified RNA or control RNA were mixed
with Cy5.5-labeled cDNAs from sequencing reactions and electrophoresed in a single
capillary of a GenomeLabTM GeXP Genetic Analysis System (Beckman Coulter).
Samples were denatured at 90˚C for 180 sec, injected into the capillary array at 2.0 kV
for 30 sec, and separated at 4.8 kV for 80 min. The temperature of the capillary array was
maintained at 60˚C throughout the separation. The raw capillary electrophoresis data
were analyzed by automated QuSHAPE software (Karabiber et al., 2013). SHAPE
reactivities were then used as constraints in RNAStructure software (Reuter and
Mathews, 2010) that outputs RNA secondary structure to obtain the secondary structure
of GsI-IIC intron RNA.
150
5.4.5 TGIRT-CRAC
The in vitro transcript GsI2c3532ΔA incubated under splicing condition (500 nM
in 100 µl of splicing buffer) in the absence or presence of its IEP GsI-IIC-MRF (1 µM)
was irradiated on ice by a Spectroline ultraviolet (UV) lamp at 254 nm emission (UV-C)
for 10 min. The cross-linked RNA-protein complexes (50 µl) were digested with RNase
T1 (final concentration 0.08 U/µL or 4 U/µL; Thermo Fisher Scientific) at 37°C for 30
min, followed by incubation with 1 U/µL SUPERaseIn (Thermo Fisher Scientific) at
37°C for 3 h to inactivate the RNases. The digested products were treated with 0.5 U/µL
T4 polynucleotide kinase (New England BioLabs) and 100 µCi γ-32P-ATP at 37°C for
15min. The radiolabeled RNA-protein complexes were analyzed by a NuPAGE 4-12%
Bis-Tris gel (Thermo Fisher Scientific), transferred to a 100% nitrocellulose membrane
(Invitrogen), exposed overnight and scanned by a Phosphoimager (GE Healthcare).
Fragments of the GsI-IIC intron RNA were released from the cross-linked RNA-protein
complex by digesting GsI-IIC-MRF with 4 mg/L protease K (Thermo Fisher Scientific)
and 7 M urea, and ethanol-precipitated using the published protocol (Ule et al., 2005).
The purified RNA fragments were used for RNA-seq library construction by the
TGIRT-seq small RNA/CircLigase method. Template-switching reactions were done
using an initial template-primer substrate consisting of a 41-nt RNA oligonucleotide (5'-
AGA UCG GAA GAG CAC ACG UCU AGU UCU ACA GUC CGA CGA UC/3SpC3/-
3'), which contains both the Illumina Read 1 and 2 primer-binding sites (Read 1,2 RNA)
and a 3' blocking group (C3 Spacer, 3SpC3; IDT), annealed to a complementary 32P-
151
labeled DNA primer, which leave an equimolar mixture of A, C, G, and T overhangs. For
reverse transcription reactions, the initial template-primer substrate (100 nM) was mixed
with 10 µL of cross-linked RNA fragments and 2 µM TeI4c-MRF RT in reaction
medium containing 450 mM NaCl, 5 mM MgCl2, 20 mM Tris-HCl pH 7.5, 1 mM DTT
and 1 mM dNTPs at room temperature. The reactions were initiated by raising the
temperature to 60°C, incubated for 15 min and terminated by adding 1 M NaOH to a final
concentration of 0.1 M, incubating at 95°C for 3 min, and neutralizing with 1M HCl. The
resulted cDNAs were purified in a denaturing 10% polyacrylamide gel, electroeluted
using a D-tube Dialyzer Maxi with MWCO of 6-8 kDa (EMD Millipore), and ethanol
precipitated with 0.3 M sodium acetate in the presence of 25 µg of linear acrylamide
(Thermo Fisher Scientific). The purified cDNAs were then circularized with CircLigase
II (Epicentre), extracted with phenol-chloroform-isoamyl alcohol (25:24:1), ethanol
precipitated, amplified with Phusion-HF (Thermo Fisher Scientific) and Illumina
multiplex and barcode primers for 15 cycles of 98°C for 5 sec, 60°C for 10 sec and 72°C
for 15 sec, and sequenced on an Illumina MiSeq instrument.
152
A
B
C
153
Figure 5.1: Determining the optimal exon length for in vitro splicing of the GsI-IIC
intron.
The splicing reactions using 30 nM (A) GsI2c5532, (B) GsI2c4632 and (C)
GsI2c3532 precursor RNAs with 5’ exon length of 55 nt, 46 nt and 35 nt, respectively,
were done in reaction medium containing 450 mM KCl, 20mM Tris-HCl (pH 7.5), and 5
mM Mg2+ at 50°C in absence of presence of 60 nM GsI-IIC-MRF. At variable time-
points, 10 µl of the splicing reaction was withdrawn, terminated with 0.25 M EDTA and
0.2% SDS, and analyzed in a denaturing 4% acrylamide gel. The gel was dried and
scanned with a Phosphorimager (GE Healthcare).
154
A
B
155
Figure 5.2: Determining the optimal IEP concentration for in vitro splicing of the GsI-IIC
intron.
(A) Splicing of GsI2c3532. A time course of GsI2c3532 splicing was done in
absence or presence of one-(1X), two-(2X) or five-fold (5X) molar excess of IEP as
described in Figure 5.1. (B) Plot showing percentage of lariat formation as a function of
time. The protein-assisted splicing of GsI-IIC intron was biphasic with an initial fast
phase followed by a slow phase, with optimal splicing occurred at a 1:1 molar ratio
between the IEP and the intron RNA (fast phase, 5.7/min and slow phase, 0.14/min).
156
Figure 5.3: SHAPE analysis of the GsI-IIC intron RNA.
A
B
Nucleotide position
DI
DII
DIII
DIV
DV
DVI
5’ Exon 3’ Exon
157
(A) Plot of SHAPE reactivities. The in vitro GsI2c3532ΔA transcript was
incubated under splicing conditions, modified by isatoic anhydride and reverse
transcribed by TeI4c-MRF (Materials and Methods). SHAPE reactivities were calculated
for each nucleotide by using QuSHAPE (Karabiber et al., 2013). Red represents high
reactivity, yellow represents medium activity, and black represents no reactivity. (B) The
secondary structure of GsI-IIC intron RNA predicted by RNAstructure (Reuter and
Mathews, 2010) using SHAPE reactivities as constraints. A stem loop region from DIII
of the intron RNA was enlarged and used as an example to compare cDNA traces
produced by TeI4c-MRF or by SSIII in the capillary electrophoresis. Peaks in the trace
represented reverse transcription stops at a single nucleotide resolution. TeI4c-MRF only
stopped at SHAPE-modification sites in the RNA and produced a structure that matched
predicted base-pairing interactions in the short stem, whereas SSIII, which has a greater
propensity for premature termination during reverse transcription, did not predict stable
base pairing in the short stem. The nucleotides in the stem loop are colored to indicate
SHAPE reactivities as shown in (A). EBS, exon-binding site; IBS, intron-binding site.
Nucleotide sequences involved in long-range tertiary interactions are boxed, circled or
indicated by arrows and are assigned with Greek letters.
158
A
C
B
159
Figure 5.4: Mapping of protein binding sites in GsI-IIC intron RNA.
(A) TGIRT-CRAC methods. Protein and RNA are irradiated by UV light under
desired conditions, digested by RNases followed by RNase-inactivation, RNA 5’-end
labeling with γ-32P-ATP, and 3’-end dephosphorylation. RNA-protein complexes are
analyzed by SDS-PAGE and transferred to a membrane. RNA fragments are released
from the membrane and subjected to RNA-seq library construction by using the TGIRT-
D
160
seq small RNA/CircLigase method. (B) Cross-linked GsI-IIC RNA-IEP complexes on a
nitrocellulose membrane. In vitro GsI-IIC transcripts GsI-IIC3532ΔA was incubated in
absence or presence of its IEP GsI-IIC-MRF under splicing conditions, irradiated, and
digested by RNase present at low or high concentrations. The RNA-IEP complexes had
higher molecular weights than 80 kDa (the GsI-IIC-MRF alone) on the membrane. (C)
The coverage map of RNA-seq reads. Reads were mapped to GsI-IIC intron RNA and the
number of hits at each nucleotide position was plotted. (D) Predicted IEP binding sites
shown on the secondary structure of GsI-IIC intron RNA. Cross-linked sites were
identified as distinctive read start sites in (C) and were shown by red arrowheads. EBS,
exon-binding site; IBS, intron-binding site. Nucleotide sequences involved in long-range
tertiary interactions are boxed, circled or indicated by arrows and are assigned with Greek
letters.
161
Bibliography
Abbas, Y.M., Pichlmair, A., Górna, M.W., Superti-Furga, G., and Nagar, B. (2013).
Structural basis for viral 5’-PPP-RNA recognition by human IFIT proteins. Nature 494,
60–64.
Abbott, J.A., Francklyn, C.S., and Robey-Bond, S.M. (2014). Transfer RNA and human
disease. Front. Genet. 5, 158.
Agris, P.F., Vendeix, F.A.P., and Graham, W.D. (2007). tRNA’s wobble decoding of the
genome: 40 years of modification. J. Mol. Biol. 366, 1–13.
Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count
data. Genome Biol. 11, R106.
Anderson, P., and Ivanov, P. (2014). tRNA fragments in human health and disease. FEBS
Lett. 588, 4297–4304.
Ansmant, I., Motorin, Y., Massenet, S., Grosjean, H., and Branlant, C. (2001).
Identification and characterization of the tRNA:Psi 31-synthase (Pus6p) of
Saccharomyces cerevisiae. J. Biol. Chem. 276, 34934–34940.
Arroyo, J.D., Chevillet, J.R., Kroh, E.M., Ruf, I.K., Pritchard, C.C., Gibson, D.F.,
Mitchell, P.S., Bennett, C.F., Pogosova-Agadjanyan, E.L., Stirewalt, D.L., et al. (2011).
Argonaute2 complexes carry a population of circulating microRNAs independent of
vesicles in human plasma. Proc. Natl. Acad. Sci. U. S. A. 108, 5003–5008.
Astuti, D., Morris, M.R., Cooper, W.N., Staals, R.H.J., Wake, N.C., Fews, G.A., Gill, H.,
Gentle, D., Shuib, S., Ricketts, C.J., et al. (2012). Germline mutations in DIS3L2 cause
the Perlman syndrome of overgrowth and Wilms tumor susceptibility. Nat. Genet. 44,
277–284.
Baranauskas, A., Paliksa, S., Alzbutas, G., Vaitkevicius, M., Lubiene, J., Letukiene, V.,
Burinskas, S., Sasnauskas, G., and Skirgaila, R. (2012). Generation and characterization
of new highly thermostable and processive M-MuLV reverse transcriptase variants.
Protein Eng. Des. Sel. PEDS 25, 657–668.
Batista, P.J., and Chang, H.Y. (2013). Long noncoding RNAs: cellular address codes in
development and disease. Cell 152, 1298–1307.
Beckman, R.A., Mildvan, A.S., and Loeb, L.A. (1985). On the fidelity of DNA
replication: manganese mutagenesis in vitro. Biochemistry (Mosc.) 24, 5810–5817.
162
Been, M.D., and Wickham, G.S. (1997). Self-cleaving ribozymes of hepatitis delta virus
RNA. Eur. J. Biochem. FEBS 247, 741–753.
Bergsagel, P.L., Mateos, M.-V., Gutierrez, N.C., Rajkumar, S.V., and San Miguel, J.F.
(2013). Improving overall survival and overcoming adverse prognosis in the treatment of
cytogenetically high-risk multiple myeloma. Blood 121, 884–892.
Bibillo, A., and Eickbush, T.H. (2002). High processivity of the reverse transcriptase
from a non-long terminal repeat retrotransposon. J. Biol. Chem. 277, 34836–34845.
Black, D.L. (2000). Protein diversity from alternative splicing: a challenge for
bioinformatics and post-genome biology. Cell 103, 367–370.
Black, D.L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev.
Biochem. 72, 291–336.
Blocker, F.J.H., Mohr, G., Conlan, L.H., Qi, L., Belfort, M., and Lambowitz, A.M.
(2005). Domain structure and three-dimensional model of a group II intron-encoded
reverse transcriptase. RNA N. Y. N 11, 14–28.
Brandman, O., Stewart-Ornstein, J., Wong, D., Larson, A., Williams, C.C., Li, G.-W.,
Zhou, S., King, D., Shen, P.S., Weibezahn, J., et al. (2012). A ribosome-bound quality
control complex triggers degradation of nascent peptides and signals translation stress.
Cell 151, 1042–1054.
Brown, J.B., Boley, N., Eisman, R., May, G.E., Stoiber, M.H., Duff, M.O., Booth, B.W.,
Wen, J., Park, S., Suzuki, A.M., et al. (2014). Diversity and dynamics of the Drosophila
transcriptome. Nature.
Brunner, A.L., Beck, A.H., Edris, B., Sweeney, R.T., Zhu, S.X., Li, R., Montgomery, K.,
Varma, S., Gilks, T., Guo, X., et al. (2012). Transcriptional profiling of long non-coding
RNAs and novel transcribed regions across a diverse panel of archived human cancers.
Genome Biol. 13, R75.
Burgos, K.L., Javaherian, A., Bomprezzi, R., Ghaffari, L., Rhodes, S., Courtright, A.,
Tembe, W., Kim, S., Metpally, R., and Van Keuren-Jensen, K. (2013). Identification of
extracellular miRNA in human cerebrospinal fluid by next-generation sequencing. RNA
N. Y. N 19, 712–722.
Burnett, B.P., and McHenry, C.S. (1997). Posttranscriptional modification of retroviral
primers is required for late stages of DNA replication. Proc. Natl. Acad. Sci. U. S. A. 94,
7210–7215.
163
Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D., and Craig,
D.W. (2016). Translating RNA sequencing into clinical diagnostics: opportunities and
challenges. Nat. Rev. Genet. 17, 257–271.
Cabili, M.N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., and Rinn,
J.L. (2011). Integrative annotation of human large intergenic noncoding RNAs reveals
global properties and specific subclasses. Genes Dev. 25, 1915–1927.
Candales, M.A., Duong, A., Hood, K.S., Li, T., Neufeld, R.A.E., Sun, R., McNeil, B.A.,
Wu, L., Jarding, A.M., and Zimmerly, S. (2012). Database for bacterial group II introns.
Nucleic Acids Res. 40, D187–D190.
Chan, P.P., and Lowe, T.M. (2009). GtRNAdb: a database of transfer RNA genes
detected in genomic sequence. Nucleic Acids Res. 37, D93–D97.
Chang, H.-M., Triboulet, R., Thornton, J.E., and Gregory, R.I. (2013). A role for the
Perlman syndrome exonuclease Dis3l2 in the Lin28-let-7 pathway. Nature 497, 244–248.
Chen, B., and Lambowitz, A.M. (1997). De novo and DNA primer-mediated initiation of
cDNA synthesis by the mauriceville retroplasmid reverse transcriptase involve
recognition of a 3’ CCA sequence. J. Mol. Biol. 271, 311–332.
Chen, X., and Wolin, S.L. (2004). The Ro 60 kDa autoantigen: insights into cellular
function and role in autoimmunity. J. Mol. Med. Berl. Ger. 82, 232–239.
Chen, R., Mias, G.I., Li-Pook-Than, J., Jiang, L., Lam, H.Y.K., Chen, R., Miriami, E.,
Karczewski, K.J., Hariharan, M., Dewey, F.E., et al. (2012). Personal omics profiling
reveals dynamic molecular and medical phenotypes. Cell 148, 1293–1307.
Chen, X., Taylor, D.W., Fowler, C.C., Galan, J.E., Wang, H.-W., and Wolin, S.L. (2013).
An RNA degradation machine sculpted by Ro autoantigen and noncoding RNA. Cell
153, 166–177.
Christov, C.P., Gardiner, T.J., Szüts, D., and Krude, T. (2006). Functional requirement of
noncoding Y RNAs for human chromosomal DNA replication. Mol. Cell. Biol. 26,
6993–7004.
Chu, D., Barnes, D.J., and von der Haar, T. (2011). The role of tRNA and ribosome
competition in coupling the expression of different mRNAs in Saccharomyces cerevisiae.
Nucleic Acids Res. 39, 6705–6714.
Chu, J., Hong, N.A., Masuda, C.A., Jenkins, B.V., Nelms, K.A., Goodnow, C.C., Glynne,
R.J., Wu, H., Masliah, E., Joazeiro, C.A.P., et al. (2009). A mouse forward genetics
164
screen identifies LISTERIN as an E3 ubiquitin ligase involved in neurodegeneration.
Proc. Natl. Acad. Sci. U. S. A. 106, 2097–2103.
Clark, J.M. (1988). Novel non-templated nucleotide addition reactions catalyzed by
procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res. 16, 9677–9686.
Cocquet, J., Chong, A., Zhang, G., and Veitia, R.A. (2006). Reverse transcriptase
template switching and false alternative transcripts. Genomics 88, 127–131.
Conlan, L.H., Stanger, M.J., Ichiyanagi, K., and Belfort, M. (2005). Localization,
mobility and fidelity of retrotransposed Group II introns in rRNA genes. Nucleic Acids
Res. 33, 5262–5270.
Cousineau, B., Smith, D., Lawrence-Cavanagh, S., Mueller, J.E., Yang, J., Mills, D.,
Manias, D., Dunny, G., Lambowitz, A.M., and Belfort, M. (1998). Retrohoming of a
bacterial group II intron: mobility via complete reverse splicing, independent of
homologous DNA recombination. Cell 94, 451–462.
Crick, F.H. (1966). Codon--anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19,
548–555.
Croce, C.M. (2009). Causes and consequences of microRNA dysregulation in cancer.
Nat. Rev. Genet. 10, 704–714.
Cui, X., Matsuura, M., Wang, Q., Ma, H., and Lambowitz, A.M. (2004). A group II
intron-encoded maturase functions preferentially in cis and requires both the reverse
transcriptase and X domains to promote RNA splicing. J. Mol. Biol. 340, 211–231.
Daffis, S., Szretter, K.J., Schriewer, J., Li, J., Youn, S., Errett, J., Lin, T.-Y., Schneller,
S., Zust, R., Dong, H., et al. (2010). 2’-O methylation of the viral mRNA cap evades host
restriction by IFIT family members. Nature 468, 452–456.
Dai, L., Chai, D., Gu, S.-Q., Gabel, J., Noskov, S.Y., Blocker, F.J.H., Lambowitz, A.M.,
and Zimmerly, S. (2008). A three-dimensional model of a group II intron RNA and its
interaction with the intron-encoded reverse transcriptase. Mol. Cell 30, 472–485.
Decroly, E., Ferron, F., Lescar, J., and Canard, B. (2012). Conventional and
unconventional mechanisms for capping viral mRNA. Nat. Rev. Microbiol. 10, 51–65.
Defenouillère, Q., Yao, Y., Mouaikel, J., Namane, A., Galopier, A., Decourty, L., Doyen,
A., Malabat, C., Saveanu, C., Jacquier, A., et al. (2013). Cdc48-associated complex
bound to 60S particles is required for the clearance of aberrant translation products. Proc.
Natl. Acad. Sci. U. S. A. 110, 5046–5051.
165
Delannoy, E., Le Ret, M., Faivre-Nitschke, E., Estavillo, G.M., Bergdoll, M., Taylor,
N.L., Pogson, B.J., Small, I., Imbault, P., and Gualberto, J.M. (2009). Arabidopsis tRNA
adenosine deaminase arginine edits the wobble nucleotide of chloroplast tRNAArg(ACG)
and is essential for efficient chloroplast translation. Plant Cell 21, 2058–2071.
Dhahbi, J.M., Spindler, S.R., Atamna, H., Yamakawa, A., Boffelli, D., Mote, P., and
Martin, D.I.K. (2013a). 5’ tRNA halves are present as abundant complexes in serum,
concentrated in blood cells, and modulated by aging and calorie restriction. BMC
Genomics 14, 298.
Dhahbi, J.M., Spindler, S.R., Atamna, H., Boffelli, D., Mote, P., and Martin, D.I.K.
(2013b). 5’-YRNA fragments derived by processing of transcripts from specific YRNA
genes and pseudogenes are abundant in human serum and plasma. Physiol. Genomics 45,
990–998.
Diamond, M.S., and Farzan, M. (2013). The broad-spectrum antiviral functions of IFIT
and IFITM proteins. Nat. Rev. Immunol. 13, 46–57.
Dittmar, K.A., Sørensen, M.A., Elf, J., Ehrenberg, M., and Pan, T. (2005). Selective
charging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6, 151–
157.
Dittmar, K.A., Goodenbour, J.M., and Pan, T. (2006). Tissue-specific differences in
human transfer RNA expression. PLoS Genet. 2, e221.
Elagib, K.E., Rubinstein, J.D., Delehanty, L.L., Ngoh, V.S., Greer, P.A., Li, S., Lee, J.K.,
Li, Z., Orkin, S.H., Mihaylov, I.S., et al. (2013). Calpain 2 activation of P-TEFb drives
megakaryocyte morphogenesis and is disrupted by leukemogenic GATA1 mutation. Dev.
Cell 27, 607–620.
EL Andaloussi, S., Mäger, I., Breakefield, X.O., and Wood, M.J.A. (2013). Extracellular
vesicles: biology and emerging therapeutic opportunities. Nat. Rev. Drug Discov. 12,
347–357.
Enyeart, P.J., Mohr, G., Ellington, A.D., and Lambowitz, A.M. (2014). Biotechnological
applications of mobile group II introns and their reverse transcriptases: gene targeting,
RNA-seq, and non-coding RNA analysis. Mob. DNA 5, 2.
Esteller, M. (2011). Non-coding RNAs in human disease. Nat. Rev. Genet. 12, 861–874.
Fabbri, M., Paone, A., Calore, F., Galli, R., Gaudio, E., Santhanam, R., Lovat, F., Fadda,
P., Mao, C., Nuovo, G.J., et al. (2012). MicroRNAs bind to Toll-like receptors to induce
prometastatic inflammatory response. Proc. Natl. Acad. Sci. U. S. A. 109, E2110–E2116.
166
Falnes, P.Ø., Johansen, R.F., and Seeberg, E. (2002). AlkB-mediated oxidative
demethylation reverses DNA damage in Escherichia coli. Nature 419, 178–182.
Fan, H.C., Blumenfeld, Y.J., Chitkara, U., Hudgins, L., and Quake, S.R. (2008).
Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal
blood. Proc. Natl. Acad. Sci. 105, 16266–16271.
Faustino, N.A., and Cooper, T.A. (2003). Pre-mRNA splicing and human disease. Genes
Dev. 17, 419–437.
Feng, F., Yuan, L., Wang, Y.E., Crowley, C., Lv, Z., Li, J., Liu, Y., Cheng, G., Zeng, S.,
and Liang, H. (2013). Crystal structure and nucleotide selectivity of human IFIT5/ISG58.
Cell Res. 23, 1055–1058.
Fu, H., Feng, J., Liu, Q., Sun, F., Tie, Y., Zhu, J., Xing, R., Sun, Z., and Zheng, X.
(2009). Stress induces tRNA cleavage by angiogenin in mammalian cells. FEBS Lett.
583, 437–442.
Gerber, A.P., and Keller, W. (1999). An adenosine deaminase that generates inosine at
the wobble position of tRNAs. Science 286, 1146–1149.
Ghosh, A., and Lima, C.D. (2010). Enzymology of RNA cap synthesis. Wiley
Interdiscip. Rev. RNA 1, 152–172.
Gingold, H., Tehler, D., Christoffersen, N.R., Nielsen, M.M., Asmar, F., Kooistra, S.M.,
Christophersen, N.S., Christensen, L.L., Borre, M., Sørensen, K.D., et al. (2014). A dual
program for translation regulation in cellular proliferation and differentiation. Cell 158,
1281–1292.
Golinelli, M.-P., and Hughes, S.H. (2002). Nontemplated nucleotide addition by HIV-1
reverse transcriptase. Biochemistry (Mosc.) 41, 5894–5906.
Goodarzi, H., Liu, X., Nguyen, H.C.B., Zhang, S., Fish, L., and Tavazoie, S.F. (2015).
Endogenous tRNA-Derived Fragments Suppress Breast Cancer Progression via YBX1
Displacement. Cell 161, 790–802.
Goubau, D., Deddouche, S., and Reis e Sousa, C. (2013). Cytosolic sensing of viruses.
Immunity 38, 855–869.
Granneman, S., Kudla, G., Petfalski, E., and Tollervey, D. (2009). Identification of
protein binding sites on U3 snoRNA and pre-rRNA by UV cross-linking and high-
throughput analysis of cDNAs. Proc. Natl. Acad. Sci. U. S. A. 106, 9613–9618.
167
Grasedieck, S., Sorrentino, A., Langer, C., Buske, C., Döhner, H., Mertens, D., and
Kuchenbauer, F. (2013). Circulating microRNAs in hematological diseases: principles,
challenges, and perspectives. Blood 121, 4977–4984.
Gürtler, C., and Bowie, A.G. (2013). Innate immune detection of microbial nucleic acids.
Trends Microbiol. 21, 413–420.
Habjan, M., Hubel, P., Lacerda, L., Benda, C., Holze, C., Eberl, C.H., Mann, A., Kindler,
E., Gil-Cruz, C., Ziebuhr, J., et al. (2013). Sequestration by IFIT1 impairs translation of
2’O-unmethylated capped RNA. PLoS Pathog. 9, e1003663.
Halse, A.-K., Wahren-Herlenius, M., and Jonsson, R. (1999). Ro/SS-A- and La/SS-B-
reactive B lymphocytes in peripheral blood of patients with Sjögren’s syndrome. Clin.
Exp. Immunol. 115, 208–213.
Hansen, K.D., Brenner, S.E., and Dudoit, S. (2010). Biases in Illumina transcriptome
sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131.
Hardin, J.A., Rahn, D.R., Shen, C., Lerner, M.R., Wolin, S.L., Rosa, M.D., and Steitz,
J.A. (1982). Antibodies from patients with connective tissue diseases bind specific
subsets of cellular RNA-protein particles. J. Clin. Invest. 70, 141–147.
He, N., Jahchan, N.S., Hong, E., Li, Q., Bayfield, M.A., Maraia, R.J., Luo, K., and Zhou,
Q. (2008). A La-Related Protein Modulates 7SK snRNP Integrity to Suppress P-TEFb-
Dependent Transcriptional Elongation and Tumorigenesis. Mol. Cell 29, 588–599.
Head, S.R., Komori, H.K., LaMere, S.A., Whisenant, T., Van Nieuwerburgh, F.,
Salomon, D.R., and Ordoukhanian, P. (2014). Library construction for next-generation
sequencing: overviews and challenges. BioTechniques 56, 61–64, 66, 68, passim.
Horton, R., Wilming, L., Rand, V., Lovering, R.C., Bruford, E.A., Khodiyar, V.K., Lush,
M.J., Povey, S., Talbot, C.C., Wright, M.W., et al. (2004). Gene map of the extended
human MHC. Nat. Rev. Genet. 5, 889–899.
Houseley, J., and Tollervey, D. (2009). The Many Pathways of RNA Degradation. Cell
136, 763–776.
Hu, W.-S., and Hughes, S.H. (2012). HIV-1 reverse transcription. Cold Spring Harb.
Perspect. Med. 2.
Huang, X., Yuan, T., Tschannen, M., Sun, Z., Jacob, H., Du, M., Liang, M., Dittmar,
R.L., Liu, Y., Liang, M., et al. (2013). Characterization of human plasma-derived
exosomal RNAs by deep sequencing. BMC Genomics 14, 319.
168
International Myeloma Working Group (2003). Criteria for the classification of
monoclonal gammopathies, multiple myeloma and related disorders: a report of the
International Myeloma Working Group. Br. J. Haematol. 121, 749–757.
Ishimura, R., Nagy, G., Dotu, I., Zhou, H., Yang, X.-L., Schimmel, P., Senju, S.,
Nishimura, Y., Chuang, J.H., and Ackerman, S.L. (2014). RNA function. Ribosome
stalling induced by mutation of a CNS-specific tRNA causes neurodegeneration. Science
345, 455–459.
Jackman, J.E., Montange, R.K., Malik, H.S., and Phizicky, E.M. (2003). Identification of
the yeast gene encoding the tRNA m1G methyltransferase responsible for modification at
position 9. RNA N. Y. N 9, 574–585.
Jarrell, K.A., Peebles, C.L., Dietrich, R.C., Romiti, S.L., and Perlman, P.S. (1988). Group
II intron self-splicing. Alternative reaction conditions yield novel products. J. Biol.
Chem. 263, 3432–3439.
Ji, J.P., and Loeb, L.A. (1992). Fidelity of HIV-1 reverse transcriptase copying RNA in
vitro. Biochemistry (Mosc.) 31, 954–958.
Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid,
accurate, and best-practices quantification of nucleic acid probing information, resolved
by capillary electrophoresis. RNA N. Y. N 19, 63–73.
Karni, R., de Stanchina, E., Lowe, S.W., Sinha, R., Mu, D., and Krainer, A.R. (2007).
The gene encoding the splicing factor SF2/ASF is a proto-oncogene. Nat. Struct. Mol.
Biol. 14, 185–193.
Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K., Nakanishi, M., Nakamura, M.,
Nishida, H., Yap, C.C., Suzuki, M., Kawai, J., et al. (2005). Antisense transcription in the
mammalian transcriptome. Science 309, 1564–1566.
Katibah, G.E., Lee, H.J., Huizar, J.P., Vogan, J.M., Alber, T., and Collins, K. (2013).
tRNA binding, structure, and localization of the human interferon-induced protein IFIT5.
Mol. Cell 49, 743–750.
Katibah, G.E., Qin, Y., Sidote, D.J., Yao, J., Lambowitz, A.M., and Collins, K. (2014).
Broad and adaptable RNA structure recognition by the human interferon-induced
tetratricopeptide repeat protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030.
Keller, A., Leidinger, P., Bauer, A., Elsharawy, A., Haas, J., Backes, C., Wendschlag, A.,
Giese, N., Tjaden, C., Ott, K., et al. (2011). Toward the blood-borne miRNome of human
diseases. Nat. Methods 8, 841–843.
169
Khorkova, O., Myers, A.J., Hsiao, J., and Wahlestedt, C. (2014). Natural antisense
transcripts. Hum. Mol. Genet.
Kickhoefer, V.A., Poderycki, M.J., Chan, E.K.L., and Rome, L.H. (2002). The La RNA-
binding protein interacts with the vault RNA and is a vault-associated protein. J. Biol.
Chem. 277, 41282–41286.
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013).
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions
and gene fusions. Genome Biol. 14, R36.
Kimura, T., Katoh, H., Kayama, H., Saiga, H., Okuyama, M., Okamoto, T., Umemoto,
E., Matsuura, Y., Yamamoto, M., and Takeda, K. (2013). Ifit1 inhibits Japanese
encephalitis virus replication through binding to 5’ capped 2’-O unmethylated RNA. J.
Virol. 87, 9997–10003.
Kirchner, S., and Ignatova, Z. (2015). Emerging roles of tRNA in adaptive translation,
signalling dynamics and disease. Nat. Rev. Genet. 16, 98–112.
Koh, W., Pan, W., Gawad, C., Fan, H.C., Kerchner, G.A., Wyss-Coray, T., Blumenfeld,
Y.J., El-Sayed, Y.Y., and Quake, S.R. (2014). Noninvasive in vivo monitoring of tissue-
specific global gene expression in humans. Proc. Natl. Acad. Sci. U. S. A. 111, 7361–
7366.
König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D.J.,
Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in
splicing at individual nucleotide resolution. Nat. Struct. Mol. Biol. 17, 909–915.
Kopreski, M.S., Benko, F.A., and Gocke, C.D. (2001). Circulating RNA as a tumor
marker: detection of 5T4 mRNA in breast and lung cancer patient serum. Ann. N. Y.
Acad. Sci. 945, 172–178.
Kowalski, M.P., and Krude, T. (2015). Functional roles of non-coding Y RNAs. Int. J.
Biochem. Cell Biol. 66, 20–29.
Krude, T., Christov, C.P., Hyrien, O., and Marheineke, K. (2009). Y RNA functions at
the initiation step of mammalian chromosomal DNA replication. J. Cell Sci. 122, 2836–
2845.
Kumar, P., Sweeney, T.R., Skabkin, M.A., Skabkina, O.V., Hellen, C.U.T., and Pestova,
T.V. (2014). Inhibition of translation by IFIT family members is determined by their
ability to interact selectively with the 5’-terminal regions of cap0-, cap1- and 5’ppp-
mRNAs. Nucleic Acids Res. 42, 3228–3245.
170
Lambowitz, A.M., and Belfort, M. (2015). Mobile Bacterial Group II Introns at the Crux
of Eukaryotic Evolution. Microbiol. Spectr. 3.
Lambowitz, A.M., and Zimmerly, S. (2011). Group II Introns: Mobile Ribozymes that
Invade DNA. Cold Spring Harb. Perspect. Biol. 3.
Lamm, A.T., Stadler, M.R., Zhang, H., Gent, J.I., and Fire, A.Z. (2011). Multimodal
RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a
refined and extended description of the C. elegans transcriptome. Genome Res. 21, 265–
275.
Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., Aravin, A., Pfeffer, S., Rice,
A., Kamphorst, A.O., Landthaler, M., et al. (2007). A Mammalian microRNA Expression
Atlas Based on Small RNA Library Sequencing. Cell 129, 1401–1414.
Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2.
Nat. Methods 9, 357–359.
Levin, J.Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D.A., Friedman, N.,
Gnirke, A., and Regev, A. (2010). Comprehensive comparative analysis of strand-
specific RNA sequencing methods. Nat. Methods 7, 709–715.
Li, G.-W., Burkhardt, D., Gross, C., and Weissman, J.S. (2014). Quantifying absolute
protein synthesis rates reveals principles underlying allocation of cellular resources. Cell
157, 624–635.
Li, M., Kao, E., Gao, X., Sandig, H., Limmer, K., Pavon-Eternod, M., Jones, T.E.,
Landry, S., Pan, T., Weitzman, M.D., et al. (2012). Codon-usage-based inhibition of HIV
protein synthesis by human schlafen 11. Nature 491, 125–128.
Lill, R., Robertson, J.M., and Wintermeyer, W. (1986). Affinities of tRNA binding sites
of ribosomes from Escherichia coli. Biochemistry (Mosc.) 25, 3245–3255.
Linsen, S.E.V., de Wit, E., Janssens, G., Heater, S., Chapman, L., Parkin, R.K., Fritz, B.,
Wyman, S.K., de Bruijn, E., Voest, E.E., et al. (2009). Limitations and possibilities of
small RNA digital gene expression profiling. Nat. Methods 6, 474–476.
Liu, Y., Zhang, Y.-B., Liu, T.-K., and Gui, J.-F. (2013). Lineage-specific expansion of
IFIT gene family: an insight into coevolution with IFN gene family. PloS One 8, e66859.
Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of
transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964.
171
Lu, Z., and Matera, A.G. (2014). Vicinal: a method for the determination of ncRNA ends
using chimeric reads from RNA-seq experiments. Nucleic Acids Res. gku207.
Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero,
A., Ebert, B.L., Mak, R.H., Ferrando, A.A., et al. (2005). MicroRNA expression profiles
classify human cancers. Nature 435, 834–838.
Lusvarghi, S., Sztuba-Solinska, J., Purzycka, K.J., Rausch, J.W., and Le Grice, S.F.J.
(2013). RNA Secondary Structure Prediction Using High-throughput SHAPE. J. Vis.
Exp. JoVE.
Lyakhov, D.L., He, B., Zhang, X., Studier, F.W., Dunn, J.J., and McAllister, W.T.
(1997). Mutant bacteriophage T7 RNA polymerases with altered termination properties.
J. Mol. Biol. 269, 28–40.
Mader, R.M., Schmidt, W.M., Sedivy, R., Rizovski, B., Braun, J., Kalipciyan, M., Exner,
M., Steger, G.G., and Mueller, M.W. (2001). Reverse transcriptase template switching
during reverse transcriptase-polymerase chain reaction: artificial generation of deletions
in ribonucleotide reductase mRNA. J. Lab. Clin. Med. 137, 422–428.
Malathi, K., Dong, B., Gale, M., and Silverman, R.H. (2007). Small self-RNA generated
by RNase L amplifies antiviral innate immunity. Nature 448, 816–819.
Malecki, M., Viegas, S.C., Carneiro, T., Golik, P., Dressaire, C., Ferreira, M.G., and
Arraiano, C.M. (2013). The exoribonuclease Dis3L2 defines a novel eukaryotic RNA
degradation pathway. EMBO J. 32, 1842–1854.
Markert, A., Grimm, M., Martinez, J., Wiesner, J., Meyerhans, A., Meyuhas, O.,
Sickmann, A., and Fischer, U. (2008). The La‐related protein LARP7 is a component of
the 7SK ribonucleoprotein and affects transcription of cellular and viral polymerase II
genes. EMBO Rep. 9, 569–575.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.journal 17, 10.
Mathews, M.B., and Bernstein, R.M. (1983). Myositis autoantibody inhibits histidyl-
tRNA synthetase: a model for autoimmunity. Nature 304, 177–179.
Matsuura, M., Saldanha, R., Ma, H., Wank, H., Yang, J., Mohr, G., Cavanagh, S., Dunny,
G.M., Belfort, M., and Lambowitz, A.M. (1997). A bacterial group II intron encoding
reverse transcriptase, maturase, and DNA endonuclease activities: biochemical
demonstration of maturase activity and insertion of new genetic information within the
intron. Genes Dev. 11, 2910–2924.
172
Matsuura, M., Noah, J.W., and Lambowitz, A.M. (2001). Mechanism of maturase-
promoted group II intron splicing. EMBO J. 20, 7259–7270.
Mayer, G., Müller, J., and Lünse, C.E. (2011). RNA diagnostics: real-time RT-PCR
strategies and promising novel target RNAs. Wiley Interdiscip. Rev. RNA 2, 32–41.
Meldrum, C., Doyle, M.A., and Tothill, R.W. (2011). Next-generation sequencing for
cancer diagnostics: a practical perspective. Clin. Biochem. Rev. Aust. Assoc. Clin.
Biochem. 32, 177–195.
Mitchell, P.S., Parkin, R.K., Kroh, E.M., Fritz, B.R., Wyman, S.K., Pogosova-
Agadjanyan, E.L., Peterson, A., Noteboom, J., O’Briant, K.C., Allen, A., et al. (2008).
Circulating microRNAs as stable blood-based markers for cancer detection. Proc. Natl.
Acad. Sci. U. S. A. 105, 10513–10518.
Mohr, G., Ghanem, E., and Lambowitz, A.M. (2010). Mechanisms used for genomic
proliferation by thermophilic group II introns. PLoS Biol. 8, e1000391.
Mohr, S., Ghanem, E., Smith, W., Sheeter, D., Qin, Y., King, O., Polioudakis, D., Iyer,
V.R., Hunicke-Smith, S., Swamy, S., et al. (2013). Thermostable group II intron reverse
transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA
sequencing. RNA N. Y. N 19, 958–970.
Moore, S.D., and Sauer, R.T. (2007). The tmRNA system for translational surveillance
and ribosome rescue. Annu. Rev. Biochem. 76, 101–124.
Moussay, E., Wang, K., Cho, J.-H., van Moer, K., Pierson, S., Paggetti, J., Nazarov, P.V.,
Palissot, V., Hood, L.E., Berchem, G., et al. (2011). MicroRNA as biomarkers and
regulators in B-cell chronic lymphocytic leukemia. Proc. Natl. Acad. Sci. U. S. A. 108,
6573–6578.
Ng, B., Nayak, S., Gibbs, M.D., Lee, J., and Bergquist, P.L. (2007). Reverse
transcriptases: intron-encoded proteins found in thermophilic bacteria. Gene 393, 137–
144.
Norbury, C.J. (2013). Cytoplasmic RNA: a case of the tail wagging the dog. Nat. Rev.
Mol. Cell Biol. 14, 643–653.
Nottingham, R.M., Wu, D.C., Qin, Y., Yao, J., Hunicke-Smith, S., and Lambowitz, A.M.
(2016). RNA-seq of human reference RNA samples using a thermostable group II intron
reverse transcriptase. RNA N. Y. N 22, 597–613.
Ozsolak, F., and Milos, P.M. (2011). RNA sequencing: advances, challenges and
opportunities. Nat. Rev. Genet. 12, 87–98.
173
Pang, Y.L.J., Abo, R., Levine, S.S., and Dedon, P.C. (2014). Diverse cell stresses induce
unique patterns of tRNA up- and down-regulation: tRNA-seq for quantifying changes in
tRNA copy number. Nucleic Acids Res. 42, e170.
Parrott, A.M., and Mathews, M.B. (2007). Novel rapidly evolving hominid RNAs bind
nuclear factor 90 and display tissue-restricted distribution. Nucleic Acids Res. 35, 6249–
6258.
Parrott, A.M., Tsai, M., Batchu, P., Ryan, K., Ozer, H.L., Tian, B., and Mathews, M.B.
(2011). The evolution and expression of the snaR family of small non-coding RNAs.
Nucleic Acids Res. 39, 1485–1500.
Phizicky, E.M., and Hopper, A.K. (2010). tRNA biology charges to the front. Genes Dev.
24, 1832–1860.
Pichlmair, A., Lassnig, C., Eberle, C.-A., Górna, M.W., Baumann, C.L., Burkard, T.R.,
Bürckstümmer, T., Stefanovic, A., Krieger, S., Bennett, K.L., et al. (2011). IFIT1 is an
antiviral protein that recognizes 5’-triphosphate RNA. Nat. Immunol. 12, 624–630.
Popovici, V., Chen, W., Gallas, B.G., Hatzis, C., Shi, W., Samuelson, F.W., Nikolsky,
Y., Tsyganova, M., Ishkin, A., Nikolskaya, T., et al. (2010). Effect of training-sample
size and classification difficulty on the accuracy of genomic predictors. Breast Cancer
Res. BCR 12, R5.
Portal, M.M., Pavet, V., Erb, C., and Gronemeyer, H. (2015). Human cells contain
natural double-stranded RNAs with potential regulatory functions. Nat. Struct. Mol. Biol.
22, 89–97.
Qin, Y., Yao, J., Wu, D.C., Nottingham, R.M., Mohr, S., Hunicke-Smith, S., and
Lambowitz, A.M. (2016). High-throughput sequencing of human plasma RNA by using
thermostable group II intron reverse transcriptases. RNA N. Y. N 22, 111–128.
Raab, M.S., Podar, K., Breitkreutz, I., Richardson, P.G., and Anderson, K.C. (2009).
Multiple myeloma. Lancet Lond. Engl. 374, 324–339.
Raabe, C.A., Tang, T.-H., Brosius, J., and Rozhdestvensky, T.S. (2014). Biases in small
RNA deep sequencing data. Nucleic Acids Res. 42, 1414–1426.
Rajkumar, S.V., Landgren, O., and Mateos, M.-V. (2015). Smoldering multiple myeloma.
Blood 125, 3069–3075.
Rambo, R.P., and Doudna, J.A. (2004). Assembly of an active group II intron-maturase
complex by protein dimerization. Biochemistry (Mosc.) 43, 6486–6497.
174
Raposo, G., and Stoorvogel, W. (2013). Extracellular vesicles: Exosomes, microvesicles,
and friends. J. Cell Biol. 200, 373–383.
Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary
structure prediction and analysis. BMC Bioinformatics 11, 129.
Robart, A.R., Seo, W., and Zimmerly, S. (2007). Insertion of group II intron
retroelements after intrinsic transcriptional terminators. Proc. Natl. Acad. Sci. U. S. A.
104, 6620–6625.
Robinson, J.T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G.,
and Mesirov, J.P. (2011). Integrative genomics viewer. Nat. Biotechnol. 29, 24–26.
Rosa, M.D., Hendrick, J.P., Lerner, M.R., Steitz, J.A., and Reichlin, M. (1983). A
mammalian tRNAHis-containing antigen is recognized by the polymyositis-specific
antibody anti-Jo-1. Nucleic Acids Res. 11, 853–870.
Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M.,
Benjamin, H., Shabes, N., Tabak, S., Levy, A., et al. (2008). MicroRNAs accurately
identify cancer tissue origin. Nat. Biotechnol. 26, 462–469.
Routsias, J.G., and Tzioufas, A.G. (2010). B-cell epitopes of the intracellular
autoantigens Ro/SSA and La/SSB: tools to study the regulation of the autoimmune
response. J. Autoimmun. 35, 256–264.
Rubio, M.A.T., Ragone, F.L., Gaston, K.W., Ibba, M., and Alfonzo, J.D. (2006). C to U
editing stimulates A to I editing in the anticodon loop of a cytoplasmic threonyl tRNA in
Trypanosoma brucei. J. Biol. Chem. 281, 115–120.
Sai Lakshmi, S., and Agrawal, S. (2008). piRNABank: a web resource on classified and
clustered Piwi-interacting RNAs. Nucleic Acids Res. 36, D173–D177.
Saldanha, R., Chen, B., Wank, H., Matsuura, M., Edwards, J., and Lambowitz, A.M.
(1999). RNA and protein catalysis in group II intron splicing and mobility reactions using
purified components. Biochemistry (Mosc.) 38, 9069–9083.
Satoh, T., Okano, T., Matsui, T., Watabe, H., Ogasawara, T., Kubo, K., Kuwana, M.,
Fertig, N., Oddis, C.V., Kondo, H., et al. (2005). Novel autoantibodies against 7SL RNA
in patients with polymyositis/dermatomyositis. J. Rheumatol. 32, 1727–1733.
Schoenberg, D.R., and Maquat, L.E. (2012). Regulation of cytoplasmic mRNA decay.
Nat. Rev. Genet. 13, 246–259.
175
Schoggins, J.W., and Rice, C.M. (2011). Interferon-stimulated genes and their antiviral
effector functions. Curr. Opin. Virol. 1, 519–525.
Shao, S., von der Malsburg, K., and Hegde, R.S. (2013). Listerin-dependent nascent
protein ubiquitination relies on ribosome subunit dissociation. Mol. Cell 50, 637–648.
Shen, P.S., Park, J., Qin, Y., Li, X., Parsawar, K., Larson, M.H., Cox, J., Cheng, Y.,
Lambowitz, A.M., Weissman, J.S., et al. (2015). Protein synthesis. Rqc2p and 60S
ribosomal subunits mediate mRNA-independent elongation of nascent chains. Science
347, 75–78.
Shi, L., Campbell, G., Jones, W.D., Campagne, F., Wen, Z., Walker, S.J., Su, Z., Chu, T.-
M., Goodsaid, F.M., Pusztai, L., et al. (2010). The MicroArray Quality Control (MAQC)-
II study of common practices for the development and validation of microarray-based
predictive models. Nat. Biotechnol. 28, 827–838.
Silva, J., García, V., García, J.M., Peña, C., Domínguez, G., Díaz, R., Lorenzo, Y.,
Hurtado, A., Sánchez, A., and Bonilla, F. (2007). Circulating Bmi-1 mRNA as a possible
prognostic factor for advanced breast cancer patients. Breast Cancer Res. BCR 9, R55.
Smith, D., and Yong, K. (2013). Multiple myeloma. BMJ 346, f3863.
Spornraft, M., Kirchner, B., Haase, B., Benes, V., Pfaffl, M.W., and Riedmaier, I. (2014).
Optimization of Extraction of Circulating RNAs from Plasma – Enabling Small RNA
Sequencing. PLoS ONE 9.
Stoltzfus, C.M. (2009). Chapter 1. Regulation of HIV-1 alternative RNA splicing and its
role in virus replication. Adv. Virus Res. 74, 1–40.
Stringer, S., Basnayake, K., Hutchison, C., and Cockwell, P. (2011). Recent advances in
the pathogenesis and management of cast nephropathy (myeloma kidney). Bone Marrow
Res. 2011, 493697.
Szretter, K.J., Daniels, B.P., Cho, H., Gainey, M.D., Yokoyama, W.M., Gale, M., Virgin,
H.W., Klein, R.S., Sen, G.C., and Diamond, M.S. (2012). 2’-O methylation of the viral
mRNA cap by West Nile virus evades ifit1-dependent and -independent mechanisms of
host restriction in vivo. PLoS Pathog. 8, e1002698.
Tijerina, P., Mohr, S., and Russell, R. (2007). DMS footprinting of structured RNAs and
RNA-protein complexes. Nat. Protoc. 2, 2608–2623.
Toor, N., Robart, A.R., Christianson, J., and Zimmerly, S. (2006). Self-splicing of a
group IIC intron: 5’ exon recognition and alternative 5’ splicing events implicate the
stem-loop motif of a transcriptional terminator. Nucleic Acids Res. 34, 6461–6471.
176
Toor, N., Keating, K.S., Taylor, S.D., and Pyle, A.M. (2008). Crystal structure of a self-
spliced group II intron. Science 320, 77–82.
Topisirovic, I., Svitkin, Y.V., Sonenberg, N., and Shatkin, A.J. (2011). Cap and cap-
binding proteins in the control of gene expression. Wiley Interdiscip. Rev. RNA 2, 277–
298.
Trewick, S.C., Henshaw, T.F., Hausinger, R.P., Lindahl, T., and Sedgwick, B. (2002).
Oxidative demethylation by Escherichia coli AlkB directly reverts DNA base damage.
Nature 419, 174–178.
Ule, J., Jensen, K., Mele, A., and Darnell, R.B. (2005). CLIP: a method for identifying
protein-RNA interaction sites in living cells. Methods San Diego Calif 37, 376–386.
Valadi, H., Ekström, K., Bossios, A., Sjöstrand, M., Lee, J.J., and Lötvall, J.O. (2007).
Exosome-mediated transfer of mRNAs and microRNAs is a novel mechanism of genetic
exchange between cells. Nat. Cell Biol. 9, 654–659.
van der Veen, R., Kwakman, J.H., and Grivell, L.A. (1987). Mutations at the lariat
acceptor site allow self-splicing of a group II intron without lariat formation. EMBO J. 6,
3827–3831.
Vellore, J., Moretz, S.E., and Lampson, B.C. (2004). A group II intron-type open reading
frame from the thermophile Bacillus (Geobacillus) stearothermophilus encodes a heat-
stable reverse transcriptase. Appl. Environ. Microbiol. 70, 7140–7147.
Verma, R., Oania, R.S., Kolawa, N.J., and Deshaies, R.J. (2013). Cdc48/p97 promotes
degradation of aberrant nascent polypeptides bound to the ribosome. eLife 2, e00308.
Vickers, K.C., Palmisano, B.T., Shoucri, B.M., Shamburek, R.D., and Remaley, A.T.
(2011). MicroRNAs are transported in plasma and delivered to recipient cells by high-
density lipoproteins. Nat. Cell Biol. 13, 423–433.
Walter, P., and Blobel, G. (1982). Signal recognition particle contains a 7S RNA
essential for protein translocation across the endoplasmic reticulum. Nature 299, 691–
698.
Wang, K., Yuan, Y., Cho, J.-H., McClarty, S., Baxter, D., and Galas, D.J. (2012).
Comparing the MicroRNA spectrum between serum and plasma. PloS One 7, e41561.
Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for
transcriptomics. Nat. Rev. Genet. 10, 57–63.
177
Wank, H., SanFilippo, J., Singh, R.N., Matsuura, M., and Lambowitz, A.M. (1999). A
reverse transcriptase/maturase promotes splicing by binding at its own coding segment in
a group II intron RNA. Mol. Cell 4, 239–250.
Weeks, K.M., and Mauger, D.M. (2011). Exploring RNA structural codes with SHAPE
chemistry. Acc. Chem. Res. 44, 1280–1291.
Werner, A. (2013). Biological functions of natural antisense transcripts. BMC Biol. 11,
31.
Wilhelm, B.T., and Landry, J.-R. (2009). RNA-Seq-quantitative measurement of
expression through massively parallel RNA-sequencing. Methods San Diego Calif 48,
249–257.
Williams, Z., Ben-Dov, I.Z., Elias, R., Mihailovic, A., Brown, M., Rosenwaks, Z., and
Tuschl, T. (2013). Comprehensive profiling of circulating microRNA via small RNA
sequencing of cDNA libraries reveals biomarker potential and limitations. Proc. Natl.
Acad. Sci. U. S. A. 110, 4255–4260.
Wolin, S.L., Sim, S., and Chen, X. (2012). Nuclear noncoding RNA surveillance: is the
end in sight? Trends Genet. TIG 28, 306–313.
Xue, D., Shi, H., Smith, J.D., Chen, X., Noe, D.A., Cedervall, T., Yang, D.D., Eynon, E.,
Brash, D.E., Kashgarian, M., et al. (2003). A lupus-like syndrome develops in mice
lacking the Ro 60-kDa protein, a major lupus autoantigen. Proc. Natl. Acad. Sci. U. S. A.
100, 7503–7508.
Yamasaki, S., Ivanov, P., Hu, G., and Anderson, P. (2009). Angiogenin cleaves tRNA
and promotes stress-induced translational repression. J. Cell Biol. 185, 35–42.
Yang, Z., Liang, H., Zhou, Q., Li, Y., Chen, H., Ye, W., Chen, D., Fleming, J., Shu, H.,
and Liu, Y. (2012). Crystal structure of ISG54 reveals a novel RNA binding structure and
potential functional mechanisms. Cell Res. 22, 1328–1338.
Zarnack, K., König, J., Tajnik, M., Martincorena, I., Eustermann, S., Stévant, I., Reyes,
A., Anders, S., Luscombe, N.M., and Ule, J. (2013). Direct competition between hnRNP
C and U2AF65 protects the transcriptome from the exonization of Alu elements. Cell
152, 453–466.
Zernecke, A., Bidzhekov, K., Noels, H., Shagdarsuren, E., Gan, L., Denecke, B., Hristov,
M., Köppel, T., Jahantigh, M.N., Lutgens, E., et al. (2009). Delivery of microRNA-126
by apoptotic bodies induces CXCL12-dependent vascular protection. Sci. Signal. 2, ra81.
178
Zheng, G., Qin, Y., Clark, W.C., Dai, Q., Yi, C., He, C., Lambowitz, A.M., and Pan, T.
(2015). Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12,
835–837.
Zhou, X., Michal, J.J., Zhang, L., Ding, B., Lunney, J.K., Liu, B., and Jiang, Z. (2013).
Interferon induced IFIT family genes in host antiviral defense. Int. J. Biol. Sci. 9, 200–
208.
Züst, R., Cervantes-Barragan, L., Habjan, M., Maier, R., Neuman, B.W., Ziebuhr, J.,
Szretter, K.J., Baker, S.C., Barchet, W., Diamond, M.S., et al. (2011). Ribose 2’-O-
methylation provides a molecular signature for the distinction of self and non-self mRNA
dependent on the RNA sensor Mda5. Nat. Immunol. 12, 137–143.
179
Vita
Yidan Qin was born in Zhengzhou, Henan, People’s Republic of China to Caiying
Xia and Huihong Qin. After completing her high school study at St Cyprian’s School in
Cape Town, Republic of South Africa, she enrolled at the University of Nebraska-Lincoln
in 2005 and received a B.S. in Biochemistry and a B.S. in Forensic Science in 2009. She
joined the Microbiology graduate program at the University of Texas at Austin in 2009,
and began her graduate work under the supervision of Dr. Alan Lambowitz in 2010.
She co-authored the following papers:
Mohr, S., Ghanem, E., Smith, W., Sheeter, D., Qin, Y., King, O., Polioudakis, D., Iyer,
V.R., Hunicke-Smith, S., Swamy, S., et al. (2013). Thermostable group II intron reverse
transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA
sequencing. RNA N. Y. N 19, 958–970.
Katibah, G.E., Qin, Y., Sidote, D.J., Yao, J., Lambowitz, A.M., and Collins, K. (2014).
Broad and adaptable RNA structure recognition by the human interferon-induced
tetratricopeptide repeat protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030.
Shen, P.S., Park, J., Qin, Y., Li, X., Parsawar, K., Larson, M.H., Cox, J., Cheng, Y.,
Lambowitz, A.M., Weissman, J.S., et al. (2015). Protein synthesis. Rqc2p and 60S
ribosomal subunits mediate mRNA-independent elongation of nascent chains. Science
347, 75–78.
Zheng, G.*, Qin, Y.*, Clark, W.C., Dai, Q., Yi, C., He, C., Lambowitz, A.M., and Pan, T.
(2015). Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12,
835–837.
Qin, Y.*, Yao, J.*, Wu, D.C., Nottingham, R.M., Mohr, S., Hunicke-Smith, S., and
Lambowitz, A.M. (2016). High-throughput sequencing of human plasma RNA by using
thermostable group II intron reverse transcriptases. RNA N. Y. N 22, 111–128.
180
Nottingham, R.M.*, Wu, D.C.*, Qin, Y., Yao, J., Hunicke-Smith, S., and Lambowitz,
A.M. (2016). RNA-seq of human reference RNA samples using a thermostable group II
intron reverse transcriptase. RNA N. Y. N 22, 597–613.
*Co-first authorship.
Permanent address: [email protected]
This dissertation was typed by Yidan Qin.