Copyright by Yidan Qin 2016

194
Copyright by Yidan Qin 2016

Transcript of Copyright by Yidan Qin 2016

Page 1: Copyright by Yidan Qin 2016

Copyright

by

Yidan Qin

2016

Page 2: Copyright by Yidan Qin 2016

The Dissertation Committee for Yidan Qin Certifies that this is the approved

version of the following dissertation:

Thermostable Group II Intron Reverse Transcriptases and Their

Applications in Next Generation RNA Sequencing, Diagnostics, and

Precision Medicine

Committee:

Alan M. Lambowitz, Supervisor

Vishwanath R. Iyer

Robert M. Krug

Rick Russell

Scott W. Stevens

Christopher S. Sullivan

Page 3: Copyright by Yidan Qin 2016

Thermostable Group II Intron Reverse Transcriptases and Their

Applications in Next Generation RNA Sequencing, Diagnostics, and

Precision Medicine

by

Yidan Qin, B.S.Biochem.; B.S.ForensicSci.

Dissertation

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

The University of Texas at Austin

May 2016

Page 4: Copyright by Yidan Qin 2016

Dedication

Dedicated to my parents, Caiying Xia and Huihong Qin.

Page 5: Copyright by Yidan Qin 2016

v

Acknowledgements

I would like to express my sincere gratitude to my advisor, Dr. Alan Lambowitz.

His guidance, encouragement and inspiration allow me to learn and grow as a scientist. I

would also like to thank my committee members, Dr. Vishwanath Iyer, Dr. Robert Krug,

Dr. Rick Russell, Dr. Scott Stevens, and Dr. Chris Sullivan, for providing their important

expertise and valuable critiques throughout the development of this research work.

Moreover, I am very grateful for the fun and fruitful work I shared with all the

collaborators at or outside the University of Texas at Austin.

Many thanks goes to the current and previous members in the Lambowitz Lab for

their enormous support, particularly Jun Yao, Sabine Mohr, Marta Mastroianni, Ryan

Nottingham, and Tawsy Lamech. I am very fortunate to have joined this lab and worked

with them. My appreciation also extends to my friends outside the Lambowitz Lab,

particularly Xia Xia, Tina Hsiang, and Lily Wang, whose friendship are vital throughout

my graduate school.

Finally, I would like to take this opportunity to thank my family. I thank my

cousin Zhen Qin, my cousin-in-law Jian Zhou, my aunt Huapei Li, and my Uncle

Huiming Qin, for their love and caring. Most importantly, I thank my parents, Caiying

Xia and Huihong Qin, for always being there for me.

Page 6: Copyright by Yidan Qin 2016

vi

Thermostable Group II Intron Reverse Transcriptases and Their

Applications in Next Generation RNA Sequencing, Diagnostics, and

Precision Medicine

Yidan Qin, Ph.D.

The University of Texas at Austin, 2016

Supervisor: Alan M. Lambowitz

Thermostable group II intron reverse transcriptases (TGIRTs) from thermophilic

bacteria are advantageous for biotechnological applications that require cDNA synthesis,

such as RT-qPCR and RNA-seq. TGIRTs have higher thermostability, processivity and

fidelity than conventional retroviral RTs, along with a novel end-to-end template-

switching activity that attaches RNA-seq adapters to target RNAs without RNA ligation.

First, I optimized the TGIRT template-switching method for RNA-seq analysis of small

non-coding RNAs (ncRNAs). I showed that TGIRT-seq gives full-length reads of tRNAs,

which are refractory to retroviral RTs, and enables identification of a variety of base

modifications in tRNAs by distinctive patterns of misincorporated nucleotides. With

collaborators, I developed an efficient and quantitative high-throughput tRNA sequencing

method, identified RNAs bound by the human interferon-induced protein IFIT5, yielding

new insights into its functions in tRNA quality control and innate immunity, and

uncovered a novel mRNA-independent mechanism for elongation of nascent peptides.

Page 7: Copyright by Yidan Qin 2016

vii

Second, I developed a new, streamlined TGIRT-seq method for comprehensive analysis

of all RNA size classes in a single RNA-seq. This method enables RNA-seq library

construction from <1 ng of fragmented RNAs in <5 h. By using the method, I showed

that human plasma contains large numbers of protein-coding and long ncRNAs together

with diverse classes of small ncRNAs, which are mostly present as full-length transcripts.

With collaborators, I showed that TGIRT-seq analysis of circulating RNAs identified

potential biomarkers at different stages of multiple myeloma and may provide a sensitive,

non-invasive diagnostic tool for a variety of human diseases. Finally, I adapted TGIRTs

for use in mapping of RNA structures and RNA-protein interaction sites, and

identification of RNA targets of cellular RNA-binding proteins. My research led to a

series of new biological insights, which would have been difficult or impossible to obtain

by current methods, and established TGIRTs as a tool for a broad range of applications in

RNA research and diagnostics.

Page 8: Copyright by Yidan Qin 2016

viii

Table of Contents

List of Tables ......................................................................................................... xi

List of Figures ....................................................................................................... xii

Chapter 1: Thermostable Group II Intron Reverse Transcriptases ..........................1

1.1 Group II introns.........................................................................................1

1.2 Group II intron reverse transcriptases .......................................................2

1.3 Thermostable group II intron reverse transcriptases are advantageous

for cDNA synthesis ................................................................................3

1.4 Thermostable group II intron reverse transcriptases are advantageous

for next-generation RNA sequencing ....................................................5

1.5 Overview of the dissertation research .......................................................7

Chapter 2: RNA-seq of transfer RNAs ..................................................................14

2.1 Efficient and quantitative high-throughput tRNA sequencing* .............14

2.1.1 tRNA sequencing by combining demethylase treatment with

the TGIRT-seq small RNA/CircLigase method .........................15

2.1.2 Analysis of tRNA isoacceptors, modifications and gene

expressions ..................................................................................17

2.1.3 Discussion ...................................................................................18

2.2 Analysis of precursor and mature tRNAs associated with the human

interferon-induced protein IFIT5* .......................................................19

2.3.1 The human IFIT5 protein ............................................................20

2.3.2 TGIRT-seq profiling of IFIT5-bound cellular RNAs .................22

2.3.3 IFIT5 binds to a broad spectrum of precursor and processed

tRNAs, as well as other RNA polymerase III transcripts ...........23

2.3.4 Discussion ...................................................................................26

2.4 Analysis of tRNAs associated with the yeast Rqc2p protein* ................27

2.4.1 The Rqc2p protein.......................................................................28

2.4.2 TGIRT-seq profiling of Rpc2p-bound tRNAs ............................29

2.4.3 Discussion ...................................................................................30

2.5 Materials and methods ............................................................................31

2.5.1 Deacylation of tRNA samples ....................................................31

Page 9: Copyright by Yidan Qin 2016

ix

2.5.2 Construction of RNA-seq libraries by TGIRT-seq small

RNA/CircLigase method ............................................................32

Chapter 3: RNA-seq of circulating RNAs in human plasma .................................46

3.1 Introduction .............................................................................................46

3.2 TGIRT-seq, the total RNA method .........................................................47

3.2.1 Overview of the TGIRT-seq total RNA method.........................47

3.2.2 Validation of the TGIRT-seq total RNA method .......................51

3.3 Human plasma RNA ...............................................................................52

3.3.1 Preparations and treatments of human plasma RNAs .................52

3.3.2 TGIRT-seq of human plasma RNA samples ..............................54

3.3.3 Classes of RNAs detected in human plasma ...............................54

3.3.4 Protein-coding gene and long non-coding RNAs in human

plasma .........................................................................................56

3.3.5 Small non-coding RNAs in human plasma .................................58

3.4 Discussion ...............................................................................................63

3.5 Materials and methods ............................................................................64

3.5.1 Thermostable group II intron RTs ..............................................64

3.5.2 Preparation of human plasma RNA samples ..............................65

3.5.3 Construction of plasma RNA-seq libraries .................................67

3.5.4 RNA-seq analysis of cDNA recopying by TGIRT enzymes ......69

3.5.5 Bioinformatics analysis ...............................................................70

3.5.6 Accession numbers .....................................................................73

Chapter 4: Identification of circulating RNA biomarkers in multiple myeloma .114

4.1 Introduction ...........................................................................................114

4.2 RNA profiles of extracellular vesicles in human plasma ......................117

4.3 TGIRT-seq identifies differentially expressed transcripts by disease

stages ..................................................................................................119

4.4 Discussion .............................................................................................121

4.5 Materials and methods ..........................................................................122

4.5.1 Thermostable group II intron RTs ............................................122

4.5.2 RNA preparations* ...................................................................122

Page 10: Copyright by Yidan Qin 2016

x

4.5.3 Construction of RNA-seq libraries ...........................................122

4.5.4 Bioinformatics*.........................................................................124

Chapter 5: Mapping RNA secondary structures and ...........................................136

RNA-protein interaction sites ..............................................................................136

5.1 Overview of SHAPE and CRAC ..........................................................136

5.2 Protein-assisted group II intron splicing ...............................................137

5.2.1 Determination of optimal exon length and protein

concentration for in vitro splicing of the GsI-IIC intron...........140

5.2.2 RNA-structure mapping of the GsI-IIC intron via TGIRT-

SHAPE* ....................................................................................142

5.2.3 Mapping of RNA-protein contact sites by TGIRT-CRAC* .....143

5.3 Discussion .............................................................................................145

5.4 Materials and methods ..........................................................................146

5.4.1 Recombinant plasmids ..............................................................146

5.4.2 Preparation of GsI-IIC intron RNA and IEP.............................147

5.4.3 GsI-IIC intron splicing ..............................................................147

5.4.4 TGIRT-SHAPE .........................................................................148

5.4.5 TGIRT-CRAC...........................................................................150

Bibliography ........................................................................................................161

Vita………………………………………………………………………….......179

Page 11: Copyright by Yidan Qin 2016

xi

List of Tables

Table 2.1: TGIRT-seq read mapping. ....................................................................34

Table 2.2: Biological replicate sequencing of pooled RNA. .................................35

Table 3.1: Read statistics and mapping for RNA-seq of total plasma RNAs

using TeI4c group II intron RT. ........................................................74

Table 3.2: Read statistics and mapping for RNA-seq of total plasma RNAs

using GsI-IIC group II intron RT. .....................................................76

Table 3.3: Analysis of 3’-terminal nucleotides of RNAs in RNA-seq datasets

constructed from total plasma RNA using TeI4c or GsI-IIC

group II intron RTs. ..........................................................................81

Table 3.4: Read statistics and mapping for RNA-seq of whole-cell RNAs by

using TeI4c or GsI-IIC group II intron RT. ......................................82

Table 3.5: Summary of RNA-seq datasets. ............................................................84

Table 4.1: Read statistics and mapping for RNA-seq of plasma EV-RNAs. ......127

Page 12: Copyright by Yidan Qin 2016

xii

List of Figures

Figure 1.1: Group II intron splicing and mobility..................................................12

Figure 1.2: Comparision of group II intron and retroviral RTs. ............................13

Figure 2.1: Demethylase-thermostable group II intron RT tRNA sequencing

(DM-tRNA-seq). ...............................................................................36

Figure 2.2: cDNA synthesis of IFIT-bound RNAs by TGIRT-seq small

RNA/CircLigase method. .................................................................37

Figure 2.3: Broad representation of IFIT5-bound tRNAs. ....................................38

Figure 2.4: Individual gene coverage by reads from the WT IFIT5 cross-

linked RNA sample. ..........................................................................39

Figure 2.5: Read sequence alignments for the WT IFIT5 cross-linked RNA

sample. ..............................................................................................41

Figure 2.6: Composite read start sites for IFIT5-bound tRNAs. ...........................44

Figure 2.7: Rqc2p-dependent enrichment of tRNAAla(IGC) and tRNAThr(IGU). ........45

Figure 3.1: TGIRT-seq overview. ..........................................................................85

Figure 3.2: Bioanalyzer traces showing size profiles of plasma RNAs before

and after various treatments. .............................................................88

Figure 3.3: Bioanalyzer traces testing the efficiency of DNase treatments used

on plasma RNA preparations. ...........................................................89

Figure 3.4: The distribution of transcript lengths in total plasma RNA libraries

calculated by the coverage of paired-end read span. ........................90

Figure 3.5: Percentage of TGIRT-seq reads from total plasma RNA datasets

mapping to different categories of genomic features. .......................92

Page 13: Copyright by Yidan Qin 2016

xiii

Figure 3.6: Correlation analysis for biological replicates of total plasma RNA

libraries. ............................................................................................94

Figure 3.7: RNA-seq analysis of total plasma RNA libraries constructed with

GsI-IIC group II intron RT. ..............................................................95

Figure 3.8: Human plasma RNA is enriched in intron and antisense sequences

compared to whole-cell RNAs. .........................................................97

Figure 3.9: Proportion of reads mapping to the sense strand of protein-coding

genes as a function of gene length in RNA-seq datasets of human

plasma or whole-cell RNAs. .............................................................99

Figure 3.10: Human plasma contains both mature and pre-miRNAs. .................100

Figure 3.11: Tissue expression profiles for mature miRNAs in plasma. .............103

Figure 3.12: Tissue expression profiles of mature miRNA identified in total

plasma RNA prepared by the mirVana combined method. ............104

Figure 3.13: TGIRT-seq detects full-length pre-miRNAs and a miRNA that

may be present in plasma in an RNA/DNA hybrid. .......................106

Figure 3.14: Relative abundance and IGV alignments of miRNAs identified in

a small plasma RNA-seq dataset constructed with GsI-IIC RT. ....108

Figure 3.15: TGIRT-seq identifies full-length mature tRNAs and tRNA

fragments in human plasma. ...........................................................110

Figure 3.16: Other classes of small non-coding RNAs identified as full-length

mature transcripts in human plasma by TGIRT-seq. ......................112

Figure 4.1: Bioanalyzer traces showing size profiles of plasma EV-RNAs. .......129

Figure 4.2: Percentage of TGIRT-seq reads from EV-RNA datasets mapping

to different categories of genomic features. ....................................131

Figure 4.3: Heatmap for sample-to-sample distance. ..........................................132

Page 14: Copyright by Yidan Qin 2016

xiv

Figure 4.4: Transcript expressions in plasma EVs...............................................133

Figure 4.5: Survival curves. .................................................................................135

Figure 5.1: Determining the optimal exon length for in vitro splicing of the

GsI-IIC intron..................................................................................153

Figure 5.2: Determining the optimal IEP concentration for in vitro splicing of

the GsI-IIC intron. ...........................................................................155

Figure 5.3: SHAPE analysis of the GsI-IIC intron RNA. ....................................156

Figure 5.4: Mapping of protein binding sites in GsI-IIC intron RNA. ................159

Page 15: Copyright by Yidan Qin 2016

1

Chapter 1: Thermostable Group II Intron Reverse Transcriptases

1.1 GROUP II INTRONS

Group II introns are mobile genetic elements found in bacterial and organellar

genomes and are thought to be evolutionary ancestors of eukaryotic spliceosomes,

retrotransposons, and retroviruses (Lambowitz and Belfort, 2015). Mobile group II intron

consists of a catalytic intron RNA (a “ribozyme”), which folds into stable secondary and

tertiary structures, and an intron-encoded protein (IEP), which is a multifunctional

reverse transcriptase (RT) that assists intron splicing and promote intron mobility within

the genome (Lambowitz and Zimmerly, 2011). The IEP binds to the intron RNA to

stabilize the catalytically active RNA structure for intron splicing (Matsuura et al., 2001).

Group II introns use the same splicing mechanism used by the spliceosomal introns in

higher organisms, producing an excised lariat intron RNA via two transesterification

steps (Fig. 1.1A). After splicing, the IEP remains bound to the excised lariat intron,

forming a ribonucleoprotein (RNP) to promote intron mobility to new DNA sites. Intron

mobility occurs by “retrohoming”, a process in which the intron RNA reverse splices

directly into a specific DNA site and is then reverse transcribed by the IEP (Fig. 1.1B).

Studies of protein-assisted group II intron splicing and mobility can further our

understanding of how proteins promote RNA folding and catalysis, and the origin,

evolution and mechanisms of spliceosomal introns in higher organisms.

Page 16: Copyright by Yidan Qin 2016

2

1.2 GROUP II INTRON REVERSE TRANSCRIPTASES

Group II intron reverse transcriptases (RTs) consist of four domains, an N-

terminal RT domain, an X domain, and C-terminal DNA-binding (D) and DNA

endonuclease domains (En) (Fig. 1.2) (Mohr et al., 2013). The RT domain of group II

intron RTs contains seven conserved sequence blocks that correspond to the finger and

palm regions of retroviral RTs, such as the HIV-1 RT. However, their RT domain is

larger in size due to an N-terminal extension and several insertions, some of which are

conserved in retroplasmid and non-LTR-retrotransposon RTs (Blocker et al., 2005).

These additional regions may contribute to more extensive interactions between the

group II intron RT and the RNA template, leading to high processivity during reverse

transcription (Chen and Lambowitz, 1997; Bibillo and Eickbush, 2002; Blocker et al.,

2005). The X domain is structurally homologous to the thumb domain of retroviral RTs

(Blocker et al., 2005). Both the RT and X domains function in binding the intron RNA

for RNA splicing and in reverse transcription to synthesize a full-length cDNA copy of

the group II intron RNA during intron mobility (Cui et al., 2004). In contrast to retroviral

RTs, group II intron RTs lack an RNase H domain and instead have D and En domains

for binding and cleaving DNA target sites during intron mobility (Blocker et al., 2005;

Lambowitz and Zimmerly, 2011).

Page 17: Copyright by Yidan Qin 2016

3

1.3 THERMOSTABLE GROUP II INTRON REVERSE TRANSCRIPTASES ARE ADVANTAGEOUS

FOR CDNA SYNTHESIS

A wide range of biotechnological applications requires cDNA synthesis by

reverse transcriptases (RTs), such as mapping of RNA structures and RNA-protein

interactions, qRT-PCR, and next generation RNA sequencing (RNA-seq) (Tijerina et al.,

2007; Wang et al., 2009; Mayer et al., 2011; Ozsolak and Milos, 2011; Lusvarghi et al.,

2013). However, the only commercially available RTs used for these applications are

retroviral RTs, which have inherently low fidelity and processivity for introducing

genetic variations and propagating them by RNA recombination in order to evade host

defenses (Ji and Loeb, 1992; Hu and Hughes, 2012). Additionally, only a few RTs are

capable of functioning at elevated temperature, which facilitates the melting of higher-

order RNA structures for full-length cDNA synthesis, and these typically have decreased

fidelity (Beckman et al., 1985; Baranauskas et al., 2012; Mohr et al., 2013).

In contrast to retroviral RTs, group II intron RTs have inherently high fidelity and

processivity in order to perform their normal biological function during intron mobility,

which requires accurate and full-length cDNA synthesis of a highly structured, 2-3-kb

intron RNA (Conlan et al., 2005; Lambowitz and Zimmerly, 2011; Mohr et al., 2013;

Enyeart et al., 2014; Lambowitz and Belfort, 2015). Group II intron RTs found in

thermophilic bacteria can potentially combine the above useful properties with high

thermostability. However, group II introns have remained untapped as a source of RTs

for biotechnological applications due to two major challenges: (i) although hundreds of

group II intron RTs were identified by genome sequencing (Candales et al., 2012), they

Page 18: Copyright by Yidan Qin 2016

4

often have mutations that decrease or abolish RT activity, suggesting that they are under

selective pressure to suppress intron mobility, which is deleterious to their hosts (Mohr et

al., 2010); and (ii) group II intron RTs have generally been difficult to express with high

yield and activity and become mostly insoluble without the bound intron RNA (Vellore et

al., 2004; Ng et al., 2007). Most previous studies of group II intron RTs have focused on

the LtrA protein encoded by the Lactococcus lactis Ll.LtrB intron for which expression

and solubility problems could be partially overcome under some experimental conditions.

The LtrA protein has been expressed in Escherichia coli with a cleavable intein-affinity

tag and purified with relatively high yield and activity (Saldanha et al., 1999). In vivo, the

LtrA protein synthesizes a full-length cDNA copy of the ~3-kb Ll.LtrB intron and

demonstrated significantly lower error rate (~10-5) than that of retroviral RTs (Cousineau

et al., 1998; Conlan et al., 2005).

Our laboratory recently identified thermostable group II introns that are actively

mobile (Mohr et al., 2010), and developed general methods for the high-level expression

of thermostable group II intron RTs (TGIRTs) as fusion proteins with a non-cleavable

solubility tag attached via a rigid linker (denoted MRF) (Mohr et al., 2013). The two most

active TGIRTs identified were TeI4c-MRF from Thermosynechococcus elongatus and

GsI-IIC-MRF from Geobacillus stearothermophilus (Vellore et al., 2004; Mohr et al.,

2010, 2013). We found that these TGIRT enzymes have higher thermostability,

processivity, and fidelity than retroviral RTs. They carried out reverse transcription

reaction at high temperature (up to 81°C) and synthesized cDNAs with uniform 5’ to 3’

coverage of a 1.2-kb RNA template, measured by the Taqman qRT-PCR assay. Similarly,

Page 19: Copyright by Yidan Qin 2016

5

in capillary electrophoresis assay, TGIRTs produced full-length cDNAs of an 807-nt

highly structured group II intron RNA with significantly fewer premature stops than

SuperScript III (SSIII; Thermo Fisher Scientific), a widely used genetically engineered

derivative of Moloney murine leukemia virus (M-MLV) RT. The high signal (full-length

cDNA copies) to noise (premature RT stops) ratio is crucial for accurately and efficiently

mapping the RNA structures and RNA-protein interactions. Finally, the TGIRT enzymes

were found to have a two- to four-fold lower in vitro error rate than SSIII in an M13-

based lacZ forward mutation assay (Mohr et al., 2013).

1.4 THERMOSTABLE GROUP II INTRON REVERSE TRANSCRIPTASES ARE ADVANTAGEOUS

FOR NEXT-GENERATION RNA SEQUENCING

Next-generation RNA sequencing (RNA-seq) is a supremely powerful method for

transcriptome profiling and gene expression analysis, with applications that include the

identification of novel biomarkers and new diagnostic methods for diseases (Wang et al.,

2009; Wilhelm and Landry, 2009; Ozsolak and Milos, 2011; Chen et al., 2012).

All RNA-seq methods rely upon an initial cDNA synthesis step in which a reverse

transcriptase (RT) converts RNA sequences into DNA, which can then be sequenced by

powerful high-throughput DNA sequencing technologies. Current RNA-seq methods can

be divided into two general categories. In one category, used for the analysis of mRNAs

and long non-coding RNAs (lncRNAs), the initial reverse transcription step typically

enriches for cDNAs of polyadenylated (poly(A)+) RNAs, either by priming with

oligo(dT) or by priming with random oligomers after depletion of the highly abundant

Page 20: Copyright by Yidan Qin 2016

6

rRNAs (Levin et al., 2010; Ozsolak and Milos, 2011). The resulting cDNAs are then

converted into suitably sized double-stranded DNAs and ligated to platform-specific

sequencing adapters (Ozsolak and Milos, 2011). The most widely used of these methods

employs RNA fragmentation, random hexamer priming, and addition of dUTP during

second-strand synthesis; after adapter ligation, the uridine-containing second strand is

either excluded during PCR with a high-fidelity DNA polymerase or degraded

enzymatically to achieve strand specificity (Levin et al., 2010; Head et al., 2014). A

second category of RNA-seq methods, used for miRNAs and other small non-coding

RNAs (small ncRNAs), involves ligation of RNA-seq adapters containing primer-binding

sites to the 3’ and/or 5’ ends of target RNAs with RNA ligase, followed by reverse

transcription and PCR amplification for RNA-seq library construction (Levin et al., 2010;

Raabe et al., 2014). Limitations of these methods include: (i) the inability to

comprehensively profile mRNAs and lncRNAs together with small ncRNAs in the same

RNA-seq reaction; (ii) the relatively low fidelity and processivity of retroviral RTs used

for cDNA synthesis (Hu and Hughes, 2012), making it difficult to analyze RNA sequence

polymorphisms and highly structured or GC-rich RNAs; and (iii) the inefficiency and/or

biases introduced by RNA-seq adapter ligation using RNA ligases or by random hexamer

priming (Linsen et al., 2009; Hansen et al., 2010; Levin et al., 2010; Lamm et al., 2011;

Raabe et al., 2014).

In addition to high thermostability, processivity, and fidelity, properties that are

useful for producing full-length reads from the highly structured or GC-rich RNAs,

TGIRT enzymes also have a novel end-to-end template-switching activity that can attach

Page 21: Copyright by Yidan Qin 2016

7

RNA-seq adapters to the target RNA during reverse transcription without a separate RNA

ligase step (Mohr et al., 2013). TGIRTs differ from retroviral RTs in template-switching

with minimal base-pairing to the 3’ ends of the target RNA (Mohr et al., 2013). Recent

work in our lab showed that the use of TGIRT template-switching enables facile and less

biased RNA-seq analysis of miRNAs than two commercial kits and could potentially

have wide RNA-seq applications (Mohr et al., 2013).

1.5 OVERVIEW OF THE DISSERTATION RESEARCH

This dissertation focuses on the further development of the TGIRT template-

switching method and its broad applications in next-generation RNA sequencing,

diagnostics and precision medicine. Specifically, by providing a new biotechnology that

is simple, rapid and efficient, I aim to: (i) contribute new insights into biological studies

that require high-throughput sequencing of structured RNAs that are refractory to

conventional RNA-seq analysis (Chapter 2); (ii) develop sensitive, non-invasive and cost-

effective diagnostic tools and personalized medical care for diseases, including cancer

(Chapter 3 and 4); (iii) improve the accuracy and efficiency of current research tools used

in the mapping of RNA secondary structure and RNA-protein interactions (Chapter 5).

In Chapter 2, I optimized the initial TGIRT template-switching method for RNA-

seq analysis of diverse small RNA classes, now referred to as the TGIRT-seq small

RNA/CircLigase method, and demonstrated its usefulness by sequencing tRNAs, which

are virtually absent from datasets obtained with conventional RNA-seq methods due to

their stable secondary and tertiary structures, and extensive post-transcriptional

Page 22: Copyright by Yidan Qin 2016

8

modifications. Through collaboration with Dr. Tao Pan’s research group at the University

of Chicago, I developed an efficient and quantitative high-throughput tRNA sequencing

method that can be widely used in studies of tRNA expression, modification and

regulation. Additionally, I describe two studies that revealed novel functions of tRNA-

binding proteins by utilizing TGIRTs for tRNA deep sequencing. In the first study, by

collaborating with Dr. Kathleen Collin’s research group at the University of California-

Berkeley, we showed that the human interferon-induced protein IFIT5 binds to a broad

spectrum of precursor and processed tRNA transcripts, uncovering a surprisingly flexible

order of human tRNA processing reactions, and potential roles of IFIT5 protein in

cytosolic tRNA quality control and innate immunity. In the second study, by

collaborating with several research groups, including Dr. Adam Frost at the University of

Utah, Dr. Onn Brandman at the Standford University, and Drs. Johnathan Weissman and

Dr. Yifan Cheng at the University of California-San Francisco, we established tRNA

recognition specificity of the Rqc2 protein, a component of the yeast quality control

complex, and uncovered a novel mRNA-independent mechanism for elongation of

nascent peptides.

In Chapter 3, I developed a new TGIRT-seq method that is simple, rapid and

efficient for analysis of RNAs of all sizes in a single RNA-seq reaction, now referred to

as the TGIRT-seq total RNA method. I demonstrated the use of the method in profiling

circulating RNAs in human plasma. Circulating RNAs are potentially useful as

biomarkers for human diseases. However, the extraction and analysis of circulating

RNAs have been challenging due to their extremely low quantity and quality. In this

Page 23: Copyright by Yidan Qin 2016

9

chapter, I describe methods for plasma RNA isolation and RNA-seq analysis by TGIRT-

seq total RNA method, which enabled construction of RNA-seq libraries from <1 ng of

plasma RNAs in <5 h. TGIRT-seq of RNA in 1-mL plasma samples from a healthy

individual revealed RNA fragments mapping to a diverse population of protein-coding

gene and lncRNAs, which are enriched in intron and antisense sequences, as well as

nearly all known classes of small ncRNAs, some of which have never before been seen in

plasma. Surprisingly, many of the small ncRNA species were present as full-length

transcripts, suggesting that they are protected from plasma RNases in ribonucleoprotein

(RNP) complexes and/or exosomes. The TGIRT-seq total RNA method is readily

adaptable for profiling of whole-cell and exosomal RNAs, and related procedures

including ribosome profiling.

In Chapter 4, by using RNAs isolated from extracellular vesicles in plasma, I

explored the use of TGIRT-seq total RNA method for the identification of novel

biomarkers in patients at different stages of multiple myeloma, which is a prevalent blood

cancer. This is an on-going study done in collaboration with Drs. Flavia Pichiorri and

Craig Hofmeisters’ group at the Ohio State University. Preliminary sequencing results

showed that TGIRT-seq identified differentially expressed mRNA transcripts that are

consistent with patient survival based on a published microarray-based gene expression

dataset (Popovici et al., 2010; Shi et al., 2010). Additionally, TGIRT-seq also identified

several small ncRNAs as potential novel biomarkers, including Y RNA derived

fragments. Other on-going collaborations described in chapter 4 include analysis of FFPE

(formalin-fixed, paraffin-embedded) tumor tissue, PBMCs (peripheral blood

Page 24: Copyright by Yidan Qin 2016

10

mononuclear cells) and plasma samples from patients with inflammatory breast cancer

with Dr. Naoto Ueno’s group at the MD Anderson Cancer Center, and analysis of plasma

samples with Dr. Joseph McCormick’s group at the University of Texas Rio Grande

Valley for a large-scale population study of environmental impact on human health.

In Chapter 5, I adapted TGIRT-seq in commonly used procedures for mapping

RNA secondary structure and RNA-protein interactions, including: (i) selective 2′-

hydroxyl acylation analyzed by primer extension (SHAPE); (ii) cross-linking and

analysis of cDNAs (CRAC); and (iii) individual-nucleotide resolution cross-linking and

immunoprecipitation (iCLIP). Using Group IIC intron GsI-IIC, found in Geobacillus

stearothermophilus, and its encoded protein (denoted GsI-IIC-MRF), as an in vitro model

system, I demonstrated the ability of TGIRT-SHAPE to map the secondary structure of a

722-nt highly structured GsI-IIC intron RNA at a single nucleotide resolution using a

single primer annealed to the 3’ end of the RNA. The secondary structure of GsI-IIC

intron RNA obtained by TGIRT-SHAPE agreed with that predicted based on

phylogenetic studies (unpublished). I also used TGIRT-CRAC to identify the direct

interaction sites between GsI-IIC intron RNA and its IEP at the pre-catalytic step of

splicing. Preliminary data identified regions known to be involved in IEP binding in other

group II introns, and several nucleotides involved in long-range RNA interactions at the

tertiary level, suggesting the IEP functions to facilitate formation of active intron RNA

structures during splicing. I also contributed to adapting TGIRT-seq for iCLIP procedures

to study RNA-protein interactions in vivo, including the identification of RNA substrates

and binding sites recognized by NS1 protein of influenza virus, and by human MDA5

Page 25: Copyright by Yidan Qin 2016

11

protein, through collaborations with research groups including Dr. Krug at the University

of Texas at Austin and Dr. Michael Gale, Jr. at the University of Washington,

respectively.

Page 26: Copyright by Yidan Qin 2016

12

Figure 1.1: Group II intron splicing and mobility.

(A) Intron splicing. After transcription, the group II intron RNA folds into conserved

secondary and tertiary structures and forms an active site that binds the splice sites and the

branch-point nucleotide to catalyze splicing. The intron-encoded protein is a multifunctional

reverse transcriptase (RT) that binds specifically to the intron RNA and stabilizes the

catalytically active RNA structure for RNA splicing. (B) Intron mobility. After splicing, the

group II intron RT binds remains bound to the excised intron lariat RNA in an RNP that

promotes intron mobility (“retrohoming”) to new DNA sites. In this process, the intron RNA

reverse splices directly into the top strand of the target DNA, while the intron-encoded

multifunctional RT cleaves the bottom strand of the target DNA and uses the 3′ end of the

cleavage site as a primer for reverse transcription of the inserted intron RNA. The

resulting intron cDNA is integrated into the host genome by cellular DNA recombination

and/or repair mechanisms (Lambowitz and Belfort, 2015).

A B

Page 27: Copyright by Yidan Qin 2016

13

Figure 1.2: Comparision of group II intron and retroviral RTs.

Group II intron RT domains: N-terminal RT domain with conserved sequence

blocks RT-1 to RT-7, corresponding to the fingers and palm domains of retroviral RTs

(HIV-1 RT); X/thumb with predicted -helices (above) corresponding to thumb domain

of retroviral RTs; C-terminal DNA binding (D) and DNA endonuclease (En) domains

instead of the RNase H domain of retroviral RTs (HIV-1 RT). Group II intron RTs have

an N-terminal extension (RT-0) and insertions between the conserved RT sequence

blocks (RT-2a, -3a, -4a and -7a) that are absent in retroviral RTs (HIV-1 RT).

Page 28: Copyright by Yidan Qin 2016

14

Chapter 2: RNA-seq of transfer RNAs

2.1 EFFICIENT AND QUANTITATIVE HIGH-THROUGHPUT TRNA SEQUENCING*

*Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12, 835-837 (2015). Authors

include Guanqun Zheng, Yidan Qin, Wesley C. Clark, Qing Dai, Chengqi Yi, Chuan He, Alan M.

Lambowitz and Tao Pan; G.Z. and Y.Q. equally contributed to this work; T.P. and A.M.L. jointly

supervised this work.

Widely used small RNA-seq methods start with adaptor ligation and cDNA

synthesis from biological RNA samples followed by PCR amplification to generate

sequencing libraries (Wang et al., 2009). These standard methods are able to sequence

most cellular RNAs, including short-read sequencing of microRNAs, fragments of

rRNAs, small nuclear RNAs, and small nucleolar RNAs, or fragmented mRNAs and

long-noncoding RNAs (lncRNAs). tRNA is the only class of small cellular RNA for

which the standard sequencing methods cannot yet be applied efficiently and

quantitatively, although attempts have been made (Pang et al., 2014). Significant

obstacles for the sequencing of tRNA include the presence of numerous post-

transcriptional modifications and its stable and extensive secondary structure, which

interfere with cDNA synthesis and adaptor ligation. tRNAs are essential for cells, and

their synthesis is under stringent cellular control (Phizicky and Hopper, 2010). Recent

findings show that tRNA expression and mutations, and cleaved tRNA fragments are

associated with various diseases, such as neurological pathologies and cancer

development (Abbott et al., 2014; Anderson and Ivanov, 2014; Goodarzi et al., 2015;

Page 29: Copyright by Yidan Qin 2016

15

Kirchner and Ignatova, 2015). The lack of efficient and quantitative tRNA sequencing

methods has hindered biological studies of tRNA.

2.1.1 tRNA sequencing by combining demethylase treatment with the TGIRT-seq

small RNA/CircLigase method

In collaboration with Dr. Tao Pan’s research group at the University of Chicago,

we applied two strategies to eliminate or substantially reduce the obstacles of tRNA

modification and structure for efficient and quantitative tRNA sequencing (Fig. 2.1)

(Zheng et al., 2015).

First, an enzyme mixture was used to remove methylations at the Watson-Crick

face. Three specific modifications are abundant in eukaryotic tRNAs and are particularly

problematic for reverse transcriptases (RTs), causing cDNA synthesis to stop or

incorporate a wrong nucleotide. In mammals, N1-methyladenosine (m1A) is present in all

tRNAs at position 58, N3-methylcytosine (m3C) is present in five tRNAs at position 32

and the variable loop, and N1-methylguanosine (m1G) is present in about half of all

tRNAs at position 37 or 9 (Fig. 2.1A). Our collaborators used a mixture of two

recombinant enzymes, a wild-type AlkB (wtAlkB) from E. coli and an engineered mutant

AlkB (D135S) to remove ~70-80% of these three methylations in human tRNAs (Falnes

et al., 2002; Trewick et al., 2002). The remaining m1A or m1G may be buried deeper in

the tRNA tertiary structure (m1A or m1G at position 9 of tRNAs) and thus not easily

accessible to demethylase treatment without causing tRNA degradation.

Page 30: Copyright by Yidan Qin 2016

16

Second, we used a thermostable group II intron reverse transcriptase (TGIRT) to

generate cDNAs from highly structured tRNAs (Fig. 2.1B). First, the TGIRT binds to an

initial template-primer substrate comprised of an RNA oligonucleotide containing RNA-

seq adapter sequences annealed to a complementary DNA primer. For Illumina

sequencing, the RNA-seq adapter contains both Illumina Read 1 and 2 primer-binding

sites, and the DNA primer contains the complementary sequence (Materials and

Methods). After forming a complex with the initial template-primer substrate, the TGIRT

initiates reverse transcription by switching directly from the 5’ end of the RNA-seq

adapter to the 3’ end of a target RNA, yielding a continuous cDNA linking the two

sequences. The RNA-seq adapter has a 3’-blocking group that impedes secondary

template-switching to the 3’ end of that RNA.

To increase the efficiency of template-switching, the DNA primer annealed to the

RNA-seq adapter in the initial adapter substrate has a single-nucleotide 3’ overhang. This

3’-overhang nucleotide base-pairs to the 3’-terminal nucleotide of the target RNA,

resulting in a seamless template-switching junction between the RNA-seq adapter and the

target RNA (Mohr et al., 2013). In the present work, an initial template-primer substrate

with a single T overhang was used to enrich for mature tRNAs, which always have an A

at their 3’ ends due to post-transcriptionally added CCA sequence. Alternatively, an

equimolar mixture of A, C, G, or T 3’ overhangs (denoted N (Mohr et al., 2013)) can be

used to construct RNA-seq libraries from RNA pools with minimal bias. The resulting

cDNAs are gel-purified, circularized by CircLigase II ssDNA Ligase (Epicentre),

amplified by PCR, and sequenced on an Illumina instrument. This method is widely

Page 31: Copyright by Yidan Qin 2016

17

applicable for other small RNA classes, including miRNA, and is referred to as the

TGIRT-seq small RNA/CircLigase method.

The tRNA libraries were sequenced on an Illumina HiSeq 2500 instrument. The

sequencing reads were mapped to the genomic tRNA database, which contains 515

predicted tRNA genes distributed over 330 unique sequences and 110 predicted tRNA

pseudogenes (Chan and Lowe, 2009). In combination with the demethylase treatment, we

obtained longer and full-length tRNA reads with markedly reduced amounts of RT stops

at the m1A58 and m1G37 positions (Figure 2.1C), a property that is crucial for the ability

to adequately map the mammalian tRNAome at single-base resolution. Alternatively, we

found that the TGIRT enzyme produced more full-length cDNA products when

increasing the reaction time of reverse transcription (Fig. 2.1D).

2.1.2 Analysis of tRNA isoacceptors, modifications and gene expressions

Our collaborators performed additional analysis to further demonstrate the

usefulness of our sequencing method. Plotting each tRNA isoacceptor against its gene

copy number showed a poor correlation, which is consistent with the known tissue-

specific tRNA expression in humans (Dittmar et al., 2006; Gingold et al., 2014). The

comparison between sequencing and array results of the Arg-tRNA showed the same

trend of isoacceptor abundance, thus validating the quantitative nature of tRNA

abundance obtained independently through sequencing- and hybridization-based

approaches. The analysis of RT misincorporations at known modification positions with

and without demethylase treatment indicates that the DM-tRNA-seq (demethylase-

Page 32: Copyright by Yidan Qin 2016

18

thermostable group II intron RT tRNA sequencing) method can determine differences in

the modification dynamics of m1A, m1G and m3C at single-base resolution, as well as

potentially infer positions of non-demethylated modifications. Finally, the examination of

unique tRNA genes from human chromosome 6, which contains a major tRNA gene

cluster (Horton et al., 2004), showed higher expression level within the cluster than

outside of the cluster. The expression levels of tRNA genes in the cluster were uneven,

suggesting that the expression of tRNA genes was not coordinated throughout the entire

cluster in HEK293T cells.

2.1.3 Discussion

The approach described above makes efficient and quantitative tRNA-seq

feasible. Furthermore, in a time-course reverse transcription reaction of tRNAs, the

TGIRT enzyme produced more full-length cDNA products at longer time points. It

suggests an extremely tight binding to the RNA template by the TGIRT enzyme, which

stalls at the modification sites without falling off, and is capable of reading through the

modified nucleotides with more time given. Interestingly, it also appears that the TGIRT

enzyme yields a distinct pattern of misincorporated nucleotides characteristic of the

modification, providing an additional advantage of being able to study modifications at

single-nucleotide resolution in a high-throughput manner.

Page 33: Copyright by Yidan Qin 2016

19

2.2 ANALYSIS OF PRECURSOR AND MATURE TRNAS ASSOCIATED WITH THE HUMAN

INTERFERON-INDUCED PROTEIN IFIT5*

*Broad and adaptable RNA structure recognition by the human interferon-induced tetratricopeptide repeat

protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030 (2014). Authors include George E.

Katibah, Yidan Qin, David J. Sidote, Jun Yao, Alan M. Lambowitz, and Kathleen Collins. G.E.K., A.M.L.,

and K.C. designed research; G.E.K., Y.Q., and D.J.S. performed research; Y.Q., D.J.S., and J.Y.

contributed new reagents/analytic tools; G.E.K., Y.Q., D.J.S., J.Y., A.M.L., and K.C. analyzed data; and

G.E.K., Y.Q., D.J.S., J.Y., A.M.L., and K.C. wrote the paper.

Innate immune responses provide a front-line defense against pathogens. Unlike

adaptive immune responses, innate immunity relies on general principles of

discrimination between self and pathogen epitopes to trigger pathogen suppression

(Gürtler and Bowie, 2013). Pathogen-specific features that can provide this

discrimination come under evolutionary selection to evade host detection, and in turn

host genes adapt new recognition specificities for pathogen signatures. Among the most

clearly established targets of innate immune response recognition are nucleic acid

structures not typical of the host cell, such as cytoplasmic double-stranded RNA (Goubau

et al., 2013). Detection of a pathogen nucleic acid signature robustly induces type I

interferon, which activates a cascade of pathways for producing anti-viral effectors

(Schoggins and Rice, 2011). Highly expressed interferon-induced proteins with

tetratricopeptide repeats (IFITs) are proposed to function as RNA binding proteins, but

the RNA binding and discrimination specificities of IFIT proteins remain unclear.

Page 34: Copyright by Yidan Qin 2016

20

2.3.1 The human IFIT5 protein

Cytoplasmic viral RNA synthesis occurs without co-transcriptional coupling to

the 5'-capping machinery, which acts pervasively on host-cell nuclear RNA polymerase

II transcripts (Ghosh and Lima, 2010; Topisirovic et al., 2011). Eukaryotic mRNA 5'

ends are first modified by addition of a cap0 structure containing N7-methylated

guanosine, which is joined to the first nucleotide (nt) of the RNA by a 5’-5’ triphosphate

linkage (7mGpppN). In higher eukaryotes including humans, cap0 is further modified by

ribose 2'-O-methylation of at least 1 nt (7mGpppNm, cap1) and sometimes 2 nt

(7mGpppNmpNm, cap2). Cap0 addition makes essential contributions to mRNA

biogenesis and function in steps of mRNA splicing, translation and protection from decay

(Ghosh and Lima, 2010; Topisirovic et al., 2011). In contrast, the biological role of

mRNA cap0 modification to cap1 and cap2 structures is largely enigmatic. Some viruses

encode enzymes for 7mGpppN formation and less frequently the ribose 2'-O-methylation

necessary to generate cap1 (Decroly et al., 2012). Recent studies show that virally

encoded cap 2’-O-methyltransferase activity can inhibit the innate immune response

(Daffis et al., 2010; Züst et al., 2011; Szretter et al., 2012; Habjan et al., 2013; Kimura et

al., 2013).

The IFIT family of interferon-induced proteins with tetratricopeptide repeats

(TPRs) are among the most robustly accumulated proteins following type I interferon

signaling (Diamond and Farzan, 2013; Zhou et al., 2013). Phylogenetic analyses reveal

different copy numbers and combinations of four distinct IFIT proteins (IFIT1, 2, 3 and

5) even within mammals, generated by paralog expansions and/or gene deletions,

Page 35: Copyright by Yidan Qin 2016

21

including the loss of IFIT5 in mice and rats (Liu et al., 2013). Human IFIT1, IFIT2 and

IFIT3 co-assemble in cells into poorly characterized multimeric complexes that exclude

IFIT5 (Pichlmair et al., 2011; Katibah et al., 2013). Recombinant IFIT-family proteins

range from monomer to multimer, with crystal structures solved for a human IFIT2

homodimer, the human IFIT5 monomer, and an N-terminal fragment of human IFIT1

(Yang et al., 2012; Abbas et al., 2013; Feng et al., 2013; Katibah et al., 2013). Studies of

IFIT1 report its preferential binding to either 5' triphosphate (ppp) RNA or cap0 RNA or

optimally cap0 without guanosine N7-methylation (Pichlmair et al., 2011; Habjan et al.,

2013; Kimura et al., 2013; Kumar et al., 2014). Reports of IFIT5 RNA binding specificity

are likewise inconsistent: the protein has been described to bind RNA single-stranded 5'

ends with ppp and monophosphate (p) but not OH (Katibah et al., 2013); ppp but not p,

OH or cap0 (Abbas et al., 2013); ppp but not cap0 (Habjan et al., 2013; Kumar et al.,

2014); or single-stranded 5'-p RNA and double-stranded DNA (Feng et al., 2013).

Using structure-guided mutagenesis coupled with quantitative binding assays of

purified recombinant protein, our collaborators in Dr. Kathleen Collins’s research group

at the University of California-Berkeley, established that IFIT5 can alternatively expand

or introduce bias in protein binding to RNAs with 5' monophosphate, triphosphate, cap0

(triphosphate-bridged N7-methylguanosine) or cap1 (cap0 with RNA 2’-O-methylation)

(Katibah et al., 2014). This surprisingly adaptable IFIT5 recognition specificity for RNA

5' structure in vitro suggested that it could bind to many cellular RNAs.

Page 36: Copyright by Yidan Qin 2016

22

2.3.2 TGIRT-seq profiling of IFIT5-bound cellular RNAs

To investigate the diversity of IFIT5-bound cellular RNAs in an unbiased manner,

we deep sequenced RNAs copurified with IFIT5 from HEK293 cells. A previous study

had shown that IFIT5 binds to tRNAs (Katibah et al., 2013), which are recalcitrant to

standard sequencing methods. Therefore we used TGIRT-seq small RNA/CircLigase

method, with an equimolar mixture of A, T, G, C overhangs in the initial template-primer

substrate for RNA-seq library construction with minimal bias. TGIRT-seq was first

performed for cellular RNAs that co-purified with IFIT5 from a HEK293 cell line with

3xF-IFIT5 expressed at a physiological level (Katibah et al., 2013). To capture in vivo

protein-RNA interactions, formaldehyde cross-linking was used before stringent

purification and then reversed prior to analyzing the bound RNAs. We also compared

IFIT5-bound RNAs isolated under native affinity purification conditions from extracts of

cells with or without prior interferon-β treatment. In addition, we compared wild-type and

mutant E33A and E33A/D334A IFIT5 proteins expressed in HEK293 cells by transient

transfection. In the first set of 3 samples, comparing wild-type IFIT5 with or without

formaldehyde cross-linking or interferon-β treatment prior to cell lysis, cDNA products

were pooled and amplified together (Fig. 2.2A, Table 2.1). In the second set of 3 samples,

because RNAs bound to wild-type and mutant IFIT5 had different size profiles, we

amplified and sequenced discrete pools of cDNA lengths (Fig. 2.2B, Table 2.1). Finally,

in a third sample, we pooled cDNAs before amplification and sequencing for a biological

replicate of the wild-type versus mutant IFIT5 comparison (Table 2.2). For each

purification condition, cDNAs were sequenced on an Illumina MiSeq to a depth of 1

Page 37: Copyright by Yidan Qin 2016

23

million or more reads, which were mapped to the Ensembl GRCh37 human genome

reference sequence.

RNA from IFIT5 purifications gave TGIRT-seq reads that mapped predominantly

to tRNA gene loci in all samples (Table 2.1 and Table 2.2). Cross-linked and native

extract purifications showed a large diversity of bound tRNAs, with reads from different

samples mapping to 507-527 of the 625 annotated human tRNA and tRNA pseudogene

loci (Fig. 2.3). For IFIT5 expressed by transient transfection with size-selected cDNA

pools sequenced separately, the largest cDNA size pool contained substantial amounts of

5S rRNA, which is less abundant in the cross-linked RNA purification (Table 2.1) and

thus could in part reflect IFIT5 binding of a highly abundant RNA in native cell extract

(Katibah et al., 2013) (Table 2.1; size categories a, b and c correspond to cDNA of ~55-

82, 84-150 and 150-230 nt, respectively, including the 42 nt primer added by template-

switching; Fig. 2.2).

2.3.3 IFIT5 binds to a broad spectrum of precursor and processed tRNAs, as well as

other RNA polymerase III transcripts

To further characterize IFIT5-bound tRNAs, we plotted read coverage across

individual tRNA loci from 50 bp upstream to 50 bp downstream of the mature tRNA

ends, with representative coverage plots shown for the cross-linked RNA sample (Fig.

2.4; mature RNA ends are indicated with dashed lines). Some tRNA loci were

represented by reads abundant only across the mature tRNA region (iMetCAT, AspGTC

and HisGTG). Read alignments to the genome sequence revealed that many IFIT5-bound

Page 38: Copyright by Yidan Qin 2016

24

mature tRNAs were full length including the post-transcriptionally added 3’ CCA tail. In

the case of HisGTG, the alignments also detected the expected post-transcriptional 5’

guanosine addition (Fig. 2.5A) (Phizicky and Hopper, 2010). Post-transcriptionally

modified nucleotides within the tRNA were evident from positions of frequent read

mismatch to the genome sequence (Fig. 2.5A). Some IFIT5-bound tRNA reads had

truncated 5' and/or 3' ends (Fig. 2.4 and Fig. 2.5A; iMetCAT) resulting from nuclease

cleavage of tRNAs and, for 5'-truncated ends, potentially from premature reverse

transcription stops.

In addition to mature tRNAs, we were surprised to find that numerous tRNA loci

were represented by abundant IFIT5-bound tRNAs with the 5' extension of a primary Pol

III transcript (Fig. 2.4, AlaTGC, ValAAC, ArgTCT, and LeuCAA). Many of these 5'-

extended tRNAs included the full-length mature tRNA sequence with a 3' CCA tail (Fig.

2.5A, AlaTGC). Also, some IFIT5-bound tRNAs with a 5' precursor extension and CCA

tail had undergone splicing to remove the intron (Fig. 2.4, ArgTCT and LeuCAA), which

is unexpected given that 5' processing precedes splicing in known tRNA biogenesis

pathways (Phizicky and Hopper, 2010). Furthermore, some of the spliced tRNAs had

aberrant splice junctions suggestive of missplicing (Fig. 2.5B). Of interest, some 5' and/or

3' extended or truncated tRNAs had post-transcriptionally appended poly-U tails (Table

2.1, Table 2.2 and Fig. 2.5C). We also found tRNA pseudogene transcripts (Fig. 2.4,

PseudoCCC), as well as a few tRNAs with atypically long 5' or 3' extensions or with

sequence reads ending at an internal modified nucleotide position suggestive of a reverse

transcription stop.

Page 39: Copyright by Yidan Qin 2016

25

Cellular IFIT5 binding to the RNAs described above is consistent with its

biochemical specificity of RNA interaction in vitro: precursor tRNAs are expected to

have 5'-ppp from RNA polymerase III initiation, while mature tRNAs are expected to

have 5'-p generated by RNase P. Although biochemically consistent, some types of

incompletely processed IFIT5-bound tRNAs should be nuclear, whereas IFIT5 is

cytoplasmic (Katibah et al., 2013). The cytoplasmic localization of IFIT5 suggests that

some immature or aberrantly processed tRNA transcripts escape the nucleus to become

available for IFIT5 binding, either via mistransport or during mitosis.

IFIT5 also bound to a family of cytoplasmic, ~120 nt, Alu-related, primate-

specific RNA polymerase III small NF90-associated RNA (snaR) transcripts and 5S

rRNA (Fig. 2.4, Table 2.1, Table 2.2 and Fig. 2.5D). The snaRs have a single-stranded 5’

end but extensive secondary structure that impedes cDNA synthesis by a conventional

reverse transcriptase (Parrott and Mathews, 2007; Parrott et al., 2011). Nonetheless

TGIRT-seq gave coverage across the full snaR (Fig. 2.4 and Fig. 2.5D). The snaR

association with IFIT5 was further confirmed using blot hybridization. Notably, the poly-

U tailing of IFIT5-bound tRNAs was also observed for IFIT5-bound snaRs (Fig. 2.5D).

Compared to wild-type IFIT5 assayed in parallel, mutant E33A or E33A/D334A

IFIT5 purifications contained an increased proportion of rRNA and mRNA (Table 2.1

and Table 2.2). The mRNA reads showed no obvious bias for 5' ends and were more

abundant in native than in cross-linked samples (Table 2.1), suggestive of IFIT5 binding

to 5'-p mRNA fragments generated in cell extract. To investigate a potential change in

specificity of IFIT5 binding to tRNA 5' ends imposed by the E33A and E33A/D334A

Page 40: Copyright by Yidan Qin 2016

26

substitutions, we determined the overall frequency of tRNA read start-site positions for

all tRNA loci combined (Fig. 2.6). Using reads mapped against tRNA loci from 50 bp

upstream to 50 bp downstream of the mature tRNA ends, most read start sites

corresponded to 5'-extended precursor (positions 1-50) or the mature tRNA 5’ end

(position 51). The cross-linked sample had a higher fraction of mature tRNA start sites at

position 51 than the two native purifications from the same cell line (Fig. 2.6A). Mutant

E33A or E33A/D334A IFIT5 purifications also showed an increased fraction of read start

sites at the mature tRNA 5' end (position 51) compared to the parallel purification of

wild-type IFIT5 (Fig. 2.6B), possibly reflecting some shift of the mutant IFIT5 proteins

toward binding of 5’-p versus 5’-ppp RNAs.

2.3.4 Discussion

TGIRT-seq analysis supports IFIT5 binding to both 5'-p and 5'-ppp cellular RNAs

and also the poly-U tailing of IFIT5-bound RNA fragments, which appeared to be the

case for an IFIT5-bound tRNA fragment sequenced previously (Katibah et al., 2013).

Recent studies describe poly-U tailing as a commitment step for RNA degradation by the

human cytoplasmic exonuclease DIS3L2, which is deficient in human Perlman syndrome

(Astuti et al., 2012; Chang et al., 2013; Malecki et al., 2013). Because IFIT5-bound

tRNAs include 3′-extended or truncated poly-U tailed forms that would be a minority of

total cellular tRNA forms, we suggest that IFIT5 may not only sequester cellular tRNAs

but also trigger their subsequent degradation by DIS3L2. Analogous modes of action

have been found for RNaseL, which degrades cellular RNA to mediate its function in

Page 41: Copyright by Yidan Qin 2016

27

innate immunity, and human schlafen 11, which binds tRNAs to alter translation as its

antiviral effector mechanism (Malathi et al., 2007; Li et al., 2012). We speculate that any

cytoplasmic single-stranded viral RNA 5′-p or 5′-ppp end would be bound by IFIT5,

potentially inhibiting viral mRNA capping and/or translation. In addition, by recruiting

RNA degradation enzymes to bound RNAs, IFIT5 could target virally encoded RNAs for

rapid turnover. Finally, our results suggest that IFIT5 could also play a general role,

beyond its function in innate immunity, in cytoplasmic surveillance for 5′-ppp RNA

polymerase III transcripts that escape the nucleus.

2.4 ANALYSIS OF TRNAS ASSOCIATED WITH THE YEAST RQC2P PROTEIN*

*Protein synthesis. Rqc2p and 60S ribosomal subunits mediate mRNA-independent elongation of nascent

chains. Science 347, 75–78 (2015). Authors include Peter S. Shen, Joseph Park, Yidan Qin, Xueming Li,

Krishna Parsawar, Matthew H. Larson, James Cox, Yifan Cheng, Alan M. Lambowitz, Jonathan S.

Weissman, Onn Brandman, Adam Frost. P.S.S., A.M.L., J.S.W., O.B. and A.F. designed research. P.S.S.,

J.P., Y.Q., X.L. M.L. O.B. and A.F. performed research. P.S.S., J.P., Y.Q., X.L. M.L. Y.C. A.L.M., J.S.W.

O.B. and A.F. analyzed data. P.S.S., J.S.W., O.B. and A.F. wrote the paper.

Despite the processivity of protein synthesis, faulty messages or defective

ribosomes can result in translational stalling and incomplete nascent chains. In Eukarya,

this leads to recruitment of the RQC (Ribosome Quality Control) complex for

ubiquitylation and degradation of incompletely-synthesized nascent chains (Brandman et

al., 2012; Defenouillère et al., 2013; Shao et al., 2013; Verma et al., 2013). The molecular

components of the RQC complex include the AAA ATPase Cdc48p and its ubiquitin-

binding cofactors, the RING-domain E3 ligase Ltn1p, and two proteins of unknown

Page 42: Copyright by Yidan Qin 2016

28

function, Rqc1p and Rqc2p. In collaboration with several research groups, including Dr.

Adam Frost at the University of Utah, Dr. Onn Brandman at the Standford University,

and Drs. Johnathan Weissman and Dr. Yifan Cheng at the University of California-San

Francisco, we set out to determine the mechanism(s) by which relatively rare proteins

like Ltn1p, Rqc1p, and Rqc2p recognize and rescue stalled 60S ribosome-nascent chain

complexes, which are vastly outnumbered by ribosomes translating normally or in stages

of assembly (Li et al., 2014).

2.4.1 The Rqc2p protein

Using cryo–electron microscopy (Cryo-EM) structures, our collaborators found

that the RQC components Ltn1p (YMR247C/Rkr1), an RING-domain E3 ubiquitin

ligase, and Rqc2p (YPL009C/Tae2) bind to the 60S subunit at sites exposed after 40S

dissociation, placing the Ltn1p RING domain near the exit channel of the ribosome and

Rqc2p over the P-site transfer RNA (tRNA) (Shen et al., 2015). Cryo-EM structures also

revealed Rqc2p binding to an ~A-site tRNA whose 3′-CCA tail is within the peptidyl

transferase center of the 60S. This observation was unexpected since A-site tRNA

interactions with the large ribosomal subunit are typically unstable and require mRNA

templates and elongation factors (Lill et al., 1986). Rqc2p’s interactions with the ~A-site

tRNA appeared to involve binding of the anticodon loop by a globular N-terminal

domain, as well as D-loop and T-loop interactions along Rqc2p’s coiled coil.

Page 43: Copyright by Yidan Qin 2016

29

2.4.2 TGIRT-seq profiling of Rpc2p-bound tRNAs

To determine whether Rqc2p binds specific tRNA molecules, we extracted total

RNA after RQC purification from strains with intact RQC2 versus rqc2 strains. Deep

sequencing by using TGIRT-seq small RNA/CircLigase method revealed that the

presence of Rqc2p leads to an ~10-fold enrichment of tRNAAla(AGC) and tRNAThr(AGT) in

the RQC (Fig. 2.7A). In complexes isolated from strains with intact RQC2, Ala(AGC)

and Thr(AGT) are the most abundant tRNA molecules, even though they are less

abundant than a number of other tRNAs in yeast (Chu et al., 2011).

Cryo-EM structures suggested that Rqc2p’s specificity for these tRNAs is due in

part to direct interactions between Rqc2p and nucleotides 32-36 of the anticodon loop,

some of which are edited or modified in the mature tRNA (Fig. 2.7B). Adenosine 34 in

the anticodon of both tRNAAla(AGC) and tRNAThr(AGT) is deaminated to inosine (Crick,

1966; Gerber and Keller, 1999; Agris et al., 2007), and this was detected by TGIRT-seq

as a diagnostic guanosine upon reverse transcription (Fig. 2.7B,C) (Delannoy et al., 2009;

Katibah et al., 2014). Further analysis of the sequencing data revealed that cytosine 32 in

tRNAThr(AGT) is also deaminated to uracil in ~70% of the Rqc2p-enriched reads (Fig.

2.7C) (Rubio et al., 2006). Together with the structure, this suggests that Rqc2p binds to

the D-, T- and anticodon loop of the ~A-site tRNA, and that recognition of the 32-

UUIGY-36 edited motif accounts for Rqc2’s specificity for these two tRNAs (Fig. 2.7C).

The pyrimidine at position 36 could explain the discrimination between the otherwise

similar anticodon loops that harbor purines at base 36.

Page 44: Copyright by Yidan Qin 2016

30

Through a series of biochemical and genetic assays, our collaborators demonstrate

that Rqc2p recruits alanine- and threonine-charged tRNAs to the A site and directs the

elongation of stalled nascent chains with non-templated Carboxy-terminal Ala and Thr

extensions or “CAT” tails, which may also function in the activation of Heat Shock

Factor 1 (Hsf1p). The identification of the Rqc2p-bound tRNAAla(AGC) and tRNAThr(AGT)

could not be done by conventional RNA-seq and required the use of TGIRT-seq.

2.4.3 Discussion

Integrating our observations, we propose the model schematized in Figure 2.8.

Ribosome stalling leads to dissociation of the 60S and 40S subunits, followed by

recognition of the peptidyl-tRNA-60S species by Rqc2p and Ltn1p. Ltn1p ubiquitinates

the stalled nascent chain, and this leads to Cdc48 recruitment for extraction and

degradation of the incomplete translation product. Rqc2p, through specific binding to

Ala(IGC) and Thr(IGU) tRNAs, directs the template-free and 40S-free elongation of the

incomplete translation product with CAT tails. CAT tails induce a heat shock response

through a mechanism that is yet to be determined.

Hypomorphic mutations in the mammalian homolog of LTN1 cause

neurodegeneration in mice (Chu et al., 2009). Similarly, mice with mutations in a CNS-

specific isoform of tRNAArg and GTPBP2, a homolog of yeast Hbs1 which works with

PELOTA/Dom34 to dissociate stalled 80S ribosomes, suffer from neurodegeneration

(Ishimura et al., 2014). These observations speak to the consequences ribosome stalls

impose on the cellular economy. Eubacteria rescue stalled ribosomes with the tmRNA-

Page 45: Copyright by Yidan Qin 2016

31

SmpB system, which releases nascent chains fused with a unique C-terminal tag that

targets the nascent chain for proteolysis (Moore and Sauer, 2007). The mechanisms

utilized by eukaryotes, which lack tmRNA, to recognize and rescue stalled ribosomes and

their incomplete translation products have been unclear. The RQC complex—and

Rqc2p’s CAT tail tagging mechanism in particular—bear both similarities and contrasts

to the tmRNA trans-translation system. The evolutionary convergence upon distinct

mechanisms for extending incomplete nascent chains at C-terminus argues for their

importance in maintaining proteostasis. One advantage of tagging stalled chains is that it

may distinguish them from normal translation products and promote their removal from

the protein pool. An alternate, not mutually exclusive, possibility is that the extension

serves to test the functional integrity of large ribosomal subunits so that the cell can

detect and dispose of defective large subunits that induce stalling.

2.5 MATERIALS AND METHODS

2.5.1 Deacylation of tRNA samples

The TGIRT enzyme initiates reverse transcription by an end-to-end template-

switching mechanism that is sensitive to whether or not the 3’ end of the tRNA is

aminoacylated. For deacylation of tRNA, RNA samples were incubated in 0.1 M Tris-

HCl (pH 9.0) for 45 min at 37°C (Dittmar et al., 2005), and purified by ethanol

precipitation in the presence of 0.3 M sodium acetate (pH 5.2) or with an RNA Clean &

Concentrator Kit (Zymo Research). Portions of the purified RNA samples before and

after deacylation were analyzed with the Small RNA Kit on a 2100 Bioanalyzer (Agilent)

Page 46: Copyright by Yidan Qin 2016

32

to assess the quality and quantity of the RNAs.

2.5.2 Construction of RNA-seq libraries by TGIRT-seq small RNA/CircLigase

method

The construction of RNA-seq libraries via TGIRT template-switching was done

by using an initial template-primer substrate consists of a 41-nt RNA oligonucleotide (5'-

AGA UCG GAA GAG CAC ACG UCU AGU UCU ACA GUC CGA CGA UC/3SpC3/-

3'), which contains both the Illumina Read 1 and Read 2 primer-binding sites and a 3'

blocking group (C3 Spacer, 3SpC3; IDT), annealed to a complementary 42-nt 32P-labeled

DNA primer that leaves an equimolar mixture of A, C, G, or T single-nucleotide 3'

overhangs. Reactions were done with RNA samples, initial template-primer substrate

(100 nM), TGIRT enzyme and 1 mM dNTPs (an equimolar mix of dATP, dCTP, dGTP,

and dTTP) in 20 μl of reaction medium containing 450 mM NaCl, 5 mM MgCl2, 20 mM

Tris-HCl, pH 7.5, 1 mM dithiothreitol (DTT). ~100 ng of gel-purified HEK293T tRNA

or ~1 μg HEK293T whole-cell RNA were used in the high-throughput tRNA sequencing

experiment; ~30 ng of IFIT5-bound RNA or 25 ng of synthetic miRNA control was used

in the IFIT5 study; and ~100-200 ng of RQC-bound RNA was used in the RQC study.

For TGIRT enzyme, a thermostable GsI-IIC reverse transcriptase (TGIRT-III; InGex)

was used at 500 nM in the high-throughput tRNA sequencing experiment; and a

thermostable TeI4c-MRF reverse transcriptase with a C-terminal truncation of the DNA

endonuclease domain was used at 1 μM in both the IFIT5 and the RQC studies.

After pre-incubating a mixture of all components except dNTPs at room

Page 47: Copyright by Yidan Qin 2016

33

temperature for 30 min, reactions were initiated by adding dNTPs, incubated at 60°C for

15 min (IFIT5) or 30 min (high-throughput tRNA sequencing and RQC), and terminated

by adding 5 M NaOH to a final concentration of 0.25 M, incubating at 95°C for 3 min,

and then neutralizing with 5 M HCl. The labeled cDNAs were analyzed by

electrophoresis in a denaturing 6% polyacrylamide gel, which was scanned with a

Typhoon FLA9500 phosphorImager (GE Healthcare). Gel regions containing the desired

cDNA products were isolated and electroeluted using a D-tube Dialyzer Maxi with

MWCO of 6-8 kDa (EMD Millipore) and ethanol precipitated in the presence of 0.3 M

sodium acetate and linear acrylamide carrier (25-50 μg; Thermo Scientific). The purified

cDNAs were then circularized with CircLigase II (Epicentre; manufacturer’s protocol

with an extended incubation time of 5 h at 60°C), extracted with phenol-chloroform-

isoamyl alcohol (25:24:1) and ethanol precipitated. The circularized cDNA products were

amplified by PCR with Phusion-HF (Thermo Scientific) using Illumina multiplex (5'-

AAT GAT ACG GCG ACC ACC GAG ATC TAC ACG TTC AGA GTT CTA CAG

TCC GAC GAT C -3') and barcode (5'- CAA GCA GAA GAC GGC ATA CGA GAT

BARCODE GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC T -3') primers

under conditions of initial denaturation at 98C for 5 s, followed by 12 (high-throughput

tRNA sequencing) or 15 (IFIT5 and RQC) cycles of 98°C for 5 s, 60°C for 10 s and 72°C

for 10 s.

Page 48: Copyright by Yidan Qin 2016

34

Table values are rounded percentages. RNA classes >1.5% of reads in any sample are

shown.

*Cross-linked and native extract purifications were done following induction of IFIT5

expression in cells without (−) or with (+) IFN-β.

†Size categories of cDNA a, b, and c (defined in text) were analyzed for WT and mutant

IFIT5 proteins expressed by transfection.

‡Transcript categories from Ensembl GRCh37.

Table 2.1: TGIRT-seq read mapping.

Page 49: Copyright by Yidan Qin 2016

35

Table values are rounded percentages. RNA classes included were >1.5% of reads in at

least one sample analyzed in Table 2.2.

*Transcript categories from Ensembl GRCh37.

Table 2.2: Biological replicate sequencing of pooled RNA.

Page 50: Copyright by Yidan Qin 2016

36

Figure 2.1: Demethylase-thermostable group II intron RT tRNA sequencing (DM-tRNA-

seq).

Schematic representation for (A) Demethylation and (B) TGIRT-seq small

RNA/CircLigase method. (C) RT reaction for both purified tRNA and total RNA as

template with (+) or without (−) demethylase treatment. The blue line shows the gel

region excised for library construction. (D) Time-course RT reaction for purified tRNA

as template without demethylase treatment.

C

D

B

A

RNA-seq adapter 3’-Blocker

Page 51: Copyright by Yidan Qin 2016

37

Figure 2.2: cDNA synthesis of IFIT-bound RNAs by TGIRT-seq small RNA/CircLigase

method.

(A) PCR amplification and sequencing were done using pooled cDNA excised as

two gel slices of ∼55–82 and ∼84–200 nt, excluding only the 83-nt cDNA product from

template switching to Read1,2 RNA. In the legend above the gel, N indicates native

extract purification and XL indicates purification after in vivo cross-linking. (B)

Amplification and sequencing were performed separately for cDNA from size pools a =

∼55–82, b = ∼84–150, and c = ∼150–230 nt. TGIRT-seq from the biological replicate in

which cDNAs were processed as in A is summarized in Table 2.2.

B A

Page 52: Copyright by Yidan Qin 2016

38

Figure 2.3: Broad representation of IFIT5-bound tRNAs.

Profiles of tRNA read abundance were plotted using RNA libraries from in vivo

cross-linking or native extract of cells without (−) or with (+) prior IFN-β treatment. All

tRNA loci with mapped reads were rank-ordered in normalized abundance. The number

of different tRNA species identified by sequence reads is indicated in parenthesis in the

plots on the right and represent most of the 625 reference human genome tRNA and

tRNA pseudogene loci searched.

Page 53: Copyright by Yidan Qin 2016

39

Figure 2.4: Individual gene coverage by reads from the WT IFIT5 cross-linked RNA

sample.

Read coverage of loci was mapped for tRNA, snaR, and 5S rRNA genes across a

window from 50 bp upstream to 50 bp downstream of the mature RNA ends, which are

indicated with dashed lines. The top left plot shows coverage for an iMet tRNA gene,

followed by six additional tRNA genes, a tRNA pseudogene, a snaR locus (National

Center for Biotechnology Information NR_024229.1), and 5S rRNA (Ensembl

ENSG00000199352.1). Each tRNA gene is identified by chromosome number,

chromosome position, charged amino acid, and anticodon sequence (5′–3′). The apparent

excess of 3′ exon fragments for LeuCAA likely reflects misalignment of truncated 5′

exon sequences by Bowtie 2 local alignment after the gap resulting from intron removal.

Page 54: Copyright by Yidan Qin 2016

40

B

A

[3618 reads]

[9809 reads]

[119545 reads]

Page 55: Copyright by Yidan Qin 2016

41

Figure 2.5: Read sequence alignments for the WT IFIT5 cross-linked RNA sample.

The figure shows screen shots of IGV sequence alignments for some RNAs bound

by the WT IFIT5. The blue bar at the top delineates the mature tRNA sequence encoded

in the genome, with the arrow indicating 5′ to 3′ direction of the tRNA, which differs

C

D

Page 56: Copyright by Yidan Qin 2016

42

across alignments depending on the DNA strand to which the reads are mapped by the

Bowtie 2 aligner. The total number of reads mapped to the locus is indicated near the top

of each panel. To fit the entire alignment on one page, loci with more than 1,500 mapped

reads were down-sampled to 1,500 reads in IGV, and only parts of the IGV screen shot

were shown in (B) and (C). Reads were sorted by their start site on the chromosome,

which can be from either the 5′ or 3′ RNA end depending on the orientation of the gene

on the chromosome. In the coverage plot profiles, nucleotides matching the genome

sequence are represented in gray color, and mismatches are represented in different

colors (A, green; C, blue; G, brown; T, red). Soft-clipped sites, which demarcate the

beginning of extra 5′ and 3′ nucleotides that do not match the genomic sequence, are

indicated by a short black bar, and read continuity between a genome sequence gap, such

as a spliced intron, is indicated by a black line. Pol III ter, predicted RNA polymerase III

termination site. For the spliced tRNAs, reads were mapped with or without intron

removal from the gene sequence to highlight inaccurate splice junctions and modified

nucleotides near the junction that affect the sequence alignment. The spliced ArgTCT

tRNA reads contain potential examples of missplicing with a shifted splice junction

and/or one extra nucleotide inserted at the junction (highlighted in the inset sequence

alignment). Examples of untrimmed adapter sequence, nontemplated nucleotide addition

by the TGIRT at cDNAs 3′ ends (corresponding to tRNA 5′ ends), and rare second

template switches are indicated in the alignments. Mismatches at positions corresponding

to modified nucleotides known to be present in the tRNA are indicated by arrows

indicating the tRNA position and modified nucleotide. The spectrum of misincorporated

Page 57: Copyright by Yidan Qin 2016

43

nucleotides at modification sites is shown in the coverage plot, with a misincorporated

nucleotides threshold of 10%. In at least some cases (e.g., m1A and m2,2G), the spectrum

of mismatches appears to be characteristic of the modified base and may be useful for

identifying unknown base modifications in other coding and noncoding RNAs. A position

of potential posttranscriptional modification of a conserved guanosine residue in the snaR

is indicated in the alignment. Cm, 2′-O-methylcytidine; D, dihydrouridine; I, inosine; i6A,

N6-isopentenyl adenosine; t6A, N6-threonylcarbamoyladenosine; m1A, 1-

methyladenosine; m1G, 1-methylguanosine; m1I, 1-methylinosine; m2,2G, N2,N2-

dimethylguanosine; m3C, 3-methylcytidine.

Page 58: Copyright by Yidan Qin 2016

44

Figure 2.6: Composite read start sites for IFIT5-bound tRNAs.

Cross-comparison of tRNA read start sites for WT IFIT5 variously purified from

extracts of a stable cell line (A) or WT and mutant IFIT5 proteins purified after

expression by transient transfection (B). Native extract was from cells without (−) or with

(+) IFN-β treatment. X axis positions are as in Fig. 2.4, and the y axis represents the

percentage of reads starting at each position. Precursor tRNA ends are at positions 1–50

and the mature tRNA 5′ end is at position 51. Read start sites at positions within the

tRNA correlate with positions of reverse transcription stops at or near modification sites

common among eukaryotic tRNAs (Fig. 2.5): position 59, G9/1-methylguanosine (m1G);

position 70, U20/dihydrouridine (D); position 77, G26/N2,N2-dimethylguanosine (m2,2G)

or U27/pseudouridine (Ψ) depending on on the length of the tRNA D-loop; position 87,

A37/N6-isopentenyladenosine (i6A), N6-threonylcarbamoyladenosine (t6A) or 1-

methylinosine (m1I), and G37/m1G or wybutosine (yW); position 108, A58/1-

methyladenosine (m1A).

Page 59: Copyright by Yidan Qin 2016

45

Figure 2.7: Rqc2p-dependent enrichment of tRNAAla(IGC) and tRNAThr(IGU).

(A) tRNA cDNA reads extracted from purified RQC particles and summed per

unique anticodon, with versus without Rqc2p. (B) Secondary structures of tRNAAla(IGC)

and tRNAThr(IGU). Identical nucleotides are underlined. Edited nucleotides are indicated

with asterisks. (C) Weblogo representation of cDNA sequencing reads related to shared

sequences found in anticodon loops (positions 32 to 38) of mature tRNAAla(IGC) and

tRNAThr(IGU).

Page 60: Copyright by Yidan Qin 2016

46

Chapter 3: RNA-seq of circulating RNAs in human plasma

3.1 INTRODUCTION

Next-generation RNA sequencing (RNA-seq) is a supremely powerful method for

transcriptome profiling and gene expression analysis, with applications that include the

identification of novel biomarkers and new diagnostic methods for diseases (Wang et al.,

2009; Wilhelm and Landry, 2009; Ozsolak and Milos, 2011; Chen et al., 2012). A recent

exciting application of RNA-seq is the analysis of extracellular RNAs present in plasma

and other bodily fluids (Mitchell et al., 2008; Burgos et al., 2013; Huang et al., 2013;

Williams et al., 2013; Koh et al., 2014). Such extracellular RNAs are potential

biomarkers for human disease and may be involved in intercellular communication

(Valadi et al., 2007; Zernecke et al., 2009; Fabbri et al., 2012; Grasedieck et al., 2013). In

plasma, extracellular RNAs, also known as circulating RNAs, are present in vesicles,

such as exosomes, microvesicles, and apoptotic bodies, and/or in ribonucleoprotein

(RNP) complexes, e.g., miRNAs with Argonaute2 (Ago2) or high-density lipoproteins

(HDLs) (Zernecke et al., 2009; Arroyo et al., 2011; Vickers et al., 2011; Huang et al.,

2013). Circulating RNAs found in human plasma include fragments of mRNAs and long

non-coding RNAs (lncRNAs), possibly resulting from intracellular RNA turnover and

secretion in exosomes, as well as miRNAs and other small non-coding RNAs (small

ncRNAs) (Huang et al., 2013; Williams et al., 2013; Koh et al., 2014). Dysregulation of

non-coding RNAs and malfunctions in their processing machinery are frequently

hallmarks of human diseases, including cancer and Alzheimer’s disease (Croce, 2009;

Page 61: Copyright by Yidan Qin 2016

47

Esteller, 2011; Batista and Chang, 2013). Further, the expression profiles of miRNAs and

lncRNAs are often tissue- and cell-state specific, which may facilitate disease diagnoses

(Lu et al., 2005; Rosenfeld et al., 2008; Cabili et al., 2011; Brunner et al., 2012). Multiple

reports correlate the presence of specific mRNAs or miRNAs in plasma or serum with

different types of cancer and other diseases, suggesting that the analysis of circulating

RNAs may provide a non-invasive, cost-effective solution for detecting and monitoring

cancer progression (Kopreski et al., 2001; Silva et al., 2007; Keller et al., 2011; Moussay

et al., 2011; Koh et al., 2014). Thus far, however, knowledge of different RNA types that

circulate in human plasma and their relative abundance remains limited. Here, I

optimized methods for plasma RNA isolation to maximize small RNA representation,

and developed a new method for RNA-seq library construction via the use of

thermostable group II intron reverse transcriptases (TGIRTs), which allow the analysis of

all human plasma RNAs in a single RNA-seq experiment.

3.2 TGIRT-SEQ, THE TOTAL RNA METHOD

3.2.1 Overview of the TGIRT-seq total RNA method

In the initial TGIRT-seq small RNA/CircLigase method (see Chapter 2), the

cDNAs with an RNA-seq adapter linked by TGIRT template switching during reverse

transcription were size-selected on a denaturing polyacrylamide gel and circularized with

CircLigase II ssDNA Ligase (Epicentre) prior to PCR amplification (Mohr et al., 2013;

Katibah et al., 2014; Shen et al., 2015; Zheng et al., 2015). Although this procedure

remains useful for RNA-seq of specific RNA size classes or homogenously sized RNA

Page 62: Copyright by Yidan Qin 2016

48

fragments in procedures like HITS-CLIP or ribosome profiling, disadvantages include: (i)

size limitations introduced by CircLigase, whose efficiency deceases for longer cDNAs

(Epicentre product literature); (ii) a gel-purification step, which is time consuming and

results in loss of material; and (iii) the use of hazardous chemicals, such as phenol and

chloroform.

To achieve simplicity, speed, and high efficiency, we developed a new method for

using TGIRT template-switching in RNA-seq library construction from RNA pools

without size selection, referred to as the total RNA method (Qin et al., 2016). By

eliminating gel-purification and phenol-extraction steps, the method enables the

construction of RNA-seq libraries from small amounts of RNA in <5 h. The method is

readily adaptable for a variety of other applications, including sequencing of whole-cell

and exosomal RNAs, profiling of miRNAs and other non-coding RNAs, and for

streamlining the identification of protein- or ribosome-bound RNA fragments in

procedures like HITS-CLIP, RIP-Seq, and ribosome profiling.

Figure 3.1A outlines the new TGIRT-seq total RNA method. First, the TGIRT

binds to an initial template-primer substrate comprised of an RNA oligonucleotide

containing an RNA-seq adapter sequence annealed to a complementary DNA primer. For

Illumina sequencing, the RNA oligonucleotide contains an Illumina Read 2 primer-

binding site (R2 RNA), and the DNA primer contains the complementary sequence (R2R

DNA) (Fig. 3.1A,B). After forming a complex with the initial template-primer substrate,

the TGIRT initiates reverse transcription by switching directly from the 5’ end of the

RNA-seq adapter to the 3’ end of a target RNA, yielding a continuous cDNA linking the

Page 63: Copyright by Yidan Qin 2016

49

two sequences. The RNA-seq adapter has a 3’-blocking group that impedes secondary

template-switching to the 3’ end of that RNA.

To increase the efficiency of template-switching, the DNA primer annealed to the

RNA-seq adapter in the initial template-primer substrate has a single-nucleotide 3’

overhang. This 3’-overhang nucleotide base-pairs to the 3’-terminal nucleotide of the

target RNA, resulting in a seamless template-switching junction between the RNA-seq

adapter and the target RNA (Mohr et al., 2013). In the present work, an initial template-

primer substrate with an equimolar mixture of A, C, G, or T 3’ overhangs (denoted N

(Mohr et al., 2013)) was used to construct RNA-seq libraries from RNA pools with

minimal bias. The ability of a single base pair between the 3’-overhang nucleotide and

the 3’ end of the target RNA to direct TGIRT template-switching at 60oC, the operational

temperature of TGIRT enzymes, indicates a very potent strand annealing activity of

group II intron RTs. Alternatively, to enrich for certain target RNA, the A, C, G or T

overhangs can be mixed at a customized ratio or replaced by a string of nucleotides

complementary to the 3’ end sequences of the target RNA (Zheng et al., 2015).

Because an RNA-seq adapter is added directly during cDNA synthesis, TGIRT-

seq is inherently strand-specific. This strand specificity was confirmed by the low

frequency of antisense reads from a 74-nt RNA synthetic oligonucleotide template (0.72

and 1.9 x 10-5 for the TeI4c and GsI-IIC thermostable group II intron RTs, respectively;

Materials and Methods).

For RNA-seq profiling, reverse transcription by TGIRT enzymes is done at 60°C

in a reaction medium containing high salt (450 mM NaCl), which limits multiple

Page 64: Copyright by Yidan Qin 2016

50

template-switches. In the primary plasma RNA-seq datasets (DSs) presented here (DS1-

10), the percentage of fusion reads, which include multiple template-switches, was ≤0.14,

comparable to conventional RNA-seq methods using retroviral RTs (Lu and Matera,

2014). Multiple template-switches that do occur are sporadic and can be distinguished

from novel biologically relevant junctions resulting from DNA translocations or

unannotated splice junctions by a combination of technical replicates, Integrative

Genomics Viewer (IGV) alignments, and qRT-PCR validation. Because TGIRT enzymes

have very high processivity, TGIRT template-switching is virtually always end-to-end

and does not occur appreciably from internal sites (Mohr et al., 2013). By contrast,

retroviral RTs frequently template-switch by dissociating from an internal site and

reinitiating at a different site, resulting in artifactual internal deletions (Mader et al.,

2001; Cocquet et al., 2006).

In the previous small RNA/CircLigase method, cDNAs were linked by TGIRT

template-switching to an RNA-seq adapter containing the complements to both the

Illumina Read 2 (R2R) and Read 1 (R1R) primer-binding sites, gel purified and then

circularized with CircLigase prior to PCR amplification (Katibah et al., 2014; Shen et al.,

2015; Zheng et al., 2015). By contrast, in the new TGIRT-seq method developed here,

the cDNAs linked to an R2R adapter sequence are processed into RNA-seq libraries

without size selection by ligating a 5’-adenylated (5’ App) DNA oligonucleotide

containing the R1R adapter to the cDNA 3’ end with Thermostable 5’ AppDNA/RNA

Ligase (New England Biolabs). The 5’ App DNA oligonucleotide has a 3’-blocking

group that impedes self-ligation. The ligated cDNAs were then amplified by 12 cycles of

Page 65: Copyright by Yidan Qin 2016

51

PCR with primers that introduce Illumina P5 and P7 flow cell capture sites and barcodes

(Fig. 3.1B). The elimination of the gel-purification step improves sample recovery and

decreases processing time, enabling us to construct RNA-seq libraries from small

amounts of starting material in less than 5 h.

Because TGIRTs give full-length reads of tRNAs and other small ncRNAs, we

developed a pipeline for read mapping, which uses TopHat v2.0.10 end-to-end alignment

followed by Bowtie2 local alignment (Fig. 3.1C) to include RNAs with post-

transcriptionally added nucleotides, such as the 3’ CCA of tRNAs or poly(U) tails

(Malecki et al., 2013; Katibah et al., 2014). Like other RTs and DNA polymerases,

TGIRTs can add a small number of extra non-templated nucleotides to the 3’ ends of

cDNAs (referred to as terminal transferase activity) (Clark, 1988; Golinelli and Hughes,

2002). Such extra nucleotides remain after local alignment, but are readily evaluated by

IGV plots.

3.2.2 Validation of the TGIRT-seq total RNA method

In parallel work carried out primarily by Ryan Nottingham and Douglas Wu, the

TGIRT-seq total RNA method was validated by using two well-characterized,

commercially available human RNA reference samples including the Universal Human

Reference RNA (UHR) and the Human Brain Reference RNA (HBR) (Nottingham et al.,

2016). This work showed that TGIRT-seq recapitulates the relative abundance of human

transcripts and RNA spike-ins in ribo-depleted, fragmented RNA samples comparably to

non-strand-specific TruSeq v2 and better than strand-specific TruSeq v3. Moreover,

Page 66: Copyright by Yidan Qin 2016

52

TGIRT-seq is more strand-specific than TruSeq v3 and eliminates sampling biases from

random hexamer priming, which are inherent to TruSeq. The TGIRT-seq datasets also

show more uniform 5’ to 3’ gene coverage and identify more splice junctions,

particularly near the 5’ ends of mRNAs, than do the TruSeq datasets. Finally, TGIRT-seq

enables the simultaneous profiling of mRNAs and lncRNAs in the same RNA-seq

experiment as structured small ncRNAs, including tRNAs, which are essentially absent

with TruSeq.

3.3 HUMAN PLASMA RNA

3.3.1 Preparations and treatments of human plasma RNAs

To obtain suitable starting material for RNA-seq, we tested several different

plasma RNA preparations and DNase treatment methods with the aims of increasing the

representation of miRNAs, which comprise only a small proportion of plasma RNA, and

reducing contamination from plasma DNA. Each RNA-seq dataset presented below was

constructed from RNAs extracted from 1 ml of plasma obtained from a healthy male

individual at intervals at least one week apart. For the primary datasets, plasma RNAs

were extracted by using Trizol LS Reagent (Thermo Fisher Scientific) followed by a

Direct-zol RNA MiniPrep Kit (Zymo Research), as described in Materials and Methods.

This method, which we refer to as the Direct-zol method, typically recovered 2-8 ng of

nucleic acids per ml plasma, comparable to yields in previous studies (Burgos et al.,

2013; Williams et al., 2013; Spornraft et al., 2014). The plasma RNA samples were

analyzed by RNA-seq with no further treatment (NT), after enzymatic treatment to

Page 67: Copyright by Yidan Qin 2016

53

remove 3’ phosphates (-3’P), which block TGIRT template-switching (Mohr et al.,

2013), or after on-column DNase I digestion (OCD) under conditions that completely

digest 10 ng of a mixture of 74-nt ssDNA and 275-nt dsDNA PCR product (Figs. 3.2 and

3.3).

Bioanalyzer traces of the NT sample showed two broad peaks: Peak 1 at ~40-60

nt and Peak 2 at ~160-170 nt (Fig. 3.2A). After the on-column DNase I treatment, Peak 2

disappeared, leaving only Peak 1 (Fig. 3.2B), which was sensitive to RNase I, an enzyme

that degrades ssRNA, or alkaline hydrolysis, which degrades RNA but not DNA (Fig.

3.2C,D). The DNase sensitivity of the Peak 2 is consistent with previous findings that

plasma DNA fragments cluster at ~160-170 bp corresponding in size to the length of

dsDNA protected by nucleosomes (Fan et al., 2008). We found that total plasma RNA

prepared by mirVana miRNA Isolation Kit (Thermo Fisher Scientific) using a method

that combines large and small RNA fractions to increase small RNA recovery (Materials

and Methods) also contain a DNase-sensitive peak of size similar to Peak 2 (Fig. 3.3B-

D). Thus, plasma RNA prepared by either method cannot be assumed to be free of DNA.

Because TGIRT enzymes can template-switch to either RNA or DNA fragments

containing a 3’ OH (Mohr et al., 2013), RNA-seq datasets constructed from the NT and -

3’P samples potentially contain both plasma RNA and DNA sequences, whereas those

constructed from the DNase-treated samples correspond almost entirely to RNA

sequences, as judged by their sensitivity to RNase I and alkali (Fig. 3.2C,D).

Page 68: Copyright by Yidan Qin 2016

54

3.3.2 TGIRT-seq of human plasma RNA samples

Table 3.1 summarizes mapping statistics for RNA-seq datasets constructed from

Direct-zol NT, -3’ P, and OCD plasma RNAs by using the thermostable TeI4c group II

intron RT. Samples were sequenced on an Illumina HiSeq 2500 (Dataset 1 (DS1); 69.4

million 100-nt paired-end reads) or NextSeq 500 (DS2-10; 14.6 to 37.8 million 75-nt or

150-nt paired-end reads). For each type of RNA preparation, we obtained at least three

RNA-seq datasets, each using a different plasma sample taken from the same individual.

After trimming and filtering to remove adapter sequences and low quality base calls,

transcript lengths determined by the coverage of the paired-end read span were consistent

with plasma RNA size profiles in bioanalyzer traces (Fig. 3.4A,B). The processed reads

were mapped to a human genome reference sequence (Ensembl GRCh38 Release 76)

supplemented with additional rRNA gene contigs (Materials and Methods). For the

plasma RNA-seq datasets constructed with the TeI4c thermostable group II intron RT,

85.7-95.3% of the paired-end reads mapped to the human genome, and 27.3-30.7% were

concordant read pairs that mapped uniquely and with high mapping quality (MAPQ ≥15)

to genomic features in the annotated orientation (Table 3.1). For confidence, only

features with ≥10 hits were counted in the analysis.

3.3.3 Classes of RNAs detected in human plasma

Figure 3.5 shows the percentage of reads mapping to different genomic features in

the RNA-seq datasets constructed by using TeI4c RT for total plasma RNA treated in

various ways, using only uniquely mapped concordant read pairs for the calculation. The

Page 69: Copyright by Yidan Qin 2016

55

number of individual genes to which the reads mapped is shown next to each feature in

the stacked bar graphs. The datasets for NT, -3’ P, and OCD-treated plasma RNAs show

similar overall profiles of RNA classes with the majority of the reads corresponding to

fragmented protein-coding gene and lncRNAs (Fig. 3.5A), and a smaller proportion (1.8-

5.8%) mapping to a variety of small ncRNAs (Fig. 3.5B).

While having little effect on the proportion of reads mapping to protein-coding

gene and lncRNAs, the removal of 3’ phosphates (-3’P), which block TGIRT template-

switching, reproducibly increased the proportion of reads mapping to 18S and 28S

rRNAs (from 0.9 ± 0.2 to 6.3 ± 4.0% of reads mapped to features, p-value = 0.15) and 5’-

tRNA halves (from 0.4% to 7.1% of reads mapped to tRNAs, see below). These findings

suggest that the protein-coding and lncRNA fragments present in plasma were either

generated by RNases that leave a 3’ OH or had their 3’ phosphates removed by a

phosphatase. Previous findings indicate that most intracellular RNases involved in

cellular RNA turnover leave 3’-OH groups (Houseley and Tollervey, 2009; Schoenberg

and Maquat, 2012). By contrast, the rRNA and 5’-tRNA halves present in plasma, whose

representation increased after 3’ phosphate removal, were generated by RNases that leave

a 2’3’-cyclic phosphate or 3’ phosphate (e.g., RNase A in blood or angiogenin in the case

of tRNA haves) (Houseley and Tollervey, 2009; Yamasaki et al., 2009).

Despite the differences in plasma collection dates, DNA sequencers, and read

lengths, the biological replicates for RNA-seq datasets constructed with the TeI4c RT

from each type of plasma RNA preparation (NT, -3’P and OCD) were highly

Page 70: Copyright by Yidan Qin 2016

56

reproducible, with pairwise Spearman’s correlation coefficients () ranging from 0.85 to

0.92 (Fig. 3.6A-C).

We obtained additional RNA-seq datasets of NT plasma RNA with the GsI-IIC

thermostable group II intron RT, which is sold commercially as TGIRT-III enzyme

(Materials and Methods). The GsI-IIC RT datasets were very similar to TeI4c RT

datasets in terms of mapping statistics, reproducibility, and features detected (Table 3.2,

Fig. 3.6D and Fig. 3.7). The correlation coefficient between combined NT plasma RNA

datasets obtained with the two TGIRT enzymes (DS1-3 versus DS12-14) was 0.92, with

most of the differences due to low abundance RNA species (Fig. 3.6E). Analysis of 3’-

terminal nucleotides of RNAs in RNA-seq datasets constructed from DNase-treated

plasma RNA preparations showed a relatively even distribution of the four possible 3’-

terminal nucleotides by both enzymes, with only small differences of unknown

significance in the frequencies of some di- or tri-nucleotide sequences (Table 3.3).

3.3.4 Protein-coding gene and long non-coding RNAs in human plasma

The TGIRT-seq profiles suggest that human plasma RNA consists largely of

RNA fragments derived from a diverse population of protein-coding gene and lncRNAs.

From the bioanalyzer traces of the on-column DNase I-treated (OCD) samples, we infer

that the protein-coding and lncRNA fragments, which comprise a high proportion of

plasma RNA, are heterogeneous in size with a broad peak at ~40-60 nt (Peak 1; Fig.

3.2B), and this was supported by separately calculating the length distribution of protein-

Page 71: Copyright by Yidan Qin 2016

57

coding gene reads (excluding embedded small ncRNAs) in the DNase-treated samples

(Fig. 3.4C).

Further analysis of the protein-coding gene reads in NT and OCD-treated plasma

RNA datasets indicated that they are enriched in intron and antisense sequences

compared to human whole-cell RNAs analyzed by the same TGIRT-seq method using

TeI4c RT (Jurkat cells) or GsI-IIC RT (K562 cells) (Fig. 3.8, and Table 3.4). RNA-seq

datasets constructed from plasma RNA prepared by either the Direct-zol or mirVana

combined methods and treated with Baseline-ZERO DNase (Epicentre), which according

to the manufacturer digests DNA to mononucleotides, showed similar enrichments of

intron and antisense sequences (datasets BZD and M-BZD in Fig. 3.8), as did limiting the

analyzed protein-coding reads in the DNase-treated datasets to 30 nt to exclude residual

small DNA fragments (denoted read span 30 nt in Fig. 3.8). Plots of the proportion of

reads mapping to the sense and antisense strands versus gene length in the datasets for

DNase-treated plasma RNAs showed wide variations for different genes with

convergence toward 50% sense/antisense reads for longer genes in the larger datasets

(Fig. 3.9).

Previous studies have shown that a high proportion of the human genome is

transcribed from both strands, with many annotated antisense RNAs overlapping protein-

coding sequences on the opposite strand and concordantly regulated with the sense RNAs

(Katayama et al., 2005; Werner, 2013; Brown et al., 2014; Khorkova et al., 2014; Portal

et al., 2015). Our findings raise the possibility that plasma RNA is enriched in extraneous

Page 72: Copyright by Yidan Qin 2016

58

intron and antisense RNAs, which may be preferentially targeted for degradation and

cellular secretion, eventually finding their way into plasma.

3.3.5 Small non-coding RNAs in human plasma

miRNAs. The TGIRT-seq profiles for different types of plasma RNA preparations

indicate that miRNA are not abundant in human plasma. Fig. 3.10A shows profiles of

miRNAs detected in total plasma RNAs prepared by the Direct-zol method with on-

column DNase I treatment (OCD) and by the mirVana combined method with Baseline-

ZERO DNase treatment (M-BZD; Materials and Methods). The miRNAs detected by

TeI4c RT in both types of RNA preparations showed skewed distributions (Fig. 3.10A).

miRNA species with the highest read counts in both datasets include miR-451a, miR-142,

miR-16-2, mir-122 (a liver-specific miRNA), miR-223, miR-19a, let-7a, miR-16-1, let-

7b, miR-6087, miR-126, miR-17, and miR-21 (Fig. 3.10A). The abundant plasma

miRNAs identified here include those previously reported to be present in plasma in

complex with Ago2 proteins (e.g., miR-451a, miR-16, miR-122, miR-223, miR-19a, let-

7b, and miR-21), largely in exosomes (e.g., mirR-142 and let-7a) or in both Ago2

complexes and exosomes (miR-126) (Arroyo et al., 2011).

Tissue expression profiles of the mature miRNAs in the RNA-seq datasets for

both types of DNase-treated plasma RNA (Fig. 3.11 and Fig 3.16) indicate that plasma is

enriched in miRNAs that are abundant in endocrine glands and highly vascularized

organs, along with a subset of miRNAs that are abundant (top 10 percentile) in red blood

cells or platelets (miRNA names indicated in red in Fig. 3.11 and Fig. 3.12) (Landgraf et

Page 73: Copyright by Yidan Qin 2016

59

al., 2007; Wang et al., 2012). Some miRNAs abundant in brain were also detected with

relative high read count in the plasma, in agreement with a previous study which detected

brain-specific transcripts in plasma with increased abundance of certain neuronal

transcripts correlated with Alzheimer’s disease (Koh et al., 2014).

IGV plots, in which reads are aligned to the genomic sequence, showed that most

of the abundant miRNA are present in plasma as full-length, mature species, including

some with post-transcriptionally added 3’ A residues (e.g., miR-122) (Fig. 3.10B)

(Norbury, 2013). For miR-126, both the mature miRNA (miR-126-3p) and passenger

strand (miR-126-5p) are present in human plasma, consistent with previous findings

(Arroyo et al., 2011). In addition to annotated miRNAs, the M-BZD dataset identified

mature-sized miRNAs from several predicted miRNA loci (e.g., AC034205.1,

AC023050.1, and AL589669.1) (Fig. 3.10C). The IGV plots also show that a few

miRNA species are present in plasma as full-length pre-miRNAs with both 5’ and 3’ ends

corresponding exactly to the annotated mature miRNA arms (Fig. 3.13A). Some of these

pre-miRNAs are present together with the mature miRNAs (e.g., let-7f, miR-27a, miR-

146a, and miR-30c), whereas others are present almost entirely as the pre-miRNA (e.g.,

miR-1229 and miR-139) (Fig. 3.10B,C). Such distinctions would be missed in miRNA

quantitation by qRT-PCR or microarray assays. Although GsI-IIC RT used at a limiting

concentrations (500 nM) appears to under-represent miRNAs in total plasma RNA

preparations, RNA-seq datasets constructed with GsI-IIC RT for mirVana small RNA

preparations (Materials and Methods) were similar to those for TeI4c RT, with mostly

Page 74: Copyright by Yidan Qin 2016

60

minor differences in profiles for abundant miRNA species detected by the two TGIRT

enzymes (Fig. 3.14).

Finally, although the abundant miRNA species in DNase-treated plasma RNA

datasets (OCD and M-BZD) correspond well to those detected in the non-treated (NT)

plasma RNA datasets, we note the curious case of miR-182 for which we detected

abundant reads corresponding to the exact antisense of the annotated mature miRNA in

the NT datasets (Fig. 3.13B), the only mature miRNA for which antisense sequences

were detected. This antisense miR-182 sequence was found reproducibly in multiple

datasets of non-treated plasma RNAs generated by both TGIRT enzymes (98% of miR-

182 reads in total plasma RNAs datasets constructed with TeI4c, and 4% and 14% of

reads in total and small RNA datasets constructed with GsI-IIC RT; respectively), but

disappeared after DNase treatment, leaving only the annotated sense orientation of the

miRNA. These findings raise the possibility that antisense miR-182 was initially part of

an RNA/DNA hybrid with the annotated miRNA, either an in vitro artifact or hinting at a

novel DNA-based mode of miRNA-regulated gene expression.

tRNAs and tRNA fragments. tRNAs are the most abundant small ncRNAs

detected in the datasets for total plasma RNA (83.0-93.4% of the small ncRNA reads,

mapping to 376-419 different tRNA genes; Fig. 3.5B). tRNA species grouped by

anticodon showed a skewed distribution, with good correspondence between the

abundant tRNA species detected by TeI4c in the NT and -3’P plasma RNA preparations

(Fig. 3.15A). IGV alignments for representative tRNA species to individual loci showed

that most are full-length, extending from the processed 5’ end of the mature tRNA, or

Page 75: Copyright by Yidan Qin 2016

61

post-transcriptionally added 5’ G residue in the case of tRNAHis, to the post-

transcriptionally added 3’ CCA (Fig. 3.15B). In contrast to retroviral RTs, which

terminate at base modifications that affect Watson-Crick base-pairing interactions

(Burnett and McHenry, 1997; Ansmant et al., 2001; Jackman et al., 2003), TGIRT

enzymes frequently read through a number of such modifications (e.g., m1A58 and

m1G9) by misincorporation, with the spectrum of misincorporated nucleotides

characteristic of the modification (Elagib et al., 2013; Katibah et al., 2014). tRNA-protein

complexes have been identified previously in human sera as autoantigens in patients with

autoimmune diseases, a well-studied example being HisGUG, which is bound to histidyl-

tRNA synthetase in the polymyositis-specific autoantigen Jo-1 (Hardin et al., 1982;

Mathews and Bernstein, 1983; Rosa et al., 1983). Our findings indicate that HisGUG and

other full-length tRNAs are normal, relatively abundant components of human plasma.

In addition to full-length tRNAs, several abundant tRNA species in the NT and -

3’P plasma RNA-seq datasets correspond to 5’- and 3’-tRNA halves resulting from

cleavage within the anticodon loop (Fig. 3.15C). As noted previously, the percentage of

5’-tRNA halves reads increased from 0.4% of mapped tRNA reads in NT datasets to

7.1% of mapped tRNA reads in -3’P datasets, consistent with cleavage by an RNase, such

as angiogenin, which leaves a 2’,3’-phosphate or 3’ phosphate (Fig. 3.15C) (Fu et al.,

2009). 5’-tRNA halves in plasma have been reported to be present in RNP complexes

that are destabilized by chelating agents such as EDTA, which was used in our plasma

preparation (Dhahbi et al., 2013a). It is possible that the proportion of 5’-tRNA halves

detected by TGIRT-seq would be higher in plasma prepared without EDTA.

Page 76: Copyright by Yidan Qin 2016

62

Other small ncRNAs. The remaining small ncRNAs detected by TeI4c RT in NT

total plasma RNA datasets include Y RNAs (3.8%; 84 species, including 3 of 4 known Y

RNAs); snoRNAs (1.9%; 220 species); 7SL RNAs (1.8%; 191 species); snRNAs (0.9%;

145 species); Vault RNAs (VT; 0.8%; 5 species, including 3 of 4 known Vault RNAs);

and 7SK RNAs (0.5%; 71 species) (Fig. 3.5B). Only fragments of snoRNAs, snRNAs

and Y RNAs were previously reported to be present in plasma or exosomes (Dhahbi et

al., 2013b; Huang et al., 2013; Spornraft et al., 2014). We detected longer transcripts

mapping to the piRNA cluster but not mature piRNAs, possibly reflecting the 2’-O-

methyl group at their 3’ end, which inhibits TGIRT template-switching (Mohr et al.,

2013).

Remarkably, many of the small ncRNAs that we identified in plasma are full-

length transcripts, including snRNAs, both H/ACA-box and C/D-box snoRNAs, Y

RNAs, Vault RNAs, 7SL RNAs (299 nt), and 7SK RNAs (332 nt) (Fig. 3.16A). All of

these RNAs function intracellularly in RNP complexes (Walter and Blobel, 1982;

Kickhoefer et al., 2002; He et al., 2008; Markert et al., 2008; Esteller, 2011; Chen et al.,

2013), and their presence as full-length transcripts protected from plasma RNases

suggests that they are present as such in plasma. Y RNA and Vault RNA are associated

with autoantigens Ro/SSA and La/SSB, respectively, both of which have been implicated

in autoimmune diseases, including systemic lupus erythematosus and Sjögren’s syndrome

(Halse et al., 1999; Xue et al., 2003; Routsias and Tzioufas, 2010), while 7SL RNA, an

RNA component of the signal recognition particle, has been implicated in the

autoimmune disease myositis (Satoh et al., 2005). 7SK RNA, the central scaffold of an

Page 77: Copyright by Yidan Qin 2016

63

RNP complex that regulates nuclear transcription elongation (He et al., 2008; Markert et

al., 2008), has not been reported previously in plasma. Notably, the unmapped reads

contain 5’ truncated Y RNAs and Vault RNA fragments with poly(U) tails (Fig. 3.16B),

presumably reflect that they were targeted for degradation before being exported into

plasma (Malecki et al., 2013).

3.4 DISCUSSION

The RNA-seq method developed here employing a thermostable group II intron

reverse transcriptase (TGIRT-seq) enables strand-specific comprehensive RNA profiling

of different RNA size classes starting from small amounts of RNA. In addition to simpler

library preparation without known biases of RNA ligation or random hexamer priming of

reverse transcription (Linsen et al., 2009; Hansen et al., 2010; Levin et al., 2010; Lamm

et al., 2011; Hu and Hughes, 2012; Raabe et al., 2014), TGIRT-seq distinguishes mature

miRNAs from pre-miRNAs and longer miRNA-containing transcripts, and it gives full-

length reads including both the 5’- and 3’-RNA termini of a variety of highly structured

small ncRNAs. Because gel-purification and phenol-extraction steps in previous versions

of the method have been eliminated, RNA-seq libraries can be prepared from a small

amount of starting material in <5 h and can potentially be automated to further enhance

efficiency and throughput.

In this initial demonstration of the method, we prepared RNA from 1 ml of human

plasma and used Illumina sequencing to obtain 14.6-69.4 million paired-end reads for

total plasma RNA datasets, enabling profiling of plasma RNAs at relatively low cost. We

Page 78: Copyright by Yidan Qin 2016

64

found that human plasma RNAs consist largely of fragments of protein-coding genes and

lncRNAs, together with less abundant small ncRNAs. The RNA fragments of protein-

coding gene appear to be enriched in intron and antisense sequences, possibly reflecting

preferential turnover of extraneous RNA sequences, which are packaged into exosomes,

exported into the intercellular space, and eventually find their way into plasma.

Surprisingly, we found that many of the small ncRNAs, including miRNAs, tRNAs,

snoRNAs, snRNAs, Y RNAs, Vault RNAs, 7SL RNAs, and 7SK RNAs, are present as

full-length transcripts, suggesting that they are protected from plasma RNase in RNP

complexes and/or exosomes. Although miRNAs are not abundant in the total plasma

RNA preparations, they were amply detected in a way that distinguishes mature miRNAs

from pre-miRNAs, and their coverage could be improved by greater sequencing depth or

by small RNA enrichment.

The TGIRT-seq method should be easily modifiable for different sequencing

platforms. By including additional steps for rRNA depletion followed by RNA

fragmentation and 3’-phosphate removal (Materials and Methods), TGIRT-seq is readily

adaptable for the profiling of whole-cell RNAs, as well as for the analysis of exosomal

RNAs and protein-bound RNA fragments in procedures like HITS-CLIP, RIP-Seq, and

for ribosome profiling.

3.5 MATERIALS AND METHODS

3.5.1 Thermostable group II intron RTs

Reverse transcription of plasma RNAs for the construction of RNA-seq libraries

Page 79: Copyright by Yidan Qin 2016

65

was done by using a thermostable TeI4c group II intron RT (TeI4c-∆En fusion protein

RT for Datasets 1-11 and 16; TeI4c-MRF group II intron RT (Mohr et al., 2013) for

Dataset 18; Table 3.5), and a thermostable GsI-IIC group II intron RT (TGIRT-III;

InGex) (Datasets 12-15, 17 and 19; Table 3.5). The TeI4c-∆En fusion protein RT was a

gift from Enzymatics and is functionally equivalent to the TeI4c-MRF group II intron

RTs described and used previously (Mohr et al., 2013).

3.5.2 Preparation of human plasma RNA samples

Plasma from a healthy male individual was obtained from the Genome

Sequencing and Analysis Facility at the University of Texas at Austin. To prepare

plasma, fresh blood was collected in 10-ml K+/EDTA venous blood collection tubes,

mixed with an equal volume of phosphate-buffered saline without calcium and

magnesium (PBS -/-; Thermo Fisher Scientific), gently layered over 15-ml Ficoll-Paque

PLUS (GE Healthcare) in a 50-ml conical tube, and centrifuged at 400 x g for 35 min at

room temperature. After centrifugation, plasma (top layer) was transferred into a clean

tube, aliquoted, and stored at -80°C.

To prepare total plasma RNA using the Direct-zol method, plasma (1 ml or four

250-µl aliquots) was mixed with 3-volume Trizol LS Reagent (Thermo Fisher Scientific),

shaken vigorously for 10-30 sec to obtain a homogenous mixture, incubated at room

temperature for 10 min with occasional mixing, and centrifuged at 12,000 x g for 10 min

at 4 °C in a 1.7-ml Eppendorf tube. The resulting supernatant was then mixed with 1-

volume 100% ethanol and 5 μg of linear acrylamide carrier (Thermo Fisher Scientific),

Page 80: Copyright by Yidan Qin 2016

66

incubated at room temperature for 10 min with occasional mixing, and processed with a

Direct-zol RNA Miniprep Kit (Zymo Research) following the manufacturer’s protocol.

RNA extracted from 1-ml plasma was concentrated into 11 µl of double-distilled water

(ddH2O) by ethanol precipitation in the presence of 0.3 M sodium acetate (pH 5.2) or

with an RNA Clean & Concentrator Kit (Zymo Research) with 8 volumes of 100%

ethanol added to the sample to increase recovery of small RNAs.

To prepare total plasma RNA by using the mirVana combined method, 1 ml of

plasma was processed by using a mirVana miRNA Isolation kit (Thermo Fisher

Scientific) following the manufacturer’s protocol, but combining the large and small

RNA fractions to obtain a total plasma RNA preparation. After mixing the plasma lysate

with 1/3-volume 100% ethanol, the large RNA fraction was bound to the first column and

eluted, while the small RNA fraction collected in the filtrate was mixed with an

additional 2/3-volume 100% ethanol, bound to the second column, eluted, and combined

with the large RNA fraction. For mirVana small plasma RNA preparation, the large RNA

fraction was discarded. In either case, the RNA was concentrated and cleaned up as

described above for the Direct-zol method.

RNA samples were used for RNA-seq either without further treatment (denoted

NT), after 3’-phosphate removal (denoted -3’ P), or after different DNase treatments. For

3’-phosphate removal, the RNA samples were treated with T4 polynucleotide kinase

(Epicentre) according to manufacturer’s recommendations, extracted with acid phenol-

chloroform-isoamyl alcohol (25:24:1; Thermo Fisher Scientific), ethanol precipitated,

and dissolved in 11-µl double-distilled (dd) H2O. DNase treatment of RNA samples

Page 81: Copyright by Yidan Qin 2016

67

prepared by the Direct-zol RNA MiniPrep Kit (Zymo Research) was done following the

manufacturer’s protocol for on-column DNase I digestion with either 5-units DNase I

(Zymo Research) as specified in the protocol (DS15) or 20-units DNase I (DS7-10).

Alternatively, DNase treatment was done on the eluted RNA by using Baseline-ZERO

DNase (Epicentre) according to manufacturer’s recommendations. For RNase digestion,

the on-column DNase I-treated samples were digested with RNase I (Epicentre)

following the manufacture’s protocol, and for alkaline hydrolysis, they were incubated at

95C for 15 min in presence of 0.25 M NaOH and then neutralized with equimolar HCl.

After treatments, RNA samples were cleaned up with an RNA Clean & Concentrator

Kit (Zymo Research) and eluted with 11-µl ddH2O. To check the efficiency of DNase

digestion, we used a 10-ng mixture of a 74-nt synthetic ssDNA oligonucleotide (5’-TTT

TGA TTG TTT TTC GAT GAT GTT CGG TGA GCA TTG TTC GAG TTT CA TTT

TAT CAC AGC CAG CTT TGA TGT GC-3’; IDT) and a 275-bp dsDNA PCR product

derived from the Lactococcus lactis Ll.LtrB group II intron.

RNA quality and quantity were assessed by running 1 µl of the 11-µl RNA

samples on a 2100 Bioanalyzer (Agilent) using the RNA 6000 Pico Kit (mRNA assay) or

Small RNA Kit for total or small plasma RNA preparations, respectively.

3.5.3 Construction of plasma RNA-seq libraries

For the construction of plasma RNA-seq libraries, TGIRT template-switching

reverse transcription reactions were done by using an initial template-primer substrate

consisting of a 34-nt RNA oligonucleotide (R2 RNA), which contains an Illumina Read 2

Page 82: Copyright by Yidan Qin 2016

68

primer-binding site and a 3’-blocking group (C3 Spacer, 3SpC3; IDT), annealed to a

complementary 35-nt DNA primer (R2R DNA) that leaves an equimolar mixture of A, C,

G, or T single-nucleotide 3’ overhangs (Fig. 3.1B). Reactions were done in 20 µl of

reaction medium containing plasma RNA (0.9-4.4 ng for total RNA and 7.2-12 ng for

small RNA preparations in 10-µl double-distilled water), 100 nM template-primer

substrate, TGIRT enzyme (2 µM TeI4c or 500 nM GsI-IIC RT), and 1 mM dNTPs (an

equimolar mix of dATP, dCTP, dGTP, and dTTP) in 450 mM NaCl, 5 mM MgCl2, 20

mM Tris-HCl, pH 7.5, and dithiothreitol (DTT; 1 mM for TeI4c RT and 5 mM for GsI-

IIC RT). DTT was either prepared freshly or from a frozen concentrated (0.5 or 1 M)

stock solution. Reactions were assembled by adding all components, except dNTPs, to a

sterile PCR tube containing plasma RNAs with the TGIRT enzyme added last. After pre-

incubating at room temperature for 30 min, reactions were initiated by adding dNTPs and

incubated for 15 min at 60°C. cDNA synthesis was terminated by adding 5 M NaOH to a

final concentration of 0.25 M, incubating at 95°C for 3 min, and then neutralizing with 5

M HCl. The resulting cDNAs were purified with a MinElute Reaction Cleanup Kit

(QIAGEN) and ligated at their 3’ end to a 5’-adenlyated/3’-blocked (C3 spacer, 3SpC3;

IDT) adapter (R1R; Fig. 3.1B) by using Thermostable 5’ AppDNA/RNA Ligase (New

England Biolabs) according to the manufacturer’s recommendations. The ligated cDNA

products were re-purified with a MinElute column and amplified by PCR by using

Phusion High-Fidelity DNA polymerase (Thermo Fisher Scientific) with 200 nM of

Illumina multiplex and 200 nM of barcode primers (a 5’ primer that adds a P5 capture

site and a 3’ primer that adds a barcode plus P7 capture site; Fig. 3.1B). PCR was done

Page 83: Copyright by Yidan Qin 2016

69

with initial denaturation at 98°C for 5 sec followed by 12 cycles of 98°C for 5 sec, 60°C

for 10 sec and 72°C for 10 sec. The PCR products were purified by using the Agencourt

AMPure XP (Beckman Coulter) and sequenced on a HiSeq 2500 or a NextSeq 500

instrument (Illumina) to obtain 100-nt (HiSeq), 75-nt (NextSeq) or 150-nt (NextSeq)

paired-end reads.

RNA-seq libraries of cellular RNAs were constructed similarly from RNAs

isolated from K562 cells (ATCC CCL-243, maintained in IMDM supplemented with

10% FBS at 37°C with a 5% CO2 atmosphere) using a mirVana miRNA Isolation Kit

(Thermo Fisher Scientific) following the manufacturer’s protocol, or commercial T Cell

Leukemia (Jurkat) Total RNA (Thermo Fisher Scientific). Whole-cell RNAs (5 µg) were

ribo-depleted by using a RiboZero Gold Kit (Human/Mouse/Rat) (Epicentre) and then

fragmented to a size predominantly between 70~100 nt by using an NEBNext

Magnesium Fragmentation Module (New England Biolabs). 40 ng of fragmented RNAs

was treated with T4 Polynucleotide Kinase (Epicentre) to remove 3’ phosphates, cleaned

up with an RNA Clean & Concentrator Kit (Zymo Research), and used for RNA-seq

library construction with TGIRT enzymes (GsI-IIC for K562 and TeI4c for Jurkat) as

described above.

3.5.4 RNA-seq analysis of cDNA recopying by TGIRT enzymes

Control RNA-seq to assess the strand specificity of TGIRT enzymes was done

with 50 ng of a 74-nt synthetic RNA oligonucleotide (5’-UUU UGA UUG UUU UUC

GAU GAU GUU CGG UGA GCA UUG UUC GAG UUU CAU UUU UAU CAC AGC

Page 84: Copyright by Yidan Qin 2016

70

CAG CUU UGA UGU GC; IDT) using 2 M TeI4c MRF or 1 M GsI-IIC RTs under

the conditions described above. Libraries were sequenced on an Illumina HiSeq, yielding

6.5-6.9 x 105 100-nt single-end reads that mapped to the RNA oligonucleotide sequence

in the expected orientation. Only a very small number of reads (3 for TeI4c-MRF RT and

12 for GsI-IIC RT) mapped to the RNA oligonucleotide in the antisense orientation,

corresponding to re-copying frequencies of 0.72 and 1.9 x 10-5 for TeI4c-MRF and GsI-

IIC RTs, respectively. All of the antisense reads resulted from template-switching to a

previously synthesized cDNA from either the 5’ end of the R2 RNA (the template-primer

substrate) or from the 5’ end of a previously copied RNA, resulting in a product with the

R2R DNA sequence on one end and the R2 RNA sequence on the other end. Both types

of recopying are readily identifiable by examining the reads without adapter trimming.

3.5.5 Bioinformatics analysis

The bioinformatics pipeline used for analysis of RNA-seq data is outlined in

Figure 3.1C. First, Illumina TruSeq DNA adapter and primer sequences were trimmed

from the reads by using cutadapt (Martin, 2011) (sequencing quality score cut-off at 20;

p-value < 0.01), and reads <18-nt after trimming were discarded. Reads were then

mapped by using Tophat v2.0.10 and Bowtie2 v2.1.0 (default settings) to the human

genome reference sequence (Ensembl GRCh38 Release 76) (Langmead and Salzberg,

2012; Kim et al., 2013) supplemented with additional contigs encoding the 5S rRNA

gene (2.2-kb 5S rRNA repeats from the cluster on chromosome 1 (1q42); GeneBank:

X12811) and the 45S rRNA gene (43-kb 45S rRNA repeats containing 5.8S, 18S and 28S

Page 85: Copyright by Yidan Qin 2016

71

rRNA sequences from clusters on chromosomes 13,14,15,21, and 22; GeneBank:

U13369). Other sequences used for mapping included DNA oligonucleotide sequences

used in control experiments (see above) to test for sample cross-contamination, and the E.

coli genome sequence (Genebank: NC_000913) to remove any reads resulting from E.

coli nucleic acids in enzyme preparations. Unmapped reads from this first pass (Pass 1)

were re-mapped to Ensembl GRCh38 Release 76 by Bowtie2 with local alignment

(default settings) to improve the mapping rate for those reads that contain post-

transcriptionally added nucleotides (e.g., CCA and poly(U)), untrimmed adapter

sequences, and non-templated nucleotides added to the 3’ end of the cDNAs by TGIRT

enzymes (Pass 2). The mapped reads from Passes 1 and 2 were combined and filtered by

mapping quality (MAPQ ≥15; p-value < 0.03), and concordant read pairs were collected

by using Samtools. The concordant read pairs were then intersected with gene

annotations (Ensembl GRCh38 Release 76) and piRNA cluster annotations from

piRNABank (Sai Lakshmi and Agrawal, 2008) to collect reads that mapped uniquely in

the annotated orientation to genomic features (genomic coordinates for piRNAs were

converted to Ensembl GRCh38 Release 76 coordinates using scripts from the UCSC

genome browser website). Coverage of each feature was calculated by Bedtools. To

improve the mapping rate for tRNAs, mapped reads from Passes 1 and 2 were intersected

with tRNA annotations from the Genomic tRNA Database (Lowe and Eddy, 1997) to

collect both uniquely and multiply mapped tRNAs reads. These were then combined with

unmapped reads after Pass 2 and mapped to the tRNA reference sequences (UCSC

genome browser website) using Bowtie2 local alignment with default settings. Because

Page 86: Copyright by Yidan Qin 2016

72

similar or identical tRNAs with the same anticodon can be multiply mapped to different

tRNA loci by Bowtie2, mapped tRNA reads with MAPQ ≥1 were combined according to

their tRNA anticodon prior to calculating the tRNA distributions. Only those features

with ten or more mapped reads were counted.

Coverage plots and alignments of reads were created by using Integrative

Genomics Viewer (IGV) (Robinson et al., 2011). Information about single nucleotide

polymorphisms (SNPs) was obtained from NCBI dbSNP (Database of Single Nucleotide

Polymorphisms Build 142; common category, minor allele frequency 1% in at least one

of the 26 major populations, with at least two unrelated individuals having the minor

allele).

For correlation analysis, RNA-seq datasets were normalized for the total number

of mapped reads by using DESeq (Anders and Huber, 2010) and plotted with ggplot2 in

R. To assess tissue expression profiles for mature miRNAs detected in plasma, reads

mapped to genomic features (Ensembl GRCh38 Release 76) were filtered by size and

reads shorter than 30 nt were intersected with miRBase 21 to obtain reads for mature

miRNAs. The latter were intersected with a published database to obtain RNA-seq

expression values (Landgraf et al., 2007), which were then normalized across different

tissues and plotted with ggplot2 in R.

To identify RNAs with poly(U) tails, unmapped reads after the first Tophat

alignment (pass 1; see above) were processed by using cutadapt and custom scripts to

find a stretch of 10 Us with <10% other nucleotides at the beginning of the Read 2

reads. The corresponding Read 1 reads were then mapped to human genome reference

Page 87: Copyright by Yidan Qin 2016

73

sequence using Bowtie2 local alignment to identify the RNA species to which the

poly(U) tails are appended, and were used for IGV plots.

Excel spreadsheets for miRNAs, tRNAs, and other small ncRNAs identified by

TGIRT-seq in different plasma RNA preparations are included in the supplemental data

file as part of the manuscript (see reference (Qin et al., 2016)).

3.5.6 Accession numbers

The plasma RNA-seq datasets have been deposited in the National Center for

Biotechnology Information Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra)

under accession number SRP064378.

Page 88: Copyright by Yidan Qin 2016

74

Dataset

NT -3’P OCD

1 2 3 1-3 4 5 6 4-6 7 8 9 10 7-10

Total reads (×106)1 69.4 23.4 31.7 124.5 20.5 21.5 26.0 68.0 14.6 37.8 36.4 28.5 117.4

Mapped to genome (%)2 92.0 95.3 93.5 93.0 91.1 88.8 92.3 90.8 90.2 85.7 86.6 87.7 87.0

Mapped to features (%)3 28.6 28.8 27.6 28.4 29.3 27.3 29.2 28.7 30.7 30.2 30.1 30.4 30.3

1Total reads after trimming and filtering.

2Percentage of concordant or discordant paired-end reads that mapped uniquely or multiply to the human genome reference

sequence.

3Percentage of concordant paired-end reads that mapped uniquely in the correct orientation to annotated features of the human

genome reference sequence.

Table 3.1: Read statistics and mapping for RNA-seq of total plasma RNAs using TeI4c group II intron RT.

RNA-seq libraries were prepared from plasma RNA samples by using TeI4c RT and sequenced on an Illumina HiSeq

or NextSeq instrument to obtain the indicated number of 100-nt (HiSeq; DS1), 150-nt (NextSeq; DS2-6), or 75-nt (NextSeq;

DS7-10) paired-end reads. Each sample corresponds to plasma RNA (0.9-4.4 ng) obtained from a healthy individual at

Page 89: Copyright by Yidan Qin 2016

75

intervals at least one week apart and was analyzed either with no further treatment (NT), after T4 polynucleotide kinase

treatment under conditions that remove 3’ phosphates (-3’ P), or after on-column DNase I treatment (OCD). The reads were

trimmed to remove adapter sequences and low quality base-calls (sequencing quality score cut-off at 20 (p-value <0.01)), and

reads <18-nt after trimming were discarded. Trimmed reads were filtered and then mapped by using Tophat and Bowtie2 to a

human genome reference sequence (Ensembl GRCh38 Release 76) supplemented with additional rRNA gene contigs, as

described in Materials and Methods.

Page 90: Copyright by Yidan Qin 2016

76

Dataset

GsI-IIC, NT GsI-IIC, OCD

12 13 14 12-14 15

Total reads (×106)1 33.6 27.8 43.6 104.9 22.9

Mapped to genome (%)2 90.1 94.6 93.0 92.5 95.1

Mapped to features (%)3 27.9 27.8 27.7 27.8 29.3

1Total reads after trimming and filtering.

2Percentage of concordant or discordant paired-end reads that mapped uniquely or

multiply to the human genome reference sequence.

3Percentage of concordant paired-end reads that mapped uniquely in the correct

orientation to annotated features of the human genome reference sequence.

Table 3.2: Read statistics and mapping for RNA-seq of total plasma RNAs using GsI-IIC

group II intron RT.

RNA-seq libraries were prepared from different plasma RNA samples by using

GsI-IIC RT and sequenced on an Illumina NextSeq instrument to obtain the indicated

number of 150-nt paired-end reads. Each sample corresponds to plasma RNA (1.4-4.4

ng) obtained from a healthy individual at intervals at least one week apart and was

analyzed with no further treatment (NT) or after on-column DNase I treatment (OCD).

The reads were trimmed to remove adapter sequences and low quality base-calls

(sequencing quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming

were discarded. Trimmed reads were filtered and mapped by using Tophat and Bowtie2

Page 91: Copyright by Yidan Qin 2016

77

to a human genome reference sequence (Ensembl GRCh38 Release 76) supplemented

with additional rRNA gene contigs, as described in Materials and Methods.

Page 92: Copyright by Yidan Qin 2016

78

N-3’ TeI4c

GsI-

IIC

NN-3’ TeI4c GsI-IIC NNN-3’ TeI4c GsI-IIC

A 26.2 26.2

AA 3.2 1.3

AAA 0.9 0.3

CAA 1.1 0.5

GAA 0.5 0.2

UAA 0.7 0.4

CA 11.1 15.0

ACA 2.4 2.6

CCA 4.4 6.7

GCA 1.5 1.4

UCA 2.8 4.4

GA 7.3 3.9

AGA 2.5 1.2

CGA 1.2 0.2

GGA 1.3 0.8

UGA 2.3 1.6

UA 4.6 6.0

AUA 1.0 1.1

CUA 1.4 1.9

GUA 0.6 0.7

UUA 1.6 2.3

C 26.6 25.8 AC 1.8 1.8

AAC 0.5 0.2

CAC 0.6 1.2

GAC 0.2 0.1

Page 93: Copyright by Yidan Qin 2016

79

UAC 0.5 0.3

CC 9.4 11.3

ACC 2.4 1.8

CCC 2.1 3.5

GCC 1.5 1.3

UCC 3.4 4.8

GC 5.1 4.0

AGC 1.5 1.2

CGC 0.7 0.3

GGC 0.9 0.7

UGC 2.0 1.9

UC 10.2 8.7

AUC 1.8 1.3

CUC 3.0 3.4

GUC 1.2 0.9

UUC 4.3 3.0

G 25.7 24.0

AG 4.1 2.8

AAG 1.0 0.6

CAG 1.5 1.1

GAG 0.7 0.5

UAG 0.9 0.7

CG 3.6 2.4

ACG 0.8 0.5

CCG 1.2 0.9

GCG 0.6 0.3

UCG 1.0 0.8

GG 8.7 7.4 AGG 2.5 2.4

Page 94: Copyright by Yidan Qin 2016

80

CGG 1.1 0.4

GGG 1.5 1.4

UGG 3.7 3.2

UG 9.2 11.3

AUG 1.9 2.4

CUG 2.9 4.3

GUG 1.5 1.7

UUG 2.9 3.0

U 21.6 24.0

AU 1.6 1.0

AAU 0.4 0.1

CAU 0.4 0.4

GAU 0.3 0.1

UAU 0.5 0.3

CU 6.2 13.8

ACU 1.3 1.5

CCU 1.6 5.3

GCU 1.0 2.4

UCU 2.4 4.5

GU 3.1 3.1

AGU 0.8 0.7

CGU 0.3 0.2

GGU 0.6 0.5

UGU 1.4 1.8

UU 10.7 6.1 AUU 2.0 0.8

Page 95: Copyright by Yidan Qin 2016

81

CUU 2.8 2.9

GUU 1.5 0.8

UUU 4.4 1.6

Table 3.3: Analysis of 3’-terminal nucleotides of RNAs in RNA-seq datasets constructed

from total plasma RNA using TeI4c or GsI-IIC group II intron RTs.

Read 2 from RNA-seq datasets constructed from on-column DNase I-treated total

plasma RNA by using TeI4c (DS7-10) or GsI-IIC (DS15) group II intron RTs were

trimmed for adapter sequence and low quality bases. Then nucleotides frequencies for the

first three nucleotides of Read 2 (corresponding to the last three nucleotides of the RNA)

were calculated by using customized scripts. Frequencies for the last (N-3’), last two

(NN-3’), and the last three (NNN-3’) nucleotides of RNAs are shown as the percent of all

3’ RNA ends in the dataset.

Page 96: Copyright by Yidan Qin 2016

82

Dataset

Jurkat K562

18 19

Total reads (×106)1 23.4 37.8

Mapped to genome (%)2 86.2 93.4

Mapped to features (%)3 70.1 73.0

1Total reads after trimming and filtering.

2Percentage of concordant or discordant paired-end reads that mapped uniquely or

multiply to the human genome reference sequence.

3Percentage of concordant paired-end reads that mapped uniquely in the correct

orientation to annotated features of the human genome reference sequence.

Table 3.4: Read statistics and mapping for RNA-seq of whole-cell RNAs by using TeI4c

or GsI-IIC group II intron RT.

RNA-seq libraries were prepared from 40 ng of ribo-depleted, fragmented whole-

cell RNAs by using TeI4c RT (Jurkat cells) or GsI-IIC RT (K562 cells) and sequenced on

an Illumina NextSeq instrument to obtain the indicated number of 150-nt paired-end

reads. The reads were trimmed to remove adapter sequences and low quality base-calls

(sequencing quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming

were discarded. Trimmed reads were then mapped by using Tophat and Bowtie2 to a

human genome reference sequence (Ensembl GRCh38 Release 76) supplemented with

additional rRNA gene contigs, as described in Materials and Methods.

Page 97: Copyright by Yidan Qin 2016

83

Plasma RNA prepared by the Direct-zol method

Dataset 1 (DS1) TeI4c RT, total plasma RNA, no treatment (NT)

Dataset 2 (DS2) TeI4c RT, total plasma RNA, no treatment (NT)

Dataset 3 (DS3) TeI4c RT, total plasma RNA, no treatment (NT)

Dataset 4 (DS4) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)

Dataset 5 (DS5) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)

Dataset 6 (DS6) TeI4c RT, total plasma RNA, 3’phosphates removal (-3’P)

Dataset 7 (DS7) TeI4c RT, total plasma RNA, on-column DNase I (OCD)

Dataset 8 (DS8) TeI4c RT, total plasma RNA, on-column DNase I (OCD)

Dataset 9 (DS9) TeI4c RT, total plasma RNA, on-column DNase I (OCD)

Dataset 10 (DS10) TeI4c RT, total plasma RNA, on-column DNase I (OCD)

Dataset 11 (DS11) TeI4c RT, total plasma RNA, Baseline-ZERO DNase (BZD)1

Dataset 12 (DS12) GsI-IIC RT, total plasma RNA, no treatment (NT)

Dataset 13 (DS13) GsI-IIC RT, total plasma RNA, no treatment (NT)

Dataset 14 (DS14) GsI-IIC RT, total plasma RNA, no treatment (NT)

Dataset 15 (DS15) GsI-IIC RT, total plasma RNA, on-column DNase I (OCD)

Plasma RNA prepared with a mirVana miRNA isolation kit

Dataset 16 (DS16) TeI4c RT, total plasma RNA, Baseline-ZERO DNase (M-BZD)1,2

Dataset 17 (DS17) GsI-IIC RT, small plasma RNA, no treatment (NT)2

Whole-cell RNA

Dataset 18 (DS18) TeI4c RT, Jurkat cells, ribo-depleted and fragmented

Page 98: Copyright by Yidan Qin 2016

84

Dataset 19 (DS19) GsI-IIC RT, K562 cells, ribo-depleted and fragmented

1Datasets constructed from plasma RNA treated with Baseline-ZERO DNase had

decreased mapping rates and complexity, reflecting loss of material due to additional

treatment and recovery steps when starting with very small amounts of plasma RNA.

2The dataset contains reads combined from multiple biological replicates (two for DS16

and four for DS17).

Table 3.5: Summary of RNA-seq datasets.

Page 99: Copyright by Yidan Qin 2016

85

Figure 3.1: TGIRT-seq overview.

(A) RNA-seq library construction via TGIRT template-switching. TGIRT

template-switching reverse transcription reactions use an initial template-primer substrate

comprised of an RNA oligonucleotide, which contains an Illumina Read 2 primer-binding

site (R2 RNA) and has a 3’-blocking group, annealed to a complementary DNA primer

(R2R DNA), which leaves an equimolar mixture of A, C, G, and T (denoted N) single-

nucleotide 3’ overhangs. The initial R2 RNA-R2R DNA substrate was mixed with target

RNA and TGIRT enzyme in the reaction medium, with the enzyme added last, and then

A C

B

Page 100: Copyright by Yidan Qin 2016

86

pre-incubated for 30 min at room temperature prior to initiating reverse transcription

reactions by adding dNTPs. The reactions were incubated at 60°C for 5 to 30 min,

depending on the length and/or modification level of target RNA, and terminated by

alkaline treatment (Materials and Methods). The cDNA products were then purified with

a MinElute Reaction Cleanup Kit (QIAGEN) and ligated at their 3’ ends to a 5’-

adenylated/3’-blocked DNA oligonucleotide complementary to an Illumina Read 1

primer (R1R) by using a Thermostable 5’ AppDNA/RNA Ligase (New England Biolabs).

The ligated cDNAs were re-purified and amplified by PCR for 12 cycles to add Illumina

flow cell capture sites (P5 and P7) and barcode sequences for sequencing. (B) Sequences

of oligonucleotides used for TGIRT-seq. (C) Mapping pipeline for human RNA-seq

datasets constructed with TGIRT enzymes. After trimming adapter sequences and reads

with low quality base calls by using cutadapt, reads of 18 nt were mapped by Tophat

and Bowtie2 (default settings) to a human genome reference sequence (Ensembl GRCh38

release 76) supplemented with additional rRNA gene contigs and other sequences (Pass

1; see Materials and Methods). Unmapped reads from Pass 1 were then re-mapped to the

same human genome reference sequence using Bowtie2 local alignment (default settings)

to recover reads from RNAs with post-transcriptionally added nucleotides (e.g., 3’ CCA,

poly(U)) or short introns (e.g., tRNA introns) (Pass 2). Concordant read pairs that

mapped uniquely with MAPQ 15 from Passes 1 and 2 were combined and mapped to

genomic features. Reads that mapped to tRNA genes were filtered and combined with the

reads that remained unmapped after the Bowtie2 local alignment, and remapped to

human tRNA reference sequences (UCSC genome browser website) to achieve optimal

Page 101: Copyright by Yidan Qin 2016

87

recovery and mapping of tRNA reads. tRNA reads with MAPQ 1 were combined with

mapped genome reads from the prior steps for downstream analysis.

Page 102: Copyright by Yidan Qin 2016

88

Figure 3.2: Bioanalyzer traces showing size profiles of plasma RNAs before and after

various treatments.

Total plasma RNA was prepared by the Direct-zol method, and a 1-µl portion was

analyzed with an RNA 6000 Pico Kit (mRNA assay) on a 2100 Bioanalyzer (Agilent) to

obtain the traces shown in the Figure. (A) Total plasma RNA with no further treatment

(NT). (B) Total plasma RNA after on-column DNase I treatment (OCD). (C) and (D)

Total plasma RNA after OCD treatment followed by RNase I or alkaline hydrolysis

treatments, respectively.

Page 103: Copyright by Yidan Qin 2016

89

Figure 3.3: Bioanalyzer traces testing the efficiency of DNase treatments used on plasma

RNA preparations.

(A) On-column DNase I treatment. A mixture containing 10 ng of a 74-nt single-

stranded DNA and 275-bp double-stranded DNA (Materials and Methods) was mock

extracted with Trizol LS Reagent and processed by the Direct-zol method without or with

on-column DNase I treatment (OCD), as described for plasma RNA preparations in

Material and Methods. After processing, a 1-µl portion of the DNA was analyzed with a

High Sensitivity DNA Kit on a 2100 Bioanalyzer (Agilent). (B-D) Baseline-Zero DNase

treatment. A 1-µl portion of total plasma RNA extracted from 1 ml of plasma by using

the mirVana combined method was analyzed with an RNA 6000 Pico Kit (mRNA assay)

on a 2100 Bioanalyzer with (B) no further treatment (M-NT); (C) after addition of a 10-

ng mixture of the same ssDNA and dsDNA as in (A); and (D) after addition of the 10-ng

mixture of the DNAs followed by treatment with Baseline-ZERO DNase (M-BZD).

Page 104: Copyright by Yidan Qin 2016

90

Figure 3.4: The distribution of transcript lengths in total plasma RNA libraries calculated

by the coverage of paired-end read span.

(A) and (B) Distribution of calculated transcript lengths in total plasma RNA

prepared by the Direct-zol method with no further treatment (NT; combined DS1-3) or

after on-column DNase I treatment (OCD; combined DS7-10), respectively. Transcript

lengths were calculated by paired-end read span using bedtools, and their distribution was

plotted in R. (C) The distribution of transcript lengths for paired-end reads mapping to

protein-coding genes in the OCD datasets calculated and plotted as above. The reads

mapping to protein-coding genes were filtered to remove reads for which >50% of the

read length overlapped embedded small ncRNAs prior to calculating transcript lengths.

Page 105: Copyright by Yidan Qin 2016

91

The read gap correlates with read length in the RNA-seq reaction and is caused by the

loss of coverage due to trimming of the final nucleotides of the reads, which are often

lower quality base calls.

Page 106: Copyright by Yidan Qin 2016

92

Figure 3.5: Percentage of TGIRT-seq reads from total plasma RNA datasets mapping to

different categories of genomic features.

RNA-seq datasets were constructed by using TeI4c RT for total plasma RNA

prepared by the Direct-zol method and either not treated (NT; combined DS1-3), 3’

dephosphorylated (-3’ P; combined DS4-6), or on-column DNase I-treated (OCD;

combined DS7-10). Reads were mapped to genomic features as described in Materials

and Methods. (A) Stacked bar graphs showing the percentage of concordant read pairs

that mapped uniquely in the correct orientation to the indicated category of genomic

features. Protein-coding genes include immunoglobulin and T-cell receptor genes; long

ncRNAs include lincRNAs, antisense RNAs and other lncRNAs; and rRNA genes

include 5S, 5.8S, 18S, and 28S rRNA genes. (B) Stacked bar graphs showing the

percentage of small ncRNA read pairs (1.8-5.8% of the reads in the total plasma RNA

datasets) that mapped to different categories of small ncRNA genes. In (A) and (B), the

Page 107: Copyright by Yidan Qin 2016

93

numbers next to each stacked bar segment indicate the number of different genes for

which transcripts were identified in that category. Only features with ten or more mapped

reads in the combined datasets were included. Abbreviation: MT, mitochondrial genes.

Page 108: Copyright by Yidan Qin 2016

94

Figure 3.6: Correlation analysis for biological replicates of total plasma RNA libraries.

Reads from the indicated RNA-seq datasets constructed by using either TeI4 or

GsI-IIC RTs for total plasma RNA prepared by the Direct-zol method and treated in

different ways were normalized to generate (A-D) correlation matrices and (E) a scatter

plot. Pairwise Spearman’s correlation coefficients () are shown in the boxes of the

correlation matrices and at the upper left of the scatterplot. NT, not treated; -3’ P, treated

to remove 3’ phosphates; OCD, on-column DNase I treatment.

Page 109: Copyright by Yidan Qin 2016

95

Figure 3.7: RNA-seq analysis of total plasma RNA libraries constructed with GsI-IIC

group II intron RT.

RNA-Seq libraries were constructed by using GsI-IIC RT from total plasma RNA

prepared by the Direct-zol method without (GsI-IIC, NT; combined DS12-14) or with

(GsI-IIC, OCD; DS15) on-column DNase I treatment following the manufacturer’s

protocol. (A) Stacked bar graphs showing the percentage of concordant read pairs that

Page 110: Copyright by Yidan Qin 2016

96

mapped uniquely in the annotated orientation to the indicated category of features. (B)

Stacked bar graphs showing the percentage of small ncRNA read pairs (1.3-2.1% of the

reads in the total plasma RNA datasets; also see Supplemental Data File) that mapped to

different categories of small ncRNAs. Protein-coding genes include immunoglobulin and

T-cell receptor genes; long ncRNAs include lincRNAs, antisense RNAs and other

lncRNAs; and rRNA genes include 5S, 5.8S, 18S, and 28S rRNAs genes. The numbers

next to the stacked bars segments indicate the number of different genes for which

transcripts were identified in each category of features. Only features with ten or more

mapped reads in the combined datasets were included. Abbreviation: MT, mitochondrial

genes. (C) Stacked bar graphs showing the percentage of bases in protein-coding gene

reads that mapped to coding sequences (CDS), introns, 5’- and 3’-untranslated regions

(UTRs), and intergenic regions. (D) Stacked bar graphs showing the proportion of

concordant read pairs that mapped to the sense and antisense strands of protein-coding

genes. In (C) and (D), the reads that mapped to protein-coding genes were filtered to

remove those with >50% of the read length overlapping embedded small ncRNAs, and

the percentage of bases or reads mapping to different regions or strands was calculated by

using picard tools.

Page 111: Copyright by Yidan Qin 2016

97

Figure 3.8: Human plasma RNA is enriched in intron and antisense sequences compared

to whole-cell RNAs.

Reads mapping to protein-coding genes were analyzed to assess coverage across

different regions and both DNA strands in RNA-seq datasets constructed with TGIRT

enzymes for total plasma or whole-cell RNA prepared and treated in different ways.

These include plasma RNA prepared by the Direct-zol method with no further treatment

(NT; combined DS1-3), after on-column DNase I treatment (OCD; combined DS7-10),

or after Baseline-ZERO DNase treatment (BZD; DS11); plasma RNA prepared by the

mirVana combined method after Baseline-ZERO DNase treatment (M-BZD; DS16); and

ribo-depleted and fragmented whole-cell RNA from Jurkat cells (TeI4c RT; DS18) or

K562 cells (GsI-IIC RT; DS19). (A) Stacked bar graphs showing the percentage of bases

Page 112: Copyright by Yidan Qin 2016

98

in protein-coding gene reads that mapped to coding sequences (CDS), introns, 5’- and 3’-

untranslated regions (UTRs), and intergenic regions. (B) Stacked bar graphs showing the

proportion of concordant read pairs that mapped to the sense and antisense strands of

protein-coding genes. In (A) and (B), reads that mapped to protein-coding genes were

filtered to remove those with >50% of the read length overlapping embedded small

ncRNAs, and the percentage of bases or reads mapping to different regions or strands

was calculated by using picard tools. Reads from the OCD, BZD, and M-BZD datasets

were analyzed with or without removal of read pairs with a span of <30 nt to exclude

short DNA fragments that may have escaped DNase treatment.

Page 113: Copyright by Yidan Qin 2016

99

Figure 3.9: Proportion of reads mapping to the sense strand of protein-coding genes as a

function of gene length in RNA-seq datasets of human plasma or whole-cell

RNAs.

Reads that mapped to either the sense or antisense strands of the protein-coding

genes in the datasets indicated in the Figure were retrieved using bedtools and filtered to

remove reads for which >50% of the read length overlapped embedded small ncRNAs.

The percentage of sense reads (black dots) versus gene length (red line) was then plotted

using R for genes with ≥10 reads mapping to one or both strands.

Page 114: Copyright by Yidan Qin 2016

100

Figure 3.10: Human plasma contains both mature and pre-miRNAs.

(A) Relative abundance of miRNAs identified in RNA-seq datasets constructed

with TeI4c RT for total plasma RNAs prepared by the Direct-zol method with on-column

DNase I treatment (OCD; combined DS7-10; left) or by the mirVana combined method

with Baseline-ZERO DNase treatment (M-BZD; DS16; right). miRNA loci with ten or

more mapped reads were rank-ordered by read count and plotted to display relative

Page 115: Copyright by Yidan Qin 2016

101

abundance. The 20 most abundant miRNAs loci by read count are shown in the bar graph

insets. Loci encoding predicted miRNAs (Ensembl GRCh38 Release 76) were not

included in the bar graphs unless mature-sized miRNAs mapping to the locus were

identified in the datasets. (B) and (C) IGV screen shots showing coverage plots (CP;

above) and alignments (below) of reads for loci in which abundant miRNA transcripts

were identified in the OCD and M-BZD datasets, respectively. In (B), the miRNA

transcripts were ordered based on abundance as shown in the left panel of (A). (C) IGV

screen shots showing additional miRNA transcripts that were abundant in the M-BZD

dataset, but less abundant or not present in the OCD datasets. The arrow at the top

indicates the boundaries and 5’ to 3’ orientation of the mature miRNA on the

chromosomal DNA sequence. Reads were sorted by start site on the chromosome, which

can be from either the 5’ or 3’ end depending on the orientation of the gene on the

chromosome. Nucleotides matching the genome sequence are shown in gray, and

mismatches are shown as different colors (A, green; C, blue; G, brown; and T, red),

which can either correspond to or be the complement of the RNA sequence depending on

the orientation of the gene on the chromosome. Mismatches were checked against NCBI

dbSNP, and known SNPs are indicated with the nucleotide change and corresponding

SNP ID. Mismatches at the 5’ end of the reads are likely due to non-templated nucleotide

addition by the TGIRT enzyme to the 3’ end of the cDNAs. Some miRNAs (e.g., miR-

122) have post-transcriptionally added A or AA residues at their 3’ ends(Norbury, 2013).

Page 116: Copyright by Yidan Qin 2016

102

Page 117: Copyright by Yidan Qin 2016

103

Figure 3.11: Tissue expression profiles for mature miRNAs in plasma.

The Figure shows tissue expression profiles of the mature miRNAs identified by

TGIRT-Seq in total plasma RNA prepared by the Direct-zol method with on-column

DNase I treatment (OCD; combined DS7-10). The profiles are based on the relative

RNA-seq expression values of the miRNAs in a published database(Landgraf et al.,

2007), and only miRNAs present in that database are shown. Tissue categories:

podocytes include both differentiated and undifferentiated podocytes; peripheral

leukocytes include T-lymphocytes, NK cells, monocytes, granulocytes and dendritic

cells. miRNAs highlighted in red are also abundant (top 10 percentile) in red blood cells

or plateles(Wang et al., 2012), cell types for which relative RNA-seq expression values

were not available in the database used to calculate the expression profiles(Landgraf et

al., 2007).

Page 118: Copyright by Yidan Qin 2016

104

Figure 3.12: Tissue expression profiles of mature miRNA identified in total plasma RNA

prepared by the mirVana combined method.

Page 119: Copyright by Yidan Qin 2016

105

The Figure shows tissue expression profiles of mature miRNAs in an RNA-seq

dataset constructed with TeI4c RT from total plasma RNA prepared by the mirVana

combined method and treated with Baseline-ZERO DNase (M-BZD; DS16). Tissue

expression profiles were plotted as described in Fig. 3.11. Tissue categories: podocytes

include both differentiated and undifferentiated podocytes; peripheral leukocytes include

T-lymphocytes, NK cells, monocytes, granulocytes and dendritic cells. miRNAs

highlighted in red are also abundant (top 10 percentile) in red blood cells or

platelets(Wang et al., 2012), cell types for which relative RNA-seq expression values

were not available in the database used to calculate the expression profiles(Landgraf et

al., 2007).

Page 120: Copyright by Yidan Qin 2016

106

Figure 3.13: TGIRT-seq detects full-length pre-miRNAs and a miRNA that may be

present in plasma in an RNA/DNA hybrid.

(A) Secondary structures of full-length pre-miRNAs shown in the IGV plots of

Figure 5C. (B) IGV screen shots showing coverage plots (CP; above) and alignments

(below) of reads for miR-182 in the RNA-seq datasets indicated in the Figure for non-

treated (NT) or on-column DNase I (OCD)-treated plasma RNA preparations with TeI4c

or GsI-IIC RTs. The arrow at the top indicates the boundaries and 5’ to 3’ orientation of

the annotated mature miRNA on the chromosomal DNA sequence. Read pairs were

grouped and colored by orientation, with the sense read pairs shown in light purple and

the antisense read pairs shown in salmon. The numbers to the right of the alignment

Page 121: Copyright by Yidan Qin 2016

107

indicate the number of reads in each category. The alignment with >1,000 mapped reads

was down-sampled to 1,000 reads in IGV.

Page 122: Copyright by Yidan Qin 2016

108

Figure 3.14: Relative abundance and IGV alignments of miRNAs identified in a small

plasma RNA-seq dataset constructed with GsI-IIC RT.

(A) Relative abundance. Small plasma RNA was isolated by the mirVana small

RNA enrichment method, and RNA-seq libraries were constructed by using GsI-IIC RT

(GsI-IIC, Small; DS17). miRNA loci with ten or more mapped reads were rank-ordered by

read count and plotted to display relative abundance. The 20 most abundant miRNAs loci

by read count are shown in the bar graph inset. Loci encoding predicted miRNAs (Ensembl

GRCh38 Release 76) were not included in the bar graph unless mature-sized miRNAs

mapping to the locus were identified in the dataset. (B) IGV screen shots. The screen shots

show coverage plots (CP; above) and alignments (below) of reads for loci in which the 20

Page 123: Copyright by Yidan Qin 2016

109

most abundant miRNA transcripts were identified in the dataset. The IGV coverage plots

and alignments of reads are as described in Figure 5.

Page 124: Copyright by Yidan Qin 2016

110

Figure 3.15: TGIRT-seq identifies full-length mature tRNAs and tRNA fragments in

human plasma.

(A) Relative abundance of tRNAs identified in RNA-seq datasets constructed

with TeI4c RT for total plasma RNA prepared by the Direct-zol method without (NT;

Page 125: Copyright by Yidan Qin 2016

111

combined DS1-3) or with treatment to remove 3’ phosphates (-3’ P; combined DS4-6).

The plots show tRNAs with ten or more mapped reads grouped by anticodon and rank-

ordered by read count. The 15 most abundant tRNAs based on anticodon are shown in the

bar graph insets. (B) IGV screen shots showing coverage plots (CP; above) and

alignments (below) of reads for abundant full-length mature tRNAs identified in the NT

datasets. The tRNAs were ordered by abundance as in the left panel of (A). For cases in

which multiple loci encode tRNAs with the same sequence, tRNA reads were distributed

equally among different tRNA loci for the IGV alignments. (C) IGV screen shots

showing coverage plots and alignments of reads for representative 3’-tRNA halves in the

NT datasets (AlaAGC and ThrCGT) and 5’-tRNA halves in the -3’ P datasets (GlyCCC,

ArgCCG and AspGTC). The arrow at the top indicates the boundaries and 5’ to 3’

orientation of the mature tRNA on the chromosomal DNA sequence. In order to fit the

entire alignment in one panel, genes with >1,000 mapped reads were down-sampled to

1,000 reads in IGV. Reads were sorted by start site on the chromosome. Nucleotides

matching the genome sequence are shown in gray, and mismatches are shown as different

colors (A, green; C, blue; G, brown; and T, red). Mismatches at the 5’ end of the reads

are likely due to non-templated nucleotide addition by the TGIRT enzyme to the 3’ end

of the cDNAs. Mismatches due to misincorporation at known sites of post-transcriptional

modifications are highlighted with the name of the modification. Modifications: I,

inosine; m1A, 1-methyladenosine; m3C, 3-methylcytidine; m5C, 5-methylcytidine; m1G,

1-methylguanosine; m2G, N2-methylguanosine; m22G, N2,N2-dimethylguanosine.

Page 126: Copyright by Yidan Qin 2016

112

Figure 3.16: Other classes of small non-coding RNAs identified as full-length mature

transcripts in human plasma by TGIRT-seq.

(A) IGV screen shots showing coverage plots (CP; above) and alignments (below)

of reads mapping to small ncRNAs loci in RNA-seq datasets constructed with TeI4c RT

for total plasma RNA prepared by the Direct-zol method (NT; combined DS1-3). The RNA

biotype is indicated at the top with the gene name and transcript length in parentheses. (B)

Page 127: Copyright by Yidan Qin 2016

113

Examples of small ncRNA fragments with poly(U) tails. IGV screen shots of showing

coverage plots (CP; above) and alignments (below) of Read 1s for poly(U)-tailed small

ncRNAs found among the unmapped reads in NT datasets. In (A) and (B), the arrow at the

top indicates the boundaries and 5’ to 3’ orientation of the mature transcript on the

chromosomal DNA sequence. In order to fit the entire alignment in one panel, genes with

>1,000 mapped reads were down-sampled to 1,000 reads in IGV. Reads were sorted by

start site on the chromosome, which can be from either the 5’ or 3’ end depending on the

orientation of the gene on the chromosome. Nucleotides matching the genome sequence

are shown in gray, and mismatches are shown as different colors (A, green; C, blue; G,

brown; and T, red), which can either correspond to or be the complement of the RNA

sequence. Mismatches were checked against NCBI dbSNP, and known SNPs are indicated

with the nucleotide change and corresponding SNP ID. Other mismatches were manually

checked and were due to lower quality base-calls, non-templated nucleotide addition to the

3’ end of the cDNA resulting in extra nucleotides at the 5’ end of the read, or misalignment

by Bowtie2 local alignment.

Page 128: Copyright by Yidan Qin 2016

114

Chapter 4: Identification of circulating RNA biomarkers in multiple

myeloma

4.1 INTRODUCTION

Multiple myeloma is the second most prevalent hematological cancer in the USA

after non-Hodgkin lymphoma (Raab et al., 2009). It remains as an incurable disease that

causes 15-20% of death from blood malignancies and about 2% of all deaths from cancer

(International Myeloma Working Group, 2003), with a median survival of around five

years for newly diagnosed patients (Bergsagel et al., 2013). In myeloma, malignant

plasma cells in the bone marrow proliferate and interfere with production of normal

blood cells (Raab et al., 2009). They also produce a monoclonal protein, referred to as the

paraprotein, which can be detected in blood or urine, or both. The paraproteins are

comprised of monoclonal immunoglobulins, typically IgG or IgA, and monoclonal free

light chains, which together are responsible for decreased humoral immunity and over

90% of the renal impairment that occurs in myeloma (Stringer et al., 2011). Other

common symptoms associated with myeloma include anemia, bone disease, and

hypercalcemia (Smith and Yong, 2013). The genetic abnormalities underlying the

pathogenesis of myeloma include chromosomal translocations, multiple trisomies, and

late onset mutations (Smith and Yong, 2013).

Diagnosis of myeloma often involves: (i) paraprotein concentration in serum or

urine, detected by serum electrophoresis and immunofixation; (ii) plasma cell infiltration

in the bone marrow (BMPCs), assessed by bone marrow aspirate; and (iii) bone lesions,

Page 129: Copyright by Yidan Qin 2016

115

screened by skeletal survey using plain radiographs and magnetic resonance imaging

(MRI) (Raab et al., 2009). Monoclonal gammopathy of undetermined significance

(MGUS) and smoldering multiple myeloma (SMM) are conditions in which patients have

less (MGUS) or more (SMM) than 30 g/L paraproteins and <10 % BMPCs, but have not

yet developed any myeloma-associated symptoms (Smith and Yong, 2013). The risk of

progression to active multiple myeloma (AMM) in the first 5 years after initial diagnosis

is 1% and 10% per year for MGUS and SMM, respectively (Rajkumar et al., 2015).

However, SMM is biologically heterogeneous, including a subset of patients displaying

premalignancy similar to MGUS, and a subset of high-risk patients progressing to AMM

with a median time of only 2 years (Rajkumar et al., 2015). Although evidence suggest

the high-risk SMM patients may benefit from early therapeutic intervention,

unfortunately there is no reliable pathological or molecular biomarker that can be used to

distinguish the MGUS-like SMM patients from the malignant SMM patients, making it

challenging for early detection and treatment (Rajkumar et al., 2015).

Next-generation RNA-sequencing (RNA-seq) is a powerful tool for transcriptome

profiling and gene expression analysis, which can potentially allow diagnostics of human

diseases to move from morphology and low-sensitivity protein analysis into global

identification of RNA biomarkers (Meldrum et al., 2011; Byron et al., 2016). The key

factor underlying the success of RNA-based diagnostics is the ability to analyze all RNAs

at the same time with high sensitivity and minimal bias. However, retroviral RTs used in

conventional methods for reverse transcription of target RNAs have inherently low

processivity and fidelity, resulting in RNA-seq libraries with reduced complexity and

Page 130: Copyright by Yidan Qin 2016

116

accuracy (Hu and Hughes, 2012). Additionally, the use of RNA ligase for attaching

RNA-seq adapter to the target RNA leads to bias and low efficiency in RNA-seq library

construction (Linsen et al., 2009; Levin et al., 2010; Lamm et al., 2011).

Recently, we developed new RNA-seq methods for the analysis of whole-cell,

exosomal, and plasma RNAs based on the use of thermostable group II intron reverse

transcriptase (TGIRT enzymes) (Qin et al., 2016; Nottingham et al., 2016). TGIRTs have

higher thermostability, processivity and fidelity than conventional retroviral reverse

transcriptases, along with a novel end-to-end template-switching activity that attaches

RNA-seq adapters to target RNAs without using RNA ligase (Mohr et al., 2013). TGIRTs

give full-length reads of structured small non-coding RNAs (small ncRNAs), including

tRNAs and snoRNAs, which are refractory to retroviral RTs, and enable identification of

a variety of base modifications in these RNAs by distinctive patterns of misincorporated

nucleotides (Katibah et al., 2014; Shen et al., 2015; Zheng et al., 2015). Validation of

TGIRT-seq on well-characterized human RNA reference samples and comparisons to

published Illumina TruSeq datasets for these samples further showed that TGIRT-seq: (i)

is simpler yet more strand-specific as TruSeq v3; (ii) recapitulates the relative abundance

of human transcripts and RNA spike-ins in ribo-depleted, fragmented RNA samples

comparably to the non-strand-specific TruSeq v2 and better than the strand-specific

TruSeq v3 methods; (iii) gives more uniform 5’ to 3’ gene coverage than either TruSeq

method; (iv) detects more splice junctions, particularly near the 5’ ends of genes, than

TruSeq v3 at comparable read depth, even from fragmented RNAs; and (v) eliminates

sequence biases due to random hexamer priming that are inherent in TruSeq (Nottingham

Page 131: Copyright by Yidan Qin 2016

117

et al., 2016). By using the TGIRT-seq total RNA method, we constructed RNA-seq

libraries from <1 ng plasma RNA in <5 h (Qin et al., 2016). We find that plasma contains

RNA fragments derived from large numbers of protein-coding and lncRNAs, along with

most known classes of small ncRNAs. Many of the latter are present as full-length

transcripts, suggesting protection from plasma RNases in ribonucleoprotein complexes

and/or exosomes (Qin et al., 2016). A particular advantage of TGIRT-seq is that these

structured small ncRNAs can be profiled in the same RNA-seq run as protein-coding and

long non-coding RNAs (lncRNAs), providing a potentially more robust biomarker

identification than conventional methods.

We are currently collaborating with Drs. Flavia Pichiorri and Craig Hofmeisters’

group at the Ohio State University to identify RNA biomarkers for early, sensitive and

non-invasive myeloma diagnostics by carrying out TGIRT-seq analysis of circulating

RNAs isolated from extracellular vesicles in human plasma. Extracellular vesicles (EVs)

are membrane-enclosed structures containing nucleic acids and proteins released by cells

(Raposo and Stoorvogel, 2013). They have been recognized as important vehicles for

intercellular communication, and their emerging roles in diagnostics and therapeutics are

under intensive investigation (EL Andaloussi et al., 2013).

4.2 RNA PROFILES OF EXTRACELLULAR VESICLES IN HUMAN PLASMA

To obtain RNAs from extracellular vesicles (EV-RNAs), ~20 ml of plasma

collected from 10 de-identified individuals, including 4 healthy (H), 3 smoldering (SMM)

and 3 active multiple myeloma (AMM), were processed directly with an exoEasy Maxi

Kit (QIAGEN) and analyzed on a 2100 Bioanalyzer (Agilent). Previous studies of whole

Page 132: Copyright by Yidan Qin 2016

118

plasma RNA in our laboratory showed the presence of ~160-bp DNA fragments in

plasma (Qin et al., 2016). However, bioanalyzer traces of the EV-RNA preparations

showed a broad peak centered around 40-60 nt with minor or non-detected peak at ~160-

bp (Fig. 4.1), suggesting the ~160-bp DNA fragments in plasma potentially come from

other sources, such as apoptotic bodies, rather than from the extracellular vesicles. For

initial analysis described here, we used the TGIRT-seq total RNA method as described

for human plasma RNAs (see Chapter 3 and ref. Qin et al., 2016) to construct RNA-seq

libraries from 1.7-7.5 ng of EV-RNAs without further treatment from four healthy

individuals (H1-4), three patients with SMM (SMM1-3), and three patients with AMM

(AMM1-3).

Table 4.1 summarizes the mapping statistics for all RNA-seq datasets. The

samples were sequenced on an Illumina NextSeq 500 to give 3.3-61.2 million 75-nt

paired-end reads. The reads were trimmed to remove adapter sequences and low quality

base calls and then mapped to a human genome reference sequence (Ensembl GRCh38

Release 76) supplemented with additional rRNA gene contigs (Materials and Methods),

using the TGIRT-seq mapping pipeline (Qin et al. 2016). For all RNA-seq datasets, 67.5-

93.7% of the paired-end reads mapped to the human genome, and 4.7-36.2% were

concordant read pairs that mapped uniquely and with high mapping quality (MAPQ ≥15)

to non-ribosomal and non-mitochondrial genomic features in the annotated orientation.

For confidence, only features with ≥10 hits were counted in the analysis. The reduced

total number of reads obtained for dataset AMM1 was likely explained by its extremely

low RNA input (~1.7 ng), resulting in excess primer-dimers being produced during PCR,

Page 133: Copyright by Yidan Qin 2016

119

remained after clean-up and consumed the majority of the reads. Datasets SMM1 and

AMM3 had lower mapping rates to genomic features, which were due to mitochondria

contamination during the preparation of EVs.

Figure 4.2 shows the percentage of reads mapping to different genomic features in

combined EV-RNA datasets obtained for each group (H, SMM and AMM), using only

uniquely mapped concordant read pairs for the calculation. The number of individual

genes to which the reads mapped is shown next to each feature in the stacked bar graphs.

The TGIRT-seq profiles of EV-RNA samples are very similar to previously published

plasma RNA samples (Qin et al. 2016), with the majority of the reads corresponding to

fragmented protein-coding gene and lncRNAs (Fig. 4.2A), and a smaller proportion

mapping to a variety of small ncRNAs (Fig. 4.2B). These findings suggest that

extracellular vesicles are major contributors to the plasma RNA pool.

4.3 TGIRT-SEQ IDENTIFIES DIFFERENTIALLY EXPRESSED TRANSCRIPTS BY DISEASE

STAGES

Despite the overall similarity among RNA classes identified in EV-RNAs for

healthy, SMM and AMM groups, their TGIRT-seq profiles separated into three distinct

classes (Fig. 4.3), suggesting gene expression changes as the disease advances. Indeed, a

number of differentially expressed protein-coding and small ncRNAs that were up- or

down-regulated in the SMM group showed progressive increases or decreases,

respectively, in the AMM group (Fig. 4.4). Therefore, TGIRT-seq analysis of RNAs in

plasma EVs is potentially useful for the identification of high-risk SMM patients with

rapid progression to malignancy.

Page 134: Copyright by Yidan Qin 2016

120

Among the differentially expressed transcripts were a population of ~33 nt RNA

fragments derived from Y RNAs, which are strongly elevated in both SMM and AMM

patients (YF; Fig. 4.4). Y RNAs are small RNAs (84-112 nt) that are part of the 60-kDa

Ro ribonucleoprotein autoantigens and function in RNA stability and cellular responses

to stress (Chen and Wolin, 2004; Wolin et al., 2012). Recent studies show that Y RNAs

are essential for the initiation of chromosomal DNA replication in vertebrates (Christov

et al., 2006; Krude et al., 2009). Interestingly, emerging evidence has identified short

fragments of Y RNAs in cells, solid tumors and bodily fluids of human and mammals

with implications in a variety of human diseases, including cancer (Kowalski and Krude,

2015). The potential role of Y RNA fragments as a novel diagnostic biomarker for

myeloma will be further investigated, including validation by qRT-PCR.

Finally, we compared our TGIRT-seq datasets to a previously published

microarray dataset, which tracked gene expression profiles of CD138-selected plasma

cells in 559 newly diagnosed myeloma patients for 730 days, with end-points

representing event-free survival (EFS), meaning a lack of malignancy or disease

recurrence, and overall survival (OS) (Popovici et al., 2010; Shi et al., 2010). Several

protein-coding genes, which had altered expression levels in EV-RNAs, demonstrated

association with survival (Fig. 4.5), providing support for using TGIRT-seq analysis of

circulating RNAs as an easily accessible and sensitive diagnostic tool. In the next phase

of this research, we will increase the number of patient samples for each disease stage

and include MGUS in the analysis.

Page 135: Copyright by Yidan Qin 2016

121

4.4 DISCUSSION

We demonstrated the potential biotechnological applications of TGIRT-seq in

diagnostics for human diseases by the identification of circulating RNA biomarkers in

multiple myeloma. In order to obtain a complete profile of protein-coding genes and

lncRNAs together with small ncRNAs that are present in plasma EVs, we used the

TGIRT-seq total RNA method for rapid and efficient RNA-seq library construction with

no size selection and minimal bias (Nottingham et al., 2016; Qin et al., 2016). Initial

results for 10 datasets obtained from de-identified healthy individuals, patients with

SMM and AMM identified differentially expressed transcripts, including novel small

ncRNAs, Y RNA-derived fragments, and protein-coding genes correlated with survival

based on a previously published microarray study (Popovici et al., 2010; Shi et al., 2010).

We are now extending this approach to other types of cancer including

inflammatory breast cancer, a rare and very aggressive disease. By collaborating with Dr.

Naoto Ueno’s group at MD Anderson Cancer Center, we will be analyzing RNA samples

from FFPE (formalin-fixed, paraffin-embedded) tumor tissue, PBMCs (peripheral blood

mononuclear cells) and plasma. Finally, we are collaborating with Dr. Joseph

McCormick’s group at the University of Texas at Brownsville to analyze plasma RNA

samples for a large-scale population study of environmental impact on human health. We

will continue to explore and develop methods for using TGIRT-seq analysis of

circulating RNAs in blood or other bodily fluids as a sensitive, non-invasive and cost-

effective tool for early detection of a variety of human diseases, and for personalized

medical care.

Page 136: Copyright by Yidan Qin 2016

122

4.5 MATERIALS AND METHODS

4.5.1 Thermostable group II intron RTs

Reverse transcription of RNAs for the construction of RNA-seq libraries was

done by using a thermostable GsI-IIC RT (TGIRT-III; InGex, St. Louis MO).

4.5.2 RNA preparations*

* This is done by Enrico Caserta in Flavia Pichiorri’s research group at Ohio State University.

Plasma from de-identified healthy individuals or patients at different stages of

myeloma (SMM and AMM) were collected and processed with an exoEasy Maxi Kit

(QIAGEN) to obtain RNAs in the extracellular vesicles (EV-RNAs).

4.5.3 Construction of RNA-seq libraries

EV-RNA preparations were concentrated with an RNA Clean & Concentrator Kit

(Zymo Research) to 23 µl for healthy individuals and 12 µl for myeloma patients (SMM

and AMM). The quality and quantity of EV-RNAs were assessed by running 1 µl on a

2100 Bioanalyzer (Agilent) using the RNA 6000 Pico Kit (mRNA assay).

For construction of RNA-seq libraries, TGIRT template-switching reverse

transcription reactions were done by using an initial template-primer substrate consisting

of a 34-nt RNA oligonucleotide (R2 RNA), which contains an Illumina Read 2 primer-

binding site and a 3’-blocking group (C3 Spacer, 3SpC3; IDT), annealed to a

complementary 35-nt DNA primer (R2R DNA) that leaves an equimolar mixture of A, C,

G, or T single-nucleotide 3’ overhangs. Reactions were done in 20 µl of reaction medium

Page 137: Copyright by Yidan Qin 2016

123

containing EV-RNAs (1.7-7.5 ng in 11-µl ddH2O), 100 nM template-primer substrate, 1

µM TGIRT-III enzyme, and 1 mM dNTPs (an equimolar mix of dATP, dCTP, dGTP,

and dTTP) in 450 mM NaCl, 5 mM MgCl2, 20 mM Tris-HCl, pH 7.5, and 5 mM

dithiothreitol (DTT). DTT was either prepared freshly or from a frozen concentrated 1 M

stock solution. Reactions were assembled by adding all components, except dNTPs, to a

sterile PCR tube containing EV-RNAs with the TGIRT-III enzyme added last. After pre-

incubating at room temperature for 30 min, reactions were initiated by adding dNTPs and

incubated for 15 min at 60°C. cDNA synthesis was terminated by adding 5 M NaOH to a

final concentration of 0.25 M, incubating at 95°C for 3 min, and then neutralizing with 5

M HCl. The resulting cDNAs were purified with a MinElute Reaction Cleanup Kit

(QIAGEN) and ligated at their 3’ end to a 5’-adenlyated/3’-blocked (C3 spacer, 3SpC3;

IDT) adapter (R1R) by using Thermostable 5’ AppDNA/RNA Ligase (New England

Biolabs) according to the manufacturer’s recommendations. The ligated cDNA products

were re-purified with a MinElute column and amplified by PCR by using Phusion High-

Fidelity DNA polymerase (Thermo Fisher Scientific) with 200 nM of Illumina multiplex

and 200 nM of barcode primers (a 5’ primer that adds a P5 capture site and a 3’ primer

that adds a barcode plus P7 capture site). PCR was done with initial denaturation at 98°C

for 5 sec followed by 12 cycles of 98°C for 5 sec, 60°C for 10 sec and 72°C for 10 sec.

The PCR products were purified by using the Agencourt AMPure XP (Beckman Coulter)

and sequenced on a NextSeq 500 instrument (Illumina) to obtain 75-nt paired-end reads.

Page 138: Copyright by Yidan Qin 2016

124

4.5.4 Bioinformatics*

* This is done by Jun Yao in the Lambowitz lab and Dennis Wylie in the Bioinformatics Consulting Group

at the University of Texas at Austin.

Analysis of all RNA-seq datasets was done by using the TGIRT-seq mapping

pipeline as described previously for human plasma RNAs (Qin et al., 2016). First,

Illumina TruSeq DNA adapter and primer sequences were trimmed from the reads by

using cutadapt (Martin, 2011) (sequencing quality score cut-off at 20; p-value < 0.01),

and reads <18-nt after trimming were discarded. Reads were then mapped by using

Tophat v2.0.10 and Bowtie2 v2.1.0 (default settings) to the human genome reference

sequence (Ensembl GRCh38 Release 76) (Langmead and Salzberg, 2012; Kim et al.,

2013) supplemented with additional contigs encoding the 5S rRNA gene (2.2-kb 5S

rRNA repeats from the cluster on chromosome 1 (1q42); GeneBank: X12811) and the

45S rRNA gene (43-kb 45S rRNA repeats containing 5.8S, 18S and 28S rRNA sequences

from clusters on chromosomes 13,14,15,21, and 22; GeneBank: U13369). Other

sequences used for mapping included the E. coli genome sequence (Genebank:

NC_000913) to remove any reads resulting from E. coli nucleic acids in enzyme

preparations. Unmapped reads from this first pass (Pass 1) were re-mapped to Ensembl

GRCh38 Release 76 by Bowtie2 with local alignment (default settings) to improve the

mapping rate for those reads that contain post-transcriptionally added nucleotides (e.g.,

CCA and poly(U)), untrimmed adapter sequences, and non-templated nucleotides added

to the 3’ end of the cDNAs by TGIRT enzymes (Pass 2). The mapped reads from Passes

1 and 2 were combined and filtered by mapping quality (MAPQ ≥15; p-value < 0.03),

Page 139: Copyright by Yidan Qin 2016

125

and concordant read pairs were collected by using Samtools. The concordant read pairs

were then intersected with gene annotations (Ensembl GRCh38 Release 76) and piRNA

cluster annotations from piRNABank (Sai Lakshmi and Agrawal, 2008) to collect reads

that mapped uniquely in the annotated orientation to genomic features (genomic

coordinates for piRNAs were converted to Ensembl GRCh38 Release 76 coordinates

using scripts from the UCSC genome browser website). Coverage of each non-ribosomal

and non-mitochondrial feature was calculated by Bedtools. To improve the mapping rate

for tRNAs, mapped reads from Passes 1 and 2 were intersected with tRNA annotations

from the Genomic tRNA Database (Lowe and Eddy, 1997) to collect both uniquely and

multiply mapped tRNAs reads. These were then combined with unmapped reads after

Pass 2 and mapped to the tRNA reference sequences (UCSC genome browser website)

using Bowtie2 local alignment with default settings. Because similar or identical tRNAs

with the same anticodon can be multiply mapped to different tRNA loci by Bowtie2,

mapped tRNA reads with MAPQ ≥1 were combined according to their tRNA anticodon

prior to calculating the tRNA distributions. Only those features with ten or more mapped

reads were counted.

For transcript expression analysis, RNA-seq datasets were normalized for the total

number of mapped reads by using DESeq (Anders and Huber, 2010) and plotted in R. To

assess correlation with survival, protein-coding genes that showed significant differences

(p < 0.05) in EV-RNA datasets obtained for healthy, SMM and AMM groups, were

intersected with a published microarray dataset (GSE24080) (Popovici et al., 2010; Shi et

al., 2010) to obtain their expression levels in CD138-selected plasma cells of 559 newly

Page 140: Copyright by Yidan Qin 2016

126

diagnosed myeloma patients across 730 days. For each protein-coding gene, patients

were divided into three equal-size groups based on its expression level (low, middle, and

high), and the number of event-free survival (EFS) was plotted as a function of time for

each patient group.

Page 141: Copyright by Yidan Qin 2016

127

Dataset

H SMM AMM

1 2 3 4 1 2 3 1 2 3

Total reads (×106)1 34.0 42.0 41.7 31.2 28.6 11.8 17.4 3.3 19.7 61.2

Mapped to genome (%)2 91.3 93.0 90.2 90.8 91.7 86.1 79.9 67.5 84.2 93.7

Mapped to features (%)

excluding MT and rRNA3

26.9 21.0 28.0 32.8 7.3 31.3 36.2 31.8 16.8 4.7

1Total reads after trimming and filtering.

2Percentage of concordant or discordant paired-end reads that mapped uniquely or multiply to the human genome reference

sequence.

3Percentage of concordant paired-end reads that mapped uniquely in the correct orientation to annotated non-ribosomal and

non-mitochondrial features of the human genome reference sequence.

Table 4.1: Read statistics and mapping for RNA-seq of plasma EV-RNAs.

RNA-seq libraries were constructed from plasma EV-RNAs by using the TGIRT-seq total RNA method and sequenced

on an Illumina NextSeq instrument to obtain the indicated number of 75-nt paired-end reads. Each sample corresponds to

Page 142: Copyright by Yidan Qin 2016

128

plasma EV-RNA (1.7-7.5 ng) obtained from de-identified healthy individuals (H), patients with smoldering (SMM) or active

(AMM) multiple myeloma. The reads were trimmed to remove adapter sequences and low quality base-calls (sequencing

quality score cut-off at 20 (p-value <0.01)), and reads <18-nt after trimming were discarded. Trimmed reads were filtered and

then mapped by using Tophat and Bowtie2 to a human genome reference sequence (Ensembl GRCh38 Release 76)

supplemented with additional rRNA gene contigs, as described in Materials and Methods.

Page 143: Copyright by Yidan Qin 2016

129

Figure 4.1: Bioanalyzer traces showing size profiles of plasma EV-RNAs.

Plasma EV-RNAs were prepared with an exoEasy Maxi Kit (QIAGEN),

concentrated with an RNA Clean & Concentrator Kit (Zymo Research), and a 1-µl

portion was analyzed with an RNA 6000 Pico Kit (mRNA assay) on a 2100 Bioanalyzer

(Agilent) to obtain the traces shown in the Figure. SMM, smoldering multiple myeloma;

AMM, active multiple myeloma.

Healthy

Smoldering

Active

40-60 nt

40-60 nt

[s]

[s]

[s]

40-60 nt

Page 144: Copyright by Yidan Qin 2016

130

A

B

Page 145: Copyright by Yidan Qin 2016

131

Figure 4.2: Percentage of TGIRT-seq reads from EV-RNA datasets mapping to different

categories of genomic features.

Reads from individual RNA-seq datasets obtained for healthy individuals (H1-4),

patients with smoldering (SMM1-3) and active (AMM1-3) multiple myeloma were

combined, and mapped to genomic features as described in Materials and Methods. (A)

Stacked bar graphs showing the percentage of concordant read pairs that mapped

uniquely in the correct orientation to the indicated category of genomic features. Protein-

coding genes include immunoglobulin and T-cell receptor genes; lncRNAs include

lincRNAs, antisense RNAs and other lncRNAs. (B) Stacked bar graphs showing the

percentage of small ncRNA read pairs in (A) that mapped to different categories of small

ncRNA genes. In (A) and (B), the numbers next to each stacked bar segment indicate the

number of different genes for which transcripts were identified in that category. Only

features with ten or more mapped reads in the combined datasets were included.

Page 146: Copyright by Yidan Qin 2016

132

Figure 4.3: Heatmap for sample-to-sample distance.

Reads from EV-RNA datasets obtained for healthy individuals (H1-4), patients

with smoldering (SMM1-3) and active (AMM1-3) multiple myeloma were normalized

and plotted in R. EV-RNA datasets were clustered based on Euclidean distance, which is

a measure of sample divergence, with larger number indicating more variations in

transcript expressions between two datasets.

Page 147: Copyright by Yidan Qin 2016

133

Figure 4.4: Transcript expressions in plasma EVs.

(A) Scatter plots comparing average levels of RNA species detected by TGIRT-

seq in plasma EVs prepared using a kit from 3 patients with smoldering multiple

myeloma myeloma (SMM; left), 3 patients with active multiple myeloma (AMM; right),

and 4 healthy individuals (H). Read counts in RNA-seq datasets for different individuals

A

B

Page 148: Copyright by Yidan Qin 2016

134

were normalized with DESeq, averaged, and log2 scaled with an offset of 1. Selected

protein-coding genes whose levels in plasma vesicles correlated positively or negatively

with survival based on data in previous microarray studies (GSE24080) (Popovici et al.,

2010; Shi et al., 2010) are highlighted in red and blue, respectively. Examples of small

ncRNAs identified by TGIRT-seq as potential biomarkers subject to RT-qPCR validation

are in green. YF is a 5’ Y RNA fragment whose abundance appears strongly correlated

with myeloma. (B) Box plots comparing average levels of selected protein-coding genes

(first row) and small ncRNAs (bottom row) in (A).

Page 149: Copyright by Yidan Qin 2016

135

Figure 4.5: Survival curves.

Patients from a previous microarray study (GSE24080) (Popovici et al., 2010; Shi

et al., 2010) were divided into three equal-size groups based on the expression level

(blue, low; grey, middle; red, high) of each selected protein-coding gene, which was

significantly (p < 0.05) down- (top row) or up- (bottom row) regulated in plasma EVs

(see Fig. 4.4). The number of event-free survival (EFS) was plotted as a function of time,

with the number in the x-axis representing the percentage of a 730-day study period.

Page 150: Copyright by Yidan Qin 2016

136

Chapter 5: Mapping RNA secondary structures and

RNA-protein interaction sites

5.1 OVERVIEW OF SHAPE AND CRAC

All cellular RNAs must fold into specific structures and/or interact with proteins

in order to fulfill their biological functions. In recent years, powerful new methods have

been developed for studying RNA folding and RNA-protein interactions (Weeks and

Mauger, 2011; Ule et al., 2005; Granneman et al., 2009; König et al., 2010; Zarnack et

al., 2013).

Selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) is a

quantitative method, which probes RNA structure in a simple two-step process: (i) RNA

modification and (ii) primer extension (Weeks and Mauger, 2011). In the first step,

exposure of RNA molecules to the electrophilic SHAPE reagent results in flexible

nucleotides, often located in single-stranded regions, being preferentially acylated at their

2’-hydroxyl group (denoted 2’-O-adducts). In the second step, the modified nucleotide

residues are mapped by primer extension. Because RNA 2’-O-adducts create stops in

reverse transcription one nucleotide prior to the RNA modification, the length of the

cDNA maps a position of modification on the RNA, revealing flexible or single-stranded

region of RNA at single-nucleotide resolution. Time-resolved SHAPE using fast SHAPE

reagents is also very powerful for mapping changes in RNA structure and RNA-protein

interaction sites during protein-assisted RNA folding (Weeks and Mauger, 2011).

Page 151: Copyright by Yidan Qin 2016

137

For the mapping of direct binding sites between RNA and protein in vitro or in

vivo, cross-linking and analysis of cDNAs (CRAC) or cross-linking and

immunoprecipitation (CLIP) are often used to covalently link the protein of interest to the

target RNA (Ule et al., 2005; Granneman et al., 2009; König et al., 2010; Zarnack et al.,

2013). The isolated RNA-protein complexes are digested with RNase followed by

protease, leaving RNA fragments with one or more amino acid residues cross-linked to

the nucleotides at the site of protein contact. In current versions of the methods, RNA

ligase is used to attach adaptor sequences containing PCR-primer binding sites to the

RNA fragments, which are then reverse transcribed to cDNAs, PCR amplified, and

sequenced. The binding sites between RNA and protein are identified by reverse

transcription stops at the cross-linked sites during cDNA synthesis (König et al., 2010).

5.2 PROTEIN-ASSISTED GROUP II INTRON SPLICING

Studies of protein-assisted group II intron splicing are important for

understanding eukaryotic gene structure, expression, and evolution. A substantial fraction

(~25%) of the human genome is comprised of spliceosomal introns, whose evolutionary

ancestors are group II introns, and their splicing plays important roles in both gene

expression and regulation (Black, 2000, 2003; Lambowitz and Belfort, 2015). Defective

splicing has been implicated in numerous human diseases including cancer (Faustino and

Cooper, 2003; Karni et al., 2007). Studies of group II intron splicing can also provide

insights into mechanisms of RNA catalysis, which underlies critical cellular processes

including pre-messenger RNA (pre-mRNA) splicing, transfer RNA (tRNA) processing

Page 152: Copyright by Yidan Qin 2016

138

and translation (Lambowitz and Zimmerly, 2011; Lambowitz and Belfort, 2015). RNA

splicing and catalysis are also involved in propagation and replication of human

pathogens, such as HIV-1 and hepatitis virus (Been and Wickham, 1997; Stoltzfus,

2009).

Group II introns have a conserved secondary structure consisting of six domains

(DI-DVI), which interact via tertiary contacts to form the core and the peripheral RNA

structures that are crucial for splicing (Lambowitz and Zimmerly, 2011). Although group

II introns are ribozymes that catalyze their own splicing, their efficient splicing under

physiological conditions requires the binding of the intron-encoded protein (IEP) to

promote formation of active RNA structures (Lambowitz and Zimmerly, 2011). Using

the Ll.LtrB intron, a group IIA intron found in Lactococcus lactis, and its IEP, denoted

the LtrA protein, as a model system, our laboratory discovered that the IEP has high-

affinity binding sites in intron subdomain DIVa and weaker binding sites across

contiguous regions of DI, II and VI, suggesting LtrA promotes intron splicing by

stabilizing the interactions of these RNA domains (Matsuura et al., 1997; Saldanha et al.,

1999; Matsuura et al., 2001; Dai et al., 2008). However, there is no information about

RNA conformational changes at different steps of splicing, such as exon binding and the

release of branch point adenosine, and how the interactions between intron RNA and IEP

facilitate such processes. Additionally, there is a lack of information about protein-

assisted splicing of group IIB and IIC introns, which have distinctive differences in the

active site and peripheral structures from group IIA introns (Lambowitz and Zimmerly,

Page 153: Copyright by Yidan Qin 2016

139

2011). For example, group IIA introns position the 5’ and 3’ exons at the active site by

base-pairing of exon-binding sites 1, 2 (EBS1 and EBS2, respectively) and δ from DI to

intron-binding sites 1, 2 (IBS1 and IBS2, respectively) and δ’ in the flanking 5’ and 3’

exons. In contrast, there is no EBS2 sequence group IIC introns, which instead use only

two interactions (IBS1/EBS1 and IBS3/EBS3) and may also recognize a stem-loop

derived from a transcription terminator or attC site in the 5′ exon (Toor et al., 2006;

Robart et al., 2007; Lambowitz and Zimmerly, 2011).

Here I focused on a small group IIC intron found in thermophile Geobacillus

stearothermophilus (denoted GsI-IIC). The GsI-IIC intron is closely related to a group

IIC intron from Oceanobacillus iheyensis (about 50% identities in the catalytic RNA part

of the intron), whose crystal structure has been determined (Toor et al., 2008). To map

the secondary structure of GsI-IIC intron RNA and to study the mechanism of protein-

assisted intron splicing, I employed the SHAPE and CRAC methods with thermostable

group II intron reverse transcriptases (TGIRTs) replacing retroviral RTs and RNA

ligation used in both methods. The higher thermostability and processivity of TGIRT

enzymes can potentially address some of the limitations of using retroviral RTs in the

primer extension step used in SHAPE. One major issue of the retroviral RTs is that they

fall off at stable RNA secondary structures during cDNA synthesis, resulting in

premature stops that produce high background noise in SHAPE and other RNA-structure

mapping methods. Such premature stops will be minimized by TGIRT enzymes.

Additionally, TGIRT enzymes can attach RNA-seq adapter sequences during cDNA

Page 154: Copyright by Yidan Qin 2016

140

synthesis via template-switching, thereby eliminating the use of RNA ligase, which has

sequence bias (Linsen et al., 2009; Levin et al., 2010; Lamm et al., 2011), is time-

consuming, and results in loss of material. The latter is a major challenge for methods

such as CRAC and CLIP, which are limited by the amount of starting material for RNA-

seq library construction.

5.2.1 Determination of optimal exon length and protein concentration for in vitro

splicing of the GsI-IIC intron.

To determine the optimal length of the flanking 5’ exon required for protein-

assisted GsI-IIC intron splicing, precursor RNAs containing an IEP-ORF-deleted (ΔORF)

intron (656-nt), flanking 3’ exon (32-nt), and different flanking 5’ exons (55-nt, 46-nt,

and 35-nt; denoted GsI2c5532, 4632 and 3532, respectively), were constructed and used

for intron transcription and splicing.

The GsI-IIC intron RNA was transcribed in vitro from each construct using a

mutant T7 polymerase, which does not pause or terminate at a variety of signals,

including a terminator found fortuitously in the human preproparathyroid hormone (PTH)

gene, a pause site found in the concatamer junction (CJ) of replicating T7 DNA, and

termination signals that are also utilized by Escherichia coli RNAP (e.g. rrnB T2)

(Lyakhov et al., 1997). To suppress hydrolytic splicing, in which water rather than the 2’-

hydroxyl group (2’-OH) of the branch-site nucleotide, is used as the nucleophile in the

first transesterification step (van der Veen et al., 1987; Jarrell et al., 1988), 4 mM dTTP

was added to the transcription reaction to sequester extra magnesium ions.

Page 155: Copyright by Yidan Qin 2016

141

The in vitro splicing reactions for GsI2c5532, 4632 and 3532, was done in

reaction medium containing 450 mM KCl and 5 mM Mg2+ at 50°C, and was initiated by

adding a two-fold molar excess of the IEP, which was expressed in Escherichia coli and

purified with high yield and activity as a fusion protein with a non-cleavable maltose-

binding protein (MalE) attached to the N-terminus of the protein via a rigid linker

(denoted GsI-IIC-MRF) (Mohr et al., 2013). To examine the splicing activity of each

GsI-IIC precursor RNA, the splicing reaction was quenched at different time-points over

a course of one hour, and was analyzed by electrophoresis in a denaturing 4% acrylamide

gel.

Figure 5.1 shows that at low Mg2+ concentration (5 mM), GsI-IIC intron splicing

occurred via lariat formation and was strictly dependent upon the addition of the IEP for

all precursor RNAs. Due to inefficient splicing of GsI2c5532 and 4632 (Fig. 5.1A,B),

GsI2c3532 (Fig. 5.1C), which has a 35-nt flanking 5’ exon including the complete hairpin

structure, was used for further biochemical characterization of the protein-assisted

splicing.

To determine the optimal concentration of IEP required for protein-assisted GsI-IIC

intron splicing, one-, two-, and five-fold molar excess of IEP, were examined in a time-

course splicing reaction using GsI2c3532 (Fig. 5.2A). The percentage of lariat formation

for each IEP to RNA molar ratio were quantitated using ImageQuant TL (GE

Healthcare), and the time-courses were fit to the two-exponential equation using

SigmaPlot (Systat Software, Inc) Figure 5.2B showed that the protein-assisted splicing of

Page 156: Copyright by Yidan Qin 2016

142

GsI-IIC intron was biphasic with an initial fast phase followed by a slow phase.

Interestingly, optimal splicing activity of the GsI-IIC intron occurred at a 1:1 molar ratio

between IEP and RNA (fast phase, 5.7/min and slow phase, 0.14/min), which is lower

than the optimal molar ratio 2:1 used for in vitro protein-assisted splicing of the Ll.LtrB

group IIA intron (Saldanha et al., 1999; Matsuura et al., 2001; Rambo and Doudna,

2004). This finding may reflect that the IEP of GsI-IIC intron, GsI-IIC-MRF, functions in

splicing as a monomer rather than as a dimer, which is thought to be the case for the IEP

of Ll.LtrB group IIA intron, LtrA (Saldanha et al., 1999). This difference could reflect

different active site and peripheral structures of group IIA and IIC introns, which enables

IIC introns to be efficiently spliced by an IEP protein monomer. Alternatively, this

difference could be explained by a higher proportion of inactive protein in the LtrA

preparations.

5.2.2 RNA-structure mapping of the GsI-IIC intron via TGIRT-SHAPE*

*This work was done in collaboration with Jacob Grohman in the Lambowitz Lab.

For mapping the RNA structure by TGIRT-SHAPE, we used a 722-nt in vitro

transcript corresponding to GsI-IIC3532 but with deletion of the branch-point adenosine

(denoted GsI2c3532ΔA) to trap the intron in pre-catalytic state prior to lariat formation

without affecting IEP binding (Matsuura et al., 2001). The transcript was incubated under

splicing conditions to allow proper folding, and then modified with the SHAPE reagent

isatoic anhydride (IA; Sigma-Aldrich) under conditions that give an average of one

modification per RNA molecule. The modified intron RNA was then reverse transcribed

Page 157: Copyright by Yidan Qin 2016

143

from a fluorescently labeled DNA primer annealed at its 3’ end by a thermostable group

II intron TeI4c-MRF RT or by SuperScript III (SSIII; Thermo Fisher Scientific). SHAPE

modifications were identified by capillary electrophoresis as reverse transcription stops.

Figure 5.3A shows a plot of SHAPE reactivity determined by TeI4c-MRF as a

function of nucleotide position in the intron RNA, with high reactivity representing

flexible or single-stranded regions and low reactivity representing inflexible or base-

paired regions. The high processivity of TeI4c-MRF RT allowed mapping of the entire

722-nt intron RNA at single-nucleotide resolution by using a single primer annealed at its

3’ end (Fig. 5.3A). By using RNAstructure (Reuter and Mathews, 2010) software with

SHAPE reactivities as constraints, we obtained a secondary structure of the GsI-IIC

intron RNA in agreement with that predicted based on phylogenetic analysis (Fig. 5.3B).

Figure 5.3B also shows a stem-loop region from DIII of the intron RNA in which

TGIRT-SHAPE indicates formation of a short stem containing A-U and G-C base pairs

that appear unpaired due to high background noise caused by premature reverse

transcription stops of SSIII RT.

5.2.3 Mapping of RNA-protein contact sites by TGIRT-CRAC*

*This work was done in collaboration with David Sidote in the Lambowitz Lab.

Next, we used TGIRT-CRAC to map direct binding sites between the intron RNA

and its IEP during splicing. Figure 5.4A outlines the methods for TGIRT-CRAC. The in

vitro transcribed GsI2c3532dA was incubated in presence or absence of the IEP under

splicing conditions, and then irradiated on ice by an ultraviolet (UV) lamp (Spectroline)

Page 158: Copyright by Yidan Qin 2016

144

at 254 nm emission (UV-C). The cross-linked ribonucleoprotein complexes (RNPs) were

digested by RNase T1 (Thermo Fisher Scientific) at low or high concentration, followed

by RNase inactivation using SUPERaseIn (Thermo Fisher Scientific), and then treated

with [γ-32P]-ATP and T4 polynucleotide kinase (Epicentre) for 5’-labeling and 3’-

dephosphorylation of the RNA fragments. The 32P-labeled RNase-digested RNPs were

analyzed by SDS-PAGE followed by nitrocellulose membrane transfer. An

autoradiogram of the membrane (Fig. 5.4B) showed labeled bands with a molecular

weight (MW) higher than 80 kDa (the MW of the IEP alone), which was indicated on the

autoradiogram using an 80kDa marker from an unlabeled protein ladder. RNA fragments

of the GsI-IIC intron RNA were released from the membrane by digestion with protease

K (Thermo Fisher Scientific) in presence of 7 M urea, and ethanol-precipitated. The

purified RNA fragments were used for RNA-seq library construction by using the

TGIRT-seq small RNA/CircLigase method to attach RNA-seq adapter sequences via

template-switching during cDNA synthesis without the use of RNA ligase. Samples were

sequenced on an Illumina Miseq instrument.

Sequencing analysis of the reads that mapped to the GsI-IIC intron RNA with

distinctive 5’ ends (Fig. 5.4C), which represent positions of the cross-linking sites,

revealed a number of nucleotides potentially involved in direct interactions with the IEP

in DI, DIV and DVI. These regions have been previously shown to be involved in IEP

binding in the Ll.LtrB group IIA intron (Dai et al. 2008). The cross-linking sites

identified here include ε, γ, EBS1 and 3, and DIVa, which is a known high-affinity IEP-

Page 159: Copyright by Yidan Qin 2016

145

binding site in the L1.LtrB intron (Wank et al., 1999; Matsuura et al., 2001), and the

guanosine opposite to the branch-point adenosine (Fig. 5.3D).

5.3 DISCUSSION

Here, I established a protein-assisted in vitro splicing system for the group IIC

intron GsI-IIC, and analyzed its secondary structure and interaction sites with the IEP.

Moreover, I demonstrated the usefulness of TGIRT enzymes in the SHAPE and CRAC

procedures used for mapping RNA secondary structures and RNA-protein interactions.

By using TGIRT-SHAPE, I mapped the secondary structure of a 722-nt highly structured

GsI-IIC intron RNA at a single-nucleotide resolution using a single primer annealed to its

3’ end. The secondary structure of the GsI-IIC intron RNA obtained by TGIRT-SHAPE

is consistent with that predicted based on phylogenetic studies, suggesting not only

efficiency, but also accuracy of the method. Furthermore, by using TGIRT-CRAC, I

showed that potential interaction sites between the GsI-IIC intron RNA and its IEP reside

in DI, DIV, and DVI regions, which are known for IEP binding in group IIA intron

(Wank et al., 1999; Matsuura et al., 2001; Dai et al., 2008). Most of the identified

nucleotides are involved in making long-range RNA tertiary contacts, suggesting the IEP

functions to facilitate formation of active intron RNA structures during splicing. Since

UV also induces RNA-RNA cross-linking, further investigation will be conducted to

show protein-dependent enrichment of the cross-linked sites identified here.

The use of TGIRT-seq small RNA/CircLigase method in CRAC allows

construction of the RNA-seq libraries without RNA ligation, eliminates two steps from

Page 160: Copyright by Yidan Qin 2016

146

the original protocol and greatly reduced RNA sample loss, which is critical in

procedures like CRAC and CLIP. Finally, the new TGIRT-seq total RNA method (see

Chapter 3) will further improve the speed and efficiency of CRAC and CLIP procedures,

facilitating the identification of global RNA-protein interactions in vivo. I am currently

collaborating with Dr. Robert Krug’s research group at the University of Texas at Austin

applying these methods to identify RNAs bound by the influenza virus NS1A and NS1B

proteins, and with Dr. Michael Gale, Jr.’s research group at the University of Washington

to identify pathogen-associated molecular patterns (PAMPs) of the MDA5 protein. Both

projects will contribute new insights to the fundamental understanding of our innate

immunity against viral infection and virus-host interactions.

5.4 MATERIALS AND METHODS

5.4.1 Recombinant plasmids

Recombinant plasmids used for in vitro transcription contain GsI-IIC intron

precursor RNA cloned downstream of a phage T7 promoter and upstream of a BamHI

recognition site in a pUC19 vector (New England BioLabs). I constructed three different

constructs that express different GsI-IIC intron precursor RNAs. They were comprised of

the same ΔORF 656-nt intron and 32-nt 3’ exon, but different 5’ exons, 55-nt for

GsI2c5532, 46-nt for GsI2c4632 and 35-nt for GsI2c3532. Plasmid GsI2c3532ΔA differs

from GsI2c3532 by a single branch-point adenosine deletion in the intron.

Page 161: Copyright by Yidan Qin 2016

147

5.4.2 Preparation of GsI-IIC intron RNA and IEP

GsI-IIC intron RNA was transcribed in vitro from 1 µg of recombinant plasmid,

which was linearized by BamHI (New England BioLabs), using a mutant T7 polymerase

that does not pause or terminate at a variety of signals, including a terminator found

fortuitously in the human preproparathyroid hormone (PTH) gene, a pause site found in

the concatamer junction (CJ) of replicating T7 DNA, and termination signals that are also

utilized by Escherichia coli RNAP (e.g. rrnB T2) (Lyakhov et al., 1997). Transcription

was done at 37°C for 2 h in reaction medium containing 40 mM Tris-HCl (pH 7.9), 10

mM DTT, 2 mM spermidine, 6 mM MgCl2, 1 mM GTP, 1 mM CTP, 1 mM ATP, 1 mM

UTP, and 4 mM dTTPs to sequester extra Mg2+ that favors hydrolytic splicing.

Transcripts were treated with 2 units of DNase I (New England BioLabs) at 37°C for 10

min according to manufacturer’s protocol, extracted with phenol-chloroform-isoamyl

alcohol (25:24:1), and purified with Sephadex G-50 column (Sigma-Aldrich). The

TGIRT enzymes, GsI-IIC-MRF used for splicing, and TeI4c-MRF used for primer

extension and TGIRT-seq, were expressed and purified as described previously (Mohr et

al., 2013).

5.4.3 GsI-IIC intron splicing

For time-course RNA splicing reactions in vitro, 30 nM precursor RNA internally

labeled with [α-32P]-UTP was denatured by heating at 85°C for 2 min in double-distilled

H2O (ddH2O) and then renatured at 50°C for 2 min in reaction medium containing 450

mM KCl, 20 mM Tris-HCl (pH 7.5), and 5 mM MgCl2. To measure splicing rates under

Page 162: Copyright by Yidan Qin 2016

148

different IEP concentrations, splicing reactions were initiated by adding 0, 30, 60, or 150

nM GsI-IIC-MRF, incubated at 50oC and then terminated at different time points by

adding stop solution containing 0.25 M EDTA and 0.2% SDS. The splicing products

were analyzed in a denaturing 4% polyacrylamide gel, which was scanned with a

Phosphorimager (GE Healthcare). Band intensities were quantified by using ImageQuant

TL (GE Healthcare) and plotted using SigmaPlot (Systat Software Inc).

5.4.4 TGIRT-SHAPE

The in vitro transcript GsI-3c3532ΔA incubated under splicing condition (2 pmol

in 9 l of splicing buffer) was added to 1 µl of freshly prepared isatoic anhydride (50 mM

in DMSO; Sigma-Aldrich) or 1 µl of DMSO as a negative control. The RNA was

incubated at 37°C for 36 min (~5 half-lives for isatoic anhydride) and ethanol

precipitated (3 volumes of ethanol, one-tenth volume of 3 M sodium acetate, pH 5.2, and

1 µl of 20 mg/ml glycogen). Alternatively if available, 1-methyl-7-nitroisatoic anhydride

(1m7) SHAPE reagent can be used to capture faster 2’ OH conformational dynamics (t1/2

~20 s; incubation at 37°C for 3 min) (Weeks and Mauger, 2011). Primer extension of the

SHAPE-modified or control RNAs was carried out using a fluorescently labeled primer A

(5’-/Cy5/CAT ACA ACG CCT TTT TCT CTC CAG G-3’; IDT), which anneals near the

3’ end of the RNA. The annealed template-primer substrate was pre-incubated with 2 M

TeI4c-MRF RT at room temperature for 30 min in 28.2 l of reaction medium containing

450 mM NaCl, 20 mM Tris-HCl (pH 7.5), 5 mM MgCl2, and 5 mM fresh DTT. Reverse

Page 163: Copyright by Yidan Qin 2016

149

transcription reactions were initiated by adding 1.8 l of 25 mM dNTPs (final

concentration 1.5 mM) and incubated at 60°C for 1 h. Reverse transcription using

SuperScript III (Invitrogen) was done in parallel according to the manufacturer’s

protocol. Reactions were stopped by adding 1 l of 5 M NaOH to a final concentration of

0.1 M, incubating at 95˚C for 3 min, and neutralizing with an equal volume of 5 M HCl.

The resulting cDNAs were then ethanol precipitated, as described above for GsI-

3c3532ΔA in vitro transcript, dissolved in 40 µl of Hi-Di formamide or other capillary

electrophoresis instrument-specific loading solution. Sequencing reactions were

performed using TeI4c-MRF or SSIII RTs following methods described above, except

unmodified RNA was used as a template, and a Cy5.5-labeled primer B (IDT) of

identical sequence to primer A and 1.5 mM ddCTP were added to the reaction. Cy5-

labeled cDNAs synthesized from SHAPE-modified RNA or control RNA were mixed

with Cy5.5-labeled cDNAs from sequencing reactions and electrophoresed in a single

capillary of a GenomeLabTM GeXP Genetic Analysis System (Beckman Coulter).

Samples were denatured at 90˚C for 180 sec, injected into the capillary array at 2.0 kV

for 30 sec, and separated at 4.8 kV for 80 min. The temperature of the capillary array was

maintained at 60˚C throughout the separation. The raw capillary electrophoresis data

were analyzed by automated QuSHAPE software (Karabiber et al., 2013). SHAPE

reactivities were then used as constraints in RNAStructure software (Reuter and

Mathews, 2010) that outputs RNA secondary structure to obtain the secondary structure

of GsI-IIC intron RNA.

Page 164: Copyright by Yidan Qin 2016

150

5.4.5 TGIRT-CRAC

The in vitro transcript GsI2c3532ΔA incubated under splicing condition (500 nM

in 100 µl of splicing buffer) in the absence or presence of its IEP GsI-IIC-MRF (1 µM)

was irradiated on ice by a Spectroline ultraviolet (UV) lamp at 254 nm emission (UV-C)

for 10 min. The cross-linked RNA-protein complexes (50 µl) were digested with RNase

T1 (final concentration 0.08 U/µL or 4 U/µL; Thermo Fisher Scientific) at 37°C for 30

min, followed by incubation with 1 U/µL SUPERaseIn (Thermo Fisher Scientific) at

37°C for 3 h to inactivate the RNases. The digested products were treated with 0.5 U/µL

T4 polynucleotide kinase (New England BioLabs) and 100 µCi γ-32P-ATP at 37°C for

15min. The radiolabeled RNA-protein complexes were analyzed by a NuPAGE 4-12%

Bis-Tris gel (Thermo Fisher Scientific), transferred to a 100% nitrocellulose membrane

(Invitrogen), exposed overnight and scanned by a Phosphoimager (GE Healthcare).

Fragments of the GsI-IIC intron RNA were released from the cross-linked RNA-protein

complex by digesting GsI-IIC-MRF with 4 mg/L protease K (Thermo Fisher Scientific)

and 7 M urea, and ethanol-precipitated using the published protocol (Ule et al., 2005).

The purified RNA fragments were used for RNA-seq library construction by the

TGIRT-seq small RNA/CircLigase method. Template-switching reactions were done

using an initial template-primer substrate consisting of a 41-nt RNA oligonucleotide (5'-

AGA UCG GAA GAG CAC ACG UCU AGU UCU ACA GUC CGA CGA UC/3SpC3/-

3'), which contains both the Illumina Read 1 and 2 primer-binding sites (Read 1,2 RNA)

and a 3' blocking group (C3 Spacer, 3SpC3; IDT), annealed to a complementary 32P-

Page 165: Copyright by Yidan Qin 2016

151

labeled DNA primer, which leave an equimolar mixture of A, C, G, and T overhangs. For

reverse transcription reactions, the initial template-primer substrate (100 nM) was mixed

with 10 µL of cross-linked RNA fragments and 2 µM TeI4c-MRF RT in reaction

medium containing 450 mM NaCl, 5 mM MgCl2, 20 mM Tris-HCl pH 7.5, 1 mM DTT

and 1 mM dNTPs at room temperature. The reactions were initiated by raising the

temperature to 60°C, incubated for 15 min and terminated by adding 1 M NaOH to a final

concentration of 0.1 M, incubating at 95°C for 3 min, and neutralizing with 1M HCl. The

resulted cDNAs were purified in a denaturing 10% polyacrylamide gel, electroeluted

using a D-tube Dialyzer Maxi with MWCO of 6-8 kDa (EMD Millipore), and ethanol

precipitated with 0.3 M sodium acetate in the presence of 25 µg of linear acrylamide

(Thermo Fisher Scientific). The purified cDNAs were then circularized with CircLigase

II (Epicentre), extracted with phenol-chloroform-isoamyl alcohol (25:24:1), ethanol

precipitated, amplified with Phusion-HF (Thermo Fisher Scientific) and Illumina

multiplex and barcode primers for 15 cycles of 98°C for 5 sec, 60°C for 10 sec and 72°C

for 15 sec, and sequenced on an Illumina MiSeq instrument.

Page 166: Copyright by Yidan Qin 2016

152

A

B

C

Page 167: Copyright by Yidan Qin 2016

153

Figure 5.1: Determining the optimal exon length for in vitro splicing of the GsI-IIC

intron.

The splicing reactions using 30 nM (A) GsI2c5532, (B) GsI2c4632 and (C)

GsI2c3532 precursor RNAs with 5’ exon length of 55 nt, 46 nt and 35 nt, respectively,

were done in reaction medium containing 450 mM KCl, 20mM Tris-HCl (pH 7.5), and 5

mM Mg2+ at 50°C in absence of presence of 60 nM GsI-IIC-MRF. At variable time-

points, 10 µl of the splicing reaction was withdrawn, terminated with 0.25 M EDTA and

0.2% SDS, and analyzed in a denaturing 4% acrylamide gel. The gel was dried and

scanned with a Phosphorimager (GE Healthcare).

Page 168: Copyright by Yidan Qin 2016

154

A

B

Page 169: Copyright by Yidan Qin 2016

155

Figure 5.2: Determining the optimal IEP concentration for in vitro splicing of the GsI-IIC

intron.

(A) Splicing of GsI2c3532. A time course of GsI2c3532 splicing was done in

absence or presence of one-(1X), two-(2X) or five-fold (5X) molar excess of IEP as

described in Figure 5.1. (B) Plot showing percentage of lariat formation as a function of

time. The protein-assisted splicing of GsI-IIC intron was biphasic with an initial fast

phase followed by a slow phase, with optimal splicing occurred at a 1:1 molar ratio

between the IEP and the intron RNA (fast phase, 5.7/min and slow phase, 0.14/min).

Page 170: Copyright by Yidan Qin 2016

156

Figure 5.3: SHAPE analysis of the GsI-IIC intron RNA.

A

B

Nucleotide position

DI

DII

DIII

DIV

DV

DVI

5’ Exon 3’ Exon

Page 171: Copyright by Yidan Qin 2016

157

(A) Plot of SHAPE reactivities. The in vitro GsI2c3532ΔA transcript was

incubated under splicing conditions, modified by isatoic anhydride and reverse

transcribed by TeI4c-MRF (Materials and Methods). SHAPE reactivities were calculated

for each nucleotide by using QuSHAPE (Karabiber et al., 2013). Red represents high

reactivity, yellow represents medium activity, and black represents no reactivity. (B) The

secondary structure of GsI-IIC intron RNA predicted by RNAstructure (Reuter and

Mathews, 2010) using SHAPE reactivities as constraints. A stem loop region from DIII

of the intron RNA was enlarged and used as an example to compare cDNA traces

produced by TeI4c-MRF or by SSIII in the capillary electrophoresis. Peaks in the trace

represented reverse transcription stops at a single nucleotide resolution. TeI4c-MRF only

stopped at SHAPE-modification sites in the RNA and produced a structure that matched

predicted base-pairing interactions in the short stem, whereas SSIII, which has a greater

propensity for premature termination during reverse transcription, did not predict stable

base pairing in the short stem. The nucleotides in the stem loop are colored to indicate

SHAPE reactivities as shown in (A). EBS, exon-binding site; IBS, intron-binding site.

Nucleotide sequences involved in long-range tertiary interactions are boxed, circled or

indicated by arrows and are assigned with Greek letters.

Page 172: Copyright by Yidan Qin 2016

158

A

C

B

Page 173: Copyright by Yidan Qin 2016

159

Figure 5.4: Mapping of protein binding sites in GsI-IIC intron RNA.

(A) TGIRT-CRAC methods. Protein and RNA are irradiated by UV light under

desired conditions, digested by RNases followed by RNase-inactivation, RNA 5’-end

labeling with γ-32P-ATP, and 3’-end dephosphorylation. RNA-protein complexes are

analyzed by SDS-PAGE and transferred to a membrane. RNA fragments are released

from the membrane and subjected to RNA-seq library construction by using the TGIRT-

D

Page 174: Copyright by Yidan Qin 2016

160

seq small RNA/CircLigase method. (B) Cross-linked GsI-IIC RNA-IEP complexes on a

nitrocellulose membrane. In vitro GsI-IIC transcripts GsI-IIC3532ΔA was incubated in

absence or presence of its IEP GsI-IIC-MRF under splicing conditions, irradiated, and

digested by RNase present at low or high concentrations. The RNA-IEP complexes had

higher molecular weights than 80 kDa (the GsI-IIC-MRF alone) on the membrane. (C)

The coverage map of RNA-seq reads. Reads were mapped to GsI-IIC intron RNA and the

number of hits at each nucleotide position was plotted. (D) Predicted IEP binding sites

shown on the secondary structure of GsI-IIC intron RNA. Cross-linked sites were

identified as distinctive read start sites in (C) and were shown by red arrowheads. EBS,

exon-binding site; IBS, intron-binding site. Nucleotide sequences involved in long-range

tertiary interactions are boxed, circled or indicated by arrows and are assigned with Greek

letters.

Page 175: Copyright by Yidan Qin 2016

161

Bibliography

Abbas, Y.M., Pichlmair, A., Górna, M.W., Superti-Furga, G., and Nagar, B. (2013).

Structural basis for viral 5’-PPP-RNA recognition by human IFIT proteins. Nature 494,

60–64.

Abbott, J.A., Francklyn, C.S., and Robey-Bond, S.M. (2014). Transfer RNA and human

disease. Front. Genet. 5, 158.

Agris, P.F., Vendeix, F.A.P., and Graham, W.D. (2007). tRNA’s wobble decoding of the

genome: 40 years of modification. J. Mol. Biol. 366, 1–13.

Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count

data. Genome Biol. 11, R106.

Anderson, P., and Ivanov, P. (2014). tRNA fragments in human health and disease. FEBS

Lett. 588, 4297–4304.

Ansmant, I., Motorin, Y., Massenet, S., Grosjean, H., and Branlant, C. (2001).

Identification and characterization of the tRNA:Psi 31-synthase (Pus6p) of

Saccharomyces cerevisiae. J. Biol. Chem. 276, 34934–34940.

Arroyo, J.D., Chevillet, J.R., Kroh, E.M., Ruf, I.K., Pritchard, C.C., Gibson, D.F.,

Mitchell, P.S., Bennett, C.F., Pogosova-Agadjanyan, E.L., Stirewalt, D.L., et al. (2011).

Argonaute2 complexes carry a population of circulating microRNAs independent of

vesicles in human plasma. Proc. Natl. Acad. Sci. U. S. A. 108, 5003–5008.

Astuti, D., Morris, M.R., Cooper, W.N., Staals, R.H.J., Wake, N.C., Fews, G.A., Gill, H.,

Gentle, D., Shuib, S., Ricketts, C.J., et al. (2012). Germline mutations in DIS3L2 cause

the Perlman syndrome of overgrowth and Wilms tumor susceptibility. Nat. Genet. 44,

277–284.

Baranauskas, A., Paliksa, S., Alzbutas, G., Vaitkevicius, M., Lubiene, J., Letukiene, V.,

Burinskas, S., Sasnauskas, G., and Skirgaila, R. (2012). Generation and characterization

of new highly thermostable and processive M-MuLV reverse transcriptase variants.

Protein Eng. Des. Sel. PEDS 25, 657–668.

Batista, P.J., and Chang, H.Y. (2013). Long noncoding RNAs: cellular address codes in

development and disease. Cell 152, 1298–1307.

Beckman, R.A., Mildvan, A.S., and Loeb, L.A. (1985). On the fidelity of DNA

replication: manganese mutagenesis in vitro. Biochemistry (Mosc.) 24, 5810–5817.

Page 176: Copyright by Yidan Qin 2016

162

Been, M.D., and Wickham, G.S. (1997). Self-cleaving ribozymes of hepatitis delta virus

RNA. Eur. J. Biochem. FEBS 247, 741–753.

Bergsagel, P.L., Mateos, M.-V., Gutierrez, N.C., Rajkumar, S.V., and San Miguel, J.F.

(2013). Improving overall survival and overcoming adverse prognosis in the treatment of

cytogenetically high-risk multiple myeloma. Blood 121, 884–892.

Bibillo, A., and Eickbush, T.H. (2002). High processivity of the reverse transcriptase

from a non-long terminal repeat retrotransposon. J. Biol. Chem. 277, 34836–34845.

Black, D.L. (2000). Protein diversity from alternative splicing: a challenge for

bioinformatics and post-genome biology. Cell 103, 367–370.

Black, D.L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev.

Biochem. 72, 291–336.

Blocker, F.J.H., Mohr, G., Conlan, L.H., Qi, L., Belfort, M., and Lambowitz, A.M.

(2005). Domain structure and three-dimensional model of a group II intron-encoded

reverse transcriptase. RNA N. Y. N 11, 14–28.

Brandman, O., Stewart-Ornstein, J., Wong, D., Larson, A., Williams, C.C., Li, G.-W.,

Zhou, S., King, D., Shen, P.S., Weibezahn, J., et al. (2012). A ribosome-bound quality

control complex triggers degradation of nascent peptides and signals translation stress.

Cell 151, 1042–1054.

Brown, J.B., Boley, N., Eisman, R., May, G.E., Stoiber, M.H., Duff, M.O., Booth, B.W.,

Wen, J., Park, S., Suzuki, A.M., et al. (2014). Diversity and dynamics of the Drosophila

transcriptome. Nature.

Brunner, A.L., Beck, A.H., Edris, B., Sweeney, R.T., Zhu, S.X., Li, R., Montgomery, K.,

Varma, S., Gilks, T., Guo, X., et al. (2012). Transcriptional profiling of long non-coding

RNAs and novel transcribed regions across a diverse panel of archived human cancers.

Genome Biol. 13, R75.

Burgos, K.L., Javaherian, A., Bomprezzi, R., Ghaffari, L., Rhodes, S., Courtright, A.,

Tembe, W., Kim, S., Metpally, R., and Van Keuren-Jensen, K. (2013). Identification of

extracellular miRNA in human cerebrospinal fluid by next-generation sequencing. RNA

N. Y. N 19, 712–722.

Burnett, B.P., and McHenry, C.S. (1997). Posttranscriptional modification of retroviral

primers is required for late stages of DNA replication. Proc. Natl. Acad. Sci. U. S. A. 94,

7210–7215.

Page 177: Copyright by Yidan Qin 2016

163

Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D., and Craig,

D.W. (2016). Translating RNA sequencing into clinical diagnostics: opportunities and

challenges. Nat. Rev. Genet. 17, 257–271.

Cabili, M.N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., and Rinn,

J.L. (2011). Integrative annotation of human large intergenic noncoding RNAs reveals

global properties and specific subclasses. Genes Dev. 25, 1915–1927.

Candales, M.A., Duong, A., Hood, K.S., Li, T., Neufeld, R.A.E., Sun, R., McNeil, B.A.,

Wu, L., Jarding, A.M., and Zimmerly, S. (2012). Database for bacterial group II introns.

Nucleic Acids Res. 40, D187–D190.

Chan, P.P., and Lowe, T.M. (2009). GtRNAdb: a database of transfer RNA genes

detected in genomic sequence. Nucleic Acids Res. 37, D93–D97.

Chang, H.-M., Triboulet, R., Thornton, J.E., and Gregory, R.I. (2013). A role for the

Perlman syndrome exonuclease Dis3l2 in the Lin28-let-7 pathway. Nature 497, 244–248.

Chen, B., and Lambowitz, A.M. (1997). De novo and DNA primer-mediated initiation of

cDNA synthesis by the mauriceville retroplasmid reverse transcriptase involve

recognition of a 3’ CCA sequence. J. Mol. Biol. 271, 311–332.

Chen, X., and Wolin, S.L. (2004). The Ro 60 kDa autoantigen: insights into cellular

function and role in autoimmunity. J. Mol. Med. Berl. Ger. 82, 232–239.

Chen, R., Mias, G.I., Li-Pook-Than, J., Jiang, L., Lam, H.Y.K., Chen, R., Miriami, E.,

Karczewski, K.J., Hariharan, M., Dewey, F.E., et al. (2012). Personal omics profiling

reveals dynamic molecular and medical phenotypes. Cell 148, 1293–1307.

Chen, X., Taylor, D.W., Fowler, C.C., Galan, J.E., Wang, H.-W., and Wolin, S.L. (2013).

An RNA degradation machine sculpted by Ro autoantigen and noncoding RNA. Cell

153, 166–177.

Christov, C.P., Gardiner, T.J., Szüts, D., and Krude, T. (2006). Functional requirement of

noncoding Y RNAs for human chromosomal DNA replication. Mol. Cell. Biol. 26,

6993–7004.

Chu, D., Barnes, D.J., and von der Haar, T. (2011). The role of tRNA and ribosome

competition in coupling the expression of different mRNAs in Saccharomyces cerevisiae.

Nucleic Acids Res. 39, 6705–6714.

Chu, J., Hong, N.A., Masuda, C.A., Jenkins, B.V., Nelms, K.A., Goodnow, C.C., Glynne,

R.J., Wu, H., Masliah, E., Joazeiro, C.A.P., et al. (2009). A mouse forward genetics

Page 178: Copyright by Yidan Qin 2016

164

screen identifies LISTERIN as an E3 ubiquitin ligase involved in neurodegeneration.

Proc. Natl. Acad. Sci. U. S. A. 106, 2097–2103.

Clark, J.M. (1988). Novel non-templated nucleotide addition reactions catalyzed by

procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res. 16, 9677–9686.

Cocquet, J., Chong, A., Zhang, G., and Veitia, R.A. (2006). Reverse transcriptase

template switching and false alternative transcripts. Genomics 88, 127–131.

Conlan, L.H., Stanger, M.J., Ichiyanagi, K., and Belfort, M. (2005). Localization,

mobility and fidelity of retrotransposed Group II introns in rRNA genes. Nucleic Acids

Res. 33, 5262–5270.

Cousineau, B., Smith, D., Lawrence-Cavanagh, S., Mueller, J.E., Yang, J., Mills, D.,

Manias, D., Dunny, G., Lambowitz, A.M., and Belfort, M. (1998). Retrohoming of a

bacterial group II intron: mobility via complete reverse splicing, independent of

homologous DNA recombination. Cell 94, 451–462.

Crick, F.H. (1966). Codon--anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19,

548–555.

Croce, C.M. (2009). Causes and consequences of microRNA dysregulation in cancer.

Nat. Rev. Genet. 10, 704–714.

Cui, X., Matsuura, M., Wang, Q., Ma, H., and Lambowitz, A.M. (2004). A group II

intron-encoded maturase functions preferentially in cis and requires both the reverse

transcriptase and X domains to promote RNA splicing. J. Mol. Biol. 340, 211–231.

Daffis, S., Szretter, K.J., Schriewer, J., Li, J., Youn, S., Errett, J., Lin, T.-Y., Schneller,

S., Zust, R., Dong, H., et al. (2010). 2’-O methylation of the viral mRNA cap evades host

restriction by IFIT family members. Nature 468, 452–456.

Dai, L., Chai, D., Gu, S.-Q., Gabel, J., Noskov, S.Y., Blocker, F.J.H., Lambowitz, A.M.,

and Zimmerly, S. (2008). A three-dimensional model of a group II intron RNA and its

interaction with the intron-encoded reverse transcriptase. Mol. Cell 30, 472–485.

Decroly, E., Ferron, F., Lescar, J., and Canard, B. (2012). Conventional and

unconventional mechanisms for capping viral mRNA. Nat. Rev. Microbiol. 10, 51–65.

Defenouillère, Q., Yao, Y., Mouaikel, J., Namane, A., Galopier, A., Decourty, L., Doyen,

A., Malabat, C., Saveanu, C., Jacquier, A., et al. (2013). Cdc48-associated complex

bound to 60S particles is required for the clearance of aberrant translation products. Proc.

Natl. Acad. Sci. U. S. A. 110, 5046–5051.

Page 179: Copyright by Yidan Qin 2016

165

Delannoy, E., Le Ret, M., Faivre-Nitschke, E., Estavillo, G.M., Bergdoll, M., Taylor,

N.L., Pogson, B.J., Small, I., Imbault, P., and Gualberto, J.M. (2009). Arabidopsis tRNA

adenosine deaminase arginine edits the wobble nucleotide of chloroplast tRNAArg(ACG)

and is essential for efficient chloroplast translation. Plant Cell 21, 2058–2071.

Dhahbi, J.M., Spindler, S.R., Atamna, H., Yamakawa, A., Boffelli, D., Mote, P., and

Martin, D.I.K. (2013a). 5’ tRNA halves are present as abundant complexes in serum,

concentrated in blood cells, and modulated by aging and calorie restriction. BMC

Genomics 14, 298.

Dhahbi, J.M., Spindler, S.R., Atamna, H., Boffelli, D., Mote, P., and Martin, D.I.K.

(2013b). 5’-YRNA fragments derived by processing of transcripts from specific YRNA

genes and pseudogenes are abundant in human serum and plasma. Physiol. Genomics 45,

990–998.

Diamond, M.S., and Farzan, M. (2013). The broad-spectrum antiviral functions of IFIT

and IFITM proteins. Nat. Rev. Immunol. 13, 46–57.

Dittmar, K.A., Sørensen, M.A., Elf, J., Ehrenberg, M., and Pan, T. (2005). Selective

charging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6, 151–

157.

Dittmar, K.A., Goodenbour, J.M., and Pan, T. (2006). Tissue-specific differences in

human transfer RNA expression. PLoS Genet. 2, e221.

Elagib, K.E., Rubinstein, J.D., Delehanty, L.L., Ngoh, V.S., Greer, P.A., Li, S., Lee, J.K.,

Li, Z., Orkin, S.H., Mihaylov, I.S., et al. (2013). Calpain 2 activation of P-TEFb drives

megakaryocyte morphogenesis and is disrupted by leukemogenic GATA1 mutation. Dev.

Cell 27, 607–620.

EL Andaloussi, S., Mäger, I., Breakefield, X.O., and Wood, M.J.A. (2013). Extracellular

vesicles: biology and emerging therapeutic opportunities. Nat. Rev. Drug Discov. 12,

347–357.

Enyeart, P.J., Mohr, G., Ellington, A.D., and Lambowitz, A.M. (2014). Biotechnological

applications of mobile group II introns and their reverse transcriptases: gene targeting,

RNA-seq, and non-coding RNA analysis. Mob. DNA 5, 2.

Esteller, M. (2011). Non-coding RNAs in human disease. Nat. Rev. Genet. 12, 861–874.

Fabbri, M., Paone, A., Calore, F., Galli, R., Gaudio, E., Santhanam, R., Lovat, F., Fadda,

P., Mao, C., Nuovo, G.J., et al. (2012). MicroRNAs bind to Toll-like receptors to induce

prometastatic inflammatory response. Proc. Natl. Acad. Sci. U. S. A. 109, E2110–E2116.

Page 180: Copyright by Yidan Qin 2016

166

Falnes, P.Ø., Johansen, R.F., and Seeberg, E. (2002). AlkB-mediated oxidative

demethylation reverses DNA damage in Escherichia coli. Nature 419, 178–182.

Fan, H.C., Blumenfeld, Y.J., Chitkara, U., Hudgins, L., and Quake, S.R. (2008).

Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal

blood. Proc. Natl. Acad. Sci. 105, 16266–16271.

Faustino, N.A., and Cooper, T.A. (2003). Pre-mRNA splicing and human disease. Genes

Dev. 17, 419–437.

Feng, F., Yuan, L., Wang, Y.E., Crowley, C., Lv, Z., Li, J., Liu, Y., Cheng, G., Zeng, S.,

and Liang, H. (2013). Crystal structure and nucleotide selectivity of human IFIT5/ISG58.

Cell Res. 23, 1055–1058.

Fu, H., Feng, J., Liu, Q., Sun, F., Tie, Y., Zhu, J., Xing, R., Sun, Z., and Zheng, X.

(2009). Stress induces tRNA cleavage by angiogenin in mammalian cells. FEBS Lett.

583, 437–442.

Gerber, A.P., and Keller, W. (1999). An adenosine deaminase that generates inosine at

the wobble position of tRNAs. Science 286, 1146–1149.

Ghosh, A., and Lima, C.D. (2010). Enzymology of RNA cap synthesis. Wiley

Interdiscip. Rev. RNA 1, 152–172.

Gingold, H., Tehler, D., Christoffersen, N.R., Nielsen, M.M., Asmar, F., Kooistra, S.M.,

Christophersen, N.S., Christensen, L.L., Borre, M., Sørensen, K.D., et al. (2014). A dual

program for translation regulation in cellular proliferation and differentiation. Cell 158,

1281–1292.

Golinelli, M.-P., and Hughes, S.H. (2002). Nontemplated nucleotide addition by HIV-1

reverse transcriptase. Biochemistry (Mosc.) 41, 5894–5906.

Goodarzi, H., Liu, X., Nguyen, H.C.B., Zhang, S., Fish, L., and Tavazoie, S.F. (2015).

Endogenous tRNA-Derived Fragments Suppress Breast Cancer Progression via YBX1

Displacement. Cell 161, 790–802.

Goubau, D., Deddouche, S., and Reis e Sousa, C. (2013). Cytosolic sensing of viruses.

Immunity 38, 855–869.

Granneman, S., Kudla, G., Petfalski, E., and Tollervey, D. (2009). Identification of

protein binding sites on U3 snoRNA and pre-rRNA by UV cross-linking and high-

throughput analysis of cDNAs. Proc. Natl. Acad. Sci. U. S. A. 106, 9613–9618.

Page 181: Copyright by Yidan Qin 2016

167

Grasedieck, S., Sorrentino, A., Langer, C., Buske, C., Döhner, H., Mertens, D., and

Kuchenbauer, F. (2013). Circulating microRNAs in hematological diseases: principles,

challenges, and perspectives. Blood 121, 4977–4984.

Gürtler, C., and Bowie, A.G. (2013). Innate immune detection of microbial nucleic acids.

Trends Microbiol. 21, 413–420.

Habjan, M., Hubel, P., Lacerda, L., Benda, C., Holze, C., Eberl, C.H., Mann, A., Kindler,

E., Gil-Cruz, C., Ziebuhr, J., et al. (2013). Sequestration by IFIT1 impairs translation of

2’O-unmethylated capped RNA. PLoS Pathog. 9, e1003663.

Halse, A.-K., Wahren-Herlenius, M., and Jonsson, R. (1999). Ro/SS-A- and La/SS-B-

reactive B lymphocytes in peripheral blood of patients with Sjögren’s syndrome. Clin.

Exp. Immunol. 115, 208–213.

Hansen, K.D., Brenner, S.E., and Dudoit, S. (2010). Biases in Illumina transcriptome

sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131.

Hardin, J.A., Rahn, D.R., Shen, C., Lerner, M.R., Wolin, S.L., Rosa, M.D., and Steitz,

J.A. (1982). Antibodies from patients with connective tissue diseases bind specific

subsets of cellular RNA-protein particles. J. Clin. Invest. 70, 141–147.

He, N., Jahchan, N.S., Hong, E., Li, Q., Bayfield, M.A., Maraia, R.J., Luo, K., and Zhou,

Q. (2008). A La-Related Protein Modulates 7SK snRNP Integrity to Suppress P-TEFb-

Dependent Transcriptional Elongation and Tumorigenesis. Mol. Cell 29, 588–599.

Head, S.R., Komori, H.K., LaMere, S.A., Whisenant, T., Van Nieuwerburgh, F.,

Salomon, D.R., and Ordoukhanian, P. (2014). Library construction for next-generation

sequencing: overviews and challenges. BioTechniques 56, 61–64, 66, 68, passim.

Horton, R., Wilming, L., Rand, V., Lovering, R.C., Bruford, E.A., Khodiyar, V.K., Lush,

M.J., Povey, S., Talbot, C.C., Wright, M.W., et al. (2004). Gene map of the extended

human MHC. Nat. Rev. Genet. 5, 889–899.

Houseley, J., and Tollervey, D. (2009). The Many Pathways of RNA Degradation. Cell

136, 763–776.

Hu, W.-S., and Hughes, S.H. (2012). HIV-1 reverse transcription. Cold Spring Harb.

Perspect. Med. 2.

Huang, X., Yuan, T., Tschannen, M., Sun, Z., Jacob, H., Du, M., Liang, M., Dittmar,

R.L., Liu, Y., Liang, M., et al. (2013). Characterization of human plasma-derived

exosomal RNAs by deep sequencing. BMC Genomics 14, 319.

Page 182: Copyright by Yidan Qin 2016

168

International Myeloma Working Group (2003). Criteria for the classification of

monoclonal gammopathies, multiple myeloma and related disorders: a report of the

International Myeloma Working Group. Br. J. Haematol. 121, 749–757.

Ishimura, R., Nagy, G., Dotu, I., Zhou, H., Yang, X.-L., Schimmel, P., Senju, S.,

Nishimura, Y., Chuang, J.H., and Ackerman, S.L. (2014). RNA function. Ribosome

stalling induced by mutation of a CNS-specific tRNA causes neurodegeneration. Science

345, 455–459.

Jackman, J.E., Montange, R.K., Malik, H.S., and Phizicky, E.M. (2003). Identification of

the yeast gene encoding the tRNA m1G methyltransferase responsible for modification at

position 9. RNA N. Y. N 9, 574–585.

Jarrell, K.A., Peebles, C.L., Dietrich, R.C., Romiti, S.L., and Perlman, P.S. (1988). Group

II intron self-splicing. Alternative reaction conditions yield novel products. J. Biol.

Chem. 263, 3432–3439.

Ji, J.P., and Loeb, L.A. (1992). Fidelity of HIV-1 reverse transcriptase copying RNA in

vitro. Biochemistry (Mosc.) 31, 954–958.

Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid,

accurate, and best-practices quantification of nucleic acid probing information, resolved

by capillary electrophoresis. RNA N. Y. N 19, 63–73.

Karni, R., de Stanchina, E., Lowe, S.W., Sinha, R., Mu, D., and Krainer, A.R. (2007).

The gene encoding the splicing factor SF2/ASF is a proto-oncogene. Nat. Struct. Mol.

Biol. 14, 185–193.

Katayama, S., Tomaru, Y., Kasukawa, T., Waki, K., Nakanishi, M., Nakamura, M.,

Nishida, H., Yap, C.C., Suzuki, M., Kawai, J., et al. (2005). Antisense transcription in the

mammalian transcriptome. Science 309, 1564–1566.

Katibah, G.E., Lee, H.J., Huizar, J.P., Vogan, J.M., Alber, T., and Collins, K. (2013).

tRNA binding, structure, and localization of the human interferon-induced protein IFIT5.

Mol. Cell 49, 743–750.

Katibah, G.E., Qin, Y., Sidote, D.J., Yao, J., Lambowitz, A.M., and Collins, K. (2014).

Broad and adaptable RNA structure recognition by the human interferon-induced

tetratricopeptide repeat protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030.

Keller, A., Leidinger, P., Bauer, A., Elsharawy, A., Haas, J., Backes, C., Wendschlag, A.,

Giese, N., Tjaden, C., Ott, K., et al. (2011). Toward the blood-borne miRNome of human

diseases. Nat. Methods 8, 841–843.

Page 183: Copyright by Yidan Qin 2016

169

Khorkova, O., Myers, A.J., Hsiao, J., and Wahlestedt, C. (2014). Natural antisense

transcripts. Hum. Mol. Genet.

Kickhoefer, V.A., Poderycki, M.J., Chan, E.K.L., and Rome, L.H. (2002). The La RNA-

binding protein interacts with the vault RNA and is a vault-associated protein. J. Biol.

Chem. 277, 41282–41286.

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013).

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions

and gene fusions. Genome Biol. 14, R36.

Kimura, T., Katoh, H., Kayama, H., Saiga, H., Okuyama, M., Okamoto, T., Umemoto,

E., Matsuura, Y., Yamamoto, M., and Takeda, K. (2013). Ifit1 inhibits Japanese

encephalitis virus replication through binding to 5’ capped 2’-O unmethylated RNA. J.

Virol. 87, 9997–10003.

Kirchner, S., and Ignatova, Z. (2015). Emerging roles of tRNA in adaptive translation,

signalling dynamics and disease. Nat. Rev. Genet. 16, 98–112.

Koh, W., Pan, W., Gawad, C., Fan, H.C., Kerchner, G.A., Wyss-Coray, T., Blumenfeld,

Y.J., El-Sayed, Y.Y., and Quake, S.R. (2014). Noninvasive in vivo monitoring of tissue-

specific global gene expression in humans. Proc. Natl. Acad. Sci. U. S. A. 111, 7361–

7366.

König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D.J.,

Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in

splicing at individual nucleotide resolution. Nat. Struct. Mol. Biol. 17, 909–915.

Kopreski, M.S., Benko, F.A., and Gocke, C.D. (2001). Circulating RNA as a tumor

marker: detection of 5T4 mRNA in breast and lung cancer patient serum. Ann. N. Y.

Acad. Sci. 945, 172–178.

Kowalski, M.P., and Krude, T. (2015). Functional roles of non-coding Y RNAs. Int. J.

Biochem. Cell Biol. 66, 20–29.

Krude, T., Christov, C.P., Hyrien, O., and Marheineke, K. (2009). Y RNA functions at

the initiation step of mammalian chromosomal DNA replication. J. Cell Sci. 122, 2836–

2845.

Kumar, P., Sweeney, T.R., Skabkin, M.A., Skabkina, O.V., Hellen, C.U.T., and Pestova,

T.V. (2014). Inhibition of translation by IFIT family members is determined by their

ability to interact selectively with the 5’-terminal regions of cap0-, cap1- and 5’ppp-

mRNAs. Nucleic Acids Res. 42, 3228–3245.

Page 184: Copyright by Yidan Qin 2016

170

Lambowitz, A.M., and Belfort, M. (2015). Mobile Bacterial Group II Introns at the Crux

of Eukaryotic Evolution. Microbiol. Spectr. 3.

Lambowitz, A.M., and Zimmerly, S. (2011). Group II Introns: Mobile Ribozymes that

Invade DNA. Cold Spring Harb. Perspect. Biol. 3.

Lamm, A.T., Stadler, M.R., Zhang, H., Gent, J.I., and Fire, A.Z. (2011). Multimodal

RNA-seq using single-strand, double-strand, and CircLigase-based capture yields a

refined and extended description of the C. elegans transcriptome. Genome Res. 21, 265–

275.

Landgraf, P., Rusu, M., Sheridan, R., Sewer, A., Iovino, N., Aravin, A., Pfeffer, S., Rice,

A., Kamphorst, A.O., Landthaler, M., et al. (2007). A Mammalian microRNA Expression

Atlas Based on Small RNA Library Sequencing. Cell 129, 1401–1414.

Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2.

Nat. Methods 9, 357–359.

Levin, J.Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D.A., Friedman, N.,

Gnirke, A., and Regev, A. (2010). Comprehensive comparative analysis of strand-

specific RNA sequencing methods. Nat. Methods 7, 709–715.

Li, G.-W., Burkhardt, D., Gross, C., and Weissman, J.S. (2014). Quantifying absolute

protein synthesis rates reveals principles underlying allocation of cellular resources. Cell

157, 624–635.

Li, M., Kao, E., Gao, X., Sandig, H., Limmer, K., Pavon-Eternod, M., Jones, T.E.,

Landry, S., Pan, T., Weitzman, M.D., et al. (2012). Codon-usage-based inhibition of HIV

protein synthesis by human schlafen 11. Nature 491, 125–128.

Lill, R., Robertson, J.M., and Wintermeyer, W. (1986). Affinities of tRNA binding sites

of ribosomes from Escherichia coli. Biochemistry (Mosc.) 25, 3245–3255.

Linsen, S.E.V., de Wit, E., Janssens, G., Heater, S., Chapman, L., Parkin, R.K., Fritz, B.,

Wyman, S.K., de Bruijn, E., Voest, E.E., et al. (2009). Limitations and possibilities of

small RNA digital gene expression profiling. Nat. Methods 6, 474–476.

Liu, Y., Zhang, Y.-B., Liu, T.-K., and Gui, J.-F. (2013). Lineage-specific expansion of

IFIT gene family: an insight into coevolution with IFN gene family. PloS One 8, e66859.

Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved detection of

transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964.

Page 185: Copyright by Yidan Qin 2016

171

Lu, Z., and Matera, A.G. (2014). Vicinal: a method for the determination of ncRNA ends

using chimeric reads from RNA-seq experiments. Nucleic Acids Res. gku207.

Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero,

A., Ebert, B.L., Mak, R.H., Ferrando, A.A., et al. (2005). MicroRNA expression profiles

classify human cancers. Nature 435, 834–838.

Lusvarghi, S., Sztuba-Solinska, J., Purzycka, K.J., Rausch, J.W., and Le Grice, S.F.J.

(2013). RNA Secondary Structure Prediction Using High-throughput SHAPE. J. Vis.

Exp. JoVE.

Lyakhov, D.L., He, B., Zhang, X., Studier, F.W., Dunn, J.J., and McAllister, W.T.

(1997). Mutant bacteriophage T7 RNA polymerases with altered termination properties.

J. Mol. Biol. 269, 28–40.

Mader, R.M., Schmidt, W.M., Sedivy, R., Rizovski, B., Braun, J., Kalipciyan, M., Exner,

M., Steger, G.G., and Mueller, M.W. (2001). Reverse transcriptase template switching

during reverse transcriptase-polymerase chain reaction: artificial generation of deletions

in ribonucleotide reductase mRNA. J. Lab. Clin. Med. 137, 422–428.

Malathi, K., Dong, B., Gale, M., and Silverman, R.H. (2007). Small self-RNA generated

by RNase L amplifies antiviral innate immunity. Nature 448, 816–819.

Malecki, M., Viegas, S.C., Carneiro, T., Golik, P., Dressaire, C., Ferreira, M.G., and

Arraiano, C.M. (2013). The exoribonuclease Dis3L2 defines a novel eukaryotic RNA

degradation pathway. EMBO J. 32, 1842–1854.

Markert, A., Grimm, M., Martinez, J., Wiesner, J., Meyerhans, A., Meyuhas, O.,

Sickmann, A., and Fischer, U. (2008). The La‐related protein LARP7 is a component of

the 7SK ribonucleoprotein and affects transcription of cellular and viral polymerase II

genes. EMBO Rep. 9, 569–575.

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput

sequencing reads. EMBnet.journal 17, 10.

Mathews, M.B., and Bernstein, R.M. (1983). Myositis autoantibody inhibits histidyl-

tRNA synthetase: a model for autoimmunity. Nature 304, 177–179.

Matsuura, M., Saldanha, R., Ma, H., Wank, H., Yang, J., Mohr, G., Cavanagh, S., Dunny,

G.M., Belfort, M., and Lambowitz, A.M. (1997). A bacterial group II intron encoding

reverse transcriptase, maturase, and DNA endonuclease activities: biochemical

demonstration of maturase activity and insertion of new genetic information within the

intron. Genes Dev. 11, 2910–2924.

Page 186: Copyright by Yidan Qin 2016

172

Matsuura, M., Noah, J.W., and Lambowitz, A.M. (2001). Mechanism of maturase-

promoted group II intron splicing. EMBO J. 20, 7259–7270.

Mayer, G., Müller, J., and Lünse, C.E. (2011). RNA diagnostics: real-time RT-PCR

strategies and promising novel target RNAs. Wiley Interdiscip. Rev. RNA 2, 32–41.

Meldrum, C., Doyle, M.A., and Tothill, R.W. (2011). Next-generation sequencing for

cancer diagnostics: a practical perspective. Clin. Biochem. Rev. Aust. Assoc. Clin.

Biochem. 32, 177–195.

Mitchell, P.S., Parkin, R.K., Kroh, E.M., Fritz, B.R., Wyman, S.K., Pogosova-

Agadjanyan, E.L., Peterson, A., Noteboom, J., O’Briant, K.C., Allen, A., et al. (2008).

Circulating microRNAs as stable blood-based markers for cancer detection. Proc. Natl.

Acad. Sci. U. S. A. 105, 10513–10518.

Mohr, G., Ghanem, E., and Lambowitz, A.M. (2010). Mechanisms used for genomic

proliferation by thermophilic group II introns. PLoS Biol. 8, e1000391.

Mohr, S., Ghanem, E., Smith, W., Sheeter, D., Qin, Y., King, O., Polioudakis, D., Iyer,

V.R., Hunicke-Smith, S., Swamy, S., et al. (2013). Thermostable group II intron reverse

transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA

sequencing. RNA N. Y. N 19, 958–970.

Moore, S.D., and Sauer, R.T. (2007). The tmRNA system for translational surveillance

and ribosome rescue. Annu. Rev. Biochem. 76, 101–124.

Moussay, E., Wang, K., Cho, J.-H., van Moer, K., Pierson, S., Paggetti, J., Nazarov, P.V.,

Palissot, V., Hood, L.E., Berchem, G., et al. (2011). MicroRNA as biomarkers and

regulators in B-cell chronic lymphocytic leukemia. Proc. Natl. Acad. Sci. U. S. A. 108,

6573–6578.

Ng, B., Nayak, S., Gibbs, M.D., Lee, J., and Bergquist, P.L. (2007). Reverse

transcriptases: intron-encoded proteins found in thermophilic bacteria. Gene 393, 137–

144.

Norbury, C.J. (2013). Cytoplasmic RNA: a case of the tail wagging the dog. Nat. Rev.

Mol. Cell Biol. 14, 643–653.

Nottingham, R.M., Wu, D.C., Qin, Y., Yao, J., Hunicke-Smith, S., and Lambowitz, A.M.

(2016). RNA-seq of human reference RNA samples using a thermostable group II intron

reverse transcriptase. RNA N. Y. N 22, 597–613.

Ozsolak, F., and Milos, P.M. (2011). RNA sequencing: advances, challenges and

opportunities. Nat. Rev. Genet. 12, 87–98.

Page 187: Copyright by Yidan Qin 2016

173

Pang, Y.L.J., Abo, R., Levine, S.S., and Dedon, P.C. (2014). Diverse cell stresses induce

unique patterns of tRNA up- and down-regulation: tRNA-seq for quantifying changes in

tRNA copy number. Nucleic Acids Res. 42, e170.

Parrott, A.M., and Mathews, M.B. (2007). Novel rapidly evolving hominid RNAs bind

nuclear factor 90 and display tissue-restricted distribution. Nucleic Acids Res. 35, 6249–

6258.

Parrott, A.M., Tsai, M., Batchu, P., Ryan, K., Ozer, H.L., Tian, B., and Mathews, M.B.

(2011). The evolution and expression of the snaR family of small non-coding RNAs.

Nucleic Acids Res. 39, 1485–1500.

Phizicky, E.M., and Hopper, A.K. (2010). tRNA biology charges to the front. Genes Dev.

24, 1832–1860.

Pichlmair, A., Lassnig, C., Eberle, C.-A., Górna, M.W., Baumann, C.L., Burkard, T.R.,

Bürckstümmer, T., Stefanovic, A., Krieger, S., Bennett, K.L., et al. (2011). IFIT1 is an

antiviral protein that recognizes 5’-triphosphate RNA. Nat. Immunol. 12, 624–630.

Popovici, V., Chen, W., Gallas, B.G., Hatzis, C., Shi, W., Samuelson, F.W., Nikolsky,

Y., Tsyganova, M., Ishkin, A., Nikolskaya, T., et al. (2010). Effect of training-sample

size and classification difficulty on the accuracy of genomic predictors. Breast Cancer

Res. BCR 12, R5.

Portal, M.M., Pavet, V., Erb, C., and Gronemeyer, H. (2015). Human cells contain

natural double-stranded RNAs with potential regulatory functions. Nat. Struct. Mol. Biol.

22, 89–97.

Qin, Y., Yao, J., Wu, D.C., Nottingham, R.M., Mohr, S., Hunicke-Smith, S., and

Lambowitz, A.M. (2016). High-throughput sequencing of human plasma RNA by using

thermostable group II intron reverse transcriptases. RNA N. Y. N 22, 111–128.

Raab, M.S., Podar, K., Breitkreutz, I., Richardson, P.G., and Anderson, K.C. (2009).

Multiple myeloma. Lancet Lond. Engl. 374, 324–339.

Raabe, C.A., Tang, T.-H., Brosius, J., and Rozhdestvensky, T.S. (2014). Biases in small

RNA deep sequencing data. Nucleic Acids Res. 42, 1414–1426.

Rajkumar, S.V., Landgren, O., and Mateos, M.-V. (2015). Smoldering multiple myeloma.

Blood 125, 3069–3075.

Rambo, R.P., and Doudna, J.A. (2004). Assembly of an active group II intron-maturase

complex by protein dimerization. Biochemistry (Mosc.) 43, 6486–6497.

Page 188: Copyright by Yidan Qin 2016

174

Raposo, G., and Stoorvogel, W. (2013). Extracellular vesicles: Exosomes, microvesicles,

and friends. J. Cell Biol. 200, 373–383.

Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary

structure prediction and analysis. BMC Bioinformatics 11, 129.

Robart, A.R., Seo, W., and Zimmerly, S. (2007). Insertion of group II intron

retroelements after intrinsic transcriptional terminators. Proc. Natl. Acad. Sci. U. S. A.

104, 6620–6625.

Robinson, J.T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G.,

and Mesirov, J.P. (2011). Integrative genomics viewer. Nat. Biotechnol. 29, 24–26.

Rosa, M.D., Hendrick, J.P., Lerner, M.R., Steitz, J.A., and Reichlin, M. (1983). A

mammalian tRNAHis-containing antigen is recognized by the polymyositis-specific

antibody anti-Jo-1. Nucleic Acids Res. 11, 853–870.

Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M.,

Benjamin, H., Shabes, N., Tabak, S., Levy, A., et al. (2008). MicroRNAs accurately

identify cancer tissue origin. Nat. Biotechnol. 26, 462–469.

Routsias, J.G., and Tzioufas, A.G. (2010). B-cell epitopes of the intracellular

autoantigens Ro/SSA and La/SSB: tools to study the regulation of the autoimmune

response. J. Autoimmun. 35, 256–264.

Rubio, M.A.T., Ragone, F.L., Gaston, K.W., Ibba, M., and Alfonzo, J.D. (2006). C to U

editing stimulates A to I editing in the anticodon loop of a cytoplasmic threonyl tRNA in

Trypanosoma brucei. J. Biol. Chem. 281, 115–120.

Sai Lakshmi, S., and Agrawal, S. (2008). piRNABank: a web resource on classified and

clustered Piwi-interacting RNAs. Nucleic Acids Res. 36, D173–D177.

Saldanha, R., Chen, B., Wank, H., Matsuura, M., Edwards, J., and Lambowitz, A.M.

(1999). RNA and protein catalysis in group II intron splicing and mobility reactions using

purified components. Biochemistry (Mosc.) 38, 9069–9083.

Satoh, T., Okano, T., Matsui, T., Watabe, H., Ogasawara, T., Kubo, K., Kuwana, M.,

Fertig, N., Oddis, C.V., Kondo, H., et al. (2005). Novel autoantibodies against 7SL RNA

in patients with polymyositis/dermatomyositis. J. Rheumatol. 32, 1727–1733.

Schoenberg, D.R., and Maquat, L.E. (2012). Regulation of cytoplasmic mRNA decay.

Nat. Rev. Genet. 13, 246–259.

Page 189: Copyright by Yidan Qin 2016

175

Schoggins, J.W., and Rice, C.M. (2011). Interferon-stimulated genes and their antiviral

effector functions. Curr. Opin. Virol. 1, 519–525.

Shao, S., von der Malsburg, K., and Hegde, R.S. (2013). Listerin-dependent nascent

protein ubiquitination relies on ribosome subunit dissociation. Mol. Cell 50, 637–648.

Shen, P.S., Park, J., Qin, Y., Li, X., Parsawar, K., Larson, M.H., Cox, J., Cheng, Y.,

Lambowitz, A.M., Weissman, J.S., et al. (2015). Protein synthesis. Rqc2p and 60S

ribosomal subunits mediate mRNA-independent elongation of nascent chains. Science

347, 75–78.

Shi, L., Campbell, G., Jones, W.D., Campagne, F., Wen, Z., Walker, S.J., Su, Z., Chu, T.-

M., Goodsaid, F.M., Pusztai, L., et al. (2010). The MicroArray Quality Control (MAQC)-

II study of common practices for the development and validation of microarray-based

predictive models. Nat. Biotechnol. 28, 827–838.

Silva, J., García, V., García, J.M., Peña, C., Domínguez, G., Díaz, R., Lorenzo, Y.,

Hurtado, A., Sánchez, A., and Bonilla, F. (2007). Circulating Bmi-1 mRNA as a possible

prognostic factor for advanced breast cancer patients. Breast Cancer Res. BCR 9, R55.

Smith, D., and Yong, K. (2013). Multiple myeloma. BMJ 346, f3863.

Spornraft, M., Kirchner, B., Haase, B., Benes, V., Pfaffl, M.W., and Riedmaier, I. (2014).

Optimization of Extraction of Circulating RNAs from Plasma – Enabling Small RNA

Sequencing. PLoS ONE 9.

Stoltzfus, C.M. (2009). Chapter 1. Regulation of HIV-1 alternative RNA splicing and its

role in virus replication. Adv. Virus Res. 74, 1–40.

Stringer, S., Basnayake, K., Hutchison, C., and Cockwell, P. (2011). Recent advances in

the pathogenesis and management of cast nephropathy (myeloma kidney). Bone Marrow

Res. 2011, 493697.

Szretter, K.J., Daniels, B.P., Cho, H., Gainey, M.D., Yokoyama, W.M., Gale, M., Virgin,

H.W., Klein, R.S., Sen, G.C., and Diamond, M.S. (2012). 2’-O methylation of the viral

mRNA cap by West Nile virus evades ifit1-dependent and -independent mechanisms of

host restriction in vivo. PLoS Pathog. 8, e1002698.

Tijerina, P., Mohr, S., and Russell, R. (2007). DMS footprinting of structured RNAs and

RNA-protein complexes. Nat. Protoc. 2, 2608–2623.

Toor, N., Robart, A.R., Christianson, J., and Zimmerly, S. (2006). Self-splicing of a

group IIC intron: 5’ exon recognition and alternative 5’ splicing events implicate the

stem-loop motif of a transcriptional terminator. Nucleic Acids Res. 34, 6461–6471.

Page 190: Copyright by Yidan Qin 2016

176

Toor, N., Keating, K.S., Taylor, S.D., and Pyle, A.M. (2008). Crystal structure of a self-

spliced group II intron. Science 320, 77–82.

Topisirovic, I., Svitkin, Y.V., Sonenberg, N., and Shatkin, A.J. (2011). Cap and cap-

binding proteins in the control of gene expression. Wiley Interdiscip. Rev. RNA 2, 277–

298.

Trewick, S.C., Henshaw, T.F., Hausinger, R.P., Lindahl, T., and Sedgwick, B. (2002).

Oxidative demethylation by Escherichia coli AlkB directly reverts DNA base damage.

Nature 419, 174–178.

Ule, J., Jensen, K., Mele, A., and Darnell, R.B. (2005). CLIP: a method for identifying

protein-RNA interaction sites in living cells. Methods San Diego Calif 37, 376–386.

Valadi, H., Ekström, K., Bossios, A., Sjöstrand, M., Lee, J.J., and Lötvall, J.O. (2007).

Exosome-mediated transfer of mRNAs and microRNAs is a novel mechanism of genetic

exchange between cells. Nat. Cell Biol. 9, 654–659.

van der Veen, R., Kwakman, J.H., and Grivell, L.A. (1987). Mutations at the lariat

acceptor site allow self-splicing of a group II intron without lariat formation. EMBO J. 6,

3827–3831.

Vellore, J., Moretz, S.E., and Lampson, B.C. (2004). A group II intron-type open reading

frame from the thermophile Bacillus (Geobacillus) stearothermophilus encodes a heat-

stable reverse transcriptase. Appl. Environ. Microbiol. 70, 7140–7147.

Verma, R., Oania, R.S., Kolawa, N.J., and Deshaies, R.J. (2013). Cdc48/p97 promotes

degradation of aberrant nascent polypeptides bound to the ribosome. eLife 2, e00308.

Vickers, K.C., Palmisano, B.T., Shoucri, B.M., Shamburek, R.D., and Remaley, A.T.

(2011). MicroRNAs are transported in plasma and delivered to recipient cells by high-

density lipoproteins. Nat. Cell Biol. 13, 423–433.

Walter, P., and Blobel, G. (1982). Signal recognition particle contains a 7S RNA

essential for protein translocation across the endoplasmic reticulum. Nature 299, 691–

698.

Wang, K., Yuan, Y., Cho, J.-H., McClarty, S., Baxter, D., and Galas, D.J. (2012).

Comparing the MicroRNA spectrum between serum and plasma. PloS One 7, e41561.

Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for

transcriptomics. Nat. Rev. Genet. 10, 57–63.

Page 191: Copyright by Yidan Qin 2016

177

Wank, H., SanFilippo, J., Singh, R.N., Matsuura, M., and Lambowitz, A.M. (1999). A

reverse transcriptase/maturase promotes splicing by binding at its own coding segment in

a group II intron RNA. Mol. Cell 4, 239–250.

Weeks, K.M., and Mauger, D.M. (2011). Exploring RNA structural codes with SHAPE

chemistry. Acc. Chem. Res. 44, 1280–1291.

Werner, A. (2013). Biological functions of natural antisense transcripts. BMC Biol. 11,

31.

Wilhelm, B.T., and Landry, J.-R. (2009). RNA-Seq-quantitative measurement of

expression through massively parallel RNA-sequencing. Methods San Diego Calif 48,

249–257.

Williams, Z., Ben-Dov, I.Z., Elias, R., Mihailovic, A., Brown, M., Rosenwaks, Z., and

Tuschl, T. (2013). Comprehensive profiling of circulating microRNA via small RNA

sequencing of cDNA libraries reveals biomarker potential and limitations. Proc. Natl.

Acad. Sci. U. S. A. 110, 4255–4260.

Wolin, S.L., Sim, S., and Chen, X. (2012). Nuclear noncoding RNA surveillance: is the

end in sight? Trends Genet. TIG 28, 306–313.

Xue, D., Shi, H., Smith, J.D., Chen, X., Noe, D.A., Cedervall, T., Yang, D.D., Eynon, E.,

Brash, D.E., Kashgarian, M., et al. (2003). A lupus-like syndrome develops in mice

lacking the Ro 60-kDa protein, a major lupus autoantigen. Proc. Natl. Acad. Sci. U. S. A.

100, 7503–7508.

Yamasaki, S., Ivanov, P., Hu, G., and Anderson, P. (2009). Angiogenin cleaves tRNA

and promotes stress-induced translational repression. J. Cell Biol. 185, 35–42.

Yang, Z., Liang, H., Zhou, Q., Li, Y., Chen, H., Ye, W., Chen, D., Fleming, J., Shu, H.,

and Liu, Y. (2012). Crystal structure of ISG54 reveals a novel RNA binding structure and

potential functional mechanisms. Cell Res. 22, 1328–1338.

Zarnack, K., König, J., Tajnik, M., Martincorena, I., Eustermann, S., Stévant, I., Reyes,

A., Anders, S., Luscombe, N.M., and Ule, J. (2013). Direct competition between hnRNP

C and U2AF65 protects the transcriptome from the exonization of Alu elements. Cell

152, 453–466.

Zernecke, A., Bidzhekov, K., Noels, H., Shagdarsuren, E., Gan, L., Denecke, B., Hristov,

M., Köppel, T., Jahantigh, M.N., Lutgens, E., et al. (2009). Delivery of microRNA-126

by apoptotic bodies induces CXCL12-dependent vascular protection. Sci. Signal. 2, ra81.

Page 192: Copyright by Yidan Qin 2016

178

Zheng, G., Qin, Y., Clark, W.C., Dai, Q., Yi, C., He, C., Lambowitz, A.M., and Pan, T.

(2015). Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12,

835–837.

Zhou, X., Michal, J.J., Zhang, L., Ding, B., Lunney, J.K., Liu, B., and Jiang, Z. (2013).

Interferon induced IFIT family genes in host antiviral defense. Int. J. Biol. Sci. 9, 200–

208.

Züst, R., Cervantes-Barragan, L., Habjan, M., Maier, R., Neuman, B.W., Ziebuhr, J.,

Szretter, K.J., Baker, S.C., Barchet, W., Diamond, M.S., et al. (2011). Ribose 2’-O-

methylation provides a molecular signature for the distinction of self and non-self mRNA

dependent on the RNA sensor Mda5. Nat. Immunol. 12, 137–143.

Page 193: Copyright by Yidan Qin 2016

179

Vita

Yidan Qin was born in Zhengzhou, Henan, People’s Republic of China to Caiying

Xia and Huihong Qin. After completing her high school study at St Cyprian’s School in

Cape Town, Republic of South Africa, she enrolled at the University of Nebraska-Lincoln

in 2005 and received a B.S. in Biochemistry and a B.S. in Forensic Science in 2009. She

joined the Microbiology graduate program at the University of Texas at Austin in 2009,

and began her graduate work under the supervision of Dr. Alan Lambowitz in 2010.

She co-authored the following papers:

Mohr, S., Ghanem, E., Smith, W., Sheeter, D., Qin, Y., King, O., Polioudakis, D., Iyer,

V.R., Hunicke-Smith, S., Swamy, S., et al. (2013). Thermostable group II intron reverse

transcriptase fusion proteins and their use in cDNA synthesis and next-generation RNA

sequencing. RNA N. Y. N 19, 958–970.

Katibah, G.E., Qin, Y., Sidote, D.J., Yao, J., Lambowitz, A.M., and Collins, K. (2014).

Broad and adaptable RNA structure recognition by the human interferon-induced

tetratricopeptide repeat protein IFIT5. Proc. Natl. Acad. Sci. U. S. A. 111, 12025–12030.

Shen, P.S., Park, J., Qin, Y., Li, X., Parsawar, K., Larson, M.H., Cox, J., Cheng, Y.,

Lambowitz, A.M., Weissman, J.S., et al. (2015). Protein synthesis. Rqc2p and 60S

ribosomal subunits mediate mRNA-independent elongation of nascent chains. Science

347, 75–78.

Zheng, G.*, Qin, Y.*, Clark, W.C., Dai, Q., Yi, C., He, C., Lambowitz, A.M., and Pan, T.

(2015). Efficient and quantitative high-throughput tRNA sequencing. Nat. Methods 12,

835–837.

Qin, Y.*, Yao, J.*, Wu, D.C., Nottingham, R.M., Mohr, S., Hunicke-Smith, S., and

Lambowitz, A.M. (2016). High-throughput sequencing of human plasma RNA by using

thermostable group II intron reverse transcriptases. RNA N. Y. N 22, 111–128.

Page 194: Copyright by Yidan Qin 2016

180

Nottingham, R.M.*, Wu, D.C.*, Qin, Y., Yao, J., Hunicke-Smith, S., and Lambowitz,

A.M. (2016). RNA-seq of human reference RNA samples using a thermostable group II

intron reverse transcriptase. RNA N. Y. N 22, 597–613.

*Co-first authorship.

Permanent address: [email protected]

This dissertation was typed by Yidan Qin.