RNA bioinformatics Marcela Davila-Lopez...

17
RNA bioinformatics Marcela Davila-Lopez ([email protected] ) RNA world hypothesis RNA types mRNA codes for proteins A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein. small RNA (sRNA) bacteria non-protein-coding RNA (npcRNA) non-messenger RNA (nmRNA) small non-messenger RNA (snmRNA) functional RNA (fRNA) ncRNA content may be responsible for the complexity of the different organisms. Huttenhofer, A., P. Schattner, and N. Polacek, Trends Genet, 2005 RNA storage catalysis ~ 3 Billion yrs DNA PROTEIN Today RNA mRNA Ribosome, RNAse P

Transcript of RNA bioinformatics Marcela Davila-Lopez...

Page 1: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

RNA bioinformatics Marcela Davila-Lopez ([email protected]) RNA world hypothesis RNA types

• mRNA codes for proteins

• A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein. small RNA (sRNA) bacteria non-protein-coding RNA (npcRNA) non-messenger RNA (nmRNA) small non-messenger RNA (snmRNA) functional RNA (fRNA) ncRNA content may be responsible for the complexity of the different organisms. Huttenhofer, A., P. Schattner, and N. Polacek, Trends Genet, 2005

RNA

storage catalysis

~ 3 Billion yrs

DNA PROTEINToday

RNA

mRNA Ribosome, RNAse P

Page 2: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Role of ncRNAs Gisela Storz, Shoshy Altuvia and Karen M. Wasserman (2005) Matera, A.G., R.M. Terns, and M.P. Terns, Nat Rev Mol Cell Biol, 2007. Disease Prasanth, K.V. and D.L. Spector, Genes Dev, 2007. Costa, F.F. Drug Discov Today 2009 Pandey, A.K., P. Agarwal, K. Kaur, and M. Datta. Cell Physiol Biochem 2009 Protein – primary sequence Sequence similarity biological relation, same function ncRNA – primary sequence Limited sequence conservation, but structural

Page 3: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Covariation: consistent and compensatory mutations that (often) conserve the structure

A single mutation can radically change the structure Canonical pairs Non- canonical airs: GU wobble

Consensus Secondary structure

Page 4: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Secondary structure: RNA functionality depends on structure

Pseudoknot: example SAM II Riboswitch

Tertiary structure: comprises interaction of SS Family structure: typically each family adopts a characteristic secondary structure. However there is also structural variability within a ncRNA family

Dictyostelium discoideum

Candida albicans

Trypanosoma brucei

Page 5: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

RNA regulatory elements: Histone 3’end processing machinery

• Cis-regulatory element

• Trans-regulatory element Riboswitch: Part of an mRNA molecule that can directly bind a small target molecule, affecting the gene’s activity (Auto-regulation) • Typically found in the 5’

UTR • Biosynthesis, catabolism and

transport of various cellular catabolites (aminoacids [K,G], cofactors, nucleotides and metal ions)

• Most known occur in Bacteria Examples: Serganov A, Patel DJ. Biochim Biophys Acta. 2009

Gerganov. A et al, Nature Reviews Genetics, 2007 IDENTIFICATION by comparative analysis SECIS element Selenocysteing approx 25 selenoporoteins Incorporation of Sec into selenoproteins requires: 1.- UGA-Sec 2.- Sec tRNA[Ser]Sec 3.- SECIS - selenocysteine insertion sequence element several Kb away from UGA – 3’UTR 4.- SRE – selenocysteine redefinition element 6 nt downstream UGA - CDS 5- several protein factors: EFSec SBP2Sec- specific elongation factor ribosomal protein L30 Secp43 - RBP SLA - soluble liver antigen

U7 snRNA

D3

B G

ELsm10

Lsm11 F Symplekin

CPSF-73

CPSF-100

SLBP

ZFP-100

Histone pre-mRNA

Page 6: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Eukaryotic SECIS: non-canonical A-G base pairs K-turn motif IRE: Iron responsive element [↓] cellular growth arrest and death anemia, retardation in children [↑] generate hydroxyl or lipid radicals damage lipid membranes, proteins, and nucleic acids. hemochromatosis, liver/heart failure Balance: iron-responsive element/iron regulatory protein regulatory system

REGULATION Ferritin Transferrin receptor Iron storage protein iron acquisition protein [Fe]↓ [Fe]↑

Page 7: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Protein vs RNA identification Protein: BLAST, FastA, PsiBLAST, HMM … RNA:

• Sequence similarity searches: ribosomal RNAs RNAs from closely related species

• Patterns/motifs: primary sequence + secondary structure • Analysis of mutual information • Prediction of secondary structure • Statistical models: consensus secondary structure • Phylogenetic analysis • Custom-designed programs: particular RNA class

tRNAscan-SE, SRPscan, miRseeker . . . Secondary structure representation Vienna RNA package RNAfold -- predict minimum energy secondary structures and pair probabilities RNAeval -- evaluate energy of RNA secondary structures RNAinverse -- inverse fold (design) sequences with predefined structure RNAdistance -- compare secondary structures RNAplot -- RNA structure drawings in PostScript, SVG, or GML RNAcofold -- predict hybrid structure of two sequences RNAduplex -- predict possible hybridization sites between two sequences RNAalifold -- predict the consensus structure of several aligned sequences Mutual information: quantity that measures the mutual dependence of two variables (two positions).

Page 8: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

fxi = fq of one of the 4 bases in column i fxixj = fq of one of the 16 base-pairs in columns i and j Mij = 2 max value informative = 0 conserved positions not informative

Example: 1 2 3 4 G G C C G C C G G A C U G U C A

Page 9: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

The plot: Diagonals of covarying positions correspond to the four stems of the tRNA. Dashed lines indicate some of the addtional tertiary contacts observed in the yeast tRNA-Phe crytal structure

PatScan: is a pattern matcher (deterministic motifs as well as secondary structure constraints) which searches protein or nucleotide sequence archives

Page 10: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

r1={au,ua,gc,cg,gu,ug} p1=6...7 GGG [1,0,0] p2=8...9 4...4 r1~p2[1,0,1] 3...4 Covariance models Regular grammar primary sequence models S aT | bS T aS | bT | e aT Example: Model repeat regions (ex. FMR-1 triplet repeat region) S gW1 W1 cW2 W2 gW3 W3 cW4 W4 gW5 W5 gW6 W6 cW7 | aW4 | cW4 W7 tW8 W8 g Context-free grammar primary sequence models palindromes S aSa |bSb | aa | bb S RNA secondary structure: also a palindrome S aW1u | cW1g | gW1c |uW1a W1 aW2u | cW2g | gW2c |uW2a W2 aW3u | cW3g | gW3c |uW3a W3 ggaa | gcaa

gcg cgg ctggcg cgg agg cgg ctggag agg ctg gcg agg cgg ctg gcg agg cgg cgg

Page 11: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Stochastic regular grammar weighted primary sequence models (Hidden Markov Models) S rW1 (0,45) S kW1 (0,45) s nW1 (0,10)

Stochastic context-free grammar (SCFG) Covariance models probabilistic models that flexibly describe the secondary structure and primary sequences consensus or an RNA sequence family

• Search for additional and family-related sequences in sequence databases

• Build a model (automatically) from an existing sequence alignment

Infernal: inference of RNA alignment

Rfam: database containing information about ncRNA families and other structured RNA elements Comparative analysis: EvoFold

Page 12: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Secondary structure prediction Nussinov: Finding the structure with the most base pairs (dynamic programming)

Drawbacks: not unique structure Testing all possible structures

Zucker: The correct structure is the one with the lowest equilibrium free energy which is the sum of individual contributions from loops, base pairs and other secondary structure elements loop energies depend on: loop size, loop type, sequence, temperature, ionic conditions Vienna package: energy minimization is the most used function in the package miRNA

• SS RNA • ~22 nucleotides • Inhibit the translation of mRNAs to their protein products by biding tospecific regions in the 3�

UTR • Accounts for ~1% of all transcripts in humans and potentially regulate 10%-30% of all genes. • Expressed ubiquitously and highly conserved in Metazoans (animal kingdom).

Biological processes: Apoptosis, Cell prolifertion, Cell differentiation, Development, Organism defense against infections, Tissue morphogenesis, Regulation of metabolism Diseases: Cancer, Viral infections, Neurodegenerative disorders, Cardiac pathologies, Muscle disorders, Diabetes

Page 13: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Multiple binding sites liln-4 is partially complementary to 7 sites in the lin-14 3′ UTR Negrini, M., M.S. Nicoloso, and G.A. Calin. Curr Opin Cell Biol 2009. He, L. and G.J. Hannon, Nat Rev Genet 2004

miRNA genes

Kim VN Nat Rev Mol Cell Biol. 2005 Winter J et al Nat Cell Biol. 2009

Exonic in non-coding transcripts (single, clustered)

Intronic in non-coding transcripts

Intronic in protein coding transcripts

miRNA Biogenesis

Winter, J., S. Jung, S. Keller, R.I. Gregory, and S. Diederichs. Nat Cell Biol 2009. Paul S. Meltzer, Nature, 2005

Page 14: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

miRNA structure

Identification Homology search based BLAST miRAling, ProMir, microHARVESTER Gene finding Identification of conserved genomic regions Folding of the identified regions (Mfold, RNAfold) Evalutation of hairpins miRseeker, miRscan Neighbour stem loop (~42% of human miRNA genes are clustered together) Check surroundings of a known miRNA for candidate secondary structures Comparative genomics BLAST intergenic sequences of two genomes against each other Filter based on rules inferred based on known miRNAs miRFinder Intragenomic matching (A functional miRNA should have at least a target) miRNAs show perfect complementarity to their targets (?) It simultaneously predicts miRNAs and their targets miMatcher

Page 15: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

miRBase miRNA target prediction

• Predicting miRNA targets in plants is easier, due to the perfect complementarity to the miRNAs • In animals, perfect complementarity is not common

– miRNA seed complementarity (6 to 9 nt) – High false positives rate

• Common approach – Experimental evidences – Validated miRNA/target pairs – Tarbase, miRecords

• Computational methods:

– Base-pairing rules and binding sites sequence features – Conservation – Thermodynamics – Structural accesibility

Base pairing: Bartel, D.P. 2009. Cell 2009. 5’ Dominant sites 6-9 nt, starting usually at P2 P1 is typically unpaired or starts with U Often flanked by A Usually no G:U wobbles (vs regulation) 3’ compensatory site May compensate for insufficient base pairing in the seed Conservation

Page 16: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

Thermodynamics Evaluation of ΔG of predicted duplexes usually < -20 Kcal/mol Discard F(+) but favorable interactions not always correspond to actual duplex Structural accessibility The targe site on the mRNA not involved in any intramolecular bp Any existing secondary structure must be first removed miRNAs in cancer

Lu, J., et al., Nature, 2005

Page 17: RNA bioinformatics Marcela Davila-Lopez …bio.lundberg.gu.se/courses/ht10/bio2/2010_rna_to_print.pdfRNA bioinformatics Marcela Davila-Lopez (marcela.davila@medkem.gu.se) RNA world

miRNAs as tumor suppresor MiR-29b inhibits Leukemic growth in vivo.

(A) Diagram illustrating the experimental design of the mice xenograft experiment. (B) Graphic representing the tumor volume determinations at the indicated days during the experiment for the three groups; mock (n= 6), scrambled (n=12) and synthetic miR-29b (n=12). (C) Tumor weight averages between scrambled and synthetic miR-29b treated mice groups at the end of the experiment (Day +14). P-values were obtained using t-test. Bars represent ±S.D. (D) Photographs of two mice injected with miR-29b (left flank) or scrambled (right flank).