Critique

Paper CritiquePaper CritiqueComparative genomics beyond sequence-based alignments:

RNA structures in the ENCODE regions

Amer Talal [email protected]

[email protected]

18/4/2011

IntroductionIntroduction

ENCODE ENCyclopedia Of DNA Elementspilot project to identify the functional elements in the genomes’ sequences.

non-coding RNAs (ncRNAs)A major challenge in these projects is to annotate the large number of non-coding RNAs.

The steadily increasing number of the discovered ncRNAs has dramatically changed views on the roles and importance of ncRNAs.

ncRNAs difficult to find by computational or experimental means .


Computationally finding ncRNAs is difficult becauseone has to consider secondary structure as well as nucleotide sequence .

But structure can be detected more reliably from a set of related sequences, if available.

The recent approach is to align the sequences first, then do RNA structure inference based on the alignment.


This study describes the first large-scale search for structured ncRNAs in several

vertebrate genomes

through using

a local structural motif finding algorithm, which has identified several thousands novel

candidate ncRNAs.

Materials and MethodsMaterials and Methods They used CMfinder: a structure-oriented RNA motif prediction tool, to search the ENCODE regions of certain vertebrate multiple alignments.

CMfinder built as a complement to the RNAz/ EvoFold scans of the ENCODE regions.

They obtained their candidates from multiple alignmentblocks of the UCSC MULTIZ ; one block at a time

(155 nt long on average.)

Materials and MethodsMaterials and Methods

A group of 11 high-scoring ncRNA candidates chosen for experimental verification. ncRNA candidates that were tested by RT-PCR and Northern blotting.

10 were confirmed to be present as RNA transcripts in certain tissues.

Their experimental verification show evidence of significant differential expression across tissues.

ResultsResults

They found a large number of potential ncRNAs in the ENCODE regions.

They reported 6587 candidate regions with an estimated false-positive rate of 50%.

With their new candidates they increased the number of ncRNA candidates in the ENCODE regions by 32%.

DiscussionDiscussion

To demonstrate accuracy of the possible benefits of structure-aware alignment,

they examined MULTIZ multiple alignment blocks identified by Wang et

al. (2007)

with good matches to the Rfam model in all species in the same region of the alignment.

And reported that CMfinder’s alignment of the region differs from the MULTIZ alignment in only 13% of the positions.

DiscussionDiscussionAlso it is an alignment-independent

CMfinder ignore a sequence if it does not contain

the motif, and the program still report a high-scoring motif for the rest of the

sequences

CMfinder, alsodoes not remove individual sequences with

>25% and 20% gaps, respectively, as compared to RNAz and EvoFold

DiscussionDiscussionAlthough MULTIZ is most frequently shown to be quite accurate in these challenging cases, as a

rationalproof of cross-species conservation of each motif

instance .

several studies occasionally revealed compelling evidence of misalignment .

Even small misalignments have adverse effects on drawing any biological inferencesTwo main misalignment categories

"partial alignments“ "chimeric alignments “

"partial alignments"

Comprise 5.1% of the MULTIZ sequences.

What is aligned to the ncRNAs includes a large gap within the same or among species.

The aligned fragment by itself does not pass the threshold of certain tests for ncRNA family membership.

"chimeric alignments"

Comprise 5.4% of the MULTIZ sequences .

What is aligned to the ncRNAs not a contiguous sequence. Instead, it is composed of sequence fragments from different regions or even different chromosomes.

None of these fragments individually passed the threshold of certain tests of ncRNA family membership .

Structural approaches to distinguish ncRNAs

CMfinder and other structural programs classify transcripts as ncRNAs are likely to

lead to significant false positive rates or discoveries.

Since conserved secondary structures are also commonly found in mRNAs (especially

3’ UTRs).

functional ncRNAs may contain secondary or tertiary structures with non-canonical base interactions, that are not considered

by structural prediction programs.

Machine LearningMachine LearningCMfinder

Integrated motif features for scoring

by machine-learning algorithmsSupport Vector MachineSupport Vector Machine

BUT these methods did not perform well because of

heterogeneity of the features

limitations of available training data

SuggestionsSuggestions

Limiting the search to the most promising regions.

I suggest the CFTR region (syntenic, few duplications, higher quality of annotation and well conserved)

Using longer blocks (local aligned sequences)

>300 nt

Thank YouThank You

QuestionsQuestions

Critique

Technology

Transcript of Critique