Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative...

28
Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1

Transcript of Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative...

Page 1: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

1

Homology Based Analysis of the Human/Mouse

lncRNome

Cédric NotredameGiovanni Bussotti

Comparative Bioinformics labCRG

Page 2: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

2

Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes

Strategy: PipeR one2many homolog assignment

Template:

PipeR Parameters:Blast - Freyhult parametrization - Lower case masking - Low complexity maskingExonerate

- est2genome model- 70% coverage required- seed extension 2X(the span of the genomic size of the query on both sides)

genes 10840

transcripts 17547

exons 58857

sum of mature transcript length (nt) 16·927·027

real coverage (nt) 13·083·478

non overlapping loci 7428

Page 3: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

3

PipeR: a pipeline for mapping lncRNAs

• blast-exonerate based framework to map lncRNAs against target genomes

• algorithm used:

chromosome2 Blast hits

mappingextension

Exonerate

spliced transcript

lncRNA

Page 4: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

GENECODE lncRNAsVs

Complete Genomes

PipeR: lncRNA Homology Mapping

1. Anchor points: ENCODE vs Mouse with tuned Blast

2. Extension: Exonerate3. Filtering: Id and Coverage4. Validation of the GFF annotation

Overlap with AnnotationOverlap with Cufflink ModelsRPKM on target genome

5. Further Mapping Parameter Space Exploration using Experimental Evidences

GFF FileNotredame, Bussotti

Page 5: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

5

Transcript 1

Transcript 2

Gene A Gene B

Blast/ExoneratefailedMultiple Homologues

Conserved exon number High repeat coverage Overlap with protein

Homolog 1

Query species

Target species

Homolog 4Homolog 3Homolog 2

Best reciprocal

Mapping overview

Transcript 3

Page 6: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

6

• mapped 17327 transcripts out of 17547

• many lncRNAs found in multiple copies (lncRNA families) - found 144566 homologs corresponding to 501355 exons • Annotations of discovered homologs are readily available

GENCODEv10 vs human genome

Page 7: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

7

• About the 10% of all our homolog predictions are fully covered by repeats

Homolog repeat coverage

Page 8: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

8

• We could sub-group the homologs in 3 set according with the repeat coverage:

<= 20< = 80< = 100

Homolog repeat coverage

Page 9: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

9

<= 20% <= 80% <= 100%genV10 mapped genes

6088 10425 10698

genV10 mapped transcripts

9318 16856 17327

Total homologs 35399 102250 144566

Homologs whose exons overlap protein coding exons (same strand)

3621 5076 8988HU

MAN

Mapping statistics

Page 10: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

10

• mapped 3190 transcripts out of 17547 representing 2249 human genes

• many lncRNAs found in multiple copies (lncRNA families) - found 14936 homologs corresponding to 38910 exons

• Annotations of discovered homologs are readily available

GENCODEv10 vs mouse genome

Page 11: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

11

• Difference between the number of exons in the human transcripts and in the mouse homologs

• “0” means that the exon number is the same

• Negative bins indicate mouse homologs having more exons than the human query

• 1160 GENCODE v10 transcripts find at least 1 homolog in mouse with the same exon number

Human/MouseExon Number Conservation

human > mousehuman < mouse

Page 12: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

12

• We could sub-group the homologs in 3 set according with the repeat coverage:

<= 20< = 80< = 100

Homolog repeat coverage

Page 13: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

13

<= 20% <= 80% <= 100% Reciprocal homologs

genV10 mapped genes

1867 2172 2249 1445

genV10 mapped transcripts

2586 3076 3190 1966

Total homologs 6108 11141 14936 1966

Homologs whose exons overlap protein coding exons (same strand)

1611 2290 3177 497

Homologs with conserved number of exons

1534 2407 2958 689

MO

USE

Mapping statistics

Best Candidates: There are 148 transcripts that have < 20% repeat coverage, conserved exon structure, do not overlap protein coding exons and are best reciprocal homologswith the human queries

Page 14: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

GENECODE lncRNAsVs

Complete Genomes

PipeR: lncRNA Homology Mapping

1. Anchor points: ENCODE vs Mouse with tuned Blast

2. Extension: Exonerate3. Filtering: Id and Coverage4. Validation of the GFF annotation

Overlap with AnnotationOverlap with Cufflink ModelsRPKM on target genome

5. Further Mapping Parameter Space Exploration using Experimental Evidences

GFF FileNotredame, Bussotti

Page 15: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

BlastR vs The World

Page 16: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

BlastR vs The World

Page 17: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

blastn(8749)

blastr(12093)

blastnOpt(12487)

a)

b) c)

blastn blastnOpt blastr all60

62

64

66

68

70

72

74

76

78

80

methods

% e

xons

with

read

blastn blastnOpt blastr all800

900

1,000

1,100

1,200

1,300

1,400

methods

aver

age

read

s per

exo

n

Figure 2: Exon read support.a) Venn-diagram indicating the number of exon detected by different methods (numbers in parentesis) and their

intersection (transcripts annotated identically by the three methods).

b) Average amount of reads per exonsc) Percent of reads covered by at least one exon

all(7492)

Page 18: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

18

Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes

Strategy: PipeR one2many homolog assignment

Template:

PipeR Parameters:Blast - Freyhult parametrization - Lower case masking - Low complexity maskingExonerate

- est2genome model- 70% coverage required- seed extension 2X(the span of the genomic size of the query on both sides)

genes 3845

transcripts 5669

exons 18353

sum of mature transcript length (nt) 7279679

real coverage (nt) 6091050

non overlapping loci 2790

Page 19: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

19

• mapped 1187 transcripts out of 5669

• many lncRNAs found in multiple copies (lncRNA families) - found 13193 homologs corresponding to 46770 exons • Annotations of discovered homologs are readily available

Ensembl.v65 vs human genome

Page 20: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

20

• mapped 5622 transcripts out of 5669

• many lncRNAs found in multiple copies (lncRNA families) - found 41005 homologs corresponding to 121515 exons • Annotations of discovered homologs are readily available

Ensembl.v65 vs mouse genome

Page 21: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

21

• Difference between the number of exons in the mouse transcripts and in the human homologs

• “0” means that the exon number is the same

• Negative bins indicate human homologs having more exons than the mouse query

• 481 Ensemblv65 transcripts find at least 1 homolog in human with the same exon number

Mouse/HumanExon Number Conservation

mouse > humanmouse < human

Page 22: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

22

• Not observed a peak of homolog predictions fully covered by repeats

Homolog repeat coverage

Page 23: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

23

• Input lncRNA datasets have similar repeat distributions

Ensemble.65 and GENCODEv10 repeat coverage

Page 24: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

24

ensV65 mapped genes

3815

ensV65 mapped transcripts

5622

Total homologs 41005Homologs whose exons overlap protein coding exons (same strand)

10086MO

USE

Mapping statisticsensV65 mapped genes

879

ensV65 mapped transcripts

1187

Total homologs 13193

Homologs whose exons overlap protein coding exons (same strand)

3642

Homologs whose exons do not overlap any gencode v10 element (same strand)

6085

Homologs with conserved number of exons

4925

HU

MAN

Page 25: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

25

Strategies: 1) GeneId ORF score comparison between mRNAs and lncRNAs 2) BlastX against human proteins (ensembl 65) 3) Overlap with protein coding gene exon annotations (gencodeV10)

4) PipeR filtering routines

Part 3: GENCODE v10 lncRNA coding potential check

Page 26: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

26

1) ORF scores as returned by GeneID

2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match proteins

Parameters: seg low complexity filtering, repeat filtering , evalue 10e-10, search just the plus strand.Human Ensembl 65 protein set

Page 27: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

27

3) -Checked the overlap between GENCODE v10 lncRNA exons and GENCODE v10 protein coding exons.

- Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon

Example 1

Example 2

Page 28: Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1.

28

4) Extensive filtering

7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines

Filtering rules:- overlap with protein coding exons- geneID ORF score similar to the ones of mRNA- blastX to uniprot database (50% redundancy)

- blastX to nr database- rpsBlast to pfam domain families- blast against Rfam