XenofilteR: computational deconvolution of mouse and human ...
Homology Based Analysis of the Human/Mouse lncRNome
description
Transcript of Homology Based Analysis of the Human/Mouse lncRNome
![Page 1: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/1.jpg)
1
Homology Based Analysis of the Human/Mouse lncRNome
Cédric NotredameGiovanni Bussotti
Comparative Bioinformics labCRG
![Page 2: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/2.jpg)
2
Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes
Strategy: PipeR one2many homolog assignment
Template:
PipeR Parameters:Blast - Freyhult parametrization - Lower case masking - Low complexity maskingExonerate
- est2genome model- 70% coverage required- seed extension 2X(the span of the genomic size of the query on both sides)
genes 10840
transcripts 17547
exons 58857
sum of mature transcript length (nt) 16·927·027
real coverage (nt) 13·083·478
non overlapping loci 7428
![Page 3: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/3.jpg)
3
PipeR: a pipeline for mapping lncRNAs
• blast-exonerate based framework to map lncRNAs against target genomes
• algorithm used:
chromosome2 Blast hits
mappingextension
Exoneratespliced transcript
lncRNA
![Page 4: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/4.jpg)
GENECODE lncRNAsVs
Complete Genomes
PipeR: lncRNA Homology Mapping
1. Anchor points: ENCODE vs Mouse with tuned Blast
2. Extension: Exonerate3. Filtering: Id and Coverage4. Validation of the GFF annotation
Overlap with AnnotationOverlap with Cufflink ModelsRPKM on target genome
5. Further Mapping Parameter Space Exploration using Experimental Evidences
GFF FileNotredame, Bussotti
![Page 5: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/5.jpg)
5
Transcript 1
Transcript 2
Gene A Gene B
Blast/ExoneratefailedMultiple Homologues
Conserved exon number High repeat coverage Overlap with protein
Homolog 1
Query species
Target species
Homolog 4Homolog 3Homolog 2
Best reciprocal
Mapping overview
Transcript 3
![Page 6: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/6.jpg)
6
• mapped 17327 transcripts out of 17547
• many lncRNAs found in multiple copies (lncRNA families) - found 144566 homologs corresponding to 501355 exons • Annotations of discovered homologs are readily available
GENCODEv10 vs human genome
![Page 7: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/7.jpg)
7
• About the 10% of all our homolog predictions are fully covered by repeats
Homolog repeat coverage
![Page 8: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/8.jpg)
8
• We could sub-group the homologs in 3 set according with the repeat coverage:
<= 20< = 80< = 100
Homolog repeat coverage
![Page 9: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/9.jpg)
9
<= 20% <= 80% <= 100%genV10 mapped genes
6088 10425 10698
genV10 mapped transcripts
9318 16856 17327
Total homologs 35399 102250 144566
Homologs whose exons overlap protein coding exons (same strand)
3621 5076 8988HUM
AN
Mapping statistics
![Page 10: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/10.jpg)
10
• mapped 3190 transcripts out of 17547 representing 2249 human genes
• many lncRNAs found in multiple copies (lncRNA families) - found 14936 homologs corresponding to 38910 exons
• Annotations of discovered homologs are readily available
GENCODEv10 vs mouse genome
![Page 11: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/11.jpg)
11
• Difference between the number of exons in the human transcripts and in the mouse homologs
• “0” means that the exon number is the same
• Negative bins indicate mouse homologs having more exons than the human query
• 1160 GENCODE v10 transcripts find at least 1 homolog in mouse with the same exon number
Human/MouseExon Number Conservation
human > mousehuman < mouse
![Page 12: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/12.jpg)
12
• We could sub-group the homologs in 3 set according with the repeat coverage:
<= 20< = 80< = 100
Homolog repeat coverage
![Page 13: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/13.jpg)
13
<= 20% <= 80% <= 100% Reciprocal homologs
genV10 mapped genes
1867 2172 2249 1445
genV10 mapped transcripts
2586 3076 3190 1966
Total homologs 6108 11141 14936 1966
Homologs whose exons overlap protein coding exons (same strand)
1611 2290 3177 497
Homologs with conserved number of exons
1534 2407 2958 689
MO
USE
Mapping statistics
Best Candidates: There are 148 transcripts that have < 20% repeat coverage, conserved exon structure, do not overlap protein coding exons and are best reciprocal homologswith the human queries
![Page 14: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/14.jpg)
GENECODE lncRNAsVs
Complete Genomes
PipeR: lncRNA Homology Mapping
1. Anchor points: ENCODE vs Mouse with tuned Blast
2. Extension: Exonerate3. Filtering: Id and Coverage4. Validation of the GFF annotation
Overlap with AnnotationOverlap with Cufflink ModelsRPKM on target genome
5. Further Mapping Parameter Space Exploration using Experimental Evidences
GFF FileNotredame, Bussotti
![Page 15: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/15.jpg)
BlastR vs The World
![Page 16: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/16.jpg)
BlastR vs The World
![Page 17: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/17.jpg)
blastn(8749)
blastr(12093)
blastnOpt(12487)
a)
b) c)
blastn blastnOpt blastr all6062646668707274767880
methods
% e
xons
with
read
blastn blastnOpt blastr all800
900
1,000
1,100
1,200
1,300
1,400
methods
aver
age
read
s per
exo
n
Figure 2: Exon read support.a) Venn-diagram indicating the number of exon detected by different methods (numbers in parentesis) and their
intersection (transcripts annotated identically by the three methods).
b) Average amount of reads per exonsc) Percent of reads covered by at least one exon
all(7492)
![Page 18: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/18.jpg)
18
Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes
Strategy: PipeR one2many homolog assignment
Template:
PipeR Parameters:Blast - Freyhult parametrization - Lower case masking - Low complexity maskingExonerate
- est2genome model- 70% coverage required- seed extension 2X(the span of the genomic size of the query on both sides)
genes 3845
transcripts 5669
exons 18353
sum of mature transcript length (nt) 7279679
real coverage (nt) 6091050
non overlapping loci 2790
![Page 19: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/19.jpg)
19
• mapped 1187 transcripts out of 5669
• many lncRNAs found in multiple copies (lncRNA families) - found 13193 homologs corresponding to 46770 exons • Annotations of discovered homologs are readily available
Ensembl.v65 vs human genome
![Page 20: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/20.jpg)
20
• mapped 5622 transcripts out of 5669
• many lncRNAs found in multiple copies (lncRNA families) - found 41005 homologs corresponding to 121515 exons • Annotations of discovered homologs are readily available
Ensembl.v65 vs mouse genome
![Page 21: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/21.jpg)
21
• Difference between the number of exons in the mouse transcripts and in the human homologs
• “0” means that the exon number is the same
• Negative bins indicate human homologs having more exons than the mouse query
• 481 Ensemblv65 transcripts find at least 1 homolog in human with the same exon number
Mouse/HumanExon Number Conservation
mouse > humanmouse < human
![Page 22: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/22.jpg)
22
• Not observed a peak of homolog predictions fully covered by repeats
Homolog repeat coverage
![Page 23: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/23.jpg)
23
• Input lncRNA datasets have similar repeat distributions
Ensemble.65 and GENCODEv10 repeat coverage
![Page 24: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/24.jpg)
24
ensV65 mapped genes
3815
ensV65 mapped transcripts
5622
Total homologs 41005Homologs whose exons overlap protein coding exons (same strand)
10086MO
USE
Mapping statisticsensV65 mapped genes
879
ensV65 mapped transcripts
1187
Total homologs 13193
Homologs whose exons overlap protein coding exons (same strand)
3642
Homologs whose exons do not overlap any gencode v10 element (same strand)
6085
Homologs with conserved number of exons
4925
HUM
AN
![Page 25: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/25.jpg)
25
Strategies: 1) GeneId ORF score comparison between mRNAs and lncRNAs 2) BlastX against human proteins (ensembl 65) 3) Overlap with protein coding gene exon annotations (gencodeV10)
4) PipeR filtering routines
Part 3: GENCODE v10 lncRNA coding potential check
![Page 26: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/26.jpg)
26
1) ORF scores as returned by GeneID
2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match proteins
Parameters: seg low complexity filtering, repeat filtering , evalue 10e-10, search just the plus strand.Human Ensembl 65 protein set
![Page 27: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/27.jpg)
27
3) -Checked the overlap between GENCODE v10 lncRNA exons and GENCODE v10 protein coding exons.
- Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon
Example 1
Example 2
![Page 28: Homology Based Analysis of the Human/Mouse lncRNome](https://reader035.fdocuments.us/reader035/viewer/2022062501/568166cc550346895ddad98c/html5/thumbnails/28.jpg)
28
4) Extensive filtering
7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines
Filtering rules:- overlap with protein coding exons- geneID ORF score similar to the ones of mRNA- blastX to uniprot database (50% redundancy)
- blastX to nr database- rpsBlast to pfam domain families- blast against Rfam