Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et...
Transcript of Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et...
“V Jornada de Usuarios de la RES”
Use of TBLASTX to find regions of homology
among multiple large-size mammalian genomes
Francisco Câmara Ferreira
Bioinformatics & Genomics Unit (Roderic Guigó,CRG)
Why TBLASTX?
• SGP2 (Parra et al. 2003)
• ab initio Geneid + sequence similarity search algorithm (TBLASTX)
SGP2 is a comparative gene prediction tool: QUERY sequences from a genome (i.e H.sapiens ) is compared against a collection of sequences from a second TARGET (REFERENCE;i.e. M.musculus) genome (TBLASTX) and the results of the comparison generate “HSPs” are used to modify the scores of the exons produced by the underlying ab initio gene prediction tool GENEID
WHAT IS SGP2??
Geneid: • Geneid is a protein-coding gene prediction tool: can be optimized for prediction in different species. • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential • Dynamic programming algorithm: maximize score of assembled exons -> assembled gene
SGP2
TBLASTX as a gene-prediction tool
Coding sequences evolve slowly
compared to surrounding DNA
“Proper” evolutionary distance?
TBLASTX CHR5_1_5000000 CHR1_mm -
hspmax=0 -gspmax=0 W=5 E=0.01
E2=0.01 -nogap -filter=xnu+seg S2=80
-matrix=blosum62 -altscore="* any -
999" -altscore="any * -999”
TBLASTX is computationally expensive
“flavour”of BLAST
6-frame translation of query/target
Why marenostrum?
• H.sapiens vs. M.musculus
•7-10 days on a 20-25 CPU grid
•12-13 hours on 256 CPUs
• Multiple genomes compared
concurrently
¡PARALLELIZATION!
LARGE SIZE OF MAMMALIAN GENOMES (i.e. Human & Cow ~3 Gbases, Mouse 2.5 Gb…)
Strategy for MN TBLASTX:
• Fragment “query” genome:
• H.sapiens genome: >650 5-Mbase fragments
• Reference genome divided into 10-
Mbase fragments (internally)
•22 chromosomes for M.musculus
TBLASTX MN PIPELINE: David García Cortés/Xavier Pastor
Significant publications (MN-derived)
SGP2 importance as an annotation tool
component of the comparative gene prediction pipelines to annotate:
• Human (MN)
• Mouse • Rat • Cow (MN)
• Chicken • Paramecium
• Also several species of insects and plants (Melon/Bean)
UCSC Genome browser: http://www.genome.ucsc.edu
GBL Web server: http://genome.crg.es/genepredictions/
Acknowledgments
• BSC-CNS/U. de Cantabria
•Xavier Pastor
•David García Cortés
• Genis Parra/Josep Abril/Roderic
Guigo (developers of SGP2)