Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments...
-
Upload
colleen-page -
Category
Documents
-
view
228 -
download
7
Transcript of Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments...
Rice Sequence and Map AnalysisLeonid Teytelman
Rice Genome Annotation
•Sequence Alignments
•Automation
Comparative Maps
•Genetic Marker Correspondences
•FPC Map
•FPC I-Map
EnsEMBL Pipeline
•Automated Annotation
•Compute Farms
Rice Genome Annotation
Non-Rice Coding Sequences
•Maize Unigene Clusters
•Maize TIGR GIs
•Maize dbEST ESTs
•Barley dbEST ESTs
•Wheat dbEST ESTs
•Sorghum dbEST ESTs
Aligned Data Sets:
Rice CUGI BAC ends
Rice JRGP/Cornell RFLP Markers
Rice Coding Sequences
•Rice Complete CDSs
•Rice TIGR GIs
•Rice BGI EST Clusters
•Rice dbEST ESTs
•Rice BGI ESTs
Rice Cornell SSRs
BLAT: search & alignment
pslReps: filtering of low-quality matches
e-PCR: matches based on near-identity to the PCR primers, and correct order
Alignment Tools:
Target
Queries
BLAT: search & alignment
pslReps: filtering of low-quality matches
e-PCR: matches based on near-identity to the PCR primers, and correct order
Alignment Tools:
TargetTarget
Queries
Rice Coding Sequences:
•BLAT search & alignment
•pslReps filtering of repetitive matches
•Accept based on percent of EST length matched
Non-Rice Coding Sequences :
•BLAT search & alignment
•pslReps filtering of repetitive matches
•Accept based on hit length and hit frequency
Rice BAC ends:
•BLAT search & alignment
•Accept based on gap length, percent of BAC end length matched, percent identity, and hit frequency.
Alignment Methods:
Rice Markers:
•BLAT search & alignment
•Accept based on percent of marker length matched and the gap length in case of genomic markers.
•Utilize genetic map information; accept those whose genetic & physical chromosome assignment is concordant.
Rice SSRs:
•e-PCR with default parameters, allowing 0 mismatches in the primers
Alignment Methods:
Total BACs/PACs: 1,847Total bp: 250,879,896 (250MB ) Phase 1: 78Phase 2: 1,238Phase 3: 531Annotated Phase 3: 330 Annotated Genes: 8,034
February 2002 BAC/PAC Dataset
Alignment Totals
DATASET TOTAL COMPARED
TOTAL MAPPED
% MAPPED
Rice Complete CDSs 1,358 505 37%
Rice TIGR Gis 12,354 6,290 51%
Rice BGI EST Clusters 24,179 12,135 50%
Rice dbEST ESTs 104,549 49,773 48%
Rice BGI ESTs 86,623 40,049 46%
Maize Unigene Clusters 10,678 3,972 37%
Maize TIGR Gis 27,642 6,941 25%
Maize dbEST ESTs 147,657 38,718 26%
Barley dbEST ESTs 148,651 50,579 34%
Wheat dbEST ESTs 166,513 49,146 29%
Sorghum dbEST ESTs 84,711 28,044 33%
Rice CUGI BAC ends 88,053 18,260 21%
Rice JRGP/Cornell RFLP Markers 2,682 1,320 49%
Rice Cornell SSRs 524 228 44%
For each group of data sets, there is a script to automatically:
•Run pslReps
•Load results into the database
•Discard low-quality matches
•Update documentation
Automating Alignments:
Comparative Maps
Same marker on multiple mapping studies
•Name-identity
•Curated evidence
Sequence-based correspondences for JRGP and Cornell markers:
•BLAT search & alignment
•Utilize genetic mapping information, accepting matches on same chromosome and less than 30cM apart.
Map Correspondences
curator
same name
sequence-based
curator
same name
FPC data from CUGI, synchronized with the latest release.
Discordant
Cornell/JRGP markers mapped to sequenced clones were assigned positions on the FPC contigs.
Total: 2,272 4,417
EnsEMBL Pipeline in a Nutshell
•Can take advantage of a compute farm
EnsEMBL Pipeline Overview
•System for automated genome annotation
•Executes and keeps track of computational jobs
•Analysis job execution is serial, allowing stage dependencies
•Jobs are user-defined
RepeatMasker Genscan Blast GenomeBuilder Hmmer
RepeatMasker BLAT GeneWise Hmmer
Organization
•Utilizes and expands on the EnsEMBL-core modules and database schema
•Database stores:
•analysis program names and parameters
•analysis results
•rules for job dependencies
•and progress status for each job
•Perl modules:
•access the database
•execute specified analysis programs
•parse and load into the database the analysis results
Cluster Utilization
•How to split up tasks?
•Load management an scheduling (LSF, PBS, etc)
•Contig-by-contig approach
•How to execute jobs on slave nodes?
•Management of management:
•Automatic job submission
•Error/completion checking