Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments...

Rice Sequence and Map AnalysisLeonid Teytelman

Rice Genome Annotation

•Sequence Alignments

•Automation

Comparative Maps

•Genetic Marker Correspondences

•FPC Map

•FPC I-Map

EnsEMBL Pipeline

•Automated Annotation

•Compute Farms

Rice Genome Annotation

Non-Rice Coding Sequences

•Maize Unigene Clusters

•Maize TIGR GIs

•Maize dbEST ESTs

•Barley dbEST ESTs

•Wheat dbEST ESTs

•Sorghum dbEST ESTs

Aligned Data Sets:

Rice CUGI BAC ends

Rice JRGP/Cornell RFLP Markers

Rice Coding Sequences

•Rice Complete CDSs

•Rice TIGR GIs

•Rice BGI EST Clusters

•Rice dbEST ESTs

•Rice BGI ESTs

Rice Cornell SSRs

BLAT: search & alignment

pslReps: filtering of low-quality matches

e-PCR: matches based on near-identity to the PCR primers, and correct order

Alignment Tools:

Target

Queries

BLAT: search & alignment

pslReps: filtering of low-quality matches

e-PCR: matches based on near-identity to the PCR primers, and correct order

Alignment Tools:

TargetTarget

Queries

Rice Coding Sequences:

•BLAT search & alignment

•pslReps filtering of repetitive matches

•Accept based on percent of EST length matched

Non-Rice Coding Sequences :


•pslReps filtering of repetitive matches

•Accept based on hit length and hit frequency

Rice BAC ends:


•Accept based on gap length, percent of BAC end length matched, percent identity, and hit frequency.

Alignment Methods:

Rice Markers:


•Accept based on percent of marker length matched and the gap length in case of genomic markers.

•Utilize genetic map information; accept those whose genetic & physical chromosome assignment is concordant.

Rice SSRs:

•e-PCR with default parameters, allowing 0 mismatches in the primers

Alignment Methods:

Total BACs/PACs: 1,847Total bp: 250,879,896 (250MB ) Phase 1: 78Phase 2: 1,238Phase 3: 531Annotated Phase 3: 330 Annotated Genes: 8,034

February 2002 BAC/PAC Dataset

Alignment Totals

DATASET TOTAL COMPARED

TOTAL MAPPED

% MAPPED

Rice Complete CDSs 1,358 505 37%

Rice TIGR Gis 12,354 6,290 51%

Rice BGI EST Clusters 24,179 12,135 50%

Rice dbEST ESTs 104,549 49,773 48%

Rice BGI ESTs 86,623 40,049 46%

Maize Unigene Clusters 10,678 3,972 37%

Maize TIGR Gis 27,642 6,941 25%

Maize dbEST ESTs 147,657 38,718 26%

Barley dbEST ESTs 148,651 50,579 34%

Wheat dbEST ESTs 166,513 49,146 29%

Sorghum dbEST ESTs 84,711 28,044 33%

Rice CUGI BAC ends 88,053 18,260 21%

Rice JRGP/Cornell RFLP Markers 2,682 1,320 49%

Rice Cornell SSRs 524 228 44%

For each group of data sets, there is a script to automatically:

•Run pslReps

•Load results into the database

•Discard low-quality matches

•Update documentation

Automating Alignments:

Comparative Maps

Same marker on multiple mapping studies

•Name-identity

•Curated evidence

Sequence-based correspondences for JRGP and Cornell markers:


•Utilize genetic mapping information, accepting matches on same chromosome and less than 30cM apart.

Map Correspondences

curator

same name

sequence-based

curator

same name

FPC data from CUGI, synchronized with the latest release.

Discordant

Cornell/JRGP markers mapped to sequenced clones were assigned positions on the FPC contigs.

Total: 2,272 4,417

EnsEMBL Pipeline in a Nutshell

•Can take advantage of a compute farm

EnsEMBL Pipeline Overview

•System for automated genome annotation

•Executes and keeps track of computational jobs

•Analysis job execution is serial, allowing stage dependencies

•Jobs are user-defined

RepeatMasker Genscan Blast GenomeBuilder Hmmer

RepeatMasker BLAT GeneWise Hmmer

Organization

•Utilizes and expands on the EnsEMBL-core modules and database schema

•Database stores:

•analysis program names and parameters

•analysis results

•rules for job dependencies

•and progress status for each job

•Perl modules:

•access the database

•execute specified analysis programs

•parse and load into the database the analysis results

Cluster Utilization

•How to split up tasks?

•Load management an scheduling (LSF, PBS, etc)

•Contig-by-contig approach

•How to execute jobs on slave nodes?

•Management of management:

•Automatic job submission

•Error/completion checking

Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments...

Documents

Transcript of Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments...