DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

65
DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005

Transcript of DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

Page 1: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Assembling and Annotating Genomes

Deanna M. ChurchNCBI

January 12, 2005

Page 2: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Of mice and men

Page 3: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Fleischman et al.(1991) PNAS88:10885-10889

Both carry mutations in the Kit gene.

Of mice and men

Page 4: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Basic Model

Gene Gene Gene Gene

Structure

Mature Peptide

ProPeptide

mRNA

Transcript

Chromosome

Resources (Maps, Clones, etc)

Genomes

Organisms

Function/Phenotype

Disease

Page 5: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

BAC insertBAC vector

Shotgun sequence

Assemble

This part is relativelycheap and easy

Fold

sequence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPSThis part is hardand expensive

“finishers” go in to manually fill the gaps, often by PCR

Putting Genomes Together

Hierarchical Shotgun Assembly

200 Kb BAC0.5 Kb/read400 reads = 1X 2000 reads = 5X

Page 6: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

HTGS keywords

htgs_phase0: low coverage sequence 1-2X

htgs_phase1: generally 4-5X sequence coverage, several fragments not ordered or oriented

htgs_phase2: sequence coverage can vary (generally 5-10X) but fragments are ordered and oriented.

htgs_phase3: highly accurate, finished sequence. Error rate <10-5

Draft sequence: phase 1 or 2, but >90% of the bases are high quality (phred 20 or better)

htgs_active_fin: center has finished shotgun phase and moved to finishing

htgs_cancelled: sequencing has discontinued on this clone

Page 7: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Raw Data

Page 8: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

- Remove contaminants(vector, E. coli, other organisms, virus)

- Bin clones by chromosome arm

- Incorporate clone order information using TPF

- Identify fragment overlaps

-Determine fragment order and orientation, remove sequence redundancy (This produces sequence contigs given NT_XXXXXX type accession numbers)

- Place contigs on chromosome

UCSC Jim KentNCBI Paul Kitts Greg Schuler Richa Agarwala

Putting genomes together

Page 9: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

STS marker D6S1606

forward primer

reverse primer

microsatellite

PCR product size: 92 - 100 bases

GAGTTTGCACCATTGCACTCCAGCCTGGGCAAC (CA)n AACGTGGCATGTGCCTGTACTCTCCCTCAAACGTGGTAACGTGAGGTCGGACCCGTTG (GT)n TTGCACCGTACACGGACATGAGAGG

A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989).

A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989).

Sequence Tagged Sites (STS)

Page 10: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Original Genome Resources- STS Maps

genomemeiosis- geneticradiation- RHclones- clone based

meiosis- geneticradiation- RHclones- clone based

fragment

- each line represents an individual cell line/animal that carries a particularbreak- STSs can be amplified from DNA in these cell lines/animals- based on cell line/animal marker content, the breaks can be determined andthe markers ordered.

129

wate

r

2 4 6 8 101214161820

ham

ste

r

1 3 5 7 9 1113 151719212224262830

D2Wsu129e

Page 11: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Electronic PCR (e-PCR)

STS marker D6S1606

forward primer

reverse primer

microsatellite repeat

PCR product size: 92 - 100 bases

GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTCACAGA (CA)n AACGTGGCATGTGCCTGTACTCTCCTCAAACGTGGTAACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n TTGCACCGTACACGGACATGAGAG

E-PCR software searches DNA sequences for exact matches to both primers in correct order, orientation, and spacing to be consistent with known PCR product size.

Schuler (1997), Genome Research 7, 541-550

Page 12: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Electronic PCR (e-PCR)

http://www.ncbi.nlm.nih.gov/sutils/e-pcr/

Page 13: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

B

C

D

FGH

KL

O

N

Ideally…

Non-sequence based Map

(flip)

A

B

C

D

FGH

KL

O

N

Putting genomes together

Page 14: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

More like…

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

BC

ZYXW

H

J

M

V

N

O

AB

HIJ

CDY

LM

N

O

AB

HIJ

LM

N

O

?

Putting genomes together

Page 15: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Starting Material:

Phase 1

Phase 2

Phase 3

number

10632

777

30470

Length (Kb)

1726.24

101.11

3621.30

http://www.ncbi.nlm.nih.gov/genome/guide/human/HsStats.html

Framework assemblies:

388 contigs- 3.02 Gb

Type of source sequence

Number used Length (bp)

Draft only 46 10,284,900

Finished only 3342,833,780,00

0

Contig Information:

Human assembly: Build 35 Assembly is now defined byAGP* files rather than a formal assembly process. These are maintained by chromosome coordinators.

*AGP= A Golden Path

Reference Contig N50#: 38.5 Mb

#N50 length: Contig length at which 50% of the bases in theassembly reside in a contig ofat least that size.

Page 16: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Contigs and components in the MapViewer

Page 17: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Aug. 200119.2M (3.5X)

Oct. 200125.2M (4.5X)

Nov. 200130M (5.5X)

Feb. 200240.1M (7X)

Mouse Genome Sequencing

Page 18: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

WGS

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

For mouse projectonly 40 kb clones and BAC clones are available

BAC clones wereconstructed and endsequenced before WGS project started

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

David JaffeJim Mullikin

tails

Putting genomes together

Page 19: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Constructing Supercontigs (scaffolds)

David JaffeJim MullikinPutting genomes

together

Page 20: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Starting Material

* Assumes a 2.75 Gb genome

The Assembly224,713 WGS contigs

Total length of the assembly: 2.5 Gb (90.9 % of genome)*

42,620 Supercontigs

N50 of mapped supercontigs: 17 MbN50 of unmapped supercontigs: 4.9 Kb

40.7 million WGS reads (2,4,6,10,40 Kb)~450,000 BAC end sequences

RPCI-23: 197 KbRPCI-24: 155 Kb

CAAA01000100-Length of contigs > 1kb: 2.53 Gb-Length of contigs with >= 1 BES: 2.06 Gb-Length of contigs with >= 1 mapped STS: .344 Gb

-N50 length: 24.8 Kb-Mapped: 173550

NW_XXXXXX

-Length of sc >= 1 BES: 2.41 Gb-Length of sc>= 1 mapped STS: 2.4 Gb

-N50 length: 17.7 Mb-Mapped: 366

ChrUn

The Mouse Genome- MGSCv3David Jaffe- Arachne

Jim Mullikin- Phusion(The Mouse Genome Sequencing Consortium)

(+ 274 finished BACs – 49.5 Mb)

Waterston et al, 2004

Page 21: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

7

The Mouse Genome- over time…

FinishedDraftWGSGap

1

2

34 5 6

89

1011

12 13 14

1516

17 18

19

XMGSCv3

Page 22: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Contig/Supercontig size by chromosome

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Contig (Kb)Supercontig (Mb)

Page 23: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

How does MGSCv3 compare to Non-Sequence based maps

0

20

40

60

80

100

120

140

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000

basepairs

Map

posit

ion

Chromosome 7

~80% of STS markers on WI-Genetic Map localized by e-PCR

~72% of STS markers on WI/MRC RH Map localized by e-PCR

<3% chromosomeconflict.

WI-Gen mapWI/MRC RH map

Page 24: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y Un

20,000,000

40,000,000

60,000,000

80,000,000

100,000,000

120,000,000

140,000,000

160,000,000

180,000,000

200,000,000

Finished NT Contig By Build

Build 29

Build 30Build 32Build 33Estimated Length

Finished sequences are usedto build hand-curated contigs(NT contigs)

Currently ~1.8 Gb (mostly) non-redundant sequence1.1 Gb in Build 33

Page 25: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mouse Build 30:

Integrated 730 Mb of Finished C57BL/6J sequence into the assemblyMGSCv3 was used as a Tiling Path to guide the assemblyFreeze date: Jan 27, 2003Release date: Feb 27, 2003

The Mouse Genome- over time… NCBIRicha Agarwala

FinishedDraftWGSGap

9

1

34

5 6

78 10

1112 13 14

15 1617 18

19

X

2

Page 26: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Mouse Genome- combining resources… NCBIRicha AgarwalaDeanna ChurchUnplaced versus Total curated Contigs Build 30

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y

UnplacedTotal contigs

.56%

.27%

1.83%

1.93%

4.07%

3.64%

3.61%

1.19%

2.94%

0

0

5.56%

1.38%

4.48%

0

0 1.27%1.41%

0

0.9%

100%

780 Mb of Curated NT Sequence

Page 27: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

The Mouse Genome- combining resources… NCBIRicha AgarwalaDeanna ChurchMmu4 unplaced contigs (Build 30)

10 unplaced NT contigs(11 GenBank accessions)

Do align to WGS contigsmapped to Mmu4

Align to WGS contigsmapped to another chromsome

No hits/bad hits(mostly chrUn)

NT_039271NT_039272NT_039276NT_039280

NT_039273 (MmuX) NT_039269NT_039270NT_039274NT_039278NT_039279

Page 28: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Intrachromosomal Interchromosomal

Large, nearly identical copies of genomic DNA. > 1 Kb, > 90% identity

Segmental DuplicationsCase Western Reserve

Evan EichlerJeff Bailey

NCBIDeanna Church

Page 29: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental Duplications

WGAC Analysis: Whole Genome Assembly Comparison

WSSD Analysis: Whole Genome Shotgun Sequence Detection

BLAST the genome against itself and look for sequence similarity.

caveat: difficult to distinguish between biological duplication and artificialduplication introduced when producing draft assemblies.

BLAST WGS reads against an assembly and look for increased depth of coverage

Case Western ReserveEvan Eichler

Jeff BaileyNCBI

Deanna Church

Page 30: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental Duplications

MGSCv3 (>20Kb; >95%)

Case Western ReserveEvan Eichler

Jeff BaileyNCBI

Deanna Church

Page 31: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental DuplicationsMGSCv3 (>90% ID; >10 Kb)

60% of all duplication mapto chrUn inMGSCv3

Case Western ReserveEvan Eichler

Jeff BaileyNCBI

Deanna Church

Page 32: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Comparison of duplication in the Mouse and Human Genomes

Human- Build 31

(2.75 Gb)

>1 KB

>5 Kb

>10 Kb

>20 Kb

5.25%

4.78%

4.52%

4.06%

MGSCv3

(2.55 Gb)

w/ unpl w/o unpl

ND ND

1.95%

0.70%

0.11%

1.01%

0.38%

0.10%

Mouse Build 29

(0.439 Gb – Finished BACs only)

initial filtered

3.74%

3.25%

2.71%

2.23%

2.35%

2.00%

1.60%

1.14%

WGAC analysis

Duplications are underrepresented in the Whole Genome Assembly (MGSCv3)

Segmental DuplicationsCase Western Reserve

Evan EichlerJeff Bailey

NCBIDeanna Church

Page 33: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental Duplications

Unique: pre-quality score

Unique: post-quality score

Duplicated: pre-quality score

Duplicated: post-quality score

WSSD Finished BACs Case Western ReserveEvan Eichler

Jeff BaileyNCBI

Deanna Church

Page 34: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental Duplications

WSSD (>95% id) analysis of Build 30 BACs

>10 Kb

>20 Kb

>5 Kb

>1 Kb

BACsMGSCv3

w/ Un w/o Un

ND

ND

ND ND

ND ND

1.51%

1.46%

2.09% 0.27%

2.01% 0.23%

(4298 BACs tested)

141 dup pos BACs

The 6 BACs (5 NT clones) from Mmu4 that hit chrUn are on the duplication positive list

Case Western ReserveEvan Eichler

Jeff BaileyNCBI

Deanna Church

Page 35: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental DuplicationsBari Italy

Mario VenturaMariano Rochi

RP23-3D2 chr.X_A3

•Validated 18/27 (67%) In silico predictions by FISH•16/18 (~90%) were clustered intrachromosomal duplications

This region described in Mileham and Brown (1996) as ‘a repeat sequence island’

Page 36: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

chr1 3.25 0.38 11.58 0.57 66.51 0.21

chr2 2.03 0.13 6.57 0.32 42.11 0.08

chr3 2.17 0.11 5.23 0.16 69.09 0.08

chr4 2.19 0.27 12.12 0.69 38.64 0.19

chr5 2.81 0.42 14.96 0.88 47.92 0.31

chr6 3.72 0.37 9.97 0.86 43.00 0.27

chr7 4.48 0.78 17.41 2.10 37.16 0.64

chr8 1.54 0.15 9.54 0.27 54.63 0.12

chr9 1.56 0.10 6.11 0.34 28.03 0.08

chr10 1.62 0.10 5.94 0.19 51.39 0.08

chr11 1.13 0.08 6.94 0.21 36.63 0.07

chr12 1.79 0.39 21.85 0.88 44.42 0.37

chr13 1.86 0.41 22.08 1.01 40.66 0.38

chr14 1.19 0.15 12.39 0.33 44.38 0.14

chr15 0.94 0.04 3.87 0.05 77.47 0.04

chr16 1.08 0.01 0.75 0.02 40.64 0.01

chr17 3.35 0.22 6.62 0.99 22.30 0.26

chr18 0.75 0.02 2.62 0.02 87.52 0.02

chr19 0.92 0.05 5.53 0.31 16.52 0.09

chrUn 23.78 13.03 54.80 82.02 15.89 12.91

chrX 3.17 0.31 9.91 0.86 36.41 0.23

both non redundant dup

WGAC (Mb)

WSSD supported WGAC (Mb)

WSSD overlap WGAC (%)

WSSD (Mb)

WGAC overlap WSSD (%)

Proportion of WSSD supported WGAC in chrom(%)

MGSCv3 Duplication Analysis

Build 33 data

Evan EichlerXinwei SheGinger ChangEray TuzanDeanna Church

Page 37: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

1

2

34

5 6

78 9 10

1112 13 14

1516

17 18

19

X

Y

WGSFinishedDraftGap

Build 33Reference assembly N50: 22.3 Mb

Page 38: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Chromosome 7 inversion still present…

Page 39: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mmu7 (3M – 6M)

Page 40: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Segmental Duplication: Genome annotation will under-represent the genecontent if segmental duplications are not included in the reference assembly.

Page 41: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Large scale variation in the genome

Nature Genetics, Sept. 2004

Page 42: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Types of annotation

Genes: By alignment, by prediction

Markers:By ePCR

Clones/Cytogenetic location: By alignment (BAC ends, insert) or assembly

Variation: By alignment

Phenotype:

Cytogenetic Position:

Feature Method

Sequence characteristics: CpG islands, source of assembly

Note: Genes from other organisms are also positioned based on alignment of mRNAs from one species on that of another genome. Example: the human Map Viewer shows the position of ESTs and other mRNAs from cow, pig, mouse, and rat.

Via Gene identification, associated markers

By annotated BAC-END sequenced clonesBy FISH-mapped clones used in assembly

Gene Trap Clones: By alignment

Page 43: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule

NC_000000

NM_000000NR_000000 NP_000000

XM_000000/

XR_000000 XP_000000

chromosome

NT_000000/NW_000000

contigRNA

predictedRNA

protein

predictedprotein

NG_000000genomic

Key:Curated annotationCalculated annotation

Key:Curated annotationCalculated annotation

Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID.

Reference Sequences…

Page 44: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Why do we need RefSeq?

Entrez Nucleotide

Page 45: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

• General alignment:– at least 50% of length or >1.0 kb– >95% identity, unless short exon – No longer one alignment per contig per strand

• (changed recently because this led to failure to annotate all members of a gene cluster)

– Constraints on intron length (compactness)– Shift within 3 nt to find splice sites conforming to

consensus (GT-AG, GC-AG, AT-AC)– Rank alignment by bit score, % identity, score, gaps,

compactness– global alignment

• Best placement:– Add to score for introns to compensate for gap

penalty– Known ambiguity if gene/pseudogene pairs are highly

related, and few introns in gene

mRNA alignment Sim4est2genomeSpideyBLATSPLIGN

Page 46: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Aligning cDNAs to the genome-Different algorithms can produce different results

-Trying to balance alignment with searching for splice sites.

ACAG++++++++++GAG||| |||ACATGTxxxxACAGGAG

Sim4AC++++++++++AGGAG|| |||||

ACATGTxxxxACAGGAG

splign/gpipe/BLAT

spidey

ACA++++++++++GGAG||| ||||ACATGTxxxxACAGGAG

NM_003490 (synapsin 3)

Between exons 7 and 8:

Page 47: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Making Gene Models (at NCBI)

Align RefSeq mRNAs to the genome

Select the best alignment (by score? exon structure?)

Run ab initio gene prediction on regions between these alignments

We use gnomon (GeneScan, GenomeScan, TwinScan, SGP)

Select best gene modelsRefSeq alignments (NM_XXXXXXXXX)ab initio models with support (XM_XXXXXXXXX)

Known issues:

Don’t make ab initio models in introns of known genesSkewed to what we knownDon’t really predict non-coding RNAs wellHard to sort out gene vs. pseudo-genes

Page 48: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Page 49: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Integrated comparison with Ensembl and UCSCPlacement of CDSPlacement of and consensus splice junctions% identity between RefSeq and GenomeReading frame

Possible ActionsReview current evidenceReview alignment algorithmsReview current RefSeqs

Integrated comparison with Ensembl and UCSCPlacement of CDSPlacement of and consensus splice junctions% identity between RefSeq and GenomeReading frame

Possible ActionsReview current evidenceReview alignment algorithmsReview current RefSeqs

Conflict resolution

Page 50: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

• CCDS identifier assigned to annotated proteins that are consistently placed

• Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome– cDNA (and thus protein) from a different allele – RNA editing– selenoproteins– ribosomal slippage– non-AUG initiation codon– cDNA source has undetected sequence errors

• CCDS identifier assigned to annotated proteins that are consistently placed

• Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome– cDNA (and thus protein) from a different allele – RNA editing– selenoproteins– ribosomal slippage– non-AUG initiation codon– cDNA source has undetected sequence errors

Future consensus annotation

Page 51: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Preliminary Statistics based on Human Build 34.3

Preliminary Statistics based on Human Build 34.3

Count Total Conditions Satisfied

7802 7802 100% nucleotide+position

1499 9301 100% protein+position3053 12336 100% exon position23 12359 NCBI/Hinxton both

"good"1540 13899 NCBI annotation

projected1772 15671 One model better52 15723 Other model better

Future consensus annotation

Page 52: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Now that the genome is togetherI. Text based queries

Entrez:- organism restriction- molecule type restrictions- keyword restrictions

II. Sequence comparisons

BLAST (Basic Local Alignment Search Tool).

SSAHA (Sequence Search and Alignment by Hashing Algorithm)BLAT

III. Query by location

Base pair positioncM positioncytogenetic position

http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=10090

Page 53: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

http://www.ncbi.nlm.nih.gov/genome/seq/MmBlast.html /HsBlast.html /RnBlast.html /DrBlast.html

Assembled SequenceReference assembly (C57BL/6J)Alternate assembliesCelera Mouse 16

Input SequencesHTGSWGS TracesAll other TracesBAC ends

Transcribed SequencesReference mRNABuild RNAESTs

ProteinsReference proteinsBuild proteins

DATABASESDATABASES

Entry point into the Genome- view BLAST results in the Map Viewer

Data Access

Other data setsGene Trap Clones

Page 54: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Data Access

Page 55: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Navigating by location

Jump to chromosome

15M30M

Add & Remove MapsChange Map OrderAdd RulersAdd another organism

Page 56: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Multiple assemblies can be a good thing…

Alignment of human Reference mRNAs:

256: Reference assembly only10: Celera assembly only

•Assembly Gaps•Assembly Errors•Biological variation

Page 57: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mulitple assemblies can be a good thing…

Page 58: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Page 59: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Page 60: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mulitple assemblies can be a good thing…

+

-

+

+

+

+

NM_004947 181 tgaaggggatctttcctgcaaattacattcacttgaaaaaggcaattgtcagtaataggg 240 +AY254099 181 ............................................................ 240 +AY145303 158 ............................................................ 217 +AY145302 509 .a......c................t.....t.......................c.... 568 +AK172930 518 .a......c................t.....t.......................c.... 577 +AK122353 445 .c.....t..a......t.c.gc..tg.............t..ctg...a.ag..c.aa. 504 +AY233380 158 .c.....t..a......t.c.gc...g.............t..ctg...a.ag..c.aa. 217 +AC121608 21865 .....c................t.....t.......................c.... 21921+AL672208 61296 .....c................t.....t.......................c.... 61240+

Reference Assembly Celera Assembly

Other sequence data indicate the reference assembly includes an inversion:

Inversions: An exon of DOCK3 is inverted in the reference assembly relative to other available information.

Page 61: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mulitple assemblies can be a good thing…

Page 62: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Page 63: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Genome assembly and annotation is an ongoing issue.

Weigh all of the evidence carefully

Multiple lines of evidence better than a single thread

Page 64: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

AcknowledgmentsRefSeq Curator StaffBLAST TeamEntrez TeamNCBI Service Desk Staff

Genome Team:Richa AgarwalaHsiu-Chuan ChenSlava ChetverninDeanna ChurchOlga ErmolaevaWratko HlavinaWonhee JangJonathan KansYuri KapustinKen KatzPaul KittsDonna MaglottJim OstellKim PruittSergey ResenchukVictor SapojnikovGreg SchulerSteve SherryAndrei ShkedaAlexandre SouvorovTatiana TatusovaLukas Wagner

Trace and Assembly ArchiveVladimir AlekseyevAnton ButanaevAlexey EgorovAndrew KlymenkoSergey PomorovEugene YaschenkoMike Dicuccio

Duplication AnalysisEvan EichlerXinwei SheZe ChengEray TuzanJeff BaileyMario VenturaMariano Rocchi

Page 65: DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

DM Church- NCBI

Mouse Genome Sequencing ConsortiumSanger InstituteWashington University Genome Sequencing CenterWhitehead (Broad) Institute Genome Cener

Baylor College of MedicineCold Spring Harbor LaboratoryGenome Therapeutics CorporationHarvard Partners Genome CenterJoint Genome InstituteNIH Intramural Sequencing CenterUK-MRC Sequencing ConsortiumThe University of Oklahoma Advanced Center for Genome TechnologyThe University of Texas Southwest

Acknowledgments