Peg Folta Lawrence Livermore National Laboratory 3/12/02

27
The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National Laboratory 3/12/02 TRANSCRIPTOME 2002 Seattle, WA

description

TRANSCRIPTOME 2002 Seattle, WA. The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ. Peg Folta Lawrence Livermore National Laboratory 3/12/02. I.M.A.G.E. maintains world’s largest publicly available cDNA collection. - PowerPoint PPT Presentation

Transcript of Peg Folta Lawrence Livermore National Laboratory 3/12/02

Page 1: Peg Folta Lawrence Livermore National Laboratory 3/12/02

The Integrated Molecular Analysis of Genomes and their Expression

Consortium’s Data Mining Tools: Introducing the IQ

Peg Folta

Lawrence Livermore National Laboratory3/12/02

TRANSCRIPTOME 2002 Seattle, WA

Page 2: Peg Folta Lawrence Livermore National Laboratory 3/12/02

I.M.A.G.E. maintains world’s largest publicly available cDNA collection

5,819,514 clones arrayed

I.M.A.G.E. clones account for 64% of human ESTs in GenBank

cumulative

arrayed

*

Page 3: Peg Folta Lawrence Livermore National Laboratory 3/12/02

The I.M.A.G.E. collection has been shaped by projects (C-GAP, MGC…)

Xenopus

Human

Other

Zebrafish

Mouse

Species

Standard

Full-length

Norm/Sub

Normalized

Subtracted Norm/FL

Library Method

adult

embryonic

juvenile

Developmental state

abnormal

normal

treated

Tissue

3' EST5' EST

Full length

Clone sequence

Page 4: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Informatics focus this year was on tools to characterize and query the collection.

• IMAGEne – mature clustering tool

• IMAGEne Tissue – allows searching of tissue type dominance in clusters

• IQ – Intelligent Query tool allows mining of I.M.A.G.E. data

• Library/plate query – allows selective searching of libraries and plates

• Problem report and query – allows users to report or query problems related to I.M.A.G.E. clones

Redesign of data management system

Page 5: Peg Folta Lawrence Livermore National Laboratory 3/12/02

IMAGEne-Human Process

2,289,020Quality

I.M.A.G.Esequences

14,566NCBI

Ref Seq

IMAGEne1,676,516Sequences

623,294Sequences

RemainingSequences

>50 basepairs of contiguous, non-repeat sequence

Known Clusters

14,566CandidateClusters

w/consensus

67,521

I.M.A.G.E.Singletons

268,472

279,262Lower quality

I.M.A.G.ESequences

Page 6: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Initial query page, construct the query.

Page 7: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Clusters matching query results, chose your cluster.

Page 8: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Display of cluster

Page 9: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Known gene clusters with full length I.M.A.G.E. clones have doubled in number.

0

2000

4000

6000

8000

10000

12000

14000

16000

V3.0 V3.1 V3.2.1 V3.3

IMAGEne Versions

# of clusters

EmptyUnknownPartialPredicted FullFull

Clustercoverage

Avg. genelength

3392276333801896

1578

Page 10: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Known Gene Cluster distribution of full length clones

0

100

200

300

400

500

200 1200 2200 3200 4200 5200 6200 7200

Length of Clone

Number of Clones

avg. length = 948

Page 11: Peg Folta Lawrence Livermore National Laboratory 3/12/02

0100

200300400

500600

700800900

1000

1 2 3 4 5 6 7 8 9 10 11 12 14 15 21

Number of Contigs in a cluster

Number of Clusters

Candidate gene clusters consensus sequence and contigs are generated by CAP4

61,3144,971

824

95

227

40

Page 12: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Candidate Gene cluster characteristics.

1938

26236

28317 11030

full insert 3'&5' 3' only 5' only

Page 13: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Singleton: Wheat within the chaff

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000High Quality Sequence Length

# of sequences

305 full insert sequences are singletons.

62,143 singletons have a 3’ PolyA site.

Avg. length is 547

Page 14: Peg Folta Lawrence Livermore National Laboratory 3/12/02

IMAGEne Tissue query allows searching for tissue proportions within clusters.

Page 15: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Introducing the Intelligent Query - IQ

• For a given category (currently clone and library) a user can specify a query based on key database attributes.

• The user can specify the fields returned.

• Various result format options (HTML, text)

• Initial version was rolled out last summer

• New functionality to be added this year (additional categories, etc.)

Page 16: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Specify a clone-based query.

Page 17: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Next specify what clone centric results will be provided and in what format.

Page 18: Peg Folta Lawrence Livermore National Laboratory 3/12/02

HTML version of clone-based query results.

Page 19: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Specify a library-based

query.

Page 20: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Similarly specify what library centric results will be provided.

Page 21: Peg Folta Lawrence Livermore National Laboratory 3/12/02

HTML version of library-based query results.

Page 22: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Other tools to mine I.M.A.G.E. information

Query plates from libraries. Query for reported problems.

Page 23: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Plates Source Well Error Rate

1-3705 Incyte 13

LLNL Master 10

Research Genetics 12

Resource Center of HumanGenome Project

10

ATTC 11

3,796-6000 Incyte 7

LLNL Master 7

Research Genetics 10

Resource Center of Human Genome Project

12

Quality control for historical collection

Page 24: Peg Folta Lawrence Livermore National Laboratory 3/12/02

QC on-goingMonths Well

error ratePlate Error Rate

Well error rate

Plate Error Rate

6/2000 1 (1,3) 0 7 (4,11) 2

10/2000 1 (0,3) 0 1 (0,3) 2

12/00 0 (0,2) 2 1 (0,3) 2

1/01 2 (1,4) 0 6 (4,11) 3

2/01 1 (0,3) 0 2 (1,5) 2

3/01 2 (1,5) 2 2 (1,5) 0

4/01 1 (0,3) 2 2 (1,4) 0

5/01 0 (0,1) 0 2 (1,5) 0

6/01 1 (0,3) 0 1 (0,4) 0

7/01 1 (0,4) 0 2 (1,6) 0

8/01 2 (1,3) 0 3 (2,6) 0

LLNL Replication Master vs. GenBank

Page 25: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Ongoing QC results

On-goingComparing master to GenBank

Error in replication @ LLNL

Page 26: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Next for I.M.A.G.E. Informatics

• Extensive expansion of query tools and data access

• IMAGEne non-species specific

• Analysis of human cluster candidate genes and singletons

• Redo of web site, easier to navigate

MUCH influenced by public needs…..you have a say!

Page 27: Peg Folta Lawrence Livermore National Laboratory 3/12/02

Acknowledgements

• LLNL– Christa Prange, I.M.A.G.E. PI – Tim Harsch, Amber Johnston, Julie Amundson

• Sponsors– DOE, Marv Stodolsky– NIH, Bob Strausberg

This work was partially funded by the NIH and was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48.

image.llnl.gov