ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,
-
Upload
georgia-preston -
Category
Documents
-
view
222 -
download
0
description
Transcript of ESTminer CHADO adaptor The University of Georgia Alan Gingle, Yecheng Huang,
ESTminer CHADO adaptor
The University of GeorgiaAlan Gingle, [email protected]
Yecheng Huang, [email protected]://cggc.agtec.uga.edu/
Nov 1, 2004
Introduction
• Purpose of this presentation is to draft an EST chado schema that is open for community comments• Examples are used to demonstrate our approach to applying CHADO to EST data.
Contents: • ESTMiner_CHADO schema overview• Control Vocabulary -- Ontology and definition • Feature, and its properties, relationship and location• Appendix (example used in slides, minor tables)
ESTminer CHADO schema overview
• Major part of CHADO that is relevant to the ESTMiner project
dbxrefprop
PK dbxrefprop_id
FK1,U1,I1 dbxref_idFK2,U1,I2 type_idU1 valueU1 rank
featureprop
PK featureprop_id
FK1,U1,I1 feature_idFK2,U1,I2 type_idU1 valueU1 rank
cvterm
PK cvterm_id
FK1,U1,I1 cv_idU1 name definitionFK2 dbxref_id
feature
PK feature_id
FK1,I1 dbxref_idI2,U1 organism_idI5,I6 nameI4,U1 uniquename residues seqlen md5checksumFK2,I3,U1 type_id is_analysis timeaccessioned timelastmodified
analysis
PK analysis_id
name descriptionU1 programU1 programversion algorithmU1 sourcename sourceversion sourceuri timeexecuted
cvterm_dbxref
PK cvterm_dbxref_id
FK2,U1,I1 cvterm_idFK1,U1,I2 dbxref_id
featurerange
PK featurerange_id
I1 featuremap_idFK5,I2 feature_idFK4,I3 leftstartf_idFK3,I4 leftendf_idFK2,I5 rightstartf_idFK1,I6 rightendf_id rangestr
cvterm_relationship
PK cvterm_relationship_id
FK3,I1,U1 type_idFK2,I2,U1 subject_idFK1,U1,I3 object_id
analysisfeature
PK analysisfeature_id
FK2,U1,I1 feature_idFK1,U1,I2 analysis_id rawscore normscore significance identity
db
PK db_id
U1,I1 name contact_id description urlprefix url
feature_cvterm
PK feature_cvterm_id
FK1,I1,U1 feature_idFK2,I2,U1 cvterm_idI3,U1 pub_id
dbxref
PK dbxref_id
FK1,I1,U1 db_idU1,I2 accessionU1,I3 version description
analysisprop
PK analysisprop_id
FK1,U1,I1 analysis_idFK2,U1,I2 type_idU1 value
feature_relationship
PK feature_relationship_id
FK2,I1,U1 subject_idFK1,I2,U1 object_idFK3,U1,I3 type_id rank
cv
PK cv_id
U1 name definition
EST_CHADO V0.1
featureloc
PK featureloc_idPK,FK1,U1,I2 feature_id
FK2,I4,I3 srcfeature_idI4,I1 fmin is_fmin_partialI4,I1 fmax is_fmax_partial strand phase residue_infoU1 locgroupU1 rank
EST Control vocabulary I - Ontology
cvterm_relationship
PK cvterm_relationship_id
FK3,I1,U1 type_idFK2,I2,U1 subject_idFK1,U1,I3 object_id
1: Read3’’
8: Scr1o
7: GB_ACC_#
4: Cluster3: Contig
2: Sequence
6: Library
9: Scr1e
12: QUAL16o
10: Scr2o
11: Src2e
13: QUAL16e
15: QUAL20e
14: QUAL20o
16: GB_Access
17: Identity_threshold
18: Length_threshold
19: Library_name
20: stage
24: strain
23: organ
21: cultivar
22: cell_type
…
25: Organism
26: imo 27: ipo
5: ESTName
…
…
27: numofcontig
26: numofSeq
…
EST Control vocabulary II -Definition
cv
PK cv_id
U1 name definition
cvterm_id 1 2 3 4 5 6 7 8
name Read3 sequence Contig Cluster Name Lib GB_Access Scr1o
definition 3’ read EST Sequence
EST Contig EST Cluster EST
Name Library GenBank Access Number
Screen offset 1
cvterm_id 9 10 11 12 13 14 15 16
name Scr1e Scr2o Scr2e QUAL16o QUAL16e QUAL20o QUAL20e
definition Screen end 1
Screen offset 2
Screen end 2
Quality16 offset Quality16 end Quality20
offset Quality20 end
cvterm_id 17 18 19 20 21 22 23 24
name Identity_threshold Length_threshold Library_name stage cultivar Cell_type Organ strain
definition
cvterm_id 25 26 27 28 29 …
name Organism imo ipo numofseq numofcontig …
definition Organism and species Is member of Is part of Number of seq Number of contig …
• insert into cv (cv_id,name,definition) values (1, ‘CGGC_UGA‘,’University of Georgia, Comparative Grass Genomic Center’ );• insert into cvterm(cvterm_id, cv_id, name, definition, dbxef_id) valuses (1, 1, ‘Read5’, ‘5\’ read’, 1 );
cvterm
PK cvterm_id
FK1,U1,I1 cv_idU1 name definitionFK2 dbxref_id
EST Featurefeature
PK feature_id
FK1,I1 dbxref_idI2,U1 organism_idI6,I5 nameI4,U1 uniquename residues seqlen md5checksumFK2,I3,U1 type_id is_analysis timeaccessioned timelastmodified
insert into feature (feature_id, uniquename, residues, seqlen, type_id, …) values (1, ‘IP1_1_F11.g1_A002‘, ‘TGAG…CATTT’, 788,1,… );
feature_id 1 2 3
uniquename IP1_1_F11.g1_A002 IP1 Q20_1
residues TGAG…CATTT TTT...TGGA
seqlen 788 579
type_id 1 6 1
feature_id 4 5 6
uniquename Q16_ 1 CTGSB_100848 CLSB_1540
residues TTT…TTCCGAT Consensus residues
seqlen 618 … …
type_id 1 3 4
**** Check the example at the appendix ****
EST Feature and Properties
feature tableFeaure_id Uniquename Type_id
1 IP1_1_F11.g1_A002 1(sequence)
2 IP1 6(Library)
5 CTGSB_100848 3(contig)
6 CLSB_1540 4(cluster)
…
feature_property table IP1_1_F11.g1_A002
Feaureprop_id Feature_id Type_id value
1 1 2(sequence)
2 1 5(ESTname) IP1_1_F11.g1_A002
3 1 12(QUAL16o) 11
4 1 13(QUAL16e) 628
5 1 14(QUAL20o) 11
6 1 15(QUAL20e) 589
7 1 16(GB_Access) BG946868
feature_property table IP1
Feaureprop_id Feature_id Type_id value
8 2 19 Library_name IP1
9 2 20
10 2 21 cultivar BTx623
11 2 22 cell_type N/A
12 2 23 organ Developing preanthesis pannicles
13 2 24 strain N/A
14 2 25 Organism Sorghum Bicolor L.
feature_property table CLSB_1540
Feaureprop_id Feature_id Type_id value
16 6 17 Iden_threshold 95
17 6 18 Len_threshold 20
18 6 28 numofcontig 1
feature_property table CTGSB_100848
Feaureprop_id Feature_id Type_id value
15 5 28 numofseq 2
EST Feature Relationshipfeature_relationship
PK feature_relationship_id
FK2,I1,U1 subject_idFK1,I2,U1 object_idFK3,U1,I3 type_id rank
Feature relationship tableFeature_relationship_id 3 4 5
subject_id 1 1 1
object_id 5 (contig) 5(contig) 2 (library)
type_id 26 (is member of) 26 26
rank
feature tableFeaure_id Uniquename Type_id
1 IP1_1_F11.g1_A002 1 (sequence)
2 IP1 2 (library)
5 CTGSB_100848 3 (contig)
6 CLSB_1540 4 (cluster)
…
feature_id 1 (sequence)
feature_id 5 (contig)
member of
feature_id 6 (cluster)
member of
EST Feature Locationfeatureloc
PK featureloc_idPK,FK1,U1,I2 feature_id
FK2,I4,I3 srcfeature_idI4,I1 fmin is_fmin_partialI4,I1 fmax is_fmax_partial strand phase residue_infoU1 locgroupU1 rank
feature tableFeaure_id Uniquename Type_id
1 IP1_1_F11.g1_A002 1(sequence)
3 Q20_1 1(sequence)
4 Q16_1 1(sequence)
…
featureloc tableFeatureloc_id Feaure_id Srcfeature_id fmin fmax
1 3 1 11 628
2 4 1 11 589
… …
feature_id 11 77811
feature_id 3
feature_id 4
628589
Appendix – Example of EST Library
IP1STAGE: N/A FULL_NAME: Immature pannicle 1 CULTIVAR: BTx623 CELL_TYPE: N/A STRAIN: N/A ORGANISM: Sorghum bicolor L. BOTANICAL_NAME: S. bicolor ORGAN: Developing preanthesis pannicles CELL_LINE: N/A COMMENT_FOR_EST: Sequences have been trimmed to exclude PolyA, vector and regions below Phred quality 16. The threshold for high quality sequence is 20. Three-prime sequences, which are obtained with PolyTMix or T7 sequencing primer, are presented as the reverse complement. PUBLISH: Y HOST: N/A SEX: N/A RE_2: EcoRI TISSUE: N/A RE_1: XhoI LIB_NAME: IP1 VECTOR: pBluescript II SK(-) from Lambda Zap II V_TYPE: Plasmid DESCR: The library was made from poly-A RNA in the cloning vector lambda ZAP II. Clones to be sequenced were prepared by mass excision.
Appendix – Example of EST Sequence
Seqence Name: IP1_1_F11.g1_A002 GenBank Access Number: BG946868
1 11 21 31 41 51 61 71 81 91
1 TGAGTTTTTT TTTTTTTTTT TTGTTCTTAA TTATTCAATT CATTCATGAT ACTACTGTCT GCTATTTCCA CAGTAAATGT TCATATTACA TAGGAGCCAC
101 TGGCTCCTCC GGATTCCTTA AAAAAAATGT CCATATTACA ATTGGATTTA TGATACTACA CAGGTTCGCG AAATCGAGCA GGTTAGAAAA GCTTCCACTT
201 GCTGACCTCA CTAAAAGTGA AACACAGTTC CGGGAAGTTC ATACAGTTTT CCCATATAGA TCAATTGATC CTATCTGAAA CCTTGGATTA GAATGAGATT
301 CTCTTACGCG TAGAAACCTA AACCGGAAAG CATTTGCTTT ATATCTCTTA TCCACTGTAA ATGTTTTTCT AAGGAAACGG CTCTCAAACA TTTCAGAATT
401 CCGAGCATCA AGTAGATTCC AGGTGGAACC TGCATCTGTG CTCCCTTCAA GAACCCAGTC CATTGGATCC CTCTCTGGAG CATCATTAGC TGACATCAAA
501 TCATATGACT CCAACTCACA ACTTTTGCCA AGCTTGCATTG TATAAATCAG CCAACATCCT TTGGCTCCAT CAGGCTCTTC CCATTTGGAA GAATGGATGC
601 CGTCAAAAGC TGCTGTTGCA ATTCCGATTG GGAGCTGTTC CCTGCTTGCA AGGACTGAAC CTGAGCATAC TCTGTTCCCC TCTGGGAAAT GGTTGCCCTC
701 TGTGAAAGAG GTATTANNTC TATAATACTC ATATCTCATT ACTGCATCCA GTGCTACTGG TAACGCTNAG GATGAGTGGA TTGCATTT
•Length of Sequence: 788 •Screened Vector •Phred Qulity 20+ START:11 END:589 •Phred Qulity 16+ START:11 END:628 •Phred Quliaty Below 16
Appendix – Example of EST Cluster and Contig
95-20-CLSB_1540
Identity Threshold: 95Length Threshold: 20Cluster Name: CLSB_1540 Number of Contigs: 1
CTGSB_100848 Contig Name: CTGSB_100848Number of Sequences: 2
• IP1_1_F11.g1_A002
• P1_48_H11.g1_A002
Appendix – Example of EST Database
db
PK db_id
U1,I1 name contact_id description urlprefix url
dbxref
PK dbxref_id
FK1,I1,U1 db_idU1,I2 accessionU1,I3 version description
• insert into db (db_id, name …) values (1, ‘CGGC_UGA’, …);
• insert into dbxref (dbxref_id, db_id,…) values (1, 1…);
• insert into dbxrefprop (dbxrefprop_id, dbxref_id, …) values (1,1…)
dbxrefprop
PK dbxrefprop_id
FK1,U1,I1 dbxref_idFK2,U1,I2 type_idU1 valueU1 rank
Appendix – Example of Analysis
analysis
PK analysis_id
name descriptionU1 programU1 programversion algorithmU1 sourcename sourceversion sourceuri timeexecuted
analysisfeature
PK analysisfeature_id
FK2,U1,I1 feature_idFK1,U1,I2 analysis_id rawscore normscore significance identity
analysis_id 1 …
name CGGC_01 …
description …
program blast …
algorithm cagt_miner …
analysisfeature_id 1 2 …
analysis_id 1 1 …
feature_id 5 6 …