Issues in Managing and Disseminating Changing Information in Biology Sue Rhee...
-
Upload
philomena-simmons -
Category
Documents
-
view
217 -
download
0
Transcript of Issues in Managing and Disseminating Changing Information in Biology Sue Rhee...
Issues in Managing and Disseminating Changing Information in Biology
Sue Rhee
Carnegie Institution
Department of Plant Biology
Stanford, CA
Information Dissemination Media in Biology
Journals~150 years
peer-reviewedhighly referenced
limited sizestatic
Public Repositories
~20 yearsminimum review
minimum referenceunlimited size
static
Community Databases~5 years
Curator-reviewModerately referenced
unlimited sizedynamic
TAIR:the Arabidopsis Information Resource
• A Community Database about Arabidopsis Information• Researchers can search, download, analyze data via
commonly-used web browsers and ftp• NSF funded project (1999-2004)• Collaboration between Carnegie (Stanford, CA), NCGR
(Santa Fe, NM) and ABRC (Columbus, OH)• http://www.arabidopsis.org
Who are the users? People
Unspecified 8863Graduate Student 792Post-Doctoral Researcher 748Professor 361Research Scientist 348Assistant Professor 330Associate Professor 246Research Associate 154Group Leader 135Research Assistant 121Other 110Unknown 100Research Fellow 82Project Leader 71Undergraduate Student 70Director 42Lecturer 36Senior Research Officer 25Curator 20Programmer 16Teacher 10Coordinator 10Senior Lecturer 10High School Teacher 9High School Student 8President 7Advisory Board Member 2Secretary 1Middle School Teacher 1
Lab 4724Institute 74Project 41Education_outreach_program 21Facility 15Company 14University 14Collaboration 13Database 84_year_college 7Center 4Stock_center 3Committee 2Foundation 1Organization 1Community_college 1
GroupsArabidopsis 3211Rice 777Maize 535Wheat 390Tomato 351Legumes 331Bacteria 274Fungi 227Potato 209Animals 177Other Crops 150Microorganisms 124Legume 93tobacco 28barley 17cotton 12Tobacco 11tomato 10Brassica 9petunia 8Barley 7Chlamydomonas 7poplar 6
Organism of Interest
Total: 12,300 invidualsand 4700 labs working onplant research
Usage Statistics Monthly:~5 million files served~900,000 page views~29,000 IP addresses~30 Gb served
What do we do?1. Capture data generated by large genome projects and individual
researchers– Read and extract info from literature, establish contact with large-scale
project groups
2. Curate and analyze the information– Error checking, making associations, synthesizing summary, adding quality
control filters through a series of standard operation procedures and analysis pipelines
3. Make information accessible to users in intuitive form– In-house biologists and user feedback from surveys & workshops
4. Develop data query, analysis, curation, visualization tools– Collaboration between software developers and biologists, iterative process
5. Communicate with the users– Data submssion, suggestions, error and other problem reports
What is PubSearch?• A web application and database for literature curation• Stores complete literature information
– References, abstracts, full text articles (pdf)
• Stores biological information– Genes, proteins, descriptions
• Stores ontologies (GO Terms)• Links literature, GO terms and biological information.• Assists manual curation with fast, automatic matching
(using suffix trees indicer)• Is password-protected, and easy to set up and use.
TAIR Installation Statistics (9/12/03)• 20,272 literature references• 14,920 research papers with abstracts• 8,642 full-text papers (58%)• 16,956 controlled vocabulary terms• 105,671 hits between terms and articles (2359 terms)• 38,010 gene names• 29,841 hits between genes and articles (4268 genes)• 14,943 hits validated
– (70% valid, 29% not valid, 0.5% maybe)
• 11,497 manual annotations to 5981 genes from 2113 articles
• 38 relationship types for gene2term and gene2gene• 103 evidence types
TAIR Data Size
Type of Info Stored Size in 1999
Size in 2003
Website General information, help, external sites
0.7 Gb 25 Gb
Database Data, external links, definition of database fields
3 Gb 20 Gb
FTP directory Large datasets generated from database or external sites
N.D. 13 Gb
DVD Archive Microarray raw data 0 1.6 Gb
Current Issues in Community Databases
1. How to maximize connection with public repositories and journals?
2. How to ensure information is up-to-date?
3. How to cross-reference all the information in independent sites?
4. What happens after the funding?
Making Connections with Public Repositories1. Utilizing existing standards
A. LinkOutA. Data capture includes Genbank accession (e.g. seed stock
containing an insertion and the insert-site sequence with Genbank accession)
B. Data downloaded from Genbank using the accession using e-utilities
C. Data curation/analysis generates additional associations (e.g. the insertion site used to identify the associating gene and a polymorphism for that gene)
D. Sequence-associated information sent back to Genbank using the LinkOut XML format
2. Collaborating to make new standardsA. Plant microarray submission standards with ArrayExpress
B. MIAME standards for microarraysA. Researchers submit microarray data in prefilled Excel sheetsB. Convert Excel into XML and load into TAIR databaseC. Data curation/analysis generates additional associations (e.g. usage
of controlled vocabularies)D. Data exported into XML and sent to ArrayExpress
Making Connections with Journals
1. Publication requirement to adhere to existing standardsA. Stock AccessionsB. Gene symbol Registry (currently under discussion)
2. Data sharingA. Image data for gene expressionB. Supplementary data (e.g. microarray results)
3. Resource sharingA. Publication through community databases?
Keeping Information Up-To-Date
1. In-house curation-pro: experience and standard operation procedures can
ensure consistency-con: becoming difficult keep up as the amount and
complexity of information increases
2. Community involvement-pro: expertise and sheer number of the community-con: has not worked successfully (no incentive in the
current academic reward structure, not considered to be a typical role of a scientist)
3. Others?
JournalImpact Factor
Total Citations
Total Articles
Citationsper Articles
Nature 30.4 326546 889 367Science 29.0 296080 987 300Cell 27.3 139765 350 399Genes & Development 19.7 45227 268 169Current Opinion in Cell Biology 19.0 12818 90 142Molecular Cell 16.5 16125 271 60Journal of Cell Biology 12.5 68928 412 167Trends in Plant Sciences 12.4 4283 60 71Developmental Cell 11.5 1196 139 9The Plant Cell 10.8 17373 241 72PNAS 10.7 315820 2911 108EMBO Journal 10.7 77524 677 115Current Opinion in Plant Biology 9.5 2510 74 34Molecular Biology of the Cell 7.6 14170 347 41Current Biology 7.0 20020 341 59Journal of Cell Science 7.0 20840 460 45Journal of Biological Chemistry 6.7 370056 6444 57The Plant Journal 5.9 12721 287 44Molecular Microbiology 5.8 23553 521 45Plant Physiology 5.8 33690 531 63Traffic 5.4 1182 87 14Plant Molecular Biology 4.5 10522 194 54Molecular Plant Microbe Interactions 3.8 4449 140 32Journal of Computational Biology 3.5 711 44 16Fungal Genetics and Biology 3.2 1044 72 15Planta 3.0 10641 245 43Phytopathology 2.2 9913 167 59Current Genetics 1.9 2788 77 36
Impact Factor of Top Journals
2000 2001 2002 2003
percent mentioned
TAIR mentioned
total full- text
percentmentioned
4.27% 5.79% 8.96% 11.37%
TAIR mentioned 44 60 110 143
total full- text 1031 1036 1228 1258
2000 2001 2002 2003
Impact of TAIR
Current Issues in Community Databases
1. How to maximize connection with public repositories and journals?
2. How to ensure information is up-to-date?
3. How to cross-reference all the information in independent sites?
4. What happens after the funding?
People Involved
TAIR-CarnegieTanya BerardiniMarga Garcia-HernandezEva HualaSuparna MundodiLeonore Reiser Julie TacklindIris XuDanny YooPeifen ZhangNick MoseykoBrandon ZoeklerJessie Zhang
TAIR-NCGRDan WeemsNeil Miller Mary Montoya
ABRCRandy SchollDebbie CristEmma KneeLuz Rivero
Information Dissemination Media in Biology
1. Scientific Journals• Traditional medium of knowledge dissemination• Long history of publishing• Recently have move to electronic publishing
3. Community Databases• Information resources that are created, maintained, and improved
by research community• Funded by governments, not permanent.• A few large databases share similar history as public repositories• Recently there has been a radiation of the community databases
2. Public Repositories• Permanent operations for electronic storage and dissmination of
basic data• Shorter history than journals, about 20 years• A good example is NCBI’s Genbank