Bioinformatics & LIS

23
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with biology. April 18, 2006

description

Bioinformatics & LIS. A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with biology. April 18, 2006 G. Benoit. Outline of the talk. Bioinformatics defined Generation of data Tools and databases - PowerPoint PPT Presentation

Transcript of Bioinformatics & LIS

Page 1: Bioinformatics & LIS

Bioinformatics & LIS

A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with biology.April 18, 2006G. Benoit

Page 2: Bioinformatics & LIS

Outline of the talk

• Bioinformatics defined• Generation of data• Tools and databases• Activities for Librarianship,

Computer andInformation Science

• Examples:– Entrez, NCBI, Visualization

• Collaborations

Page 3: Bioinformatics & LIS

Bioinformatics defined

• Over 70 defintions• Differences arise from the work• Nat’l Center for Biotechnical Information

(NCBI)• The development of new algorithms and statistics

with which to assess relationships among members of large data sets;

• The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and

• The development and implementation of tools that enable efficient access and management of different types of information.

Page 4: Bioinformatics & LIS

Without getting into the science…

• How the data started …• Four chemical bases (purines [adenine

(A), guanin (G)] and pyrimidines [cytosine (C) and thymine (T)] )

• Their precise order and linking (attached to a sugar molecule and to a phosphate molecule to create a nucleotide) …

Page 5: Bioinformatics & LIS

DNA

Page 6: Bioinformatics & LIS

• A pairs with T; G with C to make unique and very long strings, called sequences

• E.g., AATGACCAT codes for a different gene than GGGCCATAG would

• Replication: RNA consists of A, G, C, and Uracil and has ribose instead of deoxyribose

• Point is one can predict missing data, sometimes…

Page 7: Bioinformatics & LIS

In short…the nucleotides are linked in a certain order or sequence through the phosphate group;their precise order and linking within the DNA determines what proteins the gene produces and the phenotype of the organism

Page 8: Bioinformatics & LIS

Generation of Data• Raw data from sequencing• Expression data• Data generated by linking other raw data in

very large, multidimensional databases (e.g., OMIM)

• Research literature (full-text journals)• Data models to describe the literature for

retrieval, linking to other data, and linking to the raw data

• New data models to support greater flexibility in describing & manipulatingdata …

Page 9: Bioinformatics & LIS

Generation of Data

• To support integrated search and retrieval

• To focus on single organisms or find similarities across them

• Feed other technology• Visualization of natural phenomena

and of abstract phenomena

Page 10: Bioinformatics & LIS

Tools & Databases

• A host of tools for database searching…– BLAST (basic local alignment search

tool)– FASTA (sequence strings)– ChopUp (protein analysis)– Integrated packages (Lasergene

Sequence Analysis Software)– The many services offered through

NCBI and NLM

Page 11: Bioinformatics & LIS

• Take a look at handout, Table 1, publically accessible databases

Page 12: Bioinformatics & LIS

Data Categories

• Monographs, Journals, Announcements (text)• Datasets:

– Bibliographic (http://www.expasy.org/links.html)– Taxonomic– Nucleic acid– Genomic (e.g., GDB, OMIM)– Protein DB (SwissProt, TrEMBL)– Protein families, domains, and functional sites– Proteomics initiative– Enzyme/metabolic pathways– Sequence Retrieval System (SRS) and NCBI Data

Model

Page 13: Bioinformatics & LIS

• Take a look at handout, Table 2, publically-accessible databases defined and then

• Entrez sample, Table 3

Page 14: Bioinformatics & LIS

Entrez example

• Notice the familiar access points (author, journal, title) as well as domain-specific ones (exon, gene, organism)

• Notice, too, the DNA …

Page 15: Bioinformatics & LIS

NCBI Homepage

• http://www.ncbi.nih.gov/• Notice the variety of tools (left

menu)• Site map:

http://www.ncbi.nih.gov/Sitemap/index.html

• Alpha list http://www.ncbi.nih.gov/Sitemap/AlphaList.html

Page 16: Bioinformatics & LIS

Linking across resources• http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html• NCBI’s structure database is called Molecular Modeling Database (MMDB), and is

a subset of non-theoretical models 3D structures obtained from the Protein Data Bank (PDB). Data are obtained from X-ray crystallography and NMR-spectroscopy. Goal is to make it easier to compare structures.

• Searching: variety of access points: author, title, text terms, or a PDB 4-character code or a numerical MMDB-id

• MMDB Data: PDB records are parsed (to extract sequences and citations from PDB records, and structural info). Converted to ASN.1.

• Taxonomy: is used to help end users see term relationships and databases, along with literature references:

• Example: http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/• http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?

mode=Undef&name=Escherichia+coli&lvl=0&srchmode=1

Page 17: Bioinformatics & LIS

Linking across resources

• XML - there are hundreds of XML schema used in biology

• Calls for mapping to ASN1 records [see NCBI example]

• Calls for mapping across schema• Calls for exporting data for

different devices…

Page 18: Bioinformatics & LIS

Visualization

• Cn3D - uses MMDB-Entrez’s structure database– http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml

• RasMol http://www.umass.edu/microbio/rasmol/

• Protein Explorer http://www.umass.edu/microbio/rasmol/rotating.htm

• OpenRasMol http://www.openrasmol.org/

• MolviZ.org http://www.umass.edu/microbio/chime

• World Index of Molecular Visualization http://molvis.sdsc.edu/visres/index.html

Page 19: Bioinformatics & LIS

Recap main points• Very large data sets -

“homogenized” thru ASN.1• Goal to integrate (text-text,

visualization-text, text-vis)• Raw data + research literature +

visualization• Biologists provide domain

knowledge• XML is a big player• CS and IS provide technology• Librarians provide maintenance

and access to resources

Page 20: Bioinformatics & LIS

Collaborative Opportunities

• For LIS and CS:– Domain analysis– information use, communication, theories of

information; – systems analysis and design, – data modeling, – classification, – storage and retrieval,– HCI mapped onto a generalized model of a

molecular biology experimental cycle• [Denn & MacMullen, 2002, p. 556]

Page 21: Bioinformatics & LIS

Collaborative Opportunities

• “Insertion Points” - development of new tools and methods for managing, integrating & visualization

• For local use: download selected data sets for local needs (Stapley & Benoit, 2000)

• XML Transformations• XML - SVG - X3D• Automated retrieval• Clustering (data- and text-mining)

Page 22: Bioinformatics & LIS

Collaborative Opportunities

• Biologists’ needs:– To go beyond mining of genomic data

to investigate causal entailments in intra- and intracellular dynamics

• LIS’s response:– To aid understanding of the scientific

processes thru visualization of literature, metadata and graphic representations in general and for disease-specific analysis

Page 23: Bioinformatics & LIS

Back to you…

• Thanks …