Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of...

28
Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of...

Page 1: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

Biological Data Extraction and Integration

A Research Area Background Study

Cui TaoDepartment of Computer Science

Brigham Young University

Page 2: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

2

Research Field OverviewMy research Semantic Web

Data Integration

Schema Matching

Information Extraction

Bioinformatics

Page 3: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

3

Information Extraction

• “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99]

Page 4: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

4

Information Extraction

• “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99]

• “Because the WWW consists primarily of text, information extraction is central to all effort that would use the web as a resource for knowledge discovery.” [Freitag98]

Page 5: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

5

Information Extraction

• Traditional information extraction

• Hidden web crawling

• Biological data extraction

Page 6: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

6

Traditional Information Extraction

• Different groups of IE tools: [Laender02]– Wrapper generation tools– NLP-based and learning-based tools– Ontology-based tools

Page 7: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

7

Traditional Information Extraction• Wrapper generation tools

– Lixto [Baumgartner01]• Supervised wrapper generation• Semi-automatically• Not robust; Does not work well with unstructured

data

– ROADRUNNER [Crescenzi01]• Fully automatic wrapper generation• Does not generate robust and general wrappers• Only works for highly regular web pages

Page 8: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

8

Traditional Information Extraction

• NLP-based and learning-based tools– SRV [Freitag98]

• Top-down learner• Learns based on simple and relational features• Single slot filling

– RAPIER [Califf99]• Bottom-up learner• Learns pre-filler, slot filler, and post-filler patterns• Only works for free text• Single slot filling

Page 9: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

9

Traditional Information Extraction

• Ontology-based tools– BYU Ontos [Embley99]

• Based on domain-specific extraction ontologies• Robust to changes• Multiple slot filling• Ontologies has to be built manually

Page 10: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

10

Hidden Web Crawling• Traditional IE tools: publicly indexable web

pages• Hidden web crawling

– Crawl the hidden web according to a user’s query

– HiWE (Hidden Web Exposer) [Raghavan01] • Source form representation task-specific DB

concepts• Fill out and submit forms• Retrieve information hidden behind the form

Page 11: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

11

Biological Data Extraction• Mainly from plain text• Extract biological terms

– Dictionary-based– Rule-based

• Extract relationships between biological terms/elements

• Example systems– BLAST-based name identifier [Krauthammer00]– PASTA (Protein Active Site Template Acquisition)

[Gaizauskas03]

Page 12: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

12

The Semantic Web

• Machine-understandable web

• Gives information a well-defined meaning

• Allows automation of tasks

• Provides biologists– Intelligent information services– Personalized web resources– Semantically empowered search engines

Page 13: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

13

The Semantic Web

• Semantic web languages XOL (XML-based Ontology Exchange Language) SHOE (Simple HTML Ontology Extension) OML (Ontology Markup Language) RDF(S) (Resource Description Framework (Schema)) OIL (Ontology Interchange Language) DAML+OIL (DARPA Agent Markup Language + OIL) OWL (Ontology Web Language)

• Semantic Annotation– Old: indexing of publications in libraries– New: information extraction

Page 14: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

14

Schema Matching

• Previous methods [Raghavan01]:– Individual matchers vs. combining matchers– Schema-based matchers vs. instance-based

matchers– Learning-based matchers vs. rule-based

matchers– Element-level matchers vs. structure-level

matchers

Page 15: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

15

Schema Matching

• LSD (Learning Source Description) [Doan01]– Semi-automatic– Learning-based– Both schema-level and instance-Level– Only 1-1 mappings

• GLUE & CGLUE [DMD+03] – Ontology alignment– CGLUE: Complex (non-1-1) mappings

Page 16: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

16

Schema Matching

• Cupid [Madhavan01]– Rule-based matcher– Both element-level and structure-level– Schema-based– Works on hierarchical schemas with schema tree– Linguistic similarity & structure similarity– Matches tree elements by weighted similarities

Page 17: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

17

Schema Matching

• COMA (COmbing MAtch) [Do02]– Combines different matchers– Interactive with users– Also an evaluation platform for different

matchers

Page 18: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

18

Biological Data Integration

• Challenge:– Huge amount, growing rapidly– Highly diverse in granularity and variety– Different terminologies, ID systems,

units– Unstable and unpredictable– Different interface and querying

capabilities

Page 19: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

19

Biological Data Integration

• SRS (Sequence Retrieval System) [Etzold96]– Keyword-based retrieval system– Returns simple aggregation of matched records– Only works for relational databases

• BioKleisli [Davidson97]– Integrated digital library in biomedical domain– No global schema or ontology– A mediator works on top of source-specific wrappers– Horizontal integration

Page 20: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

20

Biological Data Integration

• DiscoveryLink [Haas01]– Mediator-based, wrapper-oriented– Provides virtual DB access from different sources– Cannot deal with complex source data– Hard to add new sources– Requires knowledge of specific query language

• TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [Stevens00]– Mediator-based– Uses global ontology and schema– Maps source and target concepts manually– Not robust to changes– Hard to add new sources

Page 21: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

21

Bioinformatics

• Biological ontology

• Bioinformatics data source discovery

• Trustworthiness and provenance

Page 22: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

22

Bioinformatics

• Biological ontology– GO (Gene Ontology) [Ashburner00]

• Controlled vocabulary– Molecular Function (7278 terms)

– Biological Process (8151 terms)

– Cellular Component (1379 terms)

• Is represent knowledge hierarchically

Page 23: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

23

Bioinformatics

• Biology Ontology– LinKBase [Verschelde03]

• Originally a biomedical ontology– Over 2,000,000 medical concepts– Over 5,300,000 instantiations– 543 relations

• Expanded using GO• Only describes simple binary relationships

Page 24: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

24

Bioinformatics

• Bioinformatics data source discovery– First step in integrating or answering queries– Example System: [Rocco03]:

• Pre-defined classes with class descriptions• Tries to map a source with a class

• Trustworthiness and provenance– Trustworthiness:

• Consistency• Reliability• Competence• Honesty

– Provenance• Record History• Transformations• Annotations• updates

Page 25: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

25

Page 26: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

26

Page 27: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

27

Page 28: Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

28

Summary and Future Work• Overcome drawbacks

of existing systems• Elaborate new

algorithms to solve the problem of locating and extracting data from heterogeneous biological sources

My research Semantic Web

Schema Matching

Information Extraction

Bioinformatics