Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana...

Ontology-based Information Extraction for Business Intelligence

Horacio Saggion, Adam Funk, Diana Maynard, Kalina Bontcheva

Natural Language Processing GroupUniversity of Sheffield

United Kingdom

Outline

• The MUSING Project• Ontology-based IE• MUSING Natural Language Processing

Technology• MUSING applications

– Customisation– Results

• Conclusions & Future Work

MUSING project

• Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making

• Many systems in BI are portals which allow business analysts access to information

• It is the work of the business analyst to dig into the documents in order to extract useful facts for decision making

• MUSING is a 7th Framework Programme Project from the European Commission which promotes the adoption of BI tools based on semantic-based knowledge and content systems

• Analytical techniques traditionally used in BI rely on structured information and hardly ever use qualitative information which the industry is keen in using (e.g. opinions)

• One of the goals of MUSING is to use structured as well as unstructured information for decision making

Ontology-based Information Extraction (OBIE)

• Information extraction (IE) is a technology which extracts key pieces of information from text– generic: identify specific name mentions in text

(person names, location names, money, etc.)– specific: populate a structured representation (e.g.

template) with “strings” from text (e.g., full information on a joint venture)

• OBIE is the process of finding in text and other sources concepts, instances, and relations expressed in an Ontology

Ontology-based Information Extraction (OBIE)

• Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc.

• These associated pieces of information should be asserted as properties values of the company instance

• Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)

ONTOLOGY-BASEDDOCUMENT

ANNOTATION

DATA SOURCEPROVIDER

DOCUMENTCOLLECTOR

MUSING DATAREPOSITORY

MUSINGONTOLOGY

ANNOTATEDDOCUMENT

ONTOLOGYPOPULATION

KNOWLEDGEBASE

INSTANCES &RELATIONS

DOCUMENT

DOCUMENT

ONTOLOGY CURATORDOMAIN EXPERT

DOMAIN EXPERT

MUSING APPLICATION

REGIONSELECTIONMODEL

ENTERPRISEINTELLIGENCE

REPORT

REGIONRANK

COMPANYINFORMATION

ECONOMICINDICATORS

USER

USER INPUT

Ontology-based Information Extraction in MUSING

Data Sources and Ontology

• Data sources are provided by MUSING partners and include balance sheets, company profiles, press data, web data, etc. (some private data)– Il Sole 24 ORE, CreditReform data– Companies’ web pages (main, “about us”, “contact us”, etc.)– Wikipedia, CIA Fact Book, etc.

• Ontology is manually developed through interaction with domain experts and ontology curators– It extends the PROTON ontology and covers the financial,

international, and IT operative risk domain

Partial Ontology View

Natural Language Processing Technology

• The OBIE system for English is being developed using the GATE system (http://gate.ac.uk); the German and Italian systems are based on Sprout tools developed by DFKI

• GATE components used include: tokeniser, sentence splitter; parts-of-speech tagger; morphological analyser; parsers; etc.

• GATE comes with an extraction system called ANNIE, it targets only a small fraction of the MUSING application domain

Natural Language Processing Technology

• Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition

• New components include – an ontology mapping component – entities are

mapped into specific classes in the given ontology– a component creates RDF statements for ontology

population based on the application specification• for example create a company instance with all its properties

as found in the text

Cross-source Coreference

• One important problem to be addressed in extraction from multiple source is deciding if a person name – or any other name - occurring in two different sources refer to the same individual.

• Given a set of documents containing a given person name we apply an agglomerative clustering algorithm, at the end documents referring to the same individual belong in the same cluster

• The algorithm uses vector representations of the documents (terms and weights)

• We experimented with two types of terms: words and entity names and our results indicate that a representation using one specific type of name (i.e., Organization) achieves state-of-the-art performance however performance varies depending on the data set

MUSING Applications

• A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include– Collecting Company Information from multiple

multilingual sources (English, German, Italian) to provide up-to-date information on competitors

– Identifying Chances of success in regions in a particular country

– Identify appropriate partners to do business with– Creation of a Joint Ventures Database from multiple

sources

Region Selection Application

• Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business

• A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model

Region Selection Application

• Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia)

• Gazetteer lists contain location names and associated information together with keywords to help identify the key information

• Grammars use contextual information and named entities to identify the target variables– “unemployment rate of 25% (2001)”

• Extraction performance obtained: F-score > 80%

Region Selection Application: exampleTamil Nadu

...

Population (2001) 62,405,679 (6)

Density 478/km

...

<rdf> <indicator:Measurement rdf:ID="Measurement_91567"> <indicator:hasValue>478</indicator:hasValue> <indicator:hasPoliticalRegion rdf:resource=".../int/region#TamilNadu" /> <indicator:hasIndicator rdf:resource=".../int/indicator#DENS" /> <time:hasTimeSlice xmlns:time=".../general/time#"> <time:TimeSlice rdf:ID="TimeSlice_40715"> <time:hasTemporalEntity> <time:ProperInstantYear rdf:ID="ProperInstantYear_57895"> <time:year rdf:datatype="#int">2001</time:year> </time:ProperInstantYear> </time:hasTemporalEntity> </time:TimeSlice> </time:hasTimeSlice> </indicator:Measurement></rdf>

Conclusions and Future Work

• MUSING integrates ontology-based extraction as a useful tool for Business Intelligence

• The NLP applications analyse documents and populate a knowledge base

• A number of practical applications have been defined which will use the KB’s stored facts. Extraction technology and performance so far is promising

• Our future work will concentrate on – the full problem of ontology population including a cross-source

coreference mechanism– the identification of qualitative information (such as opinions) e.g.

to model company reputation – moving from a rule-based system to a machine learning

approach

Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana...

Documents

Transcript of Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana...