Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana...
-
Upload
ethan-mitchell -
Category
Documents
-
view
221 -
download
0
Transcript of Ontology-based Information Extraction for Business Intelligence Horacio Saggion, Adam Funk, Diana...
Ontology-based Information Extraction for Business Intelligence
Horacio Saggion, Adam Funk, Diana Maynard, Kalina Bontcheva
Natural Language Processing GroupUniversity of Sheffield
United Kingdom
Outline
• The MUSING Project• Ontology-based IE• MUSING Natural Language Processing
Technology• MUSING applications
– Customisation– Results
• Conclusions & Future Work
MUSING project
• Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making
• Many systems in BI are portals which allow business analysts access to information
• It is the work of the business analyst to dig into the documents in order to extract useful facts for decision making
• MUSING is a 7th Framework Programme Project from the European Commission which promotes the adoption of BI tools based on semantic-based knowledge and content systems
• Analytical techniques traditionally used in BI rely on structured information and hardly ever use qualitative information which the industry is keen in using (e.g. opinions)
• One of the goals of MUSING is to use structured as well as unstructured information for decision making
Ontology-based Information Extraction (OBIE)
• Information extraction (IE) is a technology which extracts key pieces of information from text– generic: identify specific name mentions in text
(person names, location names, money, etc.)– specific: populate a structured representation (e.g.
template) with “strings” from text (e.g., full information on a joint venture)
• OBIE is the process of finding in text and other sources concepts, instances, and relations expressed in an Ontology
Ontology-based Information Extraction (OBIE)
• Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc.
• These associated pieces of information should be asserted as properties values of the company instance
• Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)
ONTOLOGY-BASEDDOCUMENT
ANNOTATION
DATA SOURCEPROVIDER
DOCUMENTCOLLECTOR
MUSING DATAREPOSITORY
MUSINGONTOLOGY
ANNOTATEDDOCUMENT
ONTOLOGYPOPULATION
KNOWLEDGEBASE
INSTANCES &RELATIONS
DOCUMENT
DOCUMENT
ONTOLOGY CURATORDOMAIN EXPERT
DOMAIN EXPERT
MUSING APPLICATION
REGIONSELECTIONMODEL
ENTERPRISEINTELLIGENCE
REPORT
REGIONRANK
COMPANYINFORMATION
ECONOMICINDICATORS
USER
USER INPUT
Ontology-based Information Extraction in MUSING
Data Sources and Ontology
• Data sources are provided by MUSING partners and include balance sheets, company profiles, press data, web data, etc. (some private data)– Il Sole 24 ORE, CreditReform data– Companies’ web pages (main, “about us”, “contact us”, etc.)– Wikipedia, CIA Fact Book, etc.
• Ontology is manually developed through interaction with domain experts and ontology curators– It extends the PROTON ontology and covers the financial,
international, and IT operative risk domain
Partial Ontology View
Natural Language Processing Technology
• The OBIE system for English is being developed using the GATE system (http://gate.ac.uk); the German and Italian systems are based on Sprout tools developed by DFKI
• GATE components used include: tokeniser, sentence splitter; parts-of-speech tagger; morphological analyser; parsers; etc.
• GATE comes with an extraction system called ANNIE, it targets only a small fraction of the MUSING application domain
Natural Language Processing Technology
• Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition
• New components include – an ontology mapping component – entities are
mapped into specific classes in the given ontology– a component creates RDF statements for ontology
population based on the application specification• for example create a company instance with all its properties
as found in the text
Cross-source Coreference
• One important problem to be addressed in extraction from multiple source is deciding if a person name – or any other name - occurring in two different sources refer to the same individual.
• Given a set of documents containing a given person name we apply an agglomerative clustering algorithm, at the end documents referring to the same individual belong in the same cluster
• The algorithm uses vector representations of the documents (terms and weights)
• We experimented with two types of terms: words and entity names and our results indicate that a representation using one specific type of name (i.e., Organization) achieves state-of-the-art performance however performance varies depending on the data set
MUSING Applications
• A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include– Collecting Company Information from multiple
multilingual sources (English, German, Italian) to provide up-to-date information on competitors
– Identifying Chances of success in regions in a particular country
– Identify appropriate partners to do business with– Creation of a Joint Ventures Database from multiple
sources
Region Selection Application
• Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business
• A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model
Region Selection Application
• Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia)
• Gazetteer lists contain location names and associated information together with keywords to help identify the key information
• Grammars use contextual information and named entities to identify the target variables– “unemployment rate of 25% (2001)”
• Extraction performance obtained: F-score > 80%
Region Selection Application: exampleTamil Nadu
...
Population (2001) 62,405,679 (6)
Density 478/km
...
<rdf> <indicator:Measurement rdf:ID="Measurement_91567"> <indicator:hasValue>478</indicator:hasValue> <indicator:hasPoliticalRegion rdf:resource=".../int/region#TamilNadu" /> <indicator:hasIndicator rdf:resource=".../int/indicator#DENS" /> <time:hasTimeSlice xmlns:time=".../general/time#"> <time:TimeSlice rdf:ID="TimeSlice_40715"> <time:hasTemporalEntity> <time:ProperInstantYear rdf:ID="ProperInstantYear_57895"> <time:year rdf:datatype="#int">2001</time:year> </time:ProperInstantYear> </time:hasTemporalEntity> </time:TimeSlice> </time:hasTimeSlice> </indicator:Measurement></rdf>
Conclusions and Future Work
• MUSING integrates ontology-based extraction as a useful tool for Business Intelligence
• The NLP applications analyse documents and populate a knowledge base
• A number of practical applications have been defined which will use the KB’s stored facts. Extraction technology and performance so far is promising
• Our future work will concentrate on – the full problem of ontology population including a cross-source
coreference mechanism– the identification of qualitative information (such as opinions) e.g.
to model company reputation – moving from a rule-based system to a machine learning
approach