© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research...
-
Upload
cameron-rose -
Category
Documents
-
view
215 -
download
0
Transcript of © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research...
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Analyzing European Research
Competencies in IST
– Results from a European SSA Project –
Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol
(DFKI) (IJS) (DFKI) (IJS)
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Project Information
Funding Organization: European Commission Funding Program: Sixth Framework Programme
(FP6: IST (3rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, CCLRC Project Consortium: 15 partners from EU MS, NMS
and ACC
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Project Consortium
Deutsches Forschungszentrum für Künstliche Intelligenz, Germany
Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc.,
Hungary Institute of Mathematics and Computer Science, Uni of Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK
(formerly CCLRC, UK)
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Technology Partners
DFKI
Co-ordinator
“LT World” PortalInformation Extraction
Semantic Web
DFKI
Co-ordinator
“LT World” PortalInformation Extraction
Semantic Web
Jozef Stefan Institute Technical Co-ordinator
“Project Intelligence”Data Mining
Social Network Analysis
Jozef Stefan Institute Technical Co-ordinator
“Project Intelligence”Data Mining
Social Network Analysis
Ontotext
“KIM Semantic Annotation Platform”
Ontotext
“KIM Semantic Annotation Platform”
euroCRIS
“CERIF” StandardAccess to Data
euroCRIS
“CERIF” StandardAccess to Data
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Project Objectives
Set up and populate an information portal on IST research
Provide information about RTD actors and their experience and expertise
Provide innovative and automated services
To promote RTD competencies in specific fields
To support partner search for IST proposals and commercial projects
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Presentation Outline
Information Repository
Data Collection
Data Integration / Data Cleaning
Evaluation of Results
Analytic Tools
Overall Conclusion
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Repository Features
Information Repository (CERIF 2004) containing Organisation Person Project Publications
Data Collection (CERIF XML) from National CRISs National Collections Web Crawlings Community Support
Data Integration into ONE single dataset to enable analysis at European Level
Data Cleaning with Supervised Machine Learning Methods (Active Learning)
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Repository Data Analysis
Duplicate records inherent in single datasets
Even more duplicate records after merging single datasets
Most obvious duplicates for organisations and persons
no significant number of duplicate projects publications have been ignored
Duplicate records are a known problem
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Problem: duplicate detection in record set A
Given: a set of records in A Classify: every pair (a,b) A x A
M U (set of true matches) (set of true non matches)
Formal Problem Definition (Winkler 2006)
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Heuristic Analysis of Random Samples: National Datasets / Cordis Datasets
most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets
not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined
Decision taken with respect to the IST World scope not touching project records ignore publication records find a solution for person records (IST World
Community) concentrate on cleaning organisation records
IST World Problem Definition
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Problems with Organisation Records
Most entries had slightly different names caused by additional special characters or character modifications
Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, )
Mixture of Organisation Names and Department Names Differences in Addresses
Data Cleaning Application
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
IST World Dataset Integration
records in A Blocked Records in A x A Blocked Records in A x A with comparison Features
Blocked Records in A x A with comparison Features All records have M (red) and U (green) class labels
1. blocking 2. feature generation
Blocked Records in A x A with comparison Features Some records have M (red) and U (green) class labels
3. active learning
4. model induction
5. model application
Automatic Classification Model
Organisation Names:Fulltext IndexingQuerying
Organisation Names + Location(1) Name/Location Strings (Bag of Words)(2) Word/Character Order (String Kernels)(3) Spelling Errors (Edit Distance Measure)(4) Normalization of (1-3)
Human DecisionM = MatchU = Non-Match- = unknown
Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown
Machine Decision M = Match U = Non-Match
Knowledge about
Records
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Active Learning Application
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Evalution of Results in CORDIS FP6 dataset
human evaluation of 1000 organisation record pairs 30 M correct; 934 U correct 1 M incorrect; 35 U incorrect 97% precision 46% recall
integration approach worked well can be used for large scale integration tasks
Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Analytic Tools
Advanced Tools Collaboration Diagram Competence Diagram
Experimental Tools Collaobration Trends Competence Trends Consortia Prediction Semantic Search
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
How to analyze or generate a Diagram
(1) definition of a query in the IST World Portal
(2) get a list of result records matching the query
(3) generate diagrams based on results
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Thematic Areas (Blue Clouds):SEMANTICHEALTHLEGALCHANGINGROADMAPSOFTWARE
Projects (Red Dots)Linked with Full Record in Repository
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Goals (List of Keywords):DEMENTIAPEOPLEMEDICALSTANDARDS…
Configuration of Result Space:40% of result list30 topics
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Competence Diagram
Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6
Goals
Configuration of Result Space:40% of result list30 topics
Themes
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Collaboration Diagram
Query: IST SSA projects within FP6Aim: investigate the collaboration of SSA partners in FP6
Number of joint partners
Configuration of Result Space:20% of result list
Project
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Evaluation of Analytic Tools
IST World allowed to perform the tasks defined
for more details see the full paper in the Proceedings
All analytics depend on the data behind
The analytic tools are very powerful
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Evaluation of Queries
Query execution performed in March 2008 Queried datasets IST World / Cordis
IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html
IST World (crawled
Data from CORDIS)
CORDIS
Investigation Date March 2008 March 2008
Last Updated January 2007 constantly updated
Query 1: Specific Support Action IST FP6 64 208
Query 2: Specific Support Action IST 377 1178
Query 3: Specific Support Action FP6 185 1507
Query 4: Specific Support Action 1554 2012
Query 5: Project Keywords: Specific Support Action Programme: IST, StartDate: After 01/01/2002
200 not checked
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Results of Query Evaluation
Discovered inconsistencies with Cordis data:
„FP6“ string: 30 of 80 relevant records missed the string
„SSA“ string: 15 of 208 relevant records missed the string
„Specific Support Action“ string: 15 of 208 relevant records missed the string
Dates (Year of the call): not consistently recorded
Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others
An investigation of the results of the Query 1 in Cordis revealed:80 projects of the result list are missing in IST World
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Overall Conclusion
Integration Method: Could be further developed Test data could be used to generate a better
classification model Feature generation could be improved by
using ontological knowledge Transfer learning methods might be helpful
for re-use of the learned model
Evaluation of large Datasets: very difficult needs expert knowledge
Analytic Tools: depend on quality data behind are very powerful for investigation of large datasets
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
European Research Dataset (entries)
Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs
Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp
Community: 61 Orgs, 41 Proj, 435 Exp
January 2
008
January 2
008
Project Results
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia
Beyond the Project
IST World is online: http://www.ist-world.org/
Registration is Registration is freefree
Create your
Competence Map / Collaboration Map
Continuation is planned …