Post on 05-Jan-2016
Data Integration in Current Research Information Systems
Integration vs. Aggregation
Maximilian Stempfhuber
GESIS / IZ Social Science Information Centre
Bonn, Germany
euroCRIS IR Workshop, November 9, 2006
Topics
• What users want
• Current information landscape in Germany
• Aggregation vs. Integration: Dealing with heterogeneity
• Model for integrating decentralized and heterogeneous information
• Focus: Semantic level
• Integrating entities
• Coping with sustainability
2
Looking at the Scientific User
• Spend 0,5 days/week searching for information
• Most frequently used information sources
• Journals (73%)
• Internet search engines (71%)
• Books (67%)
• Personal / informal communication (52%)
• Scientific portals / subject gateways (39%)
• Big differences between disciplinesBoekhorst et al. 2003, Poll 2004
Why are Internet search engines preferred to dedicated (research) information systems?
3
What Scientific Users Want
• Specialized portals (deep indexing, integration)
• Interdisciplinary links (cluster search)
• Intelligent integration (all types of information)
• Quality („no waste“, role of search engines?)
• Quantity + relevance (but no information overflow)
• Direct access („now-or-never“, reference + source)
• Communication (invisible colleges)
In line with models from information science
Confirms results from recent surveys
Boekhorst et al. 2003, Poll 2004, IMAC 2002, RSLG 2002, Binder et al. 2001, Stahl et al. 1998, WWW Search Engines: Machill & Welp 2003
4
Demands of Other Types of Users (Examples)
University or State level
• Overview over all scholars / research units
• Overview over all research projects
• Overview over publications, internal/external co- operations, funding received, …
• Input by users, quality assurance by research officers, automatic reporting, benchmarking, visibility of research, data exchange, …
Federal level
• Research administration
• Benchmarking / rating / ranking of instruments, programs, research organizations and disciplines
5
Consequences and difficulties for Building a CRIS
• Different types of information needed (e.g. research units, persons, projects, publications, datasets, co-operations)
• Only parts of the information produced in-house
• Information produced by different groups of people (researchers, administration, funding agency/reviewers)
• Large amounts of data must be externally acquired (e.g. other institutes, publishers, harvesting)
• Data is of different structure and quality, difficult to convert / analyze at a very detailed level; sometimes modification of data not allowed
• Not all data is visible to all users
• Different demands for information access and use
Difficult to convert to a standardized data model 6
Heterogeneity on the information landscape
JK 7
Current Information Landscape in Germany (extract)
National Central libraries National research collection system (SSG)and Virtual Libraries (funding: DFG)
Information networks(funding: BMBF)
Research institutes
8
Building a national CRIS by Aggregation (vascoda.de)
www.vascoda.de 9
Aggregation – Heterogeneity as a Challenge
• Data types
• Indexing languages
• Metadata schemas
• User interfaces
• Technical interfaces
• Natural languages, …
Heterogeneous
10
Features of Aggregation
• Single point of access to information
• Standardized functions applied to all information
• Features reflect least common denominator
• Enforced standards
• Remaining differences ignored (or lead to exclusion)
• Information entities not connected
Meta search (if distributed) Data Warehouse (if centralized)
11
Features of Integration
• Single point of access to information
• Source-specific functions available
• Features reflect different demands
• Enforced standards
• Remaining differences are treated
• Information entities tightly connected
© IBM
Model
12
SOWIPORT – CRIS for the Social Sciences
Thematische Dokumentationen
sowi ReiheSoFid
13
SOWIPORT – Core Content
• GESIS
• Literature references and full text documents
• Projects, institutes, journals, WWW resources, …
• Empirical data
• Partners
• …
• Library catalogues
• Open Access journals
• Topic-specific electronic publications
• Deutsche Forschungsgemeinschaft (DFG)
• National licenses to electronic resources
14
Layer 3
Google Scholar, MS Academic Search, Scirus, …
…Layer 2
Homepages of SOFO-Institutes, Harvesting (Grey literature), …
Theoretical Foundation: Layer Model
Layer 1
Databases, OA Repository (Self archiving), Harvesting (Metadata), Reviews, Wikipedia, …
Core
SOLIS, FORIS, SoLit, CSA, …
SOWIPORT – Information Architecture 1/2
SOWIPORT-Partners
intellectual (CC) Heterogeneity statistical, …
social sciences Content scientificsystematic Content indexing not systematic
15
SOWIPORT – Information Architecture 2/2
Databases Publications
Documentation unit (Service)Publication
16
SOWIPORT – Semantic Integration
DZI
IZ
CSA
Cross-Concordancesbetween Thesauri
Query Transformation
Aktionsforschung
SOLIS
Aktionsforschung
DZI SoLit
Handlungs-forschung
CSA
Action Research
Relevance Ranking
17
Treating Heterogeneity Between Indexing Vocabularies
intellectual
1 : n
a)
statisticalsearch
1 : n
c)
n : m
statistical, parallel corpus
b)
=
18
SOWIPORT – Semantic Integration of Core Content
DZI
IZ
CSA
Cross-concordancesbetween thesauri
• Methodology for terminology mappings
• intellectual• statistical• deductive
• Mapping between vocabularies• bilateral („pure“ model)• central vocabulary (efficency)
Thesauri in SOWIPORT
• IZ (SOLIS, FORIS, WZB OPAC)
• DZI (SoLit)
• DZA (GeroLit)
• SWD (SGG OPAC)
• FES (FES OPAC)
• ASSIA (Applied Social Sciences Index and Abstracts)
• PEI (Physical Education Index)
• WPSA (Worldwide Political Science Abstracts)
• CSA (Soc. Abstr., Soc. Serv. Abstr.)
• MADIERA (Surveys)
• EuroThes (IBLK OPAC)
• FIS Bildung
• APA (Psyndex)
• BiSP (SpoLit, SpoFor, SpoMedia) 19
vascoda: Context for Connecting Disciplines
Pedagogics Psychology
Economics Sports, …
MedicinCross-concordanzesbetweeen 12 thesauri
20
SOWIPORT – Structural, Local Integration (Core)
Partners‘databases(RDBMS, Allegro, …)
sowiport-XML-Schema
DBClear
Services:Terminology service,
Personalization,Authentication, …
Indexing / Retrieval
21
Integrated Search
Self Archiving in SOWIPORT
Pro
duct
Cat
alog
ue
SOLIS
Literature Search
CommunicationHomepages CV
+Publikations
(Self archiving) Full text
Repository
MetadataSOFO
Affiliation
• Initial motivation: WR Evaluation
• Sustainability: Incentives 22
Self Archiving and Evaluation
SOLIS + CSA
DBClear
WR suppliesnames of universitiesand researchers
• Retrieval of publications
• Quality control
Review and additions byresearchers
Evaluation by WR
Perspective:
• Basis for scholars‘ homepages / Who-is-who
• Self archiving in OpenAccess Repository
Transfer of metadata to the SOWIPORT core23
Reflecting Integration at the UI Level
24
Conclusions
• Integration goes well beyond aggregation• Shift from data orientation to an information use perspective necessary• Challenges
• Deal with heterogeneity at different levels• Integrate primary data with publications, …• Organize information sharing / access / sustainability
• Emerging infrastructures allow for integration• Licensing and access issues are still a problem
Thank You!
Dr. Maximilian Stempfhuber
st@iz-soz.de
www.gesis.org/IZ 25