Federating Research Profiling Data

1
Clinical and Translational Science Institute Accelerating Research to Improve Health This project was supported by NIH/NCRR UCSF-CTSI Grant Number UL1 TR000004. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. Data Harvesting and Indexing LOD are acquired from SPARQL-compatible sites through a multi- threaded harvester program. Per-site harvesting times vary significantly, from a current low of 8 minutes to a current high of 4+ hours. Factors in this variability include both scale (e.g., number of persons represented) and endpoint implementation (e.g. Loki data are served by a teiid data federation layer coupled to a D2RQ bridge). Additional LOD are harvested through platform-specific multi- threaded crawlers (one thread per site). Current versions of VIVO and Profiles support direct access to RDF characterizations, allowing data collection from sites not yet making SPARQL endpoints available, while avoiding the need to screen-scrape HTML. In the one case of HTML-only data (Stanford’s CAP) we use a DOM parsing library to extract data. Harvested data is cached locally in a relational database to support indexing experiments without the need to harvest data repeatedly. Harvested data are enhanced where possible with supplemental metadata from MEDLINE, including abstracts, keywords, MeSH terms, chemicals and genes. The resulting aggregated text is then processed with a UMLS concept extractor and the resulting concept codes are added to the record. Shared publications then support both true multi-site federated search and concept-driven visualization. Federating Research Profiling Data David Eichmann, PhD, University of Iowa, Iowa City Eric Meeks, Clinical and Translational Science Institute, UCSF CTSAsearch ORNG Open Research Networking Gadgets Introduction Research profiling systems have achieved notable adoption by research institutions. Multi-site search of research profiling systems has substantially evolved since the first deployment of systems such as DIRECT2Experts. CTSAsearch is a federated search engine using VIVO- compliant Linked Open Data (LOD) published by members of the NIH-funded Clinical and Translational Science (CTSA) consortium and other interested parties. Fifty-seven institutions are currently included, spanning six distinct platforms and three continents (North America, Europe and Australia). In aggregate, CTSAsearch has data on 150-300 thousand unique researchers and their 10 million publications. The public interface is available at http://research.icts.uiowa.edu/polyglot. Cross-linking Metadata Almost all research profiling sites currently provide only internal links. In the case of non-institutional co-authors, either no information is provided or stub profiles are generated containing only an author name generated from the citation. We cross-correlate publications to assert to person URIs as referring to the same individual if they share one or more publications with the same PMID or DOI, have the same family name and either the same first name or one first name is a single initial that matches the first name of the other. We currently cross-link co-author data from ProfilesRNS to their respective home institution profiles through the CrossLinks project. Conclusion CTSAsearch and CrossLinks demonstrate that substantial value can be added to the existing research networking landscape through federation of these data. This better reflects the larger collaborative networks that our researchers comprise, and provides a better user experience through seamless inter-site navigation. Profiling system counts by platform Co-authorships between 313 researchers with publications involving ontology External Collaborators links out to co-author pages in other Profiling systems 1. Linked Open Data from many research profiling sources is harvested and processed by the University of Iowa. 2. A SPARQL endpoint at Iowa is used by UCSF to capture a subset of data representing cross-institutional co- authorships. 3. Research profiling installations supporting ORNG access UCSF to find co-authorship in JSON-LD at run time. Data flow and key Our future work in this area will include enhanced ability to interconnect these systems and to visualize the resulting aggregated information space. CrossLinks interrogates the CTSAsearch SPARQL endpoint (http://marengo.info- science.uiowa.edu:2020), then provides real-time JSON-LD, supporting cross- site linking (with thumbnail images), and effectively creating a single inter- institutional information space.

Transcript of Federating Research Profiling Data

Page 1: Federating Research Profiling Data

Clinical and Translational Science Institute Accelerating Research to Improve Health

This project was supported by NIH/NCRR UCSF-CTSI Grant Number UL1 TR000004. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

Data Harvesting and Indexing • LOD are acquired from SPARQL-compatible sites through a multi-

threaded harvester program. • Per-site harvesting times vary significantly, from a current low of 8

minutes to a current high of 4+ hours. Factors in this variability include both scale (e.g., number of persons represented) and endpoint implementation (e.g. Loki data are served by a teiid data federation layer coupled to a D2RQ bridge).

• Additional LOD are harvested through platform-specific multi-threaded crawlers (one thread per site). Current versions of VIVO and Profiles support direct access to RDF characterizations, allowing data collection from sites not yet making SPARQL endpoints available, while avoiding the need to screen-scrape HTML. In the one case of HTML-only data (Stanford’s CAP) we use a DOM parsing library to extract data.

• Harvested data is cached locally in a relational database to support indexing experiments without the need to harvest data repeatedly. Harvested data are enhanced where possible with supplemental metadata from MEDLINE, including abstracts, keywords, MeSH terms, chemicals and genes.

• The resulting aggregated text is then processed with a UMLS concept extractor and the resulting concept codes are added to the record. Shared publications then support both true multi-site federated search and concept-driven visualization.

Federating Research Profiling Data

David Eichmann, PhD, University of Iowa, Iowa City Eric Meeks, Clinical and Translational Science Institute, UCSF

CTSAsearch ORNG Open Research Networking Gadgets

Introduction Research profiling systems have achieved notable adoption by research institutions. • Multi-site search of research profiling systems has

substantially evolved since the first deployment of systems such as DIRECT2Experts.

• CTSAsearch is a federated search engine using VIVO-compliant Linked Open Data (LOD) published by members of the NIH-funded Clinical and Translational Science (CTSA) consortium and other interested parties.

• Fifty-seven institutions are currently included, spanning six distinct platforms and three continents (North America, Europe and Australia).

• In aggregate, CTSAsearch has data on 150-300 thousand unique researchers and their 10 million publications. The public interface is available at http://research.icts.uiowa.edu/polyglot. Cross-linking Metadata

• Almost all research profiling sites currently provide only internal links. In the case of non-institutional co-authors, either no information is provided or stub profiles are generated containing only an author name generated from the citation.

• We cross-correlate publications to assert to person URIs as referring to the same individual if they share one or more publications with the same PMID or DOI, have the same family name and either the same first name or one first name is a single initial that matches the first name of the other.

• We currently cross-link co-author data from ProfilesRNS to their respective home institution profiles through the CrossLinks project.

Conclusion • CTSAsearch and CrossLinks demonstrate that substantial value can

be added to the existing research networking landscape through federation of these data.

• This better reflects the larger collaborative networks that our researchers comprise, and provides a better user experience through seamless inter-site navigation.

Profiling system counts by platform

Co-authorships between 313 researchers with publications involving ontology

External Collaborators links out to co-author pages in other Profiling systems

1. Linked Open Data from many research profiling sources is harvested and processed by the University of Iowa.

2. A SPARQL endpoint at Iowa is used by UCSF to capture a subset of data representing cross-institutional co-authorships.

3. Research profiling installations supporting ORNG access UCSF to find co-authorship in JSON-LD at run time.

Data flow and key

• Our future work in this area will include enhanced ability to interconnect these systems and to visualize the resulting aggregated information space.

• CrossLinks interrogates the CTSAsearch SPARQL endpoint (http://marengo.info-science.uiowa.edu:2020), then provides real-time JSON-LD, supporting cross-site linking (with thumbnail images), and effectively creating a single inter-institutional information space.