Data+Need=Hack

24
Data+Need=Hack Nikos Manolis – AgroKnow 5 th July 2014 supported by: 2 nd SemaGrow Hackathon (in conjunction with IRSS14)

description

 

Transcript of Data+Need=Hack

Data+Need=Hack

Nikos Manolis – AgroKnow5th July 2014

supported by:

2nd SemaGrow Hackathon (in conjunction with IRSS14)

The hack equation

The Hackathon Challenges

• How to help agricultural researchers to discover the resources they need?

• How to support food safety trainers in preparing their training courses by using high quality material?

Lot of Open Data

The Data

• How to access– Application programming interface (API)

• POST, GET, PUT, DELETE– Dump files – SPARQL endpoints– Harvesting from services (OAI-PMH)– HTML / data scraping– Crawling – …combination of the above

Green Learning Network (GLN)

Green Learning Network (GLN)

• Two main parts:– Metadata acquisition and preparation: transform,

correct, identification, filtering, post-processing, broken link-checking

– Maintenance of aggregated metadata: up-to-date metadata records, broken link-checking

Need for data aggregation and harmonization

GLN Search API (agINFRA powered)

• REST-based queries over harmonized information (result of metadata processing)

• Internal data model supported – akif: describing educational resources for

agriculture, http://domain/search-api/v1/akif/?q=*

ABN Search API (agINFRA powered)

• Agriculture Bibliographic Network (ABN)

• REST-based queries over aggregated metadata

• Internal data model supported – agrif: describing bibliographic resources for food &

agriculture (mainly from FAO’s data): http://domain/search-api/v1/agrif/?q=*

Search options

• Simple searchhttp://domain/search-api/v1/akif/?q=tomato

• Searching within specific fieldshttp://BASE_URL/search-api/v1/akif/?

languageBlocks.en.description=tomato

• Temporalhttp://BASE_URL/search-api/v1/akif/?creationDate=2013-04-16

• Fetching specific items http://BASE_URL/search-api/v1/akif/COLLECTION/20296

Managing results• Sorting results

e.g ?q=*&sort_by=creationDate&sort_order=desc

• Facetse.g ?facets=set&facet_size=3

• Paginatione.g ?q=sea&page_size=25&page=3

Resources related to food safety risk analysis:http://api.greenlearningnetwork.com/search-api/v1/akif/?q=risk?analysis&set=aglrfaocdx,optunesco,faocapacityportal,oeorganiceprints,oeintute

The agDataHarvester service

• Implements the OAI-PMH protocol to harvest metadata records from open data providers– REST-based API– Harvested dataset available through HTTP

• AgDataHarvester parameters{ "document_type": "harvesting_target", "harvesting_target": { "name":"Repository name", "description":”Short Repository Description", "url":"OAI-PMH target URL", "type":"metadata format prefix", "frequency":hours }}

param.json

{ "document_type": "harvesting_target", "harvesting_target": { "name":"Indian Academy of Science", "description":"Indian Academy of Science", "url":"http://repository.ias.ac.in/cgi/oai2", "type":"mets", "frequency":24 }}curl -X POST [email protected] http://'demo001':[email protected]/agcouchdb

{ "ok": true, "id": " 5c56a3fa18fa21d2a85fd63cc9eb78ac ", "rev": "1-19ef1210376df8f1695a32b53ecb963a" }

http://agro.ipb.ac.rs/agcouchdb/_design/datasets/_list/search/list?dataset.process_parameter_id=5c56a3fa18fa21d2a85fd63cc9eb78ac

Using scientific

information

The AGRIS case

• A collection of more than 7 million bibliographic references in agriculture

• AGRIS records come with AGROVOC descriptors

• An RDF-aware system– the AGRIS database is exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

\

Agrotagger

• The purpose of the application is to index some Web resources (i.e. URLs) with the AGROVOC thesaurus

• The application can accept two different inputs:– A text file with a list of URLs– The output file of an Apache Nuts Web Crawler (which

contains a list of discovered URLs, but in a specific format)• The output is a set of connections between input

URLs and some extracted AGROVOC URIs– It can be a simple text file or a set of triples (NTRIPLES

serialization)

AgroTagger output

Crawling the Web

• Objective: discovering Web resources in agriculture and interlinking them to AGRIS records

• Final Goal: when the system displays an AGRIS record, a list of related Web resources should be available to the user

DataSets and APIs

http://wiki.agroknow.gr/agroknow/index.php/SemaGrow_Hackathon#DataSets_and_APIs

thank you!

Nikos ManolisAgroKnow

[email protected]