Integrating Government Data New

26
Integrating Government Data using Semantic Web technology Dean Allemang Chief Scientist, TopQuadrant Inc. Prepared for ISWC 2009

description

Dean Allemang's Presentation at ISWC 2009

Transcript of Integrating Government Data New

Page 1: Integrating Government Data New

Integrating Government Data

using Semantic Web technology

Dean AllemangChief Scientist, TopQuadrant Inc.

Prepared for ISWC 2009

Page 2: Integrating Government Data New

Government Data Sources

Recent efforts have changed the face of government data distribution

– Better motivated– More sources– ‘Mandate’ (well, memorandum, anyway) for sharing data

Government data sources– Data.gov (main focus)– DOI Architecture - http://www.doi.gov/ocio/architecture/– USGS Earthquakes - http://earthquake.usgs.gov/eqcenter/ – USASpending.gov

Non-government data sources– Dbpedia.org– oeGov

Page 3: Integrating Government Data New

“Objets trouvés”

Artwork made from “found objects” Project Runway, etc.

Lal Hitchcock Sculptures

Page 4: Integrating Government Data New

“Found data”

Data integration efforts try to make data reusable– Data ‘wholesale’ instead of ‘retail’– Multiple efforts result in multiple data formats– Many efforts to ‘unify’ how data is represented – (competing) global

data standards. – Maybe one day, one will win.

Until that time, we have to make do with “found data” – data that is already available,

however it is.

RDF (etc.) can help us do that

Page 5: Integrating Government Data New

Formats for “Found Data” in government

Format Examples Notes

Spreadsheets Data.gov, USASpending.gov, DOI

Flexibility makes it popular, but makes work at re-use time

XML Data.gov Not really a single format, but can be parsed uniformly

RSS USASpending.gov, USGS

Syntax wars largely irrelevant now. Easy to read, dynamic

RDFa <none?> New kid on the block, supported by Google, Yahoo!, Drupal

SPARQL Endpoint

Dbpedia.org Most flexible of all, dynamic

RDF/N3/SKOS OEGov, Tetherless World

Flexible, relatively static. Great for vocabularies etc.

Page 6: Integrating Government Data New

Quality Considerations of Found Data

Correctness– Usual notion for data quality; is it right?– Misspellings, out-of-date data, etc.

Understandability– Found data requires interpretation. – E.g., what do columns in a spreadsheet mean?

Accessibility – How easily can the data be organized?– Eg. Spreadsheets can have haphazard organization– Eg., RSS feeds that aren’t dynamic, don’t have readable fields, etc.

Reusability/Repurposing – References to Controlled Vocabularies– Use of standardized ‘columns’ (properties)

Page 7: Integrating Government Data New

A few species of Found Data

Quantitative Data feeds– This is what we are usually actually interested in– Data is described using properties, units, tags, etc.

Vocabularies*– Structured, unstructured– Sometimes with strong standards behind them (Westlaw, AGROVOC)– Not always advertised as ‘vocabularies’ – also as org diagrams,

architectures, or even data• FEA, TOGAF• Geographical entities (States, cities, countries) FAO Geopolitical ontology• Units of measure, structure of gov’t agencies

Schema*– Used to standardize properties (columns, XML tags, etc.)

• DC, WGS, FOAF, SIOC• 11179

* Two kinds of “controlled vocabulary” – often confused!

Page 8: Integrating Government Data New

Integration strategy using RDF

IMPORT data into RDF– RDF is a sort of ‘least common denominator’ data representation

MERGE data – A wide variety of technologies available here– Semantic Web approach – you MODEL your mapping.

ANALYZE and DISPLAY conclusions– RDF is a sort of ‘least common denominator’ data representation

Page 9: Integrating Government Data New

Import data into RDF RDF as Common Data

representation ‘rote’ transformations

<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>

Name Address Company Title

Dean Allemang

10 Downing St.

TopQuadrant

Chief Scientist

Michael Brodie

14 Wysteria Lane

Verizon Chief Scientist

Page 10: Integrating Government Data New

Import Data into RDF

Each common data type can be input ‘rote’ into RDF– Input preserves information from original; entities for e.g spreadsheet rows,

XML elements, database tables, RSS channels, etc. – Often “found data” requires further processing to make sense, eg:

• Extracting trees from spreadsheets• Resolving references in XML

– SPARQL CONSTRUCT is useful for any of these, once data is ‘rote’ translated into RDF

Genus Species Sub-species

Canus Dog Collie

Canus Dog Beagle

Canus Dog Terrier

Canus Wolf Steppen

Canus Wolf Lone

Canus

Dog

Collie

Wolf

Beagle

Terrier

Lone Steppen

Page 11: Integrating Government Data New

Data Quality and Controlled Vocabularies

Do you reference a controlled vocabulary?– Flickr, del.icio.us, no– DOI, GSA, FTF, etc. reference FEA– Some reference more than one, e.g., GSA references TOGAF also– Legal briefs reference West Key Numbering System (WestLaw)– If you reference one (or more), then information sharing becomes possible

along that vocabulary

Did you tell us which one you referenced?– Reference is often implicit, or hidden in column name “Service Standard”

(did you recognize that as FEA?)– Reference is often explicit but informal ISBN-10: 0123735564– RDF provides global means of referencing vocabulary with a URIhttp://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en

Page 12: Integrating Government Data New

Data Quality and Controlled Vocabularies (cont)

How did you specify the term?– del.icio.us, Flickr, etc. use (uncontrolled) strings– FEA uses controlled strings (which notion of “Quality” do you mean?)– WestLaw uses Key Numbering System: 2233(2) “Regular income”– RDF/SKOS uses global means of referring to terms with the URI

http://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en

Sounds familiar? The URI solves many problems of reference with respect to shared controlled

vocabularies!

Page 13: Integrating Government Data New

Unstructured data and Controlled Vocabularies

Found Data sometimes doesn’t refer to vocabularies directly– “Microsoft announced today that negotiations to acquire search giant

Yahoo! have stalled.”– MICROSOFT, YAHOO! etc. could be controlled terms!– ‘standard’ terms might not match exactly (SEC names, etc.)

Concept Extraction technology can be relevant here– Reuters Calais reads news stories and extracts concepts in a controlled

vocabulary– Still has all the reference issues from before– Calais uses RDF (URIs) to resolve this.

Hooray for Calais!

Page 14: Integrating Government Data New

Merging Data

“Schema mapping”– Useful when multiple data sources provide the same information about

similar items– Same information is described using different terms (columns, properties)

“Tagging” or “Sorting” – ‘tags’ data (like del.icio.us or Library of Congress)– Useful for grouping similar items for search and discovery

Both can be used together– Eg., use tags to find similar things, then map schemas to report data

uniformly

Page 15: Integrating Government Data New

Data mapping Style 1: Schema Mapping Examples

Different sources use different names

Name Address Company Title

Dean Allemang 10 Downing St. TopQuadrant Chief Scientist

Michael Brodie 14 Wysteria Lane Verizon Chief Scientist

<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>

Name=name, butCompany=employerTitle=position

Page 16: Integrating Government Data New

Schema Mapping Examples (cont)

Different structures for similar data<rss:item ID=“3”> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long></rss:item>

<image src=“doggie.jpg”> <wgs:Point> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long> </wgs:Point></image>

<Entry <position>39.945345,-79.34524</position></Entry>

Page 17: Integrating Government Data New

Schema mapping solutions:

With RDFS/OWL::employer owl:equivalentProperty :Company .:position owl:equivalentProperty :Title .

With SPARQLCONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE { ?x sxml:child ?point. ?point a :Point . ?point wgs:lat ?lat . ?point wgs:long ?long }

With SPARQL extensions CONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE {?x :position ?pos . LET (?lat:=str:before (?pos, “,”)) LET (?long:=str:after(?pos, “,”)) }

Page 18: Integrating Government Data New

Schema mapping solutions (cont)

With a controlled meta-vocabulary and RDFS: E.g., 11179

:employer rdfs:subPropertyOf 11179:Concept1234 .:position rdfs:subPropertyOf 11179:Concept5678 .

Page 19: Integrating Government Data New

Role of Standards in the Mapping

Schema standards like WGS:– If all parties use them, no mapping necessary!!– Simple standards encourage reuse: Microformats

Schema meta-standards like 11179– If all parties map to them, no new mapping necessary – just use theirs!– One mega-standard makes re-use difficult– Meta-standard (don’t use my words, just map to them) makes reuse easier

Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Not very applicable at this stage– Will come in to their own in the next step . . .

Page 20: Integrating Government Data New

Data Mapping Style 2: Tagging or Sorting

Like del.icio.us etc.

<Bookmark href=“http://www.topquadrant.com”> <tag>Semantic Web</tag></Bookmark>

<System name=“Central Bookkeeping”> <Evaluation> <PerformanceMeasure>Quality</PerformanceMeasure> <Resullt>Fair</Result> </Evaluation></Bookmark>

That’s an FEA reference!

Where does this come from?

Page 21: Integrating Government Data New

Role of Standards in the Mapping

Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Useful for organizing collaboration among groups– Used extensively by libraries, professional organizations, focused domain

groups, etc.– Not used by del.icio.us, Flickr, etc.– Related to “Folksonomies”

Page 22: Integrating Government Data New

Analysis and Display

Wide variety of options, including eg: Use tags and tag structure to amalgamate data Display merged properties in a table Display merged data on a specific widget (e.g., mapping

geospatial data) Business Intelligence reporting – pie chart, bars, graphs, etc.

Page 23: Integrating Government Data New

Tags as Amalgamation

FEA

DOI

GSAIf two sources use the same controlled vocabulary, they can be amalgamated along that dimension.

Page 24: Integrating Government Data New

Mapping Columns

Page 25: Integrating Government Data New

Model-driven displays

SELECT ?lat ?longWHERE {?item a :DisplayLocation . ?item geo:lat ?lat . ?item geo:long ?long .}

Name latitude

longitude

Slausen -171.3 38.4

Union -171.4 38.2

Vine -170.9 37.9

McArthur -170.4 38.1

Anaheim -171.3 38.2

Chinatown

-171.1 38.5

Beverly -171.3 38.1

latitude

longitude

Stationdomain

geo:lat

geo:long

:DisplayLocation

domain

domain

subPropertyOf

subPropertyOf

subClassOf

Page 26: Integrating Government Data New

Exercises

Will use TopBraid™ Ensemble and TopBraid™ Composer

Using data from – oeGov– USASpending.gov– … others TBD …

Merge, slice, amalgamate, etc…