Semantic Data Science for the US Census Bureau

30
Semantic Data Science for the US Census Bureau Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ https ://silverspotfire.tibco.com/us/library#/users/bniemann/Public http:// semanticommunity.info/Census_Semantic_Knowledge_Base November 14, 2013 1

description

Semantic Data Science for the US Census Bureau. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ - PowerPoint PPT Presentation

Transcript of Semantic Data Science for the US Census Bureau

Page 1: Semantic Data Science for the US Census Bureau

1

Semantic Data Science for theUS Census Bureau

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ https://silverspotfire.tibco.com/us/library#/users/bniemann/Public http://semanticommunity.info/Census_Semantic_Knowledge_Base

November 14, 2013

Page 2: Semantic Data Science for the US Census Bureau

2

Google Search Display: Census Bureau

Page 3: Semantic Data Science for the US Census Bureau

3

Google Search Result: Census Bureau

• Home Page– First source for current population data and the latest Economic Indicators

• State and County QuickFacts– USA QuickFacts

• American FactFinder– Your source for population, housing ...

• 2010 Census– Redistricting Data - What is the Census?

• Population Estimates– The Census Bureau's Population Estimates Program

• Easy Stats– Easy Stats gives you quick and easy access

• Data Access Tools– The Census Bureau data tools provide on-line access

Page 4: Semantic Data Science for the US Census Bureau

4

Data Access Tools• Interactive Internet Data Tools:

– Data Visualization Gallery - A weekly exploration of Census data used to promote visualization and make data accessible to a broader audience.

– DataFerrett is a tool and data librarian that searches and retrieves data across federal, state, and local surveys, executes customized variable recoding, creates complex tabulations and business graphics. Current Population Survey, Survey of Income and Program Participation, American Community Survey, American Housing Survey, Small Area Income Poverty Estimates, Population Estimates, Economic Census Areawide Statistics, National Center for Health Statistics data, Centers for Disease Control data, and more.

– DataFerrett’s newest tool, the Community Economic Development HotReport provides community and business leaders speedy access to information on counties and the Employment & Training Administration’s Workforce Innovation in Regional Economic Development (WIRED) areas across the U.S.

Page 5: Semantic Data Science for the US Census Bureau

5

Data Visualization Gallery

http://www.census.gov/dataviz/

Page 6: Semantic Data Science for the US Census Bureau

6

Census Data Visualization Gallery As Data For the Digital Government Strategy

http://semanticommunity.info/Census_Data_Visualization

My Note: Structured and unstructured information is all turned into a knowledge base of data for relational and graph database processing.

My Note: The entire platform can be searched.The entire knowledge base page can be searched.

Page 7: Semantic Data Science for the US Census Bureau

7

Census Data Visualization Gallery: Spotfire

Spotfire Web Player

My Note: This is federation of diverse data sourcesto find, facet filter, visualize, and discover new facts.

Page 8: Semantic Data Science for the US Census Bureau

8

The Data Web: Data Ferrett

http://dataferrett.census.gov/

Page 9: Semantic Data Science for the US Census Bureau

9

Data Ferrett Description

• DataFerrett is a data analysis and extraction tool to customize federal, state, and local data to suit your requirements. Using DataFerrett, you can develop an unlimited array of customized spreadsheets that are as versatile and complex as your usage demands then turn those spreadsheets into graphs and maps without any additional software.

• My Comment: This is what I use Spotfire for on Open Government Data for the Digital Government Strategy.

Page 10: Semantic Data Science for the US Census Bureau

10

Community Economic Development HotReport Description

• This site, the Community Economic Development HotReport, provides access for users seeking economic indicators for individual counties.

• For areas that experience economic disruptions due to natural disasters, plant closings, base closings, and other economic changes, such as abrupt increases in employment, this HotReport shows pertinent economic indicators in unified on-line reports from many data sources.

Page 11: Semantic Data Science for the US Census Bureau

11

Community Economic Development HotReport Web Site

Click on graph to view table.

Community Economic Development HotReport

Page 12: Semantic Data Science for the US Census Bureau

12

White House Big Data Event:Data to Knowledge to Action

Making the Most of Big Data

“Just wanted to say how helpful it is that you take notes and share so broadly at these types of events. Thanks for your ongoing contributions to all the communities of which you are a part.”

Page 13: Semantic Data Science for the US Census Bureau

13

Semantic Data Science Team Attends White House Big Data Event

• Our work is an example of the bold new collaboration theme: “Harnessing the Potential of Data Scientists and Big Data for Scientific Discovery” that shows “Data Innovation Across Sectors” and includes the following Breakout session topics:– Education and Workforce Development (George Mason

University and John Hopkins University - see below)• My Note: Census is one of 9 agencies involved in this NITRD effort.

– Research and Development (NIH and YarcData)– Innovation (DC Data Science Community and Semantic

Community)

Page 14: Semantic Data Science for the US Census Bureau

14

NITRD Supplement to the FY14 President’s Budget

• We have worked to support the NITRD Current and Planned Coordination Activities as follows:– Working with two of the six agencies: NSF, NIH, and trying to work with the other four:

DoD, DARPA, DOE, and USGS;– Following the work in the NSF-NIH Solicitation, Core Techniques and Technologies for

Advancing Big Data Science & Engineering for datasets and results that can be reused;– Helping ensure a trained workforce to capitalize on big data resources by working with

GMU Data Science as part of our team and preparing a graduate course on data science using the applications and data sets mentioned above and below;

– Providing examples of applications that use multiagency big datasets and core technology that is needed to turn heterogeneous data into more homogeneous, interoperable data;

– Providing big data infrastructure development for domain science with Spotfire and the YarcData Graph Appliance; and

– Attending the second National Big Data R&D Initiative event.• My Note: We would like to work with Census on any or all of these!

Current and Planned Coordination Activities

Page 16: Semantic Data Science for the US Census Bureau

16

Contact Information

• Brand Niemann, Semantic Community– [email protected]– 703-268-9314– http://semanticommunity.info

• N. Fredrik Salvesen, SBK LLC Alliance Partner YarcData– [email protected]– 443 994-5193– http://yarcdata.com/

Page 17: Semantic Data Science for the US Census Bureau

17

Some Next Steps• So after about 10 years of development and the recent work of our Semantic Data Science

Team, we think we have the best US Federal Government semantic knowledge base (NIH Semantic Medline) running on one of the best graph computers (YarcData) for the OSTP/NITRD Federal Big Data Senior Steering WG.

• Our goal is to produce the “Killer Semantic Web Application for the US Federal Government” and we still have a ways to go.

• Now we need to help other agencies do the same by applying semantic data science to their data and metadata to develop their semantic knowledge base for piloting on the best graph computers.

• The following is a pilot example to begin to develop a semantic knowledge base for US Census showing the steps for preparing legacy US Census data sources and for collecting new US Census data sources so they are stored directly in a semantic knowledge base.– A historical note: This is like when I led the E-forms For E-government Pilot for OMB and the Federal

CIO Council – I selected the US Census Economic Census E-forms solution by Rick Fenestra to be the best practice for getting about 15 E-forms solutions being used by the US Federal Government to adopt a common e-Grant XML Schema so all 15 could become semantically interoperable and agencies would not have to “rip and replace” solutions. This approach could make agency semantic knowledge bases interoperable so they can be federated and we would have a “killer semantic web application” on top of “individual killer semantic web applications”!

Page 18: Semantic Data Science for the US Census Bureau

18

Data Access Tools

http://www.census.gov/main/www/access.html

• Quick Facts• American FactFinder• Easy Stats• My Congressional District• Population Finder• American Community Survey• 2010 Census• Economic Census• Interactive Maps• Data Visualizations• Training & Workshops• Data Tools• Catalogs• Publications

Page 19: Semantic Data Science for the US Census Bureau

19

Census Semantic Knowledge Base• US Census data is available in the following ways:– Data Access Tools: Making It Easier to Use the Data Than Just

Direct File Access Below (Start Here)– Research Data Centers: Access to Confidential Data (Defer This

Until Later Stage)– Software to Download: More Tools to Use (This is More About

Data Than Software)– Direct File Access: Public (Include This) and Private (Not

Applicable Here)– Access Tools at Other Sites: Is There a Better Place to Build This

Semantic Knowledge Base? (That University of Minnesota Web Site Looks Pretty Good!)

My Note: This defines how to start and the scope of the semantic knowledge base.

Page 20: Semantic Data Science for the US Census Bureau

20

Semantic Knowledge Base• Initially we need at least a taxonomy and a vocabulary.• Eventually, we would like an ontology and thesaurus.• We need to build a data and metadata ecosystem with

relational and graph data sets.• The pilot will build a knowledge base in MindTouch,

spreadsheets in Excel, a dashboard in Spotfire, and a business process for data collection in Be Informed.

• The pilot will be scaled up to create a RDF triple store for the YARCData Graph Appliance.

• In essence, I am going to build a “SemanticData.gov” type application for the US Census Data.

Page 21: Semantic Data Science for the US Census Bureau

21

Data Access Tools• Data Visualization Gallery: Recall Slide 6 Knowledge Base and Slide 7 Spotfire• 2010 Census Interactive Population Map• The American FactFinder• QuickFacts• Easy Stats• County Business & Demographics Map• Economic Database Search and Trend Charts• Glossary: See Slide 26 Excel and Slide 29 Spotfire Knowledge Bases• Censtats• Online Mapping Tools• US Gazetteer• Business Dynamics Statistics• DataFerrett: Recall Slides 8-9• Community Economic Development HotReport: Recall Slides 10-11• QWI Online• OnTheMap• Industry Focus• Census 2000 EEO Data Tool

My Note: This is another taxonomy!

Page 22: Semantic Data Science for the US Census Bureau

22

Data Access Tools:Knowledge Base Spreadsheet

http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx

My Note: This is a taxonomy in Semantic Web Linked Open Data Format.

Page 23: Semantic Data Science for the US Census Bureau

23

Direct File Access: Public

http://www2.census.gov/census_2000/datasets/

My Note: This is a taxonomy of howCensus organizes it data files that needsto be a searchable index in a spreadsheet.

Page 24: Semantic Data Science for the US Census Bureau

24

Direct File Access Public: Knowledge Base Spreadsheet

http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx

My Note: This is both relational and graph(subject, object, & predicate database formats.

Page 25: Semantic Data Science for the US Census Bureau

25

Census Taxonomy and Vocabulary: MindTouch Matrix

http://semanticommunity.info/Census_Semantic_Knowledge_Base#Story

My Note: The entire page & platform can be searched.

Page 26: Semantic Data Science for the US Census Bureau

26

Census Semantic Knowledge Base: Excel Glossary

http://semanticommunity.info/@api/deki/files/27084/CensusSemanticKnowledgeBase.xlsx

My Note: All of these spreadsheets can be searched.

My Note: The Semantic Community approach is consistent with the EU ISA Recommended URI Design and Management Principles.

Page 27: Semantic Data Science for the US Census Bureau

27

Census Semantic Knowledge Base: Spotfire Glossary

https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp

Page 28: Semantic Data Science for the US Census Bureau

28

Census Semantic Knowledge Base: Spotfire Taxonomy

https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp

Page 29: Semantic Data Science for the US Census Bureau

29

Conclusions and Recommendations

• A taxonomy (Interactive Internet Data Tools) and vocabulary (Glossary) from Census were used to pilot a semantic knowledge base.

• Agile development of the semantic knowledge base was possible when the data dictionary and data are readily available in a spreadsheet or at the download site so one can focus on doing the data science and analytics.

• The Census "Building Deep Links into American FactFinder" can be Semantic Web Linked Open Data.– See 2012 Statistical Abstract as a Semantic Knowledge Base in the Next Slide.

• The Semantic Community Platform can produce a Census data science ecosystem and products in an interoperability interface with semantic interoperability.

• Next is piloting Be Informed for Census survey data collection and then YARCData on the triple stores that are created.

Page 30: Semantic Data Science for the US Census Bureau

30

Statistical Abstract 2012: Spotfire Knowledge Base

http://semanticommunity.info/FedStats.net#Spotfire_Dashboard