Semantic Search for NSF Decision Making Dr. Brand Niemann Director and Senior Enterprise Architect...

26
Semantic Search for NSF Decision Making Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ April 4, 2012 1

Transcript of Semantic Search for NSF Decision Making Dr. Brand Niemann Director and Senior Enterprise Architect...

1

Semantic Search forNSF Decision Making

Dr. Brand NiemannDirector and Senior Enterprise Architect – Data Scientist

Semantic Communityhttp://semanticommunity.info/

AOL Government Bloggerhttp://gov.aol.com/bloggers/brand-niemann/

April 4, 2012

2

Overview

• Background• NITRD Dashboards• Data.gov Developer Community• Research.gov Dashboard• Semantic MedLine• Some Next Steps

3

Background

• My role at EPA, as their Senior Enterprise Architect and Data Scientist, and as lead for several Federal CIO Council activities, and since leaving government to become Director and Senior enterprise Architect-Data Scientist of Semantic Community, has been to implement high-level direction as follows:

4

Background• Teri Takai (DoD CIO) - Harvard Leadership for a Networked World, Lead Practitioner. I

am an Invited Practitioner that Mentors Students under her direction.– Social Business Intelligence from Open Government Data

• Letitia Long, Director of the National Geospatial Intelligence Agency. I am the lead for the pilot demonstration for the NCOIC-NGA CRADA at the upcoming 13th SOA for eGov Conference, April 3rd

– A Quint – Cross Information Sharing and Integration for the Intelligence Community– Demonstration at the 13th SOA for E-Government Conference, April 3, 2012, at MITRE

• Donna Roy, Executive Director of NIEM. She requested that I provide suggestions and demonstrations for evolving NIEM which I have done twice.– A Plan for Scaling NIEM to Big Data– Build The NIEM Information Exchange Clearinghouse In The Cloud

• Gus Hunt, CIA CTO. He challenged me to show how to make the CIA World Fact Book more semantic and to work with Digital Reasoning.– CIA World Fact Book– Digital Reasoning

5

Background• Sonny Bhagowahlia, David McClure, and Jeanne Holm (Data.gov Program Executive,

GSA Associate Administrator, and Data.gov Evangelist, respectively) challenged me to do data science for Data.gov.– Data.gov– Data.gov Developers Community Space Launched

• Wyatt Kash, Editor in Chief for AOL Government, challenged me to build Shared Services like Federal CIO Steven VanRoekel is asking for.– Federal IT Dashboard in Motion and In Memory

• Dennis Wisnosky, DoD CTO, and Walt Okon, DoD Senior Architect Engineer challenged me to Build DoD in the Cloud and Federate It with Other DoD and non-DoD Architectures (e.g. TOGAF)– Build DoD in the Cloud and Build TOGAF in the Cloud– Enterprise Information Web for Semantic Interoperability at DoD

• Dr. George Strawn, Director of the NSF NITRD and White House OSTP Staff to the CTO (Aneesh Chopra and Todd Park), challenged me to do data science dashboards.– A NITRD Dashboard (March and April 2011)– SIRA for Semantic Search (August 10, 2011)– A Research.gov Dashboard (March 2012)– Semantic MedLine (In process)

6

NITRD Dashboards

http://semanticommunity.info/A_NITRD_Dashboard#Spotfire

Note: Also see Build the NITRD Dashboard in the Cloud and Build the R&D Dashboard in the Cloud.

7

Data.gov Developer Community• Play the role of a data scientist from an agency, use a platform that

supports the things below, and build an app that provides semantic search for NSF abstracts that allows decision makers to identify future scientific research needs.

• My distilled suggestions for the recent excellent Data.gov meeting are:– Add a data scientist to the Data.gov team to lead a community of data scientists

from the agencies and non-government organizations in a new community.– Ensure that the new data.gov platform supports the sitemap and schema

protocols with well-defined URLs for content, faceted search, and big data in memory.

– Encourage the new developer community to build their own data.gov sites to become both publishers and consumers of data to support the new data scientist community above.

• Note: Invited to give presentation the end of April by Jeanne Holm, Data.gov Evangelist.http://semanticommunity.info/AOL_Government/Data.gov_Developers_Community_Space_Launched

8

Research.gov Dashboard

• Build an app that provides semantic search for NSF abstracts that allows decision makers to identify future scientific research needs.

• Created 176 MB Excel file (60,981 rows by 44 columns) for Spotfire Dashboard.– Get 2011 data from state tables?

• Tried to extract text for Semantic Search with SIRA and Digital Reasoning but found Abstract text is cut off and URLs are embedded in Publications and Project Outcomes columns.

9

Research.gov Spending & Results

https://www.research.gov/

Download Data Sets

10

Research.gov Dashboard

http://semanticommunity.info/A_NITRD_Dashboard/Research.gov#Spotfire_Dashboard

11

Sample of Hand Parsed Text

http://semanticommunity.info/A_NITRD_Dashboard/Research.gov/Sample_of_Hand_Parsed_Text

Note: We will need to get the raw text datato accomplish the objectives of this work.

12

Semantic MedLine Prototype: Home

• Semantic MEDLINE is a prototype Web application that summarizes MEDLINE citations returned by a PubMed search. Natural language processing is used to analyze salient content in titles and abstracts. This information is then presented in a graph that has links to the MEDLINE text processed.

• Currently, the results from 35 PubMed searches (including a variety of disorders and drugs) are available to be processed. The 500 most recent citations (from the date of the search) are available for further processing by Semantic MEDLINE.

• Begin at the Search tab by selecting a search; then move to the Summarize tab. Choose a summary type to specify the point of view of the summary (Treatment of Disease, Substance Interactions, Diagnosis, or Pharmacogenomics). After selecting the topic of the summary, click the Summarize and Visualize button. The graph appears below. Right click on an edge to display a MEDLINE citation.

http://skr3.nlm.nih.gov/SemMedDemo/index.jsp

13

Semantic MedLine Prototype: Search

http://skr3.nlm.nih.gov/SemMedDemo/InitializeSearch.do

14

Semantic MedLine Prototype: Summarize

http://skr3.nlm.nih.gov/SemMedDemo/Summary.do

15

Semantic MedLine

http://skr3.nlm.nih.gov/SemMed/

16

Semantic MedLine Prototype:Knowledgebase

http://semanticommunity.info/A_NITRD_Dashboard/Semantic_Medline

17

Semantic MedLine:Predication Database

ftp://lhcftp.nlm.nih.gov/outgoing/cgsb/

Note: Large Tar and GZIP files!

18

Semantic MedLine:Data Extraction

http://semanticommunity.info/A_NITRD_Dashboard/Semantic_Medline/Data_Extraction

22

Some Next Steps

• We will need to get the raw text data to accomplish the objectives of the work with the Research.gov Abstracts, Project Outcomes, etc.

• We need to extract the large Semantic MedLine Predication Databases files for Semantic Search with SIRA and Digital Reasoning.

23

AOL Government Stories• Semantic Medline (Pending)• HPN Health Prize for Health Data Palooza (Pending)• From Catalyst to Semantic Synthesis - How the IC Finds More Needles in Bigger

Haystacks (Pending)• Challenges and Opportunities in Big Data: Defense Department Bets Big On Big Data• Semantics and Ontologies for the Intelligence Community Working Toward Standards

(Pending)• Data.gov Developers Community Space Launched - Is Dr. Merkin In the House?

(Pending)• Building Trust Between Cloud Computing Providers and Suppliers• Health Datapalooza Would Benefit From Real Innovation Investment• Has NIEM Reached A Choke Point With Big Data• Put Federal IT Dashboard Into Motion• Why The Intelligence Community Loves Big Data• Big Data Science Visualizations Past Present and Future

http://semanticommunity.info/#AOL_Government_Stories

24

Challenges and Opportunities in Big Data

http://gov.aol.com/2012/03/30/defense-department-bets-big-on-big-data/

25

My Suggestions

• I think it leaves us with a disconnected federal big data program between the science and intelligence communities with the former considerably behind the latter.

• As Professor Jim Hendler, RPI Computer Scientist, commented during the meeting: "Computer scientists like us have to move to the social science side of things to really do big data.“

• This new White House Initiative needs Todd Park's entrepreneurial spirit, Gus Hunt's experience, and DoD's new money, spent in a coordinated way with the IC and civilian agencies to make big data across the federal government a reality.

26

Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)

http://www.nsf.gov/publications/pub_summ.jsp?WT.z_pims_id=504767&ods_key=nsf12499