SemTech2010

70
The Research Intelligence Project California Institute for Telecommunications and Information Technology (Calit2) Jerry Sheehan, Chief of Staff June 25th, 2010 SemTech 2010

description

Presentation made at SemTech2010 detailing the Calit2 Research Intelligence system for faculty expertise profile and our experience with semantics in this space.

Transcript of SemTech2010

Page 1: SemTech2010

The Research Intelligence Project

California Institute for Telecommunications and Information Technology (Calit2)

Jerry Sheehan, Chief of Staff

June 25th, 2010

SemTech 2010

Page 2: SemTech2010

The Research Intelligence Project

SemTech 2010

Outline

Our Problem

The Research Intel Tools

Semantic Data

Evolution

Concluding Thoughts

The Research Intel Tools

Future Directions

Page 3: SemTech2010

My Bias

SemTech 2010 Image Courtesy of Matt Jones, Creative Commons License, Flickr (blackbeltjones)

PreferFoundElsewhere

Page 4: SemTech2010

Topic I

SemTech 2010

Our Problem

Page 5: SemTech2010

Who Are We?

SemTech 2010

Page 6: SemTech2010

What Do We Do?

SemTech 2010

Page 7: SemTech2010

The Standard “Completed” Faculty Profile

SemTech 2010

Dr. H’s

Dr H.

drh@edu

Page 8: SemTech2010

Different Way to Think About Our Problem

SemTech 2010 Image Courtesy of Scott Granneman, Creative Commons License, Flickr (rsgranne)

Page 9: SemTech2010

Topic II

SemTech 2010

Tools We Developed

Page 10: SemTech2010

How Research Universities Look

At Their Business Data

SemTech 2010 Image Courtesy of HA! Designers, Creative Commons License, Flickr (artbyheather)

Page 11: SemTech2010

How We Could Look At Our Data

SemTech 2010 Logo Design by Kyle Bowen, http://www.educause.edu/Community/MemDir/Profiles/KyleBowen/58744

Page 12: SemTech2010

Research Intelligence Platform Development

SemTech 2010

2005 20102006 2007 2008 2009

Idea Proof of Concept Alpha/Beta for Calit2 Beta for Others Production for Campus New Domains

460 250 300 480 900 Faculty 71 Companies

# o

f Use

rs

Page 13: SemTech2010

Research Intelligence Development in Web History Timeline

SemTech 2010

Page 14: SemTech2010

2005: Topic Modeling of Researchers

SemTech 2010 Initial Site Developed by David Newman with Direction from Padhric Smyth, University of California, Irvine

Page 15: SemTech2010

2005: The Topic Modeling Proof of Concept

SemTech 2010 http://datalab-1.ics.uci.edu/calit2/

Page 16: SemTech2010

Conceptual Challenges with 2005 Model

SemTech 2010

NLP Algorithm Human Intervention

Discipline Bias

Page 17: SemTech2010

The Folksonomy vs Taxonomy Debate

SemTech 2010

Felis Bengalensis

Bengal Cat

Folksonomy•Cat•Bengal Cat•F6•Leopard•Hybrid•Nikita

Taxonomy•Kingdom: Animalia•Phylum: Chordata•Class: Mammalia•Order: Carnivora•Family: Felidea•Genus: Felis•Species: Bengalensis

Page 18: SemTech2010

Manual Tagging Experiment

SemTech 2010

• Three person team examined one university affiliated web page for affiliated faculty and associated a minimum of three keywords with each person.

• No controlled vocabulary but rather a narrative question to focus manual tagging.

• What type of research does this person primarily do?

• Created SQL Database of all UCSD affiliated academic researchers.

Page 19: SemTech2010

Unfiltered Tags: Automated Extraction

SemTech 2010

1. ucsd (157)2. email (117)3. university of california san diego (112)4. sdsc (55)5. contact (50)6. california san diego (47)7. professor (44)8. university of california (44)9. computer science (36)10. mail (36)11. edu (34)12. wireless (31)13. telecommunications (31)14. california institute (28)15. photonics (27)16. physics (26)17. signal processing (23)18. visualization (22)19. computer engineering (22)20. bioinformatics (21)21. capsule bio (21)22. nanotechnology (19)23. uc san diego (19)24. sensors (18)25. scripps institution of oceanography (18)26. information technology (17)27. ucsd faculty (17)

28. structural engineering (16)29. associate professor (16)30. electrical engineering (16)31. department of computer science (16)32. cse (16)33. responsphere (16)34. computational biology (15)35. adjunct professor (15)36. algorithms (15)37. nsf (14)38. networking (14)39. digital signal processing (14)40. geophysics (14)41. (14)42. california institutes (14)43. information technology staff (14)44. cwc (13)45. san diego supercomputer center (13)46. biology (13)47. cognitive science (13)48. information theory (13)49. optical networking (13)50. mit (13)

Page 20: SemTech2010

Filtered Tags: Automated Extraction

SemTech 2010

1. wireless (31)2. telecommunications (31)3. photonics (27)4. physics (26)5. signal processing (23)6. visualization (22)7. computer engineering (22)8. bioinformatics (21)9. nanotechnology (19)10. sensors (18)11. information technology (17)12. structural engineering (16)13. electrical engineering (16)14. responsphere (16)15. computational biology (15)16. algorithms (15)17. nsf (14)18. networking (14)19. digital signal processing (14)20. geophysics (14)21. (14)22. cwc (13)23. san diego supercomputer center (13)24. biology (13)25. cognitive science (13)26. information theory (13)27. optical networking (13)

28. computer (13)29. san diego supercomputer (13)30. supercomputing (12)31. communications (12)32. embedded systems (12)33. semiconductors (11)34. networks (11)35. biochemistry (11)36. pharmacology (11)37. systems biology (11)38. chemistry (11)39. neural networks (11)40. computer vision (11)41. http (11)42. journal of geophysical research (11)43. music (10)44. integrated circuits (10)45. vlsi (10)46. information storage (10)47. artificial intelligence (10)48. engineering (10)49. engineering university (10)50. rescue (10)

Page 21: SemTech2010

The Archimedes Project 2006

SemTech 2010

Page 22: SemTech2010

Importance of Value Propositions

SemTech 2010 Image: Norman Rockwell for Tom Sawyer and Huck Finn, 1935

Page 23: SemTech2010

What Researchers are Interested In

SemTech 2010 TreeMap, Federal Funding, May 2010, Data and Visualization Calit2

Page 24: SemTech2010

Really Good Government

SemTech 2010

Page 25: SemTech2010

Federal Funding Opportunities 2006

SemTech 2010

Page 26: SemTech2010

Federal Funding Opportunities Production Workflow

SemTech 2010

Page 27: SemTech2010

Research Intelligence 2007: Faculty and Funding Keywords

SemTech 2010606 Grants, 5700 Tags

Page 28: SemTech2010

Research Intelligence 2007 Workflow

SemTech 2010

Page 29: SemTech2010

Research Intelligence Campus 2009

SemTech 2010 http://ric.ucsd.edu

Page 30: SemTech2010

Campus RI: Integrated Researcher Metadata

SemTech 2010 http://ric.ucsd.edu

Page 31: SemTech2010

Research Intelligence The 2009 Semantic Engine

SemTech 2010

900 Users

5400 Documents

70,000 Tags

Keywords

Relevancy

Keywords

Semantics, Linked Data

Keywords

Topics

Keywords

Semantics

Keywords

Semantics

Page 32: SemTech2010

Community Research Intelligence: New Application Thrust 2010

SemTech 2010

Page 33: SemTech2010

Topic III

SemTech 2010

Semantic Data Evolution

Page 34: SemTech2010

Research Intelligence View of Semantic Data Evolution

SemTech 2010

Closed NLP Text Mining

Few Open APIs for NLP

Initial Open APIS Semantic Services

Initial Open Linked Data Repositories

Com

plex

ity

Time2005 2008 2009 2010

Page 35: SemTech2010

Research Intelligence: The Data, Grant Abstract

SemTech 2010

Computation is accepted as the third pillar supporting innovation and discovery in science and engineering and is central to NSF's future vision of Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21)[1]. Software is an integral part of the computation paradigm and a primary modality for realizing the CF21 vision. Scientific discovery and innovation are advancing fundamentally new pathways opened by development of increasingly sophisticated software. Software is also directly responsible for increased scientific productivity and significant enhancement of researchers' capabilities. In order to nurture, accelerate and sustain this critical mode of scientific progress, NSF is establishing a new program, Software Infrastructure for Sustained Innovation (SI2), with the overarching goal of transforming innovations in research and education into sustained software resources that are an integral part of the cyberinfrastructure. SI2 is a long-term investment focused on catalyzing new thinking, paradigms, and practices in using software to understand natural, human, and engineered systems. SI2's intent is to foster a pervasive cyberinfrastructure to help researchers address problems of unprecedented scale, complexity, resolution, and accuracy by integrating computation, data, networking and experiments in novel ways. It is NSF's expectation that SI2 investment will result in robust, reliable, usable and sustainable software infrastructure that is critical to the CF21 vision and will transform science and engineering. It is expected that SI2 will generate and nurture the multidisciplinary processes required to support the entire software lifecycle and will result in the development of sustainable software communities. SI2 envisions vibrant partnerships among academia, government laboratories and industry for the development and stewardship of a sustainable software infrastructure that can enhance productivity and accelerate innovation in science and engineering. The goal of the SI2 program is to create a software ecosystem that includes all levels of the software stack and scales from individual or small groups of software innovators to large hubs of software excellence. The program addresses all aspects of CI, from embedded sensor systems and instruments, to desktops and high-end data and computing systems, to major instruments and facilities.The SI2 program envisions three classes of awards:1. Scientific Software Elements (SSE): SSE awards target small groups that will create and deploy robust software elements for which there is a demonstrated need, encapsulating innovation in science and engineering. The effort targeted by a SSE award is up to a level roughly comparable to: summer support for two investigators with complementary expertise; two graduate students; and their collective research needs (e.g. materials, supplies, travel) for three years.2. Scientific Software Integration (SSI): SSI awards target larger groups of PIs organized around common research problems as well as common software infrastructure, and will result in a sustainable community software framework. The effort targeted by a SSI award is up to a level roughly comparable to: summer support for three to four investigators with complementary expertise; three to four graduate students; one or two senior personnel (including post-doctoral researchers, software developers, and staff); and their collective research needs (e.g., materials, supplies, travel) for three to five years. The integrative contributions of the SSI team should clearly be greater than the sum of the contributions of each individual member of the team.3. Scientific Software Innovation Institutes (S2I2): S2I2 awards will focus on the establishment of long-term community-wide hubs of software excellence. These hubs will provide expertise, processes, resources and implementation mechanism to transform computational science and engineering innovations and community software into robust and sustained tools for enabling science and engineering. S2I2 proposals will bring together multidisciplinary teams of domains scientists and engineers, computer scientists and software engineers, technologists and educators.The FY 2010 SI2 competition will be limited to SSE and SSI awards. The solicitation in FY 2011, and in subsequent years, will outline funding opportunities for all three classes of awards (SSE, SSI and S2I2), subject to availability of funds.[1] http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp

NSF Solicitation: Software Infrastructure for Sustained Innovation

Page 36: SemTech2010

Keyword Extraction Across Sources

SemTech 2010

Term Human Yahoo KEA Calais Alchemy OAmplifyCommon Software Infrastructure

Community SoftwareCyberinfrastructure

Embedded Sensor

Engineering

Hubs of Scientific Innovation

Innovation

NSF

Scientific

Scientific Discovery

Scientific Software

Scientific Software Integration

Scientific Software Innovation Institutes

SI2

Software

Software Developers

Software Ecosystem

Software Elements

Software Engineers

Software Infrastructure

Software Innovators

Software Lifecycle

Software Stack

SSI

Sustainable Software

Sustained Tool

Vision

12 3 9 15 20 10

Page 37: SemTech2010

Semantic Structure Returned by Open Calais

SemTech 2010

Industry Terms•Community Software•Software Lifecycle•Sustainable Software Communities•Usable and Sustainable Software Infrastructure •Software Infrastructure•Software Stack•Software Developers•Sustainable Community Software Framework•Sustained Software Resources•Software Ecosystem•Software Excellence•Embedded Sensor Systems•Software Elements•Sustainable Software Infrastructure

Organization•National Science Foundation

Social Tags•Cyberinfrastructure•E-Science•Computing•Computer Software•Innovation•Software Engineer•Technology•Science•Technology_Internet

URL•http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp

http://www.opencalais.com/

Page 38: SemTech2010

Semantic Structure Returned by Alchemy API

SemTech 2010

Tags•Scientific productivity•overarching goal•graduate students•scientific discovery•21st century science•collective research•common research problems•common software infrastructure•community software•complementary expertise•computation paradigm•cyberinfrastructure framework•entire software lifecycle•envisions vibrant partnerships•innovation computation•innovations•long-term community-wide hubs•nsf's expectation•nsf's future vision•pervasive cyberinfrastructure

Company•Scientific Software

Field Terminology•Software•Software Stack•Software Developers•Software Ecosystems•Software Engineers

Organization•NSF•SSI

•pillar supporting innovation•primary modality•program envisions•researchers address problems•robust software elements•scientific progress•scientific software elements•scientific software innovation•scientific software integration•si2's intent•small groups•software elements•software excellence•software infrastructure•software innovators•software resources•sophisticated software•sse award•ssi awards•ssi team•summer support•sustainable community software•sustainable software communities•sustainable software infrastructure

Category•Science and Technology

http://www.openamplify.com/

Page 39: SemTech2010

Semantic Modeling Challenge Even with XML/DTD

SemTech 2010 HTTP://vivoweb.org

Page 40: SemTech2010

Grants.Gov Technical Support Doesn’t Like Data Questions

SemTech 2010 HTTP://vivoweb.org

Page 41: SemTech2010

Open Calais Faculty Linked Data Results

SemTech 2010

Tag Type Linked Data Relevancy

National Science Foundation Organization http://d.opencalais.com/genericHasher-1/f7d1451f-915f-31bc-8194-b9794401ea2d.html 52%

Software Excellence Industry Term http://d.opencalais.com/genericHasher-1/3da6f84d-cff9-3eec-8fce-99ea792e370c.html 34%

Sustained Software Resources Industry Term h,p://d.opencalais.com/genericHasher-­‐1/61a1eb6d-­‐196d-­‐3493-­‐ad6c-­‐8ea0b85ce421.html 32%

Usable and Sustainable Software Infrastructure

Industry Term http://d.opencalais.com/genericHasher-1/9e6fe116-e562-3753-9b93-8f938095a715.html 31%

Software Lifecycle Industry Term http://d.opencalais.com/genericHasher-1/9c7876e1-a85f-307c-8b38-163c129f19f7.html 30%

Sustainable Software Communities Industry Term http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 29%

Sustainable Software Infrastructure Industry Term http://d.opencalais.com/genericHasher-1/4be05ead-30cd-3c3a-bd88-5dbb8427acc9.html 27%

Software Stack Industry Term http://d.opencalais.com/genericHasher-1/c22ad2e5-bd08-3083-9dc5-14945fb77010.html 24%

Software Innovators Industry Term http://d.opencalais.com/genericHasher-1/eba4d676-5aa8-3b1e-83dc-c4bd91b4d0f4.html 21%

Page 42: SemTech2010

Open Calais Linked Data Examples

SemTech 2010

National Science Foundation

Software Excellence

Page 43: SemTech2010

Zemanta Linked Data Examples from Grant Abstract

SemTech 2010

Page 44: SemTech2010

Linking to Freebase Via API from Grant Abstract

SemTech 2010

Page 45: SemTech2010

Are Faculty Yet Data Objects? Depends on Their Popularity

SemTech 2010

Page 46: SemTech2010

My Boss is 32 Triples

SemTech 2010

Page 47: SemTech2010

Faculty Web Page

SemTech 2010

Page 48: SemTech2010

Open Calais Faculty Linked Data Example

SemTech 2010

Tag Type Linked Data Relevancy

Lo Research Group Company http://d.opencalais.com/comphash-1/2cf74602-005c-3d32-a184-4bc49ef2d5f2.html 50%

California Institute Facility http://d.opencalais.com/genericHasher-1/37ab20cd-0681-3775-bf97-7583b4ec1434.html 46%

[email protected] EmailAddress h,p://d.opencalais.com/genericHasher-­‐1/babf08c8-­‐1f57-­‐3b99-­‐b020-­‐7e0dd8eaf1fc.html 31%

California Institute for Telecommunications

Organization http://d.opencalais.com/genericHasher-1/6a1fba6f-cf57-300b-94fc-f36d027c8ff0.html 31%

858-xxx-xxxx PhoneNumber http://d.opencalais.com/genericHasher-1/e8e3ad15-ace3-3616-be5a-ae9038bc0678.html 31%

858-xxx-xxxx PhoneNumberhttp://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 31%

Information Technology Technology http://d.opencalais.com/genericHasher-1/a0f02cf0-dc13-3b0f-a139-5509b026bd96.html 31%

optoelectronic devices Industry Term http://d.opencalais.com/genericHasher-1/7f81f0c9-b94f-3959-b35b-67be2f703ab4.html 29%

International Business Machines

Company http://d.opencalais.com/er/company/ralg-tr1r/9e3f6c34-aa6b-3a3b-b221-a07aa7933633.html 6%

Page 49: SemTech2010

Open Calais Linked Data Examples

SemTech 2010

Calit2

Page 50: SemTech2010

Open Calais Linked Data Examples

SemTech 2010

IBM

Page 51: SemTech2010

Zemanta Linked Data Results

SemTech 2010

Tag Linked Data Confidence

Integrated Circuits wikipedia: Integrated circuit 0.65

UC Berkeleygeolocation: University of California, Berkeleyhomepage: University of California, Berkeleywikipedia: University of California, Berkeley

0.64

Information Technology wikipedia:  InformaHon  technology 0.63

Calit2geolocation: California Institute for Telecommunications and Information Technologywikipedia: California Institute for Telecommunications and Information Technology 0.60

Almaden Research Centergeolocation: IBM Almaden Research Centerwikipedia: IBM Almaden Research Center 0.59

Age related Macular Degeneration

wikipedia: Macular degeneration 0.59

Minimally Invasive Surgery wikipedia: Invasiveness of surgical procedures 0.58

Cancer http://en.wikipedia.org/wiki/Cancer 0.57

Cornell

geolocation: Cornell Universityhomepage: Cornell Universitywikipedia: Cornell Universityyoutube: Cornell University

0.57

Fluorescence Activated Cell Sorter

wikipedia: Flow cytometry 0.57

Page 52: SemTech2010

Zemanta Linked Data Examples

SemTech 2010

Integrated Circuit Calit2

Page 53: SemTech2010

The Linked Data Cloud

SemTech 2010

Page 54: SemTech2010

Linked Data and a Wikipedia Base

SemTech 2010

Source: Jeremy Hsu, “Wikipedia: How Accurate is it?”November 2009, Live Science, http://www.livescience.com/technology/091106-ttr-wikipedia.html#comments

Wikipedia: How Accurate?

Page 55: SemTech2010

Is It A Problem?

SemTech 2010 SOURCE: USA Today, November 29, 2005

John S., Is a Possible Assassin of, John K

Page 56: SemTech2010

Maybe Not?

SemTech 2010 SOURCE: PHARMANEWS.EU, January 23, 2009

How Important is Validity to Researchers?

Page 57: SemTech2010

Topic IV

SemTech 2010

Future Directions?

Page 58: SemTech2010

Life Sciences Example

SemTech 2010 http://www.collexis.com/

Page 59: SemTech2010

SciVal From Elsevier

SemTech 2010 http://www.scival.com/

Page 60: SemTech2010

SciVal Terms And Conditions

SemTech 2010 http://www.scival.com/terms-and-conditions

Page 61: SemTech2010

The Value of Your Data

SemTech 2010 http://www.turbulence.org/Works/swipe/calculator.html

Page 62: SemTech2010

Beginning to Have Data Portability Policies...for Sites

SemTech 2010 http://portabilitypolicy.org:80/sample-policies.html

Page 63: SemTech2010

Future Direction for Semantic Academic Communities?

SemTech 2010 HTTP://vivoweb.org

Page 64: SemTech2010

VIVO Ontology

SemTech 2010 HTTP://vivoweb.org

Page 65: SemTech2010

Emerging/Growing Semantic Catalog

SemTech 2010 http://www.data.gov/semantic/catalog

Page 66: SemTech2010

Example: DOE Awards Semantic Catalog

SemTech 2010 http://www.data.gov/semantic/catalog

Page 67: SemTech2010

Topic IV

SemTech 2010

Conclusions

Page 68: SemTech2010

Words of Wisdom

SemTech 2010

“One of the Most Important Things I Learned is What Not to Pay Attention To”

John Wooley

Page 69: SemTech2010

Is the Semantic Web and Linked Data This?

SemTech 2010 Image Courtesy of Alan Vernon, Creative Commons License, Flickr (alanvernon)

Page 70: SemTech2010

Is the Semantic Web and Linked Data or This?

SemTech 2010 Image Courtesy of Vince Huang, Creative Commons License, Flickr (vincehuang)