The Best Way to Get BIG DATA is By Starting Small

1

The Best Way to Get BIG DATA is By Starting Small

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Community forJohns Hopkins University School of Medicine and Modus Operandi

http://semanticommunity.info/http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story

http://semanticommunity.info/Modus_OperandiDecember 12, 2013

http://semanticommunity.info/

http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story

http://semanticommunity.info/Modus_Operandi

2

BIG DATA• The new Digital Government Strategy is "treating all

content as data." So big data = all your content:– But just a small sample to start a pilot.

• There are many Big Data Technologies to choose from and many early adopters are finding them more expensive than expected:– Use open source-free trials to pilot.

• There are many Big Data Problems to solve that could “boil the ocean”:– Use a data scientist to help build a team and community for a

fast, inexpensive, and small semantic data science pilot.

3

Subcommittee on Networking and Information Technology Research and Development

(NITRD Subcommittee)

http://www.nitrd.gov/ & Web Address

These three activitiesfostered Semantic Medlineon the YarcData Graph Appliancefor the White House Big DataInitiative.

http://www.nitrd.gov/



http://www.nitrd.gov/nitrdgroups/index.php?title=Subcommittee_on_Networking_and_Information_Technology_Research_and_Development_(NITRD_Subcommittee)

4

Data Science Team Example:Chief Data Science Officer

• Chief Data Science Officer: – Dr. George Strawn, Director, White House OSTP NITRD/NCO:

Semantic Medline could be the “killer” Semantic Web application for the US Federal Government

• Data Science Team:– Dr. Brand Niemann, Lead– Dr. Tom Rindflesch, NLM Semantic Medline Creator– Professor Kirk Borne, George Mason University

• Federal Big Data Senior Steering WG Workforce Training Initiative– Tim White, Director, YarcData Federal Global Head– Aaron Bossett, YarcData Federal Solution Architect– Dr. Eric Little, Modus Operandi Chief Scientist

5

Generic Problems• How to get Big Data:

– Unstructured (Natural Language Processing to Graph-RDF Triples) and Structured (Relational-RDF Triples)

• Where to store Big Data:– Graph-RDF Triples and Relational

• What to show about Big Data:– Statistics, Visualizations, and Network Graphs

• Note: RDF Triples make Big Data smaller, smarter, and integrated!– Semantic Medline on the YarcData Graph Appliance is an example of the

best content on the best graph data store with the best visualization results so far (in my humble opinion)!• Our Semantic Data Science Team delivered this for the recent White House Big

Data Event: See Making the Most of Big Data

http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story

6

Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:

Work Flow

7


Semantic Medline Database Application

http://skr3.nlm.nih.gov/SemMedDB/index.jsp

See More Information: http://skr3.nlm.nih.gov/SemMedDB/MoreInfo.do

http://skr3.nlm.nih.gov/SemMedDB/index.jsp

http://skr3.nlm.nih.gov/SemMedDB/MoreInfo.do

8


Visualization and Linking to Original Text

9


Bioinformatics Publication

http://bioinformatics.oxfordjournals.org/content/28/23/3158.short

My Note: My SQL database for non-commercial use.

http://bioinformatics.oxfordjournals.org/content/28/23/3158.short

10


Semantic Medline at NIH-NLM• Current : Web based research tool.

• Transition: Current systems re-engineered to leverage Urika (less than 5 days).

• Purpose: Build a platform for users to perform increasingly complex analysis.

• Immediate Requirement : Replicate current capability.

• Future: Allow for increasingly complex analysis. Ability to capture and share analytics in addition to sharing data. Tailor Urika to less complex queries.

11


Graphs and Traditional Technologies• Square peg, round hole:

• Current technology does not support efficient representation, storage, and interaction with complex graph structures

• Traditional relational models only add the an already complex structure• Traditional hardware approaches do not support efficient access to

highly interconnected graphs

• You don’t know what you don’t know:• Efficient relational schemas require prior knowledge of the

relationships between database fields• Updating and modifying schemas frequently introduces delays and

errors

• Problems in partitioning the problem:• Distributed computing solutions are good…If your problem can be

easily partitioned• Graphs are not predictable; accessing graph nodes across large clusters

can be unwieldy at best and does not work at scale

?

CPU CPU CPU…

12

Real-time, Interactive Analytics on Large Graph Problems

Large Shared Memory ArchitectureUp to 512 TB

XMT2 Massively Multi-Threaded Processors

128 Threads

Scalable IOUp to 350TB per Hour

?

CPU CPU CPU…

Business Challenge:


The YarcData Approach

13


New Use Cases

• Schizophrenia– Current therapies target dopamine receptors

• Not entirely effective • Side effects

– Basic research is exploring glutamate and its NMDA receptor – Goal: can we use Semantic MEDLINE to discover that research trend in the scientific literature

• Cancer– With some exceptions, therapy is not effective

• Has not progressed significantly in 60 years– Scientific basis

• Traditionally – cancer cells• More recently – non-cancer cells (immune system)

– Immune system and cancer• Connection noted in 1863 (Virchow)• But not exploited until recently

– Goal: look for trends in cancer immunotherapy

Note: See Two YouTube Video Demos:Schizo (7 minutes) and Cancer (21 minutes)

Discovery Browsing Method for Exploiting Semantic MEDLINE• Cooperative reciprocity

• Between system and human

• Issue query• Inspect graph for

“interesting” concept• Use selected concept to seed

another query• Iterate until satisfied

http://www.youtube.com/watch?v=ShfI4SNzNO4

http://www.youtube.com/watch?v=6frNAmPD0mo

14

Modus Operandi:Mantra, Performance, and Vision

• Mantra:– Speeding the Discovery, Integration, and Fusion of Information

• Performance:– SBIR Phase Three Successes: Wave Exploitation Framework (EF)– Wave EF: Government-off-the-shelf (GOTS) technology for intelligence

applications that tackles the difficult problem of processing unstructured and semi-structured data

– C4ISR Government Customers: U.S. Air Force, U.S. Army, U.S. Marine Corps, U.S. Navy, DARPA, DTRA, Missile Defense Agency, and Intelligence Agencies

• Vision:– Wave All-Source Semantic Fusion Engine: In development to support

individual medical researchers/intelligence analysts to work with big data– Semedy (former Ontoprise founders): Reasoner and Triple Store

15

Modus Operandi:Finding the Right Needle in the Right Haystack

• Dyson said. “So a lot of what we’re doing is enabling that by making the data sources accessible and searchable.”

• “Our specialization is what we call ‘semantic technology,’ which is just a way of making the data smarter. We enrich the data with various tags to make it easier to find.”

• The software also provides what McNeight called data “provenance” which has to do with the traceability back to the source of the data - the really important aspect for intelligence personnel.

• “We don’t make decisions,” McNeight explained. “We just help (the analyst) to make decisions and to find the right data. He may only be interested in a certain person in a certain location at a certain time. We can bring that back to him across multiple databases.”

• Source: http://www.spacecoastbusiness.com/modus-operandi-delivers-information-based-intelligence/

http://www.spacecoastbusiness.com/modus-operandi-delivers-information-based-intelligence/

http://www.spacecoastbusiness.com/modus-operandi-delivers-information-based-intelligence/

16

Data Science Team Example:President of Modus Operandi

• President of Modus Operandi:• Richard McNeight, President, Masters Degree in Artificial Intelligence & Computer Science,

Board of Regents, Florida Institute of Technology University, Recognized for Entrepreneurial Leadership, and Recipient of Florida County Economic Development Grant for Big Medical Data

• Data Science Team:– Lee Watkins, Director of Bioinformatics & IT JHMI, and Dr. Brand Niemann, Semantic

Community, Co-Leads– Dr. Eric Little, Modus Operandi Chief Scientist, Ontology and Wave All-Source Semantic Fusion

Engine Development– Bryan Thompson and Michael Personick, SYSTAP Principals, Bigdata® Platform– Tim Barr, YarcData Medical Informatics, and Aaron Bossett, YarcData Federal Solution Architect– Others to be added as needed

• Advisors:– Dr. Tom Rindflesch, NIH/NLM Semantic Medline Creator– Dr. Richard Ford and Dr. Marco Carvalho, Florida Institute of Technology

17

Generated Semantic

Graph (RDF)

Trust/Provenance Algorithms

Wave IngestStreaming Data

Batch Data

Structured, Semi-structured, Unstructured

Data

High Performance Triple Store

(Rya)

Semantic Reasoner

Accumulo DBvMDC

Wave and the vMDC (virtual metadata catalog – which is a query translator for non-semantic queries)

An engine that can ingest any kind of data, transform that data into RDF graphs, then do a lot of semantic coolness with those graphs.

18

BLADE 2.0 WikiApps and

Visualizations

How Wave Drives the BLADE Semantic Wiki and Other Kinds of Analytic Visualizations

The wiki is just a way to view the entities in the model and make changes and see related content without having to type any SPARQL code or really know anything about the backend model structure – just point and click at the content you want to see.

19

Possible Scenario• For medicine – the Blade 2.0 Semantic Wiki would allow different researchers to view the

data collectively from within their areas of expertise, but connect them to other areas effortlessly.

• This means – scientist 1 could be looking up information on a given receptor on a cell, while scientist 2 is looking at proteomic information (perhaps not even knowing it is the underlying substance of that cell/receptor).

• Scientist 3 could add some new information about a given compound that shows reactions at the receptor site scientist 1 is studying.

• Upon entering that information, scientist 1 would see a new linked piece of data about their receptor related to the compound – and the cool part is scientist 2 would also see information about the connection between their protein structure and that compound.

• Scientist 3 would see the information about the protein related to their compound as well (since they were only looking at the receptor-compound connection).

• All 3 would basically have new linked information available to pursue if they wanted.• Now imagine being able to do those kinds of joins in near-real-time with a simple tool

across the entire corpus of the Semantic Medline data set. Kaboom!• Source: Dr. Eric Little, Chief Scientist and Ontologist

20

Knowledge Base:Modus Operandi Web Intelligence in MindTouch


Practical Example of How to Get BIG DATA By Starting Small with Structured & Unstructured Data as Relational & RDF Triples Stored in Excel and Visualized in Spotfire.


21

Big Data in Memory:Innovation Story

• Met Jef Sharp, President, Panève:– Amazing fast access and massive storage – Big Data

Supercomputer on My Mobile Device– John Hopkins University – Blackbook (CIA Cloud)

• I suggested:– Greylock Partners - #2 Data Scientist in the World (DJ Patil,

Entrepreneur-in-Residence who built the first formal data science team at LinkedIn)

• Works for In-Q-Tel (Robert Ames, Senior VP for Technology, In-Q-Tel)

• Works for CIA (Gus Hunt, CTO, CIA)– Who Wants Big Data Supercomputer on Mobile Devices

22

Future: PossibilityPanève’s ZettaLeaf & ZettaTree Products

• Scalable single level storage– Panève’s scalable single level

storage model collapses the server, network, and storage by removing software and replacing them with memory system primitives. This eliminates all network and network-processing overhead associated with accessing storage and delivers a 10,000X increase in raw performance.

http://semanticommunity.info/@api/deki/files/19353/exec_summary_20120916.pdfhttp://www.paneve.com/technology/

http://semanticommunity.info/@api/deki/files/19353/exec_summary_20120916.pdf

http://semanticommunity.info/@api/deki/files/19353/exec_summary_20120916.pdf

http://www.paneve.com/technology/

The Best Way to Get BIG DATA is By Starting Small

Documents

Transcript of The Best Way to Get BIG DATA is By Starting Small