Vital AI: Big Data Modeling

60
Big Data Modeling Today: Marc C. Hadeld, Founder Vital AI http://vital.ai [email protected] 917.463.4776

description

Video: https://www.youtube.com/watch?v=Rt2oHibJT4k Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models. These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it. This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers: Creating Semantic Metadata Models of Big Data Resources Graphical UI Tools for Big Data Models Tools to synchronize Big Data Models and Application Code Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference Using Big Data Models with Machine Learning to generate Predictive Models Developer Collaborative/Coordination processes using Big Data Models and Git Managing change – Big Data Models with rapidly changing Data Resources

Transcript of Vital AI: Big Data Modeling

Page 1: Vital AI: Big Data Modeling

Big Data Modeling

Today:

Marc C. Hadfield, FounderVital AIhttp://vital.ai [email protected] 917.463.4776

Page 2: Vital AI: Big Data Modeling

intro

Marc C. Hadfield, Founder Vital AIhttp://[email protected]

Page 3: Vital AI: Big Data Modeling

Big Data Modeling isData Modeling with the

“Variety” Big Data Dimension in mind…

Page 4: Vital AI: Big Data Modeling

Big Data “Variety” Dimension

The “Variety” problem can be addressed by a combination of improved tools and a methodology

involving both system architecture anddata science / analysis.

Compared to Volume and Velocity, Variety is a very labor -intensive human-centric process.

Variety is the many types of data to be utilized together in a data-driven application.

Potentially too many types for any single person to keep track of (especially in Life Sciences).

Page 5: Vital AI: Big Data Modeling

Key Takeaways:

Using OWL as a “meta-schema” can drastically reduce operations/development effort and increase the value of the data for analysis.

OWL can augment and not replace familiar development processes and tools.

A huge amount of ongoing development effort is spent transforming data across components and

keeping data consistent during analysis.

Collecting Good Data = Good Analytics

Page 6: Vital AI: Big Data Modeling

Big Data Modeling:

Challenges

Goals

OWL as Modeling Language

Using OWL-based Models…

Collaboration/Modeling Tools

Page 7: Vital AI: Big Data Modeling

Examples from NYC Department of Education:

Domain Ontology

Application Architecture

Development Methodology/Tools

Page 8: Vital AI: Big Data Modeling

NYC Department of Education:

Page 9: Vital AI: Big Data Modeling
Page 10: Vital AI: Big Data Modeling
Page 11: Vital AI: Big Data Modeling
Page 12: Vital AI: Big Data Modeling
Page 13: Vital AI: Big Data Modeling
Page 14: Vital AI: Big Data Modeling

Data Architecture

Data Science

Data Models in:

Page 15: Vital AI: Big Data Modeling

Challenges

Page 16: Vital AI: Big Data Modeling

Mobile/Web App Architecture

Data Model Data Model

Mobile App ServerImplementation Database

Page 17: Vital AI: Big Data Modeling

Database

Master Database"Data Lake"

Database

Database

Database

Business IntelligenceData Analytics

Dashboard

Enterprise DataWarehouse Architecture

Schema “on read” or “on write”

Data Model

Data Model

Data Model

Data Model

Data Model

ETL Process

Page 18: Vital AI: Big Data Modeling

Mobile App

Server Layer

Real Time Data

Calculated Views

Hadoop

Predictive Analytics

Master Database"Data Lake"

Business IntelligenceData Analytics

Dashboard

Lambda Architecture + Hadoop: Data Driven App

Data Model

Data Model

Data Model

Data ModelData Model

Page 19: Vital AI: Big Data Modeling

Data Wrangling / Data Science Master Database"Data Lake"

Business IntelligenceData Analytics

Raw Data

R

Data Model

Data Model

Data Model

Prediction Models must integrate back with production environment:

Page 20: Vital AI: Big Data Modeling

Same Data, Difference Contexts…

Redundant Models.

Page 21: Vital AI: Big Data Modeling

Data Architecture Issues

{ Database SchemaJSON DataData Object ClassesAvro/Parquet

Redundant Data Definitions:

Considerable Development / Maintenance / Operational Overhead

Page 22: Vital AI: Big Data Modeling

Data Science / Data Wrangling Issues

Data Harmonization: Merging Datasets from Multiple Sources

Loss of Context: Feature f123 = Column135 X Column45 / Column13

Side note: Let’s stop using CSV files for datasets!No more flat datasets!

Page 23: Vital AI: Big Data Modeling

Goals

Page 24: Vital AI: Big Data Modeling

Goals:

Reduce redundancy in Data Definitions Enforce Clean/Harmonized Data Use Contextual Datasets Use Best Software Components (Databases, Analytics, …) Use Familiar Tools (IDE, git, Languages, R)

Page 25: Vital AI: Big Data Modeling

OWL as Modeling Language

Page 26: Vital AI: Big Data Modeling

Web Ontology Language (OWL)

Specifies an Ontology (“Data Model”) Formal Semantics, W3C Standard

Provides a language to describe the meaning of data properties and how they relate to classes.

Example: Mammal Necessary Conditions: warm-blooded, vertebrate animal, has hair or fur, secrets milk, (typically) live birth

Greater descriptive power than Schema (SQL Tables) and Serialization Frameworks (Avro)

Page 27: Vital AI: Big Data Modeling

Why OWL?

If we can more formally specify what the data *means*, then we can have a single data model (ontology) apply to our entire architecture, and data can be transformed automatically locally as per the needs of a specific software module.

Manually coded data transforms may be “lossy” and/or introduce errors, so eliminating them helps keep data clean.

Page 28: Vital AI: Big Data Modeling

Why OWL? (continued)

Example: if we specify what a “Document” is, then a text-mining analyzer will know how to access the textual data without further prompting.

Example: if we specify Features for use in Machine Learning in the ontology, then features can be generated automatically to train Machine Learning Models, and the same features would be generated when we use the model in production.

Page 29: Vital AI: Big Data Modeling

Why OWL? (continued)

Note: As ontologies can extend other ontologies, rather than a single ontology, a collection of linked ontologies can be used, allowing segmentation across an organization.

Page 30: Vital AI: Big Data Modeling

Vital Core Ontology

Protege Editor…

Nodes, Edges, HyperNodes, HyperEdges get URIs

John/WorksFor/IBM —> Node / Edge / Node

Page 31: Vital AI: Big Data Modeling

Vital Core Ontology

Vital Domain Ontology

Application Domain Ontology

Extending the Ontology

Page 32: Vital AI: Big Data Modeling

NYC Dept of Education Domain Ontology

Page 33: Vital AI: Big Data Modeling

Generating Data Bindings with VitalSigns:

Ontology VitalSigns

Groovy Bindings

Semantic Bindings

Hadoop Bindings

Prolog Bindings

Graph Bindings

HBase Bindings

JavaScript Bindings

Code/Schema Generation

vitalsigns generate -ont name…

Page 34: Vital AI: Big Data Modeling

person123.name = "John"person123.worksFor.company456

<person123> <hasName> "John"<worksFor123> <hasSource> <person123><worksFor123> <hasDestination> <company456><worksFor123> <hasType> <worksFor>

person123, Node:type=Person, Node:hasName="John"worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456

Groovy

RDF

HBase

Data Representations

Page 35: Vital AI: Big Data Modeling

VitalSigns

Generation —> JAR Library

RuntimeDomain Ontology

Domain Ontology

Domain Ontology

Domain Ontology

VitalSigns

Class

Page 36: Vital AI: Big Data Modeling

Using OWL-based Models

Page 37: Vital AI: Big Data Modeling

Developing with the Ontology in UI, Hadoop, NLP, Scripts, ...

Node:Person Node:PersonEdge:hasFriend

Set<Friend> person123.getFriends()

Eclipse IDE

Page 38: Vital AI: Big Data Modeling

// Reference to an NYCSchool object NYCSchool school123 = … // get from database !// Get a list of programs, local context (cache) List<NYCSchoolProgram> programs = school123.getPrograms() !// Get list of programs, global context (database) List<NYCSchoolProgram> programs = school123.getPrograms(Context.ServiceWide) !

JVM Development

Page 39: Vital AI: Big Data Modeling

Using JSON-Schema Data in JavaScript

for(var i = 0 ; i < progressReports.length; i++) { var r = progressReports[i]; var sub = $('<ul>'); sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>'); sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>'); sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>'); sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>'); sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>'); sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>'); sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>'); sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>'); }

Page 40: Vital AI: Big Data Modeling

NoSQL Queries

Query API / CRUD Operations!Queries generated into “native” NoSQL Query format: Sparql / Triplestore (Allegrograph) HBase / DynamoDB MongoDB Hive/HiveQL (on Spark/Hadoop2.0)

Query Types: “Select” and “Graph”

Abstract type of datastore from application/analytics code

Pass in a “native” query when necessary

Page 41: Vital AI: Big Data Modeling

Data Serialization, Analytics Jobs

Data Serialized into file format by blocks of objects

Leverage Hadoop Serialization Standards: Sequence File, Avro, Parquet Get data in and out of HDFS Files

Spark/Hadoop jobs passed a set of objects as input URI of object is key

Data Objects are serialized into Compressed Strings for transport over Flume, etc.

Page 42: Vital AI: Big Data Modeling

Machine Learning

Via Hadoop, Spark, R

Mahout, MLLib

Build Predictive Models

Classification, Clustering...

Use Features defined in Ontology

Learn Target defined in Ontology

Models consume Ontology Data as input

Page 43: Vital AI: Big Data Modeling

Natural Language Processing/Text Mining

Topic Categorization…

Extract Entities… Text Features from Ontology

Classes extending Document…

Page 44: Vital AI: Big Data Modeling

Graph Analytics

GraphX, Giraph: PageRank, Centrality, Interest Graph, …

Page 45: Vital AI: Big Data Modeling

Inference / Rules

Use Semantic Web Rule Engines / Reasoners!Load Ontology + RDF Representation of Data Instances (Individuals)

Page 46: Vital AI: Big Data Modeling

R Analytics

Load Serialized Data into R Dataframes !Reference Classes and Properties by Name in Dataframes (cleaner code than huge number of columns)

Page 47: Vital AI: Big Data Modeling

Graph Visualization with Cytoscape

Data already in Node/Edge Graph Form

Page 48: Vital AI: Big Data Modeling

Graph Visualization with Cytoscape

Page 49: Vital AI: Big Data Modeling

Visualize Data “Hot Spots”

Page 50: Vital AI: Big Data Modeling

NYC Schools Architecture

Mobile App

JSONSchema

VertX

Vital Flow Queue

RuleEngine

NLP

DynamoDB

Vital Prime

VitalServiceClient

NYC SchoolsData Model

R Serialized DataData Insights

Page 51: Vital AI: Big Data Modeling

Collaboration & Tools

Page 52: Vital AI: Big Data Modeling

Collaboration/Tools

git - code revision system

OWL Ontologies treated as code artifact

Coordinate across Teams: “Front End”, “Back End”, “Operations”,

“Business Intelligence”, “Data Science”…

Coordinate across Enterprise: Departments / Business Units

“Data Model of Record”

Page 53: Vital AI: Big Data Modeling

Ontology Versioning

NYCSchoolRecommendation-0.1.8.owl

Semantic Versioning (http://semver.org/)

Page 54: Vital AI: Big Data Modeling

vitalsigns command line

vitalsigns generate

vitalsigns upversion/downversion

code/schema generation

increase version patch number move previous version to archive

rename OWL file including username

JAR files pushed into Maven (Continuous Integration)

Page 55: Vital AI: Big Data Modeling

Git Integration

git: add, commit, push, pull

diff: determine differences

merge: merge two Ontologies

detect types of Ontology changes

merge into new patch version

Page 56: Vital AI: Big Data Modeling

OWL as Data Modeling Language: Data Architecture & Data Science / Analytics

Conclusions

Leverage Existing Tools, Components

Reduce model redundancy, reduce effort.

A Means to Collaborate Across Teams: Data Model of Record

Cleaner Data

Integrate additional analysis

Page 57: Vital AI: Big Data Modeling

For more information, please contact: Marc C. Hadfield http://vital.ai [email protected] 917.463.4776

Thank You!

Page 58: Vital AI: Big Data Modeling
Page 59: Vital AI: Big Data Modeling
Page 60: Vital AI: Big Data Modeling