Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... ·...

61
Genoveva Vargas-Solar French Council of Scientific Research, LIG [email protected] Designing Experiments Overview of data management & exploitation solutions http:// vargas-solar.com/data-centric-smart-everything/

Transcript of Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... ·...

Page 1: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Genoveva Vargas-SolarFrench Council of Scientific Research, LIG

[email protected]

Designing ExperimentsOverview of data management & exploitation solutions

http://vargas-solar.com/data-centric-smart-everything/

Page 2: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

2

Volume

VelocityVariety

Veracity

Value

1000 Yottabytes 1 Brontobyte

1000Brontobytes

1 Geopbyte

Computational Science

Digital humanitiesSocial Data Science

Network Science

Computation(Algorithm: mathematical model)

Experiment setting(Architecture: computing environment)

Page 3: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

3

Big data variety: the right model according to data

Page 4: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

4

DATABASE ENVIRONMENTS LANDSCAPE

Page 5: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

CuratedRaw

Increased versatility& complexity

Increased scalability& speed

Data collections rawness degree

Key-Valuestores

Documentstores

Extensiblerecord stores

NewSQLRelational databases

Graph Databases

QueryingLook up (R/W) Analytics

AggregationProcessing Navigation

DATA CURATION AT SCALE

Page 6: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

6

Tuple­ Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar

Document­ Allows values to be nested documents or lists, as well as scalar values. ­ Attributes are not defined in a global schema

Extensible record­ Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be

added on a per-record basis

DATA MODELS

Page 7: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

7

Key-value­ Systems that store values and an index to find them, based on a key

Document­ Systems that store documents, providing index and simple query mechanisms

Extensible record­ Systems that store extensible records that can be partitioned vertically and horizontally across nodes

Graph ­ Systems that store model data as graphs where nodes can represent content modelled as document or key-value

structures and arcs represent a relation between the data modelled by the node

Relational­ Systems that store, index and query tuples

DATA STORES

Page 8: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

8

“Simplest data stores” use a data model similar to the memcached distributed in-memory cache

Single key-value index for all data

Provide a persistence mechanism

Replication, versioning, locking, transactions, sorting

API: inserts, deletes, index lookups

No secondary indices or keys

SYSTEM ADDRESS

Redis code.google.com/p/redis

Scalaris code.google.com/p/scalaris

Tokyo tokyocabinet.sourceforge.net

Voldemort project-voldemort.com

Riak riak.basho.com

Membrain schoonerinfotech.com/products

Membase membase.com

KEY STORE VALUES

Page 9: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

SELECT nameFROM groupWHERE gid IN ( SELECT gid

FROM group_memberWHERE uid = me() )

9

SELECT name, pic, profile_urlFROM userWHERE uid = me()

SELECT name, picFROM userWHERE online_presence = "active"

ANDuid IN ( SELECT uid2

FROM friendWHERE uid1 = me() )

SELECT nameFROM friendlistWHERE owner = me()

SELECT message, attachmentFROM streamWHERE source_id = me() AND type = 80

https://developers.facebook.com/docs/reference/fql/

Page 10: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

10

<805114856,

>

Page 11: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

11

Support more complex data: pointerlessobjects, i.e., documents

Secondary indexes, multiple types of documents (objects) per database, nested documents and lists, e.g. B-trees

Automatic sharding (scale writes), no explicit locks, weaker concurrency (eventual for scaling reads) and atomicity properties

API: select, delete, getAttributes, putAttributes on documents

Queries can be distributed in parallel over multiple nodes using a map-reduce mechanism

SYSTEM ADDRESS

SimpleDB amazon.com/simpledb

Couch DB couchdb.apache.org

Mongo DB mongodb.org

Terrastore code.google.com/terrastore

DOCUMENT STORES

Page 12: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

12

DOCUMENT STORES

Page 13: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

13

Basic data model is rows and columns

Basic scalability model is splitting rows and columns over multiple nodes­ Rows split across nodes through sharding on the primary key

­ Split by range rather than hash function

­ Rows analogous to documents: variable number of attributes, attribute names must be unique

­ Grouped into collections (tables)

­ Queries on ranges of values do not go to every node

Columns are distributed over multiple nodes using “column groups” ­ Which columns are best stored together

­ Column groups must be pre-defined with the extensible record stores

SYSTEM ADDRESS

HBase hbase.apache.com

HyperTable hypertable.org

Cassandra incubator.apache.org/cassandra

EXTENSIBLE RECORD STORES

Page 14: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

14

SQL: rich declarative query language

Databases reinforce referential integrity

ACID semantics

Well understood operations: ­Configuration, Care and feeding, Backups, Tuning, Failure and recovery, Performance characteristics

Use small-scope operations­Challenge: joins that do not scale with sharding

Use small-scope transactions­ACID transactions inefficient with communication and 2PC overhead

Shared nothing architecture for scalability

Avoid cross-node operations

SYSTEM ADDRESS

MySQL C mysql.com/cluster

Volt DB voltdb.com

Clustrix clustrix.com

ScaleDB scaledb.com

Scale Base scalebase.com

Nimbus DB nimbusdb.com

SCALABLE RELATIONAL SYSTEMS

Page 15: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

15

Theoretical & Practical aspects (DBMS)

Domains & R Í D1 x D2 x .... Dn, Algebra à

1st Order Predicate Logic

Languages: SQL (wins), QUEL, QBE

DBMS Prototypes (1975), Products (1980)

A major improvement in DB: provide data independence & a simple, tabular view of data

Normal Forms & Dependencies (DB design, consistency)

Controversial: missing values, duplicates

More than 30 years: maturity!

R x S

R È S

R - S

R[a]

R : j

-------

R * S

1970 - 2000 RELATIONAL DB

Page 16: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

16

https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_join_inner2

HANDS ON EXAMPLE

Page 17: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

- Relational- Key-Value- Column oriented Tabular- Document oriented

Raw data collections

- How to transform data collections ?- Which is the best adapted model?

à Polyglot persistence

Approaches dealing with transformation rules inspired in therelational case

tabular (csv, excel)

Media (XML, JSON, BLOB)Graph

MODELING DATA COLLECTIONS

Page 18: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

18

External medium (tape or cards)Data dedicated to one applicationData seen as flowing through a stationary processor

Not much different from Hollerith’s tabulator“Batch Processing”

Invoice

Master Tape Updated Master

Orders

Computer

SEQUENTIAL DATA PROCESSING

Page 19: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

19

The key property of disks:­ Random access to stored data

Records can contain pointers to other records

Records can be indexed by their values and accessed directly using B trees

Prof. Rudolf Bayer, Tech. Univ Munich

Don Chamberlain

record

record

record

DISKS MADE MODERN DATABASES POSSIBLE

Page 20: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

20

Jim Gray

IDENTIFICATION: document

ENVIRONMENT: OS

DATA: Files/Records

PROCEDURE: code

COBOL

AUTHOR, PROGRAM-ID, INSTALLATION, SOURCE-COMPUTER, OBJECT-COMPUTER, SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL, DATE-WRITTEN, DATE-COMPILED, SECURITY.

CONFIGURATION SECTION. INPUT-OUTPUT SECTION.

FILE SECTION. WORKING-STORAGE SECTION. LINKAGE SECTION. REPORT SECTION. SCREEN SECTION.

• CODASYL – DBTG (1967) COnference on DAta SYstems LanguagesData Base Task Group• Defined DDL for a network data model

Set-Relationship semantics Cursor Verbs

• Isolated from procedures• No encapsulation• DATA division is the Schema ancestor

“Us”

“them”

CODE & DATA: SEPARATED AT BIRTH

Page 21: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

21

qMany applications share data in common

qData is managed by a centralized system qCosts are sharedqRedundancy and inconsistency are minimized

qControl is improvedqAccess language can be standardizedqData becomes an enterprise resource

qShared utilities: backup, recovery, replication, . . .

Database Management System

Schema & Data

INTEGRATED DATABASES (MID 60’S)

Page 22: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

CuratedRaw

Increased versatility& complexity

Increased scalability& speed

Data collections rawness degree

Key-Valuestores

Documentstores

Extensiblerecord stores

NewSQLRelational databases

Graph Databases

QueryingLook up (R/W) Analytics

AggregationProcessing Navigation

DATA CURATION AT SCALE

Page 23: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]

Data Collections StorageNoSQL, NewSQL (Hive), Data sharding (MongoDB, CouchDB)

Harvesting &Cleaning

Preserving

Processing &Analysis

Exploiting

DATA COLLECTIONS LIFE CYCLE

Page 24: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Centralized repository containing virtually inexhaustible amounts of raw data to be analysed

CRM Sensor logs

DATA LAKE

Page 25: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Repositories grow ever bigger and complex to the point that a lake becomes a swamp

CRM Sensor logs

DATA SWAMP

Page 26: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Preserving Describing

Step 2: Extracting meta-data

Grooming

Provisioning

Step 3: Exploitation

Selecting

Vetting

Collecting

Step 1: Harvesting

Data wrangling [Terrizzano et al. 2015]

DATA EXPLORATION WORKFLOW

Page 27: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Quantitative meta-data(e.g. Compass MongoDB)

Structural meta-data(e.g. Schemata like DTD, XML schema)

similar

similar

analysed

Eco-Bike

Traffic

Ontology

Semantic meta-data(linked data approaches)

Linked data collections- Topic- Geographic location- Temporal releases

EXTRACTING META-DATA

[Constance, SIGMOD ’16 ]

Page 28: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Preserving Describing

Extracting meta-data

ExploringHarvesting

Level INo meta-data Partial meta-data Complete meta-data

Level 2 Level 3

Sensing &collectingraw data

Size, frequencyfreshness, type

Quantitative &qualitative data+ semantics

[Stonebraker et al. 2015]

information

EXTRACTING META-DATA

Page 29: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Ontology

exploreCollections( )

ConstanceSemantic meta-datadiscovery

Structural meta-datadiscovery

Which are available green transportmodes in the city?

Which are the attributes used for modelling transport modes

Are there data collections dealing with green transport?

TransportGreen transportCycling road

Electric car

FISHING DATA IN THE LAKE

[Constance, SIGMOD ’16 ]

Page 30: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Eco Bikes

Electric busesBuses

Car rental

Green transport

Downtown Lyon

Transport for rental

Data regarding traffic in downtown, by date, traffic data collected in January 2018.

EXTRACTING META-DATA

[Constance, SIGMOD ’16 ]

Page 31: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Preserving Describing

Extracting meta-dataExploringHarvesting

ETL

Parallel Data Processing Platforms

Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)

Parallel Data Processing Platforms

NoSQL & NewSQL(Parallel)

ParallelData Querying &

Analytics

Structured Data provision

Parallel data collection

(Flink, Stream, Flume)

Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)

Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)

PREPARING DATA COLLECTIONS

Page 32: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

DATA SCIENCE TOOLBOXES

32

Page 33: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

- Data loading- In memory/cache/disk indexing - Data persistence - Query optimization- Concurrent access- Consistency and access control

33

Functions must be revisited under less strong hypothesis to support theenactment of data science pipelines

DATA CONSUMPTION PHILOSOPHIES

Page 34: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

34

ENACTMENT ENVIRONMENTSBig Data Platforms & Stacks WIDE environments Machine Learning Services

Page 35: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

35

WHICH TOOLS FOR DATA SCIENCE

Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and RFundamental libraries for data scientists in Python: NumPy, SciPy, Pandas and Scikit-Learn- NumPy provides, support for multidimensional arrays with basic operations on them and useful linear algebra

functions. - SciPy provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing,

optimization, statistics. - Matplotlib tools for data visualization - Sci-kit-Learn: machine learning library with tools such as classification, regression, clustering, dimensionality

reduction, model selection, and preprocessing. - Pandas: provides high-performance data structures and data analysis tools for data manipulation with integrated

indexing

Page 36: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

36

DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT

To get started on solving data-oriented problems, we need to set up our programming environment

Decide programming language version, whether to install a data scientist ecosystem by individual tool-boxes, or to perform a bundle installation

- For example, Anaconda Python provides integration of all the Python toolboxes and applicationsfor data scientists in a single directory

The integrated development environment (IDE) is an essential tool designed tomaximize programmer productivity.- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the

debugger.- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)

Page 37: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

37

WEB INTEGRATED DEVELOPMENT ENVIRONMENT

Web-based IDEs were developed considering how not only your code but also all your environment andexecutions can be stored in a server.- The server can be set up in a centre, such as a university or school- Students can work on their homework either in the classroom or at home- Students can execute all the previous steps over and over again, and then change some particular code cell (a

segment of the document that may content source code that can be executed) and execute the operation again.

For example IPython has been issued as a browser version of its interactive console: Jupyter- Markdown (a wiki text language) cells can be added to introduce algorithms. - It is also possible to insert Matplotlib graphics to illustrate examples or even web pages. - Experiments can become completely and absolutely replicable.

Page 38: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

38

WEB INTEGRATED DEVELOPMENT ENVIRONMENT

Page 39: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

39

DATA SCIENCE VIRTUAL MACHINE

Page 40: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

COMPUTING CAPACITY

40

Page 41: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Genoveva Vargas-SolarFrench Council of Scientific Research, LIG

[email protected]

Exploring data collectionsDescriptive statistics

http://vargas-solar.com/data-centric-smart-everything/

Thanks to Prof. J.L. Zechinelli Martini, UDLAP-LAFMIA, Mexico for our collaborative construction of these slides

* This presentation was created using the content of https://www.python.org/

Page 42: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Helps to simplify large amounts of data in a sensible way

­ A simple way to describe the data

­ Presenting quantitative descriptions in a manageable form

Main steps

­ Data preparation: generate statistically valid descriptions

­ Descriptive statistics: Generate different statistics to

­ Describe and summarize the data concisely

­ Evaluate different ways to visualize them

42

DESCRIPTIVE STATISTICS

Page 43: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

PREPARING DATA

43

Page 44: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Obtaining the data: Read from a file or obtained by scraping the web

Parsing the data: Format the data which can be in plain text, fixed columns, CSV, XML, HTML, etc.

Cleaning the data: A simple strategy is to remove or ignore incomplete records

Building data structures: A data structure that lends itself to the analysis we are interested in.

Databases provide a mapping from keys to values, so they serve as dictionaries

44

PREPARING DATA

Page 45: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Financial parameters related to the US population*

§Features: Age, sex, marital, country, income, education, occupation, capital gain, etc.

§Question: Are men more likely to become high-income professionals than women, i.e., to receive an income of over $50,000 per year?

§Preparing data collections

§Read and check the data

§Represent the data, for instance using a tabular data structure with features (columns) and records (rows)

§Group the data

45* UCI’s Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Adult

ANALYSING INCOME ACCORDING TO GENDER

Page 46: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Measurements and categories represent a sample distribution of a variable:

­ which approximately represents the population distribution of the variable

­ to make tentative assumptions about the population distribution

Different techniques:

­ Summarizing the data

­ Data distributions

­ Outlier treatment

­ Kernel density

46

EXPLORATORY DATA ANALYSIS

Page 47: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

https://www.kaggle.com/robikscube/hourly-energy-consumption

47

Page 48: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

48

Page 49: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

DATA ANALYSIS TECHNIQUES

49

Page 50: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 50

Given lots of data

Discover patterns and models that are:­Valid: hold on new data with some certainty­Useful: should be possible to act on the item ­Unexpected: non-obvious to the system­Understandable: humans should be able to

interpret the pattern

PRINCIPLE

Page 51: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

51

Based on three guiding principles

Decision backwards

Step by step

Test and learnPredictive &Optimization

ModelsOrganizationaltransformation

Big Data

CAPTURING VALUE FROM ADVANCED ANALYTICS

Page 52: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

52

Page 53: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

53

STATISTICS

Concepts:- Population: collection of objects, items (“units”) about which information is sought. - Sample: a part of the population that is observed.

Descriptive statistics: simplify large amounts of data in a sensible way presenting quantitative descriptions in a manageable way. It is a way to describe data.- Applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. - These procedures are essential to provide summaries about the samples as an approximation of the population. - Together with simple graphics, they form the basis of every quantitative analysis of data.

Inferential statistics: draws conclusions beyond the analysed data ; reaches conclusions regarding made hypotheses; aims at inferring characteristics of the “population” of the data.

Page 54: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

54

DATA FOR STATISTICS

To describe the sample data and to infer any conclusion data preparation is required for generating statistically valid descriptions & conclusions

1. Harvesting from a file or obtained from sources (archives, data stores, sensors).

2. Parsing depends on what format the data are in for example plain text, fixed columns, CSV, XML, HTML, etc.

3. Cleaning : Survey responses and other data files are almost always incomplete. Sometimes, there are multiple codes for things such as, not asked, did not know, and declined to answer. And there are almost always errors.

4. Building data structures: store data in a data structure that lends itself to the analysis we are interested in. - If the data fit into the memory, building a data structure is usually the way to go. - If not, usually a database is built, which is an out-of-memory data structure. - Most databases provide a mapping from keys to values, so they serve as dictionaries.

Page 55: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

Computer Science

Artificial Intelligence

CuratedKnowledge

MachineLearning

ReverseEngineeringThe Brain

Data Mining

55

IA & DATA MINING

Page 56: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 56

Data mining overlaps with:­ Databases: Large-scale data, simple queries

­ Machine learning: Small data, Complex models­ CS Theory: (Randomized) Algorithms

Different cultures:­ To a DB person, data mining is an extreme form of analytic processing – queries that

examine large amounts of data­ Result is the query answer

­ To a ML person, data-mining is the inference of models­ Result is the parameters of the model

In this class we will do both!

Machine Learning

CSTheory

Data Mining

Database systems

DATA MINING CULTURES

Page 57: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

57

Page 58: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Description

Prediction

Clustering (Bayesian, hierarchical, k-means,

CLARA, PAM)

Classification(Neural network, PLSDA, KNN,

decision trees)

Trend prediction

Regression(PLSR, PCR)

Association

Modelling (LU, QR, PCA=SVD, PARAFAC)

How many accidents are reported per day?

Percentage of use of available bicycles in downtown?

Which are the traffic bottle-neck regions in the city?

Is the number of car accidents related to seasons?

How will the use of bicycles will evolve in downtown during the summer of the next 5 years?

What type of cars are those that have more accidents in the highspeed roads?

Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?

PROCESSING DATA COLLECTIONS

Page 59: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]

Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon

DescriptionClustering

PCA & k-means

Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data comes to lie on the first

coordinate (first principal component), - the second greatest variance on the second coordinate, and so on.

PROCESSING DATA COLLECTIONS

Page 60: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

+

22/09/2019

Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]

Contribution: PCA for identifying traffic anomalies and thereby detecting problems in roads

DescriptionModelling

Principal Component Analysis (PCA)

PROCESSING DATA COLLECTIONS

Page 61: Designing Experiments - Vargas-Solarvargas-solar.com/data-centric-smart-everything/wp... · 2019-09-22 · 8 “Simplest data stores” use a data model similar to the memcacheddistributed

61