Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime...

Post on 07-Mar-2018

216 views 0 download

Transcript of Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime...

1Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems

Budapest University of Technology and EconomicsFault Tolerant Systems Research Group

Data processing

Intelligent Data Analysis

http://www.mit.bme.hu/node/8036

2

Outline

Data format/representation

Data processing

ETL, workflow support

Outlook: OLAP

Case studies

3

Data science „process”

https://en.wikipedia.org/wiki/Data_science

4

DATA FORMAT

5

Tidy data

3 Simple rules to facilitate statistics and visualization

One variable – one column

One observation – one row

Each type of observational unit – one table

… seems to be trivial

… not true in most practical cases

… and even for staitstical tools (e.g. output of R packages)

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.https://github.com/hadley/tidy-data

6

Data originally: long/wide

https://en.wikipedia.org/wiki/Wide_and_narrow_data

7

How to use these formats?

Sparse Screening for Exact Data Reduction. Jieping Ye, Arizona State University

8

Examples for tidy data

http://garrettgman.github.io/tidying/

R dataframe representation:

9

„tidying”

R: spread(data,key,value)

http://garrettgman.github.io/tidying/

10

„tidying”

R: spread(data,key,value)

Generalization?

http://garrettgman.github.io/tidying/

11

Data restructuring examples ( in R)

https://www.r-statistics.com/2012/01/aggregation-and-restructuring-data-from-r-in-action/

12

DATA STORAGE

13

Common data storage techniques

.CSV

o Majority of inputs

o Length? Header? Encoding?

DB with a schema (in memory?)

Graph databases, ontologies, RDF…

Key-value stores (redis)

Time series databases (openTSDB, influxDB)

o Time series + metadata

„Data in motion”

o Streams as input for processing/analysis

14

Time series example: influxDB

Data: measurement

o Fields, tags, timestamp

15

Dashboards… (e..g Grafana)

https://grafana.com/dashboards/1443

16

DATA PROCESING WORKFLOW& TOOLS

17

ETL

„Extract-Transform-Load”

Originally: to fill a snowflake/star schema

In data science: create dataframes

Cleaning tasks

o Standardization

o Normalization

o Deduplication

o Enrichment

o Clear/fill NAs

18

Example data processing workflow (KNIME)

Steps: reading, filtering/aggregation,

transformation, plotting, …

Status of the concrete execution

KNIME

19

Measurement processing: RapidMiner

Deleteunnecessary

attribute

Calculatingaverages(interval)

Read CSVFormat

conversionIdentifying source

node

Filter tocpu.usage.average

Add machineinformation

20

CASE STUDY

Processing of telco data

21

SOME BACKGROUND… OLAP

22

On-Line Analytical Processing (img: snowplowanalytics.com)

Business intelligence approach

Extensively used since early 2000so Still! (although not that popular as it was

– at least in academic research)

FeaturesoMulti-dimensional analysis

o Fast query execution

o Exploratory analysis of data• Support ad-hoc queries

o Report generation

o (Visualization)

23

On-Line Analytical Processing (img: snowplowanalytics.com)

Central concept: OLAP cube

o Multi-dimensional array:

• set of separate data– Dimensionality >3

– technically a hypercube

– ~ a multi-dimensional spreadsheet

o Slicer: dimension held constant

• For a given query (e.g. sales in a particular year)

24

OLAP process (img: Pranav Joshi)

25

OLAP operations

Operations

o Slicing & dicing

o Drill up & down

o Pivoting

Easy to visualize by the cube itself

26

Slicing (img: Wikipedia)

27

Dicing (img: Wikipedia)

28

Drill up & down (img: Wikipedia)

29

Pivoting (img: Wikipedia)

30

OLAP vs. “regular/modern” data analysis

OLAP cube: like a set of spreadsheets

o multi-dimensional

o interlinked

Modern data analysis: “flat” data frames

oModern machine learning algorithms:

• require (?) single dataframes

Operations: basically the same (slicing, dicing, drill up & down, pivoting)

31

CASE STUDY3„Deep insights from observations with the help of modern data analysis tools” – CECRIS IAPP project

Railway accidents: casualties by type of accident, Department for Transport Statistics, Rail Statistics, Table TSGB0805 (RAI0501)

(https://www.gov.uk/government/organisations/department-for-transport/series/rail-statistics)

Analysis: next class, now let us process the data

32

PowerBI data import

33

Load data to PowerQuery

34

Remove unnecessary top rows

35

Remove unnecessary bottom rows

36

Remove blank rows

37

Remove columns

38

Promote first row to header

39

Filter “total” and “all” rows

40

Split first column

41

Replace empty values to null in first column

42

Replace empty values to null in second column

43

Remove colon character from first column

44

Automation: RapidMiner process

https://my.rapidminer.com/nexus/account/index.html#downloads

45

Read Excel

Read measurements

46

Filter rows

47

Split

48

Rename attributes

49

Loop attributes

50

Replace spaces

51

Replace colon character

52

Header row problem

To be kept

To be removed

(derived)