Post on 07-Mar-2018
1Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems
Budapest University of Technology and EconomicsFault Tolerant Systems Research Group
Data processing
Intelligent Data Analysis
http://www.mit.bme.hu/node/8036
2
Outline
Data format/representation
Data processing
ETL, workflow support
Outlook: OLAP
Case studies
3
Data science „process”
https://en.wikipedia.org/wiki/Data_science
4
DATA FORMAT
5
Tidy data
3 Simple rules to facilitate statistics and visualization
One variable – one column
One observation – one row
Each type of observational unit – one table
… seems to be trivial
… not true in most practical cases
… and even for staitstical tools (e.g. output of R packages)
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.https://github.com/hadley/tidy-data
6
Data originally: long/wide
https://en.wikipedia.org/wiki/Wide_and_narrow_data
7
How to use these formats?
Sparse Screening for Exact Data Reduction. Jieping Ye, Arizona State University
8
Examples for tidy data
http://garrettgman.github.io/tidying/
R dataframe representation:
9
„tidying”
R: spread(data,key,value)
http://garrettgman.github.io/tidying/
10
„tidying”
R: spread(data,key,value)
Generalization?
http://garrettgman.github.io/tidying/
11
Data restructuring examples ( in R)
https://www.r-statistics.com/2012/01/aggregation-and-restructuring-data-from-r-in-action/
12
DATA STORAGE
13
Common data storage techniques
.CSV
o Majority of inputs
o Length? Header? Encoding?
DB with a schema (in memory?)
Graph databases, ontologies, RDF…
Key-value stores (redis)
Time series databases (openTSDB, influxDB)
o Time series + metadata
„Data in motion”
o Streams as input for processing/analysis
14
Time series example: influxDB
Data: measurement
o Fields, tags, timestamp
15
Dashboards… (e..g Grafana)
https://grafana.com/dashboards/1443
16
DATA PROCESING WORKFLOW& TOOLS
17
ETL
„Extract-Transform-Load”
Originally: to fill a snowflake/star schema
In data science: create dataframes
Cleaning tasks
o Standardization
o Normalization
o Deduplication
o Enrichment
o Clear/fill NAs
18
Example data processing workflow (KNIME)
Steps: reading, filtering/aggregation,
transformation, plotting, …
Status of the concrete execution
KNIME
19
Measurement processing: RapidMiner
Deleteunnecessary
attribute
Calculatingaverages(interval)
Read CSVFormat
conversionIdentifying source
node
Filter tocpu.usage.average
Add machineinformation
20
CASE STUDY
Processing of telco data
21
SOME BACKGROUND… OLAP
22
On-Line Analytical Processing (img: snowplowanalytics.com)
Business intelligence approach
Extensively used since early 2000so Still! (although not that popular as it was
– at least in academic research)
FeaturesoMulti-dimensional analysis
o Fast query execution
o Exploratory analysis of data• Support ad-hoc queries
o Report generation
o (Visualization)
23
On-Line Analytical Processing (img: snowplowanalytics.com)
Central concept: OLAP cube
o Multi-dimensional array:
• set of separate data– Dimensionality >3
– technically a hypercube
– ~ a multi-dimensional spreadsheet
o Slicer: dimension held constant
• For a given query (e.g. sales in a particular year)
24
OLAP process (img: Pranav Joshi)
25
OLAP operations
Operations
o Slicing & dicing
o Drill up & down
o Pivoting
Easy to visualize by the cube itself
26
Slicing (img: Wikipedia)
27
Dicing (img: Wikipedia)
28
Drill up & down (img: Wikipedia)
29
Pivoting (img: Wikipedia)
30
OLAP vs. “regular/modern” data analysis
OLAP cube: like a set of spreadsheets
o multi-dimensional
o interlinked
Modern data analysis: “flat” data frames
oModern machine learning algorithms:
• require (?) single dataframes
Operations: basically the same (slicing, dicing, drill up & down, pivoting)
31
CASE STUDY3„Deep insights from observations with the help of modern data analysis tools” – CECRIS IAPP project
Railway accidents: casualties by type of accident, Department for Transport Statistics, Rail Statistics, Table TSGB0805 (RAI0501)
(https://www.gov.uk/government/organisations/department-for-transport/series/rail-statistics)
Analysis: next class, now let us process the data
32
PowerBI data import
33
Load data to PowerQuery
34
Remove unnecessary top rows
35
Remove unnecessary bottom rows
36
Remove blank rows
37
Remove columns
38
Promote first row to header
39
Filter “total” and “all” rows
40
Split first column
41
Replace empty values to null in first column
42
Replace empty values to null in second column
43
Remove colon character from first column
44
Automation: RapidMiner process
https://my.rapidminer.com/nexus/account/index.html#downloads
45
Read Excel
Read measurements
46
Filter rows
47
Split
48
Rename attributes
49
Loop attributes
50
Replace spaces
51
Replace colon character
52
Header row problem
To be kept
To be removed
(derived)