Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime...

1Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems

Budapest University of Technology and EconomicsFault Tolerant Systems Research Group

Data processing

Intelligent Data Analysis

http://www.mit.bme.hu/node/8036

Outline

Data format/representation

Data processing

ETL, workflow support

Outlook: OLAP

Case studies

Data science „process”

https://en.wikipedia.org/wiki/Data_science

DATA FORMAT

Tidy data

3 Simple rules to facilitate statistics and visualization

One variable – one column

One observation – one row

Each type of observational unit – one table

… seems to be trivial

… not true in most practical cases

… and even for staitstical tools (e.g. output of R packages)

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.https://github.com/hadley/tidy-data

Data originally: long/wide

https://en.wikipedia.org/wiki/Wide_and_narrow_data

How to use these formats?

Sparse Screening for Exact Data Reduction. Jieping Ye, Arizona State University

Examples for tidy data

http://garrettgman.github.io/tidying/

R dataframe representation:

„tidying”

R: spread(data,key,value)

„tidying”

R: spread(data,key,value)

Generalization?

Data restructuring examples ( in R)

https://www.r-statistics.com/2012/01/aggregation-and-restructuring-data-from-r-in-action/

DATA STORAGE

Common data storage techniques

o Majority of inputs

o Length? Header? Encoding?

DB with a schema (in memory?)

Graph databases, ontologies, RDF…

Key-value stores (redis)

Time series databases (openTSDB, influxDB)

o Time series + metadata

„Data in motion”

o Streams as input for processing/analysis

Time series example: influxDB

Data: measurement

o Fields, tags, timestamp

Dashboards… (e..g Grafana)

https://grafana.com/dashboards/1443

DATA PROCESING WORKFLOW& TOOLS

„Extract-Transform-Load”

Originally: to fill a snowflake/star schema

In data science: create dataframes

Cleaning tasks

o Standardization

o Normalization

o Deduplication

o Enrichment

o Clear/fill NAs

Example data processing workflow (KNIME)

Steps: reading, filtering/aggregation,

transformation, plotting, …

Status of the concrete execution

Measurement processing: RapidMiner

Deleteunnecessary

attribute

Calculatingaverages(interval)

Read CSVFormat

conversionIdentifying source

Filter tocpu.usage.average

Add machineinformation

CASE STUDY

Processing of telco data

SOME BACKGROUND… OLAP

On-Line Analytical Processing (img: snowplowanalytics.com)

Business intelligence approach

Extensively used since early 2000so Still! (although not that popular as it was

– at least in academic research)

FeaturesoMulti-dimensional analysis

o Fast query execution

o Exploratory analysis of data• Support ad-hoc queries

o Report generation

o (Visualization)

On-Line Analytical Processing (img: snowplowanalytics.com)

Central concept: OLAP cube

o Multi-dimensional array:

• set of separate data– Dimensionality >3

– technically a hypercube

– ~ a multi-dimensional spreadsheet

o Slicer: dimension held constant

• For a given query (e.g. sales in a particular year)

OLAP process (img: Pranav Joshi)

OLAP operations

Operations

o Slicing & dicing

o Drill up & down

o Pivoting

Easy to visualize by the cube itself

Slicing (img: Wikipedia)

Dicing (img: Wikipedia)

Drill up & down (img: Wikipedia)

Pivoting (img: Wikipedia)

OLAP vs. “regular/modern” data analysis

OLAP cube: like a set of spreadsheets

o multi-dimensional

o interlinked

Modern data analysis: “flat” data frames

oModern machine learning algorithms:

• require (?) single dataframes

Operations: basically the same (slicing, dicing, drill up & down, pivoting)

CASE STUDY3„Deep insights from observations with the help of modern data analysis tools” – CECRIS IAPP project

Railway accidents: casualties by type of accident, Department for Transport Statistics, Rail Statistics, Table TSGB0805 (RAI0501)

(https://www.gov.uk/government/organisations/department-for-transport/series/rail-statistics)

Analysis: next class, now let us process the data

PowerBI data import

Load data to PowerQuery

Remove unnecessary top rows

Remove unnecessary bottom rows

Remove blank rows

Remove columns

Promote first row to header

Filter “total” and “all” rows

Split first column

Replace empty values to null in first column

Replace empty values to null in second column

Remove colon character from first column

Automation: RapidMiner process

https://my.rapidminer.com/nexus/account/index.html#downloads

Read Excel

Read measurements

Filter rows

Rename attributes

Loop attributes

Replace spaces

Replace colon character

Header row problem

To be kept

To be removed

(derived)

Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime...

Documents

Transcript of Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime...

Time Series Database (TSDB) Query Languages · PDF fileTime Series Database (TSDB) Query Languages Philipp Bende January 26, 2017 1/33. Time Series Data ... OpenTSDB In uxDB Gorilla

Lustre Monitoring with OpenTSDB - EOFS · PDF file• OpenTSDB is a distributed DB (multiple TSDs), running on top of HBase. • It is scaling very well, depending on the underlying

Time Series Database (InfluxDB) - University of Floridamschneid/Teaching/CIS4930+CIS6930_Fall... · Time Series Database (InfluxDB) ... OpenTSDB (metrics + events) Kairos (metrics

Time Series Schemas @Percona Live 2017 · PDF file2 Who Am I? Chris Larsen Maintainer and author for OpenTSDB since 2013 Software Engineer @ Yahoo Central Monitoring Team Who I’m

Sunday Prayer Shaping Life and Belief in the JUBILEE OF ... 8 OTime 23-30 Sept-Oct C 2016.pdfMoses and prayer of intercession CCC , nos. 2574 -2577 Prayer of petition CCC , nos. 2629

Configuring Prometheus for High Performance - Sched · PDF fileConfiguring Prometheus for High Performance CloudNativeCon Berlin – 2017-03-30 ... The URL of the remote OpenTSDB server

AirMon: A holistic IoT-based Air Quality Monitoring Systemsmart-cities-centre.org/wp-content/uploads/2017-03-15-XiufengLiu.pdf · - OpenTSDB - PostgreSQL - KDB+ - BerkerleyDB Load

HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon

Sunday Prayer Shaping Life and Belief in the JUBILEE OF ... 7 OTime 14-22 Jul… · The briefs for the 15th, 17th, 19th, and 21st Sundays in Ordinary ... RM3 = Roman Missal, Third

The Distributed, Scalable, Time Series Database For your ...opentsdb.net/misc/opentsdb-oscon.pdf · The Distributed, Scalable, Time Series Database For your modern monitoring needs

UMITED STATES, BE: APPLIBB - Library of Congress · evidence by undertaking to collect it for the German prosecutor but the complications proved t~otime consuming to permit any degree

Monitoring MySQL with OpenTSDB

Happiest Minds Corporate Overview · PDF fileData Store (OpenTSDB) & Grafana • Display real-time operational monitoring dashboard. India | United States | United Kingdom | Canada

Faculty of Science and Technology - BIBSYS · PDF fileScalable and user friendly user interface for time-series analytics for OpenTSDB Roberto Martín Muñoz Faculty of Science and

Containerized, Cloud-Native Operations for Big Elizabeth K ...princessleia.com/presentations/2017/SoCal-DevOps-Containerized... · KairosDB/Cassandra e. OpenTSDB/HBase f. others such

openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Le BigData avance à grands pas - · PDF filewith low latency and high availability (Hbase / openTSDB) GridPocket – Michael Defoin-Platel 46 . OpenTSDB Distributed, Scalable, Time

Integrating Long-Term Storage with Prometheus - Sched Long... · Integrating Long-Term Storage with Prometheus Julius Volz, March 30, 2017. Prometheus ... For OpenTSDB, InfluxDB,

The Distributed, Scalable, Time Series Databasetsuna/opentsdb/opentsdb-oscon.pdf · Recipe For Good Performance •#1 rule: keep good data locality •Know your access pattern •Use

Otus: Resource Attribution and Metrics Correlation …kair/papers/otus.pdfperformance analysis. DataGarage [11] and OpenTSDB [19] aim at warehousing time-series data for large scale