Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development...

30
Big Data Praktikum Abteilung Datenbanken Sommersemester 2017

Transcript of Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development...

Page 1: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Big Data PraktikumAbteilung Datenbanken

Sommersemester 2017

Page 2: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Orga

Ziel: Entwurf und Realisierung einer Anwendung / eines Algorithmus unter Verwendung existierender Big Data Frameworks

Ablauf

Anwesenheitspflicht der Gruppe zu allen Testaten

Bis Anfang April Erstes Treffen mit Betreuer (Terminanfrage per Mail)

Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze

Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen

Anfang August Testat 3: Präsentation

15 Minuten pro Gruppe

Anwesenheitspflicht aller Praktikumsteilnehmer

Page 3: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Technische Details

Quellcode: GitHub Repository Gruppe => Collaborators

Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked

Java: Apache Maven 3 für Projekt Management

Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks

Quellcode Dokumentation zwingend erforderlich!

Stabile Versionen verwenden (ggf. Rücksprache) z.B. Flink 1.1.2

Lokal lauffähige Lösungen können auf dediziertem Cluster ausgeführt werden

Terminabsprache Anfang Juli mit [email protected]

Datensätze https://github.com/caesar0301/awesome-public-datasets

Page 4: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Is the globe really warming?Yin-Chi Lin

Page 5: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Mr Trump's top advisers are currently divided on the issue (Paris climate agreement), with some, including Environmental Protection Agency head Scott Pruitt, eager for the US to leave the deal. "Paris is something that we need to really look at closely, because it's something we need to exit, in my opinion," Mr Pruitt said in an interview with Fox News Channel's "Fox & Friends" last week. "It's a bad deal for America. It was an America second, third or fourth kind of approach."

Is the globe warming? • If yes, since when and at what magnitude?

• Are there regional differences (e.g. between different continents, countries, climate zones …)?

• Are there seasonal differences?

• Is the rise of temperature really correlated to the increase of CO2 emission?

• “Europe’s Atlantic-facing countries will suffer heavier rainfalls, greater flood risk, more severe storm damage and an increase in “multiple climatic hazards…”

…….

IS THE GLOBE WARMING? Tuesday 18 April 2017

Page 6: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Data Sources & Tools

GHCN-Daily dataset (Global Historical Climatology Network):

• 1763-2017• more than 100,000 stations across the globe

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/

ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/

https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf

Global CO2 Emissions from Fossil-Fuel Burning, Cement Manufacture, and Gas Flaring:

• 1751-2014

http://cdiac.ornl.gov/ftp/ndp030/global.1751_2014.ems

Tools:SparkR + Map visualization tool

Page 7: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analysing Metabolic Networks in Gradoop

Anika Groß

Page 8: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analysing Metabolic Networks in Gradoop

• Modellierung metabolischer Netzwerke und biochemischer Reaktionen im EPGM (Extended Property Graph Modell)

• Transformation und Import in Gradoop• Daten von http://bigg.ucsd.edu

[1] Lanzenia, Messinaa, Archettia: Graph models and mathematical programming in biochemicalnetwork analysis and metabolic engineering design, Computers & Mathematics with Applications, 2008.[2] Junghanns, Petermann: Verteilte Graphanalyse mit Gradoop. JavaSPEKTRUM 05/2016.

[2]

http://bigg.ucsd.edu

[1]

• Datenanalyse: „Hub“-Moleküle, Suche nach Mustern, Finden häufiger Subgraphen, …

Page 9: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analyzing PanamaPapers with Gradoop

Eric Peukert

Page 10: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analyzing Panama Papers with Gradoop

• Loading Panama Papers with Gradoop (Neo4J-Connector or from CSV)

• Viszalize Schema

• Implement analytical workflows

• Optional: link with additional sources in Germany such as people/companies from dbpedia

Page 11: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analytics of Development Project Data

Eric Peukert

Page 12: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analytics of Development Project Data

https://issues.apache.org/jira/rest/api/2/project

Analytical Workflows

Page 13: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analysis of LOD datasets within GradoopMarkus Nentwig

Page 14: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Linked Open Data

• Structured, interlinked data using standardtechnologies• HTTP dereference entities

• RDF machine-readable data exchange format

• URIs identification of entities

Page 15: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Gradoop: Distributed Graph Analytics

- Use graph operators like aggregation, grouping or subgraph to analyse data

- Ontop of Apache Flink- Extended Property Graph Model- Support for different data sources like

- CSV, JSON, …- Currently missing: RDF

Page 16: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Tasks:

Data SourceGradoopAnalysis

Data Sink

- Implement data source and data sink for RDF data format

- Based on existing data sources- Import/export LOD data set- Handle RDF reification

- Analyze a given dataset with simple Gradoopoperators

Page 17: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analytics of Publication Data with GraphuloMatthias Kricke

Page 18: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Analytics of Publication Data with Graphulo

Technologies• Graphulo, which is based on

• Apache Accumulo (Distributed DBS)• Apache Hadoop HDFS (Distributed

Filesystem)

Data• DBLP

• open bibliographic information on major computer science journals and proceedings

Task• Import DBLP into graphulo• Analyze DBLP by the means of graphulo

• Graph diameter?• Size of biggest connected component?• …

Cluster Environment

Page 19: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Classification of program traces using TensorFlow or Caffe

Martin Grimmer

Page 20: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Classification of program traces usingTensorFlow

• A program trace is the sequence of system calls of a program.

54 175 120 175 175 3 175 175 120 175 120 175 120 175 175 120 175 3 3 3 175 120 175 175 175 7 3 3 175 120 175 7 175 7 119 174 54 3 3 175 175 3 120 175 175 120 175 120 120 175 175 54 140 3 175 120 175 175 175 175 175 174 7 175 7 119 3 3 175 3 175 175 120 175 7 175 3 175 120 175 175 54 7 174 3 175 120 7 175 175 120 175 175 3 175 120 175 3 3 120 175 120 175 175 7 54 175 120 175 7 175 7 119 174 54 3 120 175 175 120 54 3 120 175 175 54 140 175 175 174 54 175 120 175 175 54 140

• TensorFlow is a open source library for artificial intelligence.• https://www.tensorflow.org/

• The task:Build a classifier with TensorFlow that learns what is normal.

-> One class classification problem!

Use this classifier to test unknown system traces for

abnormal behavior.

Page 21: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Speed up Entity Resolution with Bit Arrays

Ziad Sehili

Page 22: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Entity Resolution

Jaccard similarity= # Intersection_tokens / # union_tokens = 7/10

Build tokens: trigrams(tommas schmidt)={tom, omm, mma, mas, sch, chm, hmi, mid, idt}trigrams(tomas schmidt) = {tom, oma, mas, sch, chm, hmi, mid, idt}

Find records in different databases that refer to the same real world object

Page 23: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Entity Resolution With Bit Arrays

tom, omm, mma, mas, sch, chm, hmi, mid, idt

0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

tommas schmidt

tomas schmidt

h

0 1 0 0

0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 1 0 0

tom, oma, mas, sch, chm, hmi, mid, idt

h

Jaccard similarity= AND / OR = 7/9

Problems:1. How to get similar/same quality as string comparison? (length of bit array to avoid collisions 303???) or increase the number of hash functions!!!

2. Does this method improve the runtime?

Page 24: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Parameter Tuning for Entity-Resolution Problems

Victor Christen

Page 25: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Parameter Tuning for Entity-Resolution Problems

• Quality depends on the similarity function sim_f• Determined by compared attributes and weights for each attribute

combination

attribute

Name

Description

Price

0.7

0.4

0.9

𝑠𝑖𝑚_𝑓

0.6

0.4

0.5

𝑠𝑖𝑚_𝑓

Page 26: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Classification problem• Entity Resolution as

classification problemTask

• Determine logistic regression classifier based on given similarity vectors and a training data set

• Evaluate different training data set sizes by determining quality, variance,…

Advanced

• Investigate the impact of different classifiers according to similar vectors

Technology

• SparkML• Logistic Regression, K-Means

name

de

scri

pti

on

Page 27: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Graphbased Similarities for medical concepts

• Relatedness determined by using an knowledgebase such as an ontology• Ontologies represent the backbone of the Semantic Web

• Structure knowledge by defining concepts and relations between concepts, such as “Heart infarction”, “diabetes mellitus”,…

• Hierarchical structure of concepts

Related?

Page 28: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Concept Similarity

• Similarities based on basic measures

B

E

A

F

C

H

D

KG I L

Measure Concept(C)

#subsumer 2

#leaves 3

#leaves

5

Local depth-first search based implementation needs more than a day up to weeks!!!!

Measure Concept(C)

#subsumer 2

#leaves 3

Measure Concept(C)

#subsumer 2

#leaves 3

Measure Concept(C)

#subsumer 2

#leaves 3

Measure Concept(C)

#subsumer 2

#leaves 4

disease

isa

Heart infarction

general

specialized

Page 29: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

Task & Requirements

Data

• Extracted directed acyclic graph (DAG) from the Unified Medical Language System• 2.2 Mio Concepts, 2.9 Mio Relations

Task

• Parallel traversal algorithm to determine the measures

Technology

• Apache Flink/Gelly

Page 30: Big Data Praktikum - uni-leipzig.dedbs.uni-leipzig.de/file/bigprak_intro.pdfAnalytics of Development Project Data Gradoop/Flink 2 Peukert Analyse LOD datasets within Gradoop Gradoop/Flink

ThemenübersichtThema FW #Studenten Betreuer

Is the globe really warming? SparkR 2 Lin

Analysing Metabolic Networks in Gradoop

Gradoop/Flink 2 Groß

Analyzing PanamaPapers with Gradoop Gradoop/Flink 2 Peukert

Analytics of Development Project Data Gradoop/Flink 2 Peukert

Analyse LOD datasets within Gradoop Gradoop/Flink 2 Nentwig

Analytics of Publication Data with Graphulo

Apache Accumolo 2 Kricke

Classification of program traces using TensorFlow or Caffe

TensorFlow or Caffe/Python/C++

2 Grimmer

Speed up Entity Resolution with Bit Arrays

2 Sehili

Graph-based Similarities for medical concepts

Flink 2 Christen

Parameter Tuning for Entity-Resolution Problems

SparkML 2 Christen