Fusion der Welten: Hadoop als DWH-Backend bei Pro7 · Fusion der Welten: Hadoop als DWH-Backend bei...

Fusion der Welten:Hadoop als DWH-Backend bei Pro7

Dr. Kathrin SpreyerBig Data Engineer

Open Source Business IntelligenceKarlsruhe, 20.02.2014

Agenda

1. Relationales DWH

2. Bewirtschaftung mit PDI

3. Unter der Haube: Hadoop

4. Datenimport mit Flume

5. Hive im Fachbereich

Architektur

| 20. März 2013 | ProSiebenSat.1 Digital GmbH | Business Intelligence | Jürgen Popp Page 18

Lösungsansatz Hybrides System aus relationaler Datenbank und Hadoop Cluster

Architektur

Relationales DWH

Datenquellen

1. DWH zur Integration von Reichweiten-, Vermarktungserlös- und Transaktionsdaten

Online TV Channel ProSiebenSat.1 Network Externe Mandanten

ProSiebenSat.1 Digital Wesentlicher Treiber der Digitalstrategie

Datenquellen

1. DWH zur Integration von Reichweiten-, Vermarktungserlös- und Transaktionsdaten

Online TV Channel ProSiebenSat.1 Network Externe Mandanten

ProSiebenSat.1 Digital Wesentlicher Treiber der Digitalstrategie

Datenquellen

1. MyVideo Apache Logs VV, VT, Ads, Player-Fehler, ..

2. Webtrekk VV, VT, Ads

3. DfP (Adserver) Ads

4. Google Analytics VV, PI, Visits, Visitors

5. Soziale Netzwerke Posts, Comments, Likes

6. Conviva Bitrate, Video startup time, ..

7. ....7

Metriken:

Datenmodell

1. Relationales Datenmodell (Sternschema)

Reporting

1. Momentan viel Excel

2. Im Aufbau: Reporting mit Business Objects

3. Empfänger: BI, Controlling, Management, ...

DWH-Bewirtschaftung

1. ETL-Tool Pentaho Data Integration (PDI)

1. Datenakquise

2. Validierung / Reinigung / Fehlerbehandlung

3. Aggregation

4. Export ins DWH

DWH-Bewirtschaftung

1. Konnektoren zu gängigen Daten-/Speicherformaten

1. Dateien: CSV, JSON, XML, ...

2. RDBMS: 40+ connection types

3. Big data: HDFS, Cassandra, HBase, MongoDB, ...

DWH-Bewirtschaftung

1. Konnektoren zu gängigen Daten-/Speicherformaten

1. Dateien: CSV, JSON, XML, ...

2. RDBMS: 40+ connection types

3. Big data: HDFS, Cassandra, HBase, MongoDB, ...

Pentaho Data Integration

Pentaho MapReduce

Hadoop-Backend

HDFS-Architektur

data nodes 03, 05, 08

name node

client nodedata node 01

data node 02

data node 03

data node 04

data node 05

data node 06

data node 07

data node 08

data node 09

data node 10

data node 11

data node 12

rack 1 rack 2 rack 3

blk 2 blk 3 blk 4blk 1

Where do I store block 1?

blk 1 (03, 05, 08)

blk 1(03, 05, 08)

www.inovex.de/trainings/offene-trainings/hadoop-training/

MapReduce-Verarbeitung

1. Aggregation der Rohdaten

2. Transparente Parallelisierung

3. Horizontal skalierbar

Java: Mapper und Reducer

public class WebtrekkEventMapper extends Mapper<Text, Text, Text, IntWritable> {

@Override protected void map( Text key, Text value, Context context ) ! throws IOException, InterruptedException {! // key contains entire record! String[] fields = key.toString().split( ";" );! // extract relevant information! String eventname = fields[12];! // emit output key and count! context.write( new Text( eventname ), ! ! new IntWritable( 1 )); }}

public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {! @Override protected void reduce( Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException {! int sum = 0;! for ( IntWritable partialCount : values ) {! sum += partialCount.get();! }! context.write( key, new IntWritable( sum ) ); }}

Java: Mapper und Reducer

public class WebtrekkEventMapper extends Mapper<Text, Text, Text, IntWritable> {

@Override protected void map( Text key, Text value, Context context ) ! throws IOException, InterruptedException {! // key contains entire record! String[] fields = key.toString().split( ";" );! // extract relevant information! String eventname = fields[12];! // emit output key and count! context.write( new Text( eventname ), ! ! new IntWritable( 1 )); }}

public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {! @Override protected void reduce( Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException {! int sum = 0;! for ( IntWritable partialCount : values ) {! sum += partialCount.get();! }! context.write( key, new IntWritable( sum ) ); }}

Mapper

Reducer

(+Driver)

PDI: Main, Mapper und Reducer

Driver

Mapper

Reducer

PDI: Mapper-Transformation

Pentaho MapReduce

1. Unterstützt alle PDI-Schritte

2. Weniger effizient als Java MR

3. (Fast) keine Java-Kenntnisse nötig

Java MapReduce

1. Native Hadoop-API

2. Unterstützung aller Hadoop-Features

3. Mehr Kontrolle, Optimierungsmöglichkeiten

1. Abstraktion über MapReduce

2. Übersetzung in MapReduce-Jobs

3. Nicht immer effizient

4. Praktisch für (Equi-) Joins

Datenimport

Datenimport mit Flume

1. Import via FTP-Server: Daten erst nach 24 Stunden im DWH

2. Apache Flume: kontinuierlicher Datenstrom

3. Automatische “Einsortierung” in HDFS-Verzeichnissez.B. nach Zeitstempel

Datenimport mit Flume

1. Simulation v. Datenströmen: spooling directory

2. Sonst TCP (netcat)

3. Bisher Single-Tier-Architektur

Datenvolumen

1. via Flume: jährlich 20 TB (netto)

2. via FTP: jährlich 21.5 TB (netto)

3. Archivierung: jährlich 500 GB (netto)

4. insgesamt: 40 TB (120 TB) pro Jahr

Clustergröße

1. dev: 4 DN + NN + SN + DB + Admin

2. prod: 6 DN + NN + SN + DB + 2 Admin

3. HDFS-Kapazität: 185 TB (brutto)

4. skalierbar

Clustergröße

1. dev: 4 DN + NN + SN + DB + Admin

2. prod: 6 DN + NN + SN + DB + 2 Admin

3. HDFS-Kapazität: 185 TB (brutto)

4. skalierbar

Hive im Fachbereich

Apache Hive

1. Bekanntes Interface: SQL

2. Teilweise Nutzung zur Aggregation im Backend

3. Hauptsächlich Eruierung neuer Metriken

4. Untersuchung von Daten-”Anomalien”

Beispiel-Query

SELECT mandant, day, count(*)FROM webtrekk_cust_para_click_2WHERE mandant = 'sat1' AND day BETWEEN '2013-01-01' AND '2013-01-31'GROUP BY sid, request_id, times, day, mandant HAVING count(*) >1

Hive-Zugriff auf Rohdaten

Zusammenfassung

inovex Academy

1. U.a. Hadoop-Entwickler-Training

2. 1-3 Tage

3. Inhouse oder offen

4. Offenes Trainings 2014:

18.-20. März (Köln)24.-26. Juni (München)18.-20. November (Karlsruhe)

www.inovex.de/trainings/offene-trainings/hadoop-training/

Fragen?

Fusion der Welten: Hadoop als DWH-Backend bei Pro7 · Fusion der Welten: Hadoop als DWH-Backend bei...

Documents

Transcript of Fusion der Welten: Hadoop als DWH-Backend bei Pro7 · Fusion der Welten: Hadoop als DWH-Backend bei...

Welten Institute Research Group on Technology Enhanced ... · Technology Enhanced Learning Welten Institute Research Group on Technology Enhanced Learning Prof. Dr. Marcus Specht

'DWH 5HYLHZ 'DWH

DWH | Larcom

Fusion der Welten: Hadoop als DWH-Backend bei ProSieben

DWH Basics

GoodFuels Isabel Welten - ITF · GoodFuels – Isabel Welten OECD International Transport Forum Decarbonizing Maritime Transport November 2018 . 0 200 400 600 800 1000 1200 Annual

DWH Concentrated

pr app dtl.aspx?param=wGEVXQ - loxwindowsanddoors.com · 3urgxfw $ssurydo 0hwkrg 0hwkrg 2swlrq ' 'dwh 6xeplwwhg 'dwh 9dolgdwhg 'dwh 3hqglqj )%& $ssurydo 'dwh $ssuryhg 6xppdu\ ri 3urgxfwv

Evidence of Cis Trans-Isomerization at Pro7/Pro16 …faculty.fiu.edu/~fernandf/pubs/104_116_2019_JASMS_Lasso...Evidence of Cis/Trans-Isomerization at Pro7/Pro16 in the Lasso Peptide

Telecom Dwh

DWH Testing

5HSRUW $OO 3RUWIROLRV %HJLQ 'DWH (QG 'DWH%hjlq 'dwh (qg 'dwh 20(6 5lvn 0jpw 5hyroylqj )xqg $6$ í

DWH Concepts

Die physikalische Welt und mögliche Welten H. J. Pirnerpir/uploads/Main/InterdisciplinaryPublications/... · Die physikalische Welt und mögliche Welten H. J. Pirner Institut für

Recurrent Neural Networks - Greg Mori - CMPT 419/726mori/courses/cmpt726/slides/chapterrnn_handout.pdf*DWH,QSXW *DWH)RUJHW* DWH,QSXW0RGXODW LRQ*DWH ... alent to introducing a Þxed

Dwh Record

DWH GLOSSORY

DWH Simplified

7(1'(5 ,' 12 & 'DWH '8( 'DWH +UV ,67 7HQGHU 2SHQLQJ 'DWH +UV … · 2020. 8. 20. · ,7, /7' 0$1.$385 ',67 *21'$ 8 3 ,1',$ 7(1'(5 ,' 12 & 'dwh '8( 'dwh +uv ,67 7hqghu 2shqlqj 'dwh

DWH Material

Recurrent Neural Networks - Greg Mori - CMPT 419/726mori/courses/cmpt726/slides/chapterrnn_handout.pdfDWH,QSXW DWH)RUJHW* DWH,QSXW0RGXODW LRQ*DWH ... alent to introducing a Þxed