Stats web avec Hive chez Scoop.it

Stats Web avec Hive

Cas pratique avec

Il était une fois…

Et, et, hadoop est arrivé…

Des stats

Des stats avancées

Encore des stats

Des évènements

• PageViewEvent• PostCurationEvent• SearchEvent• CommentEvent• ShareEvent• TopicDeletionEvent• UserDeletionEvent• …

Première version

• Compteur de vues : mysql• Compteur de visiteurs : mysql• Stockages des évènements : mysql• Stats par source : Google Analytics• Stats par pays : Google Analytics• Analyse des évènements : SQL

Rançon du succès

• Taux d’écriture dans mysql– Quick fix:

• Espace de stockage• Google Analytics API lente et approximative

Les besoins

• Compteur de vues : calcul temps réel• Compteur de visiteurs : calcul 1 fois par jour• Stockages des évènements : fichiers « à plat »• Stats par source : calcul 1 fois par jour• Stats par pays : calcul 1 fois par jour• Analyse des évènements : à la demande et

régulièrement

Solution

• Compteur de vues : Cassandra• Compteur de visiteurs : Hive• Stockages des évènements : ad hoc HDFS• Stats par source : Hive• Stats par pays : Hive• Analyse des évènements : Hive

Cassandra vs Hbase

• Hbase:– « open-source, distributed, versioned, column-

oriented store modeled after Google's Bigtable »– « Bigtable-like capabilities on top of Hadoop and

HDFS »• Cassandra:– « a BigTable data model running on an Amazon

Dynamo-like infrastructure »

Cassandra vs Hbase

• Pro Hbase– Cluster Hadoop déployé– Hive supporte Hbase

• Pro Cassandra– Cluster « temps réel » vs cluster « asynchrone »– Pas de SPOF (cf Hadoop Namenode)– Opérationnellement simple

Hive vs Pig

• Pig– « high-level language for expressing data analysis

programs »– « compiler that produces sequences of Map-

Reduce programs »• Hive– « data warehouse system for Hadoop »– « query the data using a SQL-like language »

Hive vs Pig

• Pro Pig:– Plus près de l’algorithme Map-Reduce

• Pro Hive– SQL-like

ad hoc HDFS vs Flume

• Flume– « distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of log data »

ad hoc HDFS vs Flume

• Pro Flume– Fault tolerent– Streaming– Scalable– Agrégation

• Anti Flume– Encore une techno à déployer– Encore une techno à apprendre– Volume de données encore « faible »

Architecture de Hive

Source: http://www.javabloger.com/article/apache-hive-jdbc-mapreduce.html

• « workflow scheduler system to manage Apache Hadoop jobs »

• Support de Hive

Mais:• XML comme si il en pleuvait• Projet en beta en 2011• Page d’analytics « privée » ?

Scoop-mapred

• Spring MVC webapp• Embedded Hive• Quartz Scheduler

Architecture

scoopscoopwebfrontend

Cassandra

view++

évènements

Scoop-mapredgetViews

getVisitors

JobTracker

TaskTrackerTaskTracker

TaskTracker

HDFSHDFS

CassandraCassandra

Launch maps & reducestasks

Launch Jobs

getTopicSources

Calculs Map/Reduce

Via HiveQL

Hive : CREATE TABLE

CREATE TABLE httpdlogs(ip STRING, domain STRING, user STRING,date STRING, method STRING, request STRING,protocol STRING, status INT, bodySize INT,referer STRING, useragent STRING);

LOAD DATA INPATH '/var/log/site_access.log' INTO TABLE httpdlogs;

SELECT status, COUNT(*) FROM httpdlogsWHERE referer = 'www.google.com' GROUP BY status;

Hive : INSERT INTO TABLE

CREATE TABLE google_httpdlogs(ip STRING, user STRING, date STRING);

INSERT INTO TABLE google_httpdlogsSELECT ip, date FROM httpdlogs WHERE referer LIKE

'%google%';

SELECT * FROM google_httpdlogs WHERE date > '2012-01-15';

Hive : CREATE EXTERNAL TABLECREATE EXTERNAL TABLE PageViewEvent

(date STRING, uri STRING, querystring STRING,useragent STRING, referer STRING, ip STRING, …)PARTITIONED BY (day STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/events/PageViewEvent';

ALTER TABLE PageViewEvent ADD PARTITION (day='20121205') LOCATION '/events/PageViewEvent/20121205';

SELECT COUNT(*) FROM PageViewEventWHERE day = '20121205' AND date > '2012-12-05 12:00:00' AND

date < '2012-12-05 13:00:00'

Hive : CREATE ‘Cassandra’ TABLE

CREATE EXTERNAL TABLE CassandraTopicVisitors(themeid BIGINT, day STRING, visitors INT)STORED BY 'org...cassandra.hadoop.hive.CassandraStorageHandler'WITH SERDEPROPERTIES ('cassandra.columns.mapping'=':key,:column,:value', 'cassandra.cf.name'='TopicVistors', 'cassandra.host'='cassandra-1', 'cassandra.port'='9160')TBLPROPERTIES ('cassandra.ks.name’='topic');

INSERT INTO TABLE CassandraTopicVisitorsSELECT themeid, '2012-12-05', COUNT(DISTINCT userid)FROM PageViewEvent WHERE day = '20121205' GROUP BY

themeid;

CassandraStorageHandler

• Patches:– https://issues.apache.org/jira/browse/CASSANDRA-913– https://issues.apache.org/jira/browse/HIVE-1434

• En écriture : nickel• En lecture : à éviter / à tester

Après 1 an d’utilisation

Les machines

• 4 machines (Intel Xeon 1.87GHz, 8G RAM)

HDFS et les petits fichiers$ sudo hadoop-fuse-dfs dfs://namenode:8020 ro hdfs

$ du -sh hdfs/events78G hdfs/events

$ ls -l hdfs/events/GrabbedPostEvent/20121201/grabbing.10.tsv.gz-rw-r--r-- 1 99 99 2,6M 2012-12-01 07:04 hdfs/events/GrabbedPo[...]

$ ls -l hdfs/events/PageViewEvent/20121201/web-3.6.tsv.gz-rw-r--r-- 1 99 99 960K 2012-12-01 04:47 hdfs/events/PageViewE[...]

$ du -sh hdfs/apache_logs360G hdfs/apache_logs

Big Data ?

• Hadoop surdimensionné• Architecture non triviale• Déploiement non trivial

~• Scallable• Hive rocks : SQL => Map/reduce• Datamining / Recommendation

Stats web avec Hive chez Scoop.it

Technology

Transcript of Stats web avec Hive chez Scoop.it

Use Scoop.it to Market Your Brand & Generate Leads!

Scoop.it everywhere

Hive 101: Hive Query Language

Director's Brief: Scoop.it

Scoop.it!: Aplicaciones Educativas

© Hive Studios 2011 Ivan Pavlović, Hive Studios Visual C# MVP, MCT, CSM paki@hive-studios.com .

HIVE: an Open Infrastructure for Malware Collection and ...netlab-mn.unipv.it/hive/ossconf_presentation.pdf · Introduction HIVE Conclusions HIVE: an Open Infrastructure for Malware

Scoop.it y el ambito educativo

Tutorial de Scoop.it

Utilizando scoop.it

THE HIVE@MANSFIELD - Stopford Associates · THE HIVE@MANSFIELD About us The Hive@Mansfield is part of the successful Hive at Nottingham Trent University. The Hive helps and supports

1. Ad is Posted to “Pharma News, Views, & Events” Scoop.It ...€¦ · 27,000 Twitter followers, 20,000 Scoop.It followers, 6,000 daily Pharma Marketing Blog readers, and thousands

HIVE HONEYSCRIBE HIVE - Princesshay

-HIVE- Hive Insulation Valuation Experiment

Scoop.it tutorial

Scoop.it presentation4oct8th

Parasitic mite, Varroa species (Parasitiformes: Varroidae ...Langstroth Hive, Tanzania Top Bar Hive, Tanzania Commercial Hive, Log Hive and Bark Hive (Figure 4, Plates a, b, c). Table

Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Statistiche attivita Scoop.it Mariano Pallottini - data: 16 febbraio 2015