hadoop&zing

PRESENTER: HUNGVVW: http://me.zing.vn/hung.vo

E: [email protected]

2011-08

HADOOP & ZINGHADOOP & ZING

AGENDAAGENDA

Using Hadoop in ZingRank

Introduction to Hadoop, Hive

A case study: Log Collecting, Analyzing & Reporting Systemter Estimate

1

3

2

Conclusion

Hadoop & ZingHadoop & Zing

WhatIt’s a framework for large-scale data processingInspired by Google’s architecture: Map Reduce

and GFSA top-level Apache project – Hadoop is open

sourceWhy

Fault-tolerant hardware is expensiveHadoop is designed to run on cheap commodity

hardwareIt automatically handles data replication and

node failureIt does the hard work – you can focus on

processing data

Data Flow into HadoopData Flow into Hadoop

Web Servers Scribe MidTier

Network Storage and Servers

Hadoop Hive Warehouse MySQL

Hive – Data WarehouseHive – Data Warehouse

A system for managing and querying structured data build on top of HadoopMap-Reduce for executionHDFS for storageMetadata in an RDBMS

Key building Principles:SQL as a familiar data warehousing toolExtensibility - Types, Functions, Formats, ScriptsScalability and Performance

Efficient SQL to Map-Reduce Compiler

Hive ArchitectureHive Architecture

HDFSMap ReduceWeb UI + Hive CLI +

JDBC/ODBC

Browse, Query, DDL

Hive QL

Parser

Planner

Optimizer

Execution

SerDe

CSVThriftRegex

UDF/UDAF

substrsum

averageFileFormat

s

TextFileSequenceFi

leRCFile

User-definedMap-reduce

Scripts

Hive DDLHive DDL

DDLComplex columnsPartitionsBuckets

Example CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n‘ STORED AS TEXTFILE;

Hive DMLHive DML

Data loading LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');

Insert data into Hive tables INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}')SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;

Hive Query LanguageHive Query Language

SQLWhereGroup ByEqui-JoinSub query in "From" clause

Multi-table Group-By/InsertMulti-table Group-By/Insert

FROM user_information

INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid

INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)

INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid

INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid

File FormatsFile Formats

TextFile:Easy for other applications to write/readGzip text files are not splittable

SequenceFile:http://wiki.apache.org/hadoop/SequenceFileOnly hadoop can read itSupport splittable compression

RCFile: Block-based columnar storagehttps://issues.apache.org/jira/browse/HIVE-352Use SequenceFile block formatColumnar storage inside a block25% smaller compressed sizeOn-par or better query performance depending on the

query

SerDeSerDe

Serialization/DeserializationRow Format

CSV (LazySimpleSerDe)Thrift (ThriftSerDe)Regex (RegexSerDe)Hive Binary Format (LazyBinarySerDe)

LazySimpleSerDe and LazyBinarySerDeDeserialize the field when neededReuse objects across different rowsText and Binary format

UDF/UDAFUDF/UDAF

Features:Use either Java or Hadoop Objects (int, Integer,

IntWritable)OverloadingVariable-length argumentsPartial aggregation for UDAF

Example UDF:public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; }}

What we use Hadoop for?What we use Hadoop for?

Storing Zing Me core log dataStoring Zing Me Game/App log dataStoring backup dataProcessing/Analyzing data with HIVEStoring social data (feed, comment, voting,

chat messages, …) with HBase

Data UsageData Usage

Statistics per day:~ 300 GB of new data added per day~ 800 GB of data scanned per day~ 10,000 Hive jobs per day

Where is the data stored?Where is the data stored?

Hadoop/Hive Warehouse90T data20 nodes, 16 cores/node16 TB per nodeReplication=2

Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting

NeedSimple & high performance framework for log

collectionCentral, high-available & scalable storageEase-of-use tool for data analyzing (schema-

based, SQL-like query, …)Robust framework to develop report

Version 1 (RDBMS-style)Log data go directly into MySQL database

(Master)Transform data into another MySQL database

(off-load)Statistics queries running and export data into

another MySQL tablesPerformance problem

Slow log insert, concurrent insertSlow query-time on large dataset


Version 2 (Scribe, Hadoop & Hive)Fast logAcceptable query-time on large datasetData replicationDistributed calculation


ComponentsLog CollectorLog/Data TransformerData AnalyzerWeb Reporter

ProcessLog defineLog integrate (into application)Log/Data analyzeReport develop


Log CollectorScribe:

a server for aggregating streaming log data designed to scale to a very large number of nodes and be

robust to network and node failures hierarchy stores Thrift service using the non-blocking C++ server

Thrift-client in C/C++, Java, PHP, …


Log format (common)Application-action log

server_ip server_domain client_ip username actionid createdtime appdata execution_time

Request log server_ip request_domain request_uri request_time execution_time memory client_ip username application

Game action log time username actionid gameid goldgain coingain expgain itemtype itemid userid_affect appdata


Scribe – file store port=1463 max_msg_per_second=2000000 max_queue_size=10000000 new_thread_per_category=yes num_thrift_server_threads=10 check_interval=3

# DEFAULT - write all other categories to /data/scribe_log <store> category=default type=file file_path=/data/scribe_log base_filename=default_log max_size=8000000000 add_newlines=1 rotate_period=hourly #rotate_hour=0 rotate_minute=1 </store>


Scribe – buffer store <store> category=default type=buffer target_write_size=20480 max_write_interval=1 buffer_send_rate=1 retry_interval=30 retry_interval_range=10 <primary> type=network remote_host=xxx.yyy.zzz.ttt remote_port=1463 </primary> <secondary> type=file fs_type=std file_path=/tmp base_filename=zmlog_backup max_size=30000000 </secondary> </store>


Log/Data TransformerHelp to import data from multi-type source into

HiveSemi-automated

Log files to Hive: LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…

MySQL data to Hive: Data extract using SELECT … INTO OUTFILE … Import using LOAD DATA


Data AnalyzerCalculation using Hive query language (HQL):

SQL-likeData partitioning, query optimization:

very important to improve speed distributed data reading optimize query for one-pass data reading

Automation hive --service cli -f hql_file Bash shell, crontab

Export data and import into MySQL for web report Export with Hadoop command-line: hadoop fs -cat Import using LOAD DATA LOCAL INFILE … INTO TABLE …


Web ReporterPHP web applicationModularStandard format and template

jpgraph


ApplicationsSummarization

User/Apps indicators: active, churn-rate, login, return… User demographics: age, gender, education, job,

location… User interactions/Apps actions

Data miningSpam DetectionApplication performanceAd-hoc Analysis…


THANK YOU!

hadoop&zing

Technology

Transcript of hadoop&zing