hadoop&zing
-
Upload
zingopen -
Category
Technology
-
view
107 -
download
2
description
Transcript of hadoop&zing
![Page 1: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/1.jpg)
PRESENTER: HUNGVVW: http://me.zing.vn/hung.vo
2011-08
HADOOP & ZINGHADOOP & ZING
![Page 2: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/2.jpg)
AGENDAAGENDA
Using Hadoop in ZingRank
Introduction to Hadoop, Hive
A case study: Log Collecting, Analyzing & Reporting Systemter Estimate
1
3
2
Conclusion
![Page 3: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/3.jpg)
Hadoop & ZingHadoop & Zing
WhatIt’s a framework for large-scale data processingInspired by Google’s architecture: Map Reduce
and GFSA top-level Apache project – Hadoop is open
sourceWhy
Fault-tolerant hardware is expensiveHadoop is designed to run on cheap commodity
hardwareIt automatically handles data replication and
node failureIt does the hard work – you can focus on
processing data
![Page 4: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/4.jpg)
Data Flow into HadoopData Flow into Hadoop
Web Servers Scribe MidTier
Network Storage and Servers
Hadoop Hive Warehouse MySQL
![Page 5: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/5.jpg)
Hive – Data WarehouseHive – Data Warehouse
A system for managing and querying structured data build on top of HadoopMap-Reduce for executionHDFS for storageMetadata in an RDBMS
Key building Principles:SQL as a familiar data warehousing toolExtensibility - Types, Functions, Formats, ScriptsScalability and Performance
Efficient SQL to Map-Reduce Compiler
![Page 6: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/6.jpg)
Hive ArchitectureHive Architecture
HDFSMap ReduceWeb UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSVThriftRegex
UDF/UDAF
substrsum
averageFileFormat
s
TextFileSequenceFi
leRCFile
User-definedMap-reduce
Scripts
![Page 7: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/7.jpg)
Hive DDLHive DDL
DDLComplex columnsPartitionsBuckets
Example CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n‘ STORED AS TEXTFILE;
![Page 8: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/8.jpg)
Hive DMLHive DML
Data loading LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');
Insert data into Hive tables INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}')SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;
![Page 9: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/9.jpg)
Hive Query LanguageHive Query Language
SQLWhereGroup ByEqui-JoinSub query in "From" clause
![Page 10: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/10.jpg)
Multi-table Group-By/InsertMulti-table Group-By/Insert
FROM user_information
INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid
INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)
INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid
INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid
![Page 11: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/11.jpg)
File FormatsFile Formats
TextFile:Easy for other applications to write/readGzip text files are not splittable
SequenceFile:http://wiki.apache.org/hadoop/SequenceFileOnly hadoop can read itSupport splittable compression
RCFile: Block-based columnar storagehttps://issues.apache.org/jira/browse/HIVE-352Use SequenceFile block formatColumnar storage inside a block25% smaller compressed sizeOn-par or better query performance depending on the
query
![Page 12: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/12.jpg)
SerDeSerDe
Serialization/DeserializationRow Format
CSV (LazySimpleSerDe)Thrift (ThriftSerDe)Regex (RegexSerDe)Hive Binary Format (LazyBinarySerDe)
LazySimpleSerDe and LazyBinarySerDeDeserialize the field when neededReuse objects across different rowsText and Binary format
![Page 13: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/13.jpg)
UDF/UDAFUDF/UDAF
Features:Use either Java or Hadoop Objects (int, Integer,
IntWritable)OverloadingVariable-length argumentsPartial aggregation for UDAF
Example UDF:public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; }}
![Page 14: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/14.jpg)
What we use Hadoop for?What we use Hadoop for?
Storing Zing Me core log dataStoring Zing Me Game/App log dataStoring backup dataProcessing/Analyzing data with HIVEStoring social data (feed, comment, voting,
chat messages, …) with HBase
![Page 15: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/15.jpg)
Data UsageData Usage
Statistics per day:~ 300 GB of new data added per day~ 800 GB of data scanned per day~ 10,000 Hive jobs per day
![Page 16: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/16.jpg)
Where is the data stored?Where is the data stored?
Hadoop/Hive Warehouse90T data20 nodes, 16 cores/node16 TB per nodeReplication=2
![Page 17: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/17.jpg)
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
NeedSimple & high performance framework for log
collectionCentral, high-available & scalable storageEase-of-use tool for data analyzing (schema-
based, SQL-like query, …)Robust framework to develop report
![Page 18: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/18.jpg)
Version 1 (RDBMS-style)Log data go directly into MySQL database
(Master)Transform data into another MySQL database
(off-load)Statistics queries running and export data into
another MySQL tablesPerformance problem
Slow log insert, concurrent insertSlow query-time on large dataset
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 19: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/19.jpg)
Version 2 (Scribe, Hadoop & Hive)Fast logAcceptable query-time on large datasetData replicationDistributed calculation
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 20: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/20.jpg)
ComponentsLog CollectorLog/Data TransformerData AnalyzerWeb Reporter
ProcessLog defineLog integrate (into application)Log/Data analyzeReport develop
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 21: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/21.jpg)
Log CollectorScribe:
a server for aggregating streaming log data designed to scale to a very large number of nodes and be
robust to network and node failures hierarchy stores Thrift service using the non-blocking C++ server
Thrift-client in C/C++, Java, PHP, …
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 22: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/22.jpg)
Log format (common)Application-action log
server_ip server_domain client_ip username actionid createdtime appdata execution_time
Request log server_ip request_domain request_uri request_time execution_time memory client_ip username application
Game action log time username actionid gameid goldgain coingain expgain itemtype itemid userid_affect appdata
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 23: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/23.jpg)
Scribe – file store port=1463 max_msg_per_second=2000000 max_queue_size=10000000 new_thread_per_category=yes num_thrift_server_threads=10 check_interval=3
# DEFAULT - write all other categories to /data/scribe_log <store> category=default type=file file_path=/data/scribe_log base_filename=default_log max_size=8000000000 add_newlines=1 rotate_period=hourly #rotate_hour=0 rotate_minute=1 </store>
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 24: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/24.jpg)
Scribe – buffer store <store> category=default type=buffer target_write_size=20480 max_write_interval=1 buffer_send_rate=1 retry_interval=30 retry_interval_range=10 <primary> type=network remote_host=xxx.yyy.zzz.ttt remote_port=1463 </primary> <secondary> type=file fs_type=std file_path=/tmp base_filename=zmlog_backup max_size=30000000 </secondary> </store>
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 25: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/25.jpg)
Log/Data TransformerHelp to import data from multi-type source into
HiveSemi-automated
Log files to Hive: LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…
MySQL data to Hive: Data extract using SELECT … INTO OUTFILE … Import using LOAD DATA
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 26: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/26.jpg)
Data AnalyzerCalculation using Hive query language (HQL):
SQL-likeData partitioning, query optimization:
very important to improve speed distributed data reading optimize query for one-pass data reading
Automation hive --service cli -f hql_file Bash shell, crontab
Export data and import into MySQL for web report Export with Hadoop command-line: hadoop fs -cat Import using LOAD DATA LOCAL INFILE … INTO TABLE …
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 27: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/27.jpg)
Web ReporterPHP web applicationModularStandard format and template
jpgraph
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 28: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/28.jpg)
ApplicationsSummarization
User/Apps indicators: active, churn-rate, login, return… User demographics: age, gender, education, job,
location… User interactions/Apps actions
Data miningSpam DetectionApplication performanceAd-hoc Analysis…
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
![Page 29: hadoop&zing](https://reader035.fdocuments.us/reader035/viewer/2022062511/54c6a9164a795973318b45b0/html5/thumbnails/29.jpg)
THANK YOU!