Hadoop & Zing
-
Upload
long-dao -
Category
Technology
-
view
2.575 -
download
2
Transcript of Hadoop & Zing
PRESENTER: HUNGVVW: http://me.zing.vn/hung.vo
2011-08
HADOOP & ZINGHADOOP & ZING
AGENDAAGENDA
Using Hadoop in ZingRank
Introduction to Hadoop, Hive
A case study: Log Collecting, Analyzing & Reporting Systemter Estimate
1
3
2
Conclusion
Hadoop & ZingHadoop & Zing
WhatIt’s a framework for large-scale data processingInspired by Google’s architecture: Map Reduce and GFSA top-level Apache project – Hadoop is open source
WhyFault-tolerant hardware is expensiveHadoop is designed to run on cheap commodity hardwareIt automatically handles data replication and node failureIt does the hard work – you can focus on processing data
Data Flow into HadoopData Flow into Hadoop
Web ServersScribe MidTier
Network Storage and Servers
Hadoop Hive Warehouse MySQL
Hive – Data WarehouseHive – Data Warehouse
A system for managing and querying structured data build on top of HadoopMap-Reduce for executionHDFS for storageMetadata in an RDBMS
Key building Principles:SQL as a familiar data warehousing toolExtensibility - Types, Functions, Formats, ScriptsScalability and Performance
Efficient SQL to Map-Reduce Compiler
Hive ArchitectureHive Architecture
HDFSMap ReduceWeb UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSVThriftRegex
UDF/UDAF
substrsum
averageFileFormat
s
TextFileSequenceFile
RCFile
User-definedMap-reduce
Scripts
Hive DDLHive DDL
DDLComplex columnsPartitionsBuckets
Example CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n‘ STORED AS TEXTFILE;
Hive DMLHive DML
Data loading LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');
Insert data into Hive tables INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}')SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;
Hive Query LanguageHive Query Language
SQLWhereGroup ByEqui-JoinSub query in "From" clause
Multi-table Group-By/InsertMulti-table Group-By/Insert
FROM user_information
INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid
INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)
INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid
INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid
File FormatsFile Formats
TextFile:Easy for other applications to write/readGzip text files are not splittable
SequenceFile:http://wiki.apache.org/hadoop/SequenceFileOnly hadoop can read itSupport splittable compression
RCFile: Block-based columnar storagehttps://issues.apache.org/jira/browse/HIVE-352Use SequenceFile block formatColumnar storage inside a block25% smaller compressed sizeOn-par or better query performance depending on the query
SerDeSerDe
Serialization/Deserialization Row FormatCSV (LazySimpleSerDe)Thrift (ThriftSerDe)Regex (RegexSerDe)Hive Binary Format (LazyBinarySerDe)
LazySimpleSerDe and LazyBinarySerDeDeserialize the field when neededReuse objects across different rowsText and Binary format
UDF/UDAFUDF/UDAF
Features:Use either Java or Hadoop Objects (int, Integer, IntWritable)OverloadingVariable-length argumentsPartial aggregation for UDAF
Example UDF:public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; }}
What we use Hadoop for?What we use Hadoop for?
Storing Zing Me core log data Storing Zing Me Game/App log data Storing backup data Processing/Analyzing data with HIVE Storing social data (feed, comment, voting, chat
messages, …) with HBase
Data UsageData Usage
Statistics per day:~ 300 GB of new data added per day~ 800 GB of data scanned per day
~ 10,000 Hive jobs per day
Where is the data stored?Where is the data stored?
Hadoop/Hive Warehouse90T data20 nodes, 16 cores/node16 TB per nodeReplication=2
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
NeedSimple & high performance framework for log collectionCentral, high-available & scalable storageEase-of-use tool for data analyzing (schema-based, SQL-
like query, …)Robust framework to develop report
Version 1 (RDBMS-style)Log data go directly into MySQL database (Master)Transform data into another MySQL database (off-load)Statistics queries running and export data into another
MySQL tables
Performance problemSlow log insert, concurrent insertSlow query-time on large dataset
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Version 2 (Scribe, Hadoop & Hive)Fast logAcceptable query-time on large datasetData replicationDistributed calculation
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
ComponentsLog CollectorLog/Data TransformerData AnalyzerWeb Reporter
ProcessLog defineLog integrate (into application)Log/Data analyzeReport develop
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Log CollectorScribe:
a server for aggregating streaming log datadesigned to scale to a very large number of nodes and be robust to
network and node failureshierarchy storesThrift service using the non-blocking C++ server
Thrift-client in C/C++, Java, PHP, …
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Log format (common)Application-action log
server_ip server_domain client_ip username actionid createdtime appdata execution_time
Request logserver_ip request_domain request_uri request_time execution_time memory client_ip username application
Game action logtime username actionid gameid goldgain coingain expgain itemtype itemid userid_affect appdata
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
port=1463
max_msg_per_second=2000000
max_queue_size=10000000
new_thread_per_category=yes
num_thrift_server_threads=10
check_interval=3
# DEFAULT - write all other categories to /data/scribe_log
<store>
category=default
type=file
file_path=/data/scribe_log
base_filename=default_log
max_size=8000000000
add_newlines=1
rotate_period=hourly
#rotate_hour=0
rotate_minute=1
</store>
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Scribe – buffer store <store>
category=default
type=buffer
target_write_size=20480
max_write_interval=1
buffer_send_rate=1
retry_interval=30
retry_interval_range=10
<primary>
type=network
remote_host=xxx.yyy.zzz.ttt
remote_port=1463
</primary>
<secondary>
type=file
fs_type=std
file_path=/tmp
base_filename=zmlog_backup
max_size=30000000
</secondary>
</store>
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Log/Data TransformerHelp to import data from multi-type source into HiveSemi-automated
Log files to Hive:LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…
MySQL data to Hive:Data extract using SELECT … INTO OUTFILE …Import using LOAD DATA
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Data AnalyzerCalculation using Hive query language (HQL): SQL-likeData partitioning, query optimization:
very important to improve speeddistributed data readingoptimize query for one-pass data reading
Automationhive --service cli -f hql_fileBash shell, crontab
Export data and import into MySQL for web reportExport with Hadoop command-line: hadoop fs -cat Import using LOAD DATA LOCAL INFILE … INTO TABLE …
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
Web ReporterPHP web applicationModularStandard format and template
jpgraph
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
ApplicationsSummarization
User/Apps indicators: active, churn-rate, login, return…User demographics: age, gender, education, job, location…User interactions/Apps actions
Data miningSpam DetectionApplication performanceAd-hoc Analysis…
Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting
THANK YOU!