Post on 22-May-2018
ApacheHive
CMSC491Hadoop-BasedDistributedCompu<ng
Spring2016AdamShook
WhatIsHive?
• DevelopedbyFacebookandatop-levelApacheproject• AdatawarehousinginfrastructurebasedonHadoop• Immediatelymakesdataonaclusteravailabletonon-JavaprogrammersviaSQLlikequeries
• BuiltonHiveQL(HQL),aSQL-likequerylanguage• InterpretsHiveQLandgeneratesMapReducejobsthatrunonthecluster
• Enableseasydatasummariza<on,ad-hocrepor<ngandquerying,andanalysisoflargevolumesofdata
WhatHiveIsNot
• Hive,likeHadoop,isdesignedforbatchprocessingoflargedatasets
• NotanOLTPorreal-<mesystem• Latencyandthroughputarebothhighcomparedtoatradi<onalRDBMS– Evenwhendealingwithrela<velysmalldata(<100MB)
DataHierarchy
• Hiveisorganisedhierarchicallyinto:– Databases:namespacesthatseparatetablesandotherobjects
– Tables:homogeneousunitsofdatawiththesameschema• AnalogoustotablesinanRDBMS
– Par<<ons:determinehowthedataisstored• Allowefficientaccesstosubsetsofthedata
– Buckets/clusters• Forsubsamplingwithinapar<<on• Joinop<miza<on
HiveQL• HiveQL/HQLprovidesthebasicSQL-likeopera<ons:– SelectcolumnsusingSELECT– FilterrowsusingWHERE– JOINbetweentables– EvaluateaggregatesusingGROUPBY– Storequeryresultsintoanothertable– Downloadresultstoalocaldirectory(i.e.,exportfromHDFS)
– ManagetablesandquerieswithCREATE,DROP,andALTER
Primi<veDataTypes
Type Comments
TINYINT,SMALLINT,INT,BIGINT 1,2,4and8-byteintegers
BOOLEAN TRUE/FALSE
FLOAT,DOUBLE Singleanddoubleprecisionrealnumbers
STRING Characterstring
TIMESTAMP Unix-epochoffsetordate<mestring
DECIMAL Arbitrary-precisiondecimal
BINARY Opaque;ignorethesebytes
ComplexDataTypes
Type Comments
STRUCT Acollec<onofelementsIfSisoftypeSTRUCT{aINT,bINT}:S.areturnselementa
MAP Key-valuetupleIfMisamapfrom'group'toGID:M['group']returnsvalueofGID
ARRAY IndexedlistIfAisanarrayofelements['a','b','c']:A[0]returns'a'
HiveQLLimita<ons
• HQLonlysupportsequi-joins,outerjoins,lelsemi-joins
• Becauseitisonlyashellformapreduce,complexqueriescanbehardtoop<mise
• MissinglargepartsoffullSQLspecifica<on:– HAVINGclauseinSELECT– Correlatedsub-queries– Sub-queriesoutsideFROMclauses– Updatableormaterializedviews– Storedprocedures
HiveMetastore• StoresHivemetadata• DefaultmetastoredatabaseusesApacheDerby• Variousconfigura<ons:– Embedded(in-processmetastore,in-processdatabase)• Mainlyforunittests
– Local(in-processmetastore,out-of-processdatabase)• EachHiveclientconnectstothemetastoredirectly
– Remote(out-of-processmetastore,out-of-processdatabase)• EachHiveclientconnectstoametastoreserver,whichconnectstothemetadatadatabaseitself
HiveWarehouse
• HivetablesarestoredintheHive“warehouse”– DefaultHDFSloca<on:/user/hive/warehouse
• Tablesarestoredassub-directoriesinthewarehousedirectory
• Par<<onsaresubdirectoriesoftables• ExternaltablesaresupportedinHive• Theactualdataisstoredinflatfiles
HiveSchemas
• Hiveisschema-on-read– Schemaisonlyenforcedwhenthedataisread(atquery<me)
– Allowsgreaterflexibility:samedatacanbereadusingmul<pleschemas
• ContrastwithanRDBMS,whichisschema-on-write– Schemaisenforcedwhenthedataisloaded– Speedsupqueriesattheexpenseofload<mes
CreateTableSyntaxCREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
SimpleTableCREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
MoreComplexTableCREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
ExternalTableCREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User') ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
MoreAboutTables
• CREATETABLE– LOAD:filemovedintoHive’sdatawarehousedirectory
– DROP:bothmetadataanddatadeleted• CREATEEXTERNALTABLE– LOAD:nofilesmoved– DROP:onlymetadatadeleted– UsethiswhensharingwithotherHadoopapplica<ons,orwhenyouwanttousemul<pleschemasonthesamedata
Par<<oning
• Canmakesomequeriesfaster• Dividedatabasedonpar<<oncolumn• UsePARTITIONBYclausewhencrea<ngtable• UsePARTITIONclausewhenloadingdata• SHOWPARTITIONSwillshowatable’spar<<ons
Bucke<ng
• Canspeedupqueriesthatinvolvesamplingthedata– Samplingworkswithoutbucke<ng,butHivehastoscantheen<redataset
• UseCLUSTEREDBYwhencrea<ngtable– Forsortedbuckets,addSORTEDBY
• Toqueryasampleofyourdata,useTABLESAMPLE
BrowsingTablesAndPar<<onsCommand Comments
SHOW TABLES; ShowallthetablesinthedatabaseSHOW TABLES 'page.*'; Showtablesmatchingthe
specifica<on(usesregexsyntax)
SHOW PARTITIONS page_view; Showthepar<<onsofthepage_viewtable
DESCRIBE page_view; ListcolumnsofthetableDESCRIBE EXTENDED page_view; Moreinforma<ononcolumns(useful
onlyfordebugging)
DESCRIBE page_view PARTITION (ds='2008-10-31');
Listinforma<onaboutapar<<on
LoadingData
• UseLOADDATAtoloaddatafromafileordirectory
– WillreadfromHDFSunlessLOCALkeywordisspecified
– WillappenddataunlessOVERWRITEspecified– PARTITIONrequiredifdes<na<ontableispar<<oned
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inser<ngData
• UseINSERTtoloaddatafromaHivequery
– WillappenddataunlessOVERWRITEspecified– PARTITIONrequiredifdes<na<ontableispar<<onedFROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US';
Inser<ngData
• Normallyonlyonepar<<oncanbeinsertedintowithasingleINSERT
• Amul<-insertletsyouinsertintomul<plepar<<onsFROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='US‘ )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='CA' )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view
PARTITION ( dt='2008-06-08', country='UK' )
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'UK';
Inser<ngDataDuringTableCrea<on
• UseASSELECTintheCREATETABLEstatementtopopulateatableasitiscreated
CREATE TABLE page_view AS
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url
FROM page_view_stg pvs
WHERE pvs.country = 'US';
LoadingAndInser<ngData:Summary
Usethis Forthispurpose
LOAD LoaddatafromafileordirectoryINSERT Loaddatafromaquery
• Onepar<<onata<me• Usemul<pleINSERTstoinsertinto
mul<plepar<<onsintheonequery
CREATE TABLE AS (CTAS) Insertdatawhilecrea<ngatable
Add/modifyexternalfile Loadnewdataintoexternaltable
SampleSelectClauses
• SelectfromasingletableSELECT* FROMsales WHEREamount>10AND region="US";
• Selectfromapar<<onedtableSELECTpage_views.*
FROMpage_viewsWHEREpage_views.date>='2008-03-01'AND
page_views.date<='2008-03-31'
Rela<onalOperators
• ALLandDISTINCT– Specifywhetherduplicaterowsshouldbereturned– ALListhedefault(allmatchingrowsarereturned)– DISTINCTremovesduplicaterowsfromtheresultset
• WHERE– Filtersbyexpression– DoesnotsupportIN,EXISTSorsub-queriesintheWHEREclause
• LIMIT– Indicatesthenumberofrowstobereturned
Rela<onalOperators
• GROUPBY– Groupdatabycolumnvalues– SelectstatementcanonlyincludecolumnsincludedintheGROUPBYclause
• ORDERBY/SORTBY– ORDERBYperformstotalordering
• Slow,poorperformance– SORTBYperformspar<alordering
• Sortsoutputfromeachreducer
AdvancedHiveOpera<ons
• JOIN– Ifonlyonecolumnineachtableisusedinthejoin,thenonlyoneMapReducejobwillrun• Thisresultsin1MapReducejob:
SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key
• Thisresultsin2MapReducejobs:SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key
– Ifmul<pletablesarejoined,putthebiggesttablelastandthereducerwillstreamthelasttable,buffertheothers
– Uselelsemi-joinstotaketheplaceofIN/EXISTSSELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;
AdvancedHiveOpera<ons• JOIN
– Donotspecifyjoincondi<onsintheWHEREclause• Hivedoesnotknowhowtoop<misesuchqueries• WillcomputeafullCartesianproductbeforefilteringit
• JoinExample
SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd = b.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND a.ymd > '2010-01-01';
HiveS<nger
• MPP-styleexecu<onofHivequeries• AvailablesinceHive0.13• NoMapReduce• WewilltalkaboutthismorewhenwegettoSQLonHadoop
References
• hvp://hive.apache.org