SQL In The Big Data Era

14
Globalcode – Open4education Big Data – SQL In The Big Data Era Rafael Aguiar Data Science Engineer @InLocoMedia

Transcript of SQL In The Big Data Era

Page 1: SQL In The Big Data Era

Globalcode – Open4education

Big Data – SQL In The Big Data EraRafael Aguiar

Data Science Engineer @InLocoMedia

Page 2: SQL In The Big Data Era

Globalcode – Open4education

Agenda

ContextoDefinição de Big DataUm mapa do ecossistemaApache HiveApache HuePor onde começar

Page 3: SQL In The Big Data Era

Globalcode – Open4education

Mobile Ad Network baseada em localização de alta precisão (1-3m)

Terabytes de dados comprimidos/mêsComo entender padrões de visita?Como recomendar melhores anúncios?

Page 4: SQL In The Big Data Era

Globalcode – Open4education

Big Data

“Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

McKinsey (2011)

Page 5: SQL In The Big Data Era

Globalcode – Open4education

Ecossistema

Page 6: SQL In The Big Data Era

Globalcode – Open4education

Apache Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides:

Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.A mechanism to impose structure on a variety of data formatsQuery execution via Apache Tez, Apache Spark, or MapReduce

Page 7: SQL In The Big Data Era

Globalcode – Open4education

Apache Hive

Page 8: SQL In The Big Data Era

Globalcode – Open4education

Apache Hive

Quando usar o Hive?Você já sabe SQL e quer começar a processar grandes datasets sem quebrar a cabeçaVocê precisa rodar um job rapidamente e não tem tempo hábil para escrever um código limpo e otimizado

Page 9: SQL In The Big Data Era

Globalcode – Open4education

Apache HiveCREATE TABLE tdc_participants (

name STRING,age INT,skills ARRAY <STRING>,likes_beer BOOLEAN,home_town STRING

)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES (

"separatorChar" = ",","quoteChar" = "'","escapeChar" = "\\"

) STORED AS TEXTFILE;

SELECT home_town, count(*)FROM tdc_participantsWHERE array_contains(skills, "big-data")

AND likes_beer = TRUEGROUP BY home_town;

Page 10: SQL In The Big Data Era

Globalcode – Open4education

Apache HiveCREATE TEMPORARY FUNCTION st_linestring AS "com.esri.hadoop.hive.ST_LineString";CREATE TEMPORARY FUNCTION st_setsrid AS "com.esri.hadoop.hive.ST_SetSRID";CREATE TEMPORARY FUNCTION st_geodesiclengthwgs84 AS "com.esri.hadoop.hive.ST_GeodesicLengthWGS84";

CREATE TABLE location (id STRING, lat DOUBLE, lng DOUBLE, epoch BIGINT) {...};

SET hivevar:PLACE_OF_INTEREST = named_struct("lat", 1.0, "lng", 1.0);SET hivevar:MAX_DISTANCE = 10;SET hivevar:SPATIAL_REF_ID = 4326;

SELECT count(distinct id)From locationWHERE location.lat IS NOT NULL AND

location.lng IS NOT NULL ANDst_geodesiclengthwgs84(

st_setsrid(st_linestring(

${hivevar:PLACE_OF_INTEREST}.lng,${hivevar:PLACE_OF_INTEREST}.lat, location.lng,location.lat),

${hivevar:SPATIAL_REF_ID})) < ${hivevar:MAX_DISTANCE};

Page 11: SQL In The Big Data Era

Globalcode – Open4education

Apache Hue

http://demo.gethue.com/

Page 12: SQL In The Big Data Era

Globalcode – Open4education

Por onde começar

https://hive.apache.org/http://gethue.com/Programming Hive, by Edward Capriolohttps://github.com/Prokopp/the-free-hive-book

Page 13: SQL In The Big Data Era

Globalcode – Open4education

Rafael [email protected]@rafadaguiar

#TDCHive

Page 14: SQL In The Big Data Era

Globalcode – Open4education

Obrigado!