Pig and Python to Process Big Data
-
Upload
shawn-hermans -
Category
Technology
-
view
120 -
download
3
description
Transcript of Pig and Python to Process Big Data
Big Data with Pig and Python
Shawn HermansOmaha Dynamic Languages User Group
April 8th, 2013
Tuesday, April 9, 13
About Me
• Mathematician/Physicist turned Consultant
• Graduate Student in CS at UNO
• Current Software Engineer at Sojern
Tuesday, April 9, 13
Working with Big Data
Tuesday, April 9, 13
What is Big Data?Data Source Size
Wikipedia Database Dump 9GB
Open Street Map 19GB
Common Crawl 81TB
1000 Genomes 200TB
Large Hadron Collider 15PB annually
Gigabytes - Normal size for relational
databases
Terabytes - Relational databases may
start to experience scaling issues
Petabytes - Relational databases
struggle to scale without a lot of fine tuning
Tuesday, April 9, 13
Working With DataExpectation Reality
• Different File Formats
• Missing Values
• Inconsistent Schema
• Loosely Structured
• Lots of it
Tuesday, April 9, 13
MapReduce
Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview
• Map - Emit key/value pairs from data
• Reduce - Collect data with common keys
• Tries to minimize moving data between nodes
Tuesday, April 9, 13
MapReduce Issues
• Very low-level abstraction
• Cumbersome Java API
• Unfamiliar to data analysts
• Rudimentary support for data pipelines
Tuesday, April 9, 13
Pig• Eats anything
• SQL-like, procedural data flow language
• Extensible with Java, Jython, Groovy, Ruby or JavaScript
• Provides opportunities to optimize workflows
Tuesday, April 9, 13
Alternatives• Java MapReduce API
• Hadoop Streaming
• Hive
• Spark
• Cascading
• Cascalog
Tuesday, April 9, 13
Python
• Data analysis - pandas, numpy, networkx
• Machine learning - scikits.learn, milk
• Scientific - scipy, pyephem, astropysics
• Visualization - matplotlib, d3py, ggplot
Tuesday, April 9, 13
Pig Features
Tuesday, April 9, 13
Input/Output• HBase
• JDBC Database
• JSON
• CSV/TSV
• Avro
• ProtoBuff
• Sequence File
• Hive Columnar
• XML
• Apache Log
• Thrift
• Regex
Tuesday, April 9, 13
Relational OperatorsLIMIT GROUP FILTER CROSS
COGROUP JOIN STORE DISTINCT
FOREACH LOAD ORDER UNION
Tuesday, April 9, 13
Built In FunctionsCOS SIN AVG SUM
COUNT RANDOM LOWER UPPER
CONCAT MAX MIN TOKENIZE
Tuesday, April 9, 13
User Defined Functions• Easy way to add arbitrary code to Pig
• Eval - Filter, aggregate, or evaluate
• Storage - Load/Store data
• Full support for Java and Jython
• Experimental support for Groovy, Ruby and JavaScript
Tuesday, April 9, 13
Census Example
Tuesday, April 9, 13
Getting Data
Tuesday, April 9, 13
Convert to TSVogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"
• Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV
• TSV > CSV
Tuesday, April 9, 13
Inspect Headersf = open('CSA_2010Census_DP1.tsv')header = f.readline()headers = header.strip('\n').split('\t')list(enumerate(headers))
[(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . .
Tuesday, April 9, 13
Pig Quick Start
pig -x localgrunt> lsfile:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818file:/data/CSA_2010Census_DP1.prj<r 1> 167file:/data/CSA_2010Census_DP1.shp<r 1> 76180308file:/data/CSA_2010Census_DP1.shx<r 1> 3596file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058
http://pig.apache.org/releases.html
https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
• Download Pig Distribution
• Untar package
• Start Pig in local mode
Tuesday, April 9, 13
Loading Data
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
Tuesday, April 9, 13
Extracting Data
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_typesextracted_no_types: {name: bytearray,population: bytearray};
Tuesday, April 9, 13
Adding Schema
grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int;grunt> describe extracted;extracted: {name: chararray,population: int}
Tuesday, April 9, 13
Orderinggrunt> ordered = ORDER extracted by population DESC;grunt> dump ordered;
("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)("Los Angeles-Long Beach-Riverside, CA CSA",17877006)("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",8572971)("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)("San Jose-San Francisco-Oakland, CA CSA",7468390)("Dallas-Fort Worth, TX CSA",6731317)("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)
Tuesday, April 9, 13
Storing Data
grunt> STORE extracted INTO 'extracted_data' USING PigStorage('\t', '-schema');
ls -a.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160
Tuesday, April 9, 13
Space Catalog Example
Tuesday, April 9, 13
Space Catalog
• 14,000+ objects in public catalog
• Use Two Line Element sets to propagate out positions and velocities
• Can generate over 100 million positions & velocities per day
Tuesday, April 9, 13
Two Line ElementsISS (ZARYA)1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 29272 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537
• Use Python script to convert to Pig friendly TSV
• Create Python UDF to parse TLE into parameters
• Use Python UDF with Java libraries to propagate out positions
Tuesday, April 9, 13
Python UDFs
• Easy way to extend Pig with new functions
• Uses Jython which is at Python 2.5
• Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...)
• Can use Java classes
Tuesday, April 9, 13
TLE parsing
def parse_tle_number(tle_number_string): split_string = tle_number_string.split('-‐') if len(split_string) == 3: new_number = '-‐' + str(split_string[1]) + 'e-‐' + str(int(split_string[2])+1) elif len(split_string) == 2: new_number = str(split_string[0]) + 'e-‐' + str(int(split_string[1])+1) elif len(split_string) == 1: new_number = '0.' + str(split_string[0]) else: raise TypeError('Input is not in the TLE float format') return float(new_number)
54-61 BSTAR Drag (Decimal Assumed)
-11606-4
Full parser at https://gist.github.com/shawnhermans/4569360
Tuesday, April 9, 13
Simple UDF
import tleparser
@outputSchema("params:map[]")def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params
Tuesday, April 9, 13
Extract Parameters
grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);
grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);
([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,eccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])
Tuesday, April 9, 13
Storing Results
grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');
Tuesday, April 9, 13
UDF with Java Importfrom jsattrak.objects import SatelliteTleSGP4
@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time
while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions))
current_time += increment
return ecef_positions
Tuesday, April 9, 13
Propagate Positionsgrunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray);grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated);propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}}grunt> DESCRIBE flattened;flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}
Tuesday, April 9, 13
Result
(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)
Tuesday, April 9, 13
Pig on Amazon EMR
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Pig with EMR
Tuesday, April 9, 13
Pig with EMR
• SSH in to box to run interactive Pig session
• Load data to/from S3
• Run standalone Pig scripts on demand
Tuesday, April 9, 13
Conclusion
Tuesday, April 9, 13
Other Useful Tools• Python-dateutil : Super-duper date parser
• Oozie : Hadoop workflow engine
• Piggybank and Elephant Bird : 3rd party pig libraries
• Chardet: Character detection library for Python
Tuesday, April 9, 13
Parting Thoughts• Great ETL tool/language
• Flexible enough to write general purpose MapReduce jobs
• Limited, but emerging 3rd party libraries
• Jython for UDFs is extremely limiting (Spark?)
Twitter: @shawnhermansEmail: [email protected]
Tuesday, April 9, 13