Parquet and AVRO
-
Upload
airisdata -
Category
Engineering
-
view
2.383 -
download
1
Transcript of Parquet and AVRO
![Page 1: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/1.jpg)
PARQUET & AVRO
http://airisdata.com/
![Page 2: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/2.jpg)
Presenter Introduction• TimSpann,SeniorSolutionsArchitect,airis.DATA• ex-PivotalSeniorFieldEngineer•DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead
• http://www.slideshare.net/bunkertor• http://sparkdeveloper.com/• http://www.twitter.com/PaasDev
![Page 3: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/3.jpg)
• SrinivasDarunaDataEngineer,airis.DATASparkCertifiedDeveloper
Presenter Introduction
![Page 4: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/4.jpg)
airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.
Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.
WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.
Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.
airis.DATA
![Page 5: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/5.jpg)
qAvroandParquet- WhenandWhytousewhichformat?qUsecasesforSchemaEvolution&practicalexamplesqDatamodeling- AvroandParquetschemaqWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet
qOurexperienceswithAvroandParquetqSomehelpfulinsightsforprojectsdesign
Agenda
![Page 6: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/6.jpg)
AVRO - Introduction
Ø DougCuttingcreatedAvro,adataserializationandRPClibrary,tohelpimprovedatainterchange,interoperability,andversioninginHadoopEcoSystem.
ØSerialization&RPCLibraryandalsostorageformat.ØWhatledtoanewserializationmechanism.?ØThriftandPBarenotsplittable andcodegen requiredforbothofthem.Dynamicreadingisnotpossible
ØSequencefilesdoesnothaveschemaevolutionØEvolvedasin-houseserializationandRPClibraryforhadoop.GoodoverallperformancethatcanmatchuptoProtocolBuffersinsomeaspects.
![Page 7: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/7.jpg)
Ø DynamicAccess – NoneedofCodegenerationforaccessingthedata.
Ø UnTagged Data – Whichallowsbettercompression
Ø Platformin-dependent – HaslibrariesinJava,Scala,Python,Ruby,CandC#.CompressibleandSplittable – Complements theparllel processingsystemssuchasMRandSpark.
Ø SchemaEvolution: “Datamodelsevolveovertime”,andit’simportant thatyourdataformatssupport yourneedtomodifyyourdatamodels.Schemaevolutionallowsyoutoadd,modify,andinsomecasesdeleteattributes,whileatthesametimeproviding backwardandforwardcompatibility forreadersandwriters
Some Important features of AVRO
![Page 8: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/8.jpg)
• RowBased• Directmappingfrom/toJSON• Interoperability:canserializeintoAvro/BinaryorAvro/Json• Providesrichdatastructures• Mapkeyscanonlybestrings(couldbeseenasalimitation)• Compactbinaryform• Extensibleschemalanguage• Untaggeddata• Bindingsforawidevarietyofprogramminglanguages• Dynamictyping• Providesaremoteprocedurecall• Supportsblockcompression• Avrofilesaresplittable• Bestcompatibilityforevolvingdataschemas
Summary of AVRO Properties
![Page 9: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/9.jpg)
AVRO Schema TypesPrimitive Types
null: no valueboolean: a binary valueint: 32-bit signed integerlong: 64-bit signed integerfloat: single precision (32-bit) IEEE 754
floating-point numberdouble: double precision (64-bit) IEEE 754
floating-point numberbytes: sequence of 8-bit unsigned bytesstring: unicode character sequence
Primitive types have no specified attributes.Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to: {"type":"string"}
Complex Types
RecordsEnumsArraysMapsUnionsFixed
https://avro.apache.org/docs/current/spec.html
![Page 10: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/10.jpg)
Avro SchemaUnderstandingAvroschemaisveryimportantforAvroData.
Ø JSONFormatisusedtodefineschema
Ø SimplerthanIDL(InterfaceDefinitionLanguage)ofProtocolBuffersandthrift
Ø veryusefulinRPC.Schemaswillbeexchangedtoensurethedatacorrectness
Ø Youcanspecifyorder(AscendingorDescending)forfields.
SampleSchema:
{"type":"record","name":"Meetup","fields": [{
"name":"name","type":"string”,"order":"descending"
}, {"name":"value","type":["null", "string”]
}…..]
}
Union
![Page 11: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/11.jpg)
File Structure - Avro
![Page 12: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/12.jpg)
Workshop - Code Examples
Ø JavaAPItocreateAvrofile- APISupport
Ø HiveQuerytocreateExternaltablewithAvroStorageFormat– SchemaEvolution
Ø Accessingavro filegeneratedfromJavainPython– LanguageIndependence
Ø Spark-Avrodataaccess
![Page 13: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/13.jpg)
Few interesting things…
• AvroCli – AvroToolsjarthatcanprovidesomecommandlinehelp• ThriftandProtocolBuffers• Kryo• Jackson-avro-databind javaAPI• ProjectKiji (SchemamanagementinHbase)
PleasedropmailforsupportifyouhaveanyissuesorifyouhavesuggestionsonAvro
![Page 14: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/14.jpg)
PARQUET - Introduction
• ColumnarstorageformatthatcomeoutofacollaborationbetweenTwitterandClouderabasedonDremel• Whatisastorageformat?• Well-suitedtoOLAPworkloads• HighlevelofintegrationwithHadoopandtheecosystem(Hive,ImpalaandSpark)• InteroperableAvro,ThriftandProtocolBuffers
![Page 15: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/15.jpg)
PARQUET cont..• Allowscompression.CurrentlysupportsSnappyandGzip.• WellsupportedoverHadoop ecosystem.• VerywellintegratedwithSparkSQLandDataFrames.• Predicatepushdown:Projectionandpredicatepushdownsinvolveanexecutionenginepushingtheprojectionandpredicatesdowntothestorageformattooptimizetheoperationsatthelowestlevelpossible.• I/Otoaminimumbyreadingfromadiskonlythedatarequiredforthequery.• SchemaEvolutiontosomeextent.Allowsaddingnewcolumnsattheend.• Languageindependent.SupportsScala,Java,C++,Python.
![Page 16: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/16.jpg)
File Structures - Parquet
![Page 17: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/17.jpg)
message Meetup {required binary name (UTF8);required binary meetup_date (UTF8);required int32 going;required binary organizer (UTF8);required group topics (LIST) {repeated binary array (UTF8);
}}
Sample Schema
Ø BecarefulwithParquetDatatypesØ DoesnothaveagoodstandaloneAPIasAvro,haveconvertersforAvro,PBandThriftinsteadØ Flattensallnesteddatatypesinordertosavethemascolumnarstructures
![Page 18: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/18.jpg)
Nested Schema resolution
![Page 19: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/19.jpg)
Parquet few important notes..
• Parquetrequiresalotofmemorywhenwritingfilesbecauseitbufferswritesinmemorytooptimizetheencodingandcompressingofthedata• UsingaheavilynesteddatastructurewithParquetwilllikelylimitsomeoftheoptimizationsthatParquetmakesforpushdowns.Ifpossible,trytoflattenyourschema
![Page 20: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/20.jpg)
Code examples• JavaAPI• SparkExample• KafkaExmple
![Page 21: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/21.jpg)
How to decide on storage format
• Whatkindofdatayouhave?• Whatistheprocessingframework?FutureandCurrent• Dataprocessingandquerying• DoyouhaveRPC/IPC• Howmuchschemaevolutiondoyouhave?
![Page 22: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/22.jpg)
Our experiences with Parquet and Avro
![Page 23: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/23.jpg)
Namespacere-definitions<item>
<chapters><content>
<name>content1</name><pages>100</pages>
</content></chapters>
</item><otheritem>
<chapters><othercontent>
<randomname>xyz</randomname><someothername>abcd</someothername>
</othercontent></chapters>
</otheritem>
![Page 24: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/24.jpg)
Is Agenda Accomplished..??üAvroandParquet- WhenandWhytousewhichformat?üUsecasesforSchemaEvolution&practicalexamplesüDatamodeling- AvroandParquetschemaüWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet
üOurexperienceswithAvroandParquetüSomehelpfulinsightsforprojectsdesign
![Page 25: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/25.jpg)
Questions… ????????
![Page 26: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/26.jpg)
Notes• https://dzone.com/articles/where-should-i-store-hadoop-data
• https://developer.ibm.com/hadoop/blog/2016/01/14/5-reasons-to-choose-parquet-for-spark-sql/
• http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015
• http://parquet.apache.org/
• https://github.com/cloudera/parquet-examples
• http://avro.apache.org/docs/current/spec.html#schema_primitive
• http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
• https://cwiki.apache.org/confluence/display/AVRO/FAQ
• http://avro.apache.org/
• https://github.com/miguno/avro-cli-examples
• http://avro.apache.org/docs/current/spec.html#schema_primitive
• https://dzone.com/articles/getting-started-apache-avro
• https://github.com/databricks/spark-avro
![Page 27: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/27.jpg)
Notes• https://github.com/twitter/bijection
• https://github.com/mkuthan/example-spark
• http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
• https://github.com/databricks/spark-avro/blob/master/README.md
• http://www.bigdatatidbits.cc/2015/01/how-to-load-some-avro-data-into-spark.html
• http://engineering.intenthq.com/2015/08/pucket/
• https://dzone.com/articles/understanding-how-parquet
• http://blog.cloudera.com/blog/2015/03/converting-apache-avro-data-to-parquet-format-in-apache-hadoop/
![Page 28: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/28.jpg)
OurSolutionsTeamo PrasadSripathi,CEO,ExperiencedBigDataArchitect,HeadofNJDataScienceandHadoopMeetups
o SergeyFogelson,PhD,Director,DataScience
o EricMarshall, SeniorSystemsArchitect,HortonworksHadoopCertifiedAdministrator
o KristinaRogale Plazonic,SparkCertifiedDataEngineer
o RaviKora,SparkCertifiedSeniorDataScientist
o Srinivasarao Daruna, SparkCertifiedDataEngineer
o Srujana Kuntumalla, SparkCertifiedDataEngineer
o TimSpann,SeniorSolutionsArchitect,ex-Pivotal
o RajivSingla,DataEngineer
o SureshKempula,DataEngineer
![Page 29: Parquet and AVRO](https://reader031.fdocuments.us/reader031/viewer/2022022412/58f9ad95760da3da068b99d0/html5/thumbnails/29.jpg)
Technology Stack