An introduction to Presto, an open source distributed Dipti ......25 Ahana • SQL analytics company...
Transcript of An introduction to Presto, an open source distributed Dipti ......25 Ahana • SQL analytics company...
DiptiBorkarCo-Founder & CPO | Ahana
An introduction to Presto, an open source distributed
SQL engine
Founder
Mom
����Immigrant
Girldatageek(DB)
Engineeralways
Producttechie
Teambuilder
Opensourcebeliever
Mixologist
�
3
Agenda
• WhatisPresto?
• Historyoffederation
• IntroductiontoPresto
• WhatmadePrestodifferent?
• Scalablearchitecture
• FlexibleConnectors
• Performance
• Thelifeofaquery
4
TechnologyCyclesRhyme:DataFederationFDBMSChallengesRDBMSFDBMSPaperbyMcCleod /Heimbigner (1985)FDBMSPaperbySheth /Larson(1990)
OLTPtoDWWinsDataWarehousebecomesthesourceoftruthStarschemabecomessacred
Cloud&BigDataComposite Software(founded2001)GarlicPaperbyLauraHaas(2002)à DB2FederatedGoogleFileSystemPaper(2003)MapReducepaper(2006)SparkPaper(2010)ToomanyDataSources,Nooneuberschema
NewCloudDWw/DataLakesBasedonSQLSelfServicePlatformswhichenableSelf-ServiceAnalytics
SQLFederationMakesComebackDremel Paper (2010) àDrill paper (2012)SQL ++ paper (2014) à Couchbase SQL++ engine (2018)Presto paper (2019), PartiQL (2019)
80’s
90’s
2000’s
2010’s
2020’s
5
Presto:OneoftheFastestGrowingOpenSourceProjectsinDataAnalyticsBusinessNeeds
Data-drivendecisionmaking
Businessesneedmoredatatoiterateover
TechnologyTrends
DisaggregationofStorageandCompute
Theriseofdatalakes
6
WhatisPresto?
• DistributedSQLqueryengine
• ANSISQLonDatabases,Datalakes
• Designedtobeinteractive
• Accesstopetabytesofdata
• Opensource,hostedongithub
• https://github.com/prestodb
7
PrestoOverview
8
CommonQuestions?
• Isprestoadatabase?
• HowisitrelatedtoHadoop?
• Howisitdifferentfromadatawarehouse?
9
SamplePrestodeploymentstack&usecases
• Adhoc
• BItools
• Dashboard
• A/Btesting
• ETL/scheduledjob
• Onlineservice
10
WhatmadePrestodifferent?
• Scalablearchitecture
• PluggableConnectors
• Performance
11
ScalableArchitecture
• Tworoles- coordinatorand
worker
• Easyscaleupandscaledown
• Scaleupto1000workers
• Validatedatwebscaled
companies
12
ScalableArchitecture
13
PluggablePrestoConnectors
14
PrestoConnectorDataModel
• Connector:Driverforadatasource.
• Example:HDFS,AWSS3,Cassandra,MySQL,SQLServer,Kafka
• Catalog:Containsschemasfromadatasourcespecifiedbythe
connector
• Schemas:Namespacetoorganizetables.
• Tables:Setofunorderedrowsorganizedintocolumnswithtypes.
15
PrestoHiveConnectorforObjectstores&Filessystems
16
PrestoHiveConnector– AccessControl
17
PrestoHiveConnector– DataFileTypes
• SupportedFileTypes• ORC• Parquet• Avro• RCFile• SequenceFile• JSON• Text
• Nodataingestionneeded
18
PrestoDruidConnectorforreal-timeanalytics
19
WhyPrestoisFast
• In-Memoryprocessing
• Pullmodel
• Columnarstorageandexecution
20
TheLifeofaQuery– SimpleScan
21
TheLifeofaQuery– JoinandAggregationSELECT
orders.orderkey,SUM(tax)
FROM orders
LEFTJOINlineitem
ON orders.orderkey =lineitem.orderkey
WHERE discount=0GROUPBYorders.orderkey
This example is from Presto: SQL on Everything
https://research.fb.com/publications/presto-sql-on-everything/
22
LogicalPlan- DoNOTJoinTwoBigTables
23
Limitations
• MemoryLimitation
• FaultTolerance
• SingleCoordinator
24
Getstarted
DockerSandboxforPresto
https://hub.docker.com/r/ahanaio/prestodb-sandbox
AWSSandboxAMIforPresto
https://ahana.io/tutorials/aws-sandbox/
25
Ahana
• SQLanalyticscompanybasedonPresto
• Teamofexpertsincloud,database,andPresto
• InvestmentfromGoogleVentures
• NamedCRNTop10BigDataStartupof2020
• Premiermemberof “[Ahana founders] have been strongsupporters of the Presto Foundationsince its launch in September 2019”
“We are excited to welcome Ahana, asthe first and only company focused onsupporting Presto of the PrestoFoundation”
https://events.linuxfoundation.org/prestocon/
PRESTO20WIBD
Free for WiBD Members
27
JointhePrestoCommunity• Requirenewfeatureorfileabug:github.com/prestodb/presto• Slack:prestodb.slack.com• Twitter:@prestodb
Stay Up-to-Date with Ahana• URL: ahana.io
• Twitter: @ahanaio
Q & A
And yes! We are hiring!
8/27/20
30
PrestoFoundation:CommunityDriven
31
Data-DrivenCompaniesneedLowDataLatency
AnalystsandScientistsneedtoanswerquestions:
Thetimeittakesfromauserhavingaquestiontothetimetheycanactuallyanswerit
“DataLatency”=
1.Userwantstotrackorexploresomenewdata
2.UsermeetswithDataEng teamto
makeplan
3.Datateamacquiredataandcheck
accesspermissions
4.BuildandtesttheETLsandmake
tablesavailabletouser
5.Notifytheusersotheycanasktheir
questions
!Canbedaysorweeksof
time