AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet...

49
SUMMIT BERLIN

Transcript of AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet...

Page 1: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

S U MM I TBERL IN

Page 2: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Query your data in S3 with SQL andoptimize for cost and performance

Steffen GrunwaldSolutions Architect, Amazon Web Services

[email protected]: @steffeng

Page 3: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Data

every 5 years

There is more data than people think.

years

live for

Data platforms need to

scalegrows

Page 4: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Traditionally, analytics looked like this

Expensive: Large initial capex + $10k $50k/TB/year

GBs-TBs scale [not designed for PB/EBs]

Relational data

90% of data was thrown away because of cost

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence

Page 5: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Snowball

Snowmobile KinesisData Firehose

KinesisData Streams

S3Redshift

EMRAthena Kinesis

Elasticsearch Service

Data Lakes on AWS

KinesisVideo Streams

AI Services

QuickSight

Exabyte scale

Store and analyze relational and non-relational data

Purpose-built analytic tools

Analytics run on the same data without movement

Cost effective• Store at $0.0245 per GB-month in S3 Standard (Frankfurt)• Query with Athena at $0.05/GB scanned• DW with Redshift for $1,000/TB/year

Page 6: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Ingest data with Amazon Kinesis Firehose

• Stores stream of records as files in a bucket• "YYYY/MM/DD/HH“ prefix or

(New!) custom pattern based on ingestion time (UTC)• Optionally compress (GZIP, ZIP, Snappy)• Optionally store as columnar format (ORC, Parquet)• Optionally transform records with AWS Lambda

Amazon Kinesis Data Firehose

Amazon Simple Storage Service (S3)

Page 7: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Amazon Athena

Fast. Really Fast.

Interactive performance even for large datasets. Athena automatically

executes queries in parallel, so most results come back

within seconds.

Open. Powerful. Standard

Start Querying Instantly Pay Per Query

Athena is serverless. Just point to your data in

Amazon S3, define the schema, and start querying

using the built-in query editor.

Amazon Athena uses Presto with ANSI SQL support and

works with a variety of standard data formats,

including CSV, JSON, ORC, Avro, and Parquet

With Amazon Athena, you pay only for the queries that you run. You are charged $5 per terabyte scanned by your

queries.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Page 8: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Familiar technologies under the covers

Used for SQL QueriesIn-memory distributed query engineANSI-SQL compatible with extensions(Eg. SELECT * FROM tableName)

Used for DDL functionalityComplex data typesMultitude of formats Supports data partitioning(Eg. CREATE TABLE, ALTER TABLE,

MSCK REPAIR)

Page 9: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Movable Ink provides real-time personalization of marketing emails based on a wide range of user, device, and contextual data

Movable Ink has been collecting data on user actions since 2011, and this database grows by 75 to 100GB per day. To reduce time-to-insight, optimize costs, and increase flexibility for its analytics, the company recently adopted the serverless Amazon Athena query service.Since the company began using Amazon Athena, it has realized both cost savings and improved performance for analytics related to user actions.

“Using Amazon Athena, we’re able to query seven years’ worth of data—adding up to hundreds of terabytes—get results at least 50 percent faster, and save nearly $15,000 per month, all without keeping a cluster running.”

Matt Chesler, director of DevOps at Movable Ink

Page 10: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 11: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

The example data: NYC Taxi trips

Data is generated by kinesis-taxi-stream-produceravailable at http://a.co/3KyaGLV

java -jar kinesis-taxi-stream-producer.jar-speedup 400 -statisticsFrequency 10000-stream nyctlc-ingestion –noWatermark-region eu-central-1 -adaptTime ingestion

à ~2GB/h of raw data, 11 days, 487 GB total

Page 12: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Firehose (orc)

Firehose (raw)

Firehose (gzip)

Firehose (parquet)

Test setup: Ingesting data with different settings

(max. Firehose bufferinghints: 128MB & 900s)

S3Amazon Kinesis Data Streams

Instance

Page 13: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query I

Show some rides on 2nd September 10-11h:SELECT *FROM "128mb"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10

Run time: 3.53 seconds, Data scanned: 4.62GB

Page 14: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query II (gzip)gzip reduces 487GB to 76GB.

Show some rides on 2nd September 10-11h:SELECT *FROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10

Run time: 3.53 seconds, Data scanned: 4.62GBRun time: 2.45 seconds, Data scanned: 303.04KB

Page 15: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query III (without LIMIT 10)What was the distribution of passenger load on 2nd September 10-11h?

SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'GROUP BY passenger_count

Run time: 50.36 seconds, Data scanned: 76.5GB

Page 16: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Photo by Tang Junwen on Unsplash

Page 17: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Partitions to theRescue

AWS Glue crawlers addpartitions based on fileprefixes/ directories.

Page 18: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query IV

What was the distribution of passenger load on 2nd September 10-11h?

SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||

partition_3BETWEEN '2018090210' AND '2018090215'

GROUP BY passenger_count

Run time: 50.36 seconds, Data scanned: 76.5GBRun time: 5.59 seconds, Data scanned: 1.77GB

Page 19: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 20: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 1: Crawl with AWS Glue

Why? Just schedule the crawler, no need to code!Deals with schema evolution.

Lookupschema

Create tablepartitions

Query data

Crawldata

S3 Bucket

Glue Crawler

Glue Datacatalog

Amazon AthenaLog Data

Page 21: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 2: Use Apache Hive-style file format in S3Move/ copy:

YYYY/MM/DD/HH/file

year=YYYY/month=MM/day=DD/hours=HH/file

Why? Format easy to create on write, easy to move.Reload partitions by: msck repair table.

Page 22: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 3: Add partitions with AWS Lambda

Why? Add partitions instantly, just AWS Lambda cost.

Lookupschema

Add tablepartition

Query data

New filetrigger

S3 Bucket

LambdaFunction

Glue Datacatalog

Amazon AthenaLog Data

Page 23: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 4: Prepopulate partitions

Issue statements with Amazon Athena every hour/ day:

ALTER TABLE mytableADD PARTITION(year='2015',month='01',day='01')LOCATION 's3://[...]/2015/01/01/'

Why? Easy for predictable paths.Data is instantly available in queries.

Page 24: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 25: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Last_NameFischerMeierSchulzMalerMüller

Age3425456329

GenderFemFemFemMalMal

First_NameHeidiAnnaNicoleKlausRainer

Flat file sample layout

Page 26: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Last_NameFischerMeierSchulzMalerMüller

MIN: FischerMAX: Schulz

Age3425456329

MIN: 25MAX: 63

GenderFemFemFemMalMal

MIN: FemMAX: Mal

First_NameHeidiAnnaNicoleKlaus

RainerMIN: Anna

MAX: Rainer

Columnar formats layout (Parquet & ORC)

Page 27: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

SELECT * FROM ... WHERE Age > 30

Last_NameFischerMeierSchulzMalerMüller

MIN: FischerMAX: Schulz

Age3425456329

MIN: 25MAX: 63

GenderFemFemFemMalMal

MIN: FemMAX: Mal

First_NameHeidiAnnaNicoleKlaus

RainerMIN: Anna

MAX: Rainer

Benefit 1: predicate pushdown

Page 28: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Last_NameFischerMeierSchulzMalerMüller

MIN: FischerMAX: Schulz

Age3425456329

MIN: 25MAX: 63

GenderFemFemFemMalMal

MIN: FemMAX: Mal

First_NameHeidiAnnaNicoleKlaus

RainerMIN: Anna

MAX: Rainer

SELECT First_Name FROM ... WHERE Age > 30

Benefit 2: projection pushdown/ column pruning

Page 29: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 30: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query V (parquet)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mbparquet"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||

partition_3BETWEEN '2018090210' AND '2018090215'

GROUP BY passenger_count

Run time: 5.59 seconds, Data scanned: 1.77GBRun time: 3.21 seconds, Data scanned: 300.7MB

Page 31: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Analyzing an Apache parquet file

• parquet-tools• head – view data in file• meta – get metadata summary• dump -d -n – get detailed metadata down to page

level stats included

Page 32: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Schema information

1. Download and build https://github.com/apache/parquet-mr/2. run: java -jar parquet-tools.jar meta parquetfile

Row Count Total Byte Size Size in Bytes Value Count Encoding

Page 33: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

parquet-tools dump: encoding & statisticstotal_amount:- DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03 [more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0]dropoff_datetime:- BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60 [more]... ST:[no stats for this column]

Use (unix epoch) or partition by timestamp for time seriesdata.

Page 34: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query VI (ORC)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mborc"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||

partition_3BETWEEN '2018090210' AND '2018090215'

GROUP BY passenger_count

Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 3.61 seconds, Data scanned: 303.38MB

Page 35: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Analyzing Apache ORC: orcdumpfile

Spin up a single node/ master EMR Cluster and use thehive command:

hive --orcfiledump file://<absolutepath>/file.orc

[…]

Column 7: count: 210141 hasNull: false min: -76.96324157714844 max: 0.0 sum: -1.5329986951126099E7

Column 8: count: 210141 hasNull: false min: 2018-08-30T00:13:48.573Z max: 2018-08-30T00:28:49.564Z sum: 5043384

[…]

Page 36: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Page 37: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 1: Extract, Transform, Load with AWS Glue

Lookupschema

Create tablepartitions

Query data

Read/Write

S3 Bucket

Glue Job

Glue Datacatalog

Amazon AthenaLog Data

Page 38: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example Zeppelin/ AWS Glue Notebook

Page 39: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Option 2: Amazon Athena – Create Table as Select

Lookupschema

Create tablepartitions

Query data

Read/Write

S3 Bucket

AthenaCTAS

Glue Datacatalog

Amazon AthenaLog Data

Page 40: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Athena CTAS

Repartition one day:

Pay for data scanned

4.86GB / 1TB * 5 USD= 0.0243 USD

Page 41: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query VII (repartitioned)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "partitioned_by_hour"WHERE year = 2018

AND month = 9AND day = 2AND hour = 10

GROUP BY passenger_count

Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 2.68 seconds, Data scanned: 1.41 MB

Page 42: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Example query VIII (aggregated)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, trip_countFROM "aggregates_by_hour"WHERE year = 2018

AND month = 9AND day = 2AND hour = 10

Run time: 2.42 seconds, Data scanned: 2.06MBRun time: 1.84 seconds, Data scanned: 1.9 KB

Page 43: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Photo by Benjamin Davies on Unsplash

I applied these simple tricks when storing datafor Amazon Athena andyou won‘t believe whathappened next...

Page 44: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Measure.Then optimize. There‘s no silver bullet.

Photo by Cesar Carlevarino Aragon on Unsplash

Page 45: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Optimize for cost and performance 1/2

• Use Athena in the region of your buckets.• Compress your data for less storage & query cost.• Use LIMIT in queries for faster results.• Partition your data based on data access patterns.• Use partitions in your queries.• Add partitions by crawling or S3 triggers.

Page 46: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Optimize for cost and performance 2/2

• Columnar formats as ORC & parquet reduce scanneddata: faster, less cost

• Pick format depending on data, access patterns, clients• Inspect/ verify the results• Create aggregates for frequent queries• Try Amazon Athena's Create Table as Select first• Shorten turnaround times for Glue job development:

• Use a provisioned development endpoint• Use small subset of your data (think KB!)

Page 47: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Get started!

Find more detailsand deploy a sample application forAmazon CloudFront logs:

http://a.co/aJFQuMI

Page 48: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT

Thank you!

SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Steffen Grunwald

[email protected]: @steffeng

Page 49: AWSSummit Berlin Day2 steffeng Query your data in …...including CSV, JSON, ORC, Avro, and Parquet With Amazon Athena, you pay only for the queries that you run. You are charged $5

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM ITSUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.