Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon...
Transcript of Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon...
Accessing Data in the CloudUsing SAS to read data from Amazon Simple Storage Service (S3)
seleritysas.com
What is Amazon Simple Storage Service (S3)?
• An object store, not a file system
• Write once, read many (WORM)
• Eventually consistent
• 99.999999999% durability
• Unlimited storage capacity
• Highly scalable and available data storage
• Low latency and high throughput performance
What Public Data is Available in S3?
• AWS Public Datasets• https://aws.amazon.com/public-datasets/• Geospatial and Environmental Datasets• Genomics and Life Science Datasets• Datasets for Machine Learning• Regulatory and Statistical Data
• awesome-public-datasets• https://github.com/caesar0301/awesome-
public-datasets
• NYC Taxi and Limousine Commission• http://www.nyc.gov/html/tlc/html/about/trip_r
ecord_data.shtml
What is the typical workflow to use raw data from S3?• Download the data file from S3 to your PC using http/https
• Upload/Import the data to SAS
What would make this more efficient?
• Cutting out the middle-man (your local PC)
How can we have S3 communicate direct to the SAS Server?• Use the FILENAME URL access method
✓ Easy to implement
✗ File is retrieved using the http protocol (serially)
✗ The slowest of all options, subject to timeouts for very large files
• Use PROC S3 to download files to the SAS Server’s filesystem✓ Very fast, as it uses parallel downloads
✗ Only available from 9.4M4
✗ Only works with secure S3 files, not public S3 files
How can we have S3 communicate direct to the SAS Server?• Use the AWS CLI to download files to the SAS Server’s filesystem
✓ Very fast, as it uses parallel downloads
✗ Need to install the AWS CLI on the SAS Server
✗ Need the ability to run X commands on the SAS Server
• “Mount” the S3 storage on the SAS Server✓ Treat it like a local disk
✗ S3 is not designed for block storage/access
✗ Potential issues with current storage driver implementations
Example: NYC Trip Data in S3
• NYC Yellow Cab trip data for January 2017• 9,710,124 records• CSV format• 815 MB
• Location• Bucket: nyc-tlc• Object Key: trip data/yellow_tripdata_2017-01.csv
• HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv
• S3 Protocol: “s3://nyc-tlc/trip data/yellow_tripdata_2017-01.csv”
FILENAME URL Access Method
NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17
variables.
real time 36.09 seconds
cpu time 33.85 seconds
PROC S3
NOTE: PROCEDURE S3 used (Total process
time):
real time 3.77 seconds
cpu time 6.31 seconds
NOTE: PROCEDURE IMPORT used (Total
process time):
real time 26.75 seconds
cpu time 26.75 seconds
AWS CLI
NOTE: DATA statement used (Total process
time):
real time 5.80 seconds
cpu time 0.00 seconds
NOTE: PROCEDURE IMPORT used (Total process
time):
real time 26.59 seconds
cpu time 26.59 seconds