AWS_ac_ra_largescale_05

8/3/2019 AWS_ac_ra_largescale_05

http://slidepdf.com/reader/full/awsacralargescale05 1/1

H i g h t h r o u g h p u t / P a r a l l e l u p l o a d

( o r I m p o r t / E x p o r t )

R e a d / W r i t e d a t a f r o m

S 3

( u s i n g H T T P o r F U

S E l a y e r )

D o w

n l o a d r e s

u l t s

f r o m S 3 b u c

k e t s

A l t e r n a t e : u p l o a d

i n t o E C 2 / E B S

A l t e r n a t e :

D o w

n l o a d r e s u l t s

f r o m E B S

A l t e r n a t e :

U s e E B S f o r s t a g i n g ,

t e m p o r a r y o r r e s u l t

s t o r a g e

A l t e r n a t e

:

s h a r e r e s u l t s

u s i n g s n

a p s h o t s

S h a r e r e s u l t s

f r o m S 3 b u c k e t s

A m a z o n

E C 2

A m a z o n S 3

A m a z o n

E B S

SystemOverview

LARGE SCALE COMPUTING& HUGE DATA SETS

Amazon Web Services is very popular for large-scale computingscenarios such as scientific computing, simulation, and researchprojects. These scenarios involve huge data sets collected fromscientific equipment, measurement devices, or other compute jobs. After collection, these data sets need to be analyzed bylarge-scale compute jobs to generate result data sets. Ideally,results will be available as soon as the data is collected. Often,these results are then made available to a larger audience.

A m a z o n

E C 2

A m a z o n

E B S

A m a z o n

S 3

AWS

Ref er e

nce

Ar chite

ctur es

A W S I m p o r

t / E x

p o r t

3

2

1

4

4

To upload large data sets into AWS, it is critical to makethe most of the available bandwidth. You can do so by

uploading data into Amazon Simple Storage Service (S3) inparallel from multiple clients, each using multithreading toenable concurrent uploads or multipart uploads for further parallelization. TCP settings like window scaling and selectiveacknowledgement can be adjusted to further enhancethroughput. With the proper optimizations, uploads of severalterabytes a day are possible. Another alternative for hugedata sets might be Amazon Import/Export, which supportssending storage devices to AWS and inserting their contentsdirectly into Amazon S3 or Amazon EBS volumes.

Parallel processing of large-scale jobs is critical, andexisting parallel applications can typically be run on

multiple Amazon Elastic Compute Cloud (EC2) instances.A parallel application may sometimes assume large scratchareas that all nodes can efficiently read and write from. S3can be used as such a scratch area, either directly usingHTTP or using a FUSE layer (for example, s3fs or SubCloud)if the application expects a POSIX-style file system.

Once the job has completed and the result data isstored in Amazon S3, Amazon EC2 instances can be

shut down, and the result data set can be downloaded The

output data can be shared with others, either by granting readpermissions to select users or to everyone or by using timelimited URLs.

Instead of using Amazon S3, you can use AmazonEBS to stage the input set, act as a temporary storage

area, and/or capture the output set. During the upload, theconcepts of parallel upload streams and TCP tweaking alsoapply. In addition, uploads that use UDP may increase speedfurther. The result data set can be written into EBS volumes,at which time snapshots of the volumes can be taken for sharing.

1 2

3

4

AWS_ac_ra_largescale_05

Documents

Transcript of AWS_ac_ra_largescale_05