SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at...

SSLShader: Cheap SSL Acceleration with

Commodity Processors

Keon Jang+, Sangjin Han+, Seungyeop Han*,

Sue Moon+, and KyoungSoo Park+

KAIST+ and University of Washington*

Security of Paper Submission Websites

Network and Distributed System Security Symposium

Security Threats in the Internet

Public WiFi without encryption

• Easy target that requires almost no effort

Deep packet inspection by governments

• Used for censorship

• In the name of national security

NebuAd’s targeted advertisement

• Modify user’s Web traffic in the middle

Secure Sockets Layer (SSL)

A de-facto standard for secure communication

• Authentication, Confidentiality, Content integrity

Client Server TCP handshake

Encrypted data

Key exchange using public key algorithm

(e.g., RSA) Server identification

SSL Deployment Status

Most of Web-sites are not SSL-protected

• Less than 0.5%

• [NETCRAFT Survey Jan ‘09]

Why is SSL not ubiquitous?

• Small sites: lack of recognition, manageability, etc.

• Large sites: cost

• SSL requires lots of computation power

SSL Computation Overhead

Performance overhead (HTTPS vs. HTTP)

• Connection setup

• Data transfer

Good privacy is expensive

• More servers

• H/W SSL accelerators

Our suggestion:

• Offload SSL computation to GPU

SSL-accelerator leveraging GPU

• High-performance

• Cost-effective

SSL reverse proxy

• No modification on existing servers

SSLShader

Web Server

SMTP Server

POP3 Server

Plain TCP SSL-encrypted session

Our Contributions

GPU cryptography optimization

• The fastest RSA on GPU

• Superior to high-end hardware accelerators

• Low latency

SSLShader

• Complete system exploiting GPU for SSL processing

• Batch processing

• Pipelining

• Opportunistic offloading

• Scaling with multiple cores and NUMA nodes

CRYPTOGRAPHIC PROCESSING

WITH GPU

How GPU Differs From CPU?

Intel Xeon 5650 CPU:

6 cores

NVIDIA GTX580 GPU:

512 cores

Control

ALU ALU

62×109 870×109 < Instructions / sec

void VecAdd(

int *A, int *B, int *C, int N)

//iterate over N elements

for(int i = 0; i < N; i++)

C[i] = A[i] + B[i]

VecAdd(A, B, C, N);

__global__ void VecAdd(

int *A, int *B, int *C)

int i = threadIdx.x;

C[i] = A[i] + B[i]

//Launch N threads

VecAdd<<<1, N>>>(A, B, C);

Single Instruction Multiple Threads (SIMT)

GPU code CPU code

Example code: vector addition (C = A + B)

1/3지점 8분 10초

Parallelism in SSL Processing

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Our GPU Implementation

Choices of cipher-suite

Optimization of GPU algorithms • Exploiting massive parallel processing

• Parallelization of algorithms

• Batch processing

• Data copy overhead is significant

• Concurrent copy and execution

앞에랑 매핑이 되게-_- 그림을 가져와서 매핑이 되게 하는게 좋을듯

Client Server

Encryption: AES Message Authentication: SHA1

Key exchange: RSA

Basic RSA Operations

M: plain-text, C: cipher-text

(e, n): public key, (d, n): private key

Encryption:

C = Me mod n

Decryption:

M = Cd mod n

1024/2048 bits integer (300 ~ 600 digits)

Small number: 3, 17, 65537

Decryption at the server side is the bottleneck

Exponentiation many multiplications

Server-side

Server

Client

Breakdown of Large Integer Multiplication

Schoolbook multiplication

649 X 627 ---------

63 280 4200 180 800

12000 5400 32000

+ 360000 ---------

406923

Accumulation is difficult to parallelize due to

“overlapping digits”

“carry propagation”

3 x 3 = 9 multiplications 9 addition of 6-digits integers

O(s) Parallel Multiplications

Example of 649 x 627 = 406,923

2s steps

1 or 2 steps (s – 1 worst case)

s = # of words in a large integer (E.g., 1024-bits = 16 x 64 bits word)

More Optimizations on RSA

Common optimizations for RSA • Chinese Remainder Theorem (CRT) • Montgomery Multiplication • Constant Length Non-zero Window (CLNW)

Parallelization of serial algorithms • Faster Calculation of M×n • Interleaving of T + M×n • Mixed-Radix Conversion Offloading

GPU specific optimizations • Warp Utilization • Loop Unrolling • Elimination of Divergence • Avoiding Bank Conflicts • Instruction-Level Optimization

4054 6620 13281 9891 10146 6627 21041

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

Throughput (operations/s)

Initial (1)

(3) Warp

Utilization

(5) (6) 64-bit words (7) Avoiding bank

conflicts

(8) Instruction-level

Optimization CLNW (9) Post-exponentiation offloading

Read our paper for details

Parallelism in SSL Processing

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Batch Processing

GTX580 Throughput w/o Batching

0.08x 0.02x

0.02x 0.0x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a “single CPU core”

Intel Nehalem single core (2.66Ghz)

6.8x 7.7x 9.4x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a "single CPU core"

GTX580 Throughput w/ Batching

Difference: ratio of computation to copy

Batch size: 32~4096 depending on the algorithm

Copy Overhead in GPU Cryptography

GPU processing works by

• Data copy: CPU GPU

• Execution in GPU

• Data copy: GPU -> CPU

AES-ENC (Gbps)

AES-DEC (Gbps)

HMAC-SHA1 (Gbps)

GTX580 w/ copy 8.8 10 31

GTX580 no copy 21.8 33 124

AES-ENC AES-DEC HMAC-SHA1

↑2.4x ↑3.3x

w/o copy

w/ copy

w/o copy

Hiding Copy Overhead

Synchronous Execution

Pipelining

Processing time : 3t

Amortized processing time : t

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

AES-ENC AES-DEC SHA1

Throughput relative to a single core

GTX580 Performance w/ Pipelining

↑ 36% ↑ 36%

↑ 51% w/o copy

synchronous

pipelining

Summary of GPU Cryptography

Performance gain from GTX580

• GPU performs as fast as 9 ~ 28 CPU cores

Lessons • Batch processing is essential to fully utilize a GPU

• AES and SHA1 are bottlenecked by data copy • PCIe 3.0

• Integrated GPU and CPU

RSA-1024 (ops/sec)

AES-ENC (Gbps)

AES-DEC (Gbps)

SHA1 (Gbps)

GTX580 91.9K 11.5 12.5 47.1

CPU core 3.3K 1.3 1.3 3.3

16분 30초

BUILDING SSL-PROXY THAT

LEVERAGES GPU

SSLShader Design Goals

Use existing application without modification

• SSL reverse proxy

Effectively leverage GPU

• Batching cryptographic operations

• Load balancing between CPU and GPU

Scale performance with architecture evolution

• Multi-core CPUs

• Multiple NUMA nodes

Batching Crypto Operations

Network workloads vary over time • Waiting for fixed batch size doesn’t work

Output queue

Input queue

SSL Stack

Batch size is dynamically adjusted to queue length

Balancing Load Between CPU and GPU

For small batch, CPU is faster than GPU • Opportunistic offloading

Output queue

Input queue

CPU processing

GPU processing when input queue length > threshold

GPU queue

Cryptographic operation Minimum Maximum

RSA (1024-bit) 16 512

AES Decryption 32 2048

AES Encryption 128 2048

HMAC-SHA1 128 2048

Input queue length > threshold

Scaling with Multiple Cores

Per-core worker threads • Network I/O, cryptographic operation

Sharing a GPU with multiple cores • More parallelism with larger batch size

Output queues

Input queues GPU

queue CPU

Scaling with NUMA systems

A process = worker threads + a GPU thread

• Separate process per NUMA node

• Minimizes data sharing across NUMA nodes

Node 0 Node 1

Evaluation

Experimental configurations

Model Spec Qty

CPU Intel X5650 2.66Ghz x 6 croes 2

GPU NVIDIA GTX580 1.5Ghz x 512 cores 2

NIC Intel X520-DA2 10GbE x 2 2

SSLShader+ Lighttpd

8 Clients

4x 10GbE

Server Server

Lighttpd

SSLShader

Lighttpd

OpenSSL

Clients

HTTPS HTTPS

Server Specification

Evaluation Metrics

HTTPS connection handling performance

• Use small content size

• Stress on RSA computation

Latency distribution at different loads

• Test opportunistic offloading

Data transfer rate at various content size

HTTPS Connection Rate

3.6K 0

10,000

15,000

20,000

25,000

30,000

35,000

1024 bits 2048 bits

SSLShader

lighttpd

RSA Key Size

Connections / sec

CPU Usage Breakdown (RSA 1024)

Kernel NIC device driver,

SSLShader, 5.31

Libc , 9.88

IPP + libcrypto,

lighttpd, 4.9

others, 4.35

Kernel (Including

TCP/IP stack), 60.35

Current Bottleneck

Latency at Light Load

0102030405060708090

1 10 100 1000

Latency (ms)

Similar latency at light load

Lighttpd at 1k connections / sec

SSLShader at 1k connections / sec

Latency at Heavy Load

Lower latency and higher throughput at heavy load

1 10 100 1000

Latency (ms)

Lighttpd at 11k connections / sec

SSLShader at 29k connections / sec

Data Transfer Performance

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Content Size

Lighttpd performance

Typical web content size is under 100KB

SSLShader: 13 Gbps

CONCLUSIONS

Summary

Cryptographic algorithms in GPU

• Fast RSA, AES, and SHA1

SSLShader

• Transparent integration

• Effective utilization of GPU for SSL processing

• Up to 6x connections / sec

• 13 Gbps throughput

Linux network stack performance

Copy overhead

QUESTIONS?

THANK YOU!

For more details

https://shader.kaist.edu/sslshader

SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at...

Documents

Transcript of SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at...

RAND's Fidelity Monitoring Protocol for Suicide …...3.6i Protect against the danger; support survival skills 3.6j Link to health care worker 3.6k Role-play Jack using safe plan 3.7

CCM2012 Overall Results Jan-10-0845hrs 21k Male

OFFICIAL TIME RESULT FOR 11K AND 21K CATEGORY …salomonxtrailpilipinas.com.ph/wp-content/uploads/2017/07/Salomon-X... · gonzales, rex f. 11 21k cordova, romeo iii r. 12 21k caÑizar,

CCM 2013 - 21K

2017 demco towing catalog - Demco Products · PDF fileFeaturing Brand New Product Exclusively from Demco- 21K Recon Light Weight 5th Wheel Hitch Tabless Baseplates Also Featured in

Spring Cottage, Caudle Green Gloucestershire GL53 9PPmedia.rightmove.co.uk/21k/20072/57434842/20072_100300013268_… · From Cheltenham town centre, Bath Road and Leckhampton Road,

Cebu City Marathon Male 21K finish times

CAMPAIGN SUMMARY€¦ · March 17th-23rd, 2014 2.1M REACH* 9.4K IMPRESSIONS* 3X TRENDED 1.5K CONTRIBUTORS 3.6K TWEETS 17,257 REACH* 1,112 STORIES* *REACH: The number of unique users

21K 70' 21ST AVE 425' 30K OLSSON 6000' www. ASSOCIATES ...images2.loopnet.com/d2/...g5EXzrvGQbUSi07ho/document.pdf · 21K 70' 21ST AVE 425' 30K OLSSON 6000' www. ASSOCIATES olssonassociates@com

Fullers Road, South Woodfordmedia.rightmove.co.uk/21k/20864/51185550/20864_27337749_DOC_01_0000.pdfFullers Road, South Woodford £925,000 to £950,000 Freehold tel: 020 8220 9000 email:

Like 21K - Nilanjan Banik · 6/22/2016 Dalits: “Trained” to be empowered? TOI Blogs • • • • • • • • • • •

895238 BMKLPDF LAN 16Mar202011015001 002 · 100+ PRODUCTION & MANUFACTURING FACILITIES 140+ COUNTRIES 65+ ACTIVE INGREDIENTS 100+ CROPS 150+ R&D FACILITIES COLLEAGUES ~21K The information

5, 10 y 21k - Irapuato Age Group Results · September 09, 2012 Race Date 5, 10 y 21k - Irapuato Age Group Results Place Name City Age Bib No Overall Time Pace

· PDF fileNo. Of Printed Pages 15 5380 Register Number / SCIENCE S SEC ( / Tamil & Enolish Versions) . 21/2 ] Time Allowed : 21k Hours ] [ GILDTèÞ : 75

Januaryagents.aruba.com/wp-content/uploads/Calendar-of-Events...HALF MARATHON (21K) MARCH 25 International competitors participate in Aruba’s longest road race covering the length

El Calendario de Carreras más completo de México Guía ...rungdl.com/assets/downloadables/21k-principiante.pdf · Guía de Entrenamiento 61 km Notas 12 Descans o20 min2 5Desca s

SENTUL HALF MARATHON 2017 - Your Ultimate Race ... male all.pdf90 1145 WIDHI ANDIKA DARMA M 21K 6:58 0:45:44 2:26:55 2:26:24 90 91 935 HOO GAYUS SETIAWAN M 21K 7:02 0:49:27 2:28:05

Graduate Research Scholarships-2017-18 · • Master’s scholarships(17.5K per year, first six semesters) • Doctoral scholarships(21K per year, first nine semesters) Social Sciences

CCM 2013 Race Result (Male) - 21K

PV Grid Tie Inverter Solis Single Phase Inverter · Solis-2.5K-2G-ST Solis-3K-2G-ST Solis-3.6K-2G-ST Before connecting inverter, please make sure the PV array open circuit voltage