Post on 20-Dec-2015
AES Microcode Implementation In IXP2400 And
A study ofReconfigurable Crypto Unit
Piyush Ranjan SatapathyCS203B Class Project
Presentation
Road Map
AES Algorithm Overview IXP2400 Platform A Quick Look Microcode Overview Implementation of AES Experimental Results
Reconfigurable Crypto unit of Intel IXP2850
Algorithm Overview
Designed by Daemen and Rijmen for the NIST
Originally called Rijndael Symmetric key block
substitution cipher Replacement for DES Successful field testing since
inception Three bit-modes State defined as a 4x4 array
of 16 bytes Key size is either 1624 or
32 bytes A byte is represented by
Galois polynomials
Bit Mode
Key Lengt
h (Nk
words)
State Size(Nb
words)
Number
of Round
s(Nr)
128 4 4 10
192 6 4 12
256 8 4 14
Stages of AES Algorithm
Detailed view of round n Each round performs the following operationsEach round performs the following operations
Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround
Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of
the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the
expanded round keyexpanded round key
ByteSub Shift Row MixColumn AddRoundKey
Kn
Result from round n-1
Pass toround n+1
1 SubBytes Function Affine Transformation in GF (28) Direct implementation is
complex Easily performed by a 16 x 16
LUT ROM Simple byte substitution Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the following transform
Substitution (ldquoSrdquo)-box
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Road Map
AES Algorithm Overview IXP2400 Platform A Quick Look Microcode Overview Implementation of AES Experimental Results
Reconfigurable Crypto unit of Intel IXP2850
Algorithm Overview
Designed by Daemen and Rijmen for the NIST
Originally called Rijndael Symmetric key block
substitution cipher Replacement for DES Successful field testing since
inception Three bit-modes State defined as a 4x4 array
of 16 bytes Key size is either 1624 or
32 bytes A byte is represented by
Galois polynomials
Bit Mode
Key Lengt
h (Nk
words)
State Size(Nb
words)
Number
of Round
s(Nr)
128 4 4 10
192 6 4 12
256 8 4 14
Stages of AES Algorithm
Detailed view of round n Each round performs the following operationsEach round performs the following operations
Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround
Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of
the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the
expanded round keyexpanded round key
ByteSub Shift Row MixColumn AddRoundKey
Kn
Result from round n-1
Pass toround n+1
1 SubBytes Function Affine Transformation in GF (28) Direct implementation is
complex Easily performed by a 16 x 16
LUT ROM Simple byte substitution Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the following transform
Substitution (ldquoSrdquo)-box
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Algorithm Overview
Designed by Daemen and Rijmen for the NIST
Originally called Rijndael Symmetric key block
substitution cipher Replacement for DES Successful field testing since
inception Three bit-modes State defined as a 4x4 array
of 16 bytes Key size is either 1624 or
32 bytes A byte is represented by
Galois polynomials
Bit Mode
Key Lengt
h (Nk
words)
State Size(Nb
words)
Number
of Round
s(Nr)
128 4 4 10
192 6 4 12
256 8 4 14
Stages of AES Algorithm
Detailed view of round n Each round performs the following operationsEach round performs the following operations
Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround
Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of
the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the
expanded round keyexpanded round key
ByteSub Shift Row MixColumn AddRoundKey
Kn
Result from round n-1
Pass toround n+1
1 SubBytes Function Affine Transformation in GF (28) Direct implementation is
complex Easily performed by a 16 x 16
LUT ROM Simple byte substitution Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the following transform
Substitution (ldquoSrdquo)-box
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Stages of AES Algorithm
Detailed view of round n Each round performs the following operationsEach round performs the following operations
Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround
Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of
the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the
expanded round keyexpanded round key
ByteSub Shift Row MixColumn AddRoundKey
Kn
Result from round n-1
Pass toround n+1
1 SubBytes Function Affine Transformation in GF (28) Direct implementation is
complex Easily performed by a 16 x 16
LUT ROM Simple byte substitution Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the following transform
Substitution (ldquoSrdquo)-box
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
1 SubBytes Function Affine Transformation in GF (28) Direct implementation is
complex Easily performed by a 16 x 16
LUT ROM Simple byte substitution Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the following transform
Substitution (ldquoSrdquo)-box
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
2 Shift Row Shifting done only on the
bottom three rows of the State Left rotate for encryption Right rotate for decryption
Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality
resides primarily in the controller and instruction memory
bull A series of conditional XOR and left shift operations
Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo
This corresponds to matrix multiplication b(x) = c(x) a(x)
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the
Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse
Each word is simply EXORrsquoed with the expanded round key
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
IXP2400 Platform A Quick LookName SizeBytes Transfer
Size(Bytes)Reference
latency in cycles
GPRME 2564 4 1
TRME 5124 4 1
NNRME 1284 4 1
LMME 6404 4 3
Scratch 16K 4 60
SRAM 64M 4 90
DRAM 1G 16 120
bull achieve high processing performance
bull programming flexibilitybull Cheaper than ASIC
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Microcode Overview
alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Implementation Setup
Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Experimental Results(1)
Command Bus Arbiter Statistics
0
20
40
60
80
100
None-SRAM SRAM
Per
cent
age
idle due to memoryqueue fullness
Idle due to No request
Used
SRAM Utilization
MicroEngine Utilisation Percentage
0
20
40
60
80
100
8 Threads 4Threads 2Threads 1Thread
No of Threads in Execution
Per
cent
age
Idle
Stalled
Aborted
Executing
ME utilization
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Experimental Results(2)
Throughput Improvement for 1 MicroEngine with different threads
0
100
200
300
400
500
8 Threads 4Threads 2Threads 1Thread
No of threads
Thro
ughp
ut(M
IPS
)
Series1
AES Throughput Across MicroEngines
0200400600800
10001200140016001800
1 2 4 8
No of MicroEngines
Thro
ughp
ut(M
IPS
)
Throughput PerformanceAcross Threads in 1 ME
Throughput PerformanceAcross Threads in 1 ME
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Simple Encrypt and Hash Example
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
3DES Core10486972 Cores per crypto unit
1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys
1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF
element 1048697Result can be passed to the SHA-1 unit for hashing
Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
AES Core
1048697All AES key sizes are supported
ndash(128 192 or 256) Both Encryption and
Decryption supported 1048697Operates on 16 byte
blocks
AES Key Scheduler
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
SHA1 Core
2 SHA-1 cores per crypto unit Operates on 64-byte blocks
Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer
Can perform on unmodified packet data or on the ciphered packet data
Operates on 512 bit block size and has a data buffer to accumulate the ciphered data
This gives flexibility to run SHA and AES 3DES at different rates
SHA1 Critical Path Analysis
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Some of The Crypto Commands
crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write
crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data
crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key
crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Acknowledgement
Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent
erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)
Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of
NewJersey Zhangxi Tan et al Tsinghua University
Qhelliphelliphelliphelliphellip
Qhelliphelliphelliphelliphellip