Experimenting with linear search in encrypted data

9
Experimenting with linear search in encrypted data Richard Brinkman Ling Feng Sandro Etalle Pieter Hartel Willem Jonker {brinkman,ling,etalle,pieter,jonker}@cs.utwente.nl University of Twente, Enschede September 12, 2003 Abstract Song, Wagner and Perrig have published a theoretical paper about keyword search on encrypted textual data. We describe a prototype im- plementing their theory. Tests are carried out with this prototype to analyse efficiency and timing aspects. As expected encryption and search times are linear in the size of the database. More interestingly they also depend on the parameters used in the protocol. 1 Introduction In their paper, Song et al [1] describe a protocol to store sensitive data on an un- trusted server. A client (Alice) can store data on the untrusted server (Bob) and search in it, without revealing the plain text of either the stored data, the query or the query result. The protocol consists of three parts: storage, search and retrieval. After summarising the protocol we will discuss our implementation and test results showing the influence of various parameters on the encryption and search times. Storage Before Alice can store information on Bob she has to do some cal- culations. First of all she has to fragment the whole plain text W into several fixed sized words W i . Each W i has length n. She also generates encryption keys k and k and a sequence of random numbers S i using a pseudo random generator. Then she has got or can calculate the following for each block W i : W i plain text block k encryption key X i = E k (W i )= L i ,R i encrypted text block k key for f k i = f k (L i ) key for F S i random number i T i = S i ,F ki (S i ) tuple used by search C i = X i T i value to be stored 1

Transcript of Experimenting with linear search in encrypted data

Page 1: Experimenting with linear search in encrypted data

Experimenting with linear search

in encrypted data

Richard Brinkman Ling Feng Sandro EtallePieter Hartel

Willem Jonker{brinkman,ling,etalle,pieter,jonker}@cs.utwente.nl

University of Twente, Enschede

September 12, 2003

Abstract

Song, Wagner and Perrig have published a theoretical paper aboutkeyword search on encrypted textual data. We describe a prototype im-plementing their theory. Tests are carried out with this prototype toanalyse efficiency and timing aspects. As expected encryption and searchtimes are linear in the size of the database. More interestingly they alsodepend on the parameters used in the protocol.

1 Introduction

In their paper, Song et al [1] describe a protocol to store sensitive data on an un-trusted server. A client (Alice) can store data on the untrusted server (Bob) andsearch in it, without revealing the plain text of either the stored data, the queryor the query result. The protocol consists of three parts: storage, search andretrieval. After summarising the protocol we will discuss our implementationand test results showing the influence of various parameters on the encryptionand search times.

Storage Before Alice can store information on Bob she has to do some cal-culations. First of all she has to fragment the whole plain text W intoseveral fixed sized words Wi. Each Wi has length n. She also generatesencryption keys k′ and k′′ and a sequence of random numbers Si using apseudo random generator. Then she has got or can calculate the followingfor each block Wi:

Wi plain text blockk′′ encryption keyXi = Ek′′(Wi) = 〈Li, Ri〉 encrypted text blockk′ key for fki = fk′(Li) key for FSi random number iTi = 〈Si, Fki

(Si)〉 tuple used by searchCi = Xi ⊕ Ti value to be stored

1

Page 2: Experimenting with linear search in encrypted data

where E is an encryption function and f and F are keyed hash functions:

E : key64 → intn → intnf : key64 → intn−m → key64

F : key64 → intn−m → intm

The encrypted word Xi has the same blocklength as Wi (i.e. n). Li haslength n−m and Ri has length m (see Figure 1). The parameters n andm may be chosen freely. Section 4 will give guidelines for efficient valuesfor n and m. The value Ci can be sent to Bob and stored there. Alicemay now forget the values Wi, Xi, Li, Ri, ki, Ti and Ci, but should stillremember k′, k′′ and Si.

Search After the encrypted data is stored on Bob in the previous phase Alicecan ask Bob queries. Alice can provide Bob with an encrypted version ofthe plain text word Wj and ask him if and where Wj occurs in the originaldocument. Note that Alice does not have to know the value of j. If Wj

was a block in the original data then 〈j, Cj〉 is returned. Alice has got orcan can calculate:

k′′ encryption keyk′ key for fWj plain text block to look forXj = Ek′′(Wj) = 〈Lj , Rj〉 encrypted blockkj = fk′(Lj) key for F

Then Alice sends the value of Xj and kj to Bob. Having Xj and kj Bobis able to compute for each Cp:

Tp = Cp ⊕Xj = 〈Sp, S′p〉 〈Sp, Fkj

(Sp) if p = j,IF S′

p = Fkj (Sp) THEN RETURN 〈p, Cp〉 garbage otherwise

Note that all locations with a correct Tp value are returned. However thereis a small change that T satisfies T = 〈Sq, Fkj (Sq) but where Sq 6= Sp.Therefore Alice should check seach answers if the correct random value isused.

Retrieval Alice can also ask Bob for the cipher text at any position p. Alice,knowing k′, k′′ and the seed for S, can recalculate Wp by

k′ key for fk′′ encryption keyp desired locationCp = 〈Cp,l, Cp,r〉 stored blockSp random valueXp,l = Cp,l ⊕ Sp left part of encrypted blockkp = fk′(Xp,l) key for FTp = 〈Sp, Fkp

(Sp)〉 check tupleXp = Cp ⊕ Tp encrypted blockWp = Dk′′(Xp) plain text block

where D is the decryption function D : key64 → intn → intn such thatDk′′(Ek′′(Wi)) = Wi.

2

Page 3: Experimenting with linear search in encrypted data

� -

Xi

n

� -

Ci

n

l

� ��f

� ��F

� -

� -

?

� -� -

��

���

PPPPPPPPPPq

� -

� -

?

� -� -

�������������

?

��������9

6

n

Xi

n

E

Li Ri

mn−m64 bits

k′

64 bits

ki

Fki(Si)

mn−m

Si

Wi

Figure 1: Encryption schema

3

Page 4: Experimenting with linear search in encrypted data

This is all Alice needs. She can store, find and read the text while Bobcannot read anything of the plain text. The only information Bob gets fromAlice is C in the store phase and Xj and kj in the search phase. Since C and Xj

are both encrypted with a key only known to Alice and kj is only used to hashone particular random value, Bob does not learn anything of the plain text.

2 Implementation

Section 1 introduces three functions: E, f and F . Figure 1 shows how they areconnected to each other. E should be a block cipher in ECB mode and f and Fkeyed hash functions. In the original article [1] they were called pseudo-randomfunctions. This suggests that the output of the function is random, which it isnot. Therefore we prefer the name keyed hash function. For our prototype wechose DES for all three of them. E is exactly DES in ECB mode. Since DESworks on blocks of 64 bits n should be a multiple of 64 bits.

f and F are keyed hash functions with variable sized hash values. Standardhash functions like SHA-1 have a fixed sized hash value. You could use the last(or the first) m bits of the hash value, but then m should be less than the size ofthe hash value (160 bits for SHA-1). To allow a larger value for m our prototypeuses DES in CBC mode. To hash a data block of length n−m to a hash valueof length m the block is encrypted with the specified key (64 bits DES key) butonly the last m bits are used as hash value. The only restriction for m is thatn−m ≥ m and thus n ≥ 2m. See the ‘Handbook of Applied Cryptography’ [2]for a more detailed description of the used hash algorithm.

The encryption algorithm is implemented in a Java file named Encrypt.java.It is a commandline tool that encrypt the standard input to the standard output.It uses the standard crypto package shipped with JDK 1.4. The parameters n,m, the seed for S and the two keys k′ and k′′ are stored in a property file. Thisproperty file is the only information that has to be stored on Alice.

The search algorithm closely follows the protocol described in [1] as sum-marised in section 1. It is implemented in the Search.java file. All the param-eters except W are read from the same property file used for encryption. Theparameter W is supplied as commandline parameter. The program takes thewhole cipher text as input and produces the 〈i, Ci〉 pairs as output.

3 Experimental data

The Encrypt and Search tools give us the opportunity to experiment with theparameters used in the protocol. We are especially interested in the influencethe parameters n and m have on the encryption and search speed. We usedthe XML benchmark1 to generate three sample files of sizes 1 MB, 10 MB and100 MB. Although these files are XML files the treestructure is not used in theprotocol. The tools just consider them as large text files. The benchmark is onlyused to compare the results with previous and with future experiments wherewe intend to exploit the tree structure for more efficient queries on encrypteddata.

1http://www.xml-benchmark.org

4

Page 5: Experimenting with linear search in encrypted data

Also the number of collisions has been measured. collisions are the false hitsthat occur because of the collisions in the hash function F . F hashes the randomvalue Si of size n−m to a hash value of length m, where n−m ≥ m. Thereforecollisions are unavoidable (collisions are avoidable when n − m = m and F isbijective, but bijective functions are not considered good hash functions).

Tests are carried out ∀n ∈ {8, 16, 24, 32, 40, 48, 56, 64} where these valuesare the number of bytes and not bits. Because we use DES in ECB mode forthe encryption function E we only use multiples of 8 bytes. m should be lessthan or equal to n

2 so m ∈ {1, 2, . . . , n2 }. Measurement results are plotted in

figures 2-4. The absolute values are not interesting because they depend on thephysical hardware. However, differences between the various configurations areinteresting.

4 Results

From the experiments can be concluded that:

• The larger the dataset the larger the encryption and search times. En-cryption and search time grow linear in the size of the dataset. Thereforethe protocol does not scale well and can only be used for resonable smalldatabases. (Compare the absolute values of figures 2(a)-4(a) and 2(b)-4(b).)

• The larger n the shorter the encryption and search times (figures 2(a,b)-4(a,b)). This can be explained by looking at the number of blocks. Thelarger n the fewer blocks there are. For each block a fixed number of stepsare taken. Most of these steps does not depend on the length of the blocks.Therefore less time is needed for the whole database.

• The larger n the fewer collisions occur (figures 2(c)-4(c)). This can alsobe explained by the fewer blocks.

• For a fixed value of n the encryption and search times hardly depend onthe value of m (horizontal lines in figures 2(a,b)-4(a,b)).

• collisions can be avoided by choosing a sufficiently large value of m. Theoptimal case is m = n

2 . But also for m > 2 the number of collisions isnegligable.

5 Conclusions and future work

The prototype implementing the theory described in [1] shows that searching insmall excrypted text files works not only in theory but also in practice althoughthe complexity is linear in the size of the text. We are currently investigatingmeans to adapt the protocol to semi-structured XML data which combines bothstructured and semi-structured XML data.

Structured data Although the test were carried out on large XML files thestructure of XML was not exploited. We plan to investigate the possibilityto adapt the protocol to make use of the structure in the data. We willfocus on XML data.

5

Page 6: Experimenting with linear search in encrypted data

Faster search strategies The tree structure of the XML data can be ex-ploited to increase efficiency. Whereas linear search is necessary in orderto search for a word in an unstructured text, faster search strategies arepossible when looking for a specific path in structured XML data. Treesearch or indexed search may decrease search time dramatically.

Variable block size The original protocol works with a fixed block size. Wordsin a normal language such as English have variable lengths. Therefore theEnglish words should be padded or split which make it more difficult tosearch for it.

References

[1] Dawn Xiaodong Song, David Wagner, and Adrian Perrig. Practical tech-niques for searches on encrypted data. In IEEE Symposium on Security andPrivacy, pages 44–55, 2000.

[2] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbookof Applied Cryptography. CRC Press, October 1996.

6

Page 7: Experimenting with linear search in encrypted data

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

25000

50000

75000

100000

125000

150000

175000

6448

3216

m

t(m

s)

n

(a) Encryption speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

10000

20000

30000

40000

50000

60000

70000

80000

6448

3216

m

t(m

s)

n

(b) Search speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

050

100

150200

250

300

350

400

450

500

550

600

6448

3216

m

#col

lissi

ons

n

(c) Number of collisions

Figure 2: Measurement results for small dataset (1 MB)

7

Page 8: Experimenting with linear search in encrypted data

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

250000

500000

750000

1000000

1250000

1500000

1750000

2000000

6448

3216

m

t(m

s)

n

(a) Encryption speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

6448

3216

m

t(m

s)

n

(b) Search speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

1000

2000

3000

4000

5000

6000

6448

3216

m

#col

lissi

ons

n

(c) Number of collisions

Figure 3: Measurement results for medium sized dataset (10 MB)

8

Page 9: Experimenting with linear search in encrypted data

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

2500000

5000000

7500000

10000000

12500000

15000000

17500000

6448

3216

m

t(m

s)

n

(a) Encryption speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

6448

3216

m

t(m

s)

n

(b) Search speed

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31

05000

10000150002000025000

30000

35000

40000

45000

50000

55000

60000

6448

3216

m

#col

lissi

ons

n

(c) Number of collisions

Figure 4: Measurement results for large dataset (100 MB)

9