Delta Compression

Delta Compression of Executable CodeAnalysis, Implementation and Application-Specific

Improvements

Lothar May

Master ThesisInformation Technology

Supervisors:Prof. Dr. Andr NeubauerProf. Dr. Michael Txen

19 November 2008 (rev2)

Contents

1 Introduction 11.1 Security Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Matching With Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Analysis 62.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Intuitive Substring Matching . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Run Time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Derivation of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Restrictions of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Optimisation using the FFT . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 Projecting onto Subspaces . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.5 Optimising Random Behaviour . . . . . . . . . . . . . . . . . . . . . 28

2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 Reference Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.2 The First Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.3 The Second Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.4 The Third Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Implementation 513.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.1 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.3 Library Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

i

CONTENTS CONTENTS

3.2.2 Vector Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.3 Selection of Primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.4 Core Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.2.5 File Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.6 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Further Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Improvements 574.1 Theorem 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Selecting Primes Randomly . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 Numerical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Derandomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.1 Selecting Primes Non-Randomly . . . . . . . . . . . . . . . . . . . . . 604.2.2 Creating Non-Randomly . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion 62

Appendix 63

Bibliography 67

ii

List of Figures

1 Definition of example strings S and T . . . . . . . . . . . . . . . . . . . . . . 7

2 Plot of match count calculated from example strings S and T . . . . . . . . . . 8

3 Matlab function to calculate the match count vector V . . . . . . . . . . . . . . 8

4 Matlab function to find positions with at least 50% matches . . . . . . . . . . . 9

5 Matlab function to construct strings S and T according to the formal model . . 11

6 Construction example: At position 6 in S the string T randomly matches. . . . . 12

7 Plot of the probability of a randomly good match not in X . . . . . . . . . . . . 13

8 Matlab function to calculate matches using the cyclic correlation . . . . . . . . 15

9 Matlab code to calculate the cyclic correlation using the FFT . . . . . . . . . . 16

10 Plot of the cyclic correlation C calculated from example strings S and T . . . . 16

11 Matlab calls to construct data according to the model (|X |= 1) . . . . . . . . . 1712 Plot of the cyclic correlation C of model data (|X |= 1) . . . . . . . . . . . . . 1813 Matlab function to project data before calculating the cyclic correlation . . . . 20

14 Plot of the cyclic correlation C(1) of projected model data (|X |= 1) . . . . . . . 2115 Plot of the cyclic correlation C(2) of projected model data (|X |= 1) . . . . . . . 2116 Chinese Remainder Theorem: Matlab function which calculates the solution . 23

17 Matlab calls to construct data according to the model (|X |= 3) . . . . . . . . . 2418 Plot of the cyclic correlation C of model data (|X |= 3) . . . . . . . . . . . . . 2419 Plot of the cyclic correlation C(1) of projected model data (|X |= 3) . . . . . . . 2520 Plot of the cyclic correlation C(2) of projected model data (|X |= 3) . . . . . . . 2521 Matlab function to randomly select primes for the reconstruction of X . . . . . 27

22 Plot of the cyclic correlation C of model data (|X |= 3, ||= 65536) . . . . . . 2823 Construction example: At position 6 in S a non-existing match of T is found . . 28

24 Construction according to figure 23 with a different . . . . . . . . . . . . . . 2925 Matlab function to project data with varying i(x) . . . . . . . . . . . . . . . . 3026 Simple Matlab function to perform matching with mismatches . . . . . . . . 31

27 Matlab function to check input values according to the model . . . . . . . . . . 31

28 Matlab code which demonstrates the usage of our reference algorithm . . . . . 32

29 Matlab function implementing Algorithm 1.1 of [1] . . . . . . . . . . . . . . 34

30 Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0) . . . 35

31 Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=1) . . . 35

32 Limits of m in algorithm 1.1 ( = 0.1, p= 0.9, t = 2) . . . . . . . . . . . . . . 36

iii

LIST OF FIGURES LIST OF FIGURES

33 Matlab calls to construct data using primes around n10 . . . . . . . . . . . . . . 3734 Plot of the cyclic correlation C(1) of data using primes around n10 . . . . . . . . 3835 Plot of the cyclic correlation C(2) of data using primes around n10 . . . . . . . . 3936 Matlab function implementing Algorithm 1.2 of [1] . . . . . . . . . . . . . . 4137 Matlab code which tries to use algorithm 1.2 but fails (pedantic=0) . . . . . . . 4238 Matlab code which demonstrates the usage of algorithm 1.2 (pedantic=1) . . . 4239 Limit of m in algorithm 1.2 ( = 0.1, p= 0.9, t = 2) . . . . . . . . . . . . . . 4340 Plot of y= log(

+ ex) with = 2 . . . . . . . . . . . . . . . . . . . . . 45

41 Matlab function to calculate vector D . . . . . . . . . . . . . . . . . . . . . . 4542 Plot of the raw cyclic correlation C . . . . . . . . . . . . . . . . . . . . . . . . 4643 Plot of the filtered cyclic correlation D . . . . . . . . . . . . . . . . . . . . . . 4644 Matlab function implementing Algorithm 1.3 of [1] . . . . . . . . . . . . . . 4845 Matlab code which demonstrates the usage of algorithm 1.3 (pedantic=1) . . . 4946 Limit on m in algorithm 1.3 ( = 0.1, p= 0.9, t = 2) . . . . . . . . . . . . . . 4947 Matlab function to reconstruct X as in theorem 1.1 of [1] . . . . . . . . . . . . 5848 Matlab function to provide numeric probability on a good match not in X . . . 6349 Helper function to add up good matches not in X . . . . . . . . . . . . . . . . 6350 Helper function to count good matches not in X . . . . . . . . . . . . . . . . . 6451 Matlab function to compute C as in figure 8 by using the FFT as in figure 9 . . 6452 Matlab function to simply calculate the positions of the t largest values . . . . . 6553 Matlab function to iteratively calculate the Cartesian product (2 dimensions) . . 6554 Matlab function to recursively calculate the Cartesian product . . . . . . . . . 65

iv

List of Tables

1 Simple and slow substring matching . . . . . . . . . . . . . . . . . . . . . . . 72 Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000} . . 263 Test results using n= 102400, m= 512, p= 0.9, X = {34,39411,101410} . . 504 Base libraries for the C++ implementation . . . . . . . . . . . . . . . . . . . . 525 Core modules of the C++ implementation of the algorithm . . . . . . . . . . . 536 Additional modules needed for delta compression . . . . . . . . . . . . . . . . 537 Elements of the pre-calculated index of the implementation . . . . . . . . . . . 548 Numeric tests on probability of reconstructing X . . . . . . . . . . . . . . . . . 58

v

Chapter 1

Introduction

1.1 Security Patches

Regardless of which operating system (OS) is used, security patches need to be applied fre-quently. This lays a heavy burden on OS vendors: They need to provide the patches quicklyon servers with sufficient bandwidth1. The users, on the other side, often need a lot of patiencewhen downloading the patches. If they do not have this patience, their systems might end upbeing vulnerable to known exploits.

The trigger for a security update is often as simple as changing a few lines of source code,for example to prevent a buffer overflow. Changing these few lines can have an enormouseffect, however, if a large executable file needs to be replaced by a patched version. This canlead to a security patch of several megabytes, which is then downloaded by thousands or evenmillions of people. Not only does this cause huge bandwidth costs the OS vendor alsoneeds to provide update servers, network devices and manpower to administrate them. Evenworse are accumulated patches, for example Service Packs, which easily surpass a size of100 megabytes. Deploying these patches is costly and quite a challenge.

The solution not to release patches at all to save costs is not an option, because nowa-days most computers2 are connected to the Internet. Known vulnerabilities need to be fixed asquickly as possible to provide basic security. In larger companies there are usually dedicatedservers, which maintain a cache of updates for employees. This speeds up the process and savesbandwidth costs, but it is not a general solution. Getting back to the root of the problem, thesize of the patches needs to be reduced.

One way to achieve this is to use many small shared libraries instead of big executable files.If the security fix is local, only a small library needs to be replaced. However, it is not alwaysfeasible to instruct the programmers how to write their programs, especially if development isnot managed centrally. Also, sometimes there are good reasons not to use shared libraries. Not

1Even though not entirely correct, bandwidth is used as a synonym for data rate, like often done in literatureon computer science.

2and also many embedded systems

1

1.2. MATCHING WITH MISMATCHES CHAPTER 1. INTRODUCTION

to mention that some security fixes cause changes in multiple files, and this will again result inlarge patches if the new files are copied.

A totally different approach is to consider the existing files on the system, and comparethem to the corresponding files containing security fixes, in order to avoid replacing entire filesby new ones. Since these fixes usually implement only few changes, most of the data is alreadypresent on the system within the files which need patching. If this method is applied, and onlythe differences of the files are stored in security patches, they could be tremendously smaller.Actually, some OS vendors recently have started using this technique to deploy at least some oftheir updates:

Microsoft Windows Update on Windows XP and above supports Binary Delta Compres-sion (BDC) (see [15]), but this is a proprietary system and little information is availableon its inner workings.

FreeBSD and Mac OS X both come with the open source tool bsdiff 4 (see [2]) andprovide update tools which (can) make use of it.

Colin Percival, the author of bsdiff, has written a doctoral thesis [1] on this subject, in which hepresents an algorithm to further improve bsdiff 4. Unfortunately, he did not publish the sourcecode for his new algorithm, and there is only few third party material available on it. A referenceto bsdiff 6 can be found in [5] (on page 939) but it does not provide a description. The authorsof [4] use bsdiff 6 for comparison with their patch tool, but again no details are mentioned. Ifwe are willing to make use of the new algorithm, we need to work through the doctoral thesis.

1.2 Matching With Mismatches

The present master thesis is mainly about the aforementioned doctoral thesis Matching withMismatches and Assorted Applications [1]. Of that thesis, the focus is on the first chapter,which specifies an algorithm for matching with mismatches in three iterations.

What is matching with mismatches? It basically means finding something similar towhat is present, where similar implies that it can be exactly the same or might be modifiedat some places. These possible modifications are the mismatches. Insertions (additional datain between) or deletions (removed data) are not considered for the measurement of the simi-larity, only in-place changes.3 This is related to the Hamming distance (see e.g. [14] on page19): Given a large string S and a small string T , we look for substrings4 in S with low Hammingdistance to T .

Why is this useful for delta compression of executable code? Our aim is to encode only thedifferences of two executable files, specifically the original file and a different version includinga security fix. Security fixes usually implement only few source code modifications, and thus,

3Other algorithms, which consider insertions and deletions, are described in [3].4meaning continuous parts of the string S with the same length as T

2

CHAPTER 1. INTRODUCTION 1.3. STRUCTURE

very few code is actually added or removed. However, due to these modifications, memoryaddresses throughout the executable file change.5 In this context, an algorithm for matchingwith mismatches helps to find similar blocks of the original file in the new file, such that thedifferences can be identified and encoded. Blocks or parts of blocks which are identical in bothfiles can be referenced and do not need to be copied.

1.3 Structure

This thesis is structured as follows:

1. IntroductionIn the course of this chapter, conventions and definitions are provided which are usedthroughout the thesis.

2. AnalysisThe analysis contains a step-by-step derivation of the algorithm for matching with mis-matches. This includes the motivation for each step as well as numerical examples. Addi-tionally, example code for the different iterations of the algorithms is provided, and basictests are performed.

3. ImplementationBased on the prior analysis, an implementation of the algorithm using C++ is presented,and possible use in a tool for delta compression is prepared. This chapter describes thebasic structure of the implementation and additional considerations like portability andthird party libraries.

4. ImprovementsWith regard to the analysis and the C++ implementation, specific improvements of thealgorithm are proposed and illustrated.

5. ConclusionIn this chapter, we provide an overview of what we have achieved and give an outlookidentifying possible further projects.

1.4 Conventions

All examples are written in Matlab [19] code. However, neither knowledge of Matlab nor alicense of Matlab is required to understand this thesis and test the examples. Basic knowledgeof C should be sufficient to be able to read the code, and GNU Octave [20] (with the additionalpackages from Octave-Forge [21]) can be used to run the examples.

5see also [1] on page 32

3

1.5. DEFINITIONS CHAPTER 1. INTRODUCTION

Nevertheless, it should be noted that array and vector indexing in Matlab code starts at 1 (i.e.array{1} is the first element). This is opposed to C, where indexing starts at 0 (i.e. array[0]). Inspite of this, we still stick to the constraints of all values as described in [1], which are mainlyzero-based. So whenever there is an unexplained increment or decrement in the code, the reasonis almost certainly the difference in indexing.

Whenever possible, we use the symbols of [1] (on pages iii-iv). This means that the readermay generally switch to and from the doctoral thesis without problems. One notable differencethough is the use of the vector indices i and j: In [1], the index i is first used as index for thematch count vector (Vi), but later i represents the vector number and j becomes the index (e.g.A(i)j ). For the sake of clarity, we use j as vector index from the start. Additionally, when weare talking about the vectors A(i), B(i), or C(i), we actually mean A(i), B(i), respectively C(i)

i {1, . . . , k}, with k being described in the context.Some concepts of probability theory are applied in this thesis without further explanation,

for more information on this subject we refer to [8], especially chapter 3.

1.5 Definitions

A programmer usually regards executable code as binary data. A string, in contrast to that, isseen as human-readable data, which is binary data with special semantics. This distinction isgenerally applied because there are certain functions which do not work with all forms of binarydata. Yet in the context of this thesis it is not relevant. We therefore take the mathematical pointof view (see also [9] on pages 28-29):

Definition (Alphabet)

An alphabet is any finite non-empty set.

Definition (String)

A string over an alphabet is a finite sequence of elements from the alphabet.

These definitions will be used throughout the thesis, so whenever we are talking about astring, we do not impose any semantics on its elements. To give an example, an executable file(analysed at byte level) is a string over the alphabet exe = {0, 1, . . . , 255}.

4

CHAPTER 1. INTRODUCTION 1.5. DEFINITIONS

Additionally, we provide some definitions of mathematical terms used in this thesis:

Definition (Ceiling)

d e : R Z is the ceiling function which rounds towards +, i.e. dxe returns the smallestn Z which is not less than x. The corresponding Matlab function is ceil.

Definition (Addition of Sets)

Given A= {a1, a2, . . . , an} Z, B= {b1, b2, . . . , bm} Z, we define

A+B={ai+b j | i {1, 2, . . . , |A|} , j {1, 2, . . . , |B|}

}Definition (Interval)

[a, b) = {n N | a n < b} for a, b R. Note that this is not a common definition, becausewe only allow positive integers as elements of the interval.

5

Chapter 2

Analysis

2.1 Approach

In his doctoral thesis [1], Colin Percival writes that the first chapter, in which he introduces thenew algorithm for matching with mismatches, is not for the faint of heart6. This is quite true,the mathematics for the algorithm might seem daunting. We try to ease the pain a bit. However,even if we have applications in mind, a good understanding of the underlying algorithm isessential.

In this chapter we provide a description of the algorithm for matching with mismatches with

detailed common sense motivation and reasoning,

various numerical examples, including numerical evidence and restrictions.

Our approach is to numerically show how and why the algorithm works. We regard this as areasonable addition to chapter 1 of [1], which mainly consists of lemmas/theorems and proofs.

2.2 Motivation

2.2.1 Intuitive Substring Matching

Substring matching is used very frequently in everyday life. For example the Find... functionof any word processor performs a substring search in a larger string. This search is usuallydesigned to find only exact matches7. Whenever a single element does not match, it means thatthe whole substring does not match.

Reflecting that we are willing to encode the differences of two files, exact matches wouldhelp but are too strict to be used in general (see also [1] on page 33). It is more useful to also findsections which mostly match, with some mismatches, where the mismatches could for examplebe modified memory addresses.

6See [1] on page 3.7maybe being case insensitive

6

CHAPTER 2. ANALYSIS 2.2. MOTIVATION

Intuitively, this can be done as follows: We iterate through the large string and compare thesmall string with the substring at the current position. For each substring comparison, we countthe number of elements which match, and then process these match counts to find the goodmatches.8 As an example, consider the strings9 S, T and their lengths n, m in figure 1.

S = 'do not tarry water carry';T = 'carry';n = length(S); % = 24m = length(T); % = 5

Figure 1: Definition of example strings S and T

Performing a substring match of T in S using the intuitive algorithm we described aboveleads to the result shown in table 1 with a plot as in figure 2.

Table 1: Simple and slow substring matching

Position ( j) View Match Count (Vj)do not tarry water carry

0 carry 01 carry 02 carry 03 carry 04 carry 05 carry 06 carry 17 carry 48 carry 19 carry 0

10 carry 011 carry 012 carry 013 carry 114 carry 115 carry 116 carry 017 carry 018 carry 119 carry 5

In the language of mathematics, the match counts can be seen as a vector, and the calculationis formally done as follows (see [1] on page 6):

Vj =m1i=0

(Si+ j, Ti) j {0 . . . nm} (2.1)

8Good matches are matches with a high number of matching characters, compared to the maximum possiblematch count.

9The longer string was taken from J. W. Goethes The Sorcerers Apprentice.

7

2.2. MOTIVATION CHAPTER 2. ANALYSIS

Figure 2: Plot of match count calculated from example strings S and T

The function : R is in our case the Kronecker delta, i.e.:

(a, b) =

{1 if a= b0 otherwise

a, b (2.2)

Translating equation (2.1) to Matlab code is fairly straightforward (see figure 3), except thatwe should not forget to check the input constraint.

function [V] = match(S, T)n = length(S);m = length(T);% Check matching predicate.if (not (m < n))

error('Invalid vector lengths.');end% Calculate match count vector.V = zeros(1, nm+1);for j = 0 : (nm)

V(j+1) = sum(S(j+1:j+m) == T);end

Figure 3: Matlab function to calculate the match count vector V

Now that we are able to calculate the match counts, we need to process them to find goodmatches. In our example (table 1), the maximum possible match count is m = 5, which is thelength of T , the smaller string. Assuming that we wish to find the positions with at least 50% ofthe maximum match count, we require no less than

m2

= 3 matches for one substring. Thus,

we extract only those positions j which satisfy the predicate Vj m

2

. This way, we get all the

spikes in figure 2, while ignoring small match counts which are kind of random. This filteringis easily done in Matlab code (see figure 4).

8

CHAPTER 2. ANALYSIS 2.2. MOTIVATION

function [J] = find_good_matches(V, m)J = find(V >= ceil(m/2)) 1;

Figure 4: Matlab function to find positions with at least 50% matches

We have presented an intuitive algorithm for matching with mismatches, implemented it,and it works fine. However, is this algorithm also applicable for our problem of finding similarsections of two different files, given that these files contain much more data than our test inputstrings?

2.2.2 Run Time Considerations

We observe that our algorithm requires nm+1 steps when iterating through S, and each stepconsists of m element comparisons. Additionally, it requires nm+1 steps to extract the goodmatches, even if this is done on the fly. All in all it will complete in O((nm+1)(m+1)) =O(nm+nm2+1) time.10

For large n with n m, the run time can be approximated as O(nm). The factor n seemsquite natural, as we need to iterate through (most of) S. However, the factor m is specific to themethod of comparison we are using, so there might be room for optimisation.

In the context of comparing two files, n could be the size of one file, and m could be thesize of a block of the second file which we are trying to match. The file size tends to bequite large nowadays, for example nexe = 2,097,152 , 2 MB, with a sample block size ofmexe = 2,048 , 2 KB. Run time is O(nexemexe), which results in approximately 4.2950 109steps, and that would be only for matching a single block. Assuming the second file also hasa size of 2 MB, and we simply want to match all non-overlapping blocks, that would meanmatching nexemexe = 1024 blocks, i.e. some O(nexemexe

nexemexe

) = O(n2exe), which are approximately4.3980 1012 steps. This is a quadratic time algorithm, (mostly) independent of the block sizem. Even modern processors cannot compensate for this.

Based on this result, we can be sure that our intuitive algorithm is not quite fast enough (inother words: it is too expensive), and that optimisation is a necessity. There is one thing in ourfavour, though: We do not have the absolute requirement to always calculate the exact matchcounts and find only the best matches. If we do not find them, the calculated difference betweenthe two files will be larger, which will result in larger patches, but we will still succeed. Withthat in mind, one option to speed up this calculation is to estimate the match count vector Vusing a randomized algorithm with a sufficiently high chance of success. This leads us to thenew algorithm of matching with mismatches as described in [1].

10For a description of the O-Notation as used in this thesis, see [6] on pages 44-45.

9

2.3. DERIVATION OF THE ALGORITHM CHAPTER 2. ANALYSIS

2.3 Derivation of the Algorithm

2.3.1 Formal Model

Since we could not find a suitable algorithm quickly and intuitively, we now use a formalway to deal with the problem. Thus, we need to formalize the problem when comparing twosimilar versions of one executable program. If we choose a block of one file and try to locate asimilar block in the other file, we expect to find exactly one good match. This is at the positionwhere the corresponding code before applying the security fix is present. For the sake of amore general approach, instead of simply considering the best match, we plan to find the t bestmatches. Additionally, there might be not-so-good matches which occur by chance and arerandom. These need to be considered, because we prefer the good matches over them.

Instead of choosing specific example strings S and T , we generate them randomly from analphabet subject to a certain condition: There are some positions within S where T matcheswell, in the sense that each character matches with probability p. The indices of these goodmatches in S are assumed to be elements of the set X . Even more formally, the model we areusing is specified as follows (citing from [1] on pages 7-8):

Problem space: A problem is determined by a tuple (n, m, t, p, , , X) where{n, m, t} N, {p, } R, m< n, 0 < , 0 < p, || is even, and X = {x1, . . . , xt} {0, . . . , nm} with xi xi+1m for 1 i< t.Construction: Let a string T of length m be constructed by selecting m characters

independently and uniformly from the alphabet . Let a string S be constructed byrandomly selecting n characters independently and uniformly from the alphabet .Let a string S of length n be constructed by independently taking Si = Tixk withprobability p if xk {im, . . . , i1} and Si = Si otherwise [...].

In other words (with regard to finding differences of files), we create T from randomly chosencharacters to be a block of one file. We then create another file S from randomly chosencharacters, but at certain positions in S we copy the block T into the file S. This copy of T is notan exact copy, but an approximate copy, and the probability p states how accurate the copy is.For example if p= 0.9, 9 of 10 characters will be correctly copied on average. The constructioncan be performed using a Matlab function as shown in figure 5.

Based on this model, which generates S and T using given positions of good matches, wewish to invert the random construction and find X with a probability of at least 1 . In thiscontext, is a non-zero parameter which can be set to achieve the desired accuracy, butchoosing will impose certain restrictions on other input values, as we will see later.

In addition to the problem of reconstructing X , we wish to identify the parts of the algorithmwhich are independent of T . Following the nomenclature of [1], we call a pre-calculation ofthese parts an index of the algorithm. As we need to match several blocks (different stringsT ) with the same target file (constant string S), such an index can speed up the processing.

10

CHAPTER 2. ANALYSIS 2.3. DERIVATION OF THE ALGORITHM

function [S, T] = construct(n, m, p, Sigma, X)% X and Sigma should be sets.if (not (length(X) == length(unique(X))))

error('X is not a set');endif (not (length(Sigma) == length(unique(Sigma))))

error('Sigma is not a set');end% Check construction predicates.if (not (m < n && 0 < p && mod(length(Sigma), 2) == 0 ...

&& sum(ismember(X, [0:nm])) == length(X)))error('Invalid construction value.');

endfor i = 1 : length(X) 1

if (not (X(i)


= {a, b, c, . . . , z}X = {1, 12}T = abc

S= fabc1X

klabc6/X

hxvabc12X

s

Figure 6: Construction example: At position 6 in S the string T randomly matches.

contains the positions of maximum match counts, because Vi (our lucky match count) mightactually be larger. Thus, we will have to guess the elements of X , which reduces the probabilityto find the correct X to 0.5 or less.

We therefore conclude: To effectively estimate X without fifty-fifty-guessing, all matchcounts at positions not in X must be smaller than those of positions in X , i.e.:

(i, j) XX : Vi


Figure 7: Plot of the probability of a randomly good match not in X

1. If we found random matches between two files that we did not expect, we would behappy and would gladly accept them.

2. The block size m when matching two files should not be too small anyway, so that propercompression can be achieved.

3. We will set p to be near 1 anyway, otherwise we would not be able to achieve properresults.

Even though we have identified certain restrictions of the model, we can continue with our workwith the knowledge that these limitations will not block the progress of solving our problem.

2.3.3 Optimisation using the FFT

Based on the model, we intend to derive an improved algorithm for matching with mismatches.As a first optimisation, the match count vector V in equation (2.1) can be calculated with thehelp of the Fast Fourier Transform. This requires O(n

m log(m) time for matching one block

(see [1] on page 7)13, which is less than the O(nm) time of our intuitive algorithm, but there isstill potential for improvement.

To improve this, we note that we do not specifically need the vector V as described inequation (2.1). When processing V , we extract large values, thus a vector containing spikesat the same positions as V would also be perfectly fine. A vector with this characteristic is

13Unfortunately, the references mentioned in [1] on using the FFT to calculate V do not cover the FFT at all,and are therefore not very useful in this context.

13


the cyclic correlation14 of the two strings S and T , when treating them as discrete signals withcertain properties.

Assuming : R is a function which converts a character to a signal value,15 the cycliccorrelation C is calculated as follows:16

j {0, . . . , n1} :

A j = (S j) (2.5)

B j =

{(Tj)0

if j < motherwise

(2.6)

C j =n1r=0

A(r+ j) mod n Br (2.7)

In this calculation, the string S is converted to the signal vector A, and T is converted to Bwith zero padding, such that A and B have the same size. In order to retrieve proper results, wehave to define in a way that it does not weight characters differently. As counter-example,defining to map each element of to a unique numerical representation

given = {x1, x2, . . . , x||}=||j=1{x j} :

(x j) = j (2.8)

will not produce proper results, because matching x1 and x1 (which are equal) when calculatingthe cyclic correlation will yield 1 1 = 1,17 while matching x1 and x2 (which are non-equal)will produce 1 2 = 2. In other words, certain mismatches will count much more than certainmatches, and this is not the result we wish to have.

Instead, we define to randomly map half of to 1 and the other half to 1 (similar to [1]on page 12):

Choose with = 12|| uniformly at random.

(x) = (1)|{x}| (2.9)

Note that in case || > 2 this is a lossy conversion of the characters to signal values, sinceit maps to {1, 1}. However, in contrast to equation (2.8), mismatches will never produce

14see e.g. [11] on page 7215We do not consider the case that the function maps to C (as mentioned in [1] on page 27), because for all

our requirements mapping to R is sufficient.16This is a simplification based on parts of algorithm 1.1 in [1] on page 12.17The underlying operation in equation (2.7) is a multiplication.

14


larger values than matches: Matching x1 and x1 will produce 1, matching x1 and x2 will produceeither 1 or 1, depending on how was chosen.

Figure 8 shows the calculation of the cyclic correlation in Matlab code using as definedin equation (2.9).

function [C] = match_cyclic_correl(S, T, Sigma)% Retrieve lengths.n = length(S);m = length(T);% Sigma should be a set and sorted. We assume it to be continuous.alphabetSize = length(Sigma);if (not (alphabetSize == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');end% Check input predicates.if (not (m < n && mod(alphabetSize, 2) == 0))

error('Invalid input value.');endSigma_base = double(Sigma(1));% Calculate phi.tmp_phi = ones(1, alphabetSize, 'single');for j = 1 : alphabetSize/2

tmp_phi(j) = single(1);end% Use a random mapping.phi = intrlv(tmp_phi, randperm(alphabetSize));% Convert S and T to A and B.A = zeros(1, n, 'single');B = zeros(1, n, 'single');for j = 0 : n 1

A(j + 1) = phi(double(S(j + 1)) Sigma_base + 1);if (j < m)

B(j + 1) = phi(double(T(j + 1)) Sigma_base + 1);end

end% Calculate the cyclic correlation.C = zeros(1, n, 'single');for j = 0 : n 1

tmpC = single(0);for r = 0 : n 1

tmpC = tmpC + A(mod(r+j, n) + 1)*B(r+1);endC(j+1) = tmpC;

end

Figure 8: Matlab function to calculate matches using the cyclic correlation

Due to the nested loop in equation (2.7),18 the calculation of C requires O(n2) time. Theactual improvement comes from the fact that the cyclic correlation can be computed using theFFT (see [12] on pages 545-546) in O(n log2(n)) time (according to [10]). The correspondingMatlab code is shown in figure 9. In this context, fft and ifft are the Fast Fourier Transformand the inverse FFT, respectively, and conj is the complex conjugate.

18because this equation is applied j {0, . . . , n1}

15


% Calculate the cyclic correlation using the fft.C = ifft(fft(A) .* conj(fft(B)));

Figure 9: Matlab code to calculate the cyclic correlation using the FFT

The resulting vector C does not necessarily contain match counts. Mismatches decreasevalues in C, e.g., for m= 15 a match count of 10 at position j results in C j = 105 = 5. Thismeans that processing the C as we processed V in figure 4 by filtering values m2 might leadto different results. This is specifically true for ||> 2, because in that case the string-to-signalconversion is lossy. Thus, there is more background noise than in the match count vector V .Performing the matching of the example strings (see figure 1) using the cyclic correlation leadsto a result19 as plotted in figure 10. Negative values in this plot are clear mismatches (becausethey are the result of accumulated1 1 multiplications, see equation (2.7)). Positive values arelikely but not guaranteed to be matches, depending on how suitable the randomly generated is for our example strings.

Figure 10: Plot of the cyclic correlation C calculated from example strings S and T

To find the t best matches using the result of the cyclic correlation, we identify the t positionsfor which C takes the largest values. In our example, the positions 7 and 19 both indicatefull matches, although in fact position 7 only matches 4 of 5 characters (see table 1). This

19The result may vary because the algorithm is randomized.

16


emphasizes the fact that our new method does not always lead to correct results. When choosing unluckily, we might even find full matches at positions where none are present.

However, the method is still very useful, if certain conditions are met. Let us now applyour model and create S and T according to it, which means that their characters are uniformlydistributed20. If m is large enough to make up for the (possibly) lossy conversion done by ,21

the spikes within C will (with high probability) be the good matches we wish to find, since Tmatches within S well or not at all ([1] page 6).

Our initial problem remains, though: The run time is dominated by the O(n log2(n)) timerequired for the FFT, which does not scale well for our purpose.22 In addition to that, memoryusage is O(3n),23 which can be too much for use with large files.24 The next step is therefore toshorten the lengths of the vectors A and B before calculating the cyclic correlation, while stillretaining the necessary information.

2.3.4 Projecting onto Subspaces

In order to be able to reduce the size of the data before calculating the FFT, we need to finda projection (preferably lossless or with only small loss) which reduces the vector sizes inequations (2.5) and (2.6) but maintains the basic properties such that we can still perform thecyclic correlation and extract proper results.

A Simplified Approach

We now introduce such a projection (based on [1], page 9), starting with the special case |X |= 1and || = 2, which basically means that we are only looking for the best match of T in S,whereby the conversions of strings to signals are lossless. For this case, we construct an exampleto show how the projection is performed and to explain the mathematical background.

n = 10000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [9000];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl(S, T, Sigma);

Figure 11: Matlab calls to construct data according to the model (|X |= 1)

20except for the substrings in S which match well with T21To be more exact, this does not solely depend on m, but choosing a large m is one way to deal with this

problem.22In fact, this method requires even more time than the FFT-based calculation mentioned at the beginning of

this section, but it leaves room for optimisation.23implicitly depending on the system-specific size of floating point values24At least one would prefer low memory usage, especially when multiple file patches need to be generated in

parallel.

17


Figure 11 lists the Matlab calls used to create example data according to the model. Notethat we choose nm |X | and select m not to be too small, to make sure that the restrictions of themodel (section 2.3.2) do not apply. Figure 12 shows the resulting vector C (following equation(2.7)). There is significant noise and a spike at position j = 9000, which we expected becauseX = {9000}.

Figure 12: Plot of the cyclic correlation C of model data (|X |= 1)

Now we need to reduce the data size and still get the position of the maximum j = 9000as result. Assuming we can somehow reduce the data size modulo a prime number, we couldextract the position modulo this prime. Performing this several times with different primes willgive us the position modulo multiple primes. To calculate the actual result we can make useof the Chinese Remainder Theorem, which states that it is possible to reconstruct integers ina certain range from their residues modulo a set of coprime moduli ([7], page 194). This ispossible if the integer x we wish to reconstruct follows the predicate 0 x < M where M =p1p2 . . . pk is the product of the coprime integers (with k being the number of primes).

For example, we can choose the primes p1 = 5003, p2 = 6007 (being about n2 with enoughdifference, this choice is for simplicity)25, and perform the character-to-signal conversions ac-cumulated modulo each of the primes (based on algorithm 1.1 in [1] on page 12):i {1, . . . , k}; j {0, . . . , pi1}:

25The reconstruction is clearly possible because 0 n = 10000 < p1p2 = 30053021. We are not required tochoose primes, coprime values are sufficient, but we prepare for later changes to the algorithm.

18


A(i)j =

n jpi

1

=0

(S j+ pi) (2.10)

B(i)j =

m jpi

1

=0

(Tj+ pi) (2.11)

C(i)j =pi1r=0

A(i)(r+ j) mod n B(i)r (2.12)

This means that we shorten the original vector A j = (S j) by adding up roughly the secondhalf of the vector to the first half (with a different boundary for each prime). Equation (2.10)specifies a projection from nRpi, i {1, . . . , k}. Given pi < n, this is a lossy (irreversible)projection, but it maintains certain characteristics. In our example, instead of one large vectorA, we now have two vectors: A(1) with size 5003 and A(2) with size 6007. Vector B is lessconcerned only some zeros are cut from the end, since in our case m< p1 < p2.26

26Basically, we could use the same definition for B(i) as in equation (2.6), but the new definition covers thegeneral case and is therefore preferable.

19


function [C] = match_cyclic_correl_project(S, T, Sigma, Primes)% Retrieve lengths.n = length(S);m = length(T);k = length(Primes);% Sigma should be a set and sorted.% We assume it to be numeric, with values >= 0 and continuous.alphabet_size = length(Sigma);if (not (alphabet_size == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');end% Check input predicates.if (not (m < n && mod(alphabet_size, 2) == 0 && Sigma(1) >= 0))

error('Invalid input value.');endsigma_base = double(Sigma(1));% Calculate phi (character to signal conversion).tmp_phi = ones(1, alphabet_size, 'single');for j = 1 : alphabet_size/2

tmp_phi(j) = single(1);end% Use a random mapping.phi = intrlv(tmp_phi, randperm(alphabet_size));% Convert S and T to A and B, project to subspaces.for i = 1 : k

A{i} = zeros(1, Primes(i), 'single');B{i} = zeros(1, Primes(i), 'single');for j = 0 : Primes(i) 1

tmpA = single(0);tmpB = single(0);for lambda = 0 : ceil((nj)/Primes(i))1

tmpA = tmpA + phi(double(S(j+lambda*Primes(i)+1)) ... sigma_base + 1);

endfor lambda = 0 : ceil((mj)/Primes(i))1

tmpB = tmpB + phi(double(T(j+lambda*Primes(i)+1)) ... sigma_base + 1);

endA{i}(j+1) = tmpA;B{i}(j+1) = tmpB;

endend% Calculate the cyclic correlation using the fft.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));end

Figure 13: Matlab function to project data before calculating the cyclic correlation

The cyclic correlation is calculated for each of these smaller vectors (see equation (2.12)).Figure 13 shows a Matlab function implementing this projection onto subspaces; figures 14and 15 show the resulting vectors C(1) and C(2) for our example. Due to adding up the valuesbefore calculating the correlation, the level of noise has increased compared to figure 12, butthe maximum value still raises clearly above the noise.

20


Figure 14: Plot of the cyclic correlation C(1) of projected model data (|X |= 1)


The positions of maximum values in these vectorsC(i) are (with high probability) the maxi-mum position of the original vectorC modulo each of the primes pi. Using the residues modulothese primes, we can reconstruct the position of the maximum correlation. In our example the

21


residues are 9000 mod 5003= 3997 (see figure 14) and 9000 mod 6007= 2993 (see figure 15).The reconstruction following the Chinese Remainder Theorem is done as follows (see also [7]page 194f):i {1, . . . , k} :

x ai (mod pi) : x 3997 (mod 5003) (2.13)x 2993 (mod 6007)

M =k

i=1

pi : M = 5003 6007 = 30053021 (2.14)

Mi =Mpi

: M1 =30053021

5003= 6007 (2.15)

M2 =30053021

6007= 5003

NiMi 1 (mod pi) : N1 6007 1 (mod 5003) (2.16)has solution N1 =294N2 5003 1 (mod 6007)has solution N2 = 353

Finally, the underlying value x can be calculated:

x a1N1M1+ +akNkMk (mod M) : (2.17)x 3997 294 6007+2993 353 5003 (mod 30053021)1773119239 (mod 30053021) 9000 (mod 30053021)

While most of the steps are fairly straightforward, the solution of equation (2.16) requiressome work. Rearranging it leads to a form which can be solved more easily:

NiMi 1 (mod pi) NiMi = 1 r pi, r Z NiMi+ r pi = 1 (2.18)

Since gcd(Mi, pi) = 1 (by definition of Mi and pi),27 we can apply the extended Euclideanalgorithm (see [6] on pages 859-860) to calculate Ni.

27gcd: greatest common divisor

22


Figure 16 shows a Matlab function which performs the reconstruction of an integer accord-ing to the Chinese Remainder Theorem.

function [x] = solve_crt(Primes, Residues)num_primes = length(Primes);prime_prod = prod(Primes(1 : num_primes));for i = 1 : num_primes

M_i = prime_prod/Primes(i);% Use extended euclidian algorithm[g, r, N_i] = gcd(Primes(i), M_i);% g = r * Primes(i) + N_i * M_i% with g = 1 because Primes(i) and M_i are coprime.NM{i} = N_i * M_i;

endx = 0;for i = 1 : num_primes

x = x + Residues(i) * NM{i};endx = mod(x, prime_prod);

Figure 16: Chinese Remainder Theorem: Matlab function which calculates the solution

Using this algorithm, the position x of the maximum correlation can be uniquely calculatedas long as n


n = 10000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [7700, 8050, 9000];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl(S, T, Sigma);

Figure 17: Matlab calls to construct data according to the model (|X |= 3)

As an example, consider model data with |X |= 3, generated using the Matlab calls in figure17. The cyclic correlation C (see equation (2.7)) of this data has three spikes at positions whichare elements of X (see figure 18). We need to reconstruct these three positions from theirresidues modulo the primes p1 and p2 as above.

Figure 18: Plot of the cyclic correlation C of model data (|X |= 3)

Each of the cyclic correlations C(i) of projected data (calculated as in equation (2.12)) alsohas three spikes (see figures 19 and 20). Thus, when extracting the positions of the three largestvalues from each of the vectors C(i), we have three residues modulo each prime. However,unfortunately it is not apparent which of the residues modulo one prime belongs to a specificresidue modulo another prime to reconstruct one of the results.

Intuitively, one might consider to simply calculate the result x using the Chinese RemainderTheorem for all combinations of residues and check each time whether it is a valid value (i.e.whether x nm according to the model). This actually works fairly well, given M nm,such that it is possible to drop invalid combinations.

24




25


The residues and corresponding solutions for our example are shown in table 2 (with a1being position in C(1), and a2 being position in C(2)). The valid results, i.e. those which are 9744, are marked in bold. As shown in the table, we have successfully reconstructed theelements of X all other combinations of residues lead to values well out of range.

Table 2: Residues and solutions for p1 = 5003, p2 = 6007, X = {7700, 8050, 9000}a1 a2 x (mod 30053021)

2697 1693 77002697 2043 170679302697 2993 118548043047 1693 130008413047 2043 80503047 2993 248479453997 1693 182149173997 2043 52221263997 2993 9000

Theorem 1.1 in [1] on page 10 establishes a lower bound on the probability of whetherthis reconstruction will lead only to the correct results.28 The problem is formulated slightlydifferent in this theorem. It starts with a set of candidates for the solution, namely {0, . . . , n1}(assuming m= 1, the minimum reasonable value). For each prime, only those values of this setwhich are elements of one of the residue classes (of the actual results modulo the prime) areaccepted (see [1] on page 10):

X = {0, . . . , n1} (X+ p1Z) (X+ pkZ) (2.19)

The filtering by intersecting for each prime is basically the same as trying all combinationsand removing invalid results. The set X will always contain the correct results (because allcombinations of residues are considered), but it might also contain additional values. A lowerprobability bound on the condition X = X according to [1] is

1n(t log(n) log(L))

L

)k(2.20)

with LR, L 5 specifying the interval [L, L(1+2/ log(L))) from which the primes p1, . . . , pkare randomly selected; and t = |X | according to our model definition. This probability bound isused by Colin Percival for further proofs, which is why theorem 1.1 is the very foundation ofthe algorithm proposed in [1]. Still, what is missing in [1] is a critical analysis of this theorem.We provide this, together with suggestions for improvement, in section 4.1.

For now we choose input values such that the lower probability bound in equation (2.20) isnear 1. Hence, we can expect that the set X is properly reconstructed most of the time. To be ableto choose input values accordingly, we need to select primes from the specified interval. Figure

28This theorem depends on pi being prime and not just coprime i {1, . . . , k}, which is why we used primenumbers from the start.

26


21 shows a Matlab function which performs this task. Please note that this function actuallycreates a random permutation of primes instead of selecting them uniformly at random (asdescribed in [1]), which makes it behave slightly different. This is explained in section 4.1.1.

function [P] = select_primes(L, k)% Check selection predicate.if (not (L >= 5 && k >= 1))

error('Invalid input for prime selection.');end% Round upper bound.primes_upper_bound = L * (1 + 2 / log(L));max_prime = fix(primes_upper_bound);% Make sure that values are < upper bound.if (max_prime == primes_upper_bound)

max_prime = max_prime 1;end% Retrieve subset of the set of primes.% Use a permutation to prevent double primes.primes_set = setdiff(primes(max_prime), primes(L 1));primes_set = intrlv(primes_set, randperm(length(primes_set)));P = primes_set(1:k);

Figure 21: Matlab function to randomly select primes for the reconstruction of X

Now we have provided a solution which covers the general case |X | 1. The solution hasthe drawback of restricting some of the input values, an issue which we will revisit in section2.4.2.

To provide a fully generic approach, we still need to cover the case || 2. However, thisis only a small problem: We recall the fact that ||> 2 will cause (x) to be a lossy conversionfrom characters to signal values. This means that when comparing the signal values of differ-ent characters, they might turn out to be equal. However, as we discussed in section 2.3.2, thelarger ||, the less likely are randomly good matches of T in S (if all other factors are constant).These two effects compensate each other. While the lossy character-to-signal conversion in-creases the level of the noise in which to find good matches, the reduced probability of randommatches decreases the noise. Figure 22 shows vector C constructed as in figure 17 but with= {0, . . . , 65535}. Compared to figure 18 we do not observe any difference except for the in-fluence of the randomly constructed input values. Actually, there is a slight difference in detail:Since is a randomly created function, there is a certain chance that we create an unlucky for our specific input values. This issue will be handled in the next section.

Comparing C to the match count vector V (see equation (2.1) on page 7) with varying ||reveals another issue: For larger ||, the level of noise in V is reduced, while the level of noiseinC remains constant. This should be kept in mind for applications whereC is meant to be usedas direct replacement for V .

27


Figure 22: Plot of the cyclic correlation C of model data (|X |= 3, ||= 65536)

2.3.5 Optimising Random Behaviour

As we mentioned above, for || > 2 the conversion performed by is lossy. It can randomlyhappen, though, that is defined (see equation (2.9)) such that it performs an unlucky con-version of specific input strings. By unlucky we mean that the resulting C has a spike ata position where there is no good match, because converts different characters to the samesignal values. Figure 23 shows an example for an unluckily defined in the context of certaininput strings.

= {a, b, c, d}X = {1}T = abbd

S= dabbd 1X

cbabc 6=T

caddc

(x) ={

1 if x= a or x= b1 otherwise

B= (1, 1, 1,1, 0, 0, . . .) A= (1, 1, 1, 1,1,

1X1, 1, 1, 1,1,

6/X1, 1,1,1,1)

Figure 23: Construction example: At position 6 in S a non-existing match of T is found

28


For specific input strings this problem can often be solved by choosing a different (lucky) . In figure 24 is redefined, which leads to the correct result.

(x) ={

1 if x= a or x= c1 otherwise

B= (1,1,1,1, 0, 0, . . .) A= (1, 1,1,1,1,

1X1,1, 1,1, 1, 1, 1,1,1, 1)

Figure 24: Construction according to figure 23 with a different

However, there is no general solution which works for all input strings as long as is alossy conversion. Given , it is always possible to maliciously construct input strings such thata good match of T in S is found where there is none. One tempting way to reduce the probabilityof finding false matches is to analyse the input strings and define purposeful instead of atrandom. This possibility is discussed later in section 4.2.2.

Nevertheless, there is something else we can do: In the equations (2.10) and (2.11), thesame is used to calculate the vectors A(i) and B(i) for all i {1, . . . , k}. This means that if afalse match occurs, it will occur in all vectors C(i), at positions modulo the respective primes.If we use a different i for each i to calculate the vectors A(i) and B(i), false matches are stillpossible, but most likely on different positions for each prime, and therefore likely to be filteredout as invalid results (as in table 2). Extending our previous equations (2.9), (2.10) and (2.11)we now have (see also [1] on page 12):

i {1, . . . , k}:

Choose i with |i|= 12 || uniformly at random.

i(x) = (1)|i{x}| (2.21)

i {1, . . . , k}; j {0, . . . , pi1}:

A(i)j =

n jpi

1

=0

i(S j+ pi) (2.22)

B(i)j =

m jpi

1

=0

i(Tj+ pi) (2.23)

For further usage, figure 25 shows a Matlab function performing the creation of i and theprojection accordingly.

29

2.4. THE ALGORITHM CHAPTER 2. ANALYSIS

function [A, B] = project_onto_subspaces(S, T, Primes, Sigma)n = length(S);m = length(T);alphabet_size = length(Sigma);sigma_base = double(Sigma(1));k = length(Primes);% phi maps half of Sigma to 1, the other half to 1.tmp_phi = ones(1, alphabet_size, 'single');for j = 1 : alphabet_size/2

tmp_phi(j) = single(1);endphi = zeros(k, alphabet_size, 'single');for i = 1 : k

% Use a random mapping.phi(i, :) = intrlv(tmp_phi, randperm(alphabet_size));

end% Perform projection onto subspaces of prime dimensions.for i = 1 : k

A{i} = zeros(1, Primes(i), 'single');B{i} = zeros(1, Primes(i), 'single');for j = 0 : Primes(i)1

tmpA = single(0);tmpB = single(0);for lambda = 0 : ceil((nj)/Primes(i))1

tmpA = tmpA + phi(i, double(S(j+lambda*Primes(i)+1)) ... sigma_base + 1);

endfor lambda = 0 : ceil((mj)/Primes(i))1

tmpB = tmpB + phi(i, double(T(j+lambda*Primes(i)+1)) ... sigma_base + 1);

endA{i}(j+1) = tmpA;B{i}(j+1) = tmpB;

endend

Figure 25: Matlab function to project data with varying i(x)

2.4 The Algorithm

2.4.1 Reference Version

In the following sections, we present different variants of the algorithm to estimate X accordingto the model. Figure shows 26 an algorithm to be used as base for later comparisons andtests. This algorithm is not specified in [1], we have created it by strongly simplifying the firstalgorithm presented there. It implements only the optimisation using the FFT (see section 2.3.3)and does not reduce the data size. To be able to compare the run time of the different algorithms,calls to the Matlab functions tic and toc are added at the beginning respective the end of thefunction. tic starts the timer, and when calling toc the elapsed time is displayed.

The Matlab implementation of this algorithm uses a few functions which have not yet beenintroduced:

30

CHAPTER 2. ANALYSIS 2.4. THE ALGORITHM

check_model_predicates This function checks input predicates according to our formalmodel (see section 2.3.1), and aborts if they are not met. It isshown in figure 27.

match_cyclic_correl_fft This function computes the cyclic correlationC as in figure 8by using the FFT as in figure 9 (see page 16). For the sake ofclarity, this function is provided in the appendix.

pos_of_largest_val Given two parameters C and t, this function retrieves thepositions of the t largest values in C.29 This is beyond thescope of this thesis and is therefore only presented in the ap-pendix.30

function [Xest] = algorithm_simple(S, T, Sigma, t)ticn = length(S);m = length(T);% Check input predicates. Use placeholder if variable not needed.check_model_predicates(n, m, 0.9, Sigma, t, 0.1);% Calculate the cyclic correlation using the FFT.C = match_cyclic_correl_fft(S, T, Sigma);Xest = pos_of_largest_val(C, t)1;toc

Figure 26: Simple Matlab function to perform matching with mismatches

function check_model_predicates(n, m, p, Sigma, t, epsilon)if (not (length(Sigma) == length(unique(Sigma)) && issorted(Sigma)))

error('Sigma is not a sorted set');endif (not (m < n && 0 < epsilon && 0 < p && mod(length(Sigma), 2) == 0 ...

&& Sigma(1) >= 0))error('Invalid input value.');

endif (not (t > 0))

error('Nothing to do.');end

Figure 27: Matlab function to check input values according to the model

29The returned positions are Matlab indices, which is why an additional decrement is performed to get zero-based positions.

30Actually, our implementation of the function has a run time of O(tL), assuming that the length of C is L.Using an algorithm based on a priority queue (see e.g. [6] on page 194), a run time of O(L) can be achieved, butthis would add a lot of complexity.

31


Description

After checking its input values according to the model restrictions, this algorithm basicallyperforms all the steps described in section 2.3.3 on page 13. It converts the strings S and T todiscrete signals A and B through use of the function and computes the cyclic correlation C ofthese signals. The set X is then estimated by extracting the positions of the t largest values ofthe cyclic correlation.

Example

Figure 28 shows an example application of the algorithm. Note that the resulting elements inXest might be in a different order than in the supplied X . This, however, is assumed not to bea problem.31

n = 102400;m = 512;p = 0.9;Sigma = uint8([0:255]);X = [34, 39411, 101410][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_simple(S, T, Sigma, length(X))

Figure 28: Matlab code which demonstrates the usage of our reference algorithm

Review

This algorithm has one great benefit: Its run time relies mostly on the FFT; other than that onlysimple processing is done. Matlab uses the FFTW library [22] for fft/ifft-calls, which hasbeen heavily optimised and is very fast even for large input sizes. There are several drawbacks,however: Even FFTW does not calculate the FFT faster than O(n log2(n)), and memory usageis very high. Also, false matches according to section 2.3.5 on page 28 cannot be filtered.

2.4.2 The First Variant

Based on our previous analysis, we are about to present the first version of the algorithm formatching with mismatches according to [1]. Only two minor issues remain to be solved:

1. We need to deduce restrictions of input values from the probability bound in equation(2.20). These restrictions should be based on , where 1 is the probability of correctlyreconstructing X (according to our model). If is near 0, a high chance to succeed willbe guaranteed, and thus the limits on input values are more strict. If is near 1, therestrictions on input values will be more graceful.

31If it is a problem, the resulting set can be sorted.

32


2. In section 2.3.3 we have emphasized the fact that vector C does not generally containmatch counts, and therefore we cannot easily process C to extract all positions with atleast 50% matches. This is even more true of the vectors C(i), because some valueshave been added up. Extracting only the t largest values, as we proposed for C, has thedrawback that one of these largest values might be a false match according to section2.3.5, with the effect that we possibly miss one of the t results. Consequently, we stillneed to specify how to extract spikes from vectors C(i).

For both issues we accept the solutions given in [1].

1. The input value constraints are specified as follows (see [1] on page 11):

16 log(4n/)p2

< m< min

(32n

t logn,

8(n+1) log(4n/)

p2

)(2.24)

The number of primes k and the minimum size of the primes L are calculated from inputvalues and therefore indirectly restricted (see [1] on page 12):

k =

log(2n/)log(8n) log(mt log(n))

(2.25)

L=8n log(2kn/)

mp28log(2kn/) (2.26)

According to [1] on page 28 the restrictions placed upon the input parameters, and thevalues assigned to L, have naturally erred on the side of caution. This means that theyhave been chosen such that the proofs concerning the algorithm are successful, butapparently some of them are also the results of trial and error to prevent certain bordercases. For now we accept these constraints, but later we will analyse how restrictive theyare in applications.

2. For each i {1, . . . , k}, we will process the vector C(i) by finding all positions j withC(i)j >

mp2 (according to [1] on page 12). Please note, however, this does not clearly

define the number of character matches we require, it is simply a bound that tends towork reasonably well with our model.

With these supplements, we finally present a Matlab function implementing Algorithm 1.1 of[1] (on pages 11-12). It is shown in figure 29. All previous results are used in this implementa-tion, and additionally one new helper function is introduced:

cartesian_prod Given a cell array32 and the number of dimensions, this function calculatesthe Cartesian product and returns it as cell array of vectors. The implemen-tation is beyond the scope of this thesis and is provided in the appendix. Ona side note, the function has been optimised for two dimensions becausethis is the most frequent case.

32A cell array is a Matlab array with dynamic size and content.

33


function [Xest] = algorithm_11(S, T, p, Sigma, t, epsilon, pedantic)ticn = length(S);m = length(T);% Check input predicates.check_model_predicates(n, m, p, Sigma, t, epsilon);% Check additional predicates for the algorithm (if pedantic is nonzero).if (pedantic) && (not ((16*log(4*n/epsilon))/p^2 < m && ...

min(sqrt(32*n*epsilon)/(t*log(n)), ...(8*(sqrt(n)+1)*log(4*n/epsilon))/p^2) > m))

error('Invalid m for this algorithm.');end% Initialization.k = ceil(log(2*n/epsilon)/(log(8*n)log(m*t*log(n))))L = (8*n*log(2*k*n/epsilon))/(m*p^28*log(2*k*n/epsilon))khat = ceil(log(n)/log(L))Xest = [];% Randomly select primes.Primes = select_primes(L, k)% Reduce data size by projecting onto subspaces.[A, B] = project_onto_subspaces(S, T, Primes, Sigma);% Calculate the cyclic correlation using the FFT.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i})));end% Extract positions of spikes.for i = 1 : k

X_residue{i} = find(C{i} > (m*p)/2) 1;end% Calculate all khat tuples using a helper function.x_tupel = cartesian_prod(X_residue, khat);% Estimate X by applying the Chinese Remainder Theorem to all tuples.for t = 1 : size(x_tupel, 2)

x = solve_crt(Primes(1:khat), x_tupel{t});% Additional filtering of invalid valuesxvalid = logical(1);for i = khat+1 : k

if (not (ismember(mod(x, Primes(i)), X_residue{i})))xvalid = logical(0);break;

endendif (xvalid && x


strings onto subspaces according to section 2.3.5, figure 25 on page 30. Afterwards, the cycliccorrelationsC(i) of vectors A(i) and B(i) are calculated using the FFT as in section 2.3.3, figure 9on page 16. Positions in C(i) which are larger than mp2 are considered good matches and thusextracted, and the Chinese Remainder Theorem is applied combining the extracted positionsmodulo one prime with the positions modulo each other prime (similar to table 2 on page 26).33

Results which are within the valid range are accepted, building the estimated set X .34

Example

Figure 30 shows an example application of this algorithm similar to the reference example infigure 28. It turns out that the input restrictions in equation (2.24) are not met by these inputvalues, although algorithm 1.1 does produce the correct solution with high probability (as canbe verified numerically). Thus the last parameter pedantic is set to 0 for this example inorder to disable the checking of the input value constraints on m.35 Figure 31 shows a differentexample where the conditions on m are met, and pedantic is set to 1.

n = 102400;m = 512;p = 0.9;epsilon = 0.1;Sigma = uint8([0:255]);X = [34, 39411, 101410][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 0)

Figure 30: Matlab code which demonstrates the usage of algorithm 1.1 (pedantic=0)

n = 1000000;m = 325;p = 0.9;epsilon = 0.7;Sigma = uint8([0:255]);X = [999601][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_11(S, T, p, Sigma, length(X), epsilon, 1)


Review

The first thing we note is that is just a probability bound, and one can achieve correct resultswith high numerical probability even if is near one. Similarly, the input constraints (equation

33The exact implementation is slightly different, because only k primes are used for the reconstruction, and theresult is checked against the remaining primes. However, this is only an optimisation; the effect is the same.

34Following the model, we only accept values x nm (instead of x< n as in [1]).35This should be done with caution because L might turn out to be negative, especially for small m.

35


(2.24)) are not irrevocable we have to comply with them if we wish to use as provenprobability bound,36 but often we also achieve good results when ignoring them.

If we provide input values in the context of executable file comparisons, it will be quite hardto stick to the input constraints. To achieve a good encoding of the file differences, we needthe correct result of the matching with high probability, so we choose = 0.1. We are mainlyinterested in matches with only a few mismatches, and for that reason set p= 0.9. Further, weassume that we wish to find the two best matches (t = 2), to allow for some flexibility. Giventhese basic input parameters, n needs to be very large for us to be possible to meet the inputconditions. Figure 32 shows the upper and lower limits of m with these input values for varyingn. We observe that n needs to be roughly 8 107 simply to be able to select a valid m, and eventhen we are restricted to m 430. Using a block size m of several kilobytes is only possible forhuge values of n, way beyond the usual size of executable files. This also means that the runtime of the algorithm given in [1] (on page 13) is not valid for file comparisons,37 because itrelies (by definition) on

m 16log(4n/)p2

(2.27)

i.e. m is required to be a lot larger than its lower limit.

Figure 32: Limits of m in algorithm 1.1 ( = 0.1, p= 0.9, t = 2)

There is even more to say about the input constraints: For our second Matlab example(see figure 31) we selected the relatively small (given the restrictions) n = 106 and chose m =325 to be near its lower limit. In this example the input leads to L 8.9686 105, and whilethe condition L < n is true as observed in [1] (page 13), the interval [L, L(1+ 2/ log(L)))

36Note that only guarantees a certain probability of success given random input according to our model.37at least not for todays usual file sizes

36


[8.9686 105, 1.0277 106) from which the primes are randomly chosen actually exceeds n.Therefore, it can happen that primes larger than n are selected, which is inefficient in terms oftime and memory, and can even lead to incorrect results, because A(i) is zero-padded for thecorresponding i. Even if false results are eventually caught, this is still a situation which isundesirable. Given the calculation of L as it is, the limits on m are probably not strict enough.

Thinking in terms of L instead of considering the interval [L, L(1+ 2/ log(L))) also seemsto be a problem in the proofs of [1]. It specifically makes the proof of the algorithms timebound appear questionable: The size of the cyclic correlation is assumed to be L (see [1] onpage 13), but in fact it can be larger, and therefore the size is not L but worst case L(1+2/ log(L)). Possibly this difference can be ignored for huge values of n, but it is neverthelessan assumption which should at least have been justified if used in a proof. Actually, the runtime equation has the precondition n 1 (see [1] on page 13), implying asymptotic behaviour,but this underlines the fact that the time bound cannot be applied for our application. We haveto expect slower run time, so further improvement of the algorithm is necessary.

Last but not least, given the method how the set X is estimated in this algorithm (by extract-ing many results and filtering those out of range), we have no guarantee to retrieve t results. Wecan get any number of results. For some applications this might be a desired effect, but whenmatching files we usually wish to find the t best matches.

2.4.3 The Second Variant

In the first version of the algorithm, we chose primes for the projection according to theorem 1.1of [1]. If we use smaller primes and still achieve the desired result, we will reduce the size of thecorresponding vectors as well as the processing time. However, using smaller primes will resultin a higher level of background noise, and random spikes can occur more frequently. They caneven raise as high as the matches we wish to find.

In section 2.3.4 we used primes around n2 . For this version, we construct model data usingprimes around n10 in order to further reduce the size of the data. The corresponding Matlab callsare shown in figure 33. This example is similar to figure 17 on page 24, it simply specifies alarger n.

n = 50000;m = 256;p = 0.95;Sigma = uint8([0:1]);X = [7700, 8050, 9000]Primes = [5003, 6007];[S, T] = construct(n, m, p, Sigma, X);C = match_cyclic_correl_project(S, T, Sigma, Primes);

Figure 33: Matlab calls to construct data using primes around n10

Figures 34 and 35 show the resulting cyclic correlations. As expected, there is a consid-erably higher level of random noise compared to the previous example (figures 19 and 20).

37


For C(1), we expect spikes at positions 2697, 3047, 3997. Those actually are present, but wealso observe random spikes, e.g. roughly at positions 1500, 3100. The expected spikes of C(2)

at positions 1693, 2043, 2993 are also present, but there are additional spikes, e.g. roughly atpositions 100, 2800. However, it is very unlikely that these spikes occur in all projections atpositions which reconstruct a valid result. Therefore, these spikes are filtered during recon-struction, because they lead to results > nm (with high probability). This is basically thesame as in section 2.3.5, where false matches are removed.

Figure 34: Plot of the cyclic correlation C(1) of data using primes around n10

Theorem 1.3 in [1] on page 15 addresses this issue from a mathematical point of view. Itextends theorem 1.1 of [1] by considering additional random elements for each of the intersec-tions. Assuming that these elements are selected randomly in sets Y (i) {0, . . . , n1}X forall i {1, . . . , k},38 this looks as follows:39

X = {0, . . . , n1}((X Y (1))+ p1Z

)

((X Y (k))+ pkZ

)(2.28)

38This definition has been simplified, for more information see [1].39This has been slightly corrected from [1].

38


Figure 35: Plot of the cyclic correlation C(2) of data using primes around n10

The probability bound in equation (2.20) is correspondingly extended by the probability [0, 1) of one random element falling into each of the k sets40. The new bound is specifiedas follows (see [1] on page 15):

1n( +

t log(n) log(L))L

)k(2.29)

In principle, we can use the same procedure as in the first variant of the algorithm but withshorter primes, and have a slightly smaller probability of succeeding. To stay within provenbounds, however, we revisit the input value restrictions and the processing of C(i) from theprevious section:

1. The input value constraints should now be deduced from the new probability bound inequation (2.29). Unfortunately, this is not trivial, because it involves establishing propo-sitions about . Theorem 1.4 in [1] on page 17 performs this task, basically giving guid-ance on how high the spikes of the valid results still need to raise above the noise. Theresulting restriction according to [1] on page 19 is (again chosen such that the proof issuccessful):

m< min

(3n2/2

t(log(n))2,

n/28p2

)(2.30)

40Cited from [1] on page 15. The definition of given there as the probability [...] of y falling into each of thek sets is very fuzzy, since y is undefined.

39


The number of primes k, the probability and the minimum size of the primes L arederived values and thus indirectly restricted (see [1] on page 20):41

k =

log(2n/)log(n) log(mt log(n)2)

(2.31)

=12

k/(2n) (2.32)

y=( log( )+

log(4kt/)

)2(2.33)

L=2ny

mp22y (2.34)

We will analyse later how restrictive the limit on m actually is for applications.

2. To solve the problem of algorithm 1.1 that we are not guaranteed to get t results, wecan start by extracting the t largest values in vectors C(i) for each i {1, . . . , k} andperform the reconstruction. However, as mentioned before, if for any reason we have arandom spike raising above one of the results, we will miss some results and might endup with less than t values (because invalid results are removed). Given that the numberof additional spikes in C(i) according to our new approach is expected to be pi foreach i {1, . . . , k} (see [1] on page 15), we can assume that, in the worst case, all of theadditional spikes raise above our actual results. Therefore, we are on the safe side if weextract the pi+ t largest values.

Now that these issues are solved, we present a Matlab function implementing Algorithm 1.2of [1] (on pages 19-20). It is shown in figure 36.

Description

At first, this algorithm checks its input values against the limitations of the model as well asagainst the constraints given in equation (2.30). Next, it initialises k, and L according toequations (2.31), (2.32) and (2.34). It selects k primes and then projects the input strings ontosubspaces. In the next step, the cyclic correlations C(i) of vectors A(i) and B(i) are calculatedusing the FFT. The positions of the pi+t largest values inC(i) are considered as candidates forgood matches and thus extracted, and the Chinese Remainder Theorem is applied combiningthe extracted positions modulo one prime with the positions modulo each other prime (similarto table 2 on page 26). Those results that are within the valid range are accepted, and form theestimated set X .

Example

Unfortunately, the Matlab application of Algorithm 1.2 according to the reference example doesnot run. Figure 37 shows the corresponding calls, ignoring the limit on m in equation (2.30),

41In equation (2.33) we choose the name y instead of x to avoid a naming conflict.

40


function [Xest] = algorithm_12(S, T, p, Sigma, t, epsilon, pedantic)ticn = length(S);m = length(T);% Check input predicates.check_model_predicates(n, m, p, Sigma, t, epsilon);% Check additional predicates for the algorithm (if pedantic is nonzero).if (pedantic) && (not (m < min((((n^2*epsilon)/2)^1/3)/(t*(log(n)^2)), ...

sqrt((n*epsilon)/2)/(8*p^2))))error('Invalid m for this algorithm.');

end% Initialization.k = ceil(log(2*n/epsilon)/(log(n)log(m*t*log(n)^2)))beta = (1/2)*((epsilon/(2*n))^(1/k))y = (sqrt(log(beta))+sqrt(log((4*k*t)/epsilon)))^2L = (2*n*y)/(m*p^22*y)khat = ceil(log(n)/log(L))Xest = [];% Randomly select primes.Primes = select_primes(L, k)% Reduce data size by projecting onto subspaces.[A, B] = project_onto_subspaces(S, T, Primes, Sigma); pack;% Calculate the cyclic correlation using the FFT.for i = 1 : k

C{i} = ifft(fft(A{i}) .* conj(fft(B{i}))); pack;end% Extract positions of spikes.for i = 1 : k

X_residue{i} = pos_of_largest_val(C{i}, ceil(beta * Primes(i)) + t)1;%X_residue{i} = find(C{i} > (m*p)/2) 1;

end% Calculate all khat tuples using a helper function.x_tupel = cartesian_prod(X_residue, khat);% Estimate X by applying the Chinese Remainder Theorem to all tuples.for t = 1 : size(x_tupel, 2)

x = solve_crt(Primes(1:khat), x_tupel{t});% Additional filtering of invalid valuesxvalid = logical(1);for i = khat+1 : k

if (not (ismember(mod(x, Primes(i)), X_residue{i})))xvalid = logical(0);break;

endendif (xvalid && x


the limitation. Even if we choose m such that k is barely not negative (still ignoring the limit),k is unnecessarily large (e.g. around 20), which in turn has a very negative impact on the runtime of the algorithm.

If the limit is regarded, however, the algorithm can be run.42 Figure 38 shows an examplewith pedantic set to 1.

n = 102400;m = 512;p = 0.9;epsilon = 0.1;Sigma = uint8([0:255]);X = [34, 39411, 101410][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_12(S, T, p, Sigma, length(X), epsilon, 0)

Figure 37: Matlab code which tries to use algorithm 1.2 but fails (pedantic=0)

n = 1000000;m = 90;p = 0.9;epsilon = 0.7;Sigma = uint8([0:255]);X = [999601][S, T] = construct(n, m, p, Sigma, X);Xest = algorithm_12(S, T, p, Sigma, length(X), epsilon, 1)


Review

As mentioned above, the upper limit on m is very strict and cannot be set aside. When analysingit for varying n in the same context as before (i.e. with = 0.1, p= 0.9, t = 2), we observe thatonly small block sizes are allowed with this algorithm. Figure 39 shows the corresponding plot.

However, even if the limit on m is met, the run time and memory usage of the Matlabimplementation of algorithm 1.2 is unfortunately rather inappropriate.43 The main reason forthis is that pi+ t easily turns out to be several thousand. Thus, thousands of largest valuesare extracted from vectors C(i). Since solving according to the Chinese Remainder Theorem isnot optimised in Matlab, calculating the solutions for all combinations is a very slow processand clearly the bottleneck of our implementation. Colin Percival shows in [1] on page 22 thatthe seemingly quadratic run time O((L+ t)2) of this part of the algorithm is actually notquadratic, by definition of and L. However, even in theory this is questionable: It again relies

42Please note that this example might take several hours to run, and Matlab might run out of memory due to thelarge Cartesian product.

43Specifically, either Matlab runs out of memory or we have given up waiting for results after 8 hours.

42


Figure 39: Limit of m in algorithm 1.2 ( = 0.1, p= 0.9, t = 2)

on the assumption that primes of size L are used, while their worst case size is roughly L(1+2/ log(L)). The corresponding difference should not be set aside without further comment,especially because it is in context of a square. If we use a completely different way to reconstructX , which does not tend to have a quadratic run time, this problem will be solved.

2.4.4 The Third Variant

In the third version of the algorithm, the strict limits on m are relaxed, and the size of theprimes is reduced even further. This is done by applying a different theory, namely that ofBayesian analysis. As a result, the method to reconstruct X is also changed, with the followingbackground: Instead of calculating all possible solutions according to the Chinese RemainderTheorem, the elements of the vectors C(i) can be added up for all positions up to n modulo thecorresponding prime. The result is a single vector F with a length of n:

j {0, . . . , n1}:Fj =

k

i=1

C(i)j mod pi (2.35)

Spikes which existed e.g. in C(1) at positions modulo p1 and in C(2) at corresponding posi-tions modulo p2 are added up and will lead to spikes in F . Thus, F is actually an approximationof C, and equation (2.35) can be seen as inverse projection, because it restores the spikes atpositions where they would have been without the projection. Further processing of F can thenbe done like the processing of C in our reference algorithm.

While this is intuitively a reasonable result, the approximation according to the Bayesiananalysis is a bit different (see [1] on page 24):

43


j {0, . . . , n1}:

F j =k

i=1

C(i)j mod pimp2

pi(n, m, j)(2.36)

with being the standard deviation defined in [1] on page 13 as

pi(n, m, j) = |{(x, y) ZZ : 0 x< n, 0 y< m, x y+ j (mod pi)}| (2.37)

Using vector F instead of F for further processing leads to problems, however:

1. F depends on p, and if p cannot be predicted or at least estimated, the results will befalsified (see also [1] on page 26). This is a serious problem, because in an applicationwhich is comparing two files, one cannot tell in the first place how well these files willmatch.

2. Using equation (2.36) to calculate F will not lead to correct results for maliciouslyformed X (see [1] on page 24). This is more a theoretical problem, because these Xare very unlikely to occur in real applications, but it is still a drawback.

While the best way to solve the first problem is to use F instead of F , the second problem canbe dealt with by performing further processing of C(i) for some appropriate 44 (see [1] onpages 25):i {1, . . . , k}; j {0, . . . , pi1}:

exp(D(i)j

)= + exp

mp(C(i)j mp2

)2pi(n, m, j)

D(i)j = log

+ expmp

(C(i)j mp2

)2pi(n, m, j)

(2.38)These vectors D(i) are then added up: j {0, . . . , n1}:

F j =k

i=1

D(i)j (2.39)

In order to numerically show what happens during the calculation of vectors D(i), we define

x=mp(C(i)j mp2

)2pi(n, m, j)

to be used as variable, set = 2 and plot y = log( + ex). The result is shown in figure

40.Interpreting this plot, we observe that x > 0 if and only if C(i)j >

mp2 , due to the restrictions

of the model. This means that, according to the plot, spikes in vectors C(i) larger than mp2 are

44This is cited from [1] on page 24 to make it clear that is a very abstract value.

44


Figure 40: Plot of y= log( + ex) with = 2

function [D] = filter_C(n, m, p, t, C) = (t*m*p^2*log(n))/nsize = length(C);D = zeros(1, size, 'single');for j = 1 : size

D(j) = log(sqrt( )+exp((m*p*(C(j)m*p/2))/(2*2*n*m/size)));end

Figure 41: Matlab function to calculate vector D

truncated. Since y x for all x 0, all other values are relatively maintained by this function.Figure 41 shows a Matlab function which calculates a vector D from C, using the definition of as specified later in equation (2.41) and an approximation from [1] on page 28:

pi(n, m, j)nmpi

(2.40)

Applying this function to a vectorC calculated according to figure 17 on page 24, we clearly seethat all values are relatively maintained, but the spikes are truncated. Figures 42 and 43 showvectors C and D, respectively.45

Based on these numerical observations, it is questionable why Colin Percival states in [1]on page 24, while deriving the algorithm, that D(i)j = max(C

(i)j , ). This is in contrast to the

definition of D(i) in equation (2.38)46, which, to the best of our knowledge, leads to a truncation

45Note that

Delta Compression

Documents

Transcript of Delta Compression