Quadratic Residues and Machine Learning MS CS Project€¦ · If nis a prime number, then all...

Quadratic Residues and Machine Learning

MS CS Project

Michael Potter

Advisors: Dr. Leon Reznik, Dr. Stanis law Radziszowski

Aug 8, 2016

Abstract

The problem of quadratic residue detection is one which is well known in both number theory and

cryptography. While it garners less attention than problems such as factoring or discrete logarithms,

it is similar in both difficulty and importance. No polynomial–time algorithm is currently known

to the public by which the quadratic residue status of one number modulo another may be deter-

mined. This work leveraged machine learning algorithms in an attempt to create a detector capable

of solving instances of the problem in polynomial–time. A variety of neural networks, currently at

the forefront of machine learning methodologies, were compared to see if any were capable of consis-

tently outperforming random guessing as a mechanism for detection. Surprisingly, neural networks

were repeatably able to achieve accuracies well in excess of random guessing on numbers up to 20

bits in length. Unfortunately this performance was only achieved after a super–polynomial amount

of network training, and therefore it is not believed that the system as implemented could scale to

cryptographically relevant inputs which typically have between 500 and 1000 bits. This nonetheless

suggests a new avenue of attack in the search for solutions to the quadratic residues problem, where

future work focused on feature set refinement could potentially reveal the components necessary to

construct a true closed–form solution.

1 Mathematics Background

Before discussing the definition of quadratic residues in detail, some customary notation must be

introduced. First, the set of integers modulo a base n is denoted as Zn. If only integers relatively

prime to n are considered, then the resulting set is denoted Z∗n. This set is incidentally also a group

defined under the operation of multiplication, meaning that – amongst other properties – multiplying

any two elements from the set will result in another member of the set.[1] Membership in Z∗n may be

tested in polynomial–time by checking whether the greatest common divisor (gcd) between a candidate

integer x and n is equal to 1. Algorithm 1 provides a method for computing the gcd between two

numbers which has running time O(log(n)2).

1

Algorithm 1 Euclidean Algorithm for Greatest Common Divisor[2]

1 #Returns the greatest common divisor between non-negative integers a and b

2 def gcd(a,b):

3 r = 0

4 while b != 0:

5 r = a mod b

6 a = b

7 b = r

8 return a

If n is a prime number, then all integers from 1 to n − 1 are relatively prime to n, and so Z∗n is

simply the set of integers from 1 to n − 1. Note that zero is not considered in the set since 0 ≡ n

mod n, and the greatest common divisor of any number n with itself is n, not 1. Thus zero does

not satisfy the criteria for membership in Z∗n. Consider also that since Z∗n is a group, any element a

must have an inverse a−1 such that (a)(a−1) = 1. Since any number multiplied by zero yields zero, it

cannot have a multiplicative inverse and thus zero cannot be a member of Z∗n.

In general the number of elements in Z∗n is denoted by the Euler totient function φ(n). From the

above it is clear that for prime n, φ(n) = n−1. Suppose, however, that n is composite. Then n =k∏

i=1peii

where pi are distinct primes and ei are positive integers. In this case φ(n) =k∏

i=1(peii − p

ei−1i ).[1] There

is no known polynomial–time algorithm for computing φ(n) without knowing the factorization of n. If

there were then one could use φ(n) to factor n in polynomial–time. While the general proof is beyond

the scope of this project, consider the case where n = pq with p and q being two distinct primes. The

following two equalities therefore hold:

n = pq (1)

φ(n) = (p− 1)(q − 1) (2)

Expanding Equation 2 and inserting Equation 1 gives the following:

φ(n) = pq − p− q + 1

φ(n) = n− n

q− q + 1

qφ(n) = qn− n− q2 + q

0 = −q2 + (n+ 1− φ(n))q − n

Leveraging the quadratic equation then provides a solution for q:

2

q =(n+ 1− φ(n))±

√(n+ 1− φ(n))2 − 4n

2(3)

Unfortunately, attempting to infer φ(n) from random sampling seems infeasible since the numbers

which do not contribute to φ(n) grow proportional to p+ q whereas the population size grows as pq.

In fact, if p and q are distinct odd primes, then it can be shown that 2n3 − 2 ≤ φ(n) ≤ n − 2

√n + 1,

which clearly still requires O(n) evaluations to guess and is therefore exponential in the size of n.

Turning now to the definition of quadratic residues, suppose that you have two integers a and b.

Then a is a quadratic residue of b if and only if the following two conditions hold:[3]

1. The greatest common divisor of a and b is 1

2. There exists an integer c such that c2 ≡ a (mod b)

The first condition is satisfied automatically so long as a ∈ Z∗b . There is, however, no known

polynomial–time algorithm for determining whether the second condition is satisfied. The difficulty

and significance of this problem will be discussed in more detail in Section 2.

While no general solution exists, there are partial solutions to the quadratic residue problem. In

particular, if b is prime then it is easy to determine whether a is a quadratic residue of b. This can be

accomplished simply by computing ab−12 (mod b) in a test known as Euler’s criterion. If the result is

1 then a is a quadratic residue of b. If the result is -1 then a is a non–residue, and if the result is 0

then a and b are not relatively prime.[4] This value may be computed using the square and multiply

algorithm – see Algorithm 2 – which has complexity O(log(b)3) and is therefore polynomial in the

length of the input.

Algorithm 2 Square and Multiply Algorithm for Modular Exponentiation[4]

1 #Returns x^c mod n. Assumes c is represented in binary with bit length L

2 def modexp(x, c, n):

3 z = 1

4 i = L-1 #C[i] is thus the least significant bit of C

5 while i >= 0:

6 z = z*z mod n

7 if c[i] == 1:

8 z = z*x mod n

9 i -= 1

10 return z

In addition to being able to detect whether a particular number is a quadratic residue, it is also

useful to know how many quadratic residues will exist for a particular modular base. This is again

easiest to investigate in the case of primes. Consider that for any odd prime b, Z∗b is a cyclic group

and so there exists an element α ∈ Z∗b such that any other element in the set may be written as

αj for some j ∈ Z.[1] Under this construction, the quadratic residue problem reduces to determining

whether j is even or odd. If an element a can be written as an even power of α then it is a quadratic

3

residue. Otherwise it is not.[2] The set of quadratic residues, denoted Qb, must therefore have orderb−12 since half of all possible values of j are even. Likewise the set of non–residues: Qb also has

order b−12 . Extending consideration to non–prime bases, suppose n is the product of 2 distinct odd

primes p and q. The Chinese Remainder Theorem guarantees a bijection between Z∗n and Z∗p × Z∗q .[5]

In particular, suppose that a1 ∈ Z∗p and a2 ∈ Z∗q . The theorem guarantees that a ≡ a1 (mod p)

and a ≡ a2 (mod q) has a unique solution modulo pq. This solution, a, is then the element in Z∗ncorresponding to (a1, a2) ∈ Z∗p × Z∗q .

Due to this bijective property, a is a quadratic residue of n if and only if a ∈ Qp and a ∈ Qq. Thus

the number of quadratic residues of n, denoted |Qn|, is |Qp| ∗ |Qq| = (p−1)(q−1)4 whereas the number

of non–residues, |Qp|, is 3(p−1)(q−1)4 . The remaining p+ q − 2 elements from Zn will not be relatively

prime to n and are therefore in neither Qn nor Qn.

Although there is not yet a known polynomial–time residue detection algorithm, there are methods

for efficiently computing partial solutions. As mentioned previously, for a prime value p, the residue

status, r, of an integer a is given by r = ap−12 (mod p). This value, also known as the Legendre

symbol of r with respect to p, forms the basis of the Jacobi symbol: a generalized approximation

of quadratic residues for composite numbers. Suppose n =k∏

i=1peii . Then the Jacobi symbol for an

integer a is defined to be(an

)=

k∏i=1

reii . While clearly trivial to compute given the factorization of n,

this value may also be computed using Algorithm 3 which has running time O(log(n)2) and does not

require knowledge of the factorization of n.[2] Unfortunately, knowing the Jacobi symbol(an

)is not

sufficient to determine whether a is a quadratic residue of n. Consider the case where n = p ∗ q. Then

if(an

)= −1 it is safe to conclude that a is not a quadratic residue of n. If, however,

(an

)= 1, then

it could either be that both rp and rq are positive, in which case a is a residue of n, or it could be

that both rp and rq are negative, in which case a is not a residue of n. Either case is equally likely,

meaning that if the Jacobi symbol evaluates to 1 there will be a 50% chance that the number will be

a quadratic residue. There are no common–knowledge polynomial–time algorithms which are capable

of improving on this estimate. Thus any system capable of outperforming random guessing would

represent a highly significant result in number theory, computer science, and cryptography.

4

Algorithm 3 Jacobi Symbol Computation[2]

1 #Returns the Jacobi symbol (a/n) which will be either 1, -1, or 0

2 def Jacobi(a, n):

3 if a == 0:

4 return 0

5 if a == 1:

6 return 1

7 #Write a = 2^e * a1 where a1 is odd

8 e = 0

9 a1 = a

10 while a1 mod 2 == 0:

11 a1 //= 2

12 e += 1

13 #Determine the current symbol

14 s = 0

15 if e mod 2 == 0:

16 s = 1

17 #Apply quadratic reciprocity law

18 else:

19 comparator = n mod 8

20 if comparator == 1 or comparator == 7:

21 s = 1

22 elif comparator == 3 or comparator == 5:

23 s = -1

24 if n mod 4 == 3 and a1 mod 4 == 3:

25 s = -s

26 #Recurse

27 n1 = n mod a1

28 if a1 == 1:

29 return s

30 return s*Jacobi(n1, a1)

2 Cryptography Background

While less well known than factoring, the quadratic residue problem also has great significance in

cryptography. As seen in Section 1, quadratic residue detection is clearly not more difficult than

factoring. While no polynomial–time reduction from the decision problem to factoring is known,

it is conjectured to be of similar difficulty. This conjecture has been proven within the context of

generic ring algorithms, but not for full Turing machines. This proof, however, is not strong evidence

since computation of Jacobi symbols is also provably hard for generic ring algorithms despite the

polynomial–time general algorithm provided in Section 1. It could therefore be the case that quadratic

residue detection is polynomial–time computable while factoring is not.[6] Despite this possibility, it

5

has become common practice for researchers to assume the difficulty of quadratic residue detection

when attempting to prove the intractability of other less studied problems in computer science and

cryptography.[7][8]

The closely related problem of actually computing the square root of a quadratic residue, however,

has been proven to be as hard as factoring. To see this, suppose that you had an algorithm capable

of computing the square root of a residue in polynomial–time. The number of distinct square roots

for a residue of some composite number n is equal to 2L, where L is the number of distinct primes in

the factorization of n.[4] Thus for standard cryptographically relevant composites of the form n = pq,

there will be 22 = 4 roots for every quadratic residue. Pick any number x from 1 to n− 1 and square

it mod n. Use the polynomial–time square root algorithm to find a square root of x2. Since there

are four possible solutions, the algorithm will return a root x′ distinct from x with probability 34 . Of

these distinct values, one will be −x (mod n), but two will be other values. So long as x′ 6= x and

x′ 6= −x, then it may be used to compute gcd(x − x′, n) which will return a non–trivial factor of n.

This algorithm therefore succeeds with probability 12 which means that after several iterations one can

expect to find a non–trivial factor of n.

RSA is a well known encryption algorithm which relies on the difficulty of factoring numbers of the

form n = pq. It has not been proven, however, that breaking RSA encryption actually requires the

ability to factor a number.[5] There are encryption schemes, however, which are provably as secure as

the modular square root problem – and hence as secure as factoring. One such procedure is the Rabin

Public Key Cryptosystem. In this scheme, a private key of two large primes p and q is generated. The

public key n is then posted. To encrypt a message m into ciphertext c, the following is performed:

c = m2 (mod n). Since the private key contains knowledge of the factorization of n, it can then be

used to compute the four square roots of c, one of which will be m.[5]

While the generic problem of factoring is believed to be hard, its difficulty can depend on the

particular values of p and q which form the prime decomposition. In particular, a notion of ‘strong’

primes has been proposed where p and q are considered strong if p − 1 and q − 1 have large prime

factors. Such primes are more difficult for certain factoring algorithms to contend with, though the

best modern algorithms are not significantly impacted by such conditions.[9] Nonetheless, the use of

strong primes is required by ANSI standards in an effort to create a system which is as secure as

possible. One way to ensure that the strong prime criterion is met is to pick p and q such that

p = 2r + 1 and q = 2t + 1, where r and t are also prime. This project will restrict its analysis to

primes of this form since they represent the most difficult class of numbers to handle and therefore

give a better indication of the cryptographic significance of any solutions encountered.

3 Machine Learning Background

For all that it is generally believed to be both an important and difficult problem, it is not clear

that anyone has yet published any attempts to find a solution for quadratic residue detection using

modern machine learning (ML) techniques. This project sought to remedy the apparent shortcoming

in the contemporary literature by providing evidence for the potential tractability of quadratic residue

detection within a ML context.

Artificial Neural Networks (ANNs) have garnered a great deal of attention in the literature recently

6

for their achievements in a variety of domains. In the field of game AI, ANNs have been able to

successfully learn to play Chess[10] and recently even the game Go[11] – a feat which had previously

been considered well beyond the capacity of modern machine learning. In such applications it appears

that these neural networks are able to collapse what would otherwise be an exponentially large search

tree down into a polynomial–time approximation function.

ANNs, and in particular Convolutional Neural Networks (CNNs), have also been extremely suc-

cessful in pattern recognition. Their stunning success in the image processing domain, in particular

on the well known database ImageNet, has inspired a great deal of interest in CNN development.[12] In

fact, since its publication in 2012 this early proof of CNN effectiveness has been cited over 5000 times.

What is perhaps most surprising, however, is that CNNs have been shown recently to support transfer

learning – the ability to train a CNN on one dataset and then use it on another with very minimal

retraining required.[13][14][15] Such networks are so effective that they can out–perform systems which

have been developed and tuned for years on a specialized problem set.[15] This power comes from

the fact that many of the hidden layers within a CNN serve to perform generic feature extraction –

removing the need for humans to generate creative data representations in order to solve a problem.

Once this feature extractor has been created, the final layer or two of a CNN is all that needs to be

adjusted in order to move between different application areas – a process which requires relatively

little time and data. It was hoped that this ability to transfer learning between problem sets would

help to create a network able to detect residues for any number n with minimal retraining required.

Unfortunately the inputs to a CNN must be meaningfully related to one–another, for example having

pixel 1 be above and to the left of pixel 3, in order for patterns derived from the relative positions

of those inputs to be meaningful. While it is not clear how to satisfy such a constraint when dealing

with a number theory problem, the principles and effectiveness of the transfer learning paradigm were

still of interest in this study.

The success of ANNs is not without theoretical precedent. The universality theorem, a well known

result in machine learning, suggests that neural networks with even a single hidden layer may be used

to approximate any continuous function to arbitrary accuracy (given certain constraints on network

topology).[16] Based on this premise, if there were a continuous function capable of testing quadratic

residues, it ought to be possible to mimic it using a neural network. While such a function, if one

exists, might well be discontinuous and therefore beyond the purview of the theorem, there is still

hope that a continuous approximation might suffice.

While the theoretical results are interesting, one could argue that such a network would be im-

possible to train in practice. This was indeed at least a partial motivation behind the deep networks

that have become popular in modern literature. There are, however, concrete results suggesting that

shallow networks really are capable of learning the same behavior as their deeper counterparts.[17]

In their work, Ba and Caurana are able to train shallow networks very effectively by leveraging the

outputs produced by a deep network for ground truth scoring as opposed to using the original super-

vised labeling. While this is not immediately useful since it still requires a trained deep network, it

suggests that if such a deep network can be developed it could potentially then be simplified in order

to improve performance.

7

4 Data Generation

There are two significant factors which may influence the difficulty of quadratic residue detection.

The first is the composition of the modular base being considered. This base could be prime, in

which case a polynomial–time solution is known, or the product of multiple primes, in which case a

polynomial–time solution is unknown. The number of prime factors composing the base controls both

the number of quadratic residues which will be present in the modular ring, and also how many roots

each residue will have.

The second factor which was expected to contribute to the difficulty of the problem is the size of

the modular base. As the length of this number increases, more and more elements are added to the

modular space. This could make it more difficult for the system to learn any underlying patterns in

the data, but also allows for an increased number of potential training instances. Cryptographically

relevant numbers contain hundreds of digits, so even if machine learning is capable of discriminating

with high accuracy on small numbers it might not pose a threat to modern encryption schemes.

For each base, n, training and testing data are required. For small bases the complete set of

residues and non–residues can be easily enumerated and randomly sampled. In order to generate data

for the larger bases, random numbers were selected from Zn. These numbers were squared mod n

in order to acquire values which are guaranteed to be quadratic residues. In order to acquire non–

residues, random numbers were drawn from Zn until a non–residue was found. This check is possible in

polynomial–time since the prime factorization for each base is known. Non–residues were only included

into the dataset if they had Jacobi symbol equal to 1, since said symbol already provides a method to

discriminate between residues and non–residues having the alternative sign. The testing datasets were

comprised of 50% residues and 50% non–residues, meaning that any system which can deviate from

50% accuracy by a notable margin is at least a partial success. Training data was typically presented

to the networks in batches, with 50% of said data being residues. An alternative training method was

also investigated wherein the batches were biased to a 75/25 split which alternated between favoring

residues and non–residues each iteration. As will be discussed in Section 5, this alternative data

presentation scheme did not have any substantive impact on performance.

5 Experimental Results

This project examined the behavior of ANN architectures using data generated as described in Section

4. The networks considered two integers b and n corresponding to the question: “Is b a quadratic

residue of n?” There are several important considerations when building a network to attempt to

answer this question. First and foremost is what feature representations should be used in order

to attempt to find the solution. Different network architectures may also be more or less effective at

finding a viable solution. Before taking on the challenging task of detecting quadratic residues relative

to a composite modular base, experiments were completed on a prime modular base. For each case,

a 50/50 random guess procedure was run on the testing data 10 times. The resulting accuracy along

with standard deviation was recorded so that the neural network’s performance can be compared to

see whether it is significantly better than what one would expect from random chance.

8

5.1 Primes

The first network architecture investigated was a MultiLayer Perceptron (MLP) with 4 hidden layers,

each containing 100 neurons employing hyperbolic tangent activation functions. These units also

employ a dropout layer before the softmax output layer in an effort to combat over–fitting. A variety

of input feature combinations were then explored in the search for potentially useful inputs. As

discussed in Section 1, the quadratic residue problem for prime bases can be solved explicitly using

the Jacobi symbol, so this symbol value was not considered as a potential input. For initial tests, all

residues and non–residues from 1 to n were taken for the data set, with 80% being assigned to training

data and the remaining 20% held out for testing. A cryptographically relevant system would need to

be capable of learning off of a small subset of this data, but that is not necessary for feature viability

exploration.

For each feature set, three networks were independently initialized and trained in order to study

the sensitivity of the final system to variations in starting configuration. An n value of 100043 was

selected since it is a strong prime (see Section 2) which is large enough to allow for plentiful training

instances but not so large as to make network training times prohibitively long. The first input scheme

investigated was simply to supply the network with the binary representation of b and n in the hopes

that it would naturally extract its own features. This representation was not successful, with results

summarized in Table 1.

Table 1: ANN performance after 800,000 training iterations given binary representation of b and n

Trial Expected Accuracy Training Accuracy Testing Accuracy1 49.91±0.47% 94.53% 50.05%2 50.28±0.54% 90.63% 50.22%3 50.00±0.44% 96.48% 49.96%

As the results in Table 1 show, the neural network training accuracy was within one standard

deviation of random guessing in all instances. This was despite a very high training accuracy –

indicating that the system essentially memorized the training data rather than learning a function

useful for handling unseen data.

The next representation investigated was based on features inspired by the Jacobi Symbol algo-

rithm. Using variable names consistent with those outlined in Algorithm 3, these features included e

(mod 2), n (mod 8), n (mod 4), a1 (mod 4), and n (mod a1) – each normalized to the range [0,1].

While these features do not capture the recursive behavior of the algorithm, it was hoped that they

would provide at least marginal improvement over random guessing. The results are given in Table 2.

Table 2: ANN performance after 800,000 training iterations given a subset of Jacobi features

Trial Expected Accuracy Training Accuracy Testing Accuracy1 50.09±0.34% 46.88% 50.01%2 50.01±0.45% 46.88% 50.06%3 49.91±0.19% 51.56% 49.75%

As Table 2 indicates, using only a handful of features from the Jacobi algorithm was ineffective,

9

with testing accuracies being within one standard deviation of random guessing in all cases. Unlike

the first experiment, here the training accuracies were also held close to random chance, with only

Initialization 3 managing to move notably outside what would be expected by random guessing. While

the results of this test were not encouraging from a testing accuracy perspective, the data was at least

not over–fit. The next step was therefore to add all comparisons employed by the full recursive

Jacobi algorithm to the feature set in order to provide the network with more opportunity for rule

generation. This is possible since the maximum possible recursive depth of the algorithm is dlog2 ne.Since five features are provided per iteration, the total feature count is still polynomial in the length

of the input, and therefore represents a cryptographically viable feature space. One consideration

when implementing this feature set is that normally the recursive algorithm would terminate after a1

reaches a value of 1 or 0. Since all feature vectors to the neural network must be the same length,

once a1 reaches one of these termination values the feature vector [0,1,2,2,1] is appended until the

necessary number of iterations has been completed. These values were chosen since they would leave

the output of the true Jacobi algorithm unchanged if they were to appear during one of its iterations.

This feature set was very effective, as demonstrated in Table 3.

Table 3: ANN performance on n = 100043 given all comparisons used in the Jacobi algorithm

Trial Iterations Expected Accuracy Training Accuracy Testing Accuracy1 169,800 50.05±0.29% 100.00% 99.01%2 170,900 49.97±0.42% 100.00% 98.45%3 372,200 50.02±0.34% 100.00% 96.45%

The neural network training algorithm was set to terminate after achieving a training accuracy

of 100% for 10 checkpoints in a row; checkpoints being evaluated every 100 iterations. Thus Table 3

can also be used to study the speed and reliability of convergence. All three networks demonstrated

the ability to effectively mimic the Jacobi algorithm given only the modular comparators it is based

on, reaching perfect training accuracy and near perfect testing accuracy. Although all three networks

were eventually successful, the number of iterations required to learn the rules varied widely based on

starting configuration; Trial 3 taking more than twice as many iterations to converge as Trial 1. This

suggests that testing multiple network initializations may be highly important when transitioning to

the more complex case of composite numbers. It is also worth noting that the more iterations the

network required to converge, the lower its testing accuracy was. This may indicate that poorly ini-

tialized networks must develop overly complicated decision rules in order to achieve high performance,

in turn introducing greater over–fitting.

A natural question is whether a system trained for quadratic residue detection on one number

may be used on another with little to no retraining. To answer that question, three new prime values

were considered. Since the binary length of n determines the number of features for this particular

representation, new values of n need to have the same binary length. 100043 has a binary length of

17, meaning that the new values of n to be considered must also have binary length 17. The smallest

such safe prime is 65543. The largest is 130787. The value 98207 was also tested since it is roughly

half way through the permissible range of numbers, and is closer to the value on which the networks

were originally trained. These tests are summarized in Table 4.

10

Table 4: ANNs retrained for different n

NetworkID

nInitial Testing

AccuracyTrainingIterations

Final TestingAccuracy

1 65543 61.61% 9,600 99.46%1 98207 56.09% 13,200 99.30%1 130787 98.47% 41,000 99.21%2 65543 64.70% 17,500 99.34%2 98207 61.93% 29,300 98.71%2 37800 97.54% 41,000 98.96%3 65543 62.72% 23,900 98.76%3 98207 61.32% 30,500 98.05%3 130787 95.02% 103,800 96.62%

Although it took a fair number of iterations in order to meet the stopping criteria outlined above,

Network 1 was able to achieve > 98% testing accuracy after only 2000 iterations for all values of n

investigated. Even the less favorable full iteration counts are in all but one case over an order of

magnitude smaller than those seen during the first training. The initial accuracy penalty faced by the

system appears to be due to the fact that the feature space was normalized by n in order to bound

the features between zero and one; a process which was found to help prevent numerical instability

during training. Unfortunately this means that the decision thresholds must be modified to account

for the changing size of 1n . This interpretation is supported by the fact that larger values of n take

only a small hit to accuracy even without retraining, whereas if n is smaller than the value originally

trained on then 1n will be larger thereby causing whatever threshold the neural network set to model

the Jacobi stopping criterion to be exceeded. In order to verify this hypothesis, the feature space was

modified so that instead of encoding n (mod a1) directly it was converted to a boolean expression

testing whether or not n (mod a1) was equal to 1. While this limits the potential flexibility of the

model, it also makes the system much more scalable by removing precision concerns introduced by

the need to normalize by n. Initial training results are given in Table 5.

Table 5: ANN performance on n = 1000667 using a boolean test rather than normalized feature

TrialIterations to 98%Testing Accuracy

ExpectedAccuracy

Final TrainingAccuracy


1 1,081,000 50.01±0.11% 100.00% 99.93%2 1,177,000 50.05±0.08% 99.45% 99.71%3 1,728,000 49.99±0.11% 98.62% 98.02%

Whereas the system with normalized features required roughly 1.6n training iterations in order to

converge to a satisfactory solution, the system which replaced 1n with a boolean feature required only

1.3n training iterations, despite dealing with a much larger value of n. While this performance gain

is not enormous, the real advantage comes during re-training which is demonstrated in Table 6.

11

Table 6: ANNs retrained for different n

NetworkID

nInitial Testing

Accuracy

TrainingIterations to 98%Testing Accuracy


1 524387 99.97% 0 99.97%1 786803 99.93% 0 99.82%1 1048127 66.43% <1,000 99.92%2 524387 99.75% 0 99.86%2 786803 99.66% 0 99.84%2 1048127 66.33% <1,000 99.75%3 524387 98.70% 0 98.49%3 786803 98.10% 0 98.64%3 1048127 64.98% 22,000 98.31%

Interestingly the change in feature representation led to initial performance on smaller n values

being dramatically higher than on larger n values, a reversal from the previously observed pattern.

The networks were, however, able to retrain to these larger n values in very short order. One pat-

tern consistent across both feature representations is that networks which take longer to reach peak

performance also tend to have lower peak performance, as well as requiring longer to retrain. Such

networks are likely arriving in locally optimal solutions which are more complex than the true algo-

rithm, suggesting that it is worthwhile to try multiple initializations when searching for more complex

solutions in the composite case. Furthermore, given the apparent benefits of the boolean approach

over normalizing by n, it was further investigated in the course of characterizing network performance

on composite bases.

5.2 Composites

5.2.1 Jacobi Method

While the Jacobi symbol is an effective mechanism for differentiating quadratic residues from non–

residues in a prime modular base, it is insufficient for a composite modular base, providing only a 50%

probability of success if n = pq where p and q are prime. This does not mean, however, that a neural

network trained with the modular discriminants used by the Jacobi symbol algorithm will necessarily

fail. With that in mind, the inputs which led to the results summarized in Table 3 were employed to

train new networks for the composite case given the base 347 ∗ 359 = 124573. For this experiment the

dataset consisted exclusively of residues and non–residues with Jacobi symbol equal to one, meaning

that simply mimicking the Jacobi algorithm would not be of any help in solving the problem. The

results of this investigation are provided in Table 7.

12

Table 7: ANN performance on n = 124573 given all comparisons used in the Jacobi algorithm


ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 101,700 50.17±0.35% 93.33% 65.75%2 105,100 49.99±0.29% 96.89% 66.13%3 120,100 50.04±0.45% 92.00% 65.70%

Interestingly, the neural networks were each able to exceed random chance guessing by a significant

margin. While the difference between training and testing accuracy indicates that some over–fitting

is occurring, it appears that there may be some previously unknown pattern in the data that is also

being discovered. Exchanging n (mod a1)n for n (mod a1) == 1, as was described in Tables 5 and 6,

shows that the possible loss of flexibility does not result in a large decrease in testing accuracy. It

does, however, lead to a drop in training accuracy which is desirable since it means the network is less

prone to simply memorize its input data. These results are summarized in Table 8, and led to the

adoption of the boolean feature in future experiments.

Table 8: ANN performance on n = 124573 given boolean test value rather than normalized mod


ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 116,700 49.95±0.31% 88.00% 65.77%2 146,000 49.88±0.25% 86.22% 64.58%3 103,100 49.88±0.36% 88.89% 64.66%

In an effort to determine whether final accuracy is being impacted by network architecture, the

experiment was re–run replacing the hyperbolic tangent functions previously used throughout the

hidden layer with RELU–6 activation functions. The resulting network performance is given in Table

9.

Table 9: ANN performance on n = 124573 using the RELU–6 activation function.


ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 44,100 50.04±0.24% 83.56% 63.88%2 27,300 50.00±0.44% 84.89% 64.67%3 140,400 49.96±0.32% 71.56% 61.39%

It appears that despite the recent popularity of the RELU activation function in image processing

applications, the hyperbolic tangent function is more suitable for this particular application; providing

on average an absolute gain of 2.5% testing accuracy. The RELU networks did, however, train

dramatically faster than the hyperbolic tangent networks. The third trial was an exception, taking a

very long time to reach 60% accuracy. This appears to be due to the fact that it became stuck in a

very unfavorable local minimum, causing 60% to be close to its asymptotic final performance level.

13

If training speed is desired over final performance, initializing a large number of RELU networks and

keeping the best one may be the optimal choice.

Another key parameter of network architecture is hidden layer count. To explore the impact of

this variable on performance, the number of hidden layers in the network was increased from 4 to 5.

The results are given in Table 10.

Table 10: ANN performance on n = 124573 using 5 fully connected hidden layers


ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 85,000 49.79±0.41% 96.00% 65.46%2 67,200 49.93±0.25% 94.22% 66.48%3 77,500 49.82±0.39% 95.11% 65.18%

Compared to the four layer networks from Table 7 there was no improvement gain by adding the

fifth layer. In fact, the average absolute testing error actually worsened by 0.15%, though this is

within random noise fluctuations. It therefore appears that whatever pattern is being detected is not

being inhibited by network depth. Since each hidden layer contains 100 neurons, the addition of the

extra hidden layer added as many neurons to the system. An alternative way in which to add 100

neurons is to retain the depth 4 structure while boosting the per–layer neuron count to 125. Doing

this lead to the results in Table 11.

Table 11: ANN performance on n = 124573 using 125 neurons per layer


ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 125,300 50.01±0.42% 95.56% 66.34%2 123,500 50.08±0.26% 96.44% 66.27%3 135,000 49.97±0.32% 96.44% 66.57%

While the average testing accuracy of the 125 neuron–per–layer system is greater than the 100–

per–layer system by 0.53%, the larger networks take more than 2 hours longer to train. Given that

the additional benefits provided by the larger system are so limited that they may even be an artifact

of random noise, the smaller system appears better suited to the problem.

Although the Jacobi symbol itself should be of no use here, it could still be the case that networks

which have been trained for that purpose might provide a better starting point than random initializa-

tion for differentiating quadratic residues. To determine whether this is indeed the case, the networks

from Table 3 were used as the starting points for training the new quadratic residue detection systems.

The results of this experiment are given in Table 12.

14

Table 12: ANN performance on n = 124573 initialized from Jacobi networks

Network IDIterations to 60%Testing Accuracy

ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 39,700 50.01±0.42% 93.78% 65.25%2 40,600 50.08±0.26% 93.33% 65.17%3 24,200 49.97±0.32% 93.33% 64.84%

Although the pre-initialized networks did not attain a higher testing accuracy than those which

were randomly initialized, they did approach this value 3 times faster, requiring approximately 74,000

fewer iterations to reliably exceed 60%. Although the network training started out much faster based

on this pre–training, hundreds of thousands of iterations were still required to fully re–train the

network. This is weak evidence that whatever data property is being exploited by the system is

not directly connected to the decisions made in the Jacobi algorithm, though something about the

architecture of networks which have learned that algorithm is clearly valuable.

One obvious problem with these approaches is that the number of training iterations needed to

achieve better than random guessing is larger than the value of n being tested. In order to be useful

for cryptographic applications, a neural network would need to be trained on one value of n and then

be re–tasked to different n values with minimal retraining. To determine whether this is feasible,

Network 2 from Table 7 was retrained on an n value of 263*479 = 125977. The outcome is recorded

in Table 13.

Table 13: ANN performance on n = 125977 transfered from pre–trained networks

Network IDIterations to 60%Testing Accuracy

ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 14,500 49.92±0.37% 90.22% 64.23%2 17,300 50.03±0.35% 88.00% 64.07%3 16,200 49.85±0.32% 93.33% 64.68%

Although the final testing accuracy of the transfer learning system was no better than the results

from Table 7, the training accuracy did increase towards this asymptote much more rapidly. Whereas

the original training required an average of 108966 iterations to achieve 60% testing accuracy, the

retrained system was able to achieve this accuracy after an average of only 16000 iterations: nearly 7

times faster. This processes is still slower than simply factoring n, but if the underlying pattern being

exploited by the network can be better understood it may be possible to either extract an explicit

algorithm or else design networks which retrain more quickly in the future.

Finally, additional features can also be added on top of the Jacobi algorithm comparators in an

effort to boost performance. Since it is not yet known how the neural networks are achieving their

improved performances, it is not clear what features should be added to arrive at an optimal system.

Testing a variety of modular comparators at random, however, reveals that computing n (mod 7) at

each stage of recursion has a substantive impact on accuracy. These improvements are summarized

in Table 14.

15

Table 14: ANN performance on n = 124573 with additional feature: n (mod 7)

TestIterations to 60%Testing Accuracy

ExpectedAccuracy

TrainingAccuracy

TestingAccuracy

1 54,800 49.79±0.41% 99.56% 70.71%2 65,800 50.05±0.47% 98.22% 70.10%3 56,700 50.11±0.36% 99.11% 71.17%

Adding a single feature, n (mod 7), at each level of recursion lead to an average testing accuracy

increase of 4.77%, significantly beyond the influence of random noise variations for the system. It also

cut the required training iterations to reach 60% accuracy nearly in half. Other features were also

investigated, computing n and a against modular bases of 2, 3, 6, 9, 11, 13, and 17. None of these

individually yielded gains as significant as did the introduction of n (mod 7), though in combination

these features pushed the testing accuracy up to 75.02%. It is hoped that other as–of–yet unexplored

features might be able to bring testing accuracies to 100%.

In preparation for testing on a wider variety of n values, the network architecture was modified to

that each hidden layer contains a number of neurons equal to the number of input features. For the

values of n previously discussed this means 102 neurons, an insignificant change given that an increase

to 125 neurons was already shown to have very little impact. Performance for these various n values

is summarized in Table 15.

Table 15: ANN performance various n

n FactorsIterations to 60%Testing Accuracy

Final TrainingAccuracy


124,573 347 * 359 54,800 99.56% 70.71%125,321 7 * 17903 <100 100.00% 100.00%126,109 23 * 5483 49,100 99.56% 71.72%603,241 719 * 839 130,500 92.04% 80.13%848,329 863 * 983 460,000 86.35% 79.29%854,941 839 * 1019 1,530,000 79.51% 62.81%995,893 839 * 1187 1,857,000 71.90% 61.15%

1,076,437 839 * 1283 2,239,000 71.07% 61.82%1,307,377 1019 * 1283 >3,262,000 48.80% 50.67%1,551,937 1019 * 1523 >4,821,100 68.14% 57.38%

There are several interesting phenomena captured in Table 15. First, there was a network which

achieved 100% performance, and did so extremely quickly. It is believed that this high level of

performance was achieved because one of the factors of n was 7, and the data mod 7 is one of

the feature values computed along the way. To see whether small factors in general lead to high

performance, an n value was tested having 23 as one of its factors. This network behaved virtually

identically to the network trained on an n with two factors around 350, beating out the larger–factor

network by only 1%. While this might have indicated a weak pattern of decline proportional to

factor size, a network with factors of 719 and 839 managed a testing accuracy of over 80%. It’s not

clear why the performance jumped so high for this value of n, but it does indicate that the system

16

is able to handle at least certain subsets of larger factors. On the other hand, system performance

took a precipitous hit on larger numbers, to the point where it could do barely better than random

guessing for n = 1307377. These results are complicated by the fact that the n values above 603241

were trained using a different implementation of the system which generated data on the fly in order

to reduce memory consumption. When that same technique was applied to n = 1551937 it also

failed; the 57% accuracy instead being achieved by a highly memory intensive implementation. This

may indicate that the decrease in performance observed for the largest n values is a byproduct of

implementation difficulties rather than representative of any trend inherent to the problem. Larger n

values of 32188213, 860755297, and 25840758901 were also tested, but they had to be terminated due

to resource constraints before any significant progress was made.

This brings to the forefront a significant drawback of these results: even if the observed ability to

achieve greater than 50% accuracy holds for cryptographically sized values of n, it would be completely

infeasible to train such networks. Furthermore, experiments providing the neural networks with only

a small (logarithmic or square root) subset of the available data for training consistently failed. This

could be evidence that, when successful, the networks are only memorizing some pattern unique to

each value of n rather than learning a generalizable set of rules for detection. On the other hand, if

that were the case then pre–training a network on one n value should not have improved the learning

rate when transferring to other n values, as it did in Tables 12 and 13. Unfortunately, efforts to

demonstrate transferability of learning amongst the larger n values failed, though the mechanism of

failure is highly indicative of an implementation error. When attempting to re–train networks which

had previously been trained for n = 603241, n = 854941, and n = 1076437, all attempts failed to even

move beyond a training accuracy of 50%, let alone having any impact on testing performance. This

is what one would expect to see if the step size employed on each training iteration was forced to be

very small, a potential byproduct of the way in which the TensorFlow library handles reloading of

saved networks. Finding a work–around for this problem and further investigating the transferability

of systems at large n values would provide useful insights for future work seeking to discover how

these networks are able to generate such unexpectedly high accuracies.

5.2.2 Modular Exponentiation Method

A fundamentally different approach to quadratic residue detection is to compute bx (mod n) for various

values of x and use the resulting values for features. Such features are inspired by the fact that for

prime n, bn−12 (mod n) perfectly separates quadratic residues from non–residues. Unfortunately it

appears that not all possible powers x contain information useful for solving the quadratic residue

problem. For example, for n = 124573 suppose one were to select values of x equal to 3 and 897. The

resulting distribution of residues and non–residues would be virtually random, as visualized in Figure

1.

17

Figure 1: Apparently Random Distribution of Residues (Green) and Non–Residues (Red)

0 20000 40000 60000 80000 100000 120000b^3 (mod n)

0

20000

40000

60000

80000

100000

120000

b^897 (mod n)

n = 124573

There are, however, values of x for which the data becomes obviously separable. Recall that n = pq

and that p = 2r + 1 and q = 2t+ 1. Then for odd multiples of rt, perfect separation of residues from

non–residues can be observed. This situation is depicted in Figure 2.

Figure 2: Perfect Separation of Residues (Green) and Non–Residues (Red)

0 20000 40000 60000 80000 100000 120000b^1 (mod n)

0

20000

40000

60000

80000

100000

120000

b^30967 (mod n)

n = 124573

While such a case is simple enough so as to not require a neural network at all, it is very unlikely

that one would find an odd multiple of rt by sampling at random from the number line. As it turns

out, however, there are other values of x which are almost as useful for solving the problem. For

18

example, if both rt+ 1 and 1 are in the feature space, then the residues may still be easily separated

from non–residues, as depicted in Figure 3.

Figure 3: Easy Separation of Residues (Green) and Non–Residues (Red)

0 20000 40000 60000 80000 100000 120000b^1 (mod n)

0

20000

40000

60000

80000

100000

120000

b^30968 (mod n)

n = 124573

In general, features of the form bx (mod n) appear to be useful to the neural network if and only

if x is of the form m ∗ r or m ∗ t for m ∈ Z+. Features of the form m ∗ r+ y and m ∗ t+ y can also be

useful when paired with y. If no features of those forms are provided then the network is held around

50% testing accuracy. The question then becomes whether finding these useful values of x is easier

than simply factoring n and then using that knowledge to compute the quadratic residue status of a

number. The following analysis demonstrates that it is unlikely to be asymptotically easier if one is

restricted to randomly sampling values from the number line:

n = pq

n = (2r + t)(2t+ 1)

n = 4rt+ 2r + 2t+ 1

n = 4rt+ 2(r + t) + 1

4rt = n− 1− 2(r + t)

rt =n− 1− 2(r + t)

4

The bounds on rt therefore depend on the possible values of r+ t. The smallest such sum would be

achieved if r = t, in which case r+ t =√n−12 +

√n−12 =

√n−1. Thus rt ≤ n−1−2(

√n−1)

4 = n4 + 1

4 −√n2 .

Likewise the largest possible value of r + t would be achieved if they are as far apart as possible, so

p = 3 and q = n3 . In this case r + t = 1 +

n3−12 = n+3

6 which means that rt ≥ n−1−2(n+36

)

4 = n6 . The

initial bounds on rt are therefore:

19

n

6≤ rt ≤ n

4+

1

4−√n

2

By employing trial division up to 5 before considering this method, the lower bound may be

raised to n5 . Unfortunately, the lower bound asymptotically approaches n

4 as more trial divisions are

performed, meaning that the gap between the lower and upper bounds cannot be closed. Thus the

space which must be searched in order to encounter rt is O(n), worse than trial division which is

O(√n).

On the other hand, if one expands the search space to include the multiples of r and t, then target

instances become much more bountiful. Since n = pq and since p and q are distinct, exactly one of p

or q must be less than√n. Let p be the smaller one. Then |mi+1p−mip| ≤

√n−12 . Thus if you sample√

n−12 contiguous elements anywhere from the number line starting above

√n it is guaranteed that

you will encounter at least one useful feature. Unfortunately this search space is still O(√n) and thus

not better than trial division. Although neural network testing accuracies of over 99% were obtained

using only a logarithmically sized feature set for numbers up to around 1 million, this solution is

not scalable to cryptographically relevant sizes. As weak evidence against the ability to find a better

selection schema, consider that if it were possible to reliably encounter multiples of r and t, then one

could use the GCD algorithm on these multiples to infer the values of r and t and thereby factor n in

polynomial time.

5.2.3 Alternative Methods

Several other input feature schemes were also considered, none of which saw any concrete improvement

over random guessing. First was the use b (mod i) and n (mod i) for a logarithmic number of small

values of i ∈ Z+. The hope was that the system would find a pattern based on some previously

unconsidered implication of the Chinese Remainder Theorem, though the neural networks were unable

to make any headway there.

A wide variety of Fibonacci sequence values modulo different bases were also considered, for ex-

ample Fibb+1 (mod n). These were tested since similar features appear in an unproven primality

testing heuristic and it was hoped that their utility might extend to other number theory problems.

Unfortunately the neural networks were unable to make effective use of these features either.

6 Conclusions and Future Work

This work has shown that neural networks are capable of learning to mimic the Jacobi symbol al-

gorithm in a way which is transferable to modular bases on which the network was not trained.

While these networks are less efficient than the standard Jacobi algorithm, and therefore not of any

pragmatic utility, this serves as another example of the ability of neural networks to converge on a

moderately complicated algorithm. This system also provides a launching point for addressing more

complex problems.

Using Jacobi algorithm features to drive machine learning, it appears that a significant improve-

ment over random guessing may be achieved in the problem of quadratic residue detection for compos-

ite numbers of the form n = pq. Such a result challenges the conventional wisdom of the community,

20

which suggests that quadratic residue detection is as unguessable as a coin toss.[7][8] There are several

key considerations, however, which restrain the impact of this apparent result. First and foremost is

that cryptographically relevant n values tend to be on the order of 500 to 1000 bits. The numbers

tested during this investigation are only on the order of 17 to 21 bits. It is possible, therefore, that any

patterns detected on the small scale are not present at the larger scale. This would have been investi-

gated in greater detail if not for a second problem: training the neural networks from scratch appears

to require O(n) data instances and iterations. Using only a personal computer, it is not possible to

train a network on such a large set of data due to both memory and time constraints. To fully train a

network on a cryptographically relevant scale would be intractable even on a supercomputer unless a

more efficient convergence mechanism can be constructed. Based on the experiments presented here,

using a deeper network with simpler activation functions could be one way to cut down the training

iteration requirement. Another possible avenue of attack would be to take a random–forest style ap-

proach to the problem, providing numerous neural networks with only a small amount of training and

then taking a majority vote to generate the classification. One potential problem with such a method

is that, if the networks are learning some underlying algorithm for residue detection, their errors may

not be independent and thus the ensemble system may not see improved performance. Moreover, it

may still require O(n) training iterations simply to exceed 50% accuracy, in which case the ensemble

method can offer no advantage.

Given the relatively small size of the numbers investigated during this study, and the resource–

intensive training requirements of the neural networks, a much faster solution to quadratic residue

detection would have been to simply factor n and then directly compute the residue status. It is worth

noting, however, that factoring algorithms have been under development for centuries whereas the work

outlined in this paper has been under development for several weeks. Just as factoring algorithms have

become more and more efficient over time, it is possible that this methodology might be significantly

improved upon in order to decrease resource requirements and further improve accuracy. Over the

course of a relatively small number of experiments, features were discovered which provided an absolute

testing accuracy performance boost of up to 10% on top of what was already provided by the Jacobi

symbol features. If a polynomial–time algorithm for residue status detection exists, neural networks

of this form could allow for an easy way to identify what sort of computations are involved in it.

By studying the impact of different features on network performance, it may be possible to discover

a previously unknown algorithm. This would in turn make the computational resource requirement

of the neural networks a moot point and allow for solutions to even cryptographically sized inputs.

Although neural networks are conventionally treated as black–box systems, there is no reason that

they could not be used as relevant–feature detectors. Future work should therefore focus on the search

for additional features, as well as on performance optimizations so that applicability on larger n may

be investigated.

References

[1] David S. Dummit and Richard M. Foote. Abstract Algebra. English. 3rd ed. Wiley, 2004.

isbn: 978-0-471-43334-7.

21

[2] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied

Cryptography. English. 5th ed. Waterloo: CRC Press, 2001. isbn: 0-8493-8523-7.

[3] Kenneth H. Rosen. Elementary number theory and its applications. English. Boston:

Addison-Wesley, 2011. isbn: 978-0-321-50031-1 0-321-50031-8.

[4] Douglas R. Stinson. Cryptography Theory and Practice. English. 3rd ed. Waterloo: CRC

Press, 2006. isbn: 978-1-58488-508-5.

[5] Richard A. Mollin. An Introduction to Cryptography. English. 1st ed. New York: CRC

Press, 2001. isbn: 1-58488-127-5.

[6] Tibor Jager and Jrg Schwenk. The Generic Hardness of Subset Membership Problems

under the Factoring Assumption. Cryptology ePrint Archive, Report 2008/482. 2008.

url: http://eprint.iacr.org/.

[7] Ronald L. Rivest. “Advances in Cryptology — ASIACRYPT ’91: International Con-

ference on the Theory and Application of Cryptology Fujiyosida, Japan, November 1991

Proceedings”. In: ed. by Hideki Imai, Ronald L. Rivest, and Tsutomu Matsumoto. Berlin,

Heidelberg: Springer Berlin Heidelberg, 1993. Chap. Cryptography and machine learn-

ing, pp. 427–439. isbn: 978-3-540-48066-2. doi: 10.1007/3-540-57332-1_36. url:

http://dx.doi.org/10.1007/3-540-57332-1_36.

[8] Michael Kearns and Leslie Valiant. “Cryptographic Limitations on Learning Boolean

Formulae and Finite Automata”. In: J. ACM 41.1 (Jan. 1994), pp. 67–95. issn: 0004-5411.

doi: 10.1145/174644.174647. url: http://doi.acm.org/10.1145/174644.174647.

[9] Ron RIvest and Robert Silverman. Are ’Strong’ Primes Needed for RSA. 2001.

[10] Matthew Lai. “Giraffe: Using Deep Reinforcement Learning to Play Chess”. In: CoRR

abs/1509.01549 (2015). url: http://arxiv.org/abs/1509.01549.

[11] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”.

In: Nature 529.7587 (Jan. 2016), pp. 484–489. issn: 0028-0836. url: http://dx.doi.

org/10.1038/nature16961.

[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with

Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing

Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. url:

http://papers.nips.cc/paper/4824- imagenet- classification- with- deep-

convolutional-neural-networks.pdf.

[13] Jason Yosinski et al. “How transferable are features in deep neural networks?” In: CoRR


[14] Jeff Donahue et al. “DeCAF: A Deep Convolutional Activation Feature for Generic Visual

Recognition”. In: CoRR abs/1310.1531 (2013). url: http://arxiv.org/abs/1310.

1531.

22

[15] Ali Sharif Razavian et al. “CNN Features off-the-shelf: an Astounding Baseline for Recog-

nition”. In: CoRR abs/1403.6382 (2014). url: http://arxiv.org/abs/1403.6382.

[16] G. Cybenko. “Approximation by superpositions of a sigmoidal function”. In: Mathematics

of Control, Signals and Systems 2.4 (1989), pp. 303–314. issn: 1435-568X. doi: 10.1007/

BF02551274. url: http://dx.doi.org/10.1007/BF02551274.

[17] Lei Jimmy Ba and Rich Caurana. “Do Deep Nets Really Need to be Deep?” In: CoRR


23

Quadratic Residues and Machine Learning MS CS Project€¦ · If nis a prime number, then all...

Documents

Transcript of Quadratic Residues and Machine Learning MS CS Project€¦ · If nis a prime number, then all...