Quadratic Residues and Machine Learning MS CS Project€¦ · If nis a prime number, then all...
Transcript of Quadratic Residues and Machine Learning MS CS Project€¦ · If nis a prime number, then all...
Quadratic Residues and Machine Learning
MS CS Project
Michael Potter
Advisors: Dr. Leon Reznik, Dr. Stanis law Radziszowski
Aug 8, 2016
Abstract
The problem of quadratic residue detection is one which is well known in both number theory and
cryptography. While it garners less attention than problems such as factoring or discrete logarithms,
it is similar in both difficulty and importance. No polynomial–time algorithm is currently known
to the public by which the quadratic residue status of one number modulo another may be deter-
mined. This work leveraged machine learning algorithms in an attempt to create a detector capable
of solving instances of the problem in polynomial–time. A variety of neural networks, currently at
the forefront of machine learning methodologies, were compared to see if any were capable of consis-
tently outperforming random guessing as a mechanism for detection. Surprisingly, neural networks
were repeatably able to achieve accuracies well in excess of random guessing on numbers up to 20
bits in length. Unfortunately this performance was only achieved after a super–polynomial amount
of network training, and therefore it is not believed that the system as implemented could scale to
cryptographically relevant inputs which typically have between 500 and 1000 bits. This nonetheless
suggests a new avenue of attack in the search for solutions to the quadratic residues problem, where
future work focused on feature set refinement could potentially reveal the components necessary to
construct a true closed–form solution.
1 Mathematics Background
Before discussing the definition of quadratic residues in detail, some customary notation must be
introduced. First, the set of integers modulo a base n is denoted as Zn. If only integers relatively
prime to n are considered, then the resulting set is denoted Z∗n. This set is incidentally also a group
defined under the operation of multiplication, meaning that – amongst other properties – multiplying
any two elements from the set will result in another member of the set.[1] Membership in Z∗n may be
tested in polynomial–time by checking whether the greatest common divisor (gcd) between a candidate
integer x and n is equal to 1. Algorithm 1 provides a method for computing the gcd between two
numbers which has running time O(log(n)2).
1
Algorithm 1 Euclidean Algorithm for Greatest Common Divisor[2]
1 #Returns the greatest common divisor between non-negative integers a and b
2 def gcd(a,b):
3 r = 0
4 while b != 0:
5 r = a mod b
6 a = b
7 b = r
8 return a
If n is a prime number, then all integers from 1 to n − 1 are relatively prime to n, and so Z∗n is
simply the set of integers from 1 to n − 1. Note that zero is not considered in the set since 0 ≡ n
mod n, and the greatest common divisor of any number n with itself is n, not 1. Thus zero does
not satisfy the criteria for membership in Z∗n. Consider also that since Z∗n is a group, any element a
must have an inverse a−1 such that (a)(a−1) = 1. Since any number multiplied by zero yields zero, it
cannot have a multiplicative inverse and thus zero cannot be a member of Z∗n.
In general the number of elements in Z∗n is denoted by the Euler totient function φ(n). From the
above it is clear that for prime n, φ(n) = n−1. Suppose, however, that n is composite. Then n =k∏
i=1peii
where pi are distinct primes and ei are positive integers. In this case φ(n) =k∏
i=1(peii − p
ei−1i ).[1] There
is no known polynomial–time algorithm for computing φ(n) without knowing the factorization of n. If
there were then one could use φ(n) to factor n in polynomial–time. While the general proof is beyond
the scope of this project, consider the case where n = pq with p and q being two distinct primes. The
following two equalities therefore hold:
n = pq (1)
φ(n) = (p− 1)(q − 1) (2)
Expanding Equation 2 and inserting Equation 1 gives the following:
φ(n) = pq − p− q + 1
φ(n) = n− n
q− q + 1
qφ(n) = qn− n− q2 + q
0 = −q2 + (n+ 1− φ(n))q − n
Leveraging the quadratic equation then provides a solution for q:
2
q =(n+ 1− φ(n))±
√(n+ 1− φ(n))2 − 4n
2(3)
Unfortunately, attempting to infer φ(n) from random sampling seems infeasible since the numbers
which do not contribute to φ(n) grow proportional to p+ q whereas the population size grows as pq.
In fact, if p and q are distinct odd primes, then it can be shown that 2n3 − 2 ≤ φ(n) ≤ n − 2
√n + 1,
which clearly still requires O(n) evaluations to guess and is therefore exponential in the size of n.
Turning now to the definition of quadratic residues, suppose that you have two integers a and b.
Then a is a quadratic residue of b if and only if the following two conditions hold:[3]
1. The greatest common divisor of a and b is 1
2. There exists an integer c such that c2 ≡ a (mod b)
The first condition is satisfied automatically so long as a ∈ Z∗b . There is, however, no known
polynomial–time algorithm for determining whether the second condition is satisfied. The difficulty
and significance of this problem will be discussed in more detail in Section 2.
While no general solution exists, there are partial solutions to the quadratic residue problem. In
particular, if b is prime then it is easy to determine whether a is a quadratic residue of b. This can be
accomplished simply by computing ab−12 (mod b) in a test known as Euler’s criterion. If the result is
1 then a is a quadratic residue of b. If the result is -1 then a is a non–residue, and if the result is 0
then a and b are not relatively prime.[4] This value may be computed using the square and multiply
algorithm – see Algorithm 2 – which has complexity O(log(b)3) and is therefore polynomial in the
length of the input.
Algorithm 2 Square and Multiply Algorithm for Modular Exponentiation[4]
1 #Returns x^c mod n. Assumes c is represented in binary with bit length L
2 def modexp(x, c, n):
3 z = 1
4 i = L-1 #C[i] is thus the least significant bit of C
5 while i >= 0:
6 z = z*z mod n
7 if c[i] == 1:
8 z = z*x mod n
9 i -= 1
10 return z
In addition to being able to detect whether a particular number is a quadratic residue, it is also
useful to know how many quadratic residues will exist for a particular modular base. This is again
easiest to investigate in the case of primes. Consider that for any odd prime b, Z∗b is a cyclic group
and so there exists an element α ∈ Z∗b such that any other element in the set may be written as
αj for some j ∈ Z.[1] Under this construction, the quadratic residue problem reduces to determining
whether j is even or odd. If an element a can be written as an even power of α then it is a quadratic
3
residue. Otherwise it is not.[2] The set of quadratic residues, denoted Qb, must therefore have orderb−12 since half of all possible values of j are even. Likewise the set of non–residues: Qb also has
order b−12 . Extending consideration to non–prime bases, suppose n is the product of 2 distinct odd
primes p and q. The Chinese Remainder Theorem guarantees a bijection between Z∗n and Z∗p × Z∗q .[5]
In particular, suppose that a1 ∈ Z∗p and a2 ∈ Z∗q . The theorem guarantees that a ≡ a1 (mod p)
and a ≡ a2 (mod q) has a unique solution modulo pq. This solution, a, is then the element in Z∗ncorresponding to (a1, a2) ∈ Z∗p × Z∗q .
Due to this bijective property, a is a quadratic residue of n if and only if a ∈ Qp and a ∈ Qq. Thus
the number of quadratic residues of n, denoted |Qn|, is |Qp| ∗ |Qq| = (p−1)(q−1)4 whereas the number
of non–residues, |Qp|, is 3(p−1)(q−1)4 . The remaining p+ q − 2 elements from Zn will not be relatively
prime to n and are therefore in neither Qn nor Qn.
Although there is not yet a known polynomial–time residue detection algorithm, there are methods
for efficiently computing partial solutions. As mentioned previously, for a prime value p, the residue
status, r, of an integer a is given by r = ap−12 (mod p). This value, also known as the Legendre
symbol of r with respect to p, forms the basis of the Jacobi symbol: a generalized approximation
of quadratic residues for composite numbers. Suppose n =k∏
i=1peii . Then the Jacobi symbol for an
integer a is defined to be(an
)=
k∏i=1
reii . While clearly trivial to compute given the factorization of n,
this value may also be computed using Algorithm 3 which has running time O(log(n)2) and does not
require knowledge of the factorization of n.[2] Unfortunately, knowing the Jacobi symbol(an
)is not
sufficient to determine whether a is a quadratic residue of n. Consider the case where n = p ∗ q. Then
if(an
)= −1 it is safe to conclude that a is not a quadratic residue of n. If, however,
(an
)= 1, then
it could either be that both rp and rq are positive, in which case a is a residue of n, or it could be
that both rp and rq are negative, in which case a is not a residue of n. Either case is equally likely,
meaning that if the Jacobi symbol evaluates to 1 there will be a 50% chance that the number will be
a quadratic residue. There are no common–knowledge polynomial–time algorithms which are capable
of improving on this estimate. Thus any system capable of outperforming random guessing would
represent a highly significant result in number theory, computer science, and cryptography.
4
Algorithm 3 Jacobi Symbol Computation[2]
1 #Returns the Jacobi symbol (a/n) which will be either 1, -1, or 0
2 def Jacobi(a, n):
3 if a == 0:
4 return 0
5 if a == 1:
6 return 1
7 #Write a = 2^e * a1 where a1 is odd
8 e = 0
9 a1 = a
10 while a1 mod 2 == 0:
11 a1 //= 2
12 e += 1
13 #Determine the current symbol
14 s = 0
15 if e mod 2 == 0:
16 s = 1
17 #Apply quadratic reciprocity law
18 else:
19 comparator = n mod 8
20 if comparator == 1 or comparator == 7:
21 s = 1
22 elif comparator == 3 or comparator == 5:
23 s = -1
24 if n mod 4 == 3 and a1 mod 4 == 3:
25 s = -s
26 #Recurse
27 n1 = n mod a1
28 if a1 == 1:
29 return s
30 return s*Jacobi(n1, a1)
2 Cryptography Background
While less well known than factoring, the quadratic residue problem also has great significance in
cryptography. As seen in Section 1, quadratic residue detection is clearly not more difficult than
factoring. While no polynomial–time reduction from the decision problem to factoring is known,
it is conjectured to be of similar difficulty. This conjecture has been proven within the context of
generic ring algorithms, but not for full Turing machines. This proof, however, is not strong evidence
since computation of Jacobi symbols is also provably hard for generic ring algorithms despite the
polynomial–time general algorithm provided in Section 1. It could therefore be the case that quadratic
residue detection is polynomial–time computable while factoring is not.[6] Despite this possibility, it
5
has become common practice for researchers to assume the difficulty of quadratic residue detection
when attempting to prove the intractability of other less studied problems in computer science and
cryptography.[7][8]
The closely related problem of actually computing the square root of a quadratic residue, however,
has been proven to be as hard as factoring. To see this, suppose that you had an algorithm capable
of computing the square root of a residue in polynomial–time. The number of distinct square roots
for a residue of some composite number n is equal to 2L, where L is the number of distinct primes in
the factorization of n.[4] Thus for standard cryptographically relevant composites of the form n = pq,
there will be 22 = 4 roots for every quadratic residue. Pick any number x from 1 to n− 1 and square
it mod n. Use the polynomial–time square root algorithm to find a square root of x2. Since there
are four possible solutions, the algorithm will return a root x′ distinct from x with probability 34 . Of
these distinct values, one will be −x (mod n), but two will be other values. So long as x′ 6= x and
x′ 6= −x, then it may be used to compute gcd(x − x′, n) which will return a non–trivial factor of n.
This algorithm therefore succeeds with probability 12 which means that after several iterations one can
expect to find a non–trivial factor of n.
RSA is a well known encryption algorithm which relies on the difficulty of factoring numbers of the
form n = pq. It has not been proven, however, that breaking RSA encryption actually requires the
ability to factor a number.[5] There are encryption schemes, however, which are provably as secure as
the modular square root problem – and hence as secure as factoring. One such procedure is the Rabin
Public Key Cryptosystem. In this scheme, a private key of two large primes p and q is generated. The
public key n is then posted. To encrypt a message m into ciphertext c, the following is performed:
c = m2 (mod n). Since the private key contains knowledge of the factorization of n, it can then be
used to compute the four square roots of c, one of which will be m.[5]
While the generic problem of factoring is believed to be hard, its difficulty can depend on the
particular values of p and q which form the prime decomposition. In particular, a notion of ‘strong’
primes has been proposed where p and q are considered strong if p − 1 and q − 1 have large prime
factors. Such primes are more difficult for certain factoring algorithms to contend with, though the
best modern algorithms are not significantly impacted by such conditions.[9] Nonetheless, the use of
strong primes is required by ANSI standards in an effort to create a system which is as secure as
possible. One way to ensure that the strong prime criterion is met is to pick p and q such that
p = 2r + 1 and q = 2t + 1, where r and t are also prime. This project will restrict its analysis to
primes of this form since they represent the most difficult class of numbers to handle and therefore
give a better indication of the cryptographic significance of any solutions encountered.
3 Machine Learning Background
For all that it is generally believed to be both an important and difficult problem, it is not clear
that anyone has yet published any attempts to find a solution for quadratic residue detection using
modern machine learning (ML) techniques. This project sought to remedy the apparent shortcoming
in the contemporary literature by providing evidence for the potential tractability of quadratic residue
detection within a ML context.
Artificial Neural Networks (ANNs) have garnered a great deal of attention in the literature recently
6
for their achievements in a variety of domains. In the field of game AI, ANNs have been able to
successfully learn to play Chess[10] and recently even the game Go[11] – a feat which had previously
been considered well beyond the capacity of modern machine learning. In such applications it appears
that these neural networks are able to collapse what would otherwise be an exponentially large search
tree down into a polynomial–time approximation function.
ANNs, and in particular Convolutional Neural Networks (CNNs), have also been extremely suc-
cessful in pattern recognition. Their stunning success in the image processing domain, in particular
on the well known database ImageNet, has inspired a great deal of interest in CNN development.[12] In
fact, since its publication in 2012 this early proof of CNN effectiveness has been cited over 5000 times.
What is perhaps most surprising, however, is that CNNs have been shown recently to support transfer
learning – the ability to train a CNN on one dataset and then use it on another with very minimal
retraining required.[13][14][15] Such networks are so effective that they can out–perform systems which
have been developed and tuned for years on a specialized problem set.[15] This power comes from
the fact that many of the hidden layers within a CNN serve to perform generic feature extraction –
removing the need for humans to generate creative data representations in order to solve a problem.
Once this feature extractor has been created, the final layer or two of a CNN is all that needs to be
adjusted in order to move between different application areas – a process which requires relatively
little time and data. It was hoped that this ability to transfer learning between problem sets would
help to create a network able to detect residues for any number n with minimal retraining required.
Unfortunately the inputs to a CNN must be meaningfully related to one–another, for example having
pixel 1 be above and to the left of pixel 3, in order for patterns derived from the relative positions
of those inputs to be meaningful. While it is not clear how to satisfy such a constraint when dealing
with a number theory problem, the principles and effectiveness of the transfer learning paradigm were
still of interest in this study.
The success of ANNs is not without theoretical precedent. The universality theorem, a well known
result in machine learning, suggests that neural networks with even a single hidden layer may be used
to approximate any continuous function to arbitrary accuracy (given certain constraints on network
topology).[16] Based on this premise, if there were a continuous function capable of testing quadratic
residues, it ought to be possible to mimic it using a neural network. While such a function, if one
exists, might well be discontinuous and therefore beyond the purview of the theorem, there is still
hope that a continuous approximation might suffice.
While the theoretical results are interesting, one could argue that such a network would be im-
possible to train in practice. This was indeed at least a partial motivation behind the deep networks
that have become popular in modern literature. There are, however, concrete results suggesting that
shallow networks really are capable of learning the same behavior as their deeper counterparts.[17]
In their work, Ba and Caurana are able to train shallow networks very effectively by leveraging the
outputs produced by a deep network for ground truth scoring as opposed to using the original super-
vised labeling. While this is not immediately useful since it still requires a trained deep network, it
suggests that if such a deep network can be developed it could potentially then be simplified in order
to improve performance.
7
4 Data Generation
There are two significant factors which may influence the difficulty of quadratic residue detection.
The first is the composition of the modular base being considered. This base could be prime, in
which case a polynomial–time solution is known, or the product of multiple primes, in which case a
polynomial–time solution is unknown. The number of prime factors composing the base controls both
the number of quadratic residues which will be present in the modular ring, and also how many roots
each residue will have.
The second factor which was expected to contribute to the difficulty of the problem is the size of
the modular base. As the length of this number increases, more and more elements are added to the
modular space. This could make it more difficult for the system to learn any underlying patterns in
the data, but also allows for an increased number of potential training instances. Cryptographically
relevant numbers contain hundreds of digits, so even if machine learning is capable of discriminating
with high accuracy on small numbers it might not pose a threat to modern encryption schemes.
For each base, n, training and testing data are required. For small bases the complete set of
residues and non–residues can be easily enumerated and randomly sampled. In order to generate data
for the larger bases, random numbers were selected from Zn. These numbers were squared mod n
in order to acquire values which are guaranteed to be quadratic residues. In order to acquire non–
residues, random numbers were drawn from Zn until a non–residue was found. This check is possible in
polynomial–time since the prime factorization for each base is known. Non–residues were only included
into the dataset if they had Jacobi symbol equal to 1, since said symbol already provides a method to
discriminate between residues and non–residues having the alternative sign. The testing datasets were
comprised of 50% residues and 50% non–residues, meaning that any system which can deviate from
50% accuracy by a notable margin is at least a partial success. Training data was typically presented
to the networks in batches, with 50% of said data being residues. An alternative training method was
also investigated wherein the batches were biased to a 75/25 split which alternated between favoring
residues and non–residues each iteration. As will be discussed in Section 5, this alternative data
presentation scheme did not have any substantive impact on performance.
5 Experimental Results
This project examined the behavior of ANN architectures using data generated as described in Section
4. The networks considered two integers b and n corresponding to the question: “Is b a quadratic
residue of n?” There are several important considerations when building a network to attempt to
answer this question. First and foremost is what feature representations should be used in order
to attempt to find the solution. Different network architectures may also be more or less effective at
finding a viable solution. Before taking on the challenging task of detecting quadratic residues relative
to a composite modular base, experiments were completed on a prime modular base. For each case,
a 50/50 random guess procedure was run on the testing data 10 times. The resulting accuracy along
with standard deviation was recorded so that the neural network’s performance can be compared to
see whether it is significantly better than what one would expect from random chance.
8
5.1 Primes
The first network architecture investigated was a MultiLayer Perceptron (MLP) with 4 hidden layers,
each containing 100 neurons employing hyperbolic tangent activation functions. These units also
employ a dropout layer before the softmax output layer in an effort to combat over–fitting. A variety
of input feature combinations were then explored in the search for potentially useful inputs. As
discussed in Section 1, the quadratic residue problem for prime bases can be solved explicitly using
the Jacobi symbol, so this symbol value was not considered as a potential input. For initial tests, all
residues and non–residues from 1 to n were taken for the data set, with 80% being assigned to training
data and the remaining 20% held out for testing. A cryptographically relevant system would need to
be capable of learning off of a small subset of this data, but that is not necessary for feature viability
exploration.
For each feature set, three networks were independently initialized and trained in order to study
the sensitivity of the final system to variations in starting configuration. An n value of 100043 was
selected since it is a strong prime (see Section 2) which is large enough to allow for plentiful training
instances but not so large as to make network training times prohibitively long. The first input scheme
investigated was simply to supply the network with the binary representation of b and n in the hopes
that it would naturally extract its own features. This representation was not successful, with results
summarized in Table 1.
Table 1: ANN performance after 800,000 training iterations given binary representation of b and n
Trial Expected Accuracy Training Accuracy Testing Accuracy1 49.91±0.47% 94.53% 50.05%2 50.28±0.54% 90.63% 50.22%3 50.00±0.44% 96.48% 49.96%
As the results in Table 1 show, the neural network training accuracy was within one standard
deviation of random guessing in all instances. This was despite a very high training accuracy –
indicating that the system essentially memorized the training data rather than learning a function
useful for handling unseen data.
The next representation investigated was based on features inspired by the Jacobi Symbol algo-
rithm. Using variable names consistent with those outlined in Algorithm 3, these features included e
(mod 2), n (mod 8), n (mod 4), a1 (mod 4), and n (mod a1) – each normalized to the range [0,1].
While these features do not capture the recursive behavior of the algorithm, it was hoped that they
would provide at least marginal improvement over random guessing. The results are given in Table 2.
Table 2: ANN performance after 800,000 training iterations given a subset of Jacobi features
Trial Expected Accuracy Training Accuracy Testing Accuracy1 50.09±0.34% 46.88% 50.01%2 50.01±0.45% 46.88% 50.06%3 49.91±0.19% 51.56% 49.75%
As Table 2 indicates, using only a handful of features from the Jacobi algorithm was ineffective,
9
with testing accuracies being within one standard deviation of random guessing in all cases. Unlike
the first experiment, here the training accuracies were also held close to random chance, with only
Initialization 3 managing to move notably outside what would be expected by random guessing. While
the results of this test were not encouraging from a testing accuracy perspective, the data was at least
not over–fit. The next step was therefore to add all comparisons employed by the full recursive
Jacobi algorithm to the feature set in order to provide the network with more opportunity for rule
generation. This is possible since the maximum possible recursive depth of the algorithm is dlog2 ne.Since five features are provided per iteration, the total feature count is still polynomial in the length
of the input, and therefore represents a cryptographically viable feature space. One consideration
when implementing this feature set is that normally the recursive algorithm would terminate after a1
reaches a value of 1 or 0. Since all feature vectors to the neural network must be the same length,
once a1 reaches one of these termination values the feature vector [0,1,2,2,1] is appended until the
necessary number of iterations has been completed. These values were chosen since they would leave
the output of the true Jacobi algorithm unchanged if they were to appear during one of its iterations.
This feature set was very effective, as demonstrated in Table 3.
Table 3: ANN performance on n = 100043 given all comparisons used in the Jacobi algorithm
Trial Iterations Expected Accuracy Training Accuracy Testing Accuracy1 169,800 50.05±0.29% 100.00% 99.01%2 170,900 49.97±0.42% 100.00% 98.45%3 372,200 50.02±0.34% 100.00% 96.45%
The neural network training algorithm was set to terminate after achieving a training accuracy
of 100% for 10 checkpoints in a row; checkpoints being evaluated every 100 iterations. Thus Table 3
can also be used to study the speed and reliability of convergence. All three networks demonstrated
the ability to effectively mimic the Jacobi algorithm given only the modular comparators it is based
on, reaching perfect training accuracy and near perfect testing accuracy. Although all three networks
were eventually successful, the number of iterations required to learn the rules varied widely based on
starting configuration; Trial 3 taking more than twice as many iterations to converge as Trial 1. This
suggests that testing multiple network initializations may be highly important when transitioning to
the more complex case of composite numbers. It is also worth noting that the more iterations the
network required to converge, the lower its testing accuracy was. This may indicate that poorly ini-
tialized networks must develop overly complicated decision rules in order to achieve high performance,
in turn introducing greater over–fitting.
A natural question is whether a system trained for quadratic residue detection on one number
may be used on another with little to no retraining. To answer that question, three new prime values
were considered. Since the binary length of n determines the number of features for this particular
representation, new values of n need to have the same binary length. 100043 has a binary length of
17, meaning that the new values of n to be considered must also have binary length 17. The smallest
such safe prime is 65543. The largest is 130787. The value 98207 was also tested since it is roughly
half way through the permissible range of numbers, and is closer to the value on which the networks
were originally trained. These tests are summarized in Table 4.
10
Table 4: ANNs retrained for different n
NetworkID
nInitial Testing
AccuracyTrainingIterations
Final TestingAccuracy
1 65543 61.61% 9,600 99.46%1 98207 56.09% 13,200 99.30%1 130787 98.47% 41,000 99.21%2 65543 64.70% 17,500 99.34%2 98207 61.93% 29,300 98.71%2 37800 97.54% 41,000 98.96%3 65543 62.72% 23,900 98.76%3 98207 61.32% 30,500 98.05%3 130787 95.02% 103,800 96.62%
Although it took a fair number of iterations in order to meet the stopping criteria outlined above,
Network 1 was able to achieve > 98% testing accuracy after only 2000 iterations for all values of n
investigated. Even the less favorable full iteration counts are in all but one case over an order of
magnitude smaller than those seen during the first training. The initial accuracy penalty faced by the
system appears to be due to the fact that the feature space was normalized by n in order to bound
the features between zero and one; a process which was found to help prevent numerical instability
during training. Unfortunately this means that the decision thresholds must be modified to account
for the changing size of 1n . This interpretation is supported by the fact that larger values of n take
only a small hit to accuracy even without retraining, whereas if n is smaller than the value originally
trained on then 1n will be larger thereby causing whatever threshold the neural network set to model
the Jacobi stopping criterion to be exceeded. In order to verify this hypothesis, the feature space was
modified so that instead of encoding n (mod a1) directly it was converted to a boolean expression
testing whether or not n (mod a1) was equal to 1. While this limits the potential flexibility of the
model, it also makes the system much more scalable by removing precision concerns introduced by
the need to normalize by n. Initial training results are given in Table 5.
Table 5: ANN performance on n = 1000667 using a boolean test rather than normalized feature
TrialIterations to 98%Testing Accuracy
ExpectedAccuracy
Final TrainingAccuracy
Final TestingAccuracy
1 1,081,000 50.01±0.11% 100.00% 99.93%2 1,177,000 50.05±0.08% 99.45% 99.71%3 1,728,000 49.99±0.11% 98.62% 98.02%
Whereas the system with normalized features required roughly 1.6n training iterations in order to
converge to a satisfactory solution, the system which replaced 1n with a boolean feature required only
1.3n training iterations, despite dealing with a much larger value of n. While this performance gain
is not enormous, the real advantage comes during re-training which is demonstrated in Table 6.
11
Table 6: ANNs retrained for different n
NetworkID
nInitial Testing
Accuracy
TrainingIterations to 98%Testing Accuracy
Final TestingAccuracy
1 524387 99.97% 0 99.97%1 786803 99.93% 0 99.82%1 1048127 66.43% <1,000 99.92%2 524387 99.75% 0 99.86%2 786803 99.66% 0 99.84%2 1048127 66.33% <1,000 99.75%3 524387 98.70% 0 98.49%3 786803 98.10% 0 98.64%3 1048127 64.98% 22,000 98.31%
Interestingly the change in feature representation led to initial performance on smaller n values
being dramatically higher than on larger n values, a reversal from the previously observed pattern.
The networks were, however, able to retrain to these larger n values in very short order. One pat-
tern consistent across both feature representations is that networks which take longer to reach peak
performance also tend to have lower peak performance, as well as requiring longer to retrain. Such
networks are likely arriving in locally optimal solutions which are more complex than the true algo-
rithm, suggesting that it is worthwhile to try multiple initializations when searching for more complex
solutions in the composite case. Furthermore, given the apparent benefits of the boolean approach
over normalizing by n, it was further investigated in the course of characterizing network performance
on composite bases.
5.2 Composites
5.2.1 Jacobi Method
While the Jacobi symbol is an effective mechanism for differentiating quadratic residues from non–
residues in a prime modular base, it is insufficient for a composite modular base, providing only a 50%
probability of success if n = pq where p and q are prime. This does not mean, however, that a neural
network trained with the modular discriminants used by the Jacobi symbol algorithm will necessarily
fail. With that in mind, the inputs which led to the results summarized in Table 3 were employed to
train new networks for the composite case given the base 347 ∗ 359 = 124573. For this experiment the
dataset consisted exclusively of residues and non–residues with Jacobi symbol equal to one, meaning
that simply mimicking the Jacobi algorithm would not be of any help in solving the problem. The
results of this investigation are provided in Table 7.
12
Table 7: ANN performance on n = 124573 given all comparisons used in the Jacobi algorithm
TrialIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 101,700 50.17±0.35% 93.33% 65.75%2 105,100 49.99±0.29% 96.89% 66.13%3 120,100 50.04±0.45% 92.00% 65.70%
Interestingly, the neural networks were each able to exceed random chance guessing by a significant
margin. While the difference between training and testing accuracy indicates that some over–fitting
is occurring, it appears that there may be some previously unknown pattern in the data that is also
being discovered. Exchanging n (mod a1)n for n (mod a1) == 1, as was described in Tables 5 and 6,
shows that the possible loss of flexibility does not result in a large decrease in testing accuracy. It
does, however, lead to a drop in training accuracy which is desirable since it means the network is less
prone to simply memorize its input data. These results are summarized in Table 8, and led to the
adoption of the boolean feature in future experiments.
Table 8: ANN performance on n = 124573 given boolean test value rather than normalized mod
TrialIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 116,700 49.95±0.31% 88.00% 65.77%2 146,000 49.88±0.25% 86.22% 64.58%3 103,100 49.88±0.36% 88.89% 64.66%
In an effort to determine whether final accuracy is being impacted by network architecture, the
experiment was re–run replacing the hyperbolic tangent functions previously used throughout the
hidden layer with RELU–6 activation functions. The resulting network performance is given in Table
9.
Table 9: ANN performance on n = 124573 using the RELU–6 activation function.
TrialIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 44,100 50.04±0.24% 83.56% 63.88%2 27,300 50.00±0.44% 84.89% 64.67%3 140,400 49.96±0.32% 71.56% 61.39%
It appears that despite the recent popularity of the RELU activation function in image processing
applications, the hyperbolic tangent function is more suitable for this particular application; providing
on average an absolute gain of 2.5% testing accuracy. The RELU networks did, however, train
dramatically faster than the hyperbolic tangent networks. The third trial was an exception, taking a
very long time to reach 60% accuracy. This appears to be due to the fact that it became stuck in a
very unfavorable local minimum, causing 60% to be close to its asymptotic final performance level.
13
If training speed is desired over final performance, initializing a large number of RELU networks and
keeping the best one may be the optimal choice.
Another key parameter of network architecture is hidden layer count. To explore the impact of
this variable on performance, the number of hidden layers in the network was increased from 4 to 5.
The results are given in Table 10.
Table 10: ANN performance on n = 124573 using 5 fully connected hidden layers
TrialIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 85,000 49.79±0.41% 96.00% 65.46%2 67,200 49.93±0.25% 94.22% 66.48%3 77,500 49.82±0.39% 95.11% 65.18%
Compared to the four layer networks from Table 7 there was no improvement gain by adding the
fifth layer. In fact, the average absolute testing error actually worsened by 0.15%, though this is
within random noise fluctuations. It therefore appears that whatever pattern is being detected is not
being inhibited by network depth. Since each hidden layer contains 100 neurons, the addition of the
extra hidden layer added as many neurons to the system. An alternative way in which to add 100
neurons is to retain the depth 4 structure while boosting the per–layer neuron count to 125. Doing
this lead to the results in Table 11.
Table 11: ANN performance on n = 124573 using 125 neurons per layer
TrialIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 125,300 50.01±0.42% 95.56% 66.34%2 123,500 50.08±0.26% 96.44% 66.27%3 135,000 49.97±0.32% 96.44% 66.57%
While the average testing accuracy of the 125 neuron–per–layer system is greater than the 100–
per–layer system by 0.53%, the larger networks take more than 2 hours longer to train. Given that
the additional benefits provided by the larger system are so limited that they may even be an artifact
of random noise, the smaller system appears better suited to the problem.
Although the Jacobi symbol itself should be of no use here, it could still be the case that networks
which have been trained for that purpose might provide a better starting point than random initializa-
tion for differentiating quadratic residues. To determine whether this is indeed the case, the networks
from Table 3 were used as the starting points for training the new quadratic residue detection systems.
The results of this experiment are given in Table 12.
14
Table 12: ANN performance on n = 124573 initialized from Jacobi networks
Network IDIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 39,700 50.01±0.42% 93.78% 65.25%2 40,600 50.08±0.26% 93.33% 65.17%3 24,200 49.97±0.32% 93.33% 64.84%
Although the pre-initialized networks did not attain a higher testing accuracy than those which
were randomly initialized, they did approach this value 3 times faster, requiring approximately 74,000
fewer iterations to reliably exceed 60%. Although the network training started out much faster based
on this pre–training, hundreds of thousands of iterations were still required to fully re–train the
network. This is weak evidence that whatever data property is being exploited by the system is
not directly connected to the decisions made in the Jacobi algorithm, though something about the
architecture of networks which have learned that algorithm is clearly valuable.
One obvious problem with these approaches is that the number of training iterations needed to
achieve better than random guessing is larger than the value of n being tested. In order to be useful
for cryptographic applications, a neural network would need to be trained on one value of n and then
be re–tasked to different n values with minimal retraining. To determine whether this is feasible,
Network 2 from Table 7 was retrained on an n value of 263*479 = 125977. The outcome is recorded
in Table 13.
Table 13: ANN performance on n = 125977 transfered from pre–trained networks
Network IDIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 14,500 49.92±0.37% 90.22% 64.23%2 17,300 50.03±0.35% 88.00% 64.07%3 16,200 49.85±0.32% 93.33% 64.68%
Although the final testing accuracy of the transfer learning system was no better than the results
from Table 7, the training accuracy did increase towards this asymptote much more rapidly. Whereas
the original training required an average of 108966 iterations to achieve 60% testing accuracy, the
retrained system was able to achieve this accuracy after an average of only 16000 iterations: nearly 7
times faster. This processes is still slower than simply factoring n, but if the underlying pattern being
exploited by the network can be better understood it may be possible to either extract an explicit
algorithm or else design networks which retrain more quickly in the future.
Finally, additional features can also be added on top of the Jacobi algorithm comparators in an
effort to boost performance. Since it is not yet known how the neural networks are achieving their
improved performances, it is not clear what features should be added to arrive at an optimal system.
Testing a variety of modular comparators at random, however, reveals that computing n (mod 7) at
each stage of recursion has a substantive impact on accuracy. These improvements are summarized
in Table 14.
15
Table 14: ANN performance on n = 124573 with additional feature: n (mod 7)
TestIterations to 60%Testing Accuracy
ExpectedAccuracy
TrainingAccuracy
TestingAccuracy
1 54,800 49.79±0.41% 99.56% 70.71%2 65,800 50.05±0.47% 98.22% 70.10%3 56,700 50.11±0.36% 99.11% 71.17%
Adding a single feature, n (mod 7), at each level of recursion lead to an average testing accuracy
increase of 4.77%, significantly beyond the influence of random noise variations for the system. It also
cut the required training iterations to reach 60% accuracy nearly in half. Other features were also
investigated, computing n and a against modular bases of 2, 3, 6, 9, 11, 13, and 17. None of these
individually yielded gains as significant as did the introduction of n (mod 7), though in combination
these features pushed the testing accuracy up to 75.02%. It is hoped that other as–of–yet unexplored
features might be able to bring testing accuracies to 100%.
In preparation for testing on a wider variety of n values, the network architecture was modified to
that each hidden layer contains a number of neurons equal to the number of input features. For the
values of n previously discussed this means 102 neurons, an insignificant change given that an increase
to 125 neurons was already shown to have very little impact. Performance for these various n values
is summarized in Table 15.
Table 15: ANN performance various n
n FactorsIterations to 60%Testing Accuracy
Final TrainingAccuracy
Final TestingAccuracy
124,573 347 * 359 54,800 99.56% 70.71%125,321 7 * 17903 <100 100.00% 100.00%126,109 23 * 5483 49,100 99.56% 71.72%603,241 719 * 839 130,500 92.04% 80.13%848,329 863 * 983 460,000 86.35% 79.29%854,941 839 * 1019 1,530,000 79.51% 62.81%995,893 839 * 1187 1,857,000 71.90% 61.15%
1,076,437 839 * 1283 2,239,000 71.07% 61.82%1,307,377 1019 * 1283 >3,262,000 48.80% 50.67%1,551,937 1019 * 1523 >4,821,100 68.14% 57.38%
There are several interesting phenomena captured in Table 15. First, there was a network which
achieved 100% performance, and did so extremely quickly. It is believed that this high level of
performance was achieved because one of the factors of n was 7, and the data mod 7 is one of
the feature values computed along the way. To see whether small factors in general lead to high
performance, an n value was tested having 23 as one of its factors. This network behaved virtually
identically to the network trained on an n with two factors around 350, beating out the larger–factor
network by only 1%. While this might have indicated a weak pattern of decline proportional to
factor size, a network with factors of 719 and 839 managed a testing accuracy of over 80%. It’s not
clear why the performance jumped so high for this value of n, but it does indicate that the system
16
is able to handle at least certain subsets of larger factors. On the other hand, system performance
took a precipitous hit on larger numbers, to the point where it could do barely better than random
guessing for n = 1307377. These results are complicated by the fact that the n values above 603241
were trained using a different implementation of the system which generated data on the fly in order
to reduce memory consumption. When that same technique was applied to n = 1551937 it also
failed; the 57% accuracy instead being achieved by a highly memory intensive implementation. This
may indicate that the decrease in performance observed for the largest n values is a byproduct of
implementation difficulties rather than representative of any trend inherent to the problem. Larger n
values of 32188213, 860755297, and 25840758901 were also tested, but they had to be terminated due
to resource constraints before any significant progress was made.
This brings to the forefront a significant drawback of these results: even if the observed ability to
achieve greater than 50% accuracy holds for cryptographically sized values of n, it would be completely
infeasible to train such networks. Furthermore, experiments providing the neural networks with only
a small (logarithmic or square root) subset of the available data for training consistently failed. This
could be evidence that, when successful, the networks are only memorizing some pattern unique to
each value of n rather than learning a generalizable set of rules for detection. On the other hand, if
that were the case then pre–training a network on one n value should not have improved the learning
rate when transferring to other n values, as it did in Tables 12 and 13. Unfortunately, efforts to
demonstrate transferability of learning amongst the larger n values failed, though the mechanism of
failure is highly indicative of an implementation error. When attempting to re–train networks which
had previously been trained for n = 603241, n = 854941, and n = 1076437, all attempts failed to even
move beyond a training accuracy of 50%, let alone having any impact on testing performance. This
is what one would expect to see if the step size employed on each training iteration was forced to be
very small, a potential byproduct of the way in which the TensorFlow library handles reloading of
saved networks. Finding a work–around for this problem and further investigating the transferability
of systems at large n values would provide useful insights for future work seeking to discover how
these networks are able to generate such unexpectedly high accuracies.
5.2.2 Modular Exponentiation Method
A fundamentally different approach to quadratic residue detection is to compute bx (mod n) for various
values of x and use the resulting values for features. Such features are inspired by the fact that for
prime n, bn−12 (mod n) perfectly separates quadratic residues from non–residues. Unfortunately it
appears that not all possible powers x contain information useful for solving the quadratic residue
problem. For example, for n = 124573 suppose one were to select values of x equal to 3 and 897. The
resulting distribution of residues and non–residues would be virtually random, as visualized in Figure
1.
17
Figure 1: Apparently Random Distribution of Residues (Green) and Non–Residues (Red)
0 20000 40000 60000 80000 100000 120000b^3 (mod n)
0
20000
40000
60000
80000
100000
120000
b^897 (mod n)
n = 124573
There are, however, values of x for which the data becomes obviously separable. Recall that n = pq
and that p = 2r + 1 and q = 2t+ 1. Then for odd multiples of rt, perfect separation of residues from
non–residues can be observed. This situation is depicted in Figure 2.
Figure 2: Perfect Separation of Residues (Green) and Non–Residues (Red)
0 20000 40000 60000 80000 100000 120000b^1 (mod n)
0
20000
40000
60000
80000
100000
120000
b^30967 (mod n)
n = 124573
While such a case is simple enough so as to not require a neural network at all, it is very unlikely
that one would find an odd multiple of rt by sampling at random from the number line. As it turns
out, however, there are other values of x which are almost as useful for solving the problem. For
18
example, if both rt+ 1 and 1 are in the feature space, then the residues may still be easily separated
from non–residues, as depicted in Figure 3.
Figure 3: Easy Separation of Residues (Green) and Non–Residues (Red)
0 20000 40000 60000 80000 100000 120000b^1 (mod n)
0
20000
40000
60000
80000
100000
120000
b^30968 (mod n)
n = 124573
In general, features of the form bx (mod n) appear to be useful to the neural network if and only
if x is of the form m ∗ r or m ∗ t for m ∈ Z+. Features of the form m ∗ r+ y and m ∗ t+ y can also be
useful when paired with y. If no features of those forms are provided then the network is held around
50% testing accuracy. The question then becomes whether finding these useful values of x is easier
than simply factoring n and then using that knowledge to compute the quadratic residue status of a
number. The following analysis demonstrates that it is unlikely to be asymptotically easier if one is
restricted to randomly sampling values from the number line:
n = pq
n = (2r + t)(2t+ 1)
n = 4rt+ 2r + 2t+ 1
n = 4rt+ 2(r + t) + 1
4rt = n− 1− 2(r + t)
rt =n− 1− 2(r + t)
4
The bounds on rt therefore depend on the possible values of r+ t. The smallest such sum would be
achieved if r = t, in which case r+ t =√n−12 +
√n−12 =
√n−1. Thus rt ≤ n−1−2(
√n−1)
4 = n4 + 1
4 −√n2 .
Likewise the largest possible value of r + t would be achieved if they are as far apart as possible, so
p = 3 and q = n3 . In this case r + t = 1 +
n3−12 = n+3
6 which means that rt ≥ n−1−2(n+36
)
4 = n6 . The
initial bounds on rt are therefore:
19
n
6≤ rt ≤ n
4+
1
4−√n
2
By employing trial division up to 5 before considering this method, the lower bound may be
raised to n5 . Unfortunately, the lower bound asymptotically approaches n
4 as more trial divisions are
performed, meaning that the gap between the lower and upper bounds cannot be closed. Thus the
space which must be searched in order to encounter rt is O(n), worse than trial division which is
O(√n).
On the other hand, if one expands the search space to include the multiples of r and t, then target
instances become much more bountiful. Since n = pq and since p and q are distinct, exactly one of p
or q must be less than√n. Let p be the smaller one. Then |mi+1p−mip| ≤
√n−12 . Thus if you sample√
n−12 contiguous elements anywhere from the number line starting above
√n it is guaranteed that
you will encounter at least one useful feature. Unfortunately this search space is still O(√n) and thus
not better than trial division. Although neural network testing accuracies of over 99% were obtained
using only a logarithmically sized feature set for numbers up to around 1 million, this solution is
not scalable to cryptographically relevant sizes. As weak evidence against the ability to find a better
selection schema, consider that if it were possible to reliably encounter multiples of r and t, then one
could use the GCD algorithm on these multiples to infer the values of r and t and thereby factor n in
polynomial time.
5.2.3 Alternative Methods
Several other input feature schemes were also considered, none of which saw any concrete improvement
over random guessing. First was the use b (mod i) and n (mod i) for a logarithmic number of small
values of i ∈ Z+. The hope was that the system would find a pattern based on some previously
unconsidered implication of the Chinese Remainder Theorem, though the neural networks were unable
to make any headway there.
A wide variety of Fibonacci sequence values modulo different bases were also considered, for ex-
ample Fibb+1 (mod n). These were tested since similar features appear in an unproven primality
testing heuristic and it was hoped that their utility might extend to other number theory problems.
Unfortunately the neural networks were unable to make effective use of these features either.
6 Conclusions and Future Work
This work has shown that neural networks are capable of learning to mimic the Jacobi symbol al-
gorithm in a way which is transferable to modular bases on which the network was not trained.
While these networks are less efficient than the standard Jacobi algorithm, and therefore not of any
pragmatic utility, this serves as another example of the ability of neural networks to converge on a
moderately complicated algorithm. This system also provides a launching point for addressing more
complex problems.
Using Jacobi algorithm features to drive machine learning, it appears that a significant improve-
ment over random guessing may be achieved in the problem of quadratic residue detection for compos-
ite numbers of the form n = pq. Such a result challenges the conventional wisdom of the community,
20
which suggests that quadratic residue detection is as unguessable as a coin toss.[7][8] There are several
key considerations, however, which restrain the impact of this apparent result. First and foremost is
that cryptographically relevant n values tend to be on the order of 500 to 1000 bits. The numbers
tested during this investigation are only on the order of 17 to 21 bits. It is possible, therefore, that any
patterns detected on the small scale are not present at the larger scale. This would have been investi-
gated in greater detail if not for a second problem: training the neural networks from scratch appears
to require O(n) data instances and iterations. Using only a personal computer, it is not possible to
train a network on such a large set of data due to both memory and time constraints. To fully train a
network on a cryptographically relevant scale would be intractable even on a supercomputer unless a
more efficient convergence mechanism can be constructed. Based on the experiments presented here,
using a deeper network with simpler activation functions could be one way to cut down the training
iteration requirement. Another possible avenue of attack would be to take a random–forest style ap-
proach to the problem, providing numerous neural networks with only a small amount of training and
then taking a majority vote to generate the classification. One potential problem with such a method
is that, if the networks are learning some underlying algorithm for residue detection, their errors may
not be independent and thus the ensemble system may not see improved performance. Moreover, it
may still require O(n) training iterations simply to exceed 50% accuracy, in which case the ensemble
method can offer no advantage.
Given the relatively small size of the numbers investigated during this study, and the resource–
intensive training requirements of the neural networks, a much faster solution to quadratic residue
detection would have been to simply factor n and then directly compute the residue status. It is worth
noting, however, that factoring algorithms have been under development for centuries whereas the work
outlined in this paper has been under development for several weeks. Just as factoring algorithms have
become more and more efficient over time, it is possible that this methodology might be significantly
improved upon in order to decrease resource requirements and further improve accuracy. Over the
course of a relatively small number of experiments, features were discovered which provided an absolute
testing accuracy performance boost of up to 10% on top of what was already provided by the Jacobi
symbol features. If a polynomial–time algorithm for residue status detection exists, neural networks
of this form could allow for an easy way to identify what sort of computations are involved in it.
By studying the impact of different features on network performance, it may be possible to discover
a previously unknown algorithm. This would in turn make the computational resource requirement
of the neural networks a moot point and allow for solutions to even cryptographically sized inputs.
Although neural networks are conventionally treated as black–box systems, there is no reason that
they could not be used as relevant–feature detectors. Future work should therefore focus on the search
for additional features, as well as on performance optimizations so that applicability on larger n may
be investigated.
References
[1] David S. Dummit and Richard M. Foote. Abstract Algebra. English. 3rd ed. Wiley, 2004.
isbn: 978-0-471-43334-7.
21
[2] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied
Cryptography. English. 5th ed. Waterloo: CRC Press, 2001. isbn: 0-8493-8523-7.
[3] Kenneth H. Rosen. Elementary number theory and its applications. English. Boston:
Addison-Wesley, 2011. isbn: 978-0-321-50031-1 0-321-50031-8.
[4] Douglas R. Stinson. Cryptography Theory and Practice. English. 3rd ed. Waterloo: CRC
Press, 2006. isbn: 978-1-58488-508-5.
[5] Richard A. Mollin. An Introduction to Cryptography. English. 1st ed. New York: CRC
Press, 2001. isbn: 1-58488-127-5.
[6] Tibor Jager and Jrg Schwenk. The Generic Hardness of Subset Membership Problems
under the Factoring Assumption. Cryptology ePrint Archive, Report 2008/482. 2008.
url: http://eprint.iacr.org/.
[7] Ronald L. Rivest. “Advances in Cryptology — ASIACRYPT ’91: International Con-
ference on the Theory and Application of Cryptology Fujiyosida, Japan, November 1991
Proceedings”. In: ed. by Hideki Imai, Ronald L. Rivest, and Tsutomu Matsumoto. Berlin,
Heidelberg: Springer Berlin Heidelberg, 1993. Chap. Cryptography and machine learn-
ing, pp. 427–439. isbn: 978-3-540-48066-2. doi: 10.1007/3-540-57332-1_36. url:
http://dx.doi.org/10.1007/3-540-57332-1_36.
[8] Michael Kearns and Leslie Valiant. “Cryptographic Limitations on Learning Boolean
Formulae and Finite Automata”. In: J. ACM 41.1 (Jan. 1994), pp. 67–95. issn: 0004-5411.
doi: 10.1145/174644.174647. url: http://doi.acm.org/10.1145/174644.174647.
[9] Ron RIvest and Robert Silverman. Are ’Strong’ Primes Needed for RSA. 2001.
[10] Matthew Lai. “Giraffe: Using Deep Reinforcement Learning to Play Chess”. In: CoRR
abs/1509.01549 (2015). url: http://arxiv.org/abs/1509.01549.
[11] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”.
In: Nature 529.7587 (Jan. 2016), pp. 484–489. issn: 0028-0836. url: http://dx.doi.
org/10.1038/nature16961.
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with
Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing
Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. url:
http://papers.nips.cc/paper/4824- imagenet- classification- with- deep-
convolutional-neural-networks.pdf.
[13] Jason Yosinski et al. “How transferable are features in deep neural networks?” In: CoRR
abs/1411.1792 (2014). url: http://arxiv.org/abs/1411.1792.
[14] Jeff Donahue et al. “DeCAF: A Deep Convolutional Activation Feature for Generic Visual
Recognition”. In: CoRR abs/1310.1531 (2013). url: http://arxiv.org/abs/1310.
1531.
22
[15] Ali Sharif Razavian et al. “CNN Features off-the-shelf: an Astounding Baseline for Recog-
nition”. In: CoRR abs/1403.6382 (2014). url: http://arxiv.org/abs/1403.6382.
[16] G. Cybenko. “Approximation by superpositions of a sigmoidal function”. In: Mathematics
of Control, Signals and Systems 2.4 (1989), pp. 303–314. issn: 1435-568X. doi: 10.1007/
BF02551274. url: http://dx.doi.org/10.1007/BF02551274.
[17] Lei Jimmy Ba and Rich Caurana. “Do Deep Nets Really Need to be Deep?” In: CoRR
abs/1312.6184 (2013). url: http://arxiv.org/abs/1312.6184.
23