7/29/2019 Sublinear Time Algorithm
1/28
A
Seminar Report
On
SUB-LINEAR TIME ALGORITHMS
Submitted in partial fulfillment of the
requirements for the award of the degree
of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted by
AJAY YADAV
Roll No. - 1012210009
B.TECH (CS-61)
Under the guidance of
Ms. ANITA PAL
Mr. KAMAL KUMAR SRIVASTAVA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SHRI RAMSWAROOP MEMORIAL COLLEGE OF ENGINEERING &
MANAGEMENT, LUCKNOW
FEBRUARY
2013
7/29/2019 Sublinear Time Algorithm
2/281
ABSTRACT
The area of sublinear-time algorithms is a new rapidly emerging area of computer science. It has
its roots in the study of massive data sets that occur more and more frequently in various
applications. Financial transactions with billions of input data and Internet traffic analyses
(Internet traffic logs, clickstreams, web data) are examples of modern data sets that show
unprecedented scale. Managing and analyzing such data sets forces us to reconsider the
traditional notions of efficient algorithms: processing such massive data sets in more than linear
time is by far too expensive and often even linear time algorithms may be too slow. Hence, there
is the desire to develop algorithms whose running times are not only polynomial, but in fact are
sublinear in n. Constructing a sublinear time algorithm may seem to be an impossible task sinceit allows one to read only a small fraction of the input. However, in recent years, we have seen
development of sublinear time algorithms for optimization problems arising in such diverse areas
as graph theory, geometry, algebraic computations, and computer graphics. Initially, the main
research focus has been on designing efficient algorithms in the framework of property testing.
which is an alternative notion of approximation for decision problems. But more recently, we see
some major progress in sublinear-time algorithms in the classical model of randomized and
approximation algorithms. In this paper, we survey some of the recent advances in this area. Our
main focus is on sublinear-time algorithms for combinatorial problems, especially for graph
problems and optimization problems in metric spaces.
7/29/2019 Sublinear Time Algorithm
3/282
ACKNOWLEDGEMENT
Acknowledgements provide us the opportunity to express our gratitude to the people which makesuch tasks to come out as a success. I take this opportunity to thank Dr. (Mrs.) Vinodini Katiyar
(Head of Department, Computer Science) for her immense support and help. I would also like to
thank my seminar guides Ms. Anita Pal and Mr. Kamal Kumar Srivastava for their incessant and
perpetual cooperation and encouragement which helped me overcome hurdles at various stages.
I would also like to thank my family for providing me resources, support and
encouragement this report would not have been possible without you.
AJAY YADAV
7/29/2019 Sublinear Time Algorithm
4/283
List of Symbols and Abbreviations
DFA Deterministic finite automata.
R.E. Regular Expression
MST Minimum Spanning Tree
NP Non Polynomial
PTAS polynomial time approximation
scheme
O Asymptotic upper bound
Asymptotic lower bound
Asymptotic average bound
less than
Summation
7/29/2019 Sublinear Time Algorithm
5/284
TABLE OF CONTENTS
Abstract.i
Acknowledgement.ii
List of Symbols, Abbreviations and Nomenclature..iii
1.Intoduction..5
1.1.Sublinear time algorithms5
1.2.Property Testing...6
1.2.1.Definition and Variants.7
1.2.2.Features and limitations....8
1.3.Data Streaming algorithms..8
2.Related work...9
2.1.Randomized algorithms...9
2.2.Approximation algorithms..9
2.3.Time complexity comparisons10
3.Design.12
3.1.Algorithm to determine sortedness of a list.12
3.1.1.algorithm...13
3.2.Successor search in a list..14
3.2.1.Search algorithm14
3.3.Testing Membership of a string in a R.E..15
3.3.1.Algorithm...15
3.4.Searching in sorted list..16
3.5.Geometry:Intersection of two polygons17
3.6.Sublinear Time Algorithms for Graphs Problems....193.6.1.Average degree of graph.19
3.7.Sublinear Time Approximation Algorithms for Problems in Metric Spaces...233.7.1.Clustering via Random Sampling254.Conclusion..26
References..27
7/29/2019 Sublinear Time Algorithm
6/285
1. INTRODUCTIONThe concept of sublinear-time algorithms is known for a very long time, but initially it has been
used to denote pseudo-sublinear-time algorithms, where after an appropriate preprocessing, an
algorithm solves the problem in sublinear-time. For example, if we have a set of n numbers, then
after an O(n log n) preprocessing (sorting), we can trivially solve a number of problems
involving the input elements. And so, if the after the preprocessing the elements are put in a
sorted array, then in O(1) time we can find the kth smallest element, in O(log n) time we can test
if the input contains a given element x, and also in O(log n) time we can return the number of
elements equal to a given element x. Even though all these results are folklore, this is not what
we call nowadays a sublinear-time algorithm.
In this Report, my goal is to study algorithms for which the input is taken to be in any standard
representation and with no extra assumptions. Then, an algorithm does not have to read the entire
input but it may determine the output by checking only a subset of the input elements. It is easy
to see that for many natural problems it is impossible to give any reasonable answer if not all or
almost all input elements are checked. But still, for some number of problems we can obtain
good algorithms that do not have to look at the entire input. Typically, these algorithms are
randomized (because most of the problems have a trivial linear-time deterministic lower bound)
and they return only an approximate solution rather than the exact one (because usually, without
looking at the whole input we cannot determine the exact solution). In this survey, we present
recently developed sublinear-time algorithm for some combinatorial optimization problems.
1.1 Subli near time algorithms:
Sublinear time algorithms are that type of algorithm which runs in sublinear time. That is they
runs always in upper bound of linear time algorithm .i.e. if we are defining f(n) as function for
time complexity then f(n) < o(n).here o (small oh) can be define as the if g 1(n) and g2(n) are two
function then g1=o(g2) if and only if g1(n) < c*g2(n) for c > 0 and n > n0.An algorithm is said to run in sub-linear time (often spelled sublinear time) if T(n) = o(n). In
particular this includes algorithms with the time complexities defined above, as well as others
such as the O (n) Grover's search algorithm. Typical algorithms that are exact and yet run in
sub-linear time use parallel processing (as the NC1 matrix determinant calculation does), non-
classical processing (as Grover's search does), or alternatively have guaranteed assumptions on
the input structure (as the logarithmic time binary search and many tree maintenance algorithms
7/29/2019 Sublinear Time Algorithm
7/286
do).
Figure 1.1 Comparison between sublinear and linear algorithm
However, languages such as the set of all strings that have a 1-bit indexed by the first log(n) bits
may depend on every bit of the input and yet be computable in sub-linear time.
The specific term sublinear time algorithm is usually reserved to algorithms that are unlike the
above in that they are run over classical serial machine models and are not allowed prior
assumptions on the input. They are however allowed to be randomized, and indeed must be
randomized for all but the most trivial of tasks. As such an algorithm must provide an answer
without reading the entire input, its particulars heavily depend on the access allowed to the input.
Usually for an input that is represented as a binary string b1,...,bk it is assumed that the algorithm
can in time O(1) request and obtain the value of bi for any i.
Sub-linear time algorithms are typically randomized, and provide only approximate solutions. In
fact, the property of a binary string having only zeros (and no ones) can be easily proved not to
be decidable by a (non-approximate) sub-linear time algorithm. Sub-linear time algorithms arise
naturally in the investigation of property testing.
1.2. Property testing:
In broad terms, property testing is the study of the following class of the problems:
Given the ability to perform (local) queries concerning a particular object (e.g., a function, or a
graph), the task is to determine whether the object has predetermined (global property) (e.g.
7/29/2019 Sublinear Time Algorithm
8/287
linearity or bipartiteness) ,or is far from having the property. The task should be performed by
inspecting only a small (possibly randomly selected) part of the whole object, where a small
probability of failure is allowed.
In order to define a property testing problem, we need to specify the types of queries that can
perform by the algorithm and a distance measure between objects. Later is required in order todefine what it means that the object is far from having the property. we assume that the algorithm
has given distance parameter is .
In computer science, a property testing algorithm for a decision problem is an algorithm whose
query complexity to its input is much smaller than the instance size of the problem. Typically
property testing algorithms are used to decide if some mathematical object (such as a graph or a
boolean function) has a "global" property, or is "far" from having this property, using only a
small number of "local" queries to the object. For example, the following promise problem
admits an algorithm whose query complexity is independent of the instance size (for an arbitrary
constant > 0):
"Given a graph G on n vertices, decide if G is bipartite, or G cannot be made bipartite even after
removing an arbitrary subset of at most edges of G."
Property testing algorithms are important in the theory of probabilistically checkable proofs.
1.2.1.Defi ni tion and var iants
Formally, a property testing algorithm with query complexity q(n) and proximity parameter for
a decision problem L is a randomized algorithm that, on input x (an instance of L) makes at most
q(|x|) queries to x and behaves as follows:
If x is in L, the algorithm accepts x with probability at least .
If x is -far from L, the algorithm rejects x with probability at least .
Here, "x is -far from L" means that the Hamming distance between x and any string in L is at
least |x|.
A property testing algorithm is said to have one-sided error if it satisfies the stronger condition
that the accepting probability for instances x L is 1 instead of .
A property testing algorithm is said be non-adaptive if it performs all its queries before it
"observes" any answers to previous queries. Such an algorithm can be viewed as operating in the
following manner. First the algorithm receives its input. Before looking at the input, using its
internal randomness, the algorithm decides which symbols of the input are to be queried. Next,
the algorithm observes these symbols. Finally, without making any additional queries (but
possibly using its randomness), the algorithm decides whether to accept or reject the input.
7/29/2019 Sublinear Time Algorithm
9/288
1.2.2.Features and limitations
The main efficiency parameter of a property testing algorithm is its query complexity, which is
the maximum number of input symbols inspected over all inputs of a given length (and all
random choices made by the algorithm). One is interested in designing algorithms whose query
complexity is as small as possible. In many cases the running time of property testing algorithmsis sublinear in the instance length. Typically, the goal is first to make the query complexity as
small as possible as a function of the instance size n, and then study the dependency on the
proximity parameter .
Unlike other complexity-theoretic settings, the asymptotic query complexity of property testing
algorithms is affected dramatically by the representation of instances. For example, when =
0.01, the problem of testing bipartiteness of dense graphs (which are represented by their
adjacency matrix) admits an algorithm of constant query complexity. In contrast, sparse graphs
on n vertices (which are represented by their adjacency list) require property testing algorithms of
query complexity ( .
The query complexity of property testing algorithms grows as the proximity parameter becomes
smaller for all non-trivial properties. This dependence on is necessary as a change of fewer than
symbols in the input cannot be detected with constant probability using fewer than O(1/)
queries. Many interesting properties of dense graphs can be tested using query complexity that
depends only on and not on the graph size n. However, the query complexity can grow
enormously fast as a function of . For example, for a long time the best known algorithm for
testing if a graph does not contain any triangle had a query complexity which is a tower function
of poly(1/), and only in 2010 this has been improved to a tower function of log(1/). O ne of the
reasons for this enormous growth in bounds is that many of the positive results for property
testing of graphs are established using the Szemerdi regularity lemma, which also has tower-
type bounds in its conclusions.
1.3 Data Streaming algori thms:
In computer science, streaming algorithms are algorithms for processing data streams in which
the input is presented as a sequence of items and can be examined in only a few passes (typically
just one). These algorithms have limited memory available to them (much less than the input
size) and also limited processing time per item.
These constraints may mean that an algorithm produces an approximate answer based on a
summary or "sketch" of the data stream in memory.
7/29/2019 Sublinear Time Algorithm
10/289
2. Related work:-
2.1. Randomized Algori thms:
A randomized algorithm is an algorithm which employs a degree of randomness as part of its
logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its
behaviour, in the hope of achieving good performance in the "average case" over all possible
choices of random bits. Formally, the algorithm's performance will be a random variable
determined by the random bits; thus either the running time, or the output (or both) are random
variables.
One has to distinguish between algorithms that use the random input to reduce the expected
running time or memory usage, but always terminate with a correct result in a bounded amount
of time, and probabilistic algorithms, which, depending on the random input, have a chance of
producing an incorrect result (Monte Carlo algorithms) or fail to produce a result (Las Vegas
algorithms) either by signalling a failure or failing to terminate.
In the second case, random performance and random output, the term "algorithm" for a procedure
is somewhat questionable. In the case of random output, it is no longer formally effective.
However, in some cases, probabilistic algorithms are the only practical means of solving a
problem.
In common practice, randomized algorithms are approximated using a pseudorandom numbergenerator in place of a true source of random bits; such an implementation may deviate from the
expected theoretical behaviour.
2.2 Approximation Algor ithms:
In computer science and operations research, approximation algorithms are algorithms used to
find approximate solutions to optimization problems. Approximation algorithms are often
associated with NP-hard problems; since it is unlikely that there can ever be efficient polynomial-
time exact algorithms solving NP-hard problems, one settles for polynomial-time sub-optimal
solutions. Unlike heuristics, which usually only find reasonably good solutions reasonably fast,
one wants provable solution quality and provable run-time bounds. Ideally, the approximation is
optimal up to a small constant factor (for instance within 5% of the optimal solution).
Approximation algorithms are increasingly being used for problems where exact polynomial-
time algorithms are known but are too expensive due to the input size. A typical example for an
approximation algorithm is the one for vertex cover in graphs: find an uncovered edge and add
both endpoints to the vertex cover, until none remain. It is clear that the resulting cover is at most
twice as large as the optimal one. This is a constant factor approximation algorithm with a factor
7/29/2019 Sublinear Time Algorithm
11/2810
of 2. NP-hard problems vary greatly in their approximability; some, such as the bin packing
problem, can be approximated within any factor greater than 1 (such a family of approximation
algorithms is often called a polynomial time approximation scheme or PTAS). Others are
impossible to approximate within any constant, or even polynomial factor unless P = NP, such as
the maximum clique problem. NP-hard problems can often be expressed as integer programs (IP)and solved exactly in exponential time. Many approximation algorithms emerge from the linear
programming relaxation of the integer program. Not all approximation algorithms are suitable for
all practical applications. They often use IP/LP/Semidefinite solvers, complex data structures or
sophisticated algorithmic techniques which lead to difficult implementation problems. Also,
some approximation algorithms have impractical running times even though they are polynomial
time, for example O(n2000). Yet the study of even very expensive algorithms is not a completely
theoretical pursuit as they can yield valuable insights. A classic example is the initial PTAS for
Euclidean TSP due to Sanjeev Arora which had prohibitive running time, yet within a year,
Arora refined the ideas into a linear time algorithm. Such algorithms are also worthwhile in some
applications where the running times and cost can be justified e.g. computational biology,
financial engineering, transportation planning, and inventory management. In such scenarios,
they must compete with the corresponding direct IP formulations.
Another limitation of the approach is that it applies only to optimization problems and not to
"pure" decision problems like satisfiability, although it is often possible to conceive optimization
versions of such problems, such as the maximum satisfiability problem (Max SAT).
Inapproximability has been a fruitful area of research in computational complexity theory since
the 1990 result of Feige, Goldwasser, Lovasz, Safra and Szegedy on the inapproximability of
Independent Set. After Arora et al. proved the PCP theorem a year later, it has now been shown
that Johnson's 1974 approximation algorithms for Max SAT, Set Cover, Independent Set and
Coloring all achieve the optimal approximation ratio, assuming P != NP.
2.3. Time Complexity Comparisons:
In computer science, the time complexity of an algorithm quantifies the amount of time taken by
an algorithm to run as a function of the length of the string representing the input. The time
complexity of an algorithm is commonly expressed using big O notation, which excludes
coefficients and lower order terms. When expressed this way, the time complexity is said to be
described asymptotically, i.e., as the input size goes to infinity. For example, if the time required
by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is
O(n3).
7/29/2019 Sublinear Time Algorithm
12/2811
Time complexity is commonly estimated by counting the number of elementary operations
performed by the algorithm, where an elementary operation takes a fixed amount of time to
perform. Thus the amount of time taken and the number of elementary operations performed by
the algorithm differ by at most a constant factor.
Since an algorithm's performance time may vary with different inputs of the same size, onecommonly uses the worst-case time complexity of an algorithm, denoted as T(n), which is
defined as the maximum amount of time taken on any input of size n. Time complexities are
classified by the nature of the function T(n). For instance, an algorithm with T(n) = O(n) is called
a linear time algorithm, and an algorithm with T(n) = O(2n) is said to be an exponential time
algorithm.
Table 1 Time complexity comparison
7/29/2019 Sublinear Time Algorithm
13/2812
3.Design :
3.1. Algori thm for checking sortedness of a li st in Subl inear Time:
Input: a list of n numbers x1 , x2 ,..., xn.
Question: Is the list sorted orfar from sorted?Already saw: two different O((log n)/) time testers.
(log n) queries are required for all constant 1/2
Today: log n) queries are required for all constant c for every 1-sided error nonadaptive
test.
A test has 1-sided error if it always accepts all YES instances.
A test is nonadaptive if its queries that do not depend on answers to previous queries.
A pair (xi, xj) is violated if xi < xj
Claim. A 1-sided error test can reject only if it finds a violated pair.
Proof: Every sorted partial list can be extended to a sorted list.
Lolas distribution is uniform over the following log lists:
7/29/2019 Sublinear Time Algorithm
14/2813
Claim 1. All lists above are 1/2-far from sorted.
Claim 2. Every pair (xi , xj) is violated in exactly one list above.
Let picks a set of positions to query.
Ourtest must be correct, i.e., must find a violated pair with probability 2/3
When input is picked according to Lolas distribution.
Q contains a violated pair (ai, ai+1) is violated for some i
Probability Pr =
Lolas distribution
If|Q| 2/3logn then this probability is
7/29/2019 Sublinear Time Algorithm
15/2814
1.Pick an entry uniformly at random. Let x be the value in that entry.
2.Perform a binary search for x
3.If x is found, accept, otherwise, reject.
Runtime: O(logn)
3.2 Successor search in a l ist in subl inear time:
Input: a sorted doubly linked list of numbers and a search value k
Output: first element greater than k
Data structure:
Figure 3.1. Datastructure for doubly link list
3.2.1 Search Algor i thm:
Pick random list elements
Determine closest predecessor and successor of k (within the random sample)
Perform plain, linearsearch towards k (starting from these two elements)
Figure 3.2. Algorithm running status
7/29/2019 Sublinear Time Algorithm
16/2815
Good case:
Figure 3.3. Good case for algorithm
Worst case:
Figure 3.4. worst case for algorithm
Runtime:O()
3.3 Testing Membership of a Str ing in R.E.
For fixed regular language L belongs to {0,1}*, testing algorithm should accept w.h.p. every
word w belongs to L, and should reject w.h.p. every word w that differs on more than e n bits
(n=|w|) from every w belongs to L (|w|=n). Algorithm can query any bit wi of w.
Figure 3.5. DFA for testing
3.3.1.Algori thm (simpl if ied version):
Uniformly and independently select (r/e) indices 1 i n .
For each i selected, check that the substring wi wi+r/e is feasible.
If any substring is infeasible then reject, otherwise accept
7/29/2019 Sublinear Time Algorithm
17/2816
RunTime: O()
3.4.Searching in a sorted list:
It is well-known that if we can store the input in a sorted array, then we can solve various
problems on the input very efficiently. However, the assumption that the input array is sorted isnot natural in typical applications. Let us now consider a variant of this problem,
where our goal is to search for an element x in a linked sorted list containing n distinct elements.
Here, we assume that the n elements are stored in a doubly-linked, each list element has access to
the next and preceding element in the list, and the list is sorted (that is, if x follows y in the list,
then y < x). We also assume that we have access to all elements in the list, which for example,
can correspond to the situation that all n list elements are stored in an array (but the array is not
sorted and we do not impose any order for the array elements). How can we find whether a given
number x is in our input or is not? On the first glace, it seems that since we do not have direct
access to the rank of any element in the list, this problem requires (n) time. And indeed, if our
goal is to design a deterministic algorithm,then it is impossible to do the search in o(n)
time.However, if we allow randomization, then we can complete the search in O(n) expected
time (and this bound is asymptotically tight). Let us first sample uniformly at random a set S of
(n) elements from the input. Since we have access to all elements in the list, we can select the
set S in O(n) time. Next, we scan all the elements in S and in O(n) time we can find twoelements in S, p and q, such that p x < q, and there is no element in S that is between p and q.
Observe that since the input consist of n distinct numbers, p and q are uniquely defined. Next, we
traverse the input list containing all the input elements starting at p until we find either the sought
key x or we find element q.
Lemma 1:
The algorithm above completes the search in expected O(n) time. Moreover, no algorithm can
solve this problem in o(n) expected time.
Proof. The running time of the algorithm if equal to O(n) plus the number of the input elements
between p and q. Since S contains (n) elements, the expected number of input elements
between p and q is O(n/|S|) = O(n). This implies that the expected running time of the algorithm
is O(n).
For a proof of a lower bound of (n) expected time.
7/29/2019 Sublinear Time Algorithm
18/2817
3.5.Geometry: I ntersection of Two Polygons:
Let us consider a related problem but this time in a geometric setting. Given two convex
polygons A and B in R, each with n vertices, determine if they intersect, and if so, then find a
point in their intersection.
It is well known that this problem can be solved in O(n) time, for example, by observing that it
can be described as a linear programming instance in 2-dimensions, a problem which is known to
have a linear-time algorithmIn fact, within the same time one can either find a point that is in the
intersection of A and B, or find a line L that separates A from B (actually, one can even find a
bitangent separating line L, i.e., a line separating A and B which intersects with each of A and B
in exactly one point). The question is whether we can obtain a better running time. The
complexity of this problem depends on the input representation. In the most powerful model, if
the vertices of both polygons are stored in an array in cyclic order, Chazelle and Dobkin showed
that the intersection of the polygons can be determined in logarithmic time. However,a standard
geometric representation assumes that the input is not stored in an array but rather A and B are
given by their doubly-linked lists of vertices such that each vertex has as its successor the next
vertex of the polygon in the clockwise order. Can we then test if A and B intersect?
Figure 1: (a) Bitangent line LseparatingCAandCB, and (b) the polygon PA.
Chazelle et al. [2] gave an O(n)-time algorithm that reuses the approach discussed above for
searching in a sorted list. Let us first sample uniformly at random _(n) vertices from each A and
B, and let CA and CBbe the convex hulls of the sample point sets for the polygons A and
B,espectively. Using the linear-time algorithm mentioned above, in O(n) time we can check if
CAandCBintersects. If they do, then the algorithm will get us a point that lies in the intersection
ofCAand CB, and hence, this point lies also in the intersection of A and B. Otherwise, let L be the
bitangent separating line returned by the algorithm.Let a and b be the points in L that belong to
A and B, respectively. Let a1 and a2 be the two vertices adjacent to a in A. We will define now a
new polygon PA. If none of a1 and a2 is on the side CA of L the we define PA to be empty.
7/29/2019 Sublinear Time Algorithm
19/2818
Otherwise, exactly one of a1 and a2 is on the side CA of L; let it be a1. We define polygon PA by
walking from a to a1 and then continue walking along the boundary of A until we cross L again
(see Figure 1 (b)). In a similar way we define polygon PB. Observe that the expected size of each
of PA and PB is at most O(n).it is easy to see that A and B intersects if and only if either A
intersects PB or B intersects PA. We only consider the case of checking if A intersects PB. Wefirst determine if CA intersects PB. If yes, then we are done. Otherwise, let LA be a bitangent
separating line that separates CA from PB. We use the same construction as above to determine a
subpolygon QA of A that lies on the PB side of LA. Then, A intersects PB if and only if QA
intersects PB. Since QA has expected size O(n) and so does PB, testing the intersection of these
two polygons can be done in O(n) expected time. Therefore, by our construction above, we
have solved the problem of determining
if two polygons of size n intersect by reducing it to a constant number of problem instances of
determining iftwo polygons of expected size O(n) intersect. This leads to the following lemma .
Lemma 2 [2] The problem of determining whether two convex n-gons intersect can be solved in
O(n) expected time, which is asymptotically optimal. Chazelle et al. [2] gave not only this
result, but they also showed how to apply a similar approach to design a number of sublinear-
time algorithms for some basic geometric problems. For example, one can extend the result
discussed above to test the intersection of two convex polyhedra in R3with n vertices inO(n)
expected time. One can also approximate the volume of an n-vertex convex polytope to within a
relative error > 0 in expected time O(n/). Or even, for a pair of two points on the boundary of
a convex polytope P with n vertices, one can estimate the length of an optimal shortest path
outside P between the given points in O(n) expected time. In all the results mentioned above,
the input objects have been represented by a linked structure: either every point has access to its
adjacent vertices in the polygon in R2, or the polytope is defined by a doubly-connected edge list,
or so. These input representations are standard in computational geometry, but a natural question
is whether this is necessary to achieve sublinear-time algorithms what can we do if the input
polygon/polytop is represented by a set of points and no additional structure is provided to the
algorithm? In such a scenario, it is easy to see that no o(n)-time algorithm can solve exactly any
of the problems discussed above. That is, for example, to determine if two polygons with n
vertices intersect one needs (n) time. However, still, we can obtain some approximation to this
problem, one which is described in the framework of property testing. Suppose that we relax our
task and instead of determining if two (convex) polytopes A and B in Rd intersects, we just want
to distinguish between two cases: either A and B are intersection-free, or one has to significantly
modify A and B to make them intersection-free. The definition of the notion of significantlymodify may depend on the application at hand, but the most natural characterization would be to
7/29/2019 Sublinear Time Algorithm
20/2819
remove at least n points in A and B, for an appropriate parameter (see [3] for a discussion
about other geometric characterization). Czumaj et al. [4] gave a simple algorithm that for any
> 0, can distinguish between the case when A and B do not intersect, and the case when at least
n points has to be removed from A and B to make them intersection-free: the algorithm returns
the outcome of a test if a random sample of O((d/) log(d/)) points from A intersects with arandom sample of O((d/) log(d/)) points from B.
3.6.Subli near Time Algori thms for Graphs Problems
In the previous section, we introduced the concept of sublinear-time algorithms and we presented
two basic sublinear-time algorithms for geometric problems. In this section, we will discuss
sublinear-time algorithms for graph problems. Our main focus is on sublinear-time algorithms for
graphs, with special emphasizes on sparse graphs represented by adjacency lists where
combinatorial algorithms are sought.
3.6.1Approximating the Average Degree
Assume we have access to the degree distribution of the vertices of an undirected connected
graph G = (V,E), i.e., for any vertex v V we can query for its degree. Can we achieve a good
approximation of the average degree in G by looking at a sublinear number of vertices? At first
sight, this seems to be an impossible task. It seems that approximating the average degree isequivalent to approximating the average of a set of n numbers with values between 1 and n 1,
which is not possible in sublinear time. However, Feige [5] proved that one can approximate the
average degree in O(n/) time within a factor of 2 + .The difficulty with approximating the
average of a set of n numbers can be illustrated with the following example. Assume that almost
all numbers in the input set are 1 and a few of them are n 1. To approximate the average we
need to approximate how many occurrences of n 1 exist.If there is only a constant number of
them, we can do this only by looking at (n) numbers in the set. So, the problem is that these largenumbers can hide in the set and we cannot give a good approximation, unless we can find at
least some of them. Why is the problem less difficult, if, instead of an arbitrary set of numbers,
we have a set of numbers that are the vertex degrees of a graph? For example, we could still have
a few vertices ofdegree n1. The point is that in this case any edge incident to such a vertex can
be seen at another vertex. Thus, even if we do not sample a vertex with high degree we will see
all incident edges at other vertices in the graph. Hence, vertices with a large degree cannot
hide.
We will sketch a proof of a slightly weaker result than that originally proven by Feige [5]. Let d
denote the average degree in G = (V,E) and let dS denote the random variable for the average
7/29/2019 Sublinear Time Algorithm
21/2820
degree of a set S of s vertices chosen uniformly at random from V . We will show that if we set s
n/O(1) for an appropriate constant , then dS ( 1/2 ) d with probability at least1/64.
Additionally, we observe thatMarkov inequality immediately implies that dS (1+)d with
probability at least 1 1/(1 + ) /2. Therefore, our algorithm will pick 8/ sets Si, each of size
s, and output the set with the smallest average degree. Hence, the probability that all of the sets Sihave too high average degree is at most (1 /2)/8 1/8. The probability that one of them has
too small average degree is at most 8 / /64 = 1/8. Hence, the output value will satisfy both
inequalities with probability at least 3/4. By replacing with /2, this will yield a (2+ )-
approximation algorithm.
Now, our goal is to show that with high probability one does not underestimate the average
degree too much. Let H be the set of the n vertices with highest degree in G and let L = V \H
be the set of the remaining vertices. We first argue that the sum of the degrees of the vertices in L
is at least ( 1/2 ) times the sum of the degrees of all vertices. This can be easily seen by
distinguishing between edges incident to a vertex from L and edges within H. Edges incident to a
vertex from L contribute with at least 1 to the sum of degrees of vertices in L, which is fine as
this is at least 1/2 of their full contribution. So the only edges that may cause problems are edges
within H. However, since |H| = n, there can be at most n such edges, which is small
compared to the overall number of edges (which is at least n 1, since the graph is connected).
Now, let dH be the degree of a vertex with the smallest degree in H. Since we aim at giving a
lower bound on the average degree of the sampled vertices, we can safely assume that all
sampled vertices come from the set L. We know that each vertex in L has a degree between 1 and
dH. Let Xi, 1 i s, be the random variable for the degree of the ith vertex from S. Then, it
follows from Hoeffding bounds that
We know that the average degree is at least dH |H|/n, because any vertex in H has at least
degree dH. Hence, the average degree of a vertex in L is at least ( 1 /2 ) dH |H|/n. This just
means E[Xi] ( 1/2)dH|H|/n. By linearity of expectation we get E[Ps i=1 Xi] s(1
/2)dH|H|/n.
This implies that, for our choice of s, with high probability we have dS ( 1/2 ) d. Feige
showed the following result, which is stronger with respect to the dependence on .
7/29/2019 Sublinear Time Algorithm
22/2821
3.6.2.Minimum Spanning Trees:
One of the most fundamental graph problems is to compute a minimum spanning tree. Since the
minimum spanning tree is of size linear in the number of vertices, no sublinear algorithm forsparse graphs can exists. It is also know that no constant factor approximation algorithm with
o(n2) query complexity in dense graphs (even in metric spaces) exists [6]. Given these facts, it is
somewhat surprising that it is possible to approximate the cost of a minimum spanning tree in
sparse graphs [7] as well as in metric spaces [8] to within a factor of (1 + ). In the following we
will explain the algorithm for sparse graphs by Chazelle et al. [7]. We will prove a slightly
weaker result than in [7]. Let G = (V,E) be an undirected connected weighted graph with
maximum degree D and integer edge weights from {1, . . . ,W}. We assume that the graph is
given in adjacency list representation, i.e., for every vertex v there is a list of its at most D
neighbors, which can be accessed from v. Furthermore, we assume that the vertices are stored in
an array such that it is possible to select a vertex uniformly at random. We assume also that the
values of D and W are known to the algorithm.
The main idea behind the algorithm is to express the cost of a minimum spanning tree as the
number of connected components in certain auxiliary subgraphs of G. Then, one runs a
randomized algorithm to estimate the number of connected components in each of these
subgraphs.
To start with basic intuitions, let us assume that W = 2, i.e., the graph has only edges of weight 1
or 2. Let G(1) = (V,E(1)) denote the subgraph that contains all edges of weight (at most) 1 and let
c(1) be the number of connected components in G(1). It is easy to see that the minimum spanning
tree has to link these connected components by edges of weight 2. Since any connected
component in G(1) can be spanned by edges of weight 1, any minimum spanning tree of G has
c(1)1 edges of weight 2 and n1(c(1)1) edges of weight 1. Thus, the weight of a minimum
Next, let us consider an arbitrary integer value for W. Defining G(i) = (V,E(i)), where E(i) is the
set of edges in G with weight at most i, one can generalize the formula above to obtain that the
cost MST of a minimum spanning tree can be expressed as
7/29/2019 Sublinear Time Algorithm
23/2822
This gives the following simple algorithm
Thus, the key question that remains is how to estimate the number of connected components.
This is done by the following algorithm.
To analyze this algorithm let us fix an arbitrary connected component C and let |C| denote the
number of vertices in the connected component. Let c denote the number of connected
components in G. We can write
And by linearity of expectation we obtain
To show that is concentrated around its expectation, we apply Chebyshev inequality. Since bi
is an indicator random variable, we have
7/29/2019 Sublinear Time Algorithm
24/2823
With this bound for Var[], we can use Chebyshev inequality to obtain
From this it follows that one can approximate the number of connected components within
additive error of n in a graph with maximum degree D in time and with probability
1e. The following somewhat stronger result has been obtained in [7]. Notice that the obtained
running time is independent of the input size n.
3.7.Subli near Time Approximation Algori thms for Problems in Metri c Spaces
One of the most widely considered models in the area of sublinear time approximation
algorithms is the distance oracle model for metric spaces. In this model, the input of an algorithm
is a set P of n points in a metric space (P, d). We assume that it is possible to compute the
distance d(p, q) between any pair of points p, q in constant time. Equivalently, one could assume
that the algorithm is given access to the n n distance matrix of the metric space, i.e., we have
oracle access to thematrix of a weighted undirected complete graph. Since the full description
size of this matrix is O(n2), we will call any algorithm with o(n2) running time a sublinear
algorithm. Which problems can and cannot be approximated in sublinear time in the distance
oracle model? One of the most basic problems is to find (an approximation) of the shortest or the
longest pairwise distance in the metric space. It turns out that the shortest distance cannot be
approximated. The counterexample is a uniform metric (all distances are 1) with one distance
being set to some very small value . Obviously, it requires O(n2) time to find this single short
distance. Hence, no sublinear time approximation algorithm for the shortest distance problem
exists. What about the longest distance? In this case, there is a very simple 1 /2 -approximation
algorithm, which was first observed by Indyk [6]. The algorithm chooses an arbitrary point p and
returns its furthest neighbor q. Let r, s be the furthest pair in the metric space. We claim that d(p,
q) 1/ 2 d(r, s). By the triangle inequality, we have d(r, p) + d(p, s) d(r, s). This immediately
implies that either d(p, r) 1/2 d(r, s) or d(p, s) 1/2 d(r, s). This shows the approximation
guarantee. In the following, we present some recent sublinear-time algorithms for a few
optimization problems in metric spaces.
3.7.1Cluster ing via Random Sampling
The problems of clustering large data sets into subsets (clusters) of similar characteristics are one
7/29/2019 Sublinear Time Algorithm
25/2824
of the most fundamental problems in computer science, operations research, and related fields.
Clustering problems arise naturally in various massive datasets applications, including data
mining, bioinformatics, pattern classification, etc. In this section, we will discuss the uniformly
random samplingfor clustering problems in metric spaces, as analyzed in two recent papers
(a) A set of points in a metric space
(b) its 3-clustering (white points correspond to the centre points)
(c) the distances used in the cost for the 3-median.
7/29/2019 Sublinear Time Algorithm
26/2825
Let us consider a classical clustering problem known as the k-median problem. Given a finite
metric space (P, d), the goal is to find a set C P of k centers (points in P) that minimizes
where d(p,C) denotes the distance from p to the nearest point in C. The k-median
problem has been studied in numerous research papers. It is known to be NP-hard and there exist
constant-factor approximation algorithms running in e O(n k) time. In two recent papers [20, 46],
the authors asked the question about the quality of the uniformly random sampling approach to k-
median, that is, is the quality of the following generic scheme:
The goal is to show that already a sublinear-size sample set S will suffice to obtain a good
approximation guarantee. Furthermore, as observed in [10] (see also [45]), in order to have any
guarantee of the approximation, one has to consider the quality of the approximation as a
function of the diameter of the metric space. Therefore, we consider a model with the diameter of
the metric space given, that is, with d : P P [0,]. Limitations: What Cannot be done in
Sublinear-Time The algorithms discussed in the previous sections may suggest that many
optimization problems in metric spaces have sublinear-time algorithms. However, it turns out
that the problems listed in the previous sections are more like exceptions than a norm. Indeed,most of the problems have a trivial lower bound that exclude sublinear-time algorithms. We have
already mentioned in Section 4 that the problem of approximating the cost of the lightest edge in
a finite metric space (P, d) requires (n2), even if randomization is allowed. The other problems
for which no sublineartime algorithms are possible include estimation of the cost of minimum-
cost matching, the cost of minimum-cost bi-chromatic matching, the cost of minimum non-
uniform facility location, the cost of k-median for k = n/2; all these problems require (n2)
(randomized) time to estimate the cost of their optimal solution to within any constant factor .
To illustrate the lower bounds, we give two instances of the metric spaces which are indistin-
guishable by any o(n2)-time algorithm for which the cost of the minimum-cost matching in one
instance is greater than times the one in the other instance (see Figure 3). Consider a metric
space (P, d) with 2n points, n points in L and n points in R. Take a random perfect matching M
between the points in L and R, and then choose an edge e M at random. Next, define the
distance in (P, d) as follows:
d(e) is either 1 or B, where we set B = n ( 1) + 2,
7/29/2019 Sublinear Time Algorithm
27/2826
for any e*M\ {e} set d(e ) = 1, and
for any other pair of points p, q P not connected by an edge from M, d(p, q) = n3.
It is easy to see that both instances define properly a metric space (P, d). For such problem
instances, the cost of the minimum-cost matching problem will depend on the choice of d(e): Id(e) = B then the cost will be n 1 + B > n , and if d(e) = 1, then the cost will be n. Hence any
-factor approximation algorithm for the matching problem must distinguish between these two
problem instances. However, this requires to find if there is an edge of length B, and this is
known to require time (n2), even if a randomized algorithm is used.
5.Conclusions:
It would be impossible to present a complete picture of the large body of research known in the
area of sublinear-time algorithms in such a reeport. In this report, my main goal was to give some
flavor of the area and of the types of the results achieved and the techniques used. For more
details, we refer to the original works listed in the references. i did not discuss two important
areas that are closely related to sublinear-time algorithms: property testing and data streaming
algorithms.
7/29/2019 Sublinear Time Algorithm
28/28
References:
[1] A. Czumaj and C. Sohler. Sublinear-time Algorithms. Bulletin of the EATCS, 89: 23 - 47,
June 2006.
[2] B. Chazelle, D. Liu, and A. Magen. Sublinear geometric algorithms. SIAM Journal on
Computing, 35(3): 627646, 2006.
[3] A. Czumaj and C. Sohler. Property testing with geometric queries. Proceedings of the 9th
Annual European Symposium on Algorithms (ESA), pp. 266277, 2001.
[4] A. Czumaj, C. Sohler, and M. Ziegler. Property testing in computational geometry.
roceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 155166, 2000.
[5] U. Feige. On sums of independent random variables with unbounded variance and estimating
the average degree in a graph. SIAM Journal on Computing, 35(4): 964984, 2006.
[6] P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the 31st
Annual ACM Symposium on Theory of Computing (STOC), pp. 428434, 1999
[7] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the minimum spanning treeweight in sublinear time. SIAM Journal on Computing, 34(6): 13701379, 2005.
[8] A. Czumaj and C. Sohler. Estimating the weight of metric minimum spanning trees in
sublinear-time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing
(STOC), pp. 175183, 2004.
[9] A. Czumaj and C. Sohler. Sublinear-time approximation for clustering via random sampling.
Proceedings of the 31st Annual International Colloquium on Automata, Languages and
Programming (ICALP), pp. 396407, 2004.
[10] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. Proceedings of
the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439447, 2001.
Top Related