Download - Sublinear Time Algorithm

7/29/2019 Sublinear Time Algorithm

1/28

A

Seminar Report

On

SUB-LINEAR TIME ALGORITHMS

Submitted in partial fulfillment of the

requirements for the award of the degree

of

BACHELOR OF TECHNOLOGY

In

COMPUTER SCIENCE AND ENGINEERING

Submitted by

AJAY YADAV

Roll No. - 1012210009

B.TECH (CS-61)

Under the guidance of

Ms. ANITA PAL

Mr. KAMAL KUMAR SRIVASTAVA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SHRI RAMSWAROOP MEMORIAL COLLEGE OF ENGINEERING &

MANAGEMENT, LUCKNOW

FEBRUARY

2013


2/281

ABSTRACT

The area of sublinear-time algorithms is a new rapidly emerging area of computer science. It has

its roots in the study of massive data sets that occur more and more frequently in various

applications. Financial transactions with billions of input data and Internet traffic analyses

(Internet traffic logs, clickstreams, web data) are examples of modern data sets that show

unprecedented scale. Managing and analyzing such data sets forces us to reconsider the

traditional notions of efficient algorithms: processing such massive data sets in more than linear

time is by far too expensive and often even linear time algorithms may be too slow. Hence, there

is the desire to develop algorithms whose running times are not only polynomial, but in fact are

sublinear in n. Constructing a sublinear time algorithm may seem to be an impossible task sinceit allows one to read only a small fraction of the input. However, in recent years, we have seen

development of sublinear time algorithms for optimization problems arising in such diverse areas

as graph theory, geometry, algebraic computations, and computer graphics. Initially, the main

research focus has been on designing efficient algorithms in the framework of property testing.

which is an alternative notion of approximation for decision problems. But more recently, we see

some major progress in sublinear-time algorithms in the classical model of randomized and

approximation algorithms. In this paper, we survey some of the recent advances in this area. Our

main focus is on sublinear-time algorithms for combinatorial problems, especially for graph

problems and optimization problems in metric spaces.


3/282

ACKNOWLEDGEMENT

Acknowledgements provide us the opportunity to express our gratitude to the people which makesuch tasks to come out as a success. I take this opportunity to thank Dr. (Mrs.) Vinodini Katiyar

(Head of Department, Computer Science) for her immense support and help. I would also like to

thank my seminar guides Ms. Anita Pal and Mr. Kamal Kumar Srivastava for their incessant and

perpetual cooperation and encouragement which helped me overcome hurdles at various stages.

I would also like to thank my family for providing me resources, support and

encouragement this report would not have been possible without you.

AJAY YADAV


4/283

List of Symbols and Abbreviations

DFA Deterministic finite automata.

R.E. Regular Expression

MST Minimum Spanning Tree

NP Non Polynomial

PTAS polynomial time approximation

scheme

O Asymptotic upper bound

Asymptotic lower bound

Asymptotic average bound

less than

Summation


5/284

TABLE OF CONTENTS

Abstract.i

Acknowledgement.ii

List of Symbols, Abbreviations and Nomenclature..iii

1.Intoduction..5

1.1.Sublinear time algorithms5

1.2.Property Testing...6

1.2.1.Definition and Variants.7

1.2.2.Features and limitations....8

1.3.Data Streaming algorithms..8

2.Related work...9

2.1.Randomized algorithms...9

2.2.Approximation algorithms..9

2.3.Time complexity comparisons10

3.Design.12

3.1.Algorithm to determine sortedness of a list.12

3.1.1.algorithm...13

3.2.Successor search in a list..14

3.2.1.Search algorithm14

3.3.Testing Membership of a string in a R.E..15

3.3.1.Algorithm...15

3.4.Searching in sorted list..16

3.5.Geometry:Intersection of two polygons17

3.6.Sublinear Time Algorithms for Graphs Problems....193.6.1.Average degree of graph.19

3.7.Sublinear Time Approximation Algorithms for Problems in Metric Spaces...233.7.1.Clustering via Random Sampling254.Conclusion..26

References..27


6/285

1. INTRODUCTIONThe concept of sublinear-time algorithms is known for a very long time, but initially it has been

used to denote pseudo-sublinear-time algorithms, where after an appropriate preprocessing, an

algorithm solves the problem in sublinear-time. For example, if we have a set of n numbers, then

after an O(n log n) preprocessing (sorting), we can trivially solve a number of problems

involving the input elements. And so, if the after the preprocessing the elements are put in a

sorted array, then in O(1) time we can find the kth smallest element, in O(log n) time we can test

if the input contains a given element x, and also in O(log n) time we can return the number of

elements equal to a given element x. Even though all these results are folklore, this is not what

we call nowadays a sublinear-time algorithm.

In this Report, my goal is to study algorithms for which the input is taken to be in any standard

representation and with no extra assumptions. Then, an algorithm does not have to read the entire

input but it may determine the output by checking only a subset of the input elements. It is easy

to see that for many natural problems it is impossible to give any reasonable answer if not all or

almost all input elements are checked. But still, for some number of problems we can obtain

good algorithms that do not have to look at the entire input. Typically, these algorithms are

randomized (because most of the problems have a trivial linear-time deterministic lower bound)

and they return only an approximate solution rather than the exact one (because usually, without

looking at the whole input we cannot determine the exact solution). In this survey, we present

recently developed sublinear-time algorithm for some combinatorial optimization problems.

1.1 Subli near time algorithms:

Sublinear time algorithms are that type of algorithm which runs in sublinear time. That is they

runs always in upper bound of linear time algorithm .i.e. if we are defining f(n) as function for

time complexity then f(n) < o(n).here o (small oh) can be define as the if g 1(n) and g2(n) are two

function then g1=o(g2) if and only if g1(n) < c*g2(n) for c > 0 and n > n0.An algorithm is said to run in sub-linear time (often spelled sublinear time) if T(n) = o(n). In

particular this includes algorithms with the time complexities defined above, as well as others

such as the O (n) Grover's search algorithm. Typical algorithms that are exact and yet run in

sub-linear time use parallel processing (as the NC1 matrix determinant calculation does), non-

classical processing (as Grover's search does), or alternatively have guaranteed assumptions on

the input structure (as the logarithmic time binary search and many tree maintenance algorithms


7/286

do).

Figure 1.1 Comparison between sublinear and linear algorithm

However, languages such as the set of all strings that have a 1-bit indexed by the first log(n) bits

may depend on every bit of the input and yet be computable in sub-linear time.

The specific term sublinear time algorithm is usually reserved to algorithms that are unlike the

above in that they are run over classical serial machine models and are not allowed prior

assumptions on the input. They are however allowed to be randomized, and indeed must be

randomized for all but the most trivial of tasks. As such an algorithm must provide an answer

without reading the entire input, its particulars heavily depend on the access allowed to the input.

Usually for an input that is represented as a binary string b1,...,bk it is assumed that the algorithm

can in time O(1) request and obtain the value of bi for any i.

Sub-linear time algorithms are typically randomized, and provide only approximate solutions. In

fact, the property of a binary string having only zeros (and no ones) can be easily proved not to

be decidable by a (non-approximate) sub-linear time algorithm. Sub-linear time algorithms arise

naturally in the investigation of property testing.

1.2. Property testing:

In broad terms, property testing is the study of the following class of the problems:

Given the ability to perform (local) queries concerning a particular object (e.g., a function, or a

graph), the task is to determine whether the object has predetermined (global property) (e.g.


8/287

linearity or bipartiteness) ,or is far from having the property. The task should be performed by

inspecting only a small (possibly randomly selected) part of the whole object, where a small

probability of failure is allowed.

In order to define a property testing problem, we need to specify the types of queries that can

perform by the algorithm and a distance measure between objects. Later is required in order todefine what it means that the object is far from having the property. we assume that the algorithm

has given distance parameter is .

In computer science, a property testing algorithm for a decision problem is an algorithm whose

query complexity to its input is much smaller than the instance size of the problem. Typically

property testing algorithms are used to decide if some mathematical object (such as a graph or a

boolean function) has a "global" property, or is "far" from having this property, using only a

small number of "local" queries to the object. For example, the following promise problem

admits an algorithm whose query complexity is independent of the instance size (for an arbitrary

constant > 0):

"Given a graph G on n vertices, decide if G is bipartite, or G cannot be made bipartite even after

removing an arbitrary subset of at most edges of G."

Property testing algorithms are important in the theory of probabilistically checkable proofs.

1.2.1.Defi ni tion and var iants

Formally, a property testing algorithm with query complexity q(n) and proximity parameter for

a decision problem L is a randomized algorithm that, on input x (an instance of L) makes at most

q(|x|) queries to x and behaves as follows:

If x is in L, the algorithm accepts x with probability at least .

If x is -far from L, the algorithm rejects x with probability at least .

Here, "x is -far from L" means that the Hamming distance between x and any string in L is at

least |x|.

A property testing algorithm is said to have one-sided error if it satisfies the stronger condition

that the accepting probability for instances x L is 1 instead of .

A property testing algorithm is said be non-adaptive if it performs all its queries before it

"observes" any answers to previous queries. Such an algorithm can be viewed as operating in the

following manner. First the algorithm receives its input. Before looking at the input, using its

internal randomness, the algorithm decides which symbols of the input are to be queried. Next,

the algorithm observes these symbols. Finally, without making any additional queries (but

possibly using its randomness), the algorithm decides whether to accept or reject the input.


9/288

1.2.2.Features and limitations

The main efficiency parameter of a property testing algorithm is its query complexity, which is

the maximum number of input symbols inspected over all inputs of a given length (and all

random choices made by the algorithm). One is interested in designing algorithms whose query

complexity is as small as possible. In many cases the running time of property testing algorithmsis sublinear in the instance length. Typically, the goal is first to make the query complexity as

small as possible as a function of the instance size n, and then study the dependency on the

proximity parameter .

Unlike other complexity-theoretic settings, the asymptotic query complexity of property testing

algorithms is affected dramatically by the representation of instances. For example, when =

0.01, the problem of testing bipartiteness of dense graphs (which are represented by their

adjacency matrix) admits an algorithm of constant query complexity. In contrast, sparse graphs

on n vertices (which are represented by their adjacency list) require property testing algorithms of

query complexity ( .

The query complexity of property testing algorithms grows as the proximity parameter becomes

smaller for all non-trivial properties. This dependence on is necessary as a change of fewer than

symbols in the input cannot be detected with constant probability using fewer than O(1/)

queries. Many interesting properties of dense graphs can be tested using query complexity that

depends only on and not on the graph size n. However, the query complexity can grow

enormously fast as a function of . For example, for a long time the best known algorithm for

testing if a graph does not contain any triangle had a query complexity which is a tower function

of poly(1/), and only in 2010 this has been improved to a tower function of log(1/). O ne of the

reasons for this enormous growth in bounds is that many of the positive results for property

testing of graphs are established using the Szemerdi regularity lemma, which also has tower-

type bounds in its conclusions.

1.3 Data Streaming algori thms:

In computer science, streaming algorithms are algorithms for processing data streams in which

the input is presented as a sequence of items and can be examined in only a few passes (typically

just one). These algorithms have limited memory available to them (much less than the input

size) and also limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a

summary or "sketch" of the data stream in memory.


10/289

2. Related work:-

2.1. Randomized Algori thms:

A randomized algorithm is an algorithm which employs a degree of randomness as part of its

logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its

behaviour, in the hope of achieving good performance in the "average case" over all possible

choices of random bits. Formally, the algorithm's performance will be a random variable

determined by the random bits; thus either the running time, or the output (or both) are random

variables.

One has to distinguish between algorithms that use the random input to reduce the expected

running time or memory usage, but always terminate with a correct result in a bounded amount

of time, and probabilistic algorithms, which, depending on the random input, have a chance of

producing an incorrect result (Monte Carlo algorithms) or fail to produce a result (Las Vegas

algorithms) either by signalling a failure or failing to terminate.

In the second case, random performance and random output, the term "algorithm" for a procedure

is somewhat questionable. In the case of random output, it is no longer formally effective.

However, in some cases, probabilistic algorithms are the only practical means of solving a

problem.

In common practice, randomized algorithms are approximated using a pseudorandom numbergenerator in place of a true source of random bits; such an implementation may deviate from the

expected theoretical behaviour.

2.2 Approximation Algor ithms:

In computer science and operations research, approximation algorithms are algorithms used to

find approximate solutions to optimization problems. Approximation algorithms are often

associated with NP-hard problems; since it is unlikely that there can ever be efficient polynomial-

time exact algorithms solving NP-hard problems, one settles for polynomial-time sub-optimal

solutions. Unlike heuristics, which usually only find reasonably good solutions reasonably fast,

one wants provable solution quality and provable run-time bounds. Ideally, the approximation is

optimal up to a small constant factor (for instance within 5% of the optimal solution).

Approximation algorithms are increasingly being used for problems where exact polynomial-

time algorithms are known but are too expensive due to the input size. A typical example for an

approximation algorithm is the one for vertex cover in graphs: find an uncovered edge and add

both endpoints to the vertex cover, until none remain. It is clear that the resulting cover is at most

twice as large as the optimal one. This is a constant factor approximation algorithm with a factor


11/2810

of 2. NP-hard problems vary greatly in their approximability; some, such as the bin packing

problem, can be approximated within any factor greater than 1 (such a family of approximation

algorithms is often called a polynomial time approximation scheme or PTAS). Others are

impossible to approximate within any constant, or even polynomial factor unless P = NP, such as

the maximum clique problem. NP-hard problems can often be expressed as integer programs (IP)and solved exactly in exponential time. Many approximation algorithms emerge from the linear

programming relaxation of the integer program. Not all approximation algorithms are suitable for

all practical applications. They often use IP/LP/Semidefinite solvers, complex data structures or

sophisticated algorithmic techniques which lead to difficult implementation problems. Also,

some approximation algorithms have impractical running times even though they are polynomial

time, for example O(n2000). Yet the study of even very expensive algorithms is not a completely

theoretical pursuit as they can yield valuable insights. A classic example is the initial PTAS for

Euclidean TSP due to Sanjeev Arora which had prohibitive running time, yet within a year,

Arora refined the ideas into a linear time algorithm. Such algorithms are also worthwhile in some

applications where the running times and cost can be justified e.g. computational biology,

financial engineering, transportation planning, and inventory management. In such scenarios,

they must compete with the corresponding direct IP formulations.

Another limitation of the approach is that it applies only to optimization problems and not to

"pure" decision problems like satisfiability, although it is often possible to conceive optimization

versions of such problems, such as the maximum satisfiability problem (Max SAT).

Inapproximability has been a fruitful area of research in computational complexity theory since

the 1990 result of Feige, Goldwasser, Lovasz, Safra and Szegedy on the inapproximability of

Independent Set. After Arora et al. proved the PCP theorem a year later, it has now been shown

that Johnson's 1974 approximation algorithms for Max SAT, Set Cover, Independent Set and

Coloring all achieve the optimal approximation ratio, assuming P != NP.

2.3. Time Complexity Comparisons:

In computer science, the time complexity of an algorithm quantifies the amount of time taken by

an algorithm to run as a function of the length of the string representing the input. The time

complexity of an algorithm is commonly expressed using big O notation, which excludes

coefficients and lower order terms. When expressed this way, the time complexity is said to be

described asymptotically, i.e., as the input size goes to infinity. For example, if the time required

by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is

O(n3).


12/2811

Time complexity is commonly estimated by counting the number of elementary operations

performed by the algorithm, where an elementary operation takes a fixed amount of time to

perform. Thus the amount of time taken and the number of elementary operations performed by

the algorithm differ by at most a constant factor.

Since an algorithm's performance time may vary with different inputs of the same size, onecommonly uses the worst-case time complexity of an algorithm, denoted as T(n), which is

defined as the maximum amount of time taken on any input of size n. Time complexities are

classified by the nature of the function T(n). For instance, an algorithm with T(n) = O(n) is called

a linear time algorithm, and an algorithm with T(n) = O(2n) is said to be an exponential time

algorithm.

Table 1 Time complexity comparison


13/2812

3.Design :

3.1. Algori thm for checking sortedness of a li st in Subl inear Time:

Input: a list of n numbers x1 , x2 ,..., xn.

Question: Is the list sorted orfar from sorted?Already saw: two different O((log n)/) time testers.

(log n) queries are required for all constant 1/2

Today: log n) queries are required for all constant c for every 1-sided error nonadaptive

test.

A test has 1-sided error if it always accepts all YES instances.

A test is nonadaptive if its queries that do not depend on answers to previous queries.

A pair (xi, xj) is violated if xi < xj

Claim. A 1-sided error test can reject only if it finds a violated pair.

Proof: Every sorted partial list can be extended to a sorted list.

Lolas distribution is uniform over the following log lists:


14/2813

Claim 1. All lists above are 1/2-far from sorted.

Claim 2. Every pair (xi , xj) is violated in exactly one list above.

Let picks a set of positions to query.

Ourtest must be correct, i.e., must find a violated pair with probability 2/3

When input is picked according to Lolas distribution.

Q contains a violated pair (ai, ai+1) is violated for some i

Probability Pr =

Lolas distribution

If|Q| 2/3logn then this probability is


15/2814

1.Pick an entry uniformly at random. Let x be the value in that entry.

2.Perform a binary search for x

3.If x is found, accept, otherwise, reject.

Runtime: O(logn)

3.2 Successor search in a l ist in subl inear time:

Input: a sorted doubly linked list of numbers and a search value k

Output: first element greater than k

Data structure:

Figure 3.1. Datastructure for doubly link list

3.2.1 Search Algor i thm:

Pick random list elements

Determine closest predecessor and successor of k (within the random sample)

Perform plain, linearsearch towards k (starting from these two elements)

Figure 3.2. Algorithm running status


16/2815

Good case:

Figure 3.3. Good case for algorithm

Worst case:

Figure 3.4. worst case for algorithm

Runtime:O()

3.3 Testing Membership of a Str ing in R.E.

For fixed regular language L belongs to {0,1}*, testing algorithm should accept w.h.p. every

word w belongs to L, and should reject w.h.p. every word w that differs on more than e n bits

(n=|w|) from every w belongs to L (|w|=n). Algorithm can query any bit wi of w.

Figure 3.5. DFA for testing

3.3.1.Algori thm (simpl if ied version):

Uniformly and independently select (r/e) indices 1 i n .

For each i selected, check that the substring wi wi+r/e is feasible.

If any substring is infeasible then reject, otherwise accept


17/2816

RunTime: O()

3.4.Searching in a sorted list:

It is well-known that if we can store the input in a sorted array, then we can solve various

problems on the input very efficiently. However, the assumption that the input array is sorted isnot natural in typical applications. Let us now consider a variant of this problem,

where our goal is to search for an element x in a linked sorted list containing n distinct elements.

Here, we assume that the n elements are stored in a doubly-linked, each list element has access to

the next and preceding element in the list, and the list is sorted (that is, if x follows y in the list,

then y < x). We also assume that we have access to all elements in the list, which for example,

can correspond to the situation that all n list elements are stored in an array (but the array is not

sorted and we do not impose any order for the array elements). How can we find whether a given

number x is in our input or is not? On the first glace, it seems that since we do not have direct

access to the rank of any element in the list, this problem requires (n) time. And indeed, if our

goal is to design a deterministic algorithm,then it is impossible to do the search in o(n)

time.However, if we allow randomization, then we can complete the search in O(n) expected

time (and this bound is asymptotically tight). Let us first sample uniformly at random a set S of

(n) elements from the input. Since we have access to all elements in the list, we can select the

set S in O(n) time. Next, we scan all the elements in S and in O(n) time we can find twoelements in S, p and q, such that p x < q, and there is no element in S that is between p and q.

Observe that since the input consist of n distinct numbers, p and q are uniquely defined. Next, we

traverse the input list containing all the input elements starting at p until we find either the sought

key x or we find element q.

Lemma 1:

The algorithm above completes the search in expected O(n) time. Moreover, no algorithm can

solve this problem in o(n) expected time.

Proof. The running time of the algorithm if equal to O(n) plus the number of the input elements

between p and q. Since S contains (n) elements, the expected number of input elements

between p and q is O(n/|S|) = O(n). This implies that the expected running time of the algorithm

is O(n).

For a proof of a lower bound of (n) expected time.


18/2817

3.5.Geometry: I ntersection of Two Polygons:

Let us consider a related problem but this time in a geometric setting. Given two convex

polygons A and B in R, each with n vertices, determine if they intersect, and if so, then find a

point in their intersection.

It is well known that this problem can be solved in O(n) time, for example, by observing that it

can be described as a linear programming instance in 2-dimensions, a problem which is known to

have a linear-time algorithmIn fact, within the same time one can either find a point that is in the

intersection of A and B, or find a line L that separates A from B (actually, one can even find a

bitangent separating line L, i.e., a line separating A and B which intersects with each of A and B

in exactly one point). The question is whether we can obtain a better running time. The

complexity of this problem depends on the input representation. In the most powerful model, if

the vertices of both polygons are stored in an array in cyclic order, Chazelle and Dobkin showed

that the intersection of the polygons can be determined in logarithmic time. However,a standard

geometric representation assumes that the input is not stored in an array but rather A and B are

given by their doubly-linked lists of vertices such that each vertex has as its successor the next

vertex of the polygon in the clockwise order. Can we then test if A and B intersect?

Figure 1: (a) Bitangent line LseparatingCAandCB, and (b) the polygon PA.

Chazelle et al. [2] gave an O(n)-time algorithm that reuses the approach discussed above for

searching in a sorted list. Let us first sample uniformly at random _(n) vertices from each A and

B, and let CA and CBbe the convex hulls of the sample point sets for the polygons A and

B,espectively. Using the linear-time algorithm mentioned above, in O(n) time we can check if

CAandCBintersects. If they do, then the algorithm will get us a point that lies in the intersection

ofCAand CB, and hence, this point lies also in the intersection of A and B. Otherwise, let L be the

bitangent separating line returned by the algorithm.Let a and b be the points in L that belong to

A and B, respectively. Let a1 and a2 be the two vertices adjacent to a in A. We will define now a

new polygon PA. If none of a1 and a2 is on the side CA of L the we define PA to be empty.


19/2818

Otherwise, exactly one of a1 and a2 is on the side CA of L; let it be a1. We define polygon PA by

walking from a to a1 and then continue walking along the boundary of A until we cross L again

(see Figure 1 (b)). In a similar way we define polygon PB. Observe that the expected size of each

of PA and PB is at most O(n).it is easy to see that A and B intersects if and only if either A

intersects PB or B intersects PA. We only consider the case of checking if A intersects PB. Wefirst determine if CA intersects PB. If yes, then we are done. Otherwise, let LA be a bitangent

separating line that separates CA from PB. We use the same construction as above to determine a

subpolygon QA of A that lies on the PB side of LA. Then, A intersects PB if and only if QA

intersects PB. Since QA has expected size O(n) and so does PB, testing the intersection of these

two polygons can be done in O(n) expected time. Therefore, by our construction above, we

have solved the problem of determining

if two polygons of size n intersect by reducing it to a constant number of problem instances of

determining iftwo polygons of expected size O(n) intersect. This leads to the following lemma .

Lemma 2 [2] The problem of determining whether two convex n-gons intersect can be solved in

O(n) expected time, which is asymptotically optimal. Chazelle et al. [2] gave not only this

result, but they also showed how to apply a similar approach to design a number of sublinear-

time algorithms for some basic geometric problems. For example, one can extend the result

discussed above to test the intersection of two convex polyhedra in R3with n vertices inO(n)

expected time. One can also approximate the volume of an n-vertex convex polytope to within a

relative error > 0 in expected time O(n/). Or even, for a pair of two points on the boundary of

a convex polytope P with n vertices, one can estimate the length of an optimal shortest path

outside P between the given points in O(n) expected time. In all the results mentioned above,

the input objects have been represented by a linked structure: either every point has access to its

adjacent vertices in the polygon in R2, or the polytope is defined by a doubly-connected edge list,

or so. These input representations are standard in computational geometry, but a natural question

is whether this is necessary to achieve sublinear-time algorithms what can we do if the input

polygon/polytop is represented by a set of points and no additional structure is provided to the

algorithm? In such a scenario, it is easy to see that no o(n)-time algorithm can solve exactly any

of the problems discussed above. That is, for example, to determine if two polygons with n

vertices intersect one needs (n) time. However, still, we can obtain some approximation to this

problem, one which is described in the framework of property testing. Suppose that we relax our

task and instead of determining if two (convex) polytopes A and B in Rd intersects, we just want

to distinguish between two cases: either A and B are intersection-free, or one has to significantly

modify A and B to make them intersection-free. The definition of the notion of significantlymodify may depend on the application at hand, but the most natural characterization would be to


20/2819

remove at least n points in A and B, for an appropriate parameter (see [3] for a discussion

about other geometric characterization). Czumaj et al. [4] gave a simple algorithm that for any

> 0, can distinguish between the case when A and B do not intersect, and the case when at least

n points has to be removed from A and B to make them intersection-free: the algorithm returns

the outcome of a test if a random sample of O((d/) log(d/)) points from A intersects with arandom sample of O((d/) log(d/)) points from B.

3.6.Subli near Time Algori thms for Graphs Problems

In the previous section, we introduced the concept of sublinear-time algorithms and we presented

two basic sublinear-time algorithms for geometric problems. In this section, we will discuss

sublinear-time algorithms for graph problems. Our main focus is on sublinear-time algorithms for

graphs, with special emphasizes on sparse graphs represented by adjacency lists where

combinatorial algorithms are sought.

3.6.1Approximating the Average Degree

Assume we have access to the degree distribution of the vertices of an undirected connected

graph G = (V,E), i.e., for any vertex v V we can query for its degree. Can we achieve a good

approximation of the average degree in G by looking at a sublinear number of vertices? At first

sight, this seems to be an impossible task. It seems that approximating the average degree isequivalent to approximating the average of a set of n numbers with values between 1 and n 1,

which is not possible in sublinear time. However, Feige [5] proved that one can approximate the

average degree in O(n/) time within a factor of 2 + .The difficulty with approximating the

average of a set of n numbers can be illustrated with the following example. Assume that almost

all numbers in the input set are 1 and a few of them are n 1. To approximate the average we

need to approximate how many occurrences of n 1 exist.If there is only a constant number of

them, we can do this only by looking at (n) numbers in the set. So, the problem is that these largenumbers can hide in the set and we cannot give a good approximation, unless we can find at

least some of them. Why is the problem less difficult, if, instead of an arbitrary set of numbers,

we have a set of numbers that are the vertex degrees of a graph? For example, we could still have

a few vertices ofdegree n1. The point is that in this case any edge incident to such a vertex can

be seen at another vertex. Thus, even if we do not sample a vertex with high degree we will see

all incident edges at other vertices in the graph. Hence, vertices with a large degree cannot

hide.

We will sketch a proof of a slightly weaker result than that originally proven by Feige [5]. Let d

denote the average degree in G = (V,E) and let dS denote the random variable for the average


21/2820

degree of a set S of s vertices chosen uniformly at random from V . We will show that if we set s

n/O(1) for an appropriate constant , then dS ( 1/2 ) d with probability at least1/64.

Additionally, we observe thatMarkov inequality immediately implies that dS (1+)d with

probability at least 1 1/(1 + ) /2. Therefore, our algorithm will pick 8/ sets Si, each of size

s, and output the set with the smallest average degree. Hence, the probability that all of the sets Sihave too high average degree is at most (1 /2)/8 1/8. The probability that one of them has

too small average degree is at most 8 / /64 = 1/8. Hence, the output value will satisfy both

inequalities with probability at least 3/4. By replacing with /2, this will yield a (2+ )-

approximation algorithm.

Now, our goal is to show that with high probability one does not underestimate the average

degree too much. Let H be the set of the n vertices with highest degree in G and let L = V \H

be the set of the remaining vertices. We first argue that the sum of the degrees of the vertices in L

is at least ( 1/2 ) times the sum of the degrees of all vertices. This can be easily seen by

distinguishing between edges incident to a vertex from L and edges within H. Edges incident to a

vertex from L contribute with at least 1 to the sum of degrees of vertices in L, which is fine as

this is at least 1/2 of their full contribution. So the only edges that may cause problems are edges

within H. However, since |H| = n, there can be at most n such edges, which is small

compared to the overall number of edges (which is at least n 1, since the graph is connected).

Now, let dH be the degree of a vertex with the smallest degree in H. Since we aim at giving a

lower bound on the average degree of the sampled vertices, we can safely assume that all

sampled vertices come from the set L. We know that each vertex in L has a degree between 1 and

dH. Let Xi, 1 i s, be the random variable for the degree of the ith vertex from S. Then, it

follows from Hoeffding bounds that

We know that the average degree is at least dH |H|/n, because any vertex in H has at least

degree dH. Hence, the average degree of a vertex in L is at least ( 1 /2 ) dH |H|/n. This just

means E[Xi] ( 1/2)dH|H|/n. By linearity of expectation we get E[Ps i=1 Xi] s(1

/2)dH|H|/n.

This implies that, for our choice of s, with high probability we have dS ( 1/2 ) d. Feige

showed the following result, which is stronger with respect to the dependence on .


22/2821

3.6.2.Minimum Spanning Trees:

One of the most fundamental graph problems is to compute a minimum spanning tree. Since the

minimum spanning tree is of size linear in the number of vertices, no sublinear algorithm forsparse graphs can exists. It is also know that no constant factor approximation algorithm with

o(n2) query complexity in dense graphs (even in metric spaces) exists [6]. Given these facts, it is

somewhat surprising that it is possible to approximate the cost of a minimum spanning tree in

sparse graphs [7] as well as in metric spaces [8] to within a factor of (1 + ). In the following we

will explain the algorithm for sparse graphs by Chazelle et al. [7]. We will prove a slightly

weaker result than in [7]. Let G = (V,E) be an undirected connected weighted graph with

maximum degree D and integer edge weights from {1, . . . ,W}. We assume that the graph is

given in adjacency list representation, i.e., for every vertex v there is a list of its at most D

neighbors, which can be accessed from v. Furthermore, we assume that the vertices are stored in

an array such that it is possible to select a vertex uniformly at random. We assume also that the

values of D and W are known to the algorithm.

The main idea behind the algorithm is to express the cost of a minimum spanning tree as the

number of connected components in certain auxiliary subgraphs of G. Then, one runs a

randomized algorithm to estimate the number of connected components in each of these

subgraphs.

To start with basic intuitions, let us assume that W = 2, i.e., the graph has only edges of weight 1

or 2. Let G(1) = (V,E(1)) denote the subgraph that contains all edges of weight (at most) 1 and let

c(1) be the number of connected components in G(1). It is easy to see that the minimum spanning

tree has to link these connected components by edges of weight 2. Since any connected

component in G(1) can be spanned by edges of weight 1, any minimum spanning tree of G has

c(1)1 edges of weight 2 and n1(c(1)1) edges of weight 1. Thus, the weight of a minimum

Next, let us consider an arbitrary integer value for W. Defining G(i) = (V,E(i)), where E(i) is the

set of edges in G with weight at most i, one can generalize the formula above to obtain that the

cost MST of a minimum spanning tree can be expressed as


23/2822

This gives the following simple algorithm

Thus, the key question that remains is how to estimate the number of connected components.

This is done by the following algorithm.

To analyze this algorithm let us fix an arbitrary connected component C and let |C| denote the

number of vertices in the connected component. Let c denote the number of connected

components in G. We can write

And by linearity of expectation we obtain

To show that is concentrated around its expectation, we apply Chebyshev inequality. Since bi

is an indicator random variable, we have


24/2823

With this bound for Var[], we can use Chebyshev inequality to obtain

From this it follows that one can approximate the number of connected components within

additive error of n in a graph with maximum degree D in time and with probability

1e. The following somewhat stronger result has been obtained in [7]. Notice that the obtained

running time is independent of the input size n.

3.7.Subli near Time Approximation Algori thms for Problems in Metri c Spaces

One of the most widely considered models in the area of sublinear time approximation

algorithms is the distance oracle model for metric spaces. In this model, the input of an algorithm

is a set P of n points in a metric space (P, d). We assume that it is possible to compute the

distance d(p, q) between any pair of points p, q in constant time. Equivalently, one could assume

that the algorithm is given access to the n n distance matrix of the metric space, i.e., we have

oracle access to thematrix of a weighted undirected complete graph. Since the full description

size of this matrix is O(n2), we will call any algorithm with o(n2) running time a sublinear

algorithm. Which problems can and cannot be approximated in sublinear time in the distance

oracle model? One of the most basic problems is to find (an approximation) of the shortest or the

longest pairwise distance in the metric space. It turns out that the shortest distance cannot be

approximated. The counterexample is a uniform metric (all distances are 1) with one distance

being set to some very small value . Obviously, it requires O(n2) time to find this single short

distance. Hence, no sublinear time approximation algorithm for the shortest distance problem

exists. What about the longest distance? In this case, there is a very simple 1 /2 -approximation

algorithm, which was first observed by Indyk [6]. The algorithm chooses an arbitrary point p and

returns its furthest neighbor q. Let r, s be the furthest pair in the metric space. We claim that d(p,

q) 1/ 2 d(r, s). By the triangle inequality, we have d(r, p) + d(p, s) d(r, s). This immediately

implies that either d(p, r) 1/2 d(r, s) or d(p, s) 1/2 d(r, s). This shows the approximation

guarantee. In the following, we present some recent sublinear-time algorithms for a few

optimization problems in metric spaces.

3.7.1Cluster ing via Random Sampling

The problems of clustering large data sets into subsets (clusters) of similar characteristics are one


25/2824

of the most fundamental problems in computer science, operations research, and related fields.

Clustering problems arise naturally in various massive datasets applications, including data

mining, bioinformatics, pattern classification, etc. In this section, we will discuss the uniformly

random samplingfor clustering problems in metric spaces, as analyzed in two recent papers

(a) A set of points in a metric space

(b) its 3-clustering (white points correspond to the centre points)

(c) the distances used in the cost for the 3-median.


26/2825

Let us consider a classical clustering problem known as the k-median problem. Given a finite

metric space (P, d), the goal is to find a set C P of k centers (points in P) that minimizes

where d(p,C) denotes the distance from p to the nearest point in C. The k-median

problem has been studied in numerous research papers. It is known to be NP-hard and there exist

constant-factor approximation algorithms running in e O(n k) time. In two recent papers [20, 46],

the authors asked the question about the quality of the uniformly random sampling approach to k-

median, that is, is the quality of the following generic scheme:

The goal is to show that already a sublinear-size sample set S will suffice to obtain a good

approximation guarantee. Furthermore, as observed in [10] (see also [45]), in order to have any

guarantee of the approximation, one has to consider the quality of the approximation as a

function of the diameter of the metric space. Therefore, we consider a model with the diameter of

the metric space given, that is, with d : P P [0,]. Limitations: What Cannot be done in

Sublinear-Time The algorithms discussed in the previous sections may suggest that many

optimization problems in metric spaces have sublinear-time algorithms. However, it turns out

that the problems listed in the previous sections are more like exceptions than a norm. Indeed,most of the problems have a trivial lower bound that exclude sublinear-time algorithms. We have

already mentioned in Section 4 that the problem of approximating the cost of the lightest edge in

a finite metric space (P, d) requires (n2), even if randomization is allowed. The other problems

for which no sublineartime algorithms are possible include estimation of the cost of minimum-

cost matching, the cost of minimum-cost bi-chromatic matching, the cost of minimum non-

uniform facility location, the cost of k-median for k = n/2; all these problems require (n2)

(randomized) time to estimate the cost of their optimal solution to within any constant factor .

To illustrate the lower bounds, we give two instances of the metric spaces which are indistin-

guishable by any o(n2)-time algorithm for which the cost of the minimum-cost matching in one

instance is greater than times the one in the other instance (see Figure 3). Consider a metric

space (P, d) with 2n points, n points in L and n points in R. Take a random perfect matching M

between the points in L and R, and then choose an edge e M at random. Next, define the

distance in (P, d) as follows:

d(e) is either 1 or B, where we set B = n ( 1) + 2,


27/2826

for any e*M\ {e} set d(e ) = 1, and

for any other pair of points p, q P not connected by an edge from M, d(p, q) = n3.

It is easy to see that both instances define properly a metric space (P, d). For such problem

instances, the cost of the minimum-cost matching problem will depend on the choice of d(e): Id(e) = B then the cost will be n 1 + B > n , and if d(e) = 1, then the cost will be n. Hence any

-factor approximation algorithm for the matching problem must distinguish between these two

problem instances. However, this requires to find if there is an edge of length B, and this is

known to require time (n2), even if a randomized algorithm is used.

5.Conclusions:

It would be impossible to present a complete picture of the large body of research known in the

area of sublinear-time algorithms in such a reeport. In this report, my main goal was to give some

flavor of the area and of the types of the results achieved and the techniques used. For more

details, we refer to the original works listed in the references. i did not discuss two important

areas that are closely related to sublinear-time algorithms: property testing and data streaming

algorithms.


28/28

References:

[1] A. Czumaj and C. Sohler. Sublinear-time Algorithms. Bulletin of the EATCS, 89: 23 - 47,

June 2006.

[2] B. Chazelle, D. Liu, and A. Magen. Sublinear geometric algorithms. SIAM Journal on

Computing, 35(3): 627646, 2006.

[3] A. Czumaj and C. Sohler. Property testing with geometric queries. Proceedings of the 9th

Annual European Symposium on Algorithms (ESA), pp. 266277, 2001.

[4] A. Czumaj, C. Sohler, and M. Ziegler. Property testing in computational geometry.

roceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 155166, 2000.

[5] U. Feige. On sums of independent random variables with unbounded variance and estimating

the average degree in a graph. SIAM Journal on Computing, 35(4): 964984, 2006.

[6] P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the 31st

Annual ACM Symposium on Theory of Computing (STOC), pp. 428434, 1999

[7] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the minimum spanning treeweight in sublinear time. SIAM Journal on Computing, 34(6): 13701379, 2005.

[8] A. Czumaj and C. Sohler. Estimating the weight of metric minimum spanning trees in

sublinear-time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing

(STOC), pp. 175183, 2004.

[9] A. Czumaj and C. Sohler. Sublinear-time approximation for clustering via random sampling.

Proceedings of the 31st Annual International Colloquium on Automata, Languages and

Programming (ICALP), pp. 396407, 2004.

[10] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. Proceedings of

the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439447, 2001.