# «RANDOM SAMPLING IN GRAPH OPTIMIZATION PROBLEMS a dissertation submitted to the department of computer science and the committee on graduate studies ...»

## RANDOM SAMPLING IN GRAPH OPTIMIZATION

PROBLEMS

a dissertation

submitted to the department of computer science

and the committee on graduate studies

of stanford university

in partial fulfillment of the requirements

for the degree of

doctor of philosophy

By

David R. Karger

January 1995

c Copyright 1995 by David R. Karger

All Rights Reserved

ii

I certify that I have read this dissertation and that in my

opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy.

Rajeev Motwani Principal Adviser I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy.

Serge Plotkin I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy.

Andrew Goldberg

**Approved for the University Committee on Graduate Studies:**

iii Abstract The representative random sample is a central concept of statistics. It is often possible to gather a great deal of information about a large population by examining a small sample randomly drawn from it. This approach has obvious advantages in reducing the investigator's work, both in gathering and in analyzing the data.

We apply the concept of a representative sample to combinatorial optimization. Our general technique is to generate small random representative subproblems and solve them in lieu of the original ones, producing approximately correct answers which may then be re ned to correct ones at little additional cost. Our focus is optimization problems on

**undirected graphs. Highlights of our results include:**

The rst randomized linear time minimum spanning tree algorithm;

A randomized minimum cut algorithm with running time roughly On2 as compared to previous roughly On3 time bounds, as well as the rst algorithm for nding all approximately minimal cuts and multiway cuts;

An e cient parallelization of the minimum cut algorithm, providing the rst parallel RN C algorithm for minimum cuts;

The rst proof that minimum cuts can be found deterministically in parallel N C;

Reliability theorems tightly bounding the connectivities and bandwidths in networks with random edge failures, and a fully polynomial-time approximation scheme for estimating all-terminal reliability|the probability a particular graph remains connected under edge failures;

A linear time algorithm for approximating minimum cuts to within 1+ and a linear processor parallel algorithm for 2-approximation, and fast algorithms for approximating s-t minimum cuts and maximum ows;

iv For the N P-complete problem of designing minimum cost networks satisfying speci ed connectivity requirements a generalization of minimum spanning trees, signi cantly improved polynomial-time approximation bounds from Olog n to 1+o1 for many such problems;

For coloring 3-colorable graphs, improvements in the approximation bounds from On3=8 to On1=4, and even better bounds for sparse graphs;

An analysis of random sampling in Matroids.

v Acknowledgements Many people helped me to bring this thesis to fruition. First among them is Rajeev Motwani. As my advisor, he made himself frequently available day and night to help me through research snags, clarify my ideas and explanations, and provide advice on the larger questions of academic life. My other reading committee members, Serge Plotkin and Andrew Goldberg, have had the roles of informal advisors, giving me far more time and assistance than a non-advisee had a right to expect. I'd like especially to thank Serge for the time spent as a sounding board on some of the rst ideas on random sampling for cuts and minimum spanning trees which grew into this thesis. Earlier, Harry Lewis and Umesh Vazirani were the people who, in my sophomore year, showed me what an exciting eld theoretical computer science could be.

I must also thank my coauthors, Douglas Cutting, Perry Fizzano, Philip Klein, Daphne Koller, Rajeev Motwani, Noam Nisan, Michal Parnas, Jan Pedersen, Steven Phillips, G.

D. S. Ramkumar, Cli ord Stein, Robert Tarjan, Eric Torng, John Tukey, and Joel Wein.

Each has taught me a great deal about research and about writing about research. I must especially thank Daphne Koller, who by giving generously of her time and comments has done more than anyone else to in uence my writing so it's her fault and show me how to strive for a good presentation. She and Steven Phillips also made sure I got o to a fast start by helping me write my rst paper in my rst year at Stanford. Thanks also to those who have commented on drafts of various parts of this work, including David Applegate, Don Coppersmith, Tom Cormen, Hal Gabow, Michel Goemans, Robert Kennedy, Philip Klein, Micael Lomonosov, Laszlo Lov sz, Je rey Oldham, Jan Pedersen, Satish Rao, John a Tukey, David Williamson, and David Zuckerman.

Others in the community gave helpful advice on questions ranging from tiny details of equation manipulation to big questions of my place in the academic community. Thanks to Zvi Galil, David Johnson, Richard Karp, Don Knuth, Tom Leighton, Charles Leiserson, vi John Mitchell, David Shmoys, Eva Tardos, Robert Tarjan, and Je rey Ullman.

Thanks to my group at Xerox PARC, Doug Cutting, Jan Pedersen, and John Tukey, for giving me an eye for the relatively practical side of computer science, as well as a window on the exciting questions of information retrieval. Although none of our joint work appears in this thesis, my experience with them reminds me that an important end goal of algorithms is for them to be useful.

I wouldn't have enjoyed my stay at Stanford half as much had it not been for the students who made it such a fun place: Edith, who convinced me that the best way to spend conferences is climbing mountains; Donald, for keeping the o ce well stocked with food and books; Je, for rebooting my machine often; Daphne, who was willing to spend hours on one of my conjectures for the reward of being able to tell me I was wrong; Michael,

**Michael, Sanjeev, Steven, Eric, Ram, Alan, Kathleen, Robert: : :.**

Most of all, I must thank my family. My parents, for establishing my love of books and learning; my siblings, for their subjection to my experiments in teaching. My wife and son, for putting up with my mental disappearances as I chased down a stray thought and my physical disappearances as I panicked over a paper deadline or traveled to a conference, and for the constant love and support that took me through many times of doubt and worry.

They laid the foundation on which this thesis rests.

Introduction The representative random sample is a central concept of statistics. It is often possible to gather a great deal of information about a large population by examining a small sample randomly drawn from it. This approach has obvious advantages in reducing the investigator's work, both in gathering and in analyzing the data.

We apply the concept of a representative sample to combinatorial optimization problems on graphs. The graph is one of the most common structures in computer science, modeling among other things roads, communication and transportation networks, electrical circuits, relationships between individuals, corporate hierarchies, hypertext collections, tournaments, resource allocations, project plans, database and program dependencies, and parallel architectures.

Given an optimization problem, it may be possible to generate a small representative subproblem by random sampling perhaps the most natural sample from a graph is a random subset of its edges. Intuitively, such a subproblem should form a microcosm of the larger problem. Our goal is to examine the subproblem and use it to glean information about the original problem. Since the subproblem is small, we can spend proportionally more time examining it than we would spend examining the original problem. In one approach we use frequently, an optimal solution to the subproblem may be a nearly optimal solution to the problem as a whole. In some situations, such an approximation might be su cient. In other situations, it may be easy to re ne this good solution into a truly optimal solution.

## CHAPTER 1. INTRODUCTION

1.1 Overview of Results We show how these ideas can be used in several ways on problems of varying degrees of di culty. For the easy to solve" minimum spanning tree problem, where a long line of research has resulted in ever closer to linear-time algorithms, random sampling gives the nal small increment to a truly linear time algorithm. In harder problems it improves running times by a more signi cant factor. For example, we improve the time needed to nd minimum cuts from roughly Omn On3 on dense graphs to roughly On2 , and give the rst parallel algorithm for the problem. Finally, addressing some very hard N P-complete problems such as network design and graph coloring, where nding an exact solution is thought to be hopeless, we use random sampling to give better approximation algorithms than were previously known. Our focus is optimization problems on undirected graphs.

1.1.1 Random Selection Perhaps the simplest random sample is a single individual. We investigate the use of random selection. The intuition behind this idea is that a single randomly selected individual is probably a typical" representative of the entire population. This is the idea behind Quicksort 91, where the assumption is that the randomly selected pivot will be neither extremely large nor extremely small, and will therefore serve to separate the remaining elements into two roughly equal sized groups.

We apply this idea in a new algorithm for nding minimum cuts in undirected graphs.

A cut is a partition of the graph vertices into two groups; the value of the cut is the number or total weight of edges with one endpoint in each group. The minimum cut problem is to identify a cut of minimum value. This problem is of great importance in analyzing network reliability, and also plays a central role in solving traveling salesman problems, compiling on parallel machines, and identifying topics in hypertext collections.

The idea behind our Recursive Contraction Algorithm is quite simple: a random edge is unlikely to cross the minimum cut, so its endpoints are probably on the same side. If we merge two vertices on the same side of the minimum cut, then we shall not a ect the minimum cut but will reduce the number of graph vertices by one. Therefore, we can nd the minimum cut by repeatedly selecting a random edge and merging its endpoints until only two vertices remain and the minimum cut becomes obvious. This leads to an algorithm that

1.1. OVERVIEW OF RESULTS 3 ~ is strongly polynomial and runs in On2 time on an n-vertex, m-edge graph|a signi cant ~ improvement on the previous Omn bounds. With high probability the algorithm nds all minimum cuts. The parallel version of our algorithm runs in polylogarithmic time using n2 processors on a PRAM Parallel Random Access Machine. It thus provides the rst proof that the minimum cut problem with arbitrary edge weights can be solved in RNC. A derandomization of this algorithm provides the rst proof that the minimum cut problem can be solved in N C. Our algorithm can be modi ed to nd all approximately minimum cuts and analyze the reliability of a network. Parts of this work are joint with Cli ord Stein 110.

1.1.2 Random Sampling A more general use of random sampling is to generate small representative subproblems.

Floyd and Rivest 58 use this approach in a fast and elegant algorithm for nding the median of an ordered set. They select a small random sample of elements from the set and show how inspecting this sample gives a very accurate estimate of the value of the median.

It is then easy to nd the actual median by examining only those elements close to the estimate. This very simple to implement algorithm uses fewer comparisons than any other known median- nding algorithm.

The Floyd-Rivest algorithm typi es three components needed in a random-sampling algorithm. The rst is a de nition of a randomly sampled subproblem. The second is an approximation theorem that proves that a solution to the subproblem is an approximate solution to the original problem. These two components by themselves will typically yield an obvious approximation algorithm with a speed-accuracy tradeo. The third component is a re nement algorithm that takes the approximate solution and turns it into an actual solution. Combining these three components can yield an algorithm whose running time will be determined by that of the re nement algorithm; intuitively, re nement should be easier than computing a solution from scratch.

In an application of this approach, we present the rst randomized linear-time algorithm for nding minimum spanning trees in the comparison-based model of computation.

The fundamental insight is that if we construct a subgraph of a graph by taking a random sample of the graph's edges, then the minimum spanning tree in the subgraph is a nearly" minimum spanning tree of the entire graph. More precisely, very few graph edges can be used to improve the sample's minimum spanning tree. By examining these few edges, we

## CHAPTER 1. INTRODUCTION

can re ne our approximation into the actual minimum spanning tree at little additional cost. This result re ects joint work with Philip Klein and Robert E. Tarjan 106. Our results actually apply to the more general problem of matroid optimization, and show that the problem of constructing an optimum matroid basis is essentially equivalent to that of verifying the optimality of a candidate basis.We also apply sampling to the minimum cut problem and several other problems involving cuts in graphs, including maximum ows. The maximum ow problem is perhaps the most widely studied of all graph optimization problems, having hundreds of applications.

Given vertices s and t and capacitated edges, the goal is to ship the maximum quantity of material from s to t without exceeding the capacities of the edges. The value of a graph's maximum ow is completely determined by the values of certain cuts in the graph.