# «A Thesis Presented to The Academic Faculty by Oded Green In Partial Fulﬁllment of the Requirements for the Degree Doctor of Philosophy in the ...»

## HIGH PERFORMANCE COMPUTING FOR IRREGULAR

## ALGORITHMS AND APPLICATIONS WITH AN

## EMPHASIS ON BIG DATA ANALYTICS

A Thesis

Presented to

The Academic Faculty

by

Oded Green

In Partial Fulﬁllment

of the Requirements for the Degree

Doctor of Philosophy in the

School of Computational Science and Engineering

Georgia Institute of Technology

May 2014

Copyright c 2014 by Oded Green

## HIGH PERFORMANCE COMPUTING FOR IRREGULAR

## ALGORITHMS AND APPLICATIONS WITH AN

## EMPHASIS ON BIG DATA ANALYTICS

**Approved by:**

Professor David A. Bader, Professor Polo Chau Committee Chair School of Computational Science and School of Computational Science and Engineering Engineering Georgia Institute of Technology Georgia Institute of Technology Professor David A. Bader, Advisor Professor Bo Hong School of Computational Science and School of Electrical and Computer Engineering & School of Electrical and Engineering Computer Engineering Georgia Institute of Technology Georgia Institute of Technology Professor Richard Vuduc Yitzhak Birk School of Computational Science and Electrical Engineering Department Engineering Technion Georgia Institute of Technology Professor Srinivas Aluru Date Approved: 13 March 2014 School of Computational Science and Engineering Georgia Institute of Technology To my parents, Itzhak and Esther, and my twin brother, Yoav, you have always made me shoot to the stars.

To my wife, Ania, you are my Northern Star and guide me through the worst of storms.

iii

## ACKNOWLEDGEMENTS

I am thankful for the support of my family throughout my graduate studies. Your support means the world to me and greatly simpliﬁed my life.I want to thank my friends and fellow graduate students. You helped me walk this long and windy road. I am especially thankful to Rob McColl and David Ediger who helped me transition from Windows to Linux and had the patience to show me the basics - it is not easy to teach an old dog new tricks and you were successful.

To my friends, Lluis Miquel Munguia, Aakash Goel, Rajath Prasad, Adam McLaugh- lin, Xing Liu, Piyush Sao, Marat Dukhan, Aparna Chandramowlishwaran, Aftab Pa- tel, and Zhaoming Yin, thanks for the many discussions (over lunch, coﬀer, or even a glass water). I learned a lot from you all.

To my advisor, Prof. David A. Bader, thanks for giving me the opportunity to study a new ﬁeld. I am especially grateful to David for giving me as much leeway as I needed to start any new research problem.

I want to thank my committee: Professor Rich Vuduc, Professor Srinivas Aluru, Professor Polo Chau, and Professor Bo Hong. I had many interesting discussions with you all - some of the computer related and some not. I enjoyed them all.

Memory bounds for diﬀerent betweenness centrality algorithms. P denotes the number of cores used by each algorithm. K denotes the number of roots used in the the approximation. The dynamic algorithms do not depend on the number of cores as these data structures are maintained as part of the computation................ 22

## 2 Notations used in the chapter....................... 27

3 Graphs from the 10th DIMACS Implementation Challenge used in our scaling experiments. The execution times are for a single thread.... 37 4 Larger graphs from the 10th DIMACS Implementation Challenge. The execution times are for a single thread.................. 47 5 The scenarios for the medium size graphs where the diﬀerent approaches have better performance. The array based approach is denoted with ◦ and the neighbor-traversal is denoted as. The ﬁrst column denotes the graph size and average edge adjacency. The remaining columns represent the thread count. The neighbor-traversal approach dominates the table............................ 65 6 Memory bounds for diﬀerent betweenness centrality algorithms.... 70## 7 Notations in this chapter......................... 74

8 Graphs from the 10th DIMACS Implementation Challenge that are used in our experiments, grouped by type and sorted within each group by vertex count............................... 86## 9 Notations in this chapter......................... 103

10 Time complexities of both approaches.................. 110 11 Graphs from the 10th DIMACS Implementation Challenge used in our experiments with the clustering coeﬃcient computation runtimes for the diﬀerent algorithms. Labels: SF - Straightforward (40 threads), V-B - Vertex-Based (40 threads), and E-B - Edge-Based (40 threads). 113 12 The data structures maintained while tracking dynamic connected components.................................. 131 13 Graph sizes used in our experiments for testing the algorithm. Multiple graphs of each size were used....................... 139 14 Cache misses for parallel merging algorithms assuming a 3-way associativity................................... 169The insertion of edge e = (u, v) connects two vertices that are on the 1 same level in the BFS-like tree of root s. This insertion does not create any new shortest paths. In the case of edge deletion, no shortest paths are removed................................ 29 (a) The insertion of edge e = (uhigh, ulow ) connects vertices that are 2 in adjacent levels before the insertion. (b) A BFS traversal is started at ulow as the shortest paths above this vertex are not aﬀected by the insertion. (c) Shows the vertices that are aﬀected by the dependency accumulation - this includes vertices that were not found in the BFS traversal beginning at ulow. The diﬀerence between the insertion and the deletion, is that the vertices found in the BFS traversal (green triangle) will have fewer paths to the root instead of additional paths to the root................................. 30 (a) The BFS-like tree prior to the insertion of edge e = (uhigh, ulow ).

3 (b) The BFS-like tree after the insertion. Note that ulow has a single parent in the level above it. For deletions, swap between (a) and (b).

(c) The BFS traversal starts at ulow and can possibly “pull-up” vertices that are closer to the root following the insertion. (d) The dependency accumulation can ﬁnd vertices that were not in the BFS traversal.. 32 4 (a) Prior to the insertion, the vertices in diﬀerent connected components. (b) After the insertions, the vertices are in the same connected component. All the paths between vertices that were previously in different connected components go through the inserted edge. For deletions, (b) is before the deletion and (a) is after the deletion. (c) and (d) are for insertions. (c) The BFS traversal starts at the vertex that just got connected to the ﬁrst connected component. Only vertices in the second component will be found in the BFS traversal. (d) The dependency accumulation will go through all the vertices of the second connected component and work its way back through the newly connect edge and all the way up to the root............... 33 5 Edge insertion or deletion occurs in a diﬀerent connected component than that of the root............................ 35 6 Six diﬀerent speedup bar for each graph (from the left to the right) for single thread execution: 1) average, 2) minimal, 3) the 25th percentile,

4) median, 5) the 75th percentile, and 6) maximal........... 39 7 Strong scaling of the coarse-grain parallel implementations: (a) static graph algorithm and (b) dynamic graph algorithm. The approximation uses K = 256 roots. Each of the curves is for diﬀerent network.... 40

xiv All legal vertex covers of the square made up of u, v, w, and x. Additional edges that are non-relevant have been removed. There are additional vertex covers for square. However, these follow one of the presented patterns. The vertices in the vertex cover are marked with an additional circle............................. 81 The cross-vertices (u, x) and their common neighboring vertices found 23

**in adjacency intersection. These neighbors are divided into groups:**

those in the vertex cover and those not in the vertex cover. The vertices in the vertex cover are marked with an additional circle........ 83 24 This chart shows the global clustering coeﬃcient value for graphs from Table 8................................... 87 25 The ordinate presents the ratio of time spent ﬁnding the vertex cover as a function of the total time spent computing the vertex cover and the clustering coeﬃcients. The ordinate is a logscale. An additional blue curve has been placed at 2.5%. The bars of about 2/3 of the graphs are below this curve........................ 88 ˆ The ordinate is the size ratio of the vertex cover, V, and the vertex set 26 ˆ | ≤ |V |. An additional blue curve has been added at V. Note that |V 70%. The bars of about 3/5 of the graphs are below this curve.... 89 27 The ordinate is the time ratio of the new vertex cover-based clustering coeﬃcients algorithm and the original clustering coeﬃcient algorithm.

The time of the new algorithm includes both the time needed to ﬁnd the vertex cover and the time for computing the clustering coeﬃcients.

All bars below the blue curve occur when the new algorithm is faster, as such lower is better.......................... 90 28 The ordinate is the ratio of the number of intersection that are necessary by the new algorithm in comparison with the original algorithm. 91 29 The ordinate presents the ratio of time spent ﬁnding the vertex cover as a function of the total time spent ﬁnding all the 4-circuits. Note that the ordinate is a log scale. The abscissa are the Clustering graphs from Table 8................................ 94 30 The ordinate is the time ratio of the new vertex cover based clustering coeﬃcients algorithm and the original clustering coeﬃcient algorithm.

The time of the new algorithm includes both the time needed to ﬁnd the vertex cover and the time for computing the clustering coeﬃcients. The blue curve denotes equal run-times for the new and original algorithm.