# «The Batched Predecessor Problem in External Memory Michael A. Bender12, Mart´n Farach-Colton23, Mayank Goswami4, ı Dzejla Medjedovic5, Pablo ...»

The Batched Predecessor Problem in External Memory

Michael A. Bender12, Mart´n Farach-Colton23, Mayank Goswami4,

ı

Dzejla Medjedovic5, Pablo Montes1, and Meng-Tsung Tsai3

Stony Brook University, Stony Brook NY 11794, USA

{bender,pmontes}@cs.stonybrook.edu

Tokutek, Inc.

Rutgers University, Piscataway NJ 08854, USA

{farach,mtsung.tsai}@cs.rutgers.edu

Max-Planck Institute for Informatics, Saarbr¨ cken 66123, Germany

u

gmayank@mpi-inf.mpg.de

Sarajevo School of Science and Technology, Sarajevo 71000, Bosnia-Herzegovina dzejla.medjedovic@ssst.edu.ba Abstract. We give lower and upper bounds for the batched predecessor problem in external memory. We study tradeoffs between the I/O budget to preprocess a dictionary S versus the I/O requirement to ﬁnd the predecessor in S of each element in a query set Q. For Q polynomially smaller than S, we give lower bounds in three external-memory models: the I/O comparison model, the I/O pointermachine model, and the indexability model.

In the comparison I/O model, we show that the batched predecessor problem needs Ω(logB n) I/Os per query element (n = |S|) when the preprocessing is bounded by a polynomial. With exponential preprocessing, the problem can be solved faster, in Θ((log2 n)/B) per element. We give the tradeoff that quantiﬁes the minimum preprocessing required for a given searching cost.

In the pointer-machine model, we show that with O(n4/3−ε ) preprocessing for any constant ε 0, the optimal algorithm cannot perform asymptotically faster than a B-tree. In the indexability model, we exhibit the tradeoff between the redundancy r and access overhead α of the optimal indexing scheme, showing that to report all query answers in α(x/B) I/Os, log r = Ω((B/α2 ) log(n/B)).

Our lower bounds have matching or nearly matching upper bounds.

1 Introduction A static dictionary is a data structure that represents a set S = {s1, s2,..., sn } subject

**to the following operations:**

PREPROCESS(S): Prepare a data structure to answer queries.

Return TRUE if q ∈ S and FALSE otherwise.

**SEARCH(q, S):**

Return maxsi ∈S {si q}.

**PREDECESSOR(q, S):**

The traditional static dictionary can be extended to support batched operations. Let

**Q = {q1,..., qx }. Then, the batched predecessor problem can be deﬁned as follows:**

BATCHEDPRED(Q, S): Return A = {a1,..., ax }, where ai = PREDECESSOR(qi, S).

This research was supported in part by NSF grants CCF 1114809, CCF 1114930, CCF 1217708, IIS 1247726, IIS 1247750, and IIS 1251137.

In this paper we prove lower bounds on the batched predecessor problem in external memory [3], that is, when the dictionary is too large to ﬁt into main memory. We study tradeoffs between the searching cost and the cost to preprocess the underlying set S. We present our results in three models: the comparison-based I/O model [3], the pointermachine I/O model [18], and the indexability model [10, 11].

We focus on query size x ≤ nc, for constant c 1. Thus, the query Q can be large, but is still much smaller than the underlying set S. This query size is interesting because, although there is abundant parallelism in the batched query, common approaches such as linear merges and buffering [4, 6, 7] are suboptimal.

Our results show that the batched predecessor problem in external memory cannot be solved asymptotically faster than Ω(logB n) I/Os per query element if the preprocessing is bounded by a polynomial; on the other hand, the problem can be solved asymptotically faster, in Θ((log2 n)/B) I/Os, if we impose no constraints on preprocessing. These bounds stand in marked contrast to single-predecessor queries, where one search costs Ω(logB n) even if preprocessing is unlimited.

We assume that S and Q are sorted. Without loss of generality, Q is sorted because Q’s sort time is subsumed by the query time. Without loss of generality, S is sorted, as long as the preprocessing time is slightly superlinear. We consider sorted S throughout the paper. For notational convenience, we let s1 s2 · · · sn and q1 q2 · · · qx, and therefore a1 ≤ a2 ≤ · · · ≤ ax.

Given that S and Q are sorted, an alternative interpretation of this paper is as follows: how can we optimally merge two sorted lists in external memory? Speciﬁcally, what is the optimal algorithm for merging two sorted lists in external memory when one list is some polynomial factor smaller than the other?

Observe that the na¨ve linear-scan merging is suboptimal because it takes Θ(n/B) ı I/Os, which is greater than the O(nc logB n) I/Os of a B-tree-based solution. Buffer trees [4, 6, 7] also take Θ(n/B) I/Os during a terminal ﬂush phase. This paper shows that with polynomial preprocessing, performing independent searches for each element in Q is optimal, but it is possible to do better for higher preprocessing.

Single and batched predecessor problems in RAM. In the comparison model, a single predecessor can be found in Θ(log n) time using binary search. The batched predecessor problem is solved in Θ(x log(n/x) + x) by combining merging and binary search [13, 14]. The bounds for both problems remain tight for any preprocessing budget.

P˘ trascu and Thorup [15] give tight lower bounds for single predecessor queries in a¸ the cell-probe model. We are unaware of prior lower bounds for the batched predecessor problem in the pointer-machine and cell-probe models.

Although batching does not help algorithms that rely on comparisons, Karpinski and Nekrich [12] give an upper bound for this problem in the word-RAM model (bit √ operations are allowed), which achieves O(x) for all batches of size x = O( log n) (O(1) per element amortized) with superpolynomial preprocessing.

Batched predecessor problem in external memory. Dittrich et al. [8] consider multisearch problems where queries are simultaneously processed and satisﬁed by navigating through large data structures on parallel computers. They give a lower bound of Ω(x logB (n/x) + x/B) under stronger assumptions: no duplicates of nodes are allowed, the ith query has to ﬁnish before the (i + 1)st query starts, and x n1/(2+ε), for a constant ε 0.

Buffering is a standard technique for improving the performance of externalmemory algorithms [4, 6, 7]. By buffering, partial work on a set of operations can share an I/O, thus reducing the per-operation I/O cost. Queries can similarly be buffered. In this paper, the number of queries, x, is much smaller than the size, n, of the data structure being queried. As a result, as the partial work on the queries progresses, the query paths can diverge within the larger search structure, eliminating the beneﬁt of buffering.

Goodrich et al. [9] present a general method for performing x simultaneous external memory searches in O((n/B + x/B) logM/B (n/B)) I/Os when x is large. When x is small, this technique achieves O(x logB (n/B)) I/Os with a modiﬁed version of the parallel fractional cascading technique of Tamassia and Vitter [19].

Results We ﬁrst consider the comparison-based I/O model [3]. In this model, the problem cannot be solved faster than Ω(logB n) I/Os per element if preprocessing is polynomial.

That is, batching queries is not faster than processing them one by one. With exponential preprocessing, the problem can be solved faster, in Θ((log2 n)/B) I/Os per element. We generalize to show a query-preprocessing tradeoff.

Next we study the pointer-machine I/O model [18], which is less restrictive than the comparison I/O model in main memory, but more restrictive in external memory.6 We show that with preprocessing at most O(n4/3−ε ) for constant ε 0, the cost per element is again Ω(logB n).

Finally, we turn to the more general indexability model [10, 11]. This model is frequently used to describe reporting problems, and it focuses on bounding the number of disk blocks that contain the answers to the query subject to the space limit of the data structure; the searching cost is ignored. Here, the redundancy parameter r measures the number of times an element is stored in the data structure, and the access overhead parameter α captures how far the reporting cost is from the optimal.

We show that to report all query answers in α(x/B) I/Os, r = (n/B)Ω(B/α ). The lower bounds in this model also hold in the previous two models. This result shows that it is impossible to obtain O(1/B) per element unless the space used by the data structure is exponential, which corresponds to the situation in RAM, where exponential preprocessing is required to achieve O(1) amortized time per query element [12].

The rest of this section formally outlines our results.

I/Os in the worst-case, no matter the preprocessing. There exists a comparison-based algorithm matching this bound.

Traditional information-theoretic techniques give tight sorting-like lower bounds for this problem in the RAM model. In external memory, the analogous approach yields a lower bound of Ω B logM/B n + B. On the other hand, repeated ﬁnger searching x x x in a B-tree yields an upper bound of O(x logB n). Theorem 1 shows that both bounds are weak, and that in external memory this problem has a complexity that is between sorting and searching.

We can interpret results in the comparison model through the amount of information that can be learned from each I/O. For searching, a block input reduces the choices for the target position of the element by a factor of B, thus learning log B bits of information. For sorting, a block input learns up to log M = Θ(B log(M/B)) bits (obtained B by counting the ways that an incoming block can intersperse with elements resident in main memory). Theorem 1 demonstrates that in the batched predecessor problem, the optimal, unbounded-preprocessing algorithm learns B bits per I/O, more than for searching but less than for sorting.

The following theorem captures the tradeoff between the searching and preprocessing: at one end of the spectrum lies a B-tree (j = 1) with linear construction time and logB n searching cost per element, and on the other end is the parallel binary search (j = B) with exponential preprocessing cost and (log2 n)/B searching cost. This tradeoff shows that even to obtain a performance that is only twice as fast as that of a B-tree, quadratic preprocessing is necessary. To learn up to j log(B/j + 1) bits per I/O, the algorithm needs to spend nΩ(j) in preprocessing.

Theorem 2 (Search-preprocessing tradeoff, I/O comparison model). Let S be a set of size n and Q a set of size x ≤ nc, 0 ≤ c 1. In the I/O comparison model, computing BATCHEDPRED(Q, S) in O((x logB/j+1 n)/j) I/Os requires that PREPROCESSING(S) use nΩ(j) blocks of space and I/Os.

In order to show results in the I/O pointer-machine model, we deﬁne a graph whose nodes are the blocks on disk of the data structure and whose edges are the pointers between blocks. Since a block has size B, it can contain at most B pointers, and thus the graph is fairly sparse. We show that any such sparse graph has a large set of nodes that are far apart. If the algorithm must visit those well-separated nodes, then it must perform many I/Os. The crux of the proof is that, as the preprocessing increases, the redundancy of the data structure increases, thus making it hard to pin down speciﬁc locations of the data structure that must be visited. We show that if the data structure is reasonable in size—in our case O(n4/3−ε )—then we can still ﬁnd a large, well dispersed set of nodes

**that must be visited, thus establishing the following lower bound:**

Theorem 3 (Lower bound, I/O pointer-machine model). Let S be a set of size n.

In the I/O pointer-machine model, if PREPROCESSING(S) uses O(n4/3−ε ) blocks of space and I/Os, for any constant ε 0, then there exists a constant c and a set Q of size nc such that computing BATCHEDPRED(Q, S) requires Ω(x logB (n/x) + x/B) I/Os.

2 Batched Predecessor in the I/O Comparison Model In this section we give the lower bound for when preprocessing is unrestricted. Then we study the tradeoff between preprocessing and the optimal number of I/Os.

**Proof of Theorem 1 (Lower Bound). Consider the following problem instance:**

1. For all qi, |Si | = n/x. That is, all elements in Q have been given the ﬁrst log x bits of information about where they belong in S.

2. For all i and j (1 ≤ i = j ≤ x), Si ∩ Sj = ∅. That is, search intervals are disjoint.

We do not charge the algorithm for transferring elements of Q between main memory and disk. This accounting scheme is equivalent to allowing all elements of Q to reside in main memory at all times while still having the entire memory free for other manipulations. Storing Q in main memory does not provide the algorithm with any additional information, since the sorted order of Q is already known.

Now we only consider I/Os of elements in S. Denote a block being input as b = (b1,..., bB ). Observe that every bi (1 ≤ i ≤ B) belongs to at most one Sj. The element bi acts as a pivot and helps qj learn at most one bit of information—by shrinking Sj to its left or its right half.

Since a single pivot gives at most one bit of information, the entire block b can supply at most B bits, during an entire execution of BATCHEDPRED(Q, S).

We require the algorithm to identify the ﬁnal block in S where each qi belongs.

Thus, the total number of bits that the algorithm needs to learn to solve the problem is Ω(x log(n/xB)). Along with the scan bound to output the answer, the minimum x n x number of block transfers required to solve the problem is Ω B log xB + B.

2.2 Preprocessing-Searching Tradeoffs We give a lower bound on the space required by the batched predecessor problem when the budget for searching is limited. We prove Theorem 2 by proving Theorem 7.

Deﬁnition 6. An I/O containing elements of S is a j-parallelization I/O if j distinct elements of Q acquire bits of information during this I/O.