# «Abstract Suﬃx automata and factor automata are eﬃcient data structures for represent- ing the full index of a set of strings. They are minimal ...»

General Suﬃx Automaton Construction Algorithm and

Space Bounds

Mehryar Mohria,b, Pedro Morenob, Eugene Weinsteina,b

a CourantInstitute of Mathematical Sciences

251 Mercer Street, New York, NY 10012.

b Google Research

76 Ninth Avenue, New York, NY 10011.

Abstract

Suﬃx automata and factor automata are eﬃcient data structures for represent-

ing the full index of a set of strings. They are minimal deterministic automata

representing the set of all suﬃxes or substrings of a set of strings. This paper presents a novel analysis of the size of the suﬃx automaton or factor automaton of a set of strings. It shows that the suﬃx automaton or factor automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a preﬁx-tree representing the strings in U. This bound signiﬁcantly improves over 2 U −1, the bound given by Blumer et al. (1987), where U is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suﬃx or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suﬃx shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suﬃx automaton S or factor automaton F of U in time O(|S|).

Our algorithm applies in fact to any input suﬃx-unique automaton and strictly generalizes the standard on-line construction of a suﬃx automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suﬃx oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identiﬁcation task with more than 15,000 songs.

Key words: string-matching, pattern-matching, indexing, inverted text, ﬁnite automata, suﬃx trees, suﬃx automata, factor automata, music identiﬁcation.

Email addresses: mohri@cs.nyu.edu (Mehryar Mohri), pedro@google.com (Pedro Moreno), eugenew@cs.nyu.edu (Eugene Weinstein) Preprint submitted to Elsevier April 26, 2009

1. Introduction Searching for patterns in massive quantities of natural language texts, bio- logical sequences, and other widely accessible digitized sequences is a problem of central importance in computer science. The problem has a variety of appli- cations and has been extensively studied in the past [1, 2].

This paper considers the problem of constructing a full index, or inverted ﬁle, for a set of strings represented by a ﬁnite automaton. When the number of strings is large, such as thousands or even millions, the collection of strings can be compactly stored as an automaton, which also enables eﬃcient implementations of search and other algorithms [3, 4]. In fact, in many contexts such as speech recognition or information extraction tasks, the entire set of strings is often directly given as an automaton.

An eﬃcient and compact data structure for representing a full index of a set of strings is a suﬃx automaton, a minimal deterministic automaton representing the set of all suﬃxes of a set of strings. Since a substring is a preﬁx of a suﬃx, a suﬃx automaton can be used to determine if a string x is a substring in time linear in its length O(|x|), which is optimal. Additionally, as with suﬃx trees, suﬃx automata have other interesting properties in string-matching problems which make their use and construction attractive [1, 2]. Another similar data structure for representing a full index of a set of strings is a factor automaton, a minimal deterministic automaton representing the set of all factors or substrings of a set of strings. Factor automata oﬀer the same optimal linear-time search property as suﬃx automata, and are never larger.

The construction and the size of a factor automaton have been speciﬁcally analyzed in the case of a single string [5, 6]. These authors demonstrated the remarkable result that the size of the factor automaton of a string x is linear, and that, more precisely, for strings x of length more than three, it has at most 2|x| − 2 states and 3|x| − 4 transitions. They also gave on-line lineartime algorithms for constructing a factor automaton from x. Similar results were given for suﬃx automata, the minimal deterministic automata accepting exactly the set of suﬃxes of a string.

The construction and the size of the factor automata of a ﬁnite set of strings U = {x1,..., xm } has also been previously studied [7]. These authors showed that an automaton accepting all factors of U can be constructed that has at most 2 U − 1 states and 3 U − 3 transitions, where U is the sum of the m lengths of all strings in U, that is U = i=1 |xi |.

This paper proves a signiﬁcantly better bound on the size of the suﬃx automaton or factor automaton of a set of strings. It shows that the suﬃx automaton or factor automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a preﬁx-tree representation of the strings in U. The number of nodes Q can be dramatically smaller than U, the sum of the lengths of all strings. Thus, our space bound clearly improves on previous work [7]. More generally, we give novel bounds for the size of the suﬃx automaton or factor automaton of an acyclic ﬁnite automaton as a function of the size of the original automaton and the maximal length of a suﬃx shared by the 2 strings accepted by the original automaton. This result can be compared to that of Inenaga et al. for compact directed acyclic word graphs whose complexity, O(|Σ|Q), depends on the size of the alphabet [8].

Using our space bound analysis, we also give a simple algorithm for constructing the suﬃx automaton S or factor automaton F of U in time O(|S|) from a preﬁx tree representing U. Our algorithm applies in fact to any input suﬃx-unique automaton and strictly generalizes the standard on-line construction of a suﬃx automaton for a single input string.

The original motivation for this work was the design of a large-scale music identiﬁcation system [4, 9], where we represented our song database by a compact ﬁnite automaton, as we shall brieﬂy describe later in this paper. To facilitate an eﬃcient search of song snippets, we constructed the minimal deterministic factor automaton of the song automaton. Empirically, the size of the factor automaton was not prohibitive. But, to ensure the scalability of our approach to a larger set of songs, e.g., several million songs, we wished to derive a bound on the size of the factor automata of automata. One characteristic of the strings considered in this application as in many others is that the original strings do not share long suﬃxes. This motivated our analysis of the size of the factor automata with respect to the length of the common suﬃxes in the original automaton.

The remainder of the paper is organized as follows. Section 2 introduces the string and automata deﬁnitions and terminology used throughout the paper. In Section 3, we describe a novel analysis of factor automata and present new bounds on the size of the suﬃx automaton and factor automaton of an automaton. Section 4 gives a detailed description of a linear-time algorithm for the construction of the suﬃx automaton and factor automaton of a ﬁnite set of strings, or of any suﬃx-unique automaton, including a pseudocode of the algorithm. Our algorithm can also be used straightforwardly to generate the suﬃx oracle or factor oracle of a set of strings, which has been shown to have various useful properties [10]. Section 5 brieﬂy describes the use of factor automata in music identiﬁcation and reports several empirical results related to their size.

**2. Factors of a Finite Automaton**

This section reviews some key properties of factors of a ﬁxed ﬁnite automaton, generalizing similar observations made by Blumer et al. for a single string [7].

We denote by Σ a ﬁnite alphabet. The length of a string x ∈ Σ∗ over that alphabet is denoted by |x|. A factor, or substring, of a string x ∈ Σ∗ is a sequence of symbols appearing consecutively in x. Thus, y is a factor of x iﬀ there exist u, v ∈ Σ∗ such that x = uyv. A suﬃx of a string x ∈ Σ∗ is a factor that appears at the end of x. Put otherwise, y is a suﬃx of x iﬀ there exists u ∈ Σ∗ such that x = uy. Analogously, y is a preﬁx of x iﬀ there exists u ∈ Σ∗ such that x = yu. More generally, a factor, suﬃx, or preﬁx of a set of strings U or an automaton A, is a factor, suﬃx, or preﬁx of a string in U or a string

Figure 1: Finite automaton A accepting the strings ac, acab, acba.

accepted by A, respectively. The symbol ǫ represents the empty string. For any string x ∈ Σ∗, ǫ is always a preﬁx, suﬃx, and factor of x.

In some applications such as music identiﬁcation the strings considered may be long, e.g., sequences of music sounds, but with relatively short common suﬃxes. This motivates the following deﬁnition.

Deﬁnition 1. Let k be a non-negative integer. We will say that a ﬁnite automaton A is k-suﬃx-unique if no two strings accepted by A share a suﬃx of length k. A is said to be suﬃx-unique when it is k-suﬃx-unique with k = 1.

Figure 1 gives an example of a simple automaton A accepting three strings ending in distinct symbols. Note that A is suﬃx-unique.

The main results of this paper hold for suﬃx-unique automata, but we also present some results for the general case of arbitrary acyclic automata. We denote by F (A) the minimal deterministic automaton accepting the set of factors of a ﬁnite automaton A, that is the set of factors of the strings accepted by A.

Similarly, we denote by S(A) the minimal deterministic automaton accepting the set of suﬃxes of an automaton A.

Deﬁnition 2. Let A be a ﬁnite automaton. For any string x ∈ Σ∗, we deﬁne end -set(x) as the set of states of A reached by the paths in A that begin with x. We say that two strings x and y in Σ∗ are equivalent and denote this by x ≡ y, when end -set(x) = end -set(y). This deﬁnes a right-invariant equivalence relation on Σ∗. We denote by [x] the equivalence class of x ∈ Σ∗.

** Lemma 1. Assume that A is suﬃx-unique.**

Then, a non-suﬃx factor x of the automaton A is the longest member of [x] iﬀ it is either a preﬁx of A, or both ax and bx are factors of A for distinct a, b ∈ Σ.

Proof. Let x be a non-suﬃx factor of A. Clearly, if x is not a preﬁx, then there must be distinct a and b such that ax and bx are factors of A, otherwise [x] would admit a longer member. Conversely, assume that ax and bx are both factors of A with a = b. Let y be the longest member of [x]. Let q be a state in end -set(x) = end -set (y). Since x is not a suﬃx, q is not a ﬁnal state, and there exists a non-empty string z labeling a path from q to a ﬁnal state. Since A is suﬃx-unique, both xz and yz are suﬃxes of the same string. Since y is the longest member of [x], x must be a suﬃx of y. Since ax and bx are both factors of A with a = b, we must have y = x. Finally, if x is a preﬁx, then clearly it is the longest member of [x]. P

Proposition 1. Assume that A is suﬃx-unique.

**Let SA = (QS, IS, FS, ES ) be the deterministic automaton whose states are the equivalence classes QS = {[x] = ∅ : x ∈ Σ∗ }, its initial state IS = {[ǫ]}, its ﬁnal states FS = {[x] :**

end -set(x) ∩ FA = ∅} where FA is the set of ﬁnal states of A, and its transition set ES = {([x], a, [xa]) : [x], [xa] ∈ QS }. Then, SA is the minimal deterministic suﬃx automaton of A: SA = S(A).

Proof. By construction, SA is deterministic and accepts exactly the set of sufxes of A. Let [x] and [y] be two equivalent states of SA. Then, for all z ∈ Σ∗, [xz] ∈ FA iﬀ [yz] ∈ FA, that is z is a suﬃx of A iﬀ yz is a suﬃx of A. Since A is suﬃx-unique, this implies that either x is a suﬃx of y or vice versa, and thus that [x] = [y]. Thus, SA is minimal. P In what follows, we will be interested in the case where the automaton A is acyclic. We denote by |A|Q the number of states of A, by |A|E the number of transitions of A, and by |A| the size of A deﬁned as the sum of the number of states and transitions of A.

3. Space Bounds for Factor Automata The objective of this section is to derive new bounds on the size of S(A) and F (A) in the case of interest for our applications where A is an acyclic automaton, typically deterministic and minimal, representing a set of strings.

When A represents a single string, there are standard algorithms for constructing S(A) and F (A) from A in linear time [5, 6]. In the general case, S(A) can be constructed from A as follows: add an ǫ-transition from the initial state of A to each state of A, then apply an ǫ-removal algorithm, followed by determinization and minimization. F (A) can be obtained similarly by further making all states ﬁnal before applying ǫ-removal, determinization, and minimization. It can also be obtained from S(A) by making all states of S(A) ﬁnal and applying minimization. For example, if A is the simple automaton of Figure 1, then Figure 2 is its suﬃx automaton S(A).

When A represents a single string x, the size of the automata S(A) and F (A) can be proved to be linear in |x|. More precisely, the following bounds

These bounds are tight for strings of length more than three. [7] gave similar results for the case of a set of strings U by showing that the size of the factor automaton F (U ) representing this set is bounded as follows

where U denotes the sum of the lengths of all strings in U.

In general, the size of an acyclic automaton A representing a ﬁnite set of strings U can be substantially smaller than U. In fact, |A| can be exponentially smaller than U. Thus, we are interested in bounding the size of S(A) or F (A) in terms of the size of A, rather than the sum of the lengths of all strings accepted by A.