WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 |

«Abstract Suffix automata and factor automata are efficient data structures for represent- ing the full index of a set of strings. They are minimal ...»

-- [ Page 1 ] --

General Suffix Automaton Construction Algorithm and

Space Bounds

Mehryar Mohria,b, Pedro Morenob, Eugene Weinsteina,b

a CourantInstitute of Mathematical Sciences

251 Mercer Street, New York, NY 10012.

b Google Research

76 Ninth Avenue, New York, NY 10011.

Abstract

Suffix automata and factor automata are efficient data structures for represent-

ing the full index of a set of strings. They are minimal deterministic automata

representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 U −1, the bound given by Blumer et al. (1987), where U is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|).

Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

Key words: string-matching, pattern-matching, indexing, inverted text, finite automata, suffix trees, suffix automata, factor automata, music identification.

Email addresses: mohri@cs.nyu.edu (Mehryar Mohri), pedro@google.com (Pedro Moreno), eugenew@cs.nyu.edu (Eugene Weinstein) Preprint submitted to Elsevier April 26, 2009

1. Introduction Searching for patterns in massive quantities of natural language texts, bio- logical sequences, and other widely accessible digitized sequences is a problem of central importance in computer science. The problem has a variety of appli- cations and has been extensively studied in the past [1, 2].

This paper considers the problem of constructing a full index, or inverted file, for a set of strings represented by a finite automaton. When the number of strings is large, such as thousands or even millions, the collection of strings can be compactly stored as an automaton, which also enables efficient implementations of search and other algorithms [3, 4]. In fact, in many contexts such as speech recognition or information extraction tasks, the entire set of strings is often directly given as an automaton.

An efficient and compact data structure for representing a full index of a set of strings is a suffix automaton, a minimal deterministic automaton representing the set of all suffixes of a set of strings. Since a substring is a prefix of a suffix, a suffix automaton can be used to determine if a string x is a substring in time linear in its length O(|x|), which is optimal. Additionally, as with suffix trees, suffix automata have other interesting properties in string-matching problems which make their use and construction attractive [1, 2]. Another similar data structure for representing a full index of a set of strings is a factor automaton, a minimal deterministic automaton representing the set of all factors or substrings of a set of strings. Factor automata offer the same optimal linear-time search property as suffix automata, and are never larger.

The construction and the size of a factor automaton have been specifically analyzed in the case of a single string [5, 6]. These authors demonstrated the remarkable result that the size of the factor automaton of a string x is linear, and that, more precisely, for strings x of length more than three, it has at most 2|x| − 2 states and 3|x| − 4 transitions. They also gave on-line lineartime algorithms for constructing a factor automaton from x. Similar results were given for suffix automata, the minimal deterministic automata accepting exactly the set of suffixes of a string.

The construction and the size of the factor automata of a finite set of strings U = {x1,..., xm } has also been previously studied [7]. These authors showed that an automaton accepting all factors of U can be constructed that has at most 2 U − 1 states and 3 U − 3 transitions, where U is the sum of the m lengths of all strings in U, that is U = i=1 |xi |.

This paper proves a significantly better bound on the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q − 2 states, where Q is the number of nodes of a prefix-tree representation of the strings in U. The number of nodes Q can be dramatically smaller than U, the sum of the lengths of all strings. Thus, our space bound clearly improves on previous work [7]. More generally, we give novel bounds for the size of the suffix automaton or factor automaton of an acyclic finite automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the 2 strings accepted by the original automaton. This result can be compared to that of Inenaga et al. for compact directed acyclic word graphs whose complexity, O(|Σ|Q), depends on the size of the alphabet [8].





Using our space bound analysis, we also give a simple algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|) from a prefix tree representing U. Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string.

The original motivation for this work was the design of a large-scale music identification system [4, 9], where we represented our song database by a compact finite automaton, as we shall briefly describe later in this paper. To facilitate an efficient search of song snippets, we constructed the minimal deterministic factor automaton of the song automaton. Empirically, the size of the factor automaton was not prohibitive. But, to ensure the scalability of our approach to a larger set of songs, e.g., several million songs, we wished to derive a bound on the size of the factor automata of automata. One characteristic of the strings considered in this application as in many others is that the original strings do not share long suffixes. This motivated our analysis of the size of the factor automata with respect to the length of the common suffixes in the original automaton.

The remainder of the paper is organized as follows. Section 2 introduces the string and automata definitions and terminology used throughout the paper. In Section 3, we describe a novel analysis of factor automata and present new bounds on the size of the suffix automaton and factor automaton of an automaton. Section 4 gives a detailed description of a linear-time algorithm for the construction of the suffix automaton and factor automaton of a finite set of strings, or of any suffix-unique automaton, including a pseudocode of the algorithm. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties [10]. Section 5 briefly describes the use of factor automata in music identification and reports several empirical results related to their size.

2. Factors of a Finite Automaton

This section reviews some key properties of factors of a fixed finite automaton, generalizing similar observations made by Blumer et al. for a single string [7].

We denote by Σ a finite alphabet. The length of a string x ∈ Σ∗ over that alphabet is denoted by |x|. A factor, or substring, of a string x ∈ Σ∗ is a sequence of symbols appearing consecutively in x. Thus, y is a factor of x iff there exist u, v ∈ Σ∗ such that x = uyv. A suffix of a string x ∈ Σ∗ is a factor that appears at the end of x. Put otherwise, y is a suffix of x iff there exists u ∈ Σ∗ such that x = uy. Analogously, y is a prefix of x iff there exists u ∈ Σ∗ such that x = yu. More generally, a factor, suffix, or prefix of a set of strings U or an automaton A, is a factor, suffix, or prefix of a string in U or a string

–  –  –

Figure 1: Finite automaton A accepting the strings ac, acab, acba.

accepted by A, respectively. The symbol ǫ represents the empty string. For any string x ∈ Σ∗, ǫ is always a prefix, suffix, and factor of x.

In some applications such as music identification the strings considered may be long, e.g., sequences of music sounds, but with relatively short common suffixes. This motivates the following definition.

Definition 1. Let k be a non-negative integer. We will say that a finite automaton A is k-suffix-unique if no two strings accepted by A share a suffix of length k. A is said to be suffix-unique when it is k-suffix-unique with k = 1.

Figure 1 gives an example of a simple automaton A accepting three strings ending in distinct symbols. Note that A is suffix-unique.

The main results of this paper hold for suffix-unique automata, but we also present some results for the general case of arbitrary acyclic automata. We denote by F (A) the minimal deterministic automaton accepting the set of factors of a finite automaton A, that is the set of factors of the strings accepted by A.

Similarly, we denote by S(A) the minimal deterministic automaton accepting the set of suffixes of an automaton A.

Definition 2. Let A be a finite automaton. For any string x ∈ Σ∗, we define end -set(x) as the set of states of A reached by the paths in A that begin with x. We say that two strings x and y in Σ∗ are equivalent and denote this by x ≡ y, when end -set(x) = end -set(y). This defines a right-invariant equivalence relation on Σ∗. We denote by [x] the equivalence class of x ∈ Σ∗.

Lemma 1. Assume that A is suffix-unique.

Then, a non-suffix factor x of the automaton A is the longest member of [x] iff it is either a prefix of A, or both ax and bx are factors of A for distinct a, b ∈ Σ.

Proof. Let x be a non-suffix factor of A. Clearly, if x is not a prefix, then there must be distinct a and b such that ax and bx are factors of A, otherwise [x] would admit a longer member. Conversely, assume that ax and bx are both factors of A with a = b. Let y be the longest member of [x]. Let q be a state in end -set(x) = end -set (y). Since x is not a suffix, q is not a final state, and there exists a non-empty string z labeling a path from q to a final state. Since A is suffix-unique, both xz and yz are suffixes of the same string. Since y is the longest member of [x], x must be a suffix of y. Since ax and bx are both factors of A with a = b, we must have y = x. Finally, if x is a prefix, then clearly it is the longest member of [x]. P

–  –  –

Proposition 1. Assume that A is suffix-unique.

Let SA = (QS, IS, FS, ES ) be the deterministic automaton whose states are the equivalence classes QS = {[x] = ∅ : x ∈ Σ∗ }, its initial state IS = {[ǫ]}, its final states FS = {[x] :

end -set(x) ∩ FA = ∅} where FA is the set of final states of A, and its transition set ES = {([x], a, [xa]) : [x], [xa] ∈ QS }. Then, SA is the minimal deterministic suffix automaton of A: SA = S(A).

Proof. By construction, SA is deterministic and accepts exactly the set of sufxes of A. Let [x] and [y] be two equivalent states of SA. Then, for all z ∈ Σ∗, [xz] ∈ FA iff [yz] ∈ FA, that is z is a suffix of A iff yz is a suffix of A. Since A is suffix-unique, this implies that either x is a suffix of y or vice versa, and thus that [x] = [y]. Thus, SA is minimal. P In what follows, we will be interested in the case where the automaton A is acyclic. We denote by |A|Q the number of states of A, by |A|E the number of transitions of A, and by |A| the size of A defined as the sum of the number of states and transitions of A.

3. Space Bounds for Factor Automata The objective of this section is to derive new bounds on the size of S(A) and F (A) in the case of interest for our applications where A is an acyclic automaton, typically deterministic and minimal, representing a set of strings.

When A represents a single string, there are standard algorithms for constructing S(A) and F (A) from A in linear time [5, 6]. In the general case, S(A) can be constructed from A as follows: add an ǫ-transition from the initial state of A to each state of A, then apply an ǫ-removal algorithm, followed by determinization and minimization. F (A) can be obtained similarly by further making all states final before applying ǫ-removal, determinization, and minimization. It can also be obtained from S(A) by making all states of S(A) final and applying minimization. For example, if A is the simple automaton of Figure 1, then Figure 2 is its suffix automaton S(A).

When A represents a single string x, the size of the automata S(A) and F (A) can be proved to be linear in |x|. More precisely, the following bounds

–  –  –

These bounds are tight for strings of length more than three. [7] gave similar results for the case of a set of strings U by showing that the size of the factor automaton F (U ) representing this set is bounded as follows

–  –  –

where U denotes the sum of the lengths of all strings in U.

In general, the size of an acyclic automaton A representing a finite set of strings U can be substantially smaller than U. In fact, |A| can be exponentially smaller than U. Thus, we are interested in bounding the size of S(A) or F (A) in terms of the size of A, rather than the sum of the lengths of all strings accepted by A.



Pages:   || 2 | 3 |


Similar works:

«Bulletin of Education and Research June 2010, Vol. 32, No. 1 pp 37-51 An Exploration of Emotional Intelligence of the Students of IIUI in Relation to Gender, Age and Academic Achievement Maliha Nasir* & Rehana Masrur** Abstract This correlational study was intended to examine the relationship of emotional intelligence (EI) with gender, age and academic achievement of students of International Islamic University Islamabad (IIUI). In this study the predictor variable was emotional intelligence...»

«2012/13 mnr:pnr: Redogörelse 2012/13:JO1 Justitieombudsmännens ämbetsberättelse Omslag Eva Lena Johansson Översättning till engelska av Summary in English: David Jones ISSN 0282-0560 Tryck: Elanders, Vällingby 2012 2 2012/13:JO1 Innehåll Skrivelse till riksdagen Allmänna domstolar m.m. Kritik mot en rådman vid Blekinge tingsrätt med anledning av agerandet i ett mål om umgänge med barn m.m. (222-2011). 36 Allvarlig kritik mot en rådman vid Attunda tingsrätt för det sätt på...»

«How Many Miles to Basra A radio play By Colin Teevan Commissioned by BBC Radio 3 Producer Toby Swift Characters: Ursula Stewart Freddie Dangermouse Geordie Malek Gus Janet Sayed First Bandit Second Bandit News Reader on BBC World Service Sheikh of the Kuffa Family. Jeannie The speech in italics is that taken from interviews that have been done at four separate times. These times are as followed and marked with the relevant symbols. # MARCH 17/03/03– DAYTIME IN THE EQUIPMENT HANGAR AT THE BASE...»

«In Ohio Warning Any places listed in the Haunted Places requires permission to visit or investigate. Many of the places are patrolled by the authorities, trespassers will be prosecuted. Aberdeen Aberdeen Baptist Church There is an uneasy feeling when you walk or drive past the church late in the evenings. Apparitions of 3 children can be seen; 2 girls and a boy standing outside the church. They seem to vanish when you look at them. Large black birds have been seen flying out of the church...»

«OCTOBER TERM, 2010 1 (Slip Opinion) Syllabus NOTE: Where it is feasible, a syllabus (headnote) will be released, as is being done in connection with this case, at the time the opinion is issued. The syllabus constitutes no part of the opinion of the Court but has been prepared by the Reporter of Decisions for the convenience of the reader. See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.SUPREME COURT OF THE UNITED STATES Syllabus ARIZONA FREE ENTERPRISE CLUB’S FREEDOM...»

«1. In regards to the Right to Information documents what is your response to the claims from public servants that SANTOS did not supply adequate site specific information regarding wells and associated infrastructure for proper assessment for the Coordinator Generals Report?Santos GLNG response: It is important to note that the Coordinator-General concluded that: “the EIS process conducted for the project adequately meets the requirements for impact assessment, to the greatest extent...»

«POSITION STATEMENT Alcohol and cancer risk Key messages and recommendations Alcohol use is a cause of cancer. Any level of alcohol consumption increases the risk of • developing an alcohol-related cancer; the level of risk increases in line with the level of consumption. It is estimated that 5,070 cases of cancer (or 5% of all cancers) are attributable to long-term, chronic • use of alcohol each year in Australia. There is convincing evidence that alcohol use increases the risk of cancers...»

«CJEU Case List Juers Pharma Import-Export GmbH v Oberfinanzdirektion Nürnberg regarding the interpretation of Heading 3004 of the Combined Nomenclature in Annex I to Council Regulation (EEC) No 2658/87 of 1987 on the tariff and statistical nomenclature and on the Common Customs Tariff. Mr Paul, Ms Sonnen-Lütte and Ms Mörkens (‘Paul and others’) v Bundesrepublik Deutschland regarding the liability of the German government for the belated transposition of Directive 94/19 and for the...»

«Keynote Address at the 3rd USM-ISDEV International Islamic Management Conference on Islamic Capital Market, Organised by Centre for Islamic Management Studies University Sains Malaysia; 28th & 29th October 2009, USM, Penang. Challenges of Realizing Maqasid al-Shariah (Objectives of Shariah) in Islamic Capital Market: Special Focus on Equity-Based Sukuk Assoc. Prof. Dr. Asyraf Wajdi Dusuki∗ Abstract Islamic Capital Market is an important component of the overall Islamic Financial System...»

«TKK Dissertations 151 Espoo 2009 EVALUATION OF GUN PROPELLING CHARGE PERFORMANCE DURING THE LIFE CYCLE BY STATISTICAL UTILIZATION OF DATA COLLECTED IN TEST AND TROOP GUN FIRINGS Doctoral Dissertation Heli Nyberg Helsinki University of Technology Faculty of Chemistry and Materials Sciences Department of Biotechnology and Chemical Technology TKK Dissertations 151 Espoo 2009 EVALUATION OF GUN PROPELLING CHARGE PERFORMANCE DURING THE LIFE CYCLE BY STATISTICAL UTILIZATION OF DATA COLLECTED IN TEST...»

«International Journal of Oceans and Oceanography ISSN 0973-2667 Vol.1 No.1 (2006), pp. 99-109 © Research India Publications http://www.ripublication.com/ijoo.htm Environmental Assessment of Heavy Metal Pollution in Bottom Sediments of Aden Port, Yemen  2 Samir M. Nasr¹, Mohamed. A. Okbah, Shaif. M. Kasem³ ¹Department of Environmental Studies, Institute of Graduate Studies and Research, University of Alexandria, Alexandria, Egypt E-mail: samir_nasr@yahoo.com 2 National Institute of...»

«8. TESTING POWER TRANSFORMERS High-voltage transformers are some of the most important (and expensive) pieces of equipment required for operating a power system. The purchase, preparation, assembly, operation and maintenance of transformers represent a large expense to the power system.8.1 OVERVIEW When transformers are received from the factory or reallocated from another location it is necessary to verify that each transformer is dry, no damage has occurred during shipping, internal...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.