# «Abstract. Extracting knowledge from text has long been a goal of AI. Initial approaches were purely logical and brittle. More recently, the ...»

Extracting Semantic Networks from Text

Via Relational Clustering

Stanley Kok and Pedro Domingos

Department of Computer Science and Engineering

University of Washington, Seattle WA 98195-2350, USA

{koks,pedrod}@cs.washington.edu

Abstract. Extracting knowledge from text has long been a goal of AI.

Initial approaches were purely logical and brittle. More recently, the

availability of large quantities of text on the Web has led to the develop-

ment of machine learning approaches. However, to date these have mainly extracted ground facts, as opposed to general knowledge. Other learning approaches can extract logical forms, but require supervision and do not scale. In this paper we present an unsupervised approach to extracting semantic networks from large volumes of text. We use the TextRunner system [1] to extract tuples from text, and then induce general concepts and relations from them by jointly clustering the objects and relational strings in the tuples. Our approach is deﬁned in Markov logic using four simple rules. Experiments on a dataset of two million tuples show that it outperforms three other relational clustering approaches, and extracts meaningful semantic networks.

1 Introduction A long-standing goal of AI is to build an autonomous agent that can read and understand text. The natural language processing (NLP) community attempted to achieve this goal in the 1970’s and 1980’s by building systems for understand- ing and answering questions about simple stories [3, 13, 23, 6]. These systems parsed text into a network of predeﬁned concepts, and created a knowledge base from which inferences can be made. However, they required a large amount of manual engineering, only worked on small text sizes, and were not robust enough to perform well on unrestricted naturally occurring text. Gradually, research in this direction petered out.

Interest in the goal has been recently rekindled [16][7] by the abundance of easily accessible Web text, and by the substantial progress over the last few years in machine learning and NLP. The conﬂuence of these three developments led to eﬀorts to extract facts and knowledge bases from the Web [4]. Two recent steps in this direction are a system by Pasca et. al [18] and TextRunner [1].

Both systems extract facts on a large scale from Web corpora in an unsuper- vised manner. Pasca et. al’s system derives relation-speciﬁc extraction patterns from a starting set of seed facts, acquires candidate facts using the patterns, 2 Extracting Semantic Networks from Text via Relational Clustering adds high-scoring facts to the seeds, and iterates until some convergence crite- rion. TextRunner uses a domain-independent approach to extract a large set of relational tuples of the form r(x, y) where x and y are stringsdenoting objects, and r is a string denoting a relation between the objects. It uses a lightweight noun phrase chunker to identify objects, and heuristically determines the text between objects as relations. These are good ﬁrst steps, but they still fall short of the goal. While they can quickly acquire a large database of ground facts in an unsupervised manner, they are not able to learn general knowledge that is embedded in the facts.

Another line of recent research takes the opposite approach. Semantic parsing [26, 17, 29] is the task of mapping a natural language sentence into logical form.

The logical statements constitute a knowledge base that can be used to perform some task like answering questions. Semantic parsing systems require a training corpus of sentences annotated with their associated logical forms (i.e., they are supervised). These systems are then trained to induce a parser that can convert novel sentences to their logical forms. Even though these systems can create knowledge bases directly, their need for annotated training data prevents them from scaling to large corpora like the Web.

In this paper, we present SNE, a scalable, unsupervised, and domain-independent system that simultaneously extracts high-level relations and concepts, and learns a semantic network [20] from text. It ﬁrst uses TextRunner to extract ground facts as triples from text, and then extract knowledge from the triples. TextRunner’s triples are noisy, sparse, and contain many co-referent objects and relations.

Our system has to overcome these challenges in order to extract meaningful highlevel relations and concepts from the triples in an unsupervised manner. It does so with a probabilistic model that clusters objects by the objects that they are related to, and that clusters relations by the objects they relate. This allows information to propagate between clusters of relations and clusters of objects as they are created. Each cluster represents a high-level relation or concept. A concept cluster can be viewed as a node in a graph, and a relation cluster can be viewed as links between the concept clusters that it relates. Together the concept clusters and relation clusters deﬁne a simple semantic network. Figure 1 illustrates part of a semantic network that our approach learns. SNE is short for Semantic Network Extractor.

SNE is based on Markov logic [22], and is related to the Multiple Relational Clusterings (MRC) model [12] we recently proposed. SNE is our ﬁrst step towards creating a system that can extract an arbitrary semantic network directly from text. Ultimately, we want to tightly integrate the information extraction TextRunner component and the knowledge learning SNE component to form a self-contained knowledge extraction system. This tight integration will enable information to ﬂow between both tasks, allowing them to be solved jointly for better performance [14].

We begin by brieﬂy reviewing Markov logic in the next section. Then we describe our model in detail (Section 3). Next we describe related work (Section 4). After that, we report our experiments comparing our model with three Extracting Semantic Networks from Text via Relational Clustering 3 alternative approaches (Section 5). We conclude with a discussion of future work (Section 6).

2 Markov Logic Markov logic combines ﬁrst-order logic with Markov networks.

In ﬁrst-order logic [9], formulas are constructed using four types of symbols: constants, variables, functions, and predicates. (In this paper we use only function-free logic.) Constants represent objects in the domain of discourse (e.g., people: Anna, Bob, etc.). Variables (e.g., x, y) range over the objects in the domain. Predicates represent relations among objects (e.g., Friends), or attributes of objects (e.g., Student). Variables and constants may be typed. An atom is a predicate symbol applied to a list of arguments, which may be variables or constants (e.g., Friends(Anna, x)). A ground atom is an atom all of whose arguments are constants (e.g., Friends(Anna, Bob)). A world is an assignment of truth values to all possible ground atoms. A database is a partial speciﬁcation of a world; each atom in it is true, false or (implicitly) unknown.

A Markov network or Markov random ﬁeld [19] is a model for the joint distribution of a set of variables X = (X1, X2,..., Xn ) ∈ X. It is composed of an undirected graph G and a set of potential functions φk. The graph has a node for each variable, and the model has a potential function for each clique in the graph. A potential function is a non-negative real-valued function of the state of the corresponding clique. The joint distribution represented by a Markov network is given by P (X = x) = Z k φk (x{k} ) where x{k} is the state of the kth clique (i.e., the state of the variables that appear in that clique). Z, known as the partition function, is given by Z = k φk (x{k} ). Markov networks x∈X are often conveniently represented as log-linear models, with each clique potential replaced by an exponentiated weighted sum of features of the state, leading 1 to P (X = x) = Z exp j wj fj (x). A feature may be any real-valued function of the state. This paper will focus on binary features, fj (x) ∈ {0, 1}. In the most direct translation from the potential-function form, there is one feature corresponding to each possible state x{k} of each clique, with its weight being log φk (x{k} ). This representation is exponential in the size of the cliques.

However, we are free to specify a much smaller number of features (e.g., logical functions of the state of the clique), allowing for a more compact representation than the potential-function form, particularly when large cliques are present.

Markov logic takes advantage of this.

A Markov logic network (MLN) is a set of weighted ﬁrst-order formulas.

Together with a set of constants representing objects in the domain, it deﬁnes a Markov network with one node per ground atom and one feature per ground formula. The weight of a feature is the weight of the ﬁrst-order formula that originated it. The probability distribution over possible worlds x speciﬁed by the 1 ground Markov network is given by P (X = x) = Z exp j∈Gi wi gj (x), i∈F where Z is the partition function, F is the set of all ﬁrst-order formulas in the 4 Extracting Semantic Networks from Text via Relational Clustering MLN, Gi is the set of groundings of the ith ﬁrst-order formula, and gj (x) = 1 if the jth ground formula is true and gj (x) = 0 otherwise. Markov logic enables us to compactly represent complex models in non-i.i.d. domains. General algorithms for inference and learning in Markov logic are discussed in [22].

**3 Semantic Network Extraction**

We call our model SNE, for Semantic Network Extractor. SNE simultaneously clusters objects and relations in an unsupervised manner, without requiring the number of clusters to be speciﬁed in advance. The object clusters and relation clusters respectively form the nodes and links of a semantic network. A link exists between two nodes if and only if a true ground fact can be formed from the symbols in the corresponding relation and object clusters. SNE can cluster objects of diﬀerent types, and relations of any arity.

When faced with the task of extracting knowledge from noisy and sparse data like that used in our experiments, we have to glean every bit of useful information from the data to form coherent clusters. SNE does this by jointly clustering objects and relations. In its algorithm, SNE allows information from object clusters it has created at each step to be used in forming relation clusters, and vice versa. As we shall see later in our experimental results, this joint clustering approach does better than clustering objects and relations separately.

SNE is deﬁned using a form of ﬁnite second-order Markov logic in which variables can range over relations (predicates) as well as objects (constants).

Extending Markov logic to second order involves simply grounding atoms with all possible predicate symbols as well as all constant symbols, and allows us to represent some models much more compactly than ﬁrst-order Markov logic.

For simplicity, we assume that relations are binary in our deﬁnition of SNE, i.e., relations are of the form r(x, y) where r is a relation symbol, and x and y are object symbols. (Extending the deﬁnition to an arbitrary number of nary relations is straightforward.) We use γi and Γi to respectively denote a cluster and clustering (i.e., a partitioning) of symbols of type i. If r, x, and y are respectively in cluster γr, γx, and γy, we say that r(x, y) is in the cluster combination (γr, γx, γy ).

The learning problem in SNE consists of ﬁnding the cluster assignment Γ = (Γr, Γx, Γy ) that maximizes the posterior probability P (Γ |R) ∝ P (Γ, R) = P (Γ )P (R|Γ ), where R is a vector of truth assignments to the observable r(x, y) ground atoms.

We deﬁne one MLN for the likelihood P (R|Γ ) component, and one MLN for the prior P (Γ ) component of the posterior probability with just four simple rules.

The MLN for the likelihood component only contains one rule stating that the truth value of an atom is determined by the cluster combination it belongs

**to:**

∀r, x, y, +γr, +γx, +γy r ∈ γr ∧ x ∈ γx ∧ y ∈ γy ⇒ r(x, y) Extracting Semantic Networks from Text via Relational Clustering 5 This rule is soft. The “+” notation is syntactic sugar that signiﬁes that there is an instance of this rule with a separate weight for each cluster combination (γr, γx, γy ). This rule predicts the probability of query atoms given the cluster memberships of the symbols in them. This is known as the atom prediction rule.

As shown in [12], given a cluster assignment, the MAP weight wk of an instance of the atom prediction rule is given by log(tk /fk ), where tk is the empirical number of true atoms in cluster combination k, and fk is the number of false atoms. Adding smoothing parameters α and β, we estimate the MAP weight as log((tk + α)/(fk + β)).

Three rules are deﬁned in the MLN for the prior component. The ﬁrst rule

**states that each symbol belongs to exactly one cluster:**

∀x ∃1 γ x ∈ γ This rule is hard, i.e., it has inﬁnite weight and cannot be violated.

The second rule imposes an exponential prior on the number of cluster combinations. This rule combats the proliferation of cluster combinations and consequent overﬁtting, and is represented by the formula ∀γr, γx, γy ∃r, x, y r ∈ γr ∧ x ∈ γx ∧ y ∈ γy with negative weight −λ. The parameter λ is ﬁxed during learning, and is the penalty in log-posterior incurred by adding a cluster combination to the model.

Thus larger λs lead to fewer cluster combinations being formed. This rule represents the complexity of the model in terms of the number of instances of the atom prediction rule (which is equal to the number of cluster combinations).

The last rule encodes the belief that most symbols tend to be in diﬀerent clusters. It is represented by the formula ∀x, x, γx, γx x ∈ γx ∧ x ∈ γx ∧ x = x ⇒ γx = γx with positive weight µ. The parameter µ is also ﬁxed during learning. We expect there to be many concepts and high-level relations in a large heterogenous body of text. The tuple extraction process samples instances of these concepts and relations sparsely, and we expect each concept or relation to have only a few instances sampled, in many cases only one. Thus we expect most pairs of symbols to be in diﬀerent concept and relation clusters.