FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |   ...   | 38 |

«Mark Lauer Department of Computing Macquarie University NSW 2109 Australia Submitted in Partial Ful llment of the Requirements of the Degree of ...»

-- [ Page 1 ] --

Designing Statistical Language Learners:

Experiments on Noun Compounds

Mark Lauer

Department of Computing

Macquarie University NSW 2109


Submitted in Partial Ful llment of the Requirements

of the Degree of Doctor of Philosophy

December, 1995

Copyright c Mark Lauer, 1995

To Lesley Johnston,

without whom nothing good can ever come.


Statistical language learning research takes the view that many traditional natural lan-

guage processing tasks can be solved by training probabilistic models of language on a su - cient volume of training data. The design of statistical language learners therefore involves answering two questions: (i) Which of the multitude of possible language models will most ac- curately re ect the properties necessary to a given task? (ii) What will constitute a su cient volume of training data? Regarding the rst question, though a variety of successful models have been discovered, the space of possible designs remains largely unexplored. Regarding the second, exploration of the design space has so far proceeded without an adequate answer.

The goal of this thesis is to advance the exploration of the statistical language learning

design space. In pursuit of that goal, the thesis makes two main theoretical contributions:

it identi es a new class of designs by providing a novel theory of statistical natural language processing, and it presents the foundations for a predictive theory of data requirements to assist in future design explorations.

The rst of these contributions is called the meaning distributions theory. This theory speci es an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more super cial linguistic elements. Thus, rather than assigning probabilities to grammatical structures directly, grammatical forms inherit likelihoods from the semantic forms that they correspond to. The class of designs suggested by this theory represents a promising new area of the design space.

The second theoretical contribution concerns development of a mathematical theory whose aim is to predict the expected accuracy of a statistical language learning system in terms of the volume of data used to train it. Since availability of appropriate training data is a key design issue, such a theory constitutes an invaluable navigational aid. The work completed includes the development of a framework for viewing data requirements and a number of results allowing the prediction of necessary training data volumes under certain conditions.

The experimental contributions of this thesis illustrate the theoretical work by applying statistical language learning designs to the analysis of noun compounds. Both syntactic and semantic analysis of noun compounds have been approached using probabilistic models based on the meaning distributions theory.

In the experiments on syntax, a novel model, based on dependency relations between concepts, was developed and implemented. Empirical comparisons demonstrated that this model is signi cantly better than those previously proposed and approaches the performance of human judges on the same task. This model also correctly predicts the observed distribution of syntactic structures.

In the experiments on semantic analysis, a novel model, the rst statistical model of this problem, was developed and implemented. The system uses statistics computed from prepositional phrases to predict a paraphrase with signi cantly better accuracy than the baseline strategy. The training data used is both sparse and noisy, and the experimental results support the need for a theory of data requirements. Without a predictive data requirements theory, statistical language learning remains an artform.

Acknowledgements To do justice to the people who have, in one way or another, made this work what it is, feels like it would take more than the rest of the thesis. I can only say to everyone who has contributed that I apologise for the entirely inadequate acknowledgements to follow. I realise that the sheer number of names below will suggest to the reader that each has contributed only a little. What can I say? All have made a signi cant contribution; I have no real choice.

I could not ask for a pair of ner minds to share this journey with than my supervisors, Robert Dale and Mike Johnson, whose brilliance still amazes me on a regular basis. Together, they are my Plato and my Aristotle, and I doubt I will ever shed their in uences on my thinking.

I thank Mike for his in nite understanding and Robert for his boundless energy.

Without the vision and determination of Vance Gledhill, the unique environment at the Microsoft Institute, and all the work that has emerged from it, would never have existed. He deserves the success it currently enjoys, and my heartfelt thanks.

I want particularly to thank Mark Johnson for his assistance both in nurturing the development of my own ideas, and in generously contributing his own. Thanks also to Wayne Wobcke for discussions and input at various times.

Special thanks are deserved by Ted Briscoe, Gregory Grefenstette, Karen Jensen and Richard Sproat, all of whom don't realise how much I have valued their encouragements and ideas. I owe a great debt of gratitude to Philip Resnik, not only for his technical contributions, but also for his faith, passion and friendship; I only hope that one day I can do them justice.

More than anyone else, the other Microsoft Institute fellows have become part of the fabric of this thesis. I am grateful to everyone here. Especial thanks must go to Richard Buckland and Mark Dras; both are blessed with genius, as well as just being really friendly guys. Particular contributions have also been made by Sarah Boyd, Maria Milosavljevic, Steven Sommer, Wilco ter Stal, Jonathon Tidswell and Adrian Tulloch.

I wish also to thank the following people for their friendship, which has all helped: Ken Barker, Alan Blair, Tanya Bowden, Christophe Chastang, Phil Harrison, Rosie Jones, Patrick Juola, Elisabeth Maier, Michael Mitchell, Nick Nicholas, Peter Wallis, Susan Williams and Danny Yee.

Without the nancial support generously given by the Microsoft Institute Fellowship Program and the Australian Government Postgraduate Award Scheme, this research would not have happened.

i Every single one of the following has personally made a signi cant di erence to the work presented here. I am sorry I cannot speci cally thank you all.

John Bateman George Heidorn Malti Patel Andrew Taylor Ezra Black Andrew Hunt Pavlos Peppas Lucy Vanderwende Rebecca Bruce Christian Jacquemin Pam Peters Wolfgang Wahlster Ted Dunning Geof Jones David Powers Bonnie Lyn Webber Dominique Estival Kevin Knight James Pustejovsky Yorick Wilks Tim Finin John La erty Ross Quinlan Dekai Wu Norman Foo Alon Lavie Carolyn Penstein Rose Collin Yallop Louise Guthrie Chris Manning Je Siskind David Yarowsky Marti Hearst Jenny Norris Mark Steedman Kobayasi Yoshiyuki And now some very special people: There is nothing I can ever do to repay the unerring support and care provided by my father, my stepmother and my grandmother, without which I would be lost.

Finally, the love and friendship I have shared over the past few years with Andrew Campbell, Christine Cherry and Lesley Johnston goes beyond all words. Each has saved me from despair more times than I can count. They are the earth on which I stand, the air which I breathe and the sunlight that banishes my darkness.

Addendum to acknowledgements for rst reprint This thesis has been accepted without modi cation by Macquarie University in ful llment of the requirements for the Degree of Doctor of Philosophy. Since submission I have received insightful comments from my examiners which have prompted me to make some small changes.

I would therefore also like to thank them: Eugene Charniak, Mitch Marcus and Chris Wallace.

ii Preface The research represented in this thesis was carried out at the Microsoft Institute. All work reported here is the original work of the author, with the following two exceptions.

1. The reasoning given in section 4.7 regarding empty and non-empty bins (pages 107{109) was developed by Mark Johnson of Brown University. The author's contribution was to extend the results to even values of n (the initial work only considered odd n) and complete the proof for equation 4.11. This work has been published as Lauer (1995a) with Mark Johnson's permission.

2. An original version of the probabilistic model given in section 5.1.2 was jointly developed by the author and Mark Dras, and has been published in Lauer and Dras (1994).

Some parts of this thesis include revised versions of published papers. I would like to thank the Association for Computational Linguistics for (automatically) granting permission to reuse material from Lauer (1995b) (this material is primarily contained in sections 5.1.2, 5.1.5 and 5.1.6). Similarly, the Paci c Association for Computational Linguistics has (automatically) granted permission to reuse material from Lauer (1995a) (this material appears in sections 4.4 through 4.7). Finally, kind permission has been given to reuse material from Lauer (1995c) (\Conserving Fuel in Statistical Language Learning: Predicting Data Requirements" in Proceedings of the Eighth Australian Joint Conference on Arti cial Intelligence, pp. 443{450. Copyright by World Scienti c Publishing Co. Pte, Singapore, 1995) which appears in sections 4.7 through 4.9.

–  –  –

vii viii Chapter 1 Introduction This thesis is about computers learning about language. It is about bringing machines into communication with people, not by a laborious process of linguistic instruction, but by creating programs that learn about language for themselves. Just as for other work in natural language processing (nlp), the ultimate goal is to have computers understand language in the sense that humans do; in response to human language, computers should behave as we do, or at least analogously. And, just as with other work in nlp, it recognises that we must be content with meeker achievements in the short term. But the vision that inspires this work is freedom from the myriad of minutiae that make up language. Perhaps we, as teachers, will better serve our goal by showing computers how to learn and then leaving them to slave over the details of language. After all, remembering thousands of tiny facts is just what they are designed for.

This vision is gaining currency amongst nlp researchers. This is in part due to the staggering volume of text that is now available to computers as a learning resource. A single daily newspaper produces tens of millions of words of text every year, all stored electronically during typesetting. Electronic mail, business reports, novels and manuals are all being generated with every passing moment. An immense stream of text is being produced. Also, at the same time, modern computers are growing enormously powerful. Statistics about linguistic phenomena can be computed from millions of words of text in just a few seconds. The question becomes how can we put such statistics to use as a learning mechanism in order to exploit the ocean of text at our disposal?

The area of research concerning this question has been called statistical language learning (sll). Already quite a bit of research has been devoted to its investigation, and it is by no means a recent idea. Many designs have been proposed for a variety of tasks, and a few have shown promising success. But the space of possible designs is enormous, and the part so far explored is relatively small. The goal of this thesis is to signi cantly further the exploration of the sll design space. In order to advance, we must not only nd previously uncharted areas of the design space, but also build tools that help us to navigate through it.

This thesis contributes in both of these ways.

1.1 Theoretical Contributions The theoretical component of this thesis comprises two main elements. First, it proposes an architectural theory of statistical natural language processing that identi es a new class of sll 1 designs. Second, it describes work on a mathematical theory of training data requirements for sll systems that constitutes a powerful tool for selecting such designs. The following are brief outlines of the two theories.

Meaning distributions: Existing statistical language learning models are de ned in terms of lexical and syntactic representations of language. Probability distributions generally capture only grammatical knowledge. The architectural theory proposed in this thesis advocates statistical models de ned in terms of semantic representations of language.

Rather than representing grammatical knowledge probabilistically, it views grammatical knowledge as a form of constraint. Syntactic structures inherit their probability distributions from semantic forms through these constraints. The aim of this theory is to suggest new designs. Thus, evaluation of the theory must come from using it to build sll systems and testing their performance. The value of the theory lies in pointing out a promising direction for exploring the design space.

Data requirements: The amount of text used to train a statistical language learning system is crucial to its performance. Since there is no well-known theory that can predict the amount of training data necessary, the prevalent methodology in sll research is to get as much text as you can and see if the your chosen model works. However, practical considerations of data availability have a strong impact on model design. Informed navigation of the design space rests on being able to predict data requirements. In this thesis, a framework for building a predictive theory is developed and several results are given that represent the rst steps toward a general theory of data requirements.

Both of these theories have been investigated through experiments on statistical noun compound analysis.

1.2 Experiments on Noun Compounds The experimental component of this work concerns noun compounds in English. Noun compounds are common constructions exempli ed by the last three words in example 1.

(1) This time, let's avoid buying those styrofoam dinner plates.

Because noun compounds are frequent, highly ambiguous and require a great deal of knowledge to analyse, understanding them represents an ideal problem though which sll designs

Pages:   || 2 | 3 | 4 | 5 |   ...   | 38 |

Similar works:

«TIME DEPENDENT FREEZING OF WATER UNDER MULTIPLE SHOCK WAVE COMPRESSION By DANIEL H. DOLAN III A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy WASHINGTON STATE UNIVERSITY Department of Physics MAY 2003 c Copyright by DANIEL H. DOLAN III, 2003 All Rights Reserved c Copyright by DANIEL H. DOLAN III, 2003 All Rights Reserved To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of...»

«Isolated Experiences: Gilles Deleuze and the Solitudes of Reversed Platonism James Brusseau Facultad de Filosofia y Letras Universidad Nacional Autonoma de Mexico Acknowledgement I acknowledge Professor Alphonso Lingis for his contributions to this work. The task of contemporary philosophy has been defined: the reversal of Platonism.Gilles Deleuze, 1968 Difference and Repetition Contents Introduction I Difference 1. Difference As Production And Limitation. 2. The Eternal Return Does Difference:...»

«QUASARS, CARBON, AND SUPERNOVAE: EXPLORING THE DISTRIBUTION OF ELEMENTS IN AN EXPANDING UNIVERSE by Shailendra Kumar Vikas Bachelor of Technology, Indian Institute of Technology, Kharagpur, 2001 Master of Science, University of Pittsburgh, 2007 Submitted to the Graduate Faculty of the Department of Physics and Astronomy in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2013 UNIVERSITY OF PITTSBURGH DEPARTMENT OF PHYSICS AND ASTRONOMY...»

«EVALUATION OF GLARE AND LIGHTING PERFORMANCE IN NIGHTTIME HIGHWAY CONSTRUCTION PROJECTS BY IBRAHIM SAMEER MOHAMMAD ODEH DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Civil Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2010 Urbana, Illinois Doctoral Committee: Associate Professor Liang Y. Liu, Chair Associate Professor Khaled El-Rayes, Director of Research Professor Feniosky Pena-Mora Professor...»

«The Significance of Religious Experience Howard Wettstein University of California, Riverside Crimes and Misdemeanors, Woody Allen: His kind of faith is a gift. It’s like an ear for music or the talent to draw. I. Introduction: Proofs, Old and New Occasionally one meets or reads about people who were, as we say, born at the wrong time or place. Their gifts, tendencies, and ways, awkward in the context of their lives, would have seemed natural at some other time or place. The classical proofs...»


«MODELING STUDIES OF ATMOSPHERIC PRESSURE MICROPLASMAS: PLASMA DYNAMICS, SURFACE INTERACTION AND APPLICATIONS by Jun-Chieh Wang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in the University of Michigan 2014 Doctoral Committee: Professor Mark J. Kushner, Chair Professor John E. Foster Professor Brian E. Gilchrist Professor Yogesh B. Gianchandani Professor Euisik Yoon DEDICATION To those who always stand behind...»

«Multiple System Atrophy and Parkinson’s disease Thesis submitted for the degree doctor of philosophy By Haya Kisos Submitted for the senate of Hebrew University June 2013 This work was carried out by supervision of Dr. Ronit Sharon and Prof. Tamir Ben Hur Abstract: The synucleinopathies are a diverse group of neurodegenerative disorders that share a common pathologic intracellular lesion, composed primarily of aggregates of insoluble α-Synuclein (α-Syn) protein in selectively vulnerable...»

«A PIEZOELECTRICALLY ACTUATED CRYOGENIC MICROVALVE WITH INTEGRATED SENSORS by Jong Moon Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan 2009 Doctoral Committee: Professor Yogesh B. Gianchandani, Chair Professor Khalil Najafi Professor Kensall D. Wise Associate Professor Luis P. Bernal © Jong Moon Park All rights reserved 2009 To my parents ii ACKNOWLEDGEMENTS I would like to...»

«Humanity at the Turning Point: Rethinking Nature, Culture and Freedom (Sonja Servomaa, editor). Helsinki, Finland: Renvall Institute for Area and Cultural Studies, 2006.Beyond Tolerance: Globalization, Freedom, and Religious Pluralism Douglas W. Shrader1 Distinguished Teaching Professor & Chair of Philosophy SUNY Oneonta Oneonta, NY Abstract: If “Globalization” is to mean something other than imposing a single set of uniform, unexamined, and unchallengeable ideas on the entire human race,...»

«LA SOLITUDE DE L’HOMME MODERNE, UN PROBLÈME PHILOSOPHIQUE Conf.univ.dr. IULIANA PAŞTIN, Universitatea Creştină „Dimitrie Cantemir’’ La grandeur d'un métier est peut-être avant tout, d'unir les Hommes. Il n'est qu'un luxe véritable et c'est celui des Relations Humaines. En travaillant pour les seuls biens matériels, nous bâtissons nous-mêmes notre prison, avec notre monnaie de cendre qui ne procure rien qui vaille de vivre. Antoine de Saint Exupéry Abstract: Loneliness or...»

«The Concept of the Ascent of Prayer by Sixteenth-century Jerusalem Kabbalist, R. Joseph ibn Zayyah Thesis submitted for the degree of “Doctor of Philosophy” by Sachi Ogimoto Submitted to the Senate of the Hebrew University of Jerusalem August 2011 This work was carried out under the supervision of Professor Jonathan Garb Acknowledgements First and foremost, I would like to express my deep and sincere gratitude to my supervisor, Professor Jonathan Garb, of the Department of Jewish Thought,...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.