WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 8 |

«A thesis submitted in partial fulfilment of the requirements for the Degree of Master of Science in Statistics in the Department of Mathematics and ...»

-- [ Page 1 ] --

THE STATISTICS OF TOPIC MODELLING

A thesis submitted in partial fulfilment of the requirements for

the Degree

of Master of Science in Statistics

in the Department of Mathematics and Statistics

by Rebecca Katherine Abey

University of Canterbury

2015

Contents

List of Illustrations

Figures

Tables

Acknowledgements

Abstract

A Selected Glossary

Introduction

Digital Humanities

Topic Modelling

Philpapers

Research Objectives

Thesis Structure

Literature Review

What is Topic Modelling?

A Step-by-Step Introduction

Latent Dirichlet

The Generative Model

The Posterior Distribution

Latent Dirichlet Allocation

Approximate Posterior Inference

Gibbs Sampling

What is a Topic?

How to do a Topic Model

Dataset

Software

Mallet

R

Gensim

LDA-C

GibbsLDA++

Running an Analysis

Analysis of a Topic Model

Dataset

Software

R

Mallet

Running an Analysis

An Overview of the Dataset

Foreign Languages

Philosophy of Action

Analysis of Single Topics

Discussion

Bibliography

Appendices

Appendix A: Science Fiction Novels

Appendix B Stop Words

Appendix C R Code

List of Illustrations

–  –  –

Figure 1. Graph of trends in selected science fiction novels over time.

Figure 2. Illustration of a word cloud created from a topic model of articles written by Professor Brian Cox.

Figure 3. A network diagram showing the connections between a selection of topics using the PhilPapers dataset.

Figure 4. Photograph taken by myself of a cat on a garden fence.

Figure 5. An example of words being taken from the large container and sorted into one of the smaller containers.

Figure 6. The top three words expressed in the three topics created from a paragraph of Alice’s Adventures in Wonderland.

Figure 7. Graph of the distribution of topics for paragraph one of Alice’s Adventures in Wonderland.

Figure 8. Graph of the distribution of topics for paragraph two of Alice’s Adventures in Wonderland.

Figure 9. A graphical model of the parameters of a dirichlet distribution.

Figure 10. Diagram of a Markov chain.

Figure 11. Graph of the frequency of topics displayed in the topic model analysis.

Figure 12. Graph of the top ten topics displayed in the topic model analysis.

Figure 13. Graph of the frequency of categories found on the PhilPapers website.

Figure 14. Graph of the top ten categories found on the PhilPapers website.

–  –  –

This research project would not be possible without the collaboration between the Department of Mathematics and Statistics, and the Department of Digital Humanities, at the University of Canterbury. As my thesis borders between both statistics and digital humanities working with both parties has been extremely helpful and I can only hope that my research is the beginning of many future collaborations between the two departments.

I would first like to thank Jennifer Brown and James Smithies for their supervision. Jennifer provided me with excellent feedback on getting the structure of my thesis right and had good ideas of what to add in where. James provided me with good feedback on tailoring this to the humanities, and for providing good discussions on what to include in this project.

I would also like to thank David Bourget and David Chalmers from PhilPapers for allowing me the use of their extremely large database to analyse. I hope this project provides some interesting findings.

Many thanks to Lauren for helping me get my coding in R working. I am not the strongest code writer out there but Lauren helped me fix any errors that came up and helped check that everything was in working order.

A huge thank you to Tim David and François Bissey for introducing me to the idea of using a supercomputer for part of my research. They were extremely helpful and while I did not end up using a supercomputer I still thank them for their time and effort in explaining how the process works.

And finally, to Richard who supported me through the entire year of writing my thesis, and who put up with my constant barrage of questions.

–  –  –

This research project aims to provide a clear and concise guide to latent dirichlet allocation which is a form of topic modelling. The aim is to help researchers who do not have a strong background in mathematics or statistics to feel comfortable with using topic modelling in their work. In order to achieve this, the thesis provides a step-by-step explanation of how topic modelling works. A range of tools that can be used to perform a topic model analysis are also described. The first chapter gives an explanation of how topic modelling, and (more specifically), latent dirichlet allocation works; it offers a very basic explanation and then provides an easy to follow mathematical explanation. The second chapter explains how to perform a topic model analysis; this is done through an explanation of each step used to run a topic model analysis, starting from the type of dataset through to the software packages available to use. The third section provides an example topic model analysis, based on the Philpapers dataset. The final section provides a discussion on the highlights of each chapter and areas for further research.





–  –  –

Anomaly Detection – The process of detecting data points which do not fit within an expected pattern or other items within a dataset.

Association Rule Learning – A process that extracts if-then statements from a set of data. It looks for relationships such as if x then y.

Attribute – A piece of information that states the properties of a field or tag within a database.

Classification – The task of predicting the label or class of a given data point with unknown labels.

Clustering – A process in data mining where data points are separated into particular groups.

Conjugate – Any of a set of numbers that satisfy the same irreducible polynomial.

Corpus – A collection of written texts.

Correlated – Having a mutual relationship or connection where one thing depends on another.

Exponential Family Distribution – A set of probability distributions of a specific form Generative Probabilistic Process – A process in which observable data is generated using random probabilities.

GUI – Graphical User Interface, a computer interface that allows users to connect to the interface using graphical icons.

–  –  –

religion, languages, art and classics.

Iterations – Repetitions of a process until a desired result is achieved.

Parallel Computing – A form of computation where several calculations are performed simultaneously.

Parameter – A constant or variable term in a function that determines the specific form of the function, but not its general nature.

Posterior Inference – An inference made after the relevant information is taken into account.

Probabilistic Model – Statistical model that provides an estimate based on historical data of the probability of an event occurring again.

Regression – A measure of the relationship between the mean of a variable and corresponding values of other variables.

Salmonella Pulse-Field Gel Electrophoresis – A method of detecting salmonella in patients.

Simplex – A space on which a series of points are found.

Sparsity – How spread out or scattered a distribution is. For example, how many beetles in a distribution over beetles tend to have high positive probability.

Summarisation – A process for finding a compact description of a dataset.

–  –  –

Digital humanities is the humanities in the digital age (Piez, 2013). It combines the traditional humanities subjects such as philosophy, history, art, linguistics, literature, archaeology and music with tools from disciplines such as data mining, statistics, text mining, digital mapping and information retrieval (Liu, 2013). There is much debate about the precise definition of digital humanities (Svensson, 2012). Two goals commonly described for what the digital humanities should be are as follows:- the first of these is to study digital media and the cultures and cultural impacts of digital media and to design and make digital media (Piez, 2013); and the second is to bring the Humanities into the digital age through digitisation of text, and using computational tools to analyse these texts.

Digitisation is the transformation of media such as text, sounds, images and data from electronic devices into computer files (Brynjolfsson & McAfee, 2014). In recent years projects such as Google Books have set out to digitise large libraries of books to allow access to users around the globe and to preserve information.

Large datasets of digitised content offer significant opportunities for humanities researchers. To realise these opportunities, advances in statistical analysis to account for the complex nature of the content are needed. An understanding of these methods and their application by researchers in the digital humanities is also required. The most significant area of change in statistics relevant to analyses of large datasets of digitised content is the field of data mining. Data mining is the collective term for exploring large datasets using various techniques to find patterns in data. It incorporates many fields of academia including machine learning, statistics and database systems. The aim of data mining is to analyse large datasets consisting of thousands to millions of attributes and data points (Zaki & Meira, 2014). Data mining uses six types of analysis: clustering, classification, regression,

–  –  –

Smyth, 1996).

Text mining or text analysis is one specific area of data mining. It not only covers analysis of large volumes of text such as novels, academic journal articles and newspaper clippings, it also covers emails, tweets and blog posts. Any type of text file can be used in text mining (Dean, 2014). There are several techniques within the area of text mining for analysing text. One of the more recent developments in this area is topic modelling. This is a new area of research and one specifically designed for analysis of large datasets of digitised content.

–  –  –

Topic modelling is a form of text analysis used to explore relationships between words within a document where the words are grouped together to form topics. The earliest work on topic modelling is by Papadimitriou, Tamaki, Raghavan, and Vempala (1998), and Hofmann (1999). The technique was further developed by Blei, Ng, and Jordan (2003). There are a variety of different methods for topic modelling, using different sampling algorithms for word selection and topic creation. Examples of topic models include latent semantic analysis. This method is the most basic and looks at the frequency of words within a document and creates topics based on the frequencies of words occurring in each document. (Steyvers & Griffiths, 2007). Latent dirichlet allocation is another basic topic model. It groups words together based on how likely they are to appear in a document together (Blei et al., 2003). Correlated topic models explore the correlation of words to other words within a document. Topics are created based on the strength of correlations between words (Blei & Lafferty, 2007). Explicit semantic analysis adds words from a document to a matrix based on frequency and creates topics based on the frequency of co-occurrence between words (Egozi, Markovitch, & Gabrilovich, 2011). Topic modelling can be used in many different academic domains including both

–  –  –

use for topic modelling.

The amount of data available on the Internet is vast and will only increase over time. Topic modelling provides an easy way to process large amounts of information efficiently. It also allows for individual search topics to be discovered. Edward Y. Chang is a research director at Google and is currently working on implementing topic modelling into Google’s search engines. This will allow for a better exploration of Google’s databases (Dickman, 2014).

A recent example of the use of topic modelling in science includes the work on topic modelling for Cluster Analysis of Large Biological and Medical Datasets, (Zhao, Zou, & Chen, 2014). In their work, they assessed whether topic modelling is useful for biology and medicine. They analysed three different datasets for Salmonella pulse-field gel electrophoresis, lung cancer, and breast cancer and compared other data mining techniques to topic modelling. Their goal was to assess whether topic modelling gave them a better answer to a particular problem they were trying to solve for each dataset. The analysis found that topic modelling gave them a better result than the other data mining techniques. They concluded that topic modelling is beneficial for sorting through large sets of medical data with slightly better precision than other data mining methods (Zhao et al., 2014).



Pages:   || 2 | 3 | 4 | 5 |   ...   | 8 |


Similar works:

«THE IMPORTANCE OF S itu e FOR A BALANCED LIFE THE IMPORTANCE OF SOLITUDE FOR A BALANCED LIFE TABLE OF CONTENTS Put your Life in Perspective Learn to be Independent Get Those Creative Juices Flowing Enrich Your Relationships Contemplate and Grow Spiritually Transform Yourself Enjoy Doing What YOU Like Things to Do in Your Solitude Note: This little book is not about forced solitude or how to deal with it. It's not about being lonely or running away from the world into a cave in the mountains....»

«Public Disclosure Authorized 76654 Competition and Scope of Activities in Financial Services Public Disclosure Authorized Stijn Claessens • Daniela Klingebiel This article analyzes the costs and benefits of different degrees of competition and different configurations of permissible activities in the financial sector and discusses the related implications for regulation and supervision. Theory and experience demonstrate the importance of competition for efficiency and confirm that a...»

«A Guide to Equine Color Genetics and Coat Color By: Sue Copeland for Practical Horseman Confused about horse colors? The puzzle over what to call one shade and what not to call another has been around as long as the modern horse. And although the debate over certain colors will likely continue to rage, the information we’ve gathered will help you identify some sixty commonand not-so-commonhues in horsedom. We’ve also simplified basic geneticspeak to give you an idea of what pairings can...»

«DOCUMENT RESUME FL 027 377 ED 466 982 Takami, Tomoko AUTHOR A Study on Closing Sections of Japanese Telephone TITLE Conversations. 2002-00-00 PUB DATE 21p.; In: Working Papers in Educational Linguistics, Spring NOTE 2002; see FL 027 373. Research (143) Reports Journal Articles (080) PUB TYPE Working Papers in Educational Linguistics; v18 nl Spr 2002 JOURNAL CIT EDRS Price MF01/PC01 Plus Postage. EDRS PRICE Foreign Countries; *Interpersonal Communication; DESCRIPTORS Interpersonal Relationship;...»

«How do sectors change? The role of incumbents as institutional entrepreneurs Examination committee: Prof.dr.ir. B.A.G. Bossink Prof.dr.ir. J.C.M. van den Ende Prof.dr. A. Kolk Prof.dr. A.J. Meijer Prof.dr. E.H.M. Moors ISBN 978-90-6464-911-0 © 2015 M.J. Kishna Cover: Ferdinand van Nispen tot Pannerden, Citroenvlinder DTP & Vormgeving Printing: GVO drukkers & vormgevers B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any...»

«1 La signora dell'inquietudine. Le ossessioni di Patricia Highsmith nei suoi libri e nel cinema LA SIGNORA DELL’INQUIETUDINE LE OSSESSIONI DI PATRICIA HIGHSMITH NEI SUOI LIBRI E NEL CINEMA Nata negli Stati Uniti (Texas, 1921) e morta in Svizzera (Locarno, 1995), Patricia Highsmith è riconosciuta come una delle maggiori scrittrici di suspense della sua epoca e i suoi libri, continuamente ristampati, godono ancora oggi di grande successo, soprattutto in Europa. Highsmith fu una solitaria, una...»

«ABSTRACT Title of Dissertation: CONTRASTS: QUARTETS AND ART SONGS OF THE NINETEENTH CENTURY Elizabeth Brown, Doctor of Musical Arts, 2016 Dissertation Directed by: Professor Rita Sloan School of Music, Piano Division The nineteenth-century Romantic era saw the development and expansion of many vocal and instrumental forms that had originated in the Classical era. In particular, the German lied and French mélodie matured as art forms, and they found a kind of equilibrium between piano and vocal...»

«2 What Is a Virus? Abstract Viruses are built from short sequences of nucleic acid, either DNA or RNA wrapped in a protein shell. Until the invention of the electron microscope, it was impossible to visualize a virus. The first viruses to be visualized were bacteriophage, which appeared to have a head and tail-like structure. Only the nucleic acid entered the bacterial cell through the tail. Animal viruses were described as spherical or rod-shaped; they were bound to receptors and were taken...»

«GridWorld ® AP Computer Science Case Study Solutions Manual The AP® Program wishes to acknowledge and to thank Judith Hromcik of Arlington High School in Arlington, Texas. © 2007 The College Board. All rights reserved. Visit apcentral.collegeboard.com (for AP professionals) and www.collegeboard.com/apstudents (for AP students and parents). 2 Part 1 Part 1 Answers: Do You Know? Set 1 1. Does the bug always move to a new location? Explain. No. A bug will only move to the location in front of...»

«CARTER, BRIAN ANDREW, D.M.A. Luigi Boccherini’s String Quintet in B-flat Major, G. 312: A Critical Performing Edition (2013). Directed by Dr. Alexander Ezerman. 114 pp. The Italian cellist and composer Luigi Boccherini (1743 1805) composed 125 quintets for 2 violins, viola, and 2 violoncellos during his career. A vast majority of these works have never been published in modern editions, and those that have been published have been subjected to heavy editorial hands. These works are the first...»

«CID-235 ISSN 1403-0721 Department of Numerical Analysis and Computer Science KTH The Making of Brainball Sara Ilstedt Hjelm Interactions, volume X.1, Sid 26-34 CID, CENTRE FOR USER ORIENTED IT DESIGN Sara Ilstedt Hjelm The Making of Brainball Interactions, volume X.1, Sid 26-34 Report number: CID-235 ISSN number: ISSN 1403 0721 (print) 1403 073 X (Web/PDF) Publication date: Jan/feb 2003 E-mail of author: sarai@nada.kth.se Reports can be ordered from: CID, Centre for User Oriented IT Design...»

«Media Kit Photo: Matt Clarke 1 Table of Contents Photo: Matt Clarke Message from Premier Christy Clark 1 Welcome from Andrew Jakubeit, Mayor of the City of Penticton 2 British Columbia Fact Sheet 3 Overview of Penticton 5 Okanagan Wine Country 7 Media Tours 9 Tour Partners 13 Monday Evening Reception Partners 28 Tuesday Evening Reception Partners 30 Our Sponsors 32 A Message from Premier Christy Clark As Premier of the Province of British Columbia, I am pleased to welcome everyone to Penticton...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.