«A thesis submitted in partial fulfilment of the requirements for the Degree of Master of Science in Statistics in the Department of Mathematics and ...»
THE STATISTICS OF TOPIC MODELLING
A thesis submitted in partial fulfilment of the requirements for
of Master of Science in Statistics
in the Department of Mathematics and Statistics
by Rebecca Katherine Abey
University of Canterbury
List of Illustrations
A Selected Glossary
What is Topic Modelling?
A Step-by-Step Introduction
The Generative Model
The Posterior Distribution
Latent Dirichlet Allocation
Approximate Posterior Inference
What is a Topic?
How to do a Topic Model
Running an Analysis
Analysis of a Topic Model
Running an Analysis
An Overview of the Dataset
Philosophy of Action
Analysis of Single Topics
Appendix A: Science Fiction Novels
Appendix B Stop Words
Appendix C R Code
List of Illustrations
Figure 1. Graph of trends in selected science fiction novels over time.
Figure 2. Illustration of a word cloud created from a topic model of articles written by Professor Brian Cox.
Figure 3. A network diagram showing the connections between a selection of topics using the PhilPapers dataset.
Figure 4. Photograph taken by myself of a cat on a garden fence.
Figure 5. An example of words being taken from the large container and sorted into one of the smaller containers.
Figure 6. The top three words expressed in the three topics created from a paragraph of Alice’s Adventures in Wonderland.
Figure 7. Graph of the distribution of topics for paragraph one of Alice’s Adventures in Wonderland.
Figure 8. Graph of the distribution of topics for paragraph two of Alice’s Adventures in Wonderland.
Figure 9. A graphical model of the parameters of a dirichlet distribution.
Figure 10. Diagram of a Markov chain.
Figure 11. Graph of the frequency of topics displayed in the topic model analysis.
Figure 12. Graph of the top ten topics displayed in the topic model analysis.
Figure 13. Graph of the frequency of categories found on the PhilPapers website.
Figure 14. Graph of the top ten categories found on the PhilPapers website.
This research project would not be possible without the collaboration between the Department of Mathematics and Statistics, and the Department of Digital Humanities, at the University of Canterbury. As my thesis borders between both statistics and digital humanities working with both parties has been extremely helpful and I can only hope that my research is the beginning of many future collaborations between the two departments.
I would first like to thank Jennifer Brown and James Smithies for their supervision. Jennifer provided me with excellent feedback on getting the structure of my thesis right and had good ideas of what to add in where. James provided me with good feedback on tailoring this to the humanities, and for providing good discussions on what to include in this project.
I would also like to thank David Bourget and David Chalmers from PhilPapers for allowing me the use of their extremely large database to analyse. I hope this project provides some interesting findings.
Many thanks to Lauren for helping me get my coding in R working. I am not the strongest code writer out there but Lauren helped me fix any errors that came up and helped check that everything was in working order.
A huge thank you to Tim David and François Bissey for introducing me to the idea of using a supercomputer for part of my research. They were extremely helpful and while I did not end up using a supercomputer I still thank them for their time and effort in explaining how the process works.
And finally, to Richard who supported me through the entire year of writing my thesis, and who put up with my constant barrage of questions.
This research project aims to provide a clear and concise guide to latent dirichlet allocation which is a form of topic modelling. The aim is to help researchers who do not have a strong background in mathematics or statistics to feel comfortable with using topic modelling in their work. In order to achieve this, the thesis provides a step-by-step explanation of how topic modelling works. A range of tools that can be used to perform a topic model analysis are also described. The first chapter gives an explanation of how topic modelling, and (more specifically), latent dirichlet allocation works; it offers a very basic explanation and then provides an easy to follow mathematical explanation. The second chapter explains how to perform a topic model analysis; this is done through an explanation of each step used to run a topic model analysis, starting from the type of dataset through to the software packages available to use. The third section provides an example topic model analysis, based on the Philpapers dataset. The final section provides a discussion on the highlights of each chapter and areas for further research.
Anomaly Detection – The process of detecting data points which do not fit within an expected pattern or other items within a dataset.
Association Rule Learning – A process that extracts if-then statements from a set of data. It looks for relationships such as if x then y.
Attribute – A piece of information that states the properties of a field or tag within a database.
Classification – The task of predicting the label or class of a given data point with unknown labels.
Clustering – A process in data mining where data points are separated into particular groups.
Conjugate – Any of a set of numbers that satisfy the same irreducible polynomial.
Corpus – A collection of written texts.
Correlated – Having a mutual relationship or connection where one thing depends on another.
Exponential Family Distribution – A set of probability distributions of a specific form Generative Probabilistic Process – A process in which observable data is generated using random probabilities.
GUI – Graphical User Interface, a computer interface that allows users to connect to the interface using graphical icons.
religion, languages, art and classics.
Iterations – Repetitions of a process until a desired result is achieved.
Parallel Computing – A form of computation where several calculations are performed simultaneously.
Parameter – A constant or variable term in a function that determines the specific form of the function, but not its general nature.
Posterior Inference – An inference made after the relevant information is taken into account.
Probabilistic Model – Statistical model that provides an estimate based on historical data of the probability of an event occurring again.
Regression – A measure of the relationship between the mean of a variable and corresponding values of other variables.
Salmonella Pulse-Field Gel Electrophoresis – A method of detecting salmonella in patients.
Simplex – A space on which a series of points are found.
Sparsity – How spread out or scattered a distribution is. For example, how many beetles in a distribution over beetles tend to have high positive probability.
Summarisation – A process for finding a compact description of a dataset.
Digital humanities is the humanities in the digital age (Piez, 2013). It combines the traditional humanities subjects such as philosophy, history, art, linguistics, literature, archaeology and music with tools from disciplines such as data mining, statistics, text mining, digital mapping and information retrieval (Liu, 2013). There is much debate about the precise definition of digital humanities (Svensson, 2012). Two goals commonly described for what the digital humanities should be are as follows:- the first of these is to study digital media and the cultures and cultural impacts of digital media and to design and make digital media (Piez, 2013); and the second is to bring the Humanities into the digital age through digitisation of text, and using computational tools to analyse these texts.
Digitisation is the transformation of media such as text, sounds, images and data from electronic devices into computer files (Brynjolfsson & McAfee, 2014). In recent years projects such as Google Books have set out to digitise large libraries of books to allow access to users around the globe and to preserve information.
Large datasets of digitised content offer significant opportunities for humanities researchers. To realise these opportunities, advances in statistical analysis to account for the complex nature of the content are needed. An understanding of these methods and their application by researchers in the digital humanities is also required. The most significant area of change in statistics relevant to analyses of large datasets of digitised content is the field of data mining. Data mining is the collective term for exploring large datasets using various techniques to find patterns in data. It incorporates many fields of academia including machine learning, statistics and database systems. The aim of data mining is to analyse large datasets consisting of thousands to millions of attributes and data points (Zaki & Meira, 2014). Data mining uses six types of analysis: clustering, classification, regression,
Text mining or text analysis is one specific area of data mining. It not only covers analysis of large volumes of text such as novels, academic journal articles and newspaper clippings, it also covers emails, tweets and blog posts. Any type of text file can be used in text mining (Dean, 2014). There are several techniques within the area of text mining for analysing text. One of the more recent developments in this area is topic modelling. This is a new area of research and one specifically designed for analysis of large datasets of digitised content.
Topic modelling is a form of text analysis used to explore relationships between words within a document where the words are grouped together to form topics. The earliest work on topic modelling is by Papadimitriou, Tamaki, Raghavan, and Vempala (1998), and Hofmann (1999). The technique was further developed by Blei, Ng, and Jordan (2003). There are a variety of different methods for topic modelling, using different sampling algorithms for word selection and topic creation. Examples of topic models include latent semantic analysis. This method is the most basic and looks at the frequency of words within a document and creates topics based on the frequencies of words occurring in each document. (Steyvers & Griffiths, 2007). Latent dirichlet allocation is another basic topic model. It groups words together based on how likely they are to appear in a document together (Blei et al., 2003). Correlated topic models explore the correlation of words to other words within a document. Topics are created based on the strength of correlations between words (Blei & Lafferty, 2007). Explicit semantic analysis adds words from a document to a matrix based on frequency and creates topics based on the frequency of co-occurrence between words (Egozi, Markovitch, & Gabrilovich, 2011). Topic modelling can be used in many different academic domains including both
use for topic modelling.
The amount of data available on the Internet is vast and will only increase over time. Topic modelling provides an easy way to process large amounts of information efficiently. It also allows for individual search topics to be discovered. Edward Y. Chang is a research director at Google and is currently working on implementing topic modelling into Google’s search engines. This will allow for a better exploration of Google’s databases (Dickman, 2014).
A recent example of the use of topic modelling in science includes the work on topic modelling for Cluster Analysis of Large Biological and Medical Datasets, (Zhao, Zou, & Chen, 2014). In their work, they assessed whether topic modelling is useful for biology and medicine. They analysed three different datasets for Salmonella pulse-field gel electrophoresis, lung cancer, and breast cancer and compared other data mining techniques to topic modelling. Their goal was to assess whether topic modelling gave them a better answer to a particular problem they were trying to solve for each dataset. The analysis found that topic modelling gave them a better result than the other data mining techniques. They concluded that topic modelling is beneficial for sorting through large sets of medical data with slightly better precision than other data mining methods (Zhao et al., 2014).