«by Yang Liu A dissertation submitted in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (Information) in the ...»
Mining Social Media to Understand Consumers’
Health Concerns and the Public’s Opinion on
Controversial Health Topics
A dissertation submitted in partial fulﬁllment
of the requirements for the degree of
Doctor of Philosophy
in the University of Michigan
Associate Professor Kai Zheng, Co-Chair
Associate Professor Qiaozhu Mei, Co-Chair
Associate Professor David A. Hanauer
Associate Professor Joyce M. Lee c Yang Liu 2016 All Rights Reserved
ACKNOWLEDGEMENTSI would like to thank my advisors Kai Zheng and Qiaozhu Mei, who have been a wonderful source of support, inspiration and encouragement during my PhD program.
I am greatly indebted to my committee members, David Hanauer and Joyce Lee, for their medical expertise and consistent high standard of research.
There are many other people without whom this dissertation would not have been possible: V.G. Vinod Vydiswaran, whom I have closely collaborated with and learned a lot from; Matthew Davis and Helen Levy, who brought with their public health policy perspective; Maria Woodward and Shreya Prabhu, who have generously given their time and oﬀered ophthalmic expertise; and Jia Liu, Tricia OBrien, Esha Sondhi, and Sonia Zhang, who have helped me with enormous amount of annotation.
I am fortunate to have had many wonderful collaborators while at University of Michigan. Yan Chen, Roy Chen and Wei ai, with whom I worked closely with on a series of economic projects, have provided me with invaluable experience and knowledge of experimental economics. The Health Informatics Innovation group and Foreseer group have been a great source of ideas, feedback and friendship.
Finally, I would like to thank my parents, Aihong Cheng and Xianli Liu, for their love and continuous support.
TABLE OF CONTENTSACKNOWLEDGEMENTS.......................... ii LIST OF FIGURES............................... vi LIST OF TABLES................................ viii LIST OF ABBREVIATIONS......................... x
CHAPTERI. Introduction.............................. 1 II. Systematic Literature Review................... 4
3.1 Distributions of the categories of site-deﬁned and user-created groups. 24
3.2 Frequency of tweets and users tweeting with those terms/hashtags. 34
3.3 Frequency of the geo-tagged diabetes tweets in top countries.... 38
4.3 Top annotation disagreements on judging medical relevance..... 55
4.4 Top annotation disagreements between two error categories..... 56
ADR Adverse Drug Reaction ACA Aﬀordable Care Act API application programming interface ATAM Ailment Topic Aspect Model BRFSS Behavioral Risk Factor Surveillance System CDC The U.S. Centers for Disease Control and Prevention CHV Consumer Health Vocabulary CRF conditional random ﬁeld ILI Inﬂuenza-like Illness LDA Latent Dirichlet Allocation LIWC Linguistic Inquiry and Word Count MMR measles, mumps, and rubella POMS the Proﬁle of Mood States PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses RBF radial basis function SVM Support Vector Machine UMLS Uniﬁed Medical Language System
Co-Chairs: Kai Zheng and Qiaozhu Mei Social media websites are increasingly used by the general public as a venue to express health concerns and discuss controversial medical and public health issues.
This information could be utilized for the purposes of public health surveillance as well as solicitation of public opinions. In this thesis, I developed methods to extract healthrelated information from multiple sources of social media data, and conducted studies to generate insights from the extracted information using text-mining techniques.
To understand the availability and characteristics of health-related information in social media, I ﬁrst identiﬁed the users who seek health information online and participate in online health community, and analyzed their motivations and behavior by two case studies of user-created groups on MedHelp and a diabetes online community on Twitter. Through a review of tweets mentioning eye-related medical concepts identiﬁed by MetaMap, I diagnosed the common reasons of tweets mislabeled by natural language processing tools tuned for biomedical texts, and trained a classiﬁer to exclude non medically-relevant tweets to increase the precision of the extracted data.
xi Furthermore, I conducted two studies to evaluate the eﬀectiveness of understanding public opinions on controversial medical and public health issues from social media information using text-mining techniques. The ﬁrst study applied topic modeling and text summarization to automatically distill users’ key concerns about the purported link between autism and vaccines. The outputs of two methods cover most of the public concerns of MMR vaccines reported in previous survey studies. In the second study, I estimated the public’s view on the Aﬀordable Care Act (ACA) by applying sentiment analysis to four years of Twitter data, and demonstrated that the the rates of positive/negative responses measured by tweet sentiment are in general agreement with the results of Kaiser Family Foundation Poll. Finally, I designed and implemented a system which can automatically collect and analyze online news comments to help researchers, public health workers, and policy makers to better monitor and understand the public’s opinion on issues such as controversial health-related topics.
Social media has revolutionized the way people disclose their personal health concerns and express opinions on controversial public health issues. It provides a unique platform for sharing health-related information without time and location constraints.
According to a 2014 Pew Research Center survey, 74% of adults with Internet access use social media sites. (Pew, 2014) Another Pew report shows that 11% of social network site users, have posted comments, queries, or information about health or medical matters. (Fox, 2011) In the meanwhile, both the government and individual companies have spent tremendous resources and eﬀorts to track public health conditions,1 risky health behaviors,2 and public opinions on controversial public health issues3 through personal interviews or telephone surveys. Policy makers and public health researchers rely these poll results to monitor population health and develop intervention strategies.
Despite the large sample size, the traditional polling methods (Groves et al., 2011) have several disadvantages including their untimeliness, high cost, and respondents’ limited availability. Health-related information in social media is a valuable source of information which can be used to overcome these disadvantages. Content analysis of online discussions of controversial public health issues can generate insights about 1 http://www.cdc.gov/nchs/nhis.htm 2 http://www.cdc.gov/brfss/about/index.htm 3 http://kﬀ.org/report-section/kaiser-health-tracking-poll-april-2015-methodology/ 1 public opinions. It can further help us estimate the tendency of public sentiment in real time with very low cost. Collections of personal health concerns expressed in social media can also be translated into eﬀective signals of outbreak of disease epidemics in early stage. (Ginsberg et al., 2009) Finally, statistical analysis of this big data set can help clinical researchers discover new medical knowledge, such as adverse drug events (White et al., 2014) and disease comorbidities.
Despite these opportunities, several challenges to mining social media text have prevented us from eﬀectively utilizing this valuable information. First, the availability and characteristics of medically-relevant data in social media remain unclear. This issue makes it diﬃcult for researchers to determine what questions such social media data can help to answer, and the validity and generalizability of the results generated. Secondly, comparing to other traditional health information sources such as electronic health records, social media data, which could be generated by anybody on the Internet, is inherently noisy due to misspellings, casual language style, and heterogeneous contexts. Extraction of health-related information from this noisy data set can be very challenging. Careless extraction of the data can lead to false alarms of disease outbreaks or biased public opinion estimates. Finally, the lack of eﬃcient and eﬀective methods to analyze and make sense of social media data further impedes the full utilization of this information. Since most existing text-mining and medical natural language processing techniques are designed for processing biomedical text (e.g. clinician notes, published scientiﬁc literature), their performance on social media data is questionable without careful evaluations against human-labeled ground truth.
In this thesis, I addressed each of these three challenges respectively. First, I summarized previous work by conducting a systematic literature review of studies on understanding the motivation of online health information sharing and seeking behavior, methods of extracting and analyzing health-related information in social media, and 2 systems and tools leveraging such methods. I also investigated end user motivation and behaviors in two scenarios, namely user self-initiated groups in a health forum and an online diabetes community on Twitter. Second, to extract health-related information in Twitter, I applied a state-of-the-art medical natural language processing tool, MetaMap, to identify potential mentions of medical concepts. I then evaluated the performance of MetaMap by comparing the eye-related concepts it identiﬁed to the results of a manual review of a sample of tweets. Using the manually annotated sample, I trained a classiﬁer to correct the errors introduced by MetaMap to achieve higher accuracy. Third, I applied text-mining and natural language processing techniques to study public opinions using diﬀerent social media data, and demonstrated the eﬀectiveness of these tools by comparing the machine-generated results to humanannotated data or traditional poll results. Finally, I built a system to incorporate the techniques mentioned above, and to automate the process to facilitate information extraction and insight generation using the framework I developed.
Chapter II presents a literature review of existing techniques and tools for analyzing health-related information from social media discussions. Section 3.1 in Chapter III is based on part of our work published in ICWSM 2014 (Vydiswaran et al., 2014). Section 3.2 is based on unpublished work done in collaboration with Joyce Lee, David Hanauer and Qiaozhu Mei. Section 4.3 in Chapter IV is unpublished work done in collaboration with Vinod Vydiswaran, Kai Zheng, David Hanauer, Qiaozhu Mei, Trishia O’Brien, and Esha Sondhi. Section 5.1 in Chapter V is unpublished work done in collaboration with Vinod Vydiswaran, Kai Zheng, David Hanauer, and Qiaozhu Mei. Section 5.2 is ongoing work in collaboration with Matthew Davis, Kai Zheng, and Helen Levy.
Our goal of this chapter is to summarize prior work in health sciences and computer science pertaining to the following four topics: (1) users’ motivations and concerns of sharing health-related data on social media websites, (2) methods of distilling health-related data from social media content including methods of identifying medical concepts expressed in consumer language, (3) both quantitative and qualitative methods of analyzing health-related data, and (4) frameworks and applications using health-related data.
A systematic literature review was conducted according to guidelines in the PRISMA statement. (Moher et al., 2009) After consulting other health/computer science interdisciplinary literature reviews, (Saha et al., 2007; Crutzen et al., 2011; Fry and Neﬀ, 2009; Fernandez-Luque et al., 2011a), I chose to search four databases in health sciences and computer science: PubMed, WebofScience, Google Scholar, and ACM digital library. The following queries were used to search in the title and abstract ﬁelds (full text for Google Scholar) in the literature databases: health AND (twitter or tweets or facebook or myspace or youtube or “social media” or “user generated content”). The publication year must be later than 2005, and the language was limited 4 to English only. The eligible publications must be analysis of the content from popular social media websites instead of health-speciﬁc online communities. Furthermore, studies about the following topics were excluded: health policy research; using social media websites as a communication channel of health promotion or patient education;
or health issues caused by using social media. In addition, references of relevant articles were reviewed, leading to 20 more articles being included. The PRISMA diagram is shown in Figure 2.1.
2.2.1 Beneﬁts and Concerns of Sharing Personal Health Data Although social media has been widely adopted by all population regardless of gender, education, race, health status, or health care access, (Chou et al., 2009b;
Fisher and Clayton, 2012; Shaw and Johnson, 2011) understanding users’ beneﬁts and motivation of sharing their personal health data is still critical to inform future research to improve the design of social media systems and to increase their actual beneﬁts to users.