FREE ELECTRONIC LIBRARY - Dissertations, online materials

«Abstract. We present results of a new approach to detect destructive article revi- sions, so-called vandalism, in Wikipedia. Vandalism detection is a ...»

Automatic Vandalism Detection in Wikipedia

Martin Potthast, Benno Stein, and Robert Gerling

Bauhaus University Weimar, Faculty of Media, 99421 Weimar, Germany

first name.last name@medien.uni-weimar.de

Abstract. We present results of a new approach to detect destructive article revi-

sions, so-called vandalism, in Wikipedia. Vandalism detection is a one-class clas-

sification problem, where vandalism edits are the target to be identified among

all revisions. Interestingly, vandalism detection has not been addressed in the In- formation Retrieval literature by now. In this paper we discuss the characteristics of vandalism as humans recognize it and develop features to render vandalism detection as a machine learning task. We compiled a large number of vandalism edits in a corpus, which allows for the comparison of existing and new detection approaches. Using logistic regression we achieve 83% precision at 77% recall with our model. Compared to the rule-based methods that are currently applied in Wikipedia, our approach increases the F -Measure performance by 49% while being faster at the same time.

Introduction. The content of the well-known Web encyclopedia Wikipedia is created collaboratively by volunteers. Every visitor of a Wikipedia Web site can participate immediately in the authoring process: articles are created, edited, or deleted without need for authentication. In practice, an article is developed incrementally since, ideally, authors review and revise the work of others. Till this day about 8 million articles in 253 languages have been authored in this way.

However, all times the Wikipedia and its freedom of editing has been misused by some editors. We distinguish them into three groups: (i) lobbyists, who try to push their own agenda, (ii) spammers, who solicit products or services, and (iii) vandals, who de- liberately destroy the work of others. The Wikipedia community has developed policies for a manual recognition and handling of such cases, but enforcing them requires the manpower of many. With the rapid growth of Wikipedia a shift from article contributors to editors working on article maintenance is observed. Hence it is surprising that there is little research to support editors from the latter group or to automatize their tasks.

As part of our research Table 1 surveys the existing tools for the prevention of editing misuse.

Related Work. The first attempt to aid lobbying detection was the WikiScanner tool which maps IP numbers recorded from anonymous editors to their domain name. This way editors can be found who are biased with respect to the topic in question. Since there are diverse ways for lobbyists to disguise their identity a manual check of all edits for hints of lobbying is still necessary.

There has been much research concerning spam detection in e-mails, among Web pages, or in blogs. In general, machine learning approaches, possibly combined with C. Macdonald et al. (Eds.): ECIR 2008, LNCS 4956, pp. 663–668, 2008.

c Springer-Verlag Berlin Heidelberg 2008 664 M. Potthast, B. Stein, and R. Gerling Table 1. Tools for the prevention of editing misuse with respect to the target group, and the type of automation (aid, full). Tools shown gray use the same or a very similar rule set as the tool listed in the line above.

–  –  –

AntiVandalBot (AVB) vandals full inactive http://en.wikipedia.org/wiki/WP:AVB MartinBot vandals full inactive http://en.wikipedia.org/wiki/User:MartinBot T-850 Robotic Assistant vandals full active http://en.wikipedia.org/wiki/User:T-850_Robotic_Assistant WerdnaAntiVandalBot vandals full active http://en.wikipedia.org/wiki/User:WerdnaAntiVandalBot Xenophon vandals full active http://en.wikipedia.org/wiki/User:Xenophon_(bot) ClueBot vandals full active http://en.wikipedia.org/wiki/User:ClueBot CounterVandalismBot vandals full active http://en.wikipedia.org/wiki/User:CounterVandalismBot PkgBot vandals aid active http://meta.wikimedia.org/wiki/CVN/Bots MiszaBot vandals aid active http://en.wikipedia.org/wiki/User:MiszaBot manually developed rules, do an excellent spam detection job [1]. The respective technology may also be adequate for a misuse analysis in Wikipedia, but the applicability has not been investigated yet.

Vandalism was recognized as an open problem by researchers studying online collaboration [2,4,5,6,7,8], and, of course, by the Wikipedia community.1 The former provide statistical or empirical analyses concerning vandalism, but neglect its detection. The latter developed four small sets of detection rules but did not evaluate the performance.

Misuses such as trolling and flame wars in discussion boards are related to vandalism, but so far no research exists to detect either of them.

In this paper we develop foundations for an automatic vandalism detection in Wikipedia: (i) we define vandalism detection as a classification task, (ii) discuss the characteristics by which humans recognize vandalism, and (iii) develop tailored features to quantify them. (iv) A machine-readable corpus of vandalism edits is provided as a common baseline for future research. (v) Finally, we report on experiments related to vandalism detection based on this corpus.

Vandalism Detection Task. Let E = {e1,..., en } denote a set of edits, where each edit e comprises two consecutive revisions of the same document d from Wikipedia, say, e = (dt, dt+1 ). Let F = {f1,..., fp } denote a set of vandalism indicating features where each feature fi is a function that maps edits onto real numbers, fi : E → R.

Using F an edit e is represented as a vector e = (f1 (e),..., fp (e)); E is the set of edit representations for the edits in E.

Given a vandalism corpus E which has a realistic ratio of edits classified as vandalism and well-intentioned edits, a classifier c, c : E → {0, 1}, is trained with examples from E. c serves as an approximation of c∗, the true predictor of the fact whether or not an edit forms a vandalism case. Using F and c one can classify an edit e as vandalism by computing c(e).

1 http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Vandalism_studies (October 2007) Automatic Vandalism Detection in Wikipedia 665 Table 2. Organization of vandalism edits along the dimensions “Edited content” and “Editing category”: the matrix shows for each combination the portion of specific vandalism edits at all vandalism edits. For vandalized structure insertion edits and content insertion edits also a list of their typical characteristics is given. It includes both the characteristics described in the previous research and the Wikipedia policies.

–  –  –

Vandalism Indicating Features. We have manually analyzed 301 cases of vandalism to learn about their characteristics and, based on these insights, to develop a feature set F. Table 2 organizes our findings as a matrix of vandalism edits along the dimensions “Edited content” and “Editing category”; Table 3 summarizes our features.

Table 3. Features which quantify the characteristics of vandalism in Wikipedia

–  –  –

For two vandalism categories the matrix shows particular characteristics by which an edit is recognized as vandalism: a vandalism edit has the “point of view” characteristic if the vandal expresses personal opinion, which often entails the use of personal pronouns.

Many vandalism edits introduce off-topic text with respect to the surrounding text, are nonsense in that they contradict common sense, or do not form a correct sentence from their language. The first three characteristics are very difficult to be quantified, and research in this direction will be necessary to develop reliable analysis methods. Vulgar vandalism can be detected with a dictionary of vulgar words; however, one has to consider the context of a vulgar word since several Wikipedia articles contain vulgar words in a correct sense. Hence we quantify the impact of a vulgar word based on the point of time it has been inserted into an article rather than simply checking its occurrence.

If an inserted text duplicates other text within the article or within Wikipedia, one may also speak of vandalism, but this is presumably the least offending case. Very often vandalism consists only of gobbledygook: a string of characters which has no meaning whatsoever, for instance if the keyboard is hit randomly. Another common characteristic of vandalism is that it is often highlighted by capital letters or by the repetition of characters. In cases of deletion vandalism, larger parts of an article are deleted, which explains the high percentages of this vandalism type throughout all content types. Note that a vandalism edit typically shows several of these characteristics at the same time.

Vandalism Corpus. Vandalism is currently not documented in Wikipedia, so that automatic vandalism detection algorithms cannot be compared to each other. The best way to find vandalism manually is by taking a look at the list of the most vandalized pages and then to analyze the history of the listed articles.2 We have set up the vandalism corpus WEBIS-VC07-11, which was compiled from our own investigations and the results of a study3 conducted by editors of Wikipedia. The corpus contains 940 human-assessed edits from which 301 edits are classified as vandalism. It is available in a machine-readable form for download at [9].

Evaluation. Within one-class classification tasks one is often confronted with the problem of class imbalance: one of the classes, either the target or the outlier class is underrepresented, which makes training a classifier difficult. In a realistic detection scenario only 5% of all edits in a given time period are from the target class “vandalism” [5].

As a heuristic to alleviate the problem we resort to random over-sampling of the underrepresented class at training time. Nevertheless, an in-depth analysis with respect to domain characteristics of the training samples is still necessary; the authors of [3] have compared alternative methods to address class imbalance.

Using ten-fold cross-validation on the corpus WEBIS-VC07-11 and a classifier based on logistic regression we evaluated the discriminative power of the features described in Table 3 when telling apart vandalism and well-intentioned edits. We also analyzed the effort for computing these features and compared the results to AVB and to ClueBot. Table 4 summarizes the results.

As can be seen, our approach (third row) outperforms the rule-based bots on all accounts. The individual analysis of each feature indicates its contribution to the overall 2 http://en.wikipedia.org/wiki/Wikipedia:Most_vandalized_pages (October 2007) 3 http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Vandalism_studies/Study1 (Oct. 2007) Automatic Vandalism Detection in Wikipedia 667 Table 4. Vandalism detection performance quantified as category-specific recall and averaged precision values. The first row shows, as the baseline, the currently best performing Wikipedia bot, while the third row (bold) shows the results of our classifier. The right column shows the throughput on a standard PC. The underlying test corpus contains 940 human-assessed edits from which 301 edits are classified as vandalism.

–  –  –

performance. Note that vandalism detection suggests a two-stage analysis process (machine + human) and hence to prefer high recall over high precision: a manual postprocessing of classifier results is indispensable since visitors of a Wikipedia page should never see a vandalized document; as well as that, a manual analysis is feasible because an even imprecisely retrieved target class contains only few elements.


1. Blanzieri, E., Bryl, A.: A Survey of Anti-Spam Techniques. Technical Report DIT-06-056, University of Trento (2006)

2. Buriol, L.S., Castillo, C., Donato, D., Leonardi, S., Millozzi, S.: Temporal Analysis of the Wikigraph. In: WI 2006, pp. 45–51. IEEE Computer Society, Los Alamitos (2006)

3. Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 6(5), 429–449 (2002)

4. Kittur, A., Suh, B., Pendleton, B., Chi, E.: He says, she says: Conflict and Coordination in Wikipedia. In: CHI 2007, pp. 453–462. ACM, New York (2007) 668 M. Potthast, B. Stein, and R. Gerling

5. Priedhorsky, R., Chen, J., Lam, S., Panciera, K., Terveen, L., Riedl, J.: Creating, Destroying, and Restoring Value in Wikipedia. In: Group 2007 (2007)

6. Viégas, F.B.: The Visual Side of Wikipedia. In: HICSS 2007, p. 85. IEEE Computer Society, Los Alamitos (2007)

7. Viégas, F.B., Wattenberg, M., Dave, K.: Studying Cooperation and Conflict between Authors with History Flow Visualizations. In: CHI 2004, pp. 575–582. ACM Press, New York (2004)

8. Viégas, F.B., Wattenberg, M., Kriss, J., van Ham, F.: Talk before you Type: Coordination in Wikipedia. In: HICSS 2007, p. 78. IEEE Computer Society, Los Alamitos (2007)

9. Potthast, M., Gerling, R. (eds): Web Technology & Information Systems Group, Bauhaus University Weimar. Wikipedia Vandalism Corpus WEBIS-VC07-11 (2007), http://www.uni-weimar.de/medien/webis/research/corpora

Similar works:

«30029-XX Entocort® EC (budesonide) Capsules Rx only DESCRIPTION Budesonide, the active ingredient of ENTOCORT® EC capsules, is a synthetic corticosteroid. It is designated chemically as (RS)­ 11β, 16α, 17,21-tetrahydroxypregna-1,4-diene-3,20-dione cyclic 16,17-acetal with butyraldehyde. Budesonide is provided as a mixture of two epimers (22R and 22S). The empirical formula of budesonide is C25H34O6 and its molecular weight is 430.5. Its structural formula is: C H 2O H C O CH3 H O HO C O...»

«Handedness and Memory for Tonal Pitch DIANA DEUTSCH INTRODUCTION There are certain well-known relationships between handedness and mode of brain organization. For instance, the large majority of right-handers have speech represented in the left cerebral hemisphere; however, of the left-handed population, about two-thirds have speech represented in the left hemisphere and about one-third in the right. Furthermore, whereas right-handers tend to show a clear-cut dominance of the left hemisphere...»

«Mouse Models To Study Angiogenesis, Vasculogenesis And Arteriogenesis In The Context Of Cardiovascular Diseases Thierry Couffinhal1,2; Pascale Dufourcq1, Laurent Barandon1,2, Lionel Leroux1,2, Cécile Duplàa1. 1 Institut National de la Santé et de la Recherche Médicale, Inserm U828, Pessac, France; Université Victor Ségalen Bordeaux 2, Bordeaux, France. 2 Departement de Cardiologie, Pôle cardiothoracique, CHU Groupe Sud, Hôpital Haut Lévêque, Pessac, France. TABLE OF CONTENTS 1....»

«PUBLIC VERSION Before the UNITED STATES OF AMERICA Department of Justice Antitrust Division Washington, D.C. ) In the Matter of ) ) Antitrust Consent Decree Review ) Consent Decrees 2015 for American Society of Composers ) PRO Licensing of Authors and Publisher/Broadcast Music, Inc. ) Jointly Owned Works ) Emailed to ASCAP-BMI-decreereview@usdoj.gov Attn: Chief, Litigation III Section Mr. David C. Kully Antitrust Division U.S. Department of Justice 450 5th Street NW, Suite 4000 Washington, DC...»

«-WARNINGS AND PRECAUTIONS HIGHLIGHTS OF PRESCRIBING INFORMATION • Serious These highlights do not include all the information needed to use and potentially fatal cardiovascular thrombotic events, myocardial PENNSAID® Topical Solution safely and effectively. See full prescribing infarction, and stroke can occur with NSAID treatment. Use the lowest information for PENNSAID Topical Solution. effective dose of PENNSAID Topical Solution in patients with known CV disease or risk factors for CV...»

«Instructions for Use BEHIND-THE-EAR HEARING AIDS Power BTE Nevara, Saphira, Juna Table of Contents Hearing Aid Description 7 Step-by-Step Instructions for Using Your Hearing Aid 10 Step 1: Inserting the Battery 10 Step 2: Turning the Hearing Aid ON 12 Step 3: Inserting Your Hearing Aid 13 Step 4: Changing the Volume 17 Step 5: Changing the Programs 19 Step 6: Muting Your Hearing Aid 21 Step 7: Removing Your Hearing Aid 22 Step 8: Turning the Hearing Aid OFF 23 Step 9: Changing the Battery 23...»

«Advances in Anthropology, 2014, 4, 164-167 Published Online August 2014 in SciRes. http://www.scirp.org/journal/aa http://dx.doi.org/10.4236/aa.2014.43020 Removing the “Hermetic Seal” from the Aquatic Ape Hypothesis: Waterside Hypotheses of Human Evolution Algis V. Kuliukas University of Western Australia, Perth, Australia Email: algis.kuliukas@uwa.edu.au Received 26 April 2014; revised 21 May 2014; accepted 15 June 2014 Copyright © 2014 by author and Scientific Research Publishing Inc....»

«HDMI Demystified HDMI 1.3 ● Eye Pattern ● Cliff Effect ● Cable ● Speed Rating Xiaozheng Lu, Senior Vice President, Product Development, AudioQuest The release of the new HDMI 1.3 specification in June 2006 created both excitement and confusion in the consumer electronics industry. The discussion below is provided to help clarify this new technology and provide you with a better understanding of what you need to know when buying or selling HDMI products. What is HDMI? High-Definition...»

«LONCON 3 The 72nd World Science Fiction Convention Souvenir Book [Scanners Note: This is a plain text version of the souvenir book. All “scanner’s notes” will be in brackets, this includes image descriptions and other notes. Some images have been removed. The list of attendees has been omitted but is available by request from selkiechick@yahoo.com] [Image: There is a city under and orange sky, with skyscrapers, and ships in the air. One of the ships has struck the tallest skyscraper, and...»

«THE THIRD REVOLUTION By Gregory Kay Copyright 2004 by Gregory Kay; all rights reserved. No part of this book may be used or reproduced in any manner whatsoever without written permission. gregmkay@yahoo.com The Third Revolution is a work of fiction, and any resemblance of any of the characters or organizations depicted in this book to any person, living or dead, or to any organization past or present, is purely coincidental. Other Books by Gregory Kay THE THIRD REVOLUTION II: THE LONG KNIVES...»

«Assessing Learning in Australian Universities Ideas, strategies and resources for quality in student assessment www.cshe.unimelb.edu.au/assessinglearning Assessing students unfamiliar with assessment practices in Australian higher education Helping students understand assessment expectations Australian higher education has assessment practices that are quite different from assessment practices in some other international settings. The following suggestions will particularly benefit...»

«45 Section of Dermatology 871 Small doses of X-rays stopped the irritation for a few weeks. The patient has been having thyroid and radiostoleum and is now having liver extract. The condition appears to be spreading. Pemphigus Foliaceus.-A. M. H. GRAY, C.B.E., M.D. Phillis P., aged 17, has a generalized eruption which began when she was aged 131 years with some small blisters on the front of the chest. These blisters burst and left a red, scaly patch which spread over the front of the chest on...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.