It’s All Fun and Games until Someone Annotates: Video Games with a Purpose for Linguistic Annotation David Jurgens and Roberto Navigli Department of Computer Science Sapienza University of Rome {jurgens,navigli}@di.uniroma1.it Abstract Annotated data is prerequisite for many NLP applications. Acquiring large-scale annotated corpora is a major bottleneck, requiring sig- nificant time and resources. Recent work has proposed turning annotation into a game to increase its appeal and lower its cost; how- ever, current games are largely text-based and closely resemble traditional annotation tasks. We propose a new linguistic annota- tion paradigm that produces annotations from playing graphical video games. The effec- tiveness of this design is demonstrated using two video games: one to create a mapping from WordNet senses to images, and a sec- ond game that performs Word Sense Disam- biguation. Both games produce accurate re- sults. The first game yields annotation qual- ity equal to that of experts and a cost reduc- tion of 73% over equivalent crowdsourcing; the second game provides a 16.3% improve- ment in accuracy over current state-of-the-art sense disambiguation games with WordNet. 1 Introduction Nearly all of Natural Language Processing (NLP) depends on annotated examples, either for train- ing systems or for evaluating their quality. Typi- cally, annotations are created by linguistic experts or trained annotators. However, such effort is often very time- and cost-intensive, and as a result cre- ating large-scale annotated datasets remains a long- standing bottleneck for many areas of NLP. As an alternative to requiring expert-based anno- tations, many studies used untrained, online work- ers, commonly known as crowdsourcing. When successful, crowdsourcing enables gathering anno- tations at scale; however, its performance is still lim- ited by (1) the difficulty of expressing the annota- tion task as a simply-understood task suitable for the layman, (2) the cost of collecting many anno- tations, and (3) the tediousness of the task, which can fail to attract workers. Therefore, several groups have proposed an alternate annotation method us- ing games: an annotation task is converted into a game which, as a result of game play, produces an- notations (Pe-Than et al., 2012; Chamberlain et al., 2013). Turning an annotation task into a Game with a Purpose (GWAP) has been shown to lead to better quality results and higher worker engagement (Lee et al., 2013), thanks to the annotators being stimu- lated by the playful component. Furthermore, be- cause games may appeal to a different group of peo- ple than crowdsourcing, they provide a complemen- tary channel for attracting new annotators. Within NLP, gamified annotation tasks include anaphora resolution (Hladká et al., 2009; Poesio et al., 2013), paraphrasing (Chklovski and Gil, 2005), term associations (Artignan et al., 2009) and dis- ambiguation (Seemakurty et al., 2010; Venhuizen et al., 2013). The games’ interfaces typically incorpo- rate common game elements such as scores, leader- boards, or difficulty levels. However, the game it- self remains largely text-based, with a strong resem- blance to a traditional annotation task, and little re- semblance to games most people actively play. In the current work, we propose a radical shift in NLP-focused GWAP design, building graphical, dy- namic games that achieve the same result as tradi- tional annotation. Rather than embellish an annota- 449 Transactions of the Association for Computational Linguistics, 2 (2014) 449–463. Action Editor: Mirella Lapata. Submitted 10/2013; Revised 03/2014; Revised 08/2014; Published 10/2014. c©2014 Association for Computational Linguistics. tion task with game elements, we start from a video game that is playable alone and build the task into the game as a central component. By focusing on the game aspect, players are presented with a more fa- miliar task, which leads to higher engagement. Fur- thermore, the video game interface can potentially attract more interest from the large percentage of the populace who play video games. In two video games, we demonstrate how certain linguistic annotation tasks can be effectively repre- sented as video games. The first video game, Puz- zle Racer, produces a mapping between images and WordNet senses (Fellbaum, 1998), thereby creating a large-scale library of visual analogs of concepts. While resources such as ImageNet (Deng et al., 2009) provide a partial sense-image mapping, they are limited to only a few thousand concrete noun senses, whereas Puzzle Racer annotates all parts of speech and both concrete and abstract senses. Fur- thermore, Puzzle Racer’s output enables new visual games for tasks using word senses such as Word Sense Disambiguation, frame detection, and selec- tional preference acquisition. The second game, Ka-boom!, performs Word Sense Disambiguation (WSD) to identify the meaning of a word in context by players interacting with pictures. Sense annota- tion is regarded to be one of the most challenging NLP annotation tasks (Fellbaum et al., 1998; Ed- monds and Kilgarriff, 2002; Palmer et al., 2007; Art- stein and Poesio, 2008), so we view it as a challeng- ing application for testing the limits of visual NLP games. Our work provides the following four contribu- tions. First, we present a new game-centric design methodology for NLP games with a purpose. Sec- ond, we demonstrate with the first game that video games can produce linguistic annotations equal in quality to those of experts and at a cost reduc- tion from gathering the same annotations via crowd- sourcing; with the second game we show that video games provide a statistically significant performance improvement over a current state-of-the-art non- video game with a purpose for sense annotation. Third, we release both games as a platform for other researchers to use in building new games and for annotating new data. Fourth, we provide multiple resources produced by the games: (1) an image li- brary mapped to noun, verb, and adjective Word- Net senses, consisting of 19,073 images across 443 senses, (2) a set of associated word labels for most images, (3) sense annotations as a distribution over images and senses, and (4) mappings between word senses and related Web queries. 2 Related Work Games with a Purpose Multiple works have pro- posed linguistic annotation-based games with a pur- pose for tasks such as anaphora resolution (Hladká et al., 2009; Poesio et al., 2013), paraphrasing (Chklovski and Gil, 2005), term associations (Artig- nan et al., 2009; Lafourcade and Joubert, 2010; Van- nella et al., 2014), acquiring common sense knowl- edge (Kuo et al., 2009; Herdağdelen and Baroni, 2012), and WSD (Chklovski and Mihalcea, 2002; Seemakurty et al., 2010; Venhuizen et al., 2013). Notably, all of these linguistic games have play- ers primarily interacting with text, in contrast to other highly successful games with a purpose such as Foldit (Cooper et al., 2010), in which players fold protein sequences, and the ESP game (von Ahn and Dabbish, 2004), where players label images with words. Most similar to our games are Wordrobe (Ven- huizen et al., 2013) and Jinx (Seemakurty et al., 2010), which perform WSD, and The Knowledge Towers (Vannella et al., 2014), which associates im- ages with senses. Wordrobe asks players to disam- biguate nouns and verbs using multiple choice ques- tions where options are sense definitions and disam- biguation is limited to terms with at most five senses, a limitation that does not exist in our games. Jinx uses two players who both have to independently provide lexical substitutes of an ambiguous word and are then scored on the basis of their shared sub- stitutes. While Jinx has a more game-like feel, pro- ducing annotations from the substitutes is non-trivial and requires looking for locality of the substitutes in the WordNet graph. In contrast to Wordrobe and Jinx, we provide a game-centric design methodology for the seamless integration of the annotation task into a video game with dynamic, graphical elements. The Knowledge Towers (TKT) is a video game for validating the associations between images and word senses in BabelNet (Navigli and Ponzetto, 2012) and associating each of the senses with new 450 images acquired from a Web query of one of the sense’s lemmas. To perform the annotation, players are shown a word and its definition and then asked to retrieve pictures matching that definition during game play. In contrast, our Puzzle Racer game is purely vi- sual and does not require players to read defini- tions, instead showing picture examples, increas- ing its video game-like quality. Furthermore, Puz- zle Racer is tested on nouns, verbs, and adjectives whereas TKT is only applicable to annotate nouns since it relies on the BabelNet knowledge base to acquire its initial set of image-sense associations, which contains images only for nouns. Image Libraries Associating images with con- ceptual entities is a long-standing goal in Com- puter Vision (Barnard et al., 2003) and two ap- proaches have built large-scale image libraries based on the WordNet hypernym ontology. The data set of Torralba et al. (2008) contains over 80M im- ages across all 75,062 non-abstract WordNet noun synsets. However, to support the size of the data set, images are down-scaled to 32x32 pixels; fur- thermore, their image-sense mapping error rates vary between 25-80% with more general concepts having higher error rates. The second significant image library comes from ImageNet (Deng et al., 2009), which contains 3.2M high-resolution images for 5,247 non-abstract WordNet noun synsets. No- tably, both libraries focus only on concrete nouns. In contrast, the present work provides a methodol- ogy for generating image-sense associations for all parts of speech and for both abstract and concrete concepts. Within NLP resources, BabelNet (Navigli and Ponzetto, 2012) merges Wikipedia and Word- Net sense inventories and contains mappings from WordNet senses to the pictures present on the cor- responding Wikipedia page. However, since images come from an encyclopedia, the associations are in- herently limited to only nouns and, due to inherently partial mapping, only 38.6% of the WordNet senses have images, with an average of 3.01 images for those senses. The present work also varies from pre- vious approaches in that image-sense pairs are rated according to the strength of association between the image and sense, rather than having a binary un- graded association. Crowdsourced WSD Many NLP areas have ap- plied crowdsourcing (Wang et al., 2013); of these areas, the most related to this work is crowdsourc- ing word sense annotations. Despite initial success in performing WSD using crowdsourcing (Snow et al., 2008), many approaches noted the difficulty of performing WSD with untrained annotators, espe- cially as the degree of polysemy increases or when word senses are related. Several approaches have attempted to make the task more suitable for un- trained annotators by (1) using the crowd itself to define the sense inventory (Biemann and Nygaard, 2010), thereby ensuring the crowd understands the sense distinctions, (2) modifying the questions to explicitly model annotator uncertainty (Hong and Baker, 2011; Jurgens, 2013), or (3) using sophis- ticated methods to aggregate multiple annotations (Passonneau et al., 2012; Passonneau and Carpen- ter, 2013). In all cases, annotation was purely text based, in contrast to our work. 3 Game 1: Puzzle Racer The first game was designed to fill an important need for enabling engaging NLP games: image repre- sentations of concepts, specifically WordNet senses. Our goals are two-fold: (1) to overcome the limits of current sense-image libraries, which have focused largely on concrete nouns and (2) to provide a gen- eral game platform for annotation tasks that need to associate lexical items with images. Following, we first describe the design, annotation process, and ex- tensibility of the game, and then discuss how its in- put data is generated. A live demonstration of the game is available online.1 3.1 Design and Game Play Puzzle Racer was designed to be as “video game- like” as possible, with no mentioning of linguistic terminology. Because the game is targeted for the layperson, we view this a fundamental design ob- jective to make the game more engaging and long- lasting. Therefore, Puzzle Racer is modeled after popular games such as Temple Run and Subway Surfers, but with the twist of combining two game genres: racing and puzzle solving. Racing provides the core of game play, while the annotation is em- bedded as puzzle solving during and after the race. 1http://www.knowledgeforge.org 451 Following, we describe the game play and then de- tail how playing produces annotations. Racing To race, players navigate a race car along a linear track filled with obstacles and enemy pieces. During play, players collect coins, which can be used to obtain non-annotation achievements and to increase their score. Enemies were added to intro- duce variety into the game and increase the strategy required to keep playing. Players begin the race with 2–4 health points, depending on the racer chosen, which are decreased when touching enemies. Dur- ing game play, players may collect power-ups with familiar actions such as restoring lost health, dou- bling their speed, or acting as a magnet to collect coins. To bring a sense of familiarity, the game was designed using a combination of sprites, sound ef- fects, and music from Super Mario World, Mario Kart 64, and custom assets created by us. Races initially last for 90 seconds, but may last longer if players collect specific power-ups that add time. Puzzle Solving Prior to racing, players are shown three images, described as “puzzle clues,” and in- structions asking them to find the common theme in the three pictures (Fig. 1a). Then, during rac- ing, players encounter obstacles, referred to as puz- zle gates, that show a series of images. To stay alive, players must navigate their racer through the one picture in the series with the same theme as the puzzle clues. Players activate a gate after touching one of its images; a gate may only be activated once and racer movement over other pictures has no ef- fect. Puzzle gates appear at random intervals during game play. Two types of gate appear. In the first, the gate shows pictures where one picture is known to be re- lated to the puzzle clues. We refer to these as golden gates. Racing over an unrelated image in a golden gate causes the player to lose one health point, which causes the race to end if their health reaches zero. The second type of gate, referred to as a mystery gate, shows three images that are potentially related to the clue. Moving over an image in a mystery gate has no effect on health. Prior to activating a gate, there is no visual difference between the two gates. Figure 1b shows a racer approaching a puzzle gate. Upon first moving their racer on one of the gate’s images, the player receives visual and audi- tory feedback based on the type of gate. In the case of a golden gate, the borders around all pic- tures change colors showing which picture should have been selected, a feedback icon appears on the chosen picture (shown in Figure 1c), and an appro- priate sound effect plays. For mystery gates, borders become blue, indicating the gate has no effect. Finally, when the race ends, players are asked to solve the race’s puzzle by entering a single word that describes the race’s puzzle theme. For example, in the race shown in Figure 1, an answer of “paper” would solve the puzzle. Correctly answering the puzzle doubles the points accumulated by the player during the race. The initial question motivates play- ers to pay attention to picture content shown during the race; the longer the player stays alive, the more clues they can observe to help solve the puzzle. Annotation Image-sense annotation takes place by means of the puzzle gates. Each race’s puzzle theme is based on a specific WordNet sense. Ini- tially, each sense is associated with a small set of gold standard images, G, and a much larger set of potentially-associated images, U, whose quality is unknown. At the start of a race, three gold stan- dard images are randomly selected from G to serve as puzzle clues. The details of gold standard image selection are described later in Sec. 3.2. We note that not all gold standard images are shown initially as puzzle clues, helping mask potential differences between golden and mystery gates. Mystery gates annotate the images in U. The im- ages in a mystery gate are chosen by selecting the least-rated image in U and then pairing it with n- 1 random images from U, where n is the number of pictures shown per gate. By always including the least-rated image, we guarantee that, given sufficient plays, all images for a sense will eventually be rated. When a player chooses an image in the mystery gate, that image receives n-1 positive votes for it being a good depiction of the sense; the remaining unse- lected images receive one negative vote. Thus, an image’s rating is the cumulative sum of the positive and negative votes it receives. This rating scheme is zero-sum so image ratings cannot become inflated such that all images have a positive rating. How- ever, we do note that if U includes many related im- ages, due to the voting, some good images may have 452 (a) Puzzle clues (b) A puzzle gate prior to activation (c) An activated puzzle gate Figure 1: Screenshots of the key elements of the Puzzle Racer game negative ratings if even-better images become higher ranked. Golden gates are used to measure how well a player understands the race’s puzzle concept (i.e., the sense being annotated). The first three puzzle gates in a race are always golden gates. We denote the percentage of golden gates correctly answered thus far as α. After the three initial golden gates are shown, the type of new puzzle gates is metered by α: golden gates are generated with probability 0.3 + 0.7(1 − α) and mystery gates are generated for the remainder. In essence, accurate players with high α are more likely to be shown mystery gates that annotate pictures from U, whereas completely inaccurate players are prevented from adding new annotations. This mechanism adjusts the number of new annotations a player can produce in real- time based on their current accuracy at recogniz- ing the target concept, which is not currently pos- sible in common crowdsourcing platforms. Last, we note that puzzle answering also provides labels for the race’s images, data that might prove valu- able for tasks such as image labeling (Mensink et al., 2013) and image caption generation (Feng and Lapata, 2013; Kulkarni et al., 2013). Additional Game Elements Puzzle Racer incor- porates a number of standard Game with a Purpose design elements (von Ahn and Dabbish, 2008), with two notable features: unlockable achievements and a leaderboard. Players initially start out with a sin- gle racer and power-up available. Players can then unlock new racers and power-ups through various game play actions of varying difficulty, e.g., cor- rectly answering three puzzle questions in a row. This feature proved highly popular and provided an extrinsic motivation for continued playing. Sec- ond, players were ranked according to level, which was determined by the number of correct puzzle an- swers, correct golden gates, and their total score. The top-ranking players were shown at the end of every round and via a special screen in-game. A full, live-updated leaderboard was added halfway through the study and proved an important feature for new players to use in competing for the top ranks. Extensibility At its core, Puzzle Racer provides three central annotation-related mechanics: (1) an initial set of instructions on how players are to inter- act with images, (2) multiple series of images shown during game play, and (3) an open-ended question at the end of the game. These mechanics can be easily extended to other types of annotation where players must choose between several concepts shown as op- tions in the puzzle gates. For example, the instruc- tions could show players a phrase such as “a bowl of *” and ask players to race over images of things that might fit the “*” argument in order to obtain se- lectional preference annotations of the phrase (à la Flati and Navigli (2013)); the lemmas or senses as- sociated with the selected images can be aggregated to identify the types of arguments preferred by play- ers for the game’s provided phrase. Similarly, the in- structions could be changed to provide a set of key- words or phrases (instead of images associated with a sense) and ask players to navigate over images of the words in order to perform image labeling. 453 Nouns argument, arm, atmosphere, bank, difficulty, disc, interest, paper, party, shelter Verbs activate, add, climb, eat, encounter expect, rule, smell, suspend, win Adjectives different, important, simple Table 1: Lemmas for Puzzle Racer and Ka-boom! 3.2 Image Data Puzzle Racer requires sets of images G and U as in- put. Both were constructed using image queries via the Yahoo! Boss API as follows. Three annotators were asked to each produce three queries for each sense of a target word and, for each query, to se- lect three images as gold standard images. Queries were checked for uniqueness and to ensure that at least a few of its image results were related to the sense. Each query was used to retrieve one result set of 35 images. Two additional annotators then vali- dated the queries and gold standard images produced by the first three annotators. Validation ensured that each sense had at least three queries and |G| ≥ 6 for all senses. After validation, the gold standard im- ages were added to G and all non-gold images in the result set were added to U, discarding duplicates. During game play, puzzle clues are sampled across queries, rather than sampling directly from G. Query-based sampling ensures that players are not biased towards images for a single visual repre- sentation of the sense. While the construction of G and U is manual, we note that alternate methods of constructing the sets could be considered, including automatic ap- proaches such as those used by ImageNet (Deng et al., 2009). However, as the focus of this game is on ranking images, a manual process was used to en- sure high-quality images in G. Importantly, we stress that the images in G alone are often insufficient due to two reasons. First, most senses – especially those denoting abstract concepts – can be depicted in many ways. Relying on a small set of images for a sense can omit common visual- izations, which may limit downstream applications that require general representations. Second, many games rely on a sense of novelty, i.e., not seeing the same pictures repeatedly; however, limiting the im- ages to those in G can create issues where too few disc4n: a flat circular plate circular plate dish plate plate paper3n: a daily or weekly publication on folded sheets news paper daily newspaper newspaper headline simple2a: elementary, simple, uncomplicated simple problem 1+1=2 elementary equation win1v: be the winner in a contest or competition olympic winner lottery winner world cup victory Table 2: Examples of queries used to gather images images exist to keep player interest high. While ad- ditional manual annotation could be used to select more gold standard images, such a process is time- intensive; hence, one of our game’s purposes is to eventually move high-quality images from U to G. 4 Puzzle Racer Annotation Analysis Puzzle Racer is intended to produce a sense-image mapping comparable to what would be produced by crowdsourcing. Therefore, we performed a large- scale study involving over 100 players and 16,000 images. Two experiments were performed. First, we directly compared the quality of the game-based an- notations with those of crowdsourcing. Second, we compared the difference in quality between expert- based gold standard images and the highest-ranked images rated by players. 4.1 Experimental Setup To test the potential of our approach, we selected a range of 23 polysemous noun, verb, and adjec- tive lemmas, shown in Table 1. Lemmas had 4-10 senses each, for a total of 132 senses. Many lemmas have both abstract and concrete senses and some are known to have highly-related senses (Erk and Mc- Carthy, 2009). Hence, given their potential annota- tion difficulty, we view performance on these lem- mas as a lower bound. For all lemmas, during the image generation pro- cess (Sec. 3.2) annotators were able to produce queries for all but one sense, expect2v; 2 this produced 1356 gold images in G and 16,656 unrated images 2The sense expect2v has the definition, “consider obligatory; request and expect.” Annotators were able to formulate many queries that could have potentially shown images of this defi- nition, but the images results of such queries were consistently unrelated to the meaning. 454 interest1n: a sense of concern with and curiosity about someone or something eat1v: take in solid food different 1 a: unlike in nature or quality or form or degree party2n: a group of people gathered to- gether for pleasure expect6v: be pregnant with shelter 5 n: temporary housing for home- less or displaced persons Table 3: Examples of gold standard images in U. Tables 2 and 3 show examples of the queries and gold standard images, respectively. The game play study was conducted over two weeks using a pool of undergraduate students, who were allowed to recruit other students. After an email announcement, 126 players participated. Players were ranked according to their character’s level and provided with an incentive that the four top-ranking players at the end of the study would be provided with gift cards ranging from $15-25USD, with a total compensation of $70USD. 4.2 Experiment 1: Crowdsourcing Comparison The first experiment directly compares the image rankings produced by the game with those from an analogous crowdsourcing task. Tasks were created on the CrowdFlower platform using the identical set of examples and annotation questions encoun- tered by players. In each task, workers were shown three example gold standard images (sampled from those configurations seen by players) and asked to identify the common theme among the three exam- ples. Then, five annotation questions were shown in which workers were asked to choose which of three images was most related to the theme. Ques- tions were created after the Puzzle Racer study fin- ished in order to use the identical set of questions seen by players as mystery gates. Workers were paid $0.03USD per task. To compare the quality of the Puzzle Racer image rankings with those from CrowdFlower, the three highest-rated images of each sense from both rank- ings were compared. Two annotators were shown a sense’s definition and example uses, and then asked to compare the quality of three image pairs, select- ing whether (a) the left image was a better depic- tion of the sense, (b) the right image was better, or (c) the images were approximately equal in quality. In the case of disagreements, a third annotator was asked to compare the images; the majority answer was used when present or, in the case of all three ratings, images were treated as equal, the latter of which occurred for only 17% of the questions. For all 396 questions, the method used to rank the image was hidden and the order in which images appeared was randomized. Results During the study period, players com- pleted 7199 races, generating 20,253 ratings across 16,479 images. Ratings were balanced across senses, with a minimum and maximum of 231 and 329 ratings per sense. Players accurately identi- fied each race’s theme, selecting the correct image in 83% of all golden puzzle gates shown. Table 4 shows example top-rated images from Puzzle Racer. Experiment 1 measures differences in the qual- ity of the three top-ranked images produced by Puz- zle Racer and CrowdFlower for each sense. Puzzle Racer and CrowdFlower produced similar ratings, with at least one image appearing in the top-three positions of both ranks for 55% of the senses. Both annotators agreed in 72% of cases in select- ing the best sense depiction, finding that in 88% of the agreed cases both images were approximately equal representations of the sense. In the remain- ing, the Puzzle Racer image was better in 4% and 455 activate4v: aerate (sewage) so as to favor the growth of organisms that decompose organic matter argument2n: a contentious speech act; a dispute where there is strong disagree- ment atmosphere4n: the weather or climate at some place climb1v: go upward with gradual or con- tinuous progress important1a: of great significance or value rule2v: decide with authority Table 4: Examples of the three highest-rated images for six senses CrowdFlower image better in 8%. When resolu- tions from a third annotator were included, a similar trend emerges: both images were equivalent in 79% of all cases, Puzzle Racer images were preferred in 7% and Crowflower images in 14%. These results show that, as a video game, Puzzle Racer produces very similar results to what would be expected under equivalent conditions with crowdsourcing. 4.3 Experiment 2: Image Quality The second experiment evaluates the ability of the games to produce high-quality images by measur- ing the difference in quality between gold stan- dard images and top-rated images in the game. CrowdFlower workers were shown a question with a sense’s definition and example uses and then asked to choose which of two images was a better visual representation of the gloss. Questions were created for each of the three highest-rated images for each sense, pairing each with a randomly-selected gold standard image for that sense. Image order was ran- domized between questions. Five questions were shown per task and workers were paid $0.05USD per task.3 The 2670 worker responses were aggre- gated by selecting each question’s most frequent an- swer. Results For senses within each part of speech, workers preferred the gold standard image to the 3Workers were paid more for the second task to adjust for the time required to read each question’s sense definition and example uses; thus, hourly compensation rates in the two ex- periments were approximately equivalent. top-rated image for nouns, verbs, and adjectives 57.4%, 53.1%, and 56.2% of the time, respectively. This preference is not significant at p < 0.05, indi- cating that the top-ranked images produced through Puzzle Racer game play are approximately equiva- lent in quality to images manually chosen by experts with full knowledge of the sense inventory. 4.4 Cost Comparison Puzzle Racer annotations cost $70, or $0.0034USD per rating. In comparison, the analogous Crowd- Flower annotations cost $256.60, or $0.0126USD per annotation. Because the game’s costs are fixed, the cost per annotation is driven down as players compete. As a result, Puzzle Racer reduces the an- notation cost to ≤27% of that required by crowd- sourcing. We note that other factors could have con- tributed to the cost reduction over crowdsourcing be- yond the video game itself. However, as we demon- strate in Vannella et al. (2014), players will play a video game with a purpose without compensation just as much as they do when compensated using a similar setup as was performed in this experiment. Hence, the video game itself is likely the largest mo- tivating factor for the cost reduction. Video game-based annotation does come with in- direct costs due to game development. For example, Poesio et al. (2013) report spending £60,000 over a two-year period to develop their linguistic game with a purpose. In contrast, Puzzle Racer was cre- ated using open source software in just over a month and developed in the context of a Java programming class, removing any professional development costs. 456 Furthermore, Puzzle Racer is easily extensible for other text-image annotation tasks, enabling the plat- form to be re-used with minimal effort. The decreased cost does come with an increase in the time per annotation. All tasks on the Crowd- Flower platform required only a few hours to com- plete, whereas the Puzzle Racer data was gathered over the two-week contest period. The difference in collection time reflects an important difference in the current resources: while crowdsourcing has established platforms with on-demand workers, no central platforms exist for games with a purpose with an analogous pool of game players. However, although the current games were released in a lim- ited fashion, later game releases to larger venues such as Facebook may attract more players and sig- nificantly decrease both collection times and overall annotation cost. 5 Game 2: Ka-boom! Building large-scale sense-annotated corpora is a long-standing objective (see (Pilehvar and Navigli, 2014)) and has sparked significant interest in de- veloping effective crowdsourcing annotation and GWAP strategies (cf. Sec. 2). Therefore, we pro- pose a second video game, Ka-boom!, that produces sense annotations from game play. A live demon- stration of the game is available online.4 Design and Game Play Ka-boom! is an action game in the style of the popular Fruit Ninja game: pictures are tossed on screen from the boundaries of the screen, which the player must then selectively destroy in order to score points. The game’s chal- lenge stems from rapidly identifying which pictures should be destroyed or not destroyed as they appear. Prior to the start of a round, players are shown a sentence with a word in bold (Fig. 2a) and asked to envision pictures related to that word’s meaning in the context. Players are then instructed to de- stroy pictures that do not remind them of the bolded word’s meaning and let live pictures showing some- thing reminiscent. Once finished reading the instruc- tions, players begin a round of game play that shows (1) images for each sense of the word and (2) im- ages for unrelated lemmas, referred to as distractor images. 4http://www.knowledgeforge.org/ Players destroy pictures by clicking or touching them, depending on their device’s input (Fig. 2b). Players are penalized for failing to destroy the dis- tractor images. Rounds begin with a limit of at most three pictures on screen at once, which increases as the round progresses. The additional images pro- vide two important benefits: (1) an increasing de- gree of challenge to keep the player’s interest, (2) more image interactions to use in producing the an- notation. Additionally, the increasing picture rate enables us to measure the interaction between game play speed and annotation quality in order to help tune the speeds of future games. The round ends when players fail to destroy five or more distrac- tor images or 60 seconds elapses. Ending the game early after players fail to destroy distractor images provides Ka-boom! a mechanism for limiting the impact of inaccurate or adversarial players on anno- tation quality. After game play finishes, players are shown their score and all the lemma-related pictures they spared (Fig. 2c), proving a positive feedback loop where players can evaluate their choices. Annotation Traditionally, sense annotation is per- formed by having an annotator examine a word in context and then chose the word’s sense from a list of definitions. Ka-boom! replaces the sense defini- tions with image examples of that sense. A sense an- notation is built from the senses associated with the images that the player spared. Images are presented to players based on a sequence of flights. Each flight contains one randomly-selected picture for each of a word’s n senses and n distractor images. Images within a flight are randomly ordered. The structure of a flight’s images ensures that, as the game pro- gresses, players see the same number of images for each sense; otherwise, the player’s annotation may become biased simply due to one sense’s images ap- pearing more often. Once the game ends, the senses associated with the spared images are aggregated to produce a sense distribution. For simplicity, the sense with the high- est probability is selected as the player’s answer; in the case of ties, multiple senses are reported, though, we note that the game’s annotation method could also produce a weighted distribution over senses (Erk et al., 2012), revealing different meanings that a player considered valid in the context. 457 (a) The context and target word (b) Players destroying images (c) The round-over summary Figure 2: Screenshots of the three key elements of the Ka-boom! game The highest probability of a sense from this distri- bution is then multiplied by the duration of the game to produce the player’s score for the round. Players maximize their score when they consistently choose images associated with a single sense, which encour- ages precise game play. The annotation design of having players destroy unrelated images was motivated by two factors. First, the mechanism of destroying unrelated images does not introduce noise into the annotation when a player mistakenly destroys an image; because only retained images count towards the sense annotation, players may be highly selective in which images they retain – even destroying some images that are associated with the correct sense – while still pro- ducing a correct annotation. Second, our internal testing showed the objective of destroying unrelated pictures keeps players more actively engaged. In the inverse type of play where players destroy only re- lated pictures, players often had to wait for a single picture to destroy, causing them to lose interest. Extensibility Ka-boom! contains two core me- chanics: (1) instructions on which pictures should be destroyed and which should be spared, and (2) series of images shown to the player during game play. As with Puzzle Racer, the Ka-boom! mechan- ics can be modified to extend the game to new types of annotation. For example, instructions could dis- play picture examples and ask players to destroy ei- ther similar or opposite-meaning ideas in order to annotate synonyms or antonyms. In another setting, images can be associated with semantic frames (e.g., from FrameNet (Baker et al., 1998)) and players must spare images showing the frame of the game’s sentence in order to provide frame annotations. 6 Ka-boom! Annotation Analysis Ka-boom! is intended to provide a complementary and more-enjoyable method for sense annotation us- ing only pictures. To test its effectiveness, we per- form a direct comparison with the state-of-the-art GWAP for sense annotation, Wordrobe (Venhuizen et al., 2013), which is not a video game. 6.1 Experimental Setup Organizers of the Wordrobe project (Venhuizen et al., 2013) provided a data set of 111 recently- annotated contexts having between one and nine games played for each (mean 3.2 games). This data was distinct from the contexts used to evaluate Wor- drobe in Venhuizen et al. (2013) in which case all contexts had six games played each. Contexts were for 74 noun and 16 verb lemmas with a total of 310 senses (mean 3.4 senses per word). Contexts were assigned the most-selected sense label from the Wor- drobe games. To gather the images for each lemma used with Ka-boom!, we repeated a similar image-gathering process as done for the gold standard images in Puzzle Racer. Annotators generated at least three queries for each sense, selecting three images for each query as gold standard examples of the sense. During annotation, four senses could not be asso- ciated with any queries that produced high-quality images. In total, 2594 images were gathered, with an average of 8.36 images per sense. The query data and unrated images are included in the data set, but were not used further in Ka-boom! experiments. Game players were drawn from a small group of 458 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 A v e ra g e W S D A c c u ra c y Flights of pictures seen All Nouns Verbs MFS Random Figure 3: Players’ average WSD accuracy within a single game relative to the number of flights seen Accuracy Method All Noun Verb Ka-boom! 0.766 0.803 0.559 Wordrobe 0.603 0.659 0.313 MFS 0.678 0.702 0.588 Random 0.322 0.325 0.312 Table 5: Sense disambiguation accuracies fluent English speakers and were free to recruit other players. A total of 19 players participated. Unlike Puzzle Racer, players were not compensated. Each context was seen in at least six games. WSD performance is measured using the tradi- tional precision and recall definitions and the F1 measure of the two (Navigli, 2009); because all items are annotated, precision and recall are equiv- alent and we report performance as accuracy. Per- formance is measured relative to two baselines: (1) a baseline that picks the sense of the lemma that is most frequent in SemCor (Miller et al., 1993), de- noted as MFS, and (2) a baseline equivalent to per- formance if players had randomly clicked on im- ages, denoted as Random.5 6.2 Results Two analyses were performed. Because Ka- boom! continuously revises the annotation during gameplay based on which pictures players spare, the first analysis assesses how the accuracy changes 5This baseline is similar to random sense selection but takes into account differences in the number of pictures per sense. with respect to the length of one Ka-boom! game. The second analysis measures the accuracy with re- spect to the number of games played per context. In the first analysis, each context’s annotation was evaluated using the most-probable sense after each flight of gameplay. Figure 3 shows results after six games were played. Players were highly accurate at playing, surpassing the MFS baseline after see- ing two flights of pictures (i.e., two pictures for each sense). Accuracy remained approximately equiva- lent after three rounds for noun lemmas, while verb lemmas showed a small drop-off in performance. We believe that the increased rate at which images occur on screen likely caused lower performance, where players were unable to react quickly enough. Many noun lemmas had easily-recognizable associ- ated images, so higher-speed game play may still be accurate. In contrast, verbs were more general (e.g., “decide,” “concern,” and “include”), which required more abstract thinking in order to recognize an asso- ciated picture; as the game speed increased, players were not able to identify these associated pictures as easily, causing slightly decreased performance. Table 5 shows the players’ disambiguation accu- racy after three flights in comparison to the play- ers’ accuracy with Wordrobe and the two baselines. Ka-boom! provides an increased performance over Wordrobe that is statistically significant at p < 0.01.6 Ka-boom! also provides a performance in- crease over the MFS baseline, though it is statisti- cally significant only at p = 0.14. The time required to gather annotations after three flights varied based on the number of senses, but was under a minute in all cases, which puts the rate of annotation on par with that of expert-based annotation (Krishnamurthy and Nicholls, 2000). In the second analysis, disambiguation accuracy was measured based on the number of games played for a context.7 Because the provided Wordrobe data set has 3.2 games played per context on aver- age, results are reported only for the subset of con- texts played in at least four Wordrobe games in or- der to obtain consistent performance estimates. Ka- 6We note that, although Venhuizen et al. (2013) report a higher accuracy for Wordrobe in their original experiments (85.7 F1), that performance was measured on a different data set and used six games per context. 7In all cases, players played at most one game per context. 459 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 A v e ra g e W S D A c c u ra c y Number of Games Played Ka-boom! All Ka-boom! Nouns Ka-boom! Verbs MFS Wordrobe All Wordrobe Nouns Wordrobe Verbs Random Figure 4: Average WSD accuracy as a function of the number of games played for a context boom! annotations are recorded after three flights were seen in each game. Figure 4 shows the perfor- mance relative to the number of annotators for both Ka-boom! and Wordrobe, i.e., the number of games played for that context by different players. For nouns, Ka-boom! is able to exceed the MFS baseline after only two games are played. For both nouns and verbs, multiple rounds of Ka-boom! game play improve performance. In contrast to Ka- Boom!, Wordrobe accuracy declines as the number of players increases; when multiple players disagree on the sense, no clear majority emerges in Wor- drobe, lowering the accuracy of the resulting anno- tation. In contrast, a single player in Ka-boom! pro- duces multiple sense judgments for a context in a single game via interacting with each flight of im- ages. These interactions provide a robust distribu- tional annotation over senses that can be easily ag- gregated with other players’ judgments to produce a higher-quality single sense annotation. This analysis suggests that Ka-boom! can produce accurate anno- tations with just a few games per context, removing the need for many redundant annotations and im- proving the overall annotation throughput. 7 Conclusion and Future Work In this work we have presented a new model of lin- guistic Games with a Purpose focused on annota- tion using video games. Our contributions show that designing linguistic annotation tasks as video games can produce high-quality annotations. In the first game, Puzzle Racer, we demonstrated that game play can produce a high-quality library of images as- sociated with WordNet senses, equivalent to those produced by expert annotators. Moreover, Puzzle Racer reduces the cost of producing an equivalent resource via crowdsourcing by at least 73% while providing similar-quality image ratings. In the sec- ond game, Ka-boom!, we demonstrated that a video game could be used to perform accurate word sense annotation with a large improvement over the MFS baseline and a statistically significant improvement over current game-based WSD. While not all linguistic annotations tasks are eas- ily representable as video games, our two games provide an important starting point for building new types of NLP games with a purpose based on video games mechanics. Software for both games will be open-sourced, providing a new resource for future game development and extensions of our work. Fur- thermore, the multiple data sets produced by this work are available at http://lcl.uniroma1. it/videogames, providing (1) a sense-image mapping from hundreds of senses to tens of thou- sands of images, (2) word labels for most images in our dataset, (3) Web queries associated with all senses, and (4) image-based word sense annotations. Based on our results, three directions for future work are planned. The two games presented here focus on concepts that can be represented visually and thus lend themselves to annotations for lexi- cal semantics. However, the fact that the games are graphical does not prevent them from showing textual items (see Vannella et al. (2014)) and more apt video games could be developed for text-based annotations such as PP-attachment or pronoun res- olution. Therefore, in our first future work, we plan to develop new types of video games for tex- tual items as well as extend the current games for new semantic tasks such as selectional preferences and frame annotation. Second, we plan to scale up both games to a broader audience such as Face- book, creating a larger sense-image library and a standard platform for releasing video games with a purpose. Third, we plan to build multilingual games using the images from Puzzle Racer, which provide a language-independent concept representation, and could therefore be used to enable the annotation and validation of automatically-created knowledge re- sources (Hovy et al., 2013). 460 Acknowledgments The authors gratefully acknowl- edge the support of the ERC Start- ing Grant MultiJEDI No. 259234. We thank the many game players whose collective passion for video games made this work possible. References Guillaume Artignan, Mountaz Hascoët, and Mathieu Lafourcade. 2009. Multiscale visual analysis of lexi- cal networks. In Proceedings of the International Con- ference on Information Visualisation, pages 685–690. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computa- tional Linguistics, 34(4):555–596. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Pro- ceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meet- ing of the Association for Computational Linguistics, Montréal, Québec, Canada, 10–14 August 1998, Mon- treal, Canada. Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. The Journal of Machine Learning Research, 3:1107–1135. Chris Biemann and Valerie Nygaard. 2010. Crowdsourc- ing WordNet. In The 5th International Conference of the Global WordNet Association (GWC-2010). Jon Chamberlain, Karën Fort, Udo Kruschwitz, Math- ieu Lafourcade, and Massimo Poesio. 2013. Using games to create language resources: Successes and limitations of the approach. In Iryna Gurevych and Jungi Kim, editors, The People’s Web Meets NLP, The- ory and Applications of Natural Language Processing, pages 3–44. Springer. Tim Chklovski and Yolanda Gil. 2005. Improving the design of intelligent acquisition interfaces for collect- ing world knowledge from web contributors. In Pro- ceedings of the International Conference on Knowl- edge Capture, pages 35–42. ACM. Tim Chklovski and Rada Mihalcea. 2002. Building a Sense Tagged Corpus with Open Mind Word Expert. In Proceedings of ACL 2002 Workshop on WSD: Re- cent Successes and Future Directions, Philadelphia, PA, USA. Seth Cooper, Firas Khatib, Adrien Treuille, Janos Bar- bero, Jeehyung Lee, Michael Beenen, Andrew Leaver- Fay, David Baker, Zoran Popović, and Foldit players. 2010. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756–760. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hier- archical image database. In Proceedings of the Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Philip Edmonds and Adam Kilgarriff. 2002. Introduc- tion to the special issue on evaluating word sense dis- ambiguation systems. Natural Language Engineering, 8(4):279–291. Katrin Erk and Diana McCarthy. 2009. Graded word sense assignment. In Proceedings of the 2009 Confer- ence on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 440–449, Singapore. Katrin Erk, Diana McCarthy, and Nicholas Gaylord. 2012. Measuring word meaning in context. Computa- tional Linguistics, 39(3):511–554. Christiane Fellbaum, Joachim Grabowski, and Shari Lan- des. 1998. Performance and confidence in a semantic annotation task. In Christiane Fellbaum, editor, Word- Net: An electronic lexical database, pages 217–237. MIT Press. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Database. MIT Press, Cambridge, MA. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 35(4):797–812. Tiziano Flati and Roberto Navigli. 2013. SPred: Large- scale Harvesting of Semantic Predicates. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1222–1232, Sofia, Bulgaria. Amaç Herdağdelen and Marco Baroni. 2012. Bootstrap- ping a game with a purpose for common sense col- lection. ACM Transactions on Intelligent Systems and Technology, 3(4):1–24. Barbora Hladká, Jiřı́ Mı́rovskỳ, and Pavel Schlesinger. 2009. Play the language: Play coreference. In Pro- ceedings of the Joint Conference of the Association for Computational Linguistics and International Joint Conference of the Asian Federation of Natural Lan- guage Processing (ACL-IJCNLP), pages 209–212. As- sociation for Computational Linguistics. Jisup Hong and Collin F. Baker. 2011. How Good is the Crowd at “real” WSD? In Proceedings of the Fifth Linguistic Annotation Workshop (LAW V), pages 30– 37. ACL. Eduard H. Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013. Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194:2–27. David Jurgens. 2013. Embracing ambiguity: A compar- ison of annotation methodologies for crowdsourcing 461 word sense labels. In Proceedings of the Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), pages 556–562. Ramesh Krishnamurthy and Diane Nicholls. 2000. Peel- ing an onion: The lexicographer’s experience of man- ual sense-tagging. Computers and the Humanities, 34(1-2):85–97. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sag- nik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understand- ing and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 35(12):2891–2903. Yen-ling Kuo, Jong-Chuan Lee, Kai-yang Chiang, Rex Wang, Edward Shen, Cheng-wei Chan, and Jane Yung-jen Hsu. 2009. Community-based game de- sign: experiments on social games for commonsense data collection. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 15–22. Mathieu Lafourcade and Alain Joubert. 2010. Comput- ing trees of named word usages from a crowdsourced lexical network. In Proceedings of the International Multiconference on Computer Science and Informa- tion Technology (IMCSIT), pages 439–446, Wisla, Poland. Tak Yeon Lee, Casey Dugan, Werner Geyer, Tristan Ratchford, Jamie Rasmussen, N. Sadat Shami, and Stela Lupushor. 2013. Experiments on motivational feedback for crowdsourced workers. In Seventh In- ternational AAAI Conference on Weblogs and Social Media (ICWSM), pages 341–350. Thomas Mensink, Jakob J. Verbeek, and Gabriela Csurka. 2013. Tree-structured crf models for inter- active image labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):476–489. George A. Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. 1993. A semantic concordance. In Pro- ceedings of the 3rd DARPA Workshop on Human Lan- guage Technology, pages 303–308, Plainsboro, N.J. Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba- belNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250. Roberto Navigli. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):1–69. Martha Palmer, Hoa Dang, and Christiane Fellbaum. 2007. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natu- ral Language Engineering, 13(2):137–163. Rebecca J. Passonneau and Bob Carpenter. 2013. The benefits of a model of annotation. In 7th Linguistic Annotation Workshop and Interoperability with Dis- course, August 8–9. Rebecca J. Passonneau, Vikas Bhardwaj, Ansaf Salleb- Aouissi, and Nancy Ide. 2012. Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Language Resources and Evaluation, 46(2):209–252. Ei Pa Pa Pe-Than, DH-L Goh, and Chei Sian Lee. 2012. A survey and typology of human computation games. In Information Technology: New Generations (ITNG), 2012 Ninth International Conference on, pages 720– 725. IEEE. Mohammad Taher Pilehvar and Roberto Navigli. 2014. A Large-scale Pseudoword-based Evaluation Frame- work for State-of-the-Art Word Sense Disambigua- tion. Computational Linguistics, 40(4). Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo, and Luca Ducceschi. 2013. Phrase de- tectives: Utilizing collective intelligence for internet- scale language resource creation. ACM Transac- tions on Interactive Intelligent Systems, 3(1):3:1–3:44, April. Nitin Seemakurty, Jonathan Chu, Luis Von Ahn, and An- thony Tomasic. 2010. Word sense disambiguation via human computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 60–63. ACM. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Confer- ence on Empirical Methods in Natural Language Pro- cessing, Waikiki, Honolulu, Hawaii, 25-27 October, pages 254–263. Antonio Torralba, Robert Fergus, and William T Free- man. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 30(11):1958–1970. Daniele Vannella, David Jurgens, Daniele Scarfini, Domenico Toscani, and Roberto Navigli. 2014. Vali- dating and extending semantic knowledge bases using video games with a purpose. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (ACL), pages 1294–1304. Noortje J. Venhuizen, Valerio Basile, Kilian Evang, and Johan Bos. 2013. Gamification for word sense label- ing. In Proceedings of the International Conference on Computational Semantics (IWCS), pages 397–403. Luis von Ahn and Laura Dabbish. 2004. Labeling im- ages with a computer game. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), pages 319–326. Luis von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Communications of the ACM, 51(8):58–67. 462 Aobo Wang, Cong Duy Vu Hoang, and Min-Yen Kan. 2013. Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1):9–31. 463 464