key: cord-0479874-kocgrz22
authors: Perez, Beatrice; Machado, Sara R.; Andrews, Jerone T. A.; Kourtellis, Nicolas
title: I call BS: Fraud Detection in Crowdfunding Campaigns
date: 2020-06-30
journal: nan
DOI: nan
sha: db04b431a055581e1f095ec8763b8af6f304467c
doc_id: 479874
cord_uid: kocgrz22

Donations to charity-based crowdfunding environments have been on the rise in the last few years. Unsurprisingly, deception and fraud in such platforms have also increased, but have not been thoroughly studied to understand what characteristics can expose such behavior and allow its automatic detection and blocking. Indeed, crowdfunding platforms are the only ones typically performing oversight for the campaigns launched in each service. However, they are not properly incentivized to combat fraud among users and the campaigns they launch: on the one hand, a platform's revenue is directly proportional to the number of transactions performed (since the platform charges a fixed amount per donation); on the other hand, if a platform is transparent with respect to how much fraud it has, it may discourage potential donors from participating. In this paper, we take the first step in studying fraud in crowdfunding campaigns. We analyze data collected from different crowdfunding platforms, and annotate 700 campaigns as fraud or not. We compute various textual and image-based features and study their distributions and how they associate with campaign fraud. Using these attributes, we build machine learning classifiers, and show that it is possible to automatically classify such fraudulent behavior with up to 90.14% accuracy and 96.01% AUC, only using features available from the campaign's description at the moment of publication (i.e., with no user or money activity), making our method applicable for real-time operation on a user browser.

Crowdfunding has become a standard means to financially support individuals' needs or ideas, typically through an online campaign appealing to contributions from the community. What started off as a grassroots movement is now a flourishing industry. In fact, from $597M raised worldwide in 2014, to $17.2B, in North America alone, in 2017, this industry will continue to grow globally [34] . Over the last decade, the emergence and consolidation of crowdfunding platforms (CFPs) have narrowed the field to a handful of platforms.

Top contenders, such as Kickstarter, FundingCircle, and GoFundMe, have specialized into one of three categories: investment-based platforms where donors become angel investors in a new enterprise; reward-based platforms where the backers provide loans with the condition of interest upon repayment; and donation-based platforms where campaigns are an appeal to charity [2] .

CFPs and their increasing popularity and fundraising ability for many causes (even recently for coronavirus-related costs [30] ) inevitably attract malicious actors who take advantage of unsuspecting users (e.g., [14, 27, 36, 37] ). Immediate access to trusting investors and their funds make these platforms particularly attractive to malicious activity. Some platforms allow fund disbursements to happen immediately following a donation, while others require scheduled intervals (e.g., weekly), or reaching donation goals.

Furthermore, there is a striking lack of regulation in this space [17] . This void leaves a grey area where crimes are hard to define and difficult to prosecute. The emergence of few highly publicized cases of fraud in crowdfunding campaigns further undermines general public confidence in this trust-based enterprise. Nevertheless, evidence of campaign and fund misuse is scant. According to GoFundMe, one of the most prominent CFPs, fraudulent campaigns make up less than 0.1% of all campaigns posted on the site [13] . But even at this "low" rate of fraud, which has not be substantiated with transparent reports by GoFundMe or other CFPs, in a billion dollar industry, it can amount to tens of millions in defrauded funds every year. Given CFPs' major source of revenue are these campaigns, through commissions on new campaigns and on every donation, CFPs are not properly incentivized to detect and stop fraud. Therefore, the lack of tools quantifying this problem is not surprising, effectively preventing the protection of unsuspected contributors.

In this study, we aim to provide such tools to help combat fraud in donation-based CFPs. We analyze campaigns in North America created to cover medical expenses, a primary reason for these types of appeals (one in three crowdfunding campaigns [24] ). The urgency and strong emotional content of health-related financial constraints easily attract donor attention and donations. We are interested in quantifying the prevalence of fraudulent behavior in these campaigns, requiring us to classify campaigns as fraud or not.

Our goal is to create a machine learning (ML) classifier to distinguish between campaigns that are fraudulent or not, at the moment of their creation, i.e., using only features extracted from campaigns newly published. To accomplish our task, we collect and annotate over 700 campaigns from all major CFPs (GoFundMe, MightyCause, Fundly, Fundrazr, and Indiegogo) and derive deception cues from both the text and the images provided in each campaign. Overall, we find that fraud is a small percentage of the crowdfunding ecosystem, but an insidious problem. It corrodes the trust ecosystem on which these platforms operate on, endangering the support that thousands of people receive year on year. Our results show that using an ensemble ML classifier that combines both textual and visual cues, we can achieve a Precision of 91.14%, Recall 90.77% and AUC 96.01%, i.e., approximately 41% improvement over the deception detection abilities of people within the same culture [5] . This work is also the first to incorporate text and images in the analysis of fraud, and we rely on features available immediately after a campaign goes live. This is a significant step in building a system that is preemptive (e.g., a browser plugin) as opposed to reactive. We believe our method could help build trust in this ecosystem, by allowing potential donors to vet campaigns before contributing. Similarly, CFPs could use it to prompt vetting and request additional information from a potentially fraudulent campaign creators before campaigns are made public.

Our contributions with the present study are as follows:

• We collect a dataset of over 700 crowdfunding campaigns on different health-related topics, including medical, emergency appeals, and memorials, and annotate them for being fraudulent or not. • Through the use of NLP techniques, we extract language cues, including emotions and complexity of language, and study their association with fraudulent campaigns. • Using convolutional neural networks, we extract characteristics of the images posted with each campaign, including emotions and content displayed, and associate them with fraudulent campaigns. • Using the above features (text and image-based), we train supervised classification techniques to perform automatic detection of fraudulent campaigns, and discuss our results. • We make the collected and annotated dataset available for other researchers to further investigate this problem on CFPs.

Previous work on financial fraud highlights this task's complexity. Financial information has typically been used to predict the likelihood of a given transaction being fraudulent. Primarily, past works build behavioral profiles for each user to compute the likelihood of a new transaction being legitimate [1, 3, 6, 10, 26, 31, 33] . Novel models in finance technology have opened new areas of research in this space. Some works focus on detecting deception online. Luca and Zervas [23] take a set of reviews identified as fraud or not fraud by the platform Yelp and explore the determinants of fraudulent behavior. The authors explore how restaurants' engagement in positive and/or negative review fraud (i.e., fake reviews) interacts with reputation and competition, over time. We use insights from these fraudulent reviews to shape our understanding of fraudulent deceptive behavior in CFP campaigns.

In their work on Peer-to-Peer lending, Xu et al. [39] explore trust relationships between borrowers and lenders. Funds raised can be used for private affairs (i.e., there are no business plans and no milestones to rely on for validation), just like crowdfunding campaigns. However, in their model, the money is meant to be restituted and the network members build up their reputation over time. They find that soft descriptions of the borrower, e.g., age, gender, race, and appearance, are good predictors of whether a loan will be repaid in time. Conversely, the physical appearance (gender, race, and attractiveness) of the campaign's creator as understood from their profile picture can be used to predict the trustworthiness and, by extension, the success of a campaign [11, 22, 29] . Again, such features help us better understand trust relationships between funders and requesters.

Similarly to Peer-to-Peer lending, crowdfunding is an online financial tool in the hands of millions. Both are paradigms that depend on participant trust, and fraud against them "causes emotional and financial harm to lenders (donors) and great damage to sites (platforms) destroying their reputation" [39] . The study of fraud in crowdfunding has been tied to the success of the campaign [38] , or to entrepreneurial endeavors. In fact, Wessel et al. [38] looked at social capital as a means to influence consumer decision making. They take 591 campaigns that have been flagged for having fake Facebook likes and find that fake social capital has an overall negative impact on the number of backers of a campaign.

Siering et al. [32] looked at the problem of deception in crowdfunding campaigns, focusing on investment-based donations, where there is an expectation of a reward and a defined business plan. Importantly, these are business interactions, fundamentally different from the altruistic behavior implied in our campaigns. Based on linguistic text cues, their model presented in [32] achieves 75% accuracy. We build on textual features and combine them with image features, leading our model to achieve 86% accuracy.

Finally, and most similar to our work, Cumming et al. [9] look at entrepreneurial campaigns (i.e., commercial campaigns with pledges and rewards) and try to understand the difference between campaigns labeled as detected fraud, suspected fraud, and not fraud. Using theories from economics and behavioral sciences they identify four possible markers: characteristics and background of the campaign creator (i.e., use of names and further participation in the community), a campaigns' affinity to social media, funding and reward structure in the campaign (i.e., the duration of the campaign), and finally, details in the campaign description (i.e., clarity of language and veracity). They test their markers in a dataset of 207 fraud cases from two major crowdfunding portals (Kickstarter and Indiegogo) and find that fraudulent campaigns can be described as having longer periods of collection, no Facebook page associated to it, and campaign creators with comparatively less time in the crowdfunding community. Even though relevant to our work, there are several primary differences with Cumming et al. [9] : 1) the type and incentive behind the campaigns (entrepreneurial vs. charitable donations for health problems), 2) when the funds are available (the full amount must be raised vs. immediately available), 3) the tone and the content of the text in the campaigns. Indeed, while we do borrow their insights on the relationship between fraud and the simplicity of text, the problem we are addressing is different. The general classification of the campaign (e.g., Memorial, Health, Emergencies, etc.).

The total amount of money the creator hopes to raise.

The money that has been raised by the campaign to date.

The number of individual contributors to the cause.

The individual amounts from each of the contributions.

The number of likes or shares that the campaign has received over social media.

The location from which the campaign was launched.

Crowdfunding sites are designed to help people connect funding requests with benefactors. While each CFP is different, they all provide a search engine to find specific campaigns and a classification system that allows visitors to find campaigns that may be relevant to their interests. In this paper, we looked at campaigns from the top five crowdfunding platforms online: Indiegogo, GoFundMe, MightyCause, Fundrazr, and Fundly. The information displayed per campaign varies depending on the platform and on the creator of the campaign. Table 1 summarizes the fields and descriptions that are common across all sites.

Depending on the type of campaign and hosting platform, funds raised are delivered either to the campaign creator or its beneficiary. Typically, entrepreneurial campaigns require a goal to be met before funds are released (not reaching a funding goal results in contributions being returned to investors), whereas in charitable projects, there is no limit as to how soon or often any funds are withdrawn from the campaign. In terms of revenue, gofundme.com, the most prominent CFP, states that they charge a percentage of the transaction as fees, plus a fixed amount per donation [12] .

Finally, CFPs are aware of the risk of fraud and some offer a guarantee: any member that made a donation to a fraudulent campaign is entitled to a refund by the CFP. The reimbursement, however, must be requested by the donor after an internal investigation reveals the campaign to be fraudulent. When a campaign is reported as suspicious, the CFP will send a request for information to the creator of the campaign. Following the initial report, continued suspicious behavior might result in a campaign being deactivated, or altogether removed. A deactivated campaign will show the title, primary image and total funds raised. A removed campaign will result in a redirect to the CFPs main website. A missing or deactivated campaign, however, is not always an indication of fraud. For example, a campaign created to raise funds outside a CFP's "donation cover area" results in the campaign being removed from the platform. Alternatively, campaign creators might close a campaign if the fundraising goal has been met, or an event has passed.

Fraud is generally defined as a misrepresentation of an existing fact, made from one person to another, with knowledge of its falsity and for the purpose of inducing the other to act [35] . One requirement of fraud is therefore deception, as it requires the perpetrator to convince the victim that their (false) statement is true. It also results in damages to the victim and is, most importantly, a criminal offense 1 .

First, we must recognize that fraud is an umbrella term used to define a range of behaviors including embezzlement where (legitimatelly acquired) funds are missappropriated; opportunist fraud where a real story draws criminals to fabricate association to the people/event; or complete fiction in both events and associations, among others. In this work we are attempting to automate the process of recognizing cues available at the time of publication of a crowdfunding campaign where the creator of the campaign is aware of the falsehood of the claims in the campaign. We refer to these campaigns as fake. As an example, opportunist campaigns are fake: the creator of the campaign has limited information which can be reflected in his writing style and choice of picture 2 .

In the results we present, our priority is to minimize the number of false positives (i.e., real campaigns mislabeled as fraud). However, it is not possible to completely eliminate type II errors (i.e., fraud campaigns that were misclassified as real). Cases like embezzlement where the people, events, and description are real and with the appropriate level of detail but, where the funds were never delivered to the rightful recipient cannot be identified before the decision to commit a crime has been carried out. Therefore, the results we present should be understood as a lower bound of the number of cases to be expected in the wild. This work, however, is an improvement upon the current state of the art in detection of these types of campaigns where the determination of fraud is delegated to the CFPs, and the individual contributors are left on their own devices and judgement to decide if a campaign is fraudulent or not.

We have two main sources of data: a set of campaigns that have been confirmed 3 as fraud and collected from GoFraudMe [14] (we will refer to these as set A), and two sets of manually annotated campaigns collected from different CFPs (sets B and C).

The goal of this website, maintained by an investigative journalist, is to expose fraudulent cases in the GoFundMe platform. The site serves the dual purpose of holding the CFP accountable for fraudulent campaigns and presenting, preserving, and publicizing the evidence that led to the characterization of fraud. The site holds 192 confirmed cases of fraud that were shared with us by the website curator. The process that leads to the inclusion of a campaign in the website varies greatly, but each is accompanied by a narrative that presents the inconsistencies that led to the declaration of fraud. Some campaigns have the guarantee of a guilty verdict following legal criminal proceedings, whereas others have been denounced by beneficiaries and supported by their community. Some of the cases presented, typically those that follow from events reported in various news platforms, give rise to several fraudulent campaigns. For some cases, there is an archived version of the campaign that was used to collect money with a link to the rightful beneficiary. For others, there is a timeline following the investigation, that led to criminal charges (and subsequent conviction if it is available).

In addition to the labeled campaigns collected from set A, we created two manually annotated datasets. Set B: 191 campaigns from the Medical category in Go-FundMe.com. Set C: 350 campaigns from different CFPs that were directly related to organ transplants. Sets B and C were manually annotated following the methodology described in the Section 3.4. Set C is a random sample of 350 campaigns related to organ transplants collected in January 2019 from the top 5 CFPs: Indiegogo, GoFundMe, MightyCause, Fundrazr, and Fundly. Both B and C sets were collected through automated crawlers written in python. From each campaign, we collected the features in Table 1 as well as all comments, pictures, and individual donations. Each campaign was visited in the order presented by the CFP's search engine. While some CFPs provided APIs to connect with their database, the data fields were collected, for the most part, through the corresponding elements in HTML. 0  invalid  93  117  1  fraud  141  138  2  probably fraud  26  71  3  unknown  105  123  4 probably not-fraud 78 141 5

not-fraud 290 517

Combined, sets A, B and C make up the ground truth in the study. Sets A and B are inversely balanced in the sense that one provides mostly examples of fraud and the other mostly not-fraud 4 . Set C was created as a means to augment the number of campaigns in the study. All campaigns were manually annotated using the scale proposed in Table 2 , where (1) indicates certainty of fraud and (5) certainty of not-fraud. During manual annotation, two expert annotators developed guidelines to determine the label of each campaign. The considerations in the guidelines included:

• A personal (offline) knowledge of the circumstances that led to the appeal as evidenced in the support messages posted to the campaign. Knowledge is reflected (but not limited to) in having met the beneficiary (or having first hand knowledge of the circumstances), participating in offline fundraising activities, or familial relationships between donors. • A sense of closure to each campaign, particularly those that have been open for donations for several years. • Coherency between the description, support documents, pictures, fundraising goal, donors, and level of detail. • Participation of the creator in other campaigns.

• Reverse search of pictures and text diplayed in the campaign leading to unrelated results in the web. • Evidence of contradictory information.

• Overwhelming lack of engagement of campaign donors.

The label of fraud assigned by the annotators was independent of the features engineered for automated detection. The annotators relied on semantic interpretation. The features used in the models are stylistic markers of the textual description and the images in the campaign. To this extent, we are reasonably certain that while the label of fraud might be incomplete (i.e., it will not capture all categories of fraud), it is correct. Ultimately, we considered 704 campaigns in the study, with some campaigns removed from the dataset because the content was no longer accessible, the text was in multiple languages, or there were too few characters to compute any of the text-based features. Inter-annotator reliability (or consistency) refers to the validity of the variable being measured [25] . In this project, we had a structured subjective task, where we iteratively refined and applied a set of guidelines that define fraud to each of the campaigns reviewed. At the end of the first iteration, annotators reconciled the labels and revised the guidelines. At the end of the second round of annotations, and for campaigns with contradictory labels, the final decision was agreed by discussion and consensus between annotators. We measured the consistency across annotators for each iteration using Cohen's Kappa (κ) [8] . We applied the interpretation scale proposed by Landis and Koch [21] , where values between 0.6 and 0.8 are considered to reflect a substantial agreement between annotators. At the end of the first iteration, κ was found to be 0.451, reflecting only moderate agreement between annotators. After revising the guidelines, at the end of the second round of annotations, κ = 0.675. Ultimately, the values used as labels for classification were considered of binary form, i.e., 1 for fraud and 0 for not-fraud, that reflect consensus across annotators (i.e., a dataset with κ = 1).

The description of the campaign is the best line of communication between the campaign creator or beneficiary and any potential donors. Therefore, it is the first place where a malicious actor might leave traces of deception. In this section, we present five different areas where automated analysis might find quantitative evidence of deception. We also present some preliminary analysis of the variable categories with respect to our classification variable: fraud.

4.1.1 Sentiment Analysis. We extract the sentiment and tone expressed in the text for further analysis using IBM services [4] . The sentiment is computed as a probability across five basic emotions: sadness, joy, fear, disgust, and anger. Complementary to emotions, the text's tone can also express a campaign's intent. We analyze confidence scores for seven possible tones: frustration, satisfaction, excitement, politeness, impoliteness, sadness, and sympathy.

Choice. The need for appeal to a more general population can lead fake campaign creators to adapt (or carefully select) the language used. Simpler language and shorter sentences can appeal to the emotions of the reader and, therefore, be more successful. To check the language complexity of the document and word choice, we look at a series of readability scores (e.g., automated readability index, Dale-Chall Formula, etc.) and language features (e.g., function words, personal pronouns, average syllables per word, total number of characters, etc.) [9, 38] .

Recognition. Named-Entity recognition is the process of identifying named entities (e.g. proper nouns, numeric entities, currencies) in unstructured text and assigning them to a finite set of categories. In this project, we relied on spaCy [16] a tool released for Python which identifies 18 types of entities in text. SpaCy models are based on convolutional neural networks built with pre-trained vectors which give an accuracy of 86.42%.

Form of the text. The next group of features we considered was the visual structure of the text. For the entire textual dataset we captured the form of each word: whether the letters were all lower-case, all upper-case, the number of emojis on the text, the number of words with exclamation mark, the words with apostrophes, and many others. We generated a vector with 255 descriptors and evaluated the text in each campaign against the features. 4.1.5 Word Importance. Lastly, we considered the numerical vectorial representation of the text given by tf-idf. This method, similar to a bag-of-words approach, highlights content similarity between different documents. As with the other textual features, we compute word importance on the text included in the campaign description. Ultimately, this description is the primary method of communication between campaign creator and potential donors. While the success of a campaign is mostly determined by the strength of a community and their participation in the system, a good story may persuade chance visitors to donate to the cause. Interesting is also the balance between the positive and negative emotions in each campaign. This figure shows that, as emotions in the text, campaigns that are not-fraud display more joy and less disgust than campaigns that are fraud. Almost as if the narrator is making an effort to present their friend or relative (i.e., the beneficiary) as they were; then, present the (presumably negative) reason for creating the campaign.

One of the most interesting results we found is evidenced in Figure 2 . Word importance for each category shows that while, generally, both sets of campaigns have similar characteristics, fraudulent campaigns are perhaps more desperate in their appeal. Starting from the left side of x-axis, Figure 2 shows that the words money, help, please, cancer, and get are more prevalent in fraudulent campaigns, whereas not-fraud descriptions will emphasize words like kidney, transplant, heart, medic(al), and work (right side of x-axis). In general, legitimate campaigns are more descriptive, being open about the circumstances in making their appeal.

Combined, the five types of text-based analysis result in 8,341 features extracted from the description provided with the campaign, but several of them can be sparse (e.g., TFIDF) and others may not prove to be so helpful in detecting fraud. Therefore, the final step in our pre-processing of the data is to analyze each feature with respect to the variable we are interested in, and filter out features not useful. We start by making no assumptions about the distribution of our random variables and choose the non-parametric, two-sample KS test to check whether the difference between the distributions of the fraud and not-fraud data for each feature are significant at level a = 0.05. This testing removed features that were not different, and reduced the space to 71 variables from all five textual analysis categories. Ultimately, in this paper, any result computed with text-based features includes only the 71 KS significant features 5 .

Though the text is the primary means of information, pictures provide the often essential supporting details of the claim. As with Section 4, in this section we present the rational for the features derived from images and the preliminary results of the analysis of the data collected.

Psychological studies show that images, as a form of visual stimuli, can be used to induce human emotion [18] . Visual emotion prediction has therefore attracted much interest from the computer vision community-framed as a multiclass classification problem using image-emotion pairs as input-output tuples for learning. Motivated by the foregoing successes for visual emotion prediction in transfer learning, we repurposed a ResNet-152 [15] a convolutional network pre-trained on the ImageNet dataset [20] containing 1.2 million images of 1000 diverse object categories. The fine-tuning was performed by replacing the original 1000-way fully connected classification layer with a newly initialized layer consisting of 8 neurons that correspond to the emotion categories of interest. As defined in [40, 42] , the eight categories were as follows: amusement, anger, awe, contentment, disgust, excitement, fear, and sadness.

To fine-tune the model, we utilized the Flickr and Instagram (FI) dataset [40] of 23k images, where each image is labeled as evoking one of the eight emotions based on a majority vote between five Amazon Mechanical Turk workers. We used 90% of the images for training and the remainder for validation. During pre-processing, each image was resized to 256 × 256 × 3 and standardized (per channel) based on the original ImageNet training data statistics. We used 100 epochs, to minimize a negative log-likelihood loss, with stochastic gradient descent, using an initial learning rate of 0.1, momentum 0.9, and a batch size of 128. The learning rate was multiplied by a factor 0.1 at epochs 30, 60 and 90. We performed data augmentation by randomly cropping 224 × 224 × 3 image patches, which is the resolution accepted by ResNet-152. During fine-tuning, all layers except the classification layer were frozen. The final model accuracy on the validation data was 73.9%, where predictions are based on central 224 × 224 × 3 crops.

The semantic evidence over the eight emotions, in the form of logits (unnormalized log probabilities), can then be extracted for the crowdfunding images. Each image was resized such that its shortest side was 256 pixels and then a central crop was extracted of size 224 × 224 × 3. Semantic emotion category representations were then extracted from the classification layer.

Representations. Again, with the help of a pre-trained ResNet-152 model, trained on the ImageNet dataset, we extracted appearance representations and semantic representations of each of the images present in the campaigns. For pre-processing, each crowdfunding image was resized such that its shortest side was 256 pixels and then a central crop extracted of size 224×224×3. We standardized each image (per channel) based on the original ImageNet training data statistics.

The appearance representations is meant to quantize the picture itself by generating a vector of descriptors from the penultimate layer of the network. These features ( ∈ R 2048 ) provide a description of each image where the fields, automatically learnt by the network can be, e.g., the dominant color, the texture of the edges of a segment, a (lower level) object -e.g., an eye, among others. In contrast, the semantic representation expresses the logit presence of predetermined objects in each image. The vector (∈ R 1000 ) is extracted from the classification layer over the 1000 ImageNet classes.

Each representation is useful since convolutional neural networks are known to implicitly learn a level of correspondence between particular objects [41] . Moreover, the representations invariably outperform their hand-engineered counter-parts [7] .

Finally, we consider the number of faces present in the image as a possible distinguishing factor between fraud and not-fraud campaigns. We extract this feature using the dlib [19] HOG-based face detector and estimate the number of faces present per image.

In our analysis of emotion in images we found that, as compared to the text (in Figure 1(a) ), there is a greater imbalance between positive emotions and sadness. Figure 1(b) shows the positive emotions as shades of blue and other emotions as indicated in the legend. Similar to the text, not-fraud campaigns display more positive emotions and proportionally less anger and fear through their images. In our analysis of objects present in each image (Figure 3 ), we find that not-fraud campaigns have a stronger presence of objects that are associated with hospital stays (as evidenced by the presence of objects like lab coats, pajamas, stretchers, and neck braces) though the same categories are found to a lesser degree in the fraudulent campaigns. On the other hand, fraudulent campaigns appear to include images with objects or concepts that are more casual in nature, such as barbershop, suit, tie, uniform, which may not fit the context of CFP campaigns launched for medical-related problems.

Compared to the results in Figure 2 , the signal revealed by the images is not as strong as the one contained in the text, as the separation is not so clear. The difference between text and images can be explained by considering that CFPs provide, at times, specific instructions regarding the types of images to include. For example, one such instruction is to include a picture of the fundraising organizer and the person in need looking happy. Not only does this homogenize the type of images used in the fundraisers, it also provides a clear guidebook for potentially fraudulent campaigns, hence diminishing the predictive power of images, in general, and the objects identified in those images, in particular.

We also analyze the number of faces detected in the images of the two types of CFP campaigns, in Figure 4 . Even though the number of faces in the extreme cases (e.g., above 10 faces detected) follows similar distribution in the two classes, we notice that for the majority of non-fraudulent campaigns, they tend to include images with more faces than fraudulent campaigns. Interestingly, the median for both classes is 1, but the mean for non-fraudulent campaigns is 1.488 and for fraudulent is 0.8341, which means it is more common to include images with at least one face in the non-fraudulent campaigns. 

Combined, the visual cues amount to 3,057 features. As was the case with textual features, we expect the semantic representation of each image to be sparse and some features to be more discriminative than others with regards to our target variable. As was the case with the text-based features, we used the KS-test to determine the significance of each descriptor. The result was a vector of 501 features with representatives from all categories: emotion, appearance, semantics, and number of faces. The classification models contained only these 501 features in all types of image analysis.

Next, we present our effort to train a machine learning (ML) classifier to automatically detect fraudulent campaigns using various features discussed in the previous sections.

6.1.1 Fraud Scale Grouping. The fraud scale presented in Table 2 can be combined in different ways to generate the overall label of fraud. In our first experimental setup, we use the union of campaigns with scores {1,2} as fraud, scores {4,5} as not-fraud, omitting the campaigns with score 3, and denote this setup as Label I. In the second experimental setup, we define as fraud exclusively the campaigns with scores of {1}, and not-fraud the campaigns with scores of {5}, omitting the other campaigns, and denote this setup as Label II. Practically, in using Label I, we prioritize the need to get more observations for the training of the classifier, whereas using Label II, we give more importance to the strength of the signal being captured, but in reduced instances. In our experiments, we observed better performance when minimizing the noise in the signal. Ultimately, we chose Label II for the final results.

In choosing a classifier, we need a method that is fast, robust to noise and not prone to overfit the data, thus, allowing the model to be generalizable. We tested different classical ML methods whose implementation is available in sklearn [28] : Random Forests (RF), AdaBoost, Decision Tree, k-NN, Naive-Bayes and Support Vector Machine (SVM), and compare their performance across different metrics. In addition to the classical methods, we also built a multilayer perceptron (MLP) with one hidden layer (followed by a ReLU) of dimensionality equal to its input. Each MLP was trained for 50 epochs using SGD with momentum 0.9, weight decay 5 × 10 −4 , a batch size of 1 and initial learning rate of 0.001. During training, inputs were corrupted on-the-fly with additive white Gaussian noise ∼ N (0, √ 0.1).

For each classifier, we compute five metrics: accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC), which plots the relationship between true positives and false positives at different operating thresholds of the classifier. For these metrics, a perfect classifier would score 1 in all.

Initial attempts at classification showed that the classifiers' results for different metrics were dispersed. To obtain accurate measures of each model's performance, and following the law of large numbers, we increased the number of iterations and looked at the distribution of results for each model. For each iteration, we perform a random split of train and test data. As expected, multiple iterations over the different splits of data yielded different results. Overall, the mean of normally-distributed classification results can approximate the true value of each metric. Also, the classes are not balanced and, therefore, we forced the same number of observations for each class by under-sampling the bigger class (i.e., a random selection of observations available) while creating a split per iteration. We perform two experiments: a preliminary one, to test the performance of each feature modality (text vs. images), and then the final one with an ensemble classifier that uses an average of the two preliminary ones. Results for the classical ML algorithms were computed by executing 2,000 iterations of the classifiers on the available text or image data. For the neural network, we used 1,000 models to obtain the final classification.

Text vs. Images Tables 3 and 4 show the performance of the considered classifiers, using Label II, with textual and visual features, respectively. These results were obtained by running 2,000 iterations and computing all metrics for each model. As shown in the tables, all classifiers outperform the 50% random baseline of binary classification implying that the signal separating fraud from not-fraud is present in the data. Interestingly, tree-based models such as Decision Tree and Random Forest perform fairly well with AUC up to 0.84, just under the 0.93 AUC exhibited by neural networks on textual features. We note that textual data alone provide better classification power than images alone (AUC=0.93 vs 0.67). However, the classification performance is improved by combining modalities.

The models based on text and images show definite separation between the class (fraud or not fraud). The next step is to determine whether combining all features of a campaign into the same model provides improvement over treating them separately. Tables 3 and 4 show that RF is the best from the classical algorithms, but MLP outperforms RF in textual features. Thus, we use both RF and MLP to evaluate the ensemble classifier performance. Furthermore, Table 2 showed that information on each campaign varies: some have no images while others have multiple. We first run the classification task separately for text and images, and then combine results into a single score for each campaign. As before, we train and test RF with 2K runs, and the MLP on 1K models, on the Label II setup. We then run an ablation study to determine whether any of the feature groups (i.e., TFIDF, text Sentiment Analysis, Named-Entity Recognition, the Shape of the word, the Readability Index, the Descriptive elements of an image, the Objects present in an image, Emotions triggered by each image, and the number of faces recognized) have a negative interaction and should therefore be removed. The results, shown in Table 5 indicate that, while the neural network approach was comparable to the classical algorithms in terms of the separate modalities (i.e., images and text), there is a clear improvement in all metrics when we combine all features in the same model, with AUC=0.96.

For completeness, Figure 5 presents the distribution of the individual evaluations of the neural network for the Label I and Label II setups. As expected, Label II performs better than Label I. Also, the models are not dispersed, and the results are consistently over 80% of median performance.

In Section 6.1.1, we discussed the impact the labels have on the classification output. Here, we investigate another configuration presented as Label III. This corresponds to the scenario where we train on campaigns with scores of {1} for fraud, and campaigns with scores of {5} for not-fraud (i.e., Label II setup), and then test this model on campaigns with label scores {2,4}, corresponding to fraud and not-fraud, respectively. These campaigns were dropped in Label II setup, and thus were unseen by the classifier.

In Table 6 , we compare the performance of modeling fraud with Label I, Label II and Label III setups. Overall, we observe that classifying on a stronger fraud signal (Label II ) translates into better performance. Also, these results seem to indicate that once a model is trained with a sufficiently strong signal, it is able to correctly label noisy data (AUC = 0.936) on Label III. This shows great promise in terms of the extensibility and applicability of our work.

In recent years, crowdfunding has emerged as a means of making personal appeals for financial support to members of the public. These may be simple tasks such as a DIY project at home, or more complex ventures such as starting a new company or medical procedures. The community trusts that the individual who requests support, whatever the task, is doing so without malicious intent.

However, time and again, fraudulent cases come to light, ranging from fake objectives to embezzlement. Fraudsters often fly under the radar and defraud people of what adds up to tens of millions, under the guise of crowdfunding support, enabled by small individual donations. Detecting and preventing fraud is thus an adversarial problem. Inevitably, perpetrators adapt and attempt to bypass whatever system is deployed to prevent their malicious schemes.

In this work, we take the first step in studying the problem of fraudulent crowdfunding campaigns and detecting them at the time of publication. We collect appropriate data from thousands of campaigns from different platforms and study fraud cases to better understand their characteristics. Armed with this knowledge, we perform an annotation study to label hundreds of campaigns as fraud or not, with substantial overall annotation agreement. We proceed to extract characteristics (features) from the text and image content included in each campaign, and compare these features with the associated label of the campaign.

The dataset we built is useful in training machine learning ensemble classifiers, which can take visual and textual cues from any crowdfunding campaign, and predict if the campaign is fraudulent or not when created, with satisfactory performance (up to AUC=0.96). Indeed, there is room for improvement, especially regarding feature engineering and classifier complexity and tuning. However, our results demonstrate that it is possible to detect fraudulent campaigns with high certainty, and allow crowdfunding platforms to remove them semi-automatically, i.e., can be marked for a more detailed inspection by an administrator.

In practice, we are proposing an automatic method that can help donors to have an indication of which of the campaigns they are viewing may be fraudulent. With this method, we attempt to make the job of fraudsters harder, by proposing a better system than currently available. In fact, in order to mitigate the risk of fraudsters catching up with the online model and what features it monitors for predicting fraud, we can explore different methods and timings of when to deliver the warning flag to a donor.

In terms of limitations while building this methodology, we attempted to reduce any bias that may have been introduced by the annotators. During this process, we created checklists and standards into what would be defined as fraud, to minimize subjective bias. A further unbiased way to conduct this study would be to rely exclusively on convicted cases of fraud, instead of relying on manual annotations of suspected cases. However, this option would not provide enough examples to develop good enough machine learning models, and it would be again up to annotators to identify not-fraud examples. One solution that would further reduce the risk of bias is to increase the number of annotators that label each campaign. But this is highly depended on resource availability. Finally, algorithmic bias could be reduced. For example, poorly written campaigns by legitimate requestors who are uneducated or non-native English speakers can be mislabeled as fraud -a clear source of bias. Also, there may be limited examples of such campaigns, since these users may not be willing or comfortable to post a campaign in the first place. These aspects point to the problem of fair and balanced representation of characteristics in our training data and labels.

In the future, we plan to improve our classifier to take into account such sources of bias. We also plan to test our classifier on unlabeled data of medically-related campaigns to investigate its capability detecting such fraud cases, which in the health domain can have a severe monetary and emotional impact on the defrauded.

Metafraud: A meta-learning framework for detecting financial fraud

The economics of crowdfunding platforms

Data mining for credit card fraud: A comparative study

A Deep Learning Semantic Approach to Emotion Recognition Using the IBM Watson Bluemix Alchemy Language

Lie detection across cultures

Detecting management fraud in public companies

Return of the devil in the details: Delving deep into convolutional nets

A coefficient of agreement for nominal scales

Disentangling crowdfunding from fraudfunding. Max Planck Institute for Innovation & competition research paper

Predicting material accounting misstatements

Trust and credit: The role of appearance in peer-to-peer lending

GoFundMe Pricing

GoFundMe fraudulent campaigns

Deep residual learning for image recognition

2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

FBI details new methods of fraud born amid the pandemic

Aesthetics and emotions in images

Dlib-ml: A machine learning toolkit

Imagenet classification with deep convolutional neural networks

An application of hierarchical kappatype statistics in the assessment of majority agreement among multiple observers

Judging borrowers by the company they keep: Friendship networks and information asymmetry in online peer-to-peer lending

Fake it till you make it: Reputation, competition, and Yelp review fraud

People Are Raising USD650 Million On GoFundMe Each Year To Attack Rising Healthcare Costs

Interrater reliability: the kappa statistic

Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning

Captain Tom Moore: Just Giving blocks copycats over fears scammers are 'cashing in' on Âč28m NHS fundraising campaign

BS: Fraud Detection in Crowdfunding Campaigns

Scikit-learn: Machine learning in Python

WhatâĂŹs in a Picture? Evidence of Discrimination from Prosper. com

GoFundMe Confronts Coronavirus Demand

Association rules applied to credit card fraud detection. Expert systems with applications

Detecting fraudulent behavior on crowdfunding platforms: The role of linguistic and content-based cues in static and dynamic contexts

Credit card fraud detection using hidden Markov model

Crowdfunding Statistics and Facts

Fraud Law and Legal Definition

Woman and Homeless Man Plead Guilty in $400,000 Go-FundMe Scam

LOWEST OF THE LOW Sick scammers are setting up GoFundMe accounts for fake coronavirus victims

The emergence and effects of fake social information: Evidence from crowdfunding

P2P Lending Fraud Detection: A Big Data Approach

Building a large scale dataset for image emotion recognition: The fine print and the benchmark

Visualizing and understanding convolutional networks

Exploring principles-of-art features for image emotion recognition

The data that support the findings of this study are available from the corresponding author, B. Perez, upon reasonable request.