An expert system for detecting automobile insurance fraud using social network analysis An expert system for detecting automobile insurance fraud using social network analysis Lovro Šubelj *, Štefan Furlan, Marko Bajec Faculty of Computer and Information Science, University of Ljubljana, Tržaška 25, SI-1001 Ljubljana, Slovenia a r t i c l e i n f o Keywords: Fraud detection Automobile insurance Social network analysis Link analysis Assessment propagation a b s t r a c t The article proposes an expert system for detection, and subsequent investigation, of groups of collaborating automobile insurance fraudsters. The system is described and examined in great detail, several technical difficulties in detecting fraud are also considered, for it to be applicable in practice. Opposed to many other approaches, the system uses networks for representation of data. Networks are the most natural representation of such a relational domain, allowing formulation and analysis of complex relations between entities. Fraudulent entities are found by employing a novel assessment algo- rithm, Iterative Assessment Algorithm (IAA), also presented in the article. Besides intrinsic attributes of entities, the algorithm explores also the relations between entities. The prototype was evaluated and rig- orously analyzed on real world data. Results show that automobile insurance fraud can be efficiently detected with the proposed system and that appropriate data representation is vital. � 2010 Elsevier Ltd. All rights reserved. 1. Introduction Fraud is encountered in a variety of domains. It comes in all dif- ferent shapes and sizes, from traditional fraud, e.g. (simple) tax cheating, to more sophisticated, where entire groups of individuals are collaborating in order to commit fraud. Such groups can be found in the automobile insurance domain. Here fraudsters stage traffic accidents and issue fake insurance claims to gain (unjustified) funds from their general or vehicle insurance. There are also cases where an accident has never oc- curred, and the vehicles have only been placed onto the road. Still, the majority of such fraud is not planned (opportunistic fraud) – an individual only seizes the opportunity arising from the accident and issues exaggerated insurance claims or claims for past damages. Staged accidents have several common characteristics. They oc- cur in late hours and non-urban areas in order to reduce the prob- ability of witnesses. Drivers are usually younger males, there are many passengers in the vehicles, but never children or elders. The police is always called to the scene to make the subsequent acquisition of means easier. It is also not uncommon that all of the participants have multiple (serious) injuries, when there is al- most no damage on the vehicles. Many other suspicious character- istics exist, not mentioned here. The insurance companies place the most interest in organized groups of fraudsters consisting of drivers, chiropractors, garage mechanics, lawyers, police officers, insurance workers and others. Such groups represent the majority of revenue leakage. Most of the analyses agree that approximately 20% of all insurance claims are in some way fraudulent (various resources). But most of these claims go unnoticed, as fraud investigation is usually done by hand by the domain expert or investigator and is only rarely computer supported. Inappropriate representation of data is also common, making the detection of groups of fraudsters extremely difficult. An expert system approach is thus needed. Jensen (1997) has observed several technical difficulties in detecting fraud (various domains). Most hold for (automobile) insurance fraud as well. Firstly, only a small portion of accidents or participants is fraudulent (skewed class distribution) making them extremely difficult to detect. Next, there is a severe lack of la- beled data sets as labeling is expensive and time consuming. Be- sides, due to sensitivity of the domain, there is even a lack of unlabeled data sets. Any approach for detecting such fraud should thus be founded on moderate resources (data sets) in order to be applicable in practice. Fraudsters are very innovative and new types of fraud emerge constantly. Hence, the approach must also be highly adaptable, detecting new types of fraud as soon as they are noticed. Lastly, it holds that fully autonomous detection of automobile insurance fraud is not possible in practice. Final assess- ment of potential fraud can only be made by the domain expert or investigator, who also determines further actions in resolving it. The approach should also support this investigation process. Due to everything mentioned above, the set of approaches for detecting such fraud is extremely limited. We propose a novel ex- pert system approach for detection and subsequent investigation 0957-4174/$ - see front matter � 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.143 * Corresponding author. Tel.: +386 1 4768 186. E-mail addresses: lovro.subelj@fri.uni-lj.si (L. Šubelj), stefan.furlan@fri.uni-lj.si (Š. Furlan), marko.bajec@fri.uni-lj.si (M. Bajec). Expert Systems with Applications 38 (2011) 1039–1052 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a http://dx.doi.org/10.1016/j.eswa.2010.07.143 mailto:lovro.subelj@fri.uni-lj.si mailto:stefan.furlan@fri.uni-lj.si mailto:marko.bajec@fri.uni-lj.si http://dx.doi.org/10.1016/j.eswa.2010.07.143 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa of automobile insurance fraud. The system is focused on detection of groups of collaborating fraudsters, and their connecting acci- dents (non-opportunistic fraud), and not some isolated fraudulent entities. The latter should be done independently for each particu- lar entity, while in our system, the entities are assessed in a way that considers also the relations between them. This is done with appropriate representation of the domain – networks. Networks are the most natural representation of any relational domain, allowing formulation of complex relations between enti- ties. They also present the main advantage of our system against other approaches that use a standard flat data form. As collaborat- ing fraudsters are usually related to each other in various ways, detection of groups of fraudsters is only possible with appropriate representation of data. Networks also provide clear visualization of the assessment, crucial for the subsequent investigation process. The system assesses the entities using a novel Iterative Assess- ment Algorithm (IAA algorithm), presented in this article. No learn- ing from initial labeled data set is done, the system rather allows simple incorporation of the domain knowledge. This makes it applicable in practice and allows detection of new types of fraud as soon as they are encountered. The system can be used with poor data sets, which is often the case in practice. To simulate realistic conditions, the discussion in the article and evaluation with the prototype system relies only on the data and entities found in the police record of the accident (main entities are participant, vehicle, collision,1 police officer). The article makes an in depth description, evaluation and anal- ysis of the proposed system. We pursue the hypothesis that auto- mobile insurance fraud can be detected with such a system and that proper data representation is vital. Main contributions of our work are: (1) a novel expert system approach for the detection of automobile insurance fraud with networks; (2) a benchmarking study, as no expert system approach for detection of groups of automobile insurance fraudsters has yet been reported (to our knowledge); (3) an algorithm for assessment of entities in a rela- tional domain, demanding no labeled data set (IAA algorithm); and (4) a framework for detection of groups of fraudsters with net- works (applicable in other relational domains). The rest of the article is organized as follows. In Section 2 we discuss related work and emphasize weaknesses of other proposed approaches. Section 3 presents formal grounds of (social) net- works. Next, in Section 4, we introduce the proposed expert system for detecting automobile insurance fraud. The prototype system was evaluated and rigorously analyzed on real world data, descrip- tion of the data set and obtained results are given in Section 5. Dis- cussion of the results is conducted in Section 6, followed by the conclusion in Section 7. 2. Related work Our work places in the wide field of fraud detection. Fraud ap- pears in many domains including telecommunications, banking, medicine, e-commerce, general and automobile insurance. Thus a number of expert system approaches for preventing, detecting and investigating fraud have been developed in the past. Re- searches have proposed using some standard methods of data min- ing and machine learning, neural networks, fuzzy logic, genetic algorithms, support vector machines, (logistic) regression, consoli- dated (classification) trees, approaches over red-flags or profiles, var- ious statistical methods and other methods and approaches (Artis et al., 2002; Bolton and Hand, 2002; Brockett et al., 2002; Estevez et al., 2006; Furlan and Bajec, 2008; Ghosh and Schwartzbard, 1999; Hu et al., 2007; Kirkos et al., 2007; Perez et al., 2005; Quah and Sriganesh, 2008; Rupnik et al., 2007; Sanchez et al., 2009; Viae- ne et al., 2005; Viaene et al., 2002; Weisberg and Derrig, 1998; Yang and Hwang, 2006). Analyses show that in practice none is signifi- cantly better than others (Bolton and Hand, 2002; Viaene et al., 2005). Furthermore, they mainly have three weaknesses. They (1) use inappropriate or inexpressive representation of data; (2) de- mand a labeled (initial) data set; and (3) are only suitable for larger, richer data sets. It turns out that these are generally a problem when dealing with fraud detection (Jensen, 1997; Phua et al., 2005). In the narrower sense, our work comes near the approaches from the field of network analysis, that combine intrinsic attributes of entities with their relational attributes. Noble et al. (2003) pro- posed detecting anomalies in networks with various types of ver- tices, but they focus on detecting suspicious structures in the network, not vertices (i.e. entities). Besides that, the approach is more appropriate for larger networks. Researchers also proposed detecting anomalies using measures of centrality (Freeman, 1977, 1979), random walks (Sun et al., 2005) and other (Holder and Cook, 2003; Maxion and Tan, 2000), but these approaches mainly rely only on the relational attributes of entities. Many researchers have investigated the problem of classifica- tion in the relational context, following the hypothesis that classi- fication of an entity can be improved by also considering its related entities (inference). Thus many approaches formulating inference, spread or propagation on networks have been developed in various fields of research (Brin and Page, 1998; Domingos and Richardson, 2001; Kleinberg, 1999; Kschischang and Frey, 1998; Lu and Getoor, 2003b; Minka, 2001; Neville and Jensen, 2000). Most of them are based on one of the three most popular (approximate) inference algorithms: Relaxation Labeling (RL) (Hummel and Zucker, 1983) from the computer vision community, Loopy Belief Propagation (LBP) on loopy (Bayesian) graphical models (Kschischang and Frey, 1998) and Iterative Classification Algorithm (ICA) from the data min- ing community (Neville and Jensen, 2000). For the analyses and comparison see (Kempe et al., 2003; Sen and Getoor, 2007). Researchers have reported good results with these algorithms (Brin and Page, 1998; Kschischang and Frey, 1998; Lu and Getoor, 2003b; Neville and Jensen, 2000), however they mainly address the problem of learning from an (initial) labeled data set (supervised learning), or a partially labeled (semi-supervised learning) (Lu and Ge- toor, 2003a), therefore the approaches are generally inappropriate for fraud detection. The algorithm we introduce here, IAA algorithm, is almost identical to the ICA algorithm, however it was developed with different intentions in mind – to assess the entities when no la- beled data set is at hand (and not for improving classification with inference). Furthermore, IAA does not address the problem of classi- fication, but ranking. Thus, in this way, it is actually a simplification of RL algorithm, or even Google’s PageRank (Brin and Page, 1998), still it is not founded on the probability theory like the latter. We conclude that due to the weaknesses mentioned, most of the proposed approaches are inappropriate for detection of (auto- mobile) insurance fraud. Our approach differs, as it does not de- mand a labeled data set and is also appropriate for smaller data sets. It represents data with networks, which are one of the most natural representation and allow complex analysis without simpli- fication of data. It should be pointed out that networks, despite their strong foundations and expressive power, have not yet been used for detecting (automobile) insurance fraud (at least according to our knowledge). 3. (Social) networks Networks are based upon mathematical objects called graphs. Informally speaking, graph consists of a collection of points, called 1 Throughout the article the term collision is used instead of (traffic) accident. The word accident implies there is no one to blame, which contradicts with the article. 1040 L. Šubelj et al. / Expert Systems with Applications 38 (2011) 1039–1052 https://isiarticles.com/article/17725