Comparisons of machine learning techniques for detecting malicious webpages Expert Systems with Applications 42 (2015) 1166–1177 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Comparisons of machine learning techniques for detecting malicious webpages http://dx.doi.org/10.1016/j.eswa.2014.08.046 0957-4174/� 2014 Elsevier Ltd. All rights reserved. ⇑ Corresponding author. H.B. Kazemian ⇑, S. Ahmed Intelligent Systems Research Centre, School of Computing, London Metropolitan University, United Kingdom a r t i c l e i n f o a b s t r a c t Article history: Available online 16 September 2014 Keywords: K-Nearest Neighbor Support Vector Machine Naive Bayes Affinity Propagation K-Means Supervised and unsupervised learning This paper compares machine learning techniques for detecting malicious webpages. The conventional method of detecting malicious webpages is going through the black list and checking whether the web- pages are listed. Black list is a list of webpages which are classified as malicious from a user’s point of view. These black lists are created by trusted organizations and volunteers. They are then used by modern web browsers such as Chrome, Firefox, Internet Explorer, etc. However, black list is ineffective because of the frequent-changing nature of webpages, growing numbers of webpages that pose scalability issues and the crawlers’ inability to visit intranet webpages that require computer operators to log in as authen- ticated users. In this paper therefore alternative and novel approaches are used by applying machine learning algorithms to detect malicious webpages. In this paper three supervised machine learning techniques such as K-Nearest Neighbor, Support Vector Machine and Naive Bayes Classifier, and two unsupervised machine learning techniques such as K-Means and Affinity Propagation are employed. Please note that K-Means and Affinity Propagation have not been applied to detection of malicious web- pages by other researchers. All these machine learning techniques have been used to build predictive models to analyze large number of malicious and safe webpages. These webpages were downloaded by a concurrent crawler taking advantage of gevent. The webpages were parsed and various features such as content, URL and screenshot of webpages were extracted to feed into the machine learning models. Computer simulation results have produced an accuracy of up to 98% for the supervised techniques and silhouette coefficient of close to 0.96 for the unsupervised techniques. These predictive models have been applied in a practical context whereby Google Chrome can harness the predictive capabilities of the classifiers that have the advantages of both the lightweight and the heavyweight classifiers. � 2014 Elsevier Ltd. All rights reserved. 1. Introduction Web security threats are increasing day by day (Facebook, 2010; Malware, 2011; Sood & Enbody, 2011). The open nature of the Internet allows malicious webpages to pose as ‘safe webpages’ and consequently some users are misled to think that these webpages are safe. As the use and the speed of the Internet increased over the last two decades, web developers have increased the usage of images, JavaScript and other elements. The Google search engine is a clear example. At the beginning, it had very few elements. There are now more elements, graphics, stylesheets and the HTML specifications which have been added as time went on. Initially, the only way to create a webpage was by static HTML. JavaScript was then added for user interactivity. ActiveX, Silverlight, Java Applets, etc. were further added to include features. For example, ActiveX allowed browsers to host various executables which enabled users to read PDF and various other file formats such as Flash, DivX, etc. Web developers started using the integrated development environ- ments that generated a considerable HTML markup language and this increased the HTML payload. The number of browsers increased and some of these browsers, especially Internet Explorer had their own quirks and needed more work from the developers. These factors raised the complexity of the webpages that led to potential increase in how webpages are ‘adversely affected’ and have become malicious. Cross Site Scripting (XSS) injects malicious code from an unexpected source. These malicious codes can get hold of the cookies, browsing history and then send them over to the mali- cious webpage. Thus the user’s privacy is jeopardized. There have been many attempts to prevent this sort of attacks (Lucca, Fasolino, Mastoianni, & Tramontana, 2004). XSS not only affects the user but also it affects the server. The webpage is used as the vehicle to transfer infections to multiple users. The malicious http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2014.08.046&domain=pdf http://dx.doi.org/10.1016/j.eswa.2014.08.046 http://dx.doi.org/10.1016/j.eswa.2014.08.046 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1167 code then executes in the user’s browser. The problem has been intensified with the addition of scripting capabilities that did not exist at the beginning of the history of web browsing. With the addition of scripting capabilities, the users are benefitting with a better user experience but have become prone to these additional security problems. These scripts run on users’ brows- ers. The web developer may build a webpage only using HTML, but an attacker can still inject scripts to make it susceptible to scripts. These scripts can then access the cookies that are used for authentication. The XSS vulnerability therefore affects the users and the webpages. Take for example, a user visits a web- page and decides to purchase a product. The user adds the items to the basket and would like to checkout. Then he fills in a form to register. Each of these users is uniquely identifiable by the webpage through the use of cookies. The criminal will be able to look at the cookies and impersonate the user and buy the prod- ucts, without the knowledge of the user. By the time the user has realized the problem, the money has already been dispatched from the user’s account. Almost all HTML tags are wrapped by ‘greater than’ and ‘less than’. To write the script tag, these two characters are needed. There are several combinations of characters that can be generated (Braganza, 2006). The combinations are quite vast and will likely to increase. The combinations of letters that generate the letters are dependent on browser version and the default language. Braganza (2006) states that the browser cannot be trusted because of these extensive possibilities and some precautions are required. To cleanse, data entered are encoded and data displayed are both decoded, this process is known as ‘sanitization’. In terms of how the webpage is deployed to the user, the operations team have to make sure that the firewall or any other forms of preventative measures are kept up to date. Another security threat that is very difficult to detect is clickjacking. This is a relatively new threat that has become more prevalent through the advancement of modern browsers. The interesting thing about clickjacking is that it does not use security vulnerabilities, rather uses the browsers most common feature such as hyperlinks. The user is encouraged to click a link to a webpage. But this webpage has two webpages one is dis- played to the user and the other one the malicious webpage which is hidden from the user (Hansen & Grossman, 2008). The hidden webpage executes the malicious code even though the user thinks that the information is on the right webpage. This technique is very hard to detect by inspecting the source code and there have not been many successful ways to prevent it from happening. Drive-by-download occurs without the knowledge of a user and the downloaded file is used for malicious purposes. This malicious executable installs itself on users’ computer. This is a very popular method that has been used by Harley and Bureau (2008) to spread malware infection on the Internet. There are three components in the attack, the web server, the browser and the malware. An attacker finds a web server to serve the malware. The user who vis- its a webpage hosted in this web server is then exploited by the webpage, and some code utilizes software loopholes to execute commands on the user’s browser are injected. The web server sub- sequently provides the malware that is downloaded by the brow- ser. As a result, the browser that is targeted will have a known vulnerability that the attacker will try to exploit. Internet Explorer had many instances of ActiveX loopholes that the attackers had used and are still using and Harley and Bureau (2008) have pro- vided potential solutions. The first solution is to completely isolate the browser from the operating system so that the arbitrary codes are not at all executed on the browser. Another solution is for web crawlers to visit webpage and see whether they are hosting any malware content. But the attackers can avoid by using a URL that does not have a corresponding hyperlink. Crawlers by its nature only visit URLs that have a corresponding hyperlink. Browsers these days use publicly available blacklists of mali- cious webpages. These blacklists are updated after a few days or even a month. These gaps allow for webpages to be affected while being unnoticed to the crawler. At this point, the users will also get affected, because the browser thinks the webpage to be secure, as it has never been in the blacklist. Take another scenario where a regular webpage may be hacked and injected with malicious code visible only to some users or a group of users of an organization or a country. The blacklists will not be able to ‘blacklist’ those either. Some crawlers do not validate the JavaScript code because the code is executed on the server and not in a browser. This allows client vulnerabilities to pass through easily. Even though some of the scripts which are assumed to be safe, these scripts can load mali- cious scripts remotely and then execute them on the computer. Some scripts create iFrames and then load external webpages that are malicious (Provos, Mavrommatis, Rajab, & Monrose, 2008). These external webpages then gets hold of the cookies and steal the identity. The users then browse this malicious webpage and get infected and are then easily tracked by remote users from somewhere else. The users also may run malicious executables without even knowing that the executables have already access to the system and are monitored from somewhere else. Webpages are the common victims to all these threats that have been described above. The features in a webpage can indicate whether it is malicious or not. Researchers have studied and analyzed a large number of features with or without machine learning tech- niques described below. Kan and Thi (2005) carried out one of the first research work that utilized machine learning to detect malicious webpages. This work ignored webpage content and looked at URLs using a bag-of-words representation of tokens with annotations about the tokens’ posi- tions within the URL. A noteworthy result from Kan and Thi’s research is that lexical features can achieve 95% accuracy of the page content features. Garera, Provos, Chew, and Rubin (2007)s work used logistic regression over 18 hand selected features to clas- sify phishing URLs. The features include the presence of red flag key words in the URL, features based on Google’s page ranking, and Google’s webpage quality guidelines. Garera et al. achieved a classi- fication accuracy of 97.3% over a set of 2500 URLs. Although this paper has similar motivation and methodology, it differs by trying to detect all types of malicious activities. This paper also uses more data for training and testing, as described in the subsequent sec- tions. Spertus (1997) suggested an alternative approach and endeavored to identify malicious webpages, Cohen (1996) employed the decision trees for detection and Dumais, Platt, Heckerman, and Sahami (1998) utilized inductive learning algo- rithms and representations for text categorization. Guan, Chen, and Lin (2009) focused on classifying URLs that appear in webpages. Several URL-based features were used such as webpage timing and content. This paper has used similar techniques but applied them to webpages which have much more complex structures with better accuracies. Mcgrath and Gupta (2008) did not construct a classifier but performed a comparative analysis of phishing and non-phish- ing URLs. With respect to data sets, they compared non-phishing URLs drawn from the DMOZ Open Directory Project to phishing URLs from Phishtank (2013) and a non-public source. The features they analyzed included IP addresses, WHOIS thin records (contain- ing date and registrar provided information only), geographic infor- mation, and lexical features of the URL (length, character distribution and presence of predefined brand names). The differ- ence is that this paper utilizes different types of features to add to the novelty. Provos et al. (2008) carried out a study of drive-by exploit URLs and used a patented machine learning algorithm as a pre-filter for virtual machine (VM) based analysis. This approach is based on heavyweight classifiers and is time consuming. Provos et al. (2008) used the following features in computer simulation, 1168 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 content based features from the page, whether inline frames are ‘out of place’, the presence of obfuscated JavaScript, and finally whether iFrames point to known malicious sites. Please note, an ‘iFrame’ is a window within a page that can contain another page. In their evaluations, the machine learning based pre filter can achieve 0.1% false positives and 10% false negatives. Provos et al.’s approach is very different to this paper as the features are primarily focused on iFrames. Bannur, Saul, and Savage (2011)s research has some similarities to this paper but it uses a very small dataset and furthermore this paper utilizes various other types of features. Abbasi, Zhang, Zimbra, Chen, and Nunamaker (2010) and Abbasi, Zahedi, and Kaza (2012) ran some classification to detect fake med- ical sites but the size of dataset was very small, whereas this paper focuses on detecting any type of malicious webpages. Fu, Wenyin, and Deng (2006) and Liu, Deng, Huang, and Fu (2006)s research considered the visual aspects of a webpage to determine whether the page is malicious or not. Le, Markopoulou, and Faloutsos (2010) detected phishing webpages only using the URLs. And, Ma, Saul, Savage, and Voelker (2009b) looked at online learning to detect malicious webpages from URL features. As the security improves, the attackers will devise new ways to avoid barriers raised by administrators and web developers. To fur- ther improve security, an automated tool is required in order to detect the vulnerabilities. One approach to automation is for web developers to secure and enhance their webpages. But there are limits to the extent that developers can work to secure webpages. Web developers are bound by the web frameworks they use (Okanovic, Mateljan, & May, 2011). If the web frameworks fail to take preventative measures, the users’ machines get infected and the webpages become vulnerable. This paper takes the research in malicious webpages described above further by applying three supervised machine learning techniques such as Naive Bayes Clas- sifier, K-Nearest Neighbor and Support Vector Machine, and two unsupervised machine learning techniques like K-Means and Affin- ity Propagation, and compares the results. The novel unsupervised techniques of K-Means and Affinity Propagation have not been applied to detection of malicious webpages by any other research- ers in the past. Moreover, to add to the novelty, the research utilizes more complex structures for better accuracies, more data for train- ing and testing, and employs both the lightweight and the heavy- weight classifiers for detection of all types of malicious activities. 2. Machine learning models This section provides a brief overview of the five machine learning techniques that have been used in this paper; they are Naive Bayes Classifier, K-Nearest Neighbor, Support Vector Machine, K-Means and Affinity Propagation, briefly described below. The computer simulation results of these machine learning methods are discussed in Section 3. 2.1. Naive Bayes Classifier Naive Bayes Classifier uses the Bayes’ theorem. The classifier assumes that all the features are independent of each other. It learns pðCkjxÞ by modeling pðxjCkÞ and pðCkÞ, using Bayes’ rule to infer the class conditional probability (Bayes & Price, 1763). The model is yðxÞ¼ arg max x pðCkjxÞ¼ arg max x pðxjCkÞ� pðCkÞ ¼ arg max x YD i¼1 pðxijCkÞ� pðCkÞ ¼ arg max x XD i¼1 log pðxijCkÞþ log pðCkÞ ð1Þ The training in this paper was carried out using Gaussian likelihood (alternative options for training include multivariate likelihood and multinomial likelihood). pðxjCkÞ¼ YD i¼1 Nðlik; rikÞ ð2Þ where: � Ck are the classes where C = {C1, C2, . . . , Ck}. � Nðlik; rikÞ is the normal distribution. � l is the mean of the Gaussian distribution. � r is the standard deviation of the Gaussian distribution. The complexity of the model OðNMÞ is as such that each training instance must be visited and each of its features ought to be counted. For non-linear problems, it can only learn linear boundaries for multivariate/multinomial attributes. With Gaussian attributes, quadratic boundaries can be learnt with unimodal distributions (Jordan, 2002). Naive Bayes is generally used in text classification and is one of the most widely used classification algo- rithm in machine learning because it is fast and space efficient, which is also noticed in the simulation results described in Section 3. 2.2. K-Nearest Neighbor K-Nearest Neighbor works in such a way that the label of a new point x̂ is classified with the most frequent label t̂ of the k nearest training instances. It is modeled as t̂ ¼ arg max C X i:xi2Nkðx;x̂Þ dðti;CÞ ð3Þ where: � Nkðx; x̂Þ k points in x closest to x̂. � Euclidean distance formula: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPD i¼1ðxi � x̂iÞ 2 q . � d(a, b) 1 if a = b; 0o/w. The model does not require any optimization and it trains itself using cross validation to learn the appropriate k. k regularizes the classifier, as k ? N the boundary becomes smoother. OðNMÞ is used as space complexity, since all training instances and all their features need to be kept in memory. K-Nearest Neighbor uses a very simple technique for classification, and cannot handle large training dataset as shown in the results section. It can handle non-linear boundaries easily (Altman, 1992). 2.3. Support Vector Machines SVM was developed by Cortes and Vapnik (1995) and it is widely regarded as one of the most effective models for binary classification of high dimensional data. SVM and indeed any other supervised classifiers use a similar technique to classify webpages shown in Fig. 1. SVM is modeled as hðxÞ¼ b þ XN n¼1 yiai Kðx; xiÞ ð4Þ where � h(x) is the distance of the decision boundary. � b is the bias weight. � a is a coefficient that maximize the margin of correct classifica- tion on the training set. � N is the number of features. � K is the kernel function. � x is a feature vector. Fig. 1. Supervised machine learning architecture for classifying webpages. H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1169 The positive or negative sign of this distance indicates the side of the decision boundary on which the example lies. The value of h(x) is limited to predict a binary label for the feature vector x. The model is trained by initially specifying a kernel function K(x, x0) and then computing the coefficients ai that maximize the margin of correct classification on the training set. The required optimization can be formulated as an instance of quadratic pro- gramming, a problem for which many efficient solutions have been developed (Chang & Lin, 2012). 2.4. K-Means K-Means is a geometric clustering, hard-margin algorithm, where each data point is assigned to its closest centroid. It is mod- eled using hard assignments rnk 2 {0, 1} so that "n P krnk = 1, i.e. each data point is assigned to one cluster k (MacQueen, 1967). The geometric distance is calculated using the Euclidean distance, l2 norm: jjxn � lkjj 2 ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XD i¼1 ðxni � lkiÞ 2 vuut ð5Þ where � lk is cluster centroid. � D is the no of points. � x is one of the points. The mini-batch K-Means algorithm uses mini batches to reduce the computation time while still attempting to optimize the same objective function. Mini-batches are subsets of the input data, ran- domly sampled in each training iteration. These mini-batches dras- tically reduce the amount of computation required to converge to a local solution. Mini batch k-means converges faster than K-Means, but the quality of the results is reduced. In practice, the difference in quality can be quite small, as shown in the results section. 2.5. Affinity Propagation Affinity Propagation is an unsupervised algorithm created by Dueck (2009) where each data point acts as centroids and these data points choose the number of clusters. The following is an example of how to represent the centroid for datapoint i ci 2f1; . . . ; Ng ð6Þ The algorithm maximizes the following function SðcÞ¼ XN i¼1 sði; ciÞþ XN k¼1 dkðcÞ ð7Þ The similarity of each point to the centroid is measured by the first term in Eq. (7) and the second term is a penalty term denoted as -1. If some data point i has chosen k as its exemplar that is ck – k, but k has not chosen itself as an exemplar i.e. ck = k, then the following constraints could be presented in Eq. (8). dkðcÞ¼ �1 if ck–k but 9i : ci ¼ k 0 otherwise � ð8Þ A factor graph can represent the objective function and it is possible to use N nodes, each with N possible values or with N2 binary nodes. Each variable node ci sends a message to each feature node dk and each factor node dk sends a message to each variable node ci. The number of clusters is controllable by scaling the diagonal term S(i, i), which shows how much each data point would like to be an exemplar. Although Affinity Propagation was developed very recently it has a very good performance as shown later in the results section. The overall architecture of the unsupervised machine learn- ing is outlined in Fig. 2. 3. Results The architecture of the computer simulation carried out for this paper is presented in Fig. 3. The experiment was conducted on an Intel Xeon E3-1220 4 Cores � 3.1 GHz computer with 12 GB of RAM. First, 100,000 webpages were downloaded using a crawler Fig. 3. The architecture of Web Application Classifier (WAC). Fig. 2. Unsupervised machine learning architecture for classifying webpages. 1170 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 and then converted into feature vectors. Then a tool named Web Application Classifier (WAC) took these vectors as inputs, and applied the machine learning algorithms described in the previous section to create the predictive models. These predictive models read vectors of the new webpages to produce an output that indi- cated whether a webpage is safe or not. The Sections 3.1–3.5 describe which features were used and how the webpages were pre- processed and cleansed before placing inside the predictive models. 3.1. Data sources The downloaded webpages were divided into two sets, malicious and safe. The source for the list of safe webpages was gathered primarily from Alexa (2013), the malicious ones were collated primarily from Phishtank (2013) and two types of repre- sentation of each webpage were created. One contained all the HTML codes and other only had English characters. 3.2. Features 3.2.1. Semantic This paper utilizes vector space representation to extract the semantic features, as it is commonly used in classification and clustering. Salton, Wong, and Yang (1975) carried out research using a vector space model for automatic indexing of webpages, Cohen and Singer (1999) used a context-sensitive learning method H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1171 for categorizing text, and Sahami, Dumais, Heckerman, and Horvitz (1998) utilized a Bayesian approach to classify emails for detecting spam. Vector space representation denotes webpages as vectors in a very high dimensional space. Each webpage is represented as a Boolean or numerical vector; in this paper, it is represented as a numerical vector. In this very high dimensional space, each term is a sequence of alphanumeric characters. A given webpage has in each component a numerical value specifying some function f of often a term corresponding to the dimension appearing in the webpage. Salton and Buckley (1988) used an alternative term called weighting but it is not used in this simulation. In a vector space representation of webpages, n(ti, d) is the number of occur- rences of term ti in webpage d; n(ti, d) is a random variable. The function f is applied to n(ti, d) and produces a value for the ith com- ponent of the vector for webpage d. The identity function f(a) = a is applied to the term counts. Other common functions that are applied to the term frequencies are outlined below. fðaÞ¼ logða þ 1Þ ð9Þ fðaÞ¼ ffiffiffi a p ð10Þ fðaÞ¼ a a þ const ð11Þ Eq. (9) was defined by Robertson and Jones (1976). Eq. (10) was used in the scatter/gather system (Cutting, Karger, Pedersen, & Tukey, 1992) for webpage clustering and was found to outperform Eq. (9). Robertson and Walker (1994) proposed Eq. (11) and found this general form to be useful for webpage retrieval by making use of various instantiations of the constant value. Fig. 4. Visual representation of supervised learning models demo Term frequency – inverse document frequency (TFIDF) weighting is the most used function for the webpage term frequen- cies. The term frequencies (TF) in each webpage and the inverse document (webpage) frequency (IDF) of each term in the entire collection are part of the weighting function. IDF is defined as IDFðtÞ¼ log N nt � � ð12Þ where N is the total number of webpages in the collection and nt is the number of webpages in which term t appears at least once. The TFIDF weight for a term t in a webpage d is the product of the term frequency and the inverse webpages frequency for that term, returning: TFIDFðt; dÞ¼ nðt; dÞ:IDFðtÞ ð13Þ This paper uses a simple Boolean representation of webpages that records whether or not a given term appears in a webpage. Most rule-based methods (Apt´e, Damerau, & Weiss 1994; Cohen, 1996) use an underlying Boolean model and Boolean vector representation has been used in probabilistic classification models. Another way to incorporate word frequency information into probabilistic classification models is by using a parametric distribution, such as bounded Gaussian or Poisson distribution to capture the probability of words appearing number of times in webpages. Yang and Chute (1994) provide further evidence that Boolean representation is adequate by comparing Boolean and frequency-based representations. nstrating clear separations of malicious and safe webpages. Table 1 Results of comparisons of supervised machine learning models that detect malicious webpages (key: KNN: K-Nearest Neighbor, LS: Linear SVM, RS: RBF SVM, NB: Naive Bayes). No. of webpages KNN (%) LS (%) RS (%) NB (%) 50 74 80 79 77 100 75 82 83 78 500 79 86 92 78 5000 91 93 97 84 100,000 95 93 98 89 1172 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 Word stemming reduces the words in the webpages to their root forms, known as word stems. Porter (1997) described a simu- lation model that utilizes the word-stemming algorithm, which is able to combine similar and dissimilar items into one. But Frakes and Baeza-Yates (1992) compared various stemming methods to unstemmed representations and showed that in many cases the performances of both representations are very close to each other. 3.2.2. URLs URLs identify webpages and are used as unique identifiers. Many malicious webpages have suspiciously looking characters in their URLs and in their contents. Sometimes URLs have spelling mistakes too. The lexical features of URLs were fed into the machine learning models. If there were spelling mistakes or suspi- cious characters in the URL, then they were regarded as suspicious. 3.2.3. Page links Webpages have many links in order to provide further informa- tion. The webpages that link to malicious webpages are likely to be malicious themselves. The computer simulations extracted all the Fig. 5. ROC curves for supervise links from each webpage and they were also fed into the machine learning models. 3.2.4. Visual features All the features that have been mentioned so far are text based such as source codes, stripped HTML, domain names and URLs. The visual features are based on images. First, screenshots of webpages were downloaded by passing the URLs to PhantomJS (a headless webkit browser with JavaScript API). PhantomJS took each URL, saved a screenshot of the webpage and converted them to Portable Network Graphics (PNG) file format. Images were then converted to a format understandable by the proposed models. There are two popular techniques that are generally used such as Speeded Up Robust Features (SURF) and Scale Invariant Feature Transform (SIFT) (Lowe, 2004). The simulation used SURF, as it has less strin- gent licensing options compared to SIFT. The idea behind using the visual features is that malicious webpages look simpler because they are likely to have less input from designers whereas safe web- pages are designed better. An exciting feature was that the unsupervised machine learning models were used in conjunction with supervised machine learn- ing models and each screenshot of the webpages were analyzed using SURF. The values from the SURF were clustered using the unsupervised machine learning models. These clusters were then fed into the supervised machine learning models to further improve the classification. 3.3. Evaluation of the machine learning models Machine learning methods discussed previously were used on different combinations of features. First the webpages were classi- fied from just content features, then other types of features were d machine learning models. Table 2 Results based on datasets provided by Ma et al. (2009a). Classifier Accuracy (%) K-Nearest Neighbor 91 RBF Support Vector Machine 97 Linear Support Vector Machine 92 Naive Bayes 85 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1173 added and the gain was reexamined. It was found in this research that the highest accuracy is obtained by combining URL, page-link, semantic TFIDF and SURF features, which adds to the paper’s nov- elty. This combination of features was used as the optimal feature configuration. The accuracy of the unsupervised models had to be done differently because the truth labels are not known and the evaluation had to be performed using the model itself. Finally the machine learning techniques were trained on data sets with varying ratios of malicious and safe webpages. 70% of the labeled webpages were used for training and 30% for testing. The ratio of malicious to safe webpages is the same in testing as well as train- ing for the supervised machine learning models. The supervised classification performances were evaluated in terms of precision and recall, while the silhouette coefficient was used to evaluate the unsupervised techniques. The silhouette coefficient outputs a score relating to a model with better-defined clusters. The silhouette coefficient is defined for each sample and composed of two extremes, bounded between �1 for incorrect clustering and +1 for highly dense clustering. 3.3.1. Supervised techniques Fig. 4 shows a visual representation of webpages as safe or mali- cious, using supervised classification. K-Nearest Neighbor, Radial Fig. 6. Confusion matrices for supervi Basis Function (RBF) SVM, Linear SVM and Naive Bayes clearly sep- arate the safe and malicious webpages. The webpages used in Fig. 4 is somewhat smaller in numbers in order to demonstrate pictorial representations of the outcomes. Table 1 further demonstrates the overall results of the supervised techniques. The accuracy for all the supervised models improved as the number of webpages increased. In general, SVM outperformed the rest. In the case of SVM, the accuracy values are remarkably low even when a small number of webpages are applied. As soon as the number of webpages exceeds 500, the accuracy increases. In the case of other models, a similar trend is observed at the beginning, as the number of webpages increases the accuracy also increases. The results suggest that the models were able to generalize better as more patterns emerged from various sources. Receiver Operating sed machine learning techniques. Fig. 7. Visual representation of unsupervised learning models demonstrating clear separations of malicious and safe webpages. Table 3 Results of comparisons of unsupervised machine learning models that detect malicious webpages. Classifier Silhouette coefficient Mini Batch K-Means 0.877 Affinity Propagation 0.963 K-Means 0.877 1174 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 Characteristic (ROC) curve and Confusion Matrix are used to analyze the efficiencies and performances of the supervised algo- rithms, as outlined in Figs. 5 and 6. ROC curve is a graph which demonstrates the efficiency and performance of a classifier system by plotting true positive rate against false positive rate at various threshold settings. Fig. 5 illustrates that Linear SVM and RBF SVM perform the best out of the four classifiers with 0.93 and 0.91 respectively. K-Nearest Neighbor performs the worst, because it had less access to training data due to memory constraints. Confusion Matrix is a contingency table enables visualization of the efficiency and performance of a supervised learning algorithm, which makes it easy to understand if the system is confusing or mislabeling two classes. In Fig. 6, the confusion matrices have four sections, True Positive, True Negative, False Positive and False Negative. True Positive denotes that a malicious webpage is cor- rectly identified as malicious. True Negative represents that a safe Fig. 8. The Chrome extension uses 4 steps to dec webpage is correctly labeled as safe. False Positive means that a safe webpage is incorrectly identified as malicious. False Negative indicates that a malicious webpage is incorrectly labeled as safe. The proposed machine learning techniques are designed as such where the priority is given to detect the malicious websites that is the true positives. With this is mind, the outstanding True Posi- tives from highest to lowest are in the following order RBF SVM (97.8%), Linear SVM (92.4%), Naïve Bayes (76.4%) and K-Nearest Neighbor (9.9%). The supervised models were also run against one of the most popular datasets provided by Ma, Saul, Savage, and Voelker (2009a). The data file had based on SVM and therefore the data was converted into recognizable formats to be fed into the machine learning models. Table 2 shows the results of the simula- tions using Ma et al. datasets. All the supervised algorithms scored over 85% and RBF SVM performed the best with 97% accuracy. 3.3.2. Unsupervised techniques Fig. 7 shows that the three unsupervised algorithms do clearly separate the malicious and safe webpages. In Fig. 7, a smaller number of webpages were used in order to demonstrate pictorial representations of the outcomes. The detailed numerical results are outlined in Table 3 using 100,000 webpages. Affinity Propaga- tion performs the best among the unsupervised machine learning ide whether a webpage is malicious or not. H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1175 algorithms in Table 3 because the silhouette coefficient is closest to 1. Furthermore, Affinity Propagation algorithm identifies three clusters. The red cluster is grouped as outlier, whereas the other two Mini Batch K-Means and K-Means find only two clusters. 3.4. Online learning The conventional batch processing machine learning models cannot learn incrementally. Online learning needs to employ a different approach than the traditional batch processing to accom- modate for the new incoming data. This problem is addressed in this paper by using streaming data in the form of a list of malicious Fig. 9. The Chrome extension shows that t Fig. 10. The Chrome extension shows that the webpages and safe webpages, which are compiled from various sources. This allows the supervised predictive models inside WAC to train automatically as new the data is coming in. This has been found to be very useful with the use of Chrome extension, which is described in Section 3.5. 3.5. Chrome extension The chrome extension has been built using the architecture shown in Fig. 8. The Content Script looks at the loading document and sends the loaded source code to the predictive classifier. The classifier then parses, creates the features and then responds with he website loaded by the user is safe. website loaded by the user is malicious. 1176 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 whether it thinks that the webpage is safe or malicious. The Back- ground Script receives the response and displays whether it is safe as demonstrated in Fig. 9 or malicious as shown in Fig. 10. The four steps in the Chrome Extension are outlined below: Step 1: Grab the webpage. Step 2: Extract the features and send them to the predictive model. The predictive model then sends a response. Step 3: Content Script notifies the Background Script with the response. Step 4: Background Script displays the response. ‘Heavyweight’ classifiers are more accurate but they have poor prediction time. Heavyweight classifiers use more features and so have a higher accuracy. ‘Lightweight’ classifiers does the opposite, it uses less features and consumes the features from the browser. This paper utilizes Google Chrome which exploits the predictive capabilities of the proposed supervised and unsupervised machine learning models by taking into account the advantages of both the heavyweight and the lightweight classifiers. The Chrome extension obtains all the features from the browser and sends them to the clas- sifier which uses more features, thus have a quick prediction time and yet higher accuracy. The Chrome extension used the supervised models for accurate classification and the unsupervised models were used to ascertain the clusters of the screenshots of webpages. 4. Conclusion This paper presents the evaluation of three supervised machine learning models and two unsupervised machine learning models for text classification, to detect webpages as either malicious or not. All the supervised techniques were trained and applied to a large number of webpages and manually split into two classes. In a nutshell, all of the machine learning techniques show encourag- ing performance with accuracies above 89% for supervised tech- niques and a silhouette coefficient of 0.87 for unsupervised techniques using 100,000 webpages. As the number of webpages increased the accuracy rates for each type of machine learning model were improved. The accuracies of the unsupervised models are not better than the supervised one, but came close. This is interesting because the unsupervised models were not aware of the two classes of malicious and safe. The major contribution of this paper is to explore a range of machine learning algorithms that use a wide range of features including the use of unsupervised algorithms. The computer simu- lation results show that the classifiers can obtain information from the URLs, page links, semantics and visual features of webpages. Different sets of data on ratios of malicious and safe webpages do not significantly affect the error rates of the classifiers. The content features of webpages contributed the most to reduce the error rates, and then the other significant ones were the URLs followed by the SURF visual features. The visual features take more computational resources and they are time consuming to compute. However, the SURF features can capture the visual information faster than the SIFT. Another significant contribution of this paper is the use of Chrome extension in the browser. This allowed to detect whether a webpage is malicious or not in a very short time. The lightweight classifiers are very fast but are slightly less accu- rate, whereas heavyweight classifiers are more accurate but take more time. This extension enabled the SVM classifier to be ‘middle- weight’ with very high accuracy and yet less prediction time. The final contribution is the use of online learning to detect malicious webpages, which allows for the training to take place without going through the ‘old training data’. This is very significant for real world applications. There is one limitation in this paper, that is malicious webpages will not be detected if the harmful elements are outside of the features that have been used in the proposed algorithms. Despite this limitation, the paper’s contribution is encouraging as the accuracy rates are improved through the use of visual features. There are possibilities that this research work could be taken further, some of which are described below. (i) Changing the implementation of the whole system to JavaScript which will combine the separate processes into one. This approach will allow the application to run on the browser as a standalone extension without external dependencies. This method requires that the machine learning models to be rewritten. The question to be answered is whether the prediction time will still remain the same. (ii) Using deep learning to extract features automatically. With the current scheme all the features have to be specified in the imple- mentation and it is difficult to incorporate new features because the system has to be pre-processed and it is time consuming. Deep learning may be able to extract features automatically and probably will adapt to the future threats that could use new fea- tures which have not been discovered as yet. The question remains the same whether deep learning will be as effective as the current successful systems. (iii) Using multiple processing units of a Graphical Processing Unit. Using distributed processing in general will improve the time to build the machine learning models and also will improve the prediction time. However, the main challenge will be to rewrite the algorithms so that they will make an efficient use of hardware especially on mobile devices. Acknowledgement The authors would like to thank Technology Strategy Board - KTP, UK for their generous support for part of the research [Grant No. ktp006367]. References Abbasi, A., Zahedi, F. M., & Kaza, S. (2012). Detecting fake medical web sites using recursive trust labeling. ACM Transactions of Information Systems, 30(4), 22:1–22:36. URL http://dx.doi.org/10.1145/2382438.2382441. Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., & Nunamaker, J. F. (2010). Detecting fake websites: The contribution of statistical learning theory. MIS Quarterly, 34(3), 435–461. URL . Alexa (2013). Alexa. URL . Altman, N. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185. Apt´e, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions of Information Systems, 12(3), 233–251. URL http://dx.doi.org/10.1145/183422.183423. Bannur, S. N., Saul, L. K., & Savage, S. (2011). Judging a site by its content: learning the textual, structural, and visual features of malicious web pages. In Proceedings of the 4th ACM workshop on security and artificial intelligence. AISec ’11 (pp. 1–10). New York, NY, USA: ACM. URL http://dx.doi.org/10.1145/ 2046684.2046686. Bayes, M., & Price, M. (1763). An essay towards solving a problem in the doctrine of chances. By the late rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to john canton, A. M. F. R. S.. Philosophical Transactions, 53, 370–418. URL . Braganza, R. (2006). Cross-site scripting: Cross-site scripting an alternative view. Network Security, 2006(9), 17–20. URL http://dx.doi.org/10.1016/S1353- 4858(06)70425-1. Chang, C. -C., & Lin, C. -J. (2012). Libsvm: A library for support vector machines. URL . Cohen, W. W. (1996). Learning trees and rules with set-valued features. Proceedings of the thirteenth national conference on artificial intelligence. AAAI’96 (Vol. 1, pp. 709–716). Springer. Cohen, W., & Singer, Y. (1999). Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS), 17(2), 141–173. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 273–297. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’92 (pp. 318–329). New York, NY, USA: ACM. URL http://dx.doi.org/10.1145/133160.133214. Dueck, D. (2009). Affinity propagation: Clustering data by passing messages. http://dx.doi.org/10.1145/2382438.2382441 http://dl.acm.org/citation.cfm?id=2017470.2017473 http://s3.amazonaws.com/alexa-static/top-1m.csv.zip http://refhub.elsevier.com/S0957-4174(14)00528-4/h0020 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0020 http://dx.doi.org/10.1145/183422.183423 http://dx.doi.org/10.1145/2046684.2046686 http://dx.doi.org/10.1145/2046684.2046686 http://www.rstl.royalsocietypublishing.org/content/53/370.short http://dx.doi.org/10.1016/S1353-4858(06)70425-1 http://dx.doi.org/10.1016/S1353-4858(06)70425-1 http://www.csie.ntu.edu.tw/cjlin/libsvm/ http://refhub.elsevier.com/S0957-4174(14)00528-4/h0050 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0050 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0050 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0055 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0055 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0060 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0060 http://dx.doi.org/10.1145/133160.133214 H.B. Kazemian, S. Ahmed / Expert Systems with Applications 42 (2015) 1166–1177 1177 Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on information and knowledge management. CIKM ’98 (pp. 148–155). New York, NY, USA: ACM. URL http://dx.doi.org/ 10.1145/288627.288651. Facebook (2010). Facebook suffers from rash of clickjacking. Network Security 2010 (6), 20. URL . Frakes, W. B., & Baeza-Yates, R. (Eds.). (1992). Information retrieval: Data structures and algorithms. Upper Saddle River, NJ, USA: Prentice-Hall Inc. Fu, A. Y., Wenyin, L., & Deng, X. (2006). Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD). IEEE Transactions of Dependable and Secure Computing, 3(4), 301–311. URL http://dx.doi.org/ 10.1109/TDSC.2006.50. Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM workshop on recurring malcode. WORM ’07 (pp. 1–8). New York, NY, USA: ACM. URL http:// dx.doi.org/10.1145/1314389.1314391. Guan, D., Chen, C., & Lin, J. (2009). Anomaly based malicious URL detection in instant messaging. In Proceedings of the joint workshop on information security (JWIS). Hansen, R., & Grossman, J. (2008). Clickjacking, URL . Harley, D., & Bureau, P. -M. (2008). Drive-by downloads from the trenches. In 3rd International Conference on Malicious and Unwanted Software, MALWARE 2008 (pp. 98–103). Jordan, A. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Advances in Neural Information Processing Systems, 14, 841. Kan, M., & Thi, H. (2005). Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on information and knowledge management (pp. 325–326). ACM. Le, A., Markopoulou, A., & Faloutsos, M. (2010). Phishdef: URL names say it all. CoRR abs/1009.2275. Liu, W., Deng, X., Huang, G., & Fu, A. Y. (2006). An antiphishing strategy based on visual similarity assessment. IEEE Internet Computing, 10(2), 58–65. URL http:// dx.doi.org/10.1109/MIC.2006.23. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. Lucca, G. A. D., Fasolino, A. R., Mastoianni, M., & Tramontana, P. (2004). Identifying cross site scripting vulnerabilities in web applications. In Proceedings of the web site evolution, sixth IEEE international workshop. WSE ’04 (pp. 71–80). Washington, DC, USA: IEEE Computer Society. URL . Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009a). Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’09 (pp. 1245–1254). New York, NY, USA: ACM. URL http://dx.doi.org/10.1145/ 1557019.1557153. Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009b). Identifying suspicious URLs: An application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning. ICML ’09 (pp. 681–688). New York, NY, USA: ACM. URL http://dx.doi.org/10.1145/1553374.1553462. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam & J. Neyman (Eds.). Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). University of California Press. Malware (2011). A great year for malware. Computer fraud and security 2011 (1), 20. URL . Mcgrath, D. K., & Gupta, M. (2008). Behind phishing: An examination of phisher modi operandi. In Proceedings of the USENIX workshop on large-scale exploits and emergent threats. Okanovic, V., & Mateljan, T. (2011). Designing a new web application framework. In MIPRO, 2011 proceedings of the 34th international convention (pp. 1315–1318). Phishtank (2013). Phistank. URL . Porter, M. F. (1997). Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Ch. An algorithm for suffix stripping (pp. 313–316). URL . Provos, N., Mavrommatis, P., Rajab, M. A., & Monrose, F. (2008). All your iFrames point to us. In Proceedings of the 17th conference on security symposium. SS’08 (pp. 1–15). Berkeley, CA, USA: USENIX Association. URL . Robertson, S., & Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146. Robertson, S. E., & Walker, S. (1994). Some simple effective approximations to the 2- poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’94 (pp. 232–241). New York, NY, USA: Springer- Verlag New York Inc.. URL . Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for text categorization: Papers from the 1998 workshop (Vol. 62). Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. URL http:// dx.doi.org/10.1016/0306-4573(88)90021-0. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication ACM, 18(11), 613–620. URL http://dx.doi.org/10.1145/ 361219.361220. Sood, A. K., & Enbody, R. J. (2011). Frame trapping the frame busting defence. Network Security, 2011(10), 8–12. URL . Spertus, E. (1997). Smokey: Automatic recognition of hostile messages. In Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on innovative applications of artificial intelligence. AAAI’97/ IAAI’97 (pp. 1058–1065). AAAI Press. URL . Yang, Y., & Chute, C. G. (1994). An example-based mapping method for text categorization and retrieval. ACM Transactions of Information Systems, 12(3), 252–277. URL http://dx.doi.org/10.1145/183422.183424. http://dx.doi.org/10.1145/288627.288651 http://dx.doi.org/10.1145/288627.288651 http://www.sciencedirect.com/science/article/pii/S1353485810700866 http://www.sciencedirect.com/science/article/pii/S1353485810700866 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0085 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0085 http://dx.doi.org/10.1109/TDSC.2006.50 http://dx.doi.org/10.1109/TDSC.2006.50 http://dx.doi.org/10.1145/1314389.1314391 http://dx.doi.org/10.1145/1314389.1314391 http://www.sectheory.com/clickjacking.htm http://www.sectheory.com/clickjacking.htm http://refhub.elsevier.com/S0957-4174(14)00528-4/h0115 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0115 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0115 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0120 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0120 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0120 http://dx.doi.org/10.1109/MIC.2006.23 http://dx.doi.org/10.1109/MIC.2006.23 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0135 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0135 http://dl.acm.org/citation.cfm?id=1025133.1026460 http://dl.acm.org/citation.cfm?id=1025133.1026460 http://dx.doi.org/10.1145/1557019.1557153 http://dx.doi.org/10.1145/1557019.1557153 http://dx.doi.org/10.1145/1553374.1553462 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0155 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0155 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0155 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0155 http://www.sciencedirect.com/science/article/pii/S1361372311700082 http://www.sciencedirect.com/science/article/pii/S1361372311700082 http://www.phishtank.com http://dl.acm.org/citation.cfm?id=275537.275705 http://www.dl.acm.org/citation.cfm?id=1496711.1496712 http://www.dl.acm.org/citation.cfm?id=1496711.1496712 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0190 http://refhub.elsevier.com/S0957-4174(14)00528-4/h0190 http://dl.acm.org/citation.cfm?id=188490.188561 http://dx.doi.org/10.1016/0306-4573(88)90021-0 http://dx.doi.org/10.1016/0306-4573(88)90021-0 http://dx.doi.org/10.1145/361219.361220 http://dx.doi.org/10.1145/361219.361220 http://www.sciencedirect.com/science/article/pii/S1353485811701052 http://www.sciencedirect.com/science/article/pii/S1353485811701052 http://www.dl.acm.org/citation.cfm?id=1867406.1867616 http://www.dl.acm.org/citation.cfm?id=1867406.1867616 http://dx.doi.org/10.1145/183422.183424 Comparisons of machine learning techniques for detecting malicious webpages 1 Introduction 2 Machine learning models 2.1 Naive Bayes Classifier 2.2 K-Nearest Neighbor 2.3 Support Vector Machines 2.4 K-Means 2.5 Affinity Propagation 3 Results 3.1 Data sources 3.2 Features 3.2.1 Semantic 3.2.2 URLs 3.2.3 Page links 3.2.4 Visual features 3.3 Evaluation of the machine learning models 3.3.1 Supervised techniques 3.3.2 Unsupervised techniques 3.4 Online learning 3.5 Chrome extension 4 Conclusion Acknowledgement References