Auditing file system permissions using Association Rule Mining S. Parkinsona,∗, V. Somarakia, R. Warda aDepartment of Informatics, School of Computing and Engineering, University of Huddersfield, Queensgate, Huddersfield, HD1 3DH, UK Abstract Identifying irregular file system permissions in large, multi-user systems is challenging due to the com- plexity of gaining structural understanding from large volumes of permission information. This challenge is exacerbated when file systems permissions are allocated in an ad-hoc manner when new access rights are re- quired, and when access rights become redundant as users change job roles or terminate employment. These factors make it challenging to identify what can be classed as an irregular file system permission, as well as identifying if they are irregular and exposing a vulnerability. The current way of finding such irregularities is by performing an exhaustive audit of the permission distribution; however, this requires expert knowledge and a significant amount of time. In this paper a novel method of modelling file system permissions which can be used by association rule mining techniques to identify irregular permissions is presented. This results in the creation of object-centric model as a by-product. This technique is then implemented and tested on Microsoft’s New Technology File System permissions (NTFS). Empirical observations are derived by making comparisons with expert knowledge to determine the effectiveness of the proposed technique on five diverse real-world directory structures extracted from different organisations. The results demonstrate that the technique is able to correctly identify irregularities with an average accuracy rate of 91%, minimising the reliance on expert knowledge. Experiments are also performed on synthetic directory structures which demonstrate an accuracy rate of 95% when the number of irregular permissions constitutes 1% of the total number. This is a significant contribution as it creates the possibility of identifying vulnerabilities without prior knowledge of how to file systems permissions are implemented within a directory structure. Keywords: Access control, Auditing, Association Rule Mining. 1. Introduction File systems are common amongst the majority of computer operating systems and from a user perspec- tive their primary use is to securely store files. Modern, multi-user computer systems contain high quantities of data that require strong access control mechanisms to restrict data access to intended users. Different operating systems provide different implementations of access control, but common to the most prevalent is that they provide a customisable architecture for access control. This is implemented through the use of both coarse- and fine-grained permissions (De Capitani di Vimercati et al., 2003). Coarse-grained permis- sions are predefined levels (e.g read, write, full control, etc.) and fine-grained permissions are customised permissions created from a set of predefined attributes to represent highly customised access control rules. Using a mixture of both types of permissions provides a flexible architecture which can accommodate a large verity of different access control levels. However, this flexible nature also provides the possibility for complex permission relationships and anomalous permissions which often go undetected. ∗Corresponding author Email addresses: s.parkinson@hud.ac.uk (S. Parkinson), v.somaraki@hud.ac.uk (V. Somaraki), r.ward@hud.ac.uk (R. Ward) Preprint submitted to Expert Systems with Applications February 20, 2016 Administration of file system permissions on large file systems is both a challenging and cumbersome task as it is often difficult to conceptualise the large volumes of information available. Due to the complexities of managing permissions, unforeseen weak and incorrect allocations are often made. These complexities are usually the result of there being a large number of directories to secure, a large number of users that need to be correctly assigned, and a large number of access control rules. There is a wide range of literature discussing these complexities (Cao and Iverson, 2006; Beznosov et al., 2009; De Capitani di Vimercati et al., 2003), and many factual guides have been produced for different operating systems (Thomas, 2010; Solomon, 2005). However, even with such guides, users are left to analyse the large amount of access control information independently. The level of their knowledge regarding access control and their system’s configuration often directly determines the quality of their audit, and thus produces a heavy reliance on expert knowledge. When analysing for vulnerabilities within file system permissions they can be divided into two groups of; (1) known system vulnerabilities, and (2) those relative to the access control structure implemented within a system. A trivial example of a known system fundamental is that users should not have access to an important system directory (e.g C:\windows\system32). These are programmatically easy to find by using a predefined knowledge-base of potential vulnerabilities. Identifying such vulnerabilities is at most O(n×v) where n is the number of access control entries to examine and v is the number of known vulnerabilities. An example of a relative vulnerability is an incorrect assignment of permission with respect to an organisation’s implementation of access control. For example, the anomaly of one user having write privileges on a directory where all other users have read access. Such anomalies are very difficult to identify within access control as there is no quick method of determining potential vulnerabilities. The consequences of both types of vulnerabilities can be severe and can be generated from both user and software actions. For example, if a user has an elevated level of permission on a network directory structure, they could unintentionally modify or remove important data. A more severe consequence would be if the user could access data which is sensitive, as this could result in the organisation breaching the Data Protection Act. It is also possible that software (such as viruses, etc.) executing under the user’s credentials could exploit their file system permissions to perform malicious activity. Trend mining, in the form of Association Rule Mining (ARM) (Ma, 1998), has been extensively used to identifying anomalies and irregularities in many different application areas (Cheng et al., 2015). ARM is a method of automatically identifying interesting relationships amongst variables in large datasets. Interesting relationships are often those that frequently occur; however, in some applications, such as the one presented in this paper, an interesting relationship is one which occurs infrequently. There have been many successful applications of ARM for identifying interesting (both frequent and infrequent) relationships in large datasets. For example, in finance (Yu et al., 2009; Barak and Modarres, 2015), medical data (Somaraki et al., 2011, 2015), and cellular networks (Khatib et al., 2015). The use of ARM is therefore a natural selection for applying to the challenge of identifying irregularities in file system permissions. The work presented in this paper is tailored towards Microsoft’s New Technology File System (NTFS); however, it can be easily adapted to other file systems. The majority of multi-user environments in organ- isations will use Microsoft’s NTFS for providing a distributed mechanism of file storage. There are many complexities associated with administrating and auditing file system permissions on Microsoft’s NTFS which can result in the creation of vulnerabilities. The complexity of identifying these vulnerabilities has resulted in a unhealthy reliance on expert knowledge. The work presented in this paper helps to remove this unhealthy balance on expert knowledge and provide a method of auditing file system permissions for all NTFS users. This paper first provides a review into related work in the area of aiding auditing of file system permis- sions. The next section the provides a description of NTFS access control structure, highlighting auditing complexities. At the end of this section, detail is provides on how file system permissions are modelled in the work presented in this paper, including the use of an algorithm to combine permissions to determine the effective permission. This then leads to a description of association rule mining techniques which are useful for identifying irregularities in file system permissions. In this section the chosen technique is also discussed and justified. Empirical observations are then provided where the developed technique is tested on five diverse real-world file systems. In this empirical observation, the results are compared to those of a domain expert. 2 2. Related work Access control is typically defined as a relational model over the following domains: O the set of objects (i.e users), the set of resources R and the set of permissions P . Access control is a characteristic function on the set A ⊆ S × O × R. A subject s is granted permission r over resource o iff 〈s, o, r〉 ∈ A. Access control models are typically called the access matrix. In many operating systems the access matrix is stored as an access list, which is associated with a resource object and is used to list all subjects and their permissions. In NTFS, access lists are implemented as Discretionary Access Control List (DACL) models, where only the owner of a resource is authorised to change its access permissions. However, in multi-user environments it is possible for any user with sufficient permission to take ownership of a resource or change its permission. Other operating systems also implement access control lists to manage file system permissions. Unix (including Max OSX and Linux) and Portable Operating System Interface (POSIX) compliant systems have a simple system for managing individual file permissions. This is the ability to assign a predetermined set of coarse-grained permissions (often called “traditional Unix permissions”). Most of these systems also support the use of Access Control Lists (ACL), such as POSIX.1e ACLs (coarse-grained) (Nemeth, 2010) or NFSv4 ACLs (fine-grained) (Pawlowski et al., 2000). Interestingly, the implementation between NFSv4 and NTFS ACLs is very similar and also caters for fine-grained permissions. The ability to create fine-grained permissions significantly increases the complexity of the access control implementation. For this reason, and due to strong similarities between NFSv4 and NTFS, the rest of this section will describe in more detail the access control implementation of NTFS. There are many tools available to assist with the examination of NTFS permissions allocation (Microsoft, 2006b,a). However, there is one common weakness in that they all still require expert knowledge to analyse the output of the tools to determine if there are any weaknesses. Previous work in the area of file system permission analysis resulted in the development of a novel tool for permissions administration that allows users to easily view file system permissions for large directories (Parkinson and Crampton, 2013; Parkinson and Hardcastle, 2014). The tool provides features to help the user identify vulnerabilities and view necessary information to make better informed permission allocations. For example, the reduction in reported permis- sions by not displaying inherited permissions, and allowing the user to filter and show effective permission for a specified user, are two prominent features. Identifying anomalies or irregularities in system security is by no means a new topic (Lazarevic et al., 2003; Bhuyan et al., 2014). For decades, researchers have been developing techniques and tools to identify security anomalies. For example, recent works have covered topics from the identification of anomalous user behaviour in social networks (Viswanath et al., 2014), anomalies in network traffic (Catania et al., 2012; Mahoney, 2003), anomaly detection in wireless networks (Islam and Rahman, 2011; Xie et al., 2011), and the anomaly detection in power station security (Ten et al., 2011). All these studies generate good results and demonstrate the potential of using machine learning and statistics to identify anomalies. Recent research in the area of file systems includes the construction of a Bayesian network and neural network from the predetermined knowledge of the manipulation of file system artefacts for file system forensic analysis (Khan, 2012). Research has been carried out into developing tools for identifying vulnerabilities in file system access control based on some predetermined knowledge-base of vulnerability definitions. Nalgurd et. al. (Naldurg et al., 2006) use an inference engine alongside a snapshot of the access control implementation with static knowledge, and in further work they produce a technique where a model checking algorithm (Naldurg and KR, 2011) is used to identify vulnerabilities. Another area that has recently received significant interest in the access control community is that of access control and anomaly detection for cloud computing. For example, recent work by Hu et al. (2013) both motivates the need for identifying anomalies in web access control policies and provides a potential solution through the use of a binary decision diagram for anomaly discovery and resolution. These methods show significant ability at identifying vulnerabilities, but they require statically defining a knowledge-base of vulnerabilities in respect to relative file system permissions which is heavily reliant on expert knowledge. The focus of the work presented in this paper is to provide a mechanism to identify likely vulnerabilities without the reliance on expert knowledge. Literature suggests that there has been no work in the detection of irregularities in file system security without the need to statically define what constitutes an 3 Attribute Number (a) Description 1 Full control 2 Traverse folder \ execute files 3 List folder \ read data 4 Read attributes 5 Read extended attributes 6 Create files \ write data 7 Create folders \ append data 8 Write attributes 9 Write extended attributes 10 Delete subfolders and files 11 Delete 12 Read permissions 13 Change permissions 14 Take ownership Table 1: Individual attributes. irregular permission. This includes the potential application of Association Rule Mining, This is surprising considering the vast user-base and the potential benefits that can be achieved. 3. NTFS Implementation and Modelling In this section, a description of how NTFS permissions are implemented is provided. Alongside this description is a discussion of how its implementation can results in complexities which ultimately can result in vulnerabilities. Following this, a description is provided of how NTFS permissions are translated into a relational object-model. 3.1. Access Mask NTFS implements DACLs by applying a DACL to each object within the file system. Each DACL will contain a Security Identifier (SID) which is a unique key that identifies the owner of the object and the primary associated group. The structure of the DACL is a sequential storage mechanism which contains access control entries (ACEs). An ACE is an element within a DACL which dictates the level of access given to the interacting subject. The ACE contains a SID that identifies the particular subject, an access mask which contains information regarding the level of permissions and the inheritance flags. An ACE within the NTFS is a set of attributes, p, from the predefined set of attributes p ⊆ P , P = {a1, . . . , an}. Table 1 shows the fourteen predefined attributes in the NTFS. The NTFS provides six levels of standard coarse- grained permission that consist of a combination of predefined attributes. The NTFS also allow for the creation of special fine-grained permissions that are constructed from a combination of fourteen permission attributes (Russel et al., 2003). The potential to create a special permission is a useful feature; however, it can introduce complexity as it requires detailed knowledge regarding the authority that each attribute holds (Thomas, 2010). A example of where a special permission could be used is if the user wanted to assign a user or group the standard access level of Modify for all the contents of a shared directory. However, this has adverse consequences as the Modify permissions will allow the user(s) to be able to delete the folder itself. A potential solution here is to assign the user or group the default permissions level of Modify, and then modify the permissions’ attributes so that only subfolders and files can be deleted, and not the folder itself. This modification would result in the creation of a special permission. 4 3.2. Propagation and Inheritance In this section, a discussion is provided on the different mechanisms through which NTFS permissions can propagate through a directory structure. There are two types of ACE entires within the DACL; (1) Explicit and (2) Inherited. Explicit entires are those that are applied directly to a DACL, whereas inherited are propagated from the directory’s parent. The type property of the ACE allows programmatic determination of whether the permission has been explicitly assigned to a directory or if it was inherited from a parent directory. Furthermore, it is also possible to create custom fine-grained inheritance rules. Such special inheritance rules can easily be overlooked during administration and auditing, resulting in theunintended propagation of access. 3.3. Accumulation Permissions accumulation creates the potential for an interacting subject to receive permissions from multiple different policies. Any interacting subject within the NTFS can be assigned to access control groups. This means that they do not have to be explicitly added in the ACE, they could be an associated group member. Such policy combination is controlled by the Local Security Authority Subsystem Service (LSASS). The high-level functionality of this service is to create an ordered union all relevant policies. However, there are a few complexities that can occur in this combination process. These are: (1) explicit permissions take priority over those inherited, (2) explicit deny permissions always take priority over any other assigned permission, and (3) permissions inherited from closer relatives take priority over those further away. It would be logical to assume that deny permissions would always take precedence over apply permissions to ensure that the user operates at the least possible level of access. However, the first point makes it possible for an inherited deny permission to never be reached. This goes against a fundamental aspect of policy combination that a deny permission should never be ignored. If a situation where a user is able to ignore a deny permission was to arise, a system administrator would need to be aware of this so that they could take rectifying action. In addition to explicit permissions taking priority over inherited permissions, inherited permissions closer to the derrived directory will take priority over those more distance. For example, a folder’s inherited permissions will take priority over those inherited from their grandparent. To summarise, the ordered hierarchy for policy combination is: 1. Explicit deny. 2. Explicit allow. 3. Inherited deny. 4. Inherited allow. 3.4. Group Membership A powerful feature of NTFS access control is that of group membership. An interacting subject (group, user, or process) that interacts with the file system can hold membership to another group. This creates the possibility for permissions to be inherited from all of the associated groups. Users will often be grouped together to mange management easier. This separation of duty can often be seen in many organisations. For example, users within a finance department will have different permissions that of management. As Hanner, et. al. (Hanner and Hörmanseder, 1999) identifies, understanding effective file permissions can become increasingly more complex by group association. This is because to evaluate a user’s effective permission would require the knowledge of which groups that they are inheriting from. It should be noted here that it is not a direct mechanism of how NTFS access control is implemented, rather it results from how Microsoft allows for users, groups and processes to be managed and controlled through group memberships. 5 Algorithm 1: Depth-first recursive permission extraction algorithm, returning an ordered list of ef- fective permissions for each object within the directory structure. Input: Initial directory d Output: Set of ordered effective permissions E = e1, e2, . . . , en where en = {d, o, p}. Here d is the directory resource, o is the object, and p is the permission level, p = {a1, a2, . . . , an} 1 Output: Membership relation set, G = {g1, g2, . . . , gn} where gn = {o1, o2, . . . , on} 2 Algorithm algo() 44 P ← proc(d) 66 return 7 1 Procedure proc(directory d) 2 pACL ← d(ACL) 3 foreach subdirectory c of d do 4 cACL ← c(ACL) 5 if cACL ! = pACL then 6 ex = ∅, in = ∅, r = memberships(G) 7 foreach ACE a in cACL do 8 foreach Membership g in r do 9 if isExplicitDeny(a) then 10 E ← (c, getObj(a), getPerm(a)) 11 break 12 else 13 else if isExplicitAllow(a) then 14 E ← (c, getObj(a), getPerm(a)) 15 break 16 else if isInherited(a) then 17 if isInheritedDeny(a) then 18 E ← (c, getObj(a), getPerm(a)) 19 break 20 else if isExplicitAllow(a) then 21 E ← (c, getObj(a), getPerm(a)) 22 break 23 end 24 25 26 end 27 3.5. Object-centric Modelling In order to represent the accumulated permission that each object holds on a file system resource, is it necessary to adopt an object-centred approach to permissions modelling. In this section, the description of NTFS’s access control structure is translated into an object-centred model suitable representation which can be used for association rule mining. Here each interacting object (user, group, etc.), o, holds a combination of permission attributes which are accumulated based on (1) the assigned permissions, (2) propagated and inherited permissions, and (3) group membership. Algorithm 1 has been developed to iterate over all the directories and create a set of effective object permissions for each directory. Each effective permission for an object is the set of permission attributes, p, that o holds on a directory, d. As seen in Algorithm 1, the hierarchy for permission accumulation (Section 3.3) is taken into consideration. The algorithm also ignores permissions which do not 6 Dave Read D : \ has permission on directory Figure 1: Object-based diagram for the permission entry of D : \, Dave, FileReadData differ from those of their parent directory. This helps to remove repetitive information and provides a useful heuristic for increasing performance and reducing the volume of captured data. The data structure returned from executing the algorithm is a list ordered by the directory structure. This list, E = {e1, e2, . . . , en}, contains an ordered list of tuples, en = {o, p, d}, where d is the directory resource object, o is the interacting object, and p is the set of permission level objects, p = {a1, a2, . . . , an}. The returned list (E) contains the object model for each files system entry. For example, consider a directory where a user Dave, o = Dave, has permission of FileReadData, p = FileReadData, on a locally mounted directory, d = D : \. A diagrammatic illustration of this model is provided in Figure 1. This object model is then stored in Comma Separated Value form for future processing. In the provided example, the following data set would be output: Dave, FileReadData, D : \ The example provided is simplistic and only represents one individual file system permission. It is often the case that an interacting object would accumulate access permission from multiple sources (explicit, inherited, etc.), and therefore Algorithm 1 is used to calculate each objects’ effective permission (i.e the permission the user actually receive) before the creation of the object model. The output object model (stored in text form) is then a complete representation of each interacting objects’ effective permission on the specified directory structure. 4. Association Rule Mining In this section, the area of Association Rule Mining (ARM) is discussed, and details of how it is used for detecting irregularities in file system permissions are provided. ARM is a method of discovering associations within data that frequently occur. For example, in the perspective of file system permissions analysis, determining that a user (e.g Dave) frequently has the same level of permissions (e.g Read) on a wide array of directories. In ARM, I is the set of binary items, and D are the data sets. In applying ARM to file system permissions, I is the total object set (users, permissions, directories) and D (previously defined as E) is the total set of effective permissions, D = {e1, e2, en}. Each permission entry en is a subset of I. For example, a potential item set could be I = {Dave, Mike, FileRead, FileWrite, D : \} and the following are example data sets: E1 = {Dave, FileRead, D : \} E2 = {Mike, FileRead, D : \} E3 = {Mike, FileWrite, D : \} Association rule mining is a way to discover interesting relationships in datasets, or in the case of this paper, infrequent relationships. The three following rules are used in association rule mining: Definition 1 (Association rule). X =⇒ Y is an Association Rule (AR), if X and Y are item sets, X is the LHS (Left Handside) or body of the rule, and Y is RHS (Right Handside) or head of rule. A set of 7 data entries, can be informally define an Association Rule (AR) as the rule of X =⇒ Y , where X, Y ⊂ I. In the continuing example, a file system specific association rule would be {FileRead} =⇒ {D : \}. Definition 2 (Support of an AR). The support (supp(X)) of an AR is defined as the percentage of per- mission entries that contain both X and Y . It can also be defined as the probability P ( X ∩ Y ) . In the example, the dataset {FileRead, D : \} has support of 0.6 since is occurs in one two thirds of the permission entries. Definition 3 (Confidence of an AR). The confidence of an AR is defined as the ratio between the num- ber of transactions that contain X∪Y and the number of transactions that contain X. It can also be defined as P ( X ∪ Y |X ) = P ( X ∪ Y ) /P ( X ) . In the example, the confidence (supp(X ⊂ Y )/supp(X)) would equal 0.6/0.6 = 1, meaning that for 100% of the permission entries that have FileRead are related to D : \ ARM procedures contain two stages: (i) the identification of frequent datasets, and (ii) generation of ARs. Piatetsky-Shapiro (1991) defines ARM as a method for the description, analysis and presentation of ARs which are discovered in databases using different measures to determine interesting data. ARM is concerned with the discovery, in tabular databases, of rules that satisfy defined threshold requirements. Of these requirements, the most fundamental is concerned with the support (frequency) of the item sets used to make up the ARs: a rule is applicable only if the relationship occurs sufficiently often in the data. Whether an item set is relevant or not, is determined by its support count confidence value. An item set is deemed to be relevant if its support count is above a user specified support threshold. Similarly, an AR is deemed relevant if the confidence value is above a user specified confidence threshold (Singer and Willett, 2003). Although minimum support and confidence thresholds help remove the generation of uninteresting rules, many of the remaining rules are still not interesting to the users. This work focuses on irregularities which in the context of Association Rule Mining can be assumed to be a non-frequent set of items. Therefore, in this work an interesting rule (I.e an irregular file system permission) is one with both low support and confidence. Frequent data set mining is thoroughly studied by many researchers, but important rare items are often not discovered by these algorithms. In many cases, the contradictions or exceptions also offers useful associations. In the recent past researchers started to focus on the discovery of such kind of associations called rare associations. Rare data sets can be obtained by setting low support; however, this generates huge number of rules. Therefore the key idea is to inverse the support threshold criterion by using a maximum support threshold but also using the confidence as a second criterion to deal with the large amounts of rare relations between items. In the technique presented in this paper, the selection of the minimum support threshold value is based on the frequency of each item in the data set. E.g. how many times an item appears which is turned into a percentage and represents the threshold value. As regards to the confidence threshold, it is selected empirically and usually its value is selected to be equal to the minimum support threshold. SOMA (Somaraki et al., 2010) is a trend-mining framework for knowledge discovery from large databases, the development of a validation framework for trend mining. SOMA was originally developed for the application of trend mining in medical data; however, SOMA itself is not domain dependent and can therefore be used with any type of data in different application areas. The framework exploits three steps: pre-processing, association rule mining, and trend mining; however, given the nature of the data and the aim of the investigation, both preprocessing and trend mining stages are not required. The preprocessing stage is superfluous in this application as the data is complete and discrete and there are no temporal requirements. In SOMA’s original application of medical data, logic was utilised to aid with merging and time-stamping data subsets for analysis. Trend mining is also not required. For this application the most important phase is association rule mining, where rules are extracted by considering a the of variables extracted from the data. In this paper, the Matrix (Yuan and Huang, 2005) algorithm which is part of SOMA framework was used for association rule mining. The algorithm is more efficient than the well-known Apriori (Tsai and Chen, 2004) algorithm as it only requires one pass over the data to generate the matrix. The algorithm is called a matrix algorithm as it creates a binary matrix with entries (0, 1) from passing over the database, resulting in the creation of a set of candidate items from which association rules are produced. In the work 8 Stage 1 (NTFS Processing) Stage 2 (SOMA) Input: NTFS Directory Effective Permissions Output: Infrequent Permissions Figure 2: Schematic illustrating the integration of both the NTFS permissions extraction software and SOMA for association rule mining. presented in this paper, the file system permission data is passed to the matrix algorithm in tabular form for processing and the output is a set of association rules, generated based on the support threshold and confidence value. The output is then processed to identify association rules for infrequent data. Full details of the matrix algorithm can be found in Yuan and Huang (2005). There is an absence of literature detailing the use of ARM to detect anomalies in file systems; however, closely related research on the identification of anomalies and irregular data sets using ARM strongly moti- vate its use (Chandola et al., 2009). For example, Li et al. (2009) present the successful use of ARM to detect anomalies in identifying irregular behaviour in local area network traffic. In addition, intrusion detection is another successful area of research which has seen applications of ARM (Patcha and Park, 2007). 5. Produced Software Software has been developed for the extraction, processing and association rule mining of file system permissions. For the extraction and accumulation of file system permissions (stage one), software was developed in C# using the .NET System.Management namespace. The developed application is a command line application which can be executed on any Windows NT operating system. The application will run under the user’s credentials, and therefore can only analyse directories on which the user has authority to read security permissions. The output from the tool contains the ACL information for each directory in comma value separated form which is subsequently used for association rule mining. In the second stage, the SOMA framework was developed in MATLAB 20014a. In the experiments presented in this paper, both stages were executed on a Intel i7 Processor with a clock speed of 2.2 GHz. Figure 2 provides an overview illustration of the integration of the two stages. This integration of these two stages allows the user to extract NTFS permissions and then identify irregularities through trend mining. 6. Empirical Observations This section contains two types of empirical observations through using the tools developed in Section 5. The first is empirical observations based on synthetically generated file system data, and the second is from analysing five real-world, multi-user directory structures of different configurations. The aim of this experiment is to determine the effectiveness of the proposed technique at identifying irregular and anomalous permissions within the directory structures. In both tests, to asses the performance, the following measures are considered: (1) True Positive Rate (tpr): the fraction of irregular permissions correctly identified as irregular; (2) False Positive Rate (fpr = 1 - tnr): the fraction of regular permissions incorrectly identified as irregular; (2) True Negative Rate (tnr): the fraction of regular permissions correctly identified as regular; (3) False Negative Rate (fnr = 1- tpr): the fraction of irregular permissions incorrectly classified as regular; Finally, the accuracy is reported as the fraction of all samples correctly identified. 9 Algorithm 2: Algorithm for generated synthetic directory structures Input: The maximum number of directories to be created, MaxDir Input: The step size of the increasing number of directories, StepDir Input: The maximum number of anomalies to be created, MaxAnom Input: The step size of increasing the number of anomalies, StepAnom Output: A set of directories, S = {s1, s2, ..., sn}, where sn = {d, p} where d is the directory resource, and p is the permission level 1 Algorithm algo() 2 for n ← 0 to MaxAnom do 3 i = 0 4 while i ≤ MaxDir do 5 S[j] ← createDirectory() 6 i+ = StepSize 7 end 8 j = 0, i = 0 9 while J ≤ MaxAnom do 10 dirNo = genRandomInt(0, i) 11 pLevel = genRandomInt(1, 14) 12 i+ = StepSize 13 S[j] ← pLevel 14 end 15 n+ = StepAnom 16 end 17 return S 6.1. Synthetic Data sets The motivation behind using synthetically generated test data is that although acquiring security in- formation from real-world directory structures is easily achievable, it is difficult to acquire ground-truth knowledge regarding the irregularities within those systems. This is because the identification of irregular permissions is ultimately down to the auditor’s knowledge, and although evaluating the system against such data is useful, it potentially limits the quality of assessment as both correct and incorrect permissions might have been incorrectly identified by the expert. In order to generate synthetic file system data, a technique has been created to automatically generate directory structures, assign permissions, and insert irregular permissions (anomalies). From the created directory structures, it is then possible to use the proposed NTFS processing software to extract access control information, and then use SOMA to identify irregularities. Algorithm 2 describes the process to generate synthetic directory structures. The first aspect of the algorithm is to create a the directory struc- tures by providing a maximum number of directories and a stepsize. For example, providing a maximum directory size (MaxDir) of 50 and a stepsize (StepDir) of 10 would result in the generation of five directory structures (10, 20, 30, 40, and 50). In the creation of directories, the security permission of their parent and system defaults will automatically be inherited. In addition, the same directory structure is created multiple times, from 1 until MaxAnom in the step size of StepAnom. For example, if MaxAnom = 20 and StepAnom = 5, the number of the same directory structure created would be four with 5, 10, 15, and 20 anomalies. Creating the anomalies for each directory structure is performed by identifying a random number between 0 and and the current directory size (i). This random number (dirNo) will be used to assign an anomaly. The permission of the anomaly is determined by generating a random number between 1 and 14 (Table 1), pLevel. The number identified represents the number of permission attribute set. For example, if the generated number is 2, the first two permission attributes are set. In the synthetic experiments presented in this paper, MaxDir = 1, 000, StepDir = 100, MaxAnom = 64, and for each iteration is StepAnom = StepAnom× 2. I.e the anomalies would increase in the sequence 10 User Group Directory Attribute Support Confidence Administrators 100\100 16 Delete 0.05 0.27 Anomalyuser 100\100 16\0 Delete 0.001 0.01 Table 2: Example Association Rules generated when analysing synthetically produced file systems that has 100 directory and 16 anomalies. of 1, 2, 4, 8, 16, 32, 64. Doubling the number of anomalies in each iteration will allow for the comprehensive understanding of the system’s ability to recognise anomalies as their frequency increases. The random approach of assigning permissions, as well as the permission level, helps to simulate a file system that is becoming increasingly complex and helps to replicate the process of ‘ad-hoc’ permissions allocation. This results in the production of 70 different directory structures to test the proposed technique. Figure 6 illustrates the average execution time for analysing each directory size. The difference in execution time for directory structures of the same size but with a different number of irregular permissions is less than 1 second. From the Figure 6 it is noticeable that execution time increases by around 50 seconds until the directory size reaches 800 where the time increases by around 100 seconds for the addition of 100 directories. It is also noticeable that the time required for the associative rule mining stage makes up a larger portion of the duration than the permission extraction stage. The results from all 10 directory structures, and the 7 different permission levels, are demonstrated in three different graphs illustrates the Receiver Operator Characteristics (ROC) space (Hanley and McNeil, 1982). Figure 3 illustrates the tpr and the fpr for the first four directory structures, with a directory size of 100, 200, 300, and 400, respectively. Figure 4 illustrates the results for the directory structures with 500, 600 and 700 directories. Finally Figure 5 illustrates the results for the directory sizes of 800, 900, and 1000. These results are interesting as they demonstrate both an improvement in tpr and a decrease in fpr as the directory size increases. This reason behind this is that the proportion of anomalies in the directory structure containing 100 directories is greater than that of the directory with 200 directories. For example, the directory structure with 100 directories containing 32 anomalies consists of a total of 1008 permissions. The directory structure containing 200 directories and 32 anomalies anomalies consists of 1936 total permissions. This is interesting as it demonstrates that the tools ability improves with an increasing number of directories and a decreasing number of irregular permissions. In the first instance this might appear like a negative characteristic of the system; however, as real-world directory structures will often be well in excess of 200 directories, with a low proportion of anomalies, it can be seen that the tools performance would be suitable for application. It is, however, worth raising the point that the accuracy would reduce in directory structures where permissions have been managed in an ad-hoc manner for a prolonged period of time. However, considering the fact that a large portion of permissions will always be system generated, the technique should still be able to identify anomalies, even if accuracy is reduced. The verbose results from all 70 experiments are omitted from the paper due to space constraints. How- ever, the results and data sets are available from the author upon request. In addition, from the graphs it is not possible to identify the number of irregular permissions each directory structure has; however, from analysing the results it can be stated that the directory structure with fewest regular permissions has the highest tpr an the lowest tnr across all directory sizes. Another observation that is consistent for all directory sizes is that the tpr increases and the tnr decreases incrementally with the increasing number of irregular permissions. Table 2 demonstrates two example Associate Rules for demonstrating both an irregular and regular per- mission extracted from a synthetically generated directory structure with 100 directories and 16 anomalies. The first rule is for a frequently occurring permission, and the second rule is for a irregular permission allocation. The first rule (X =⇒ Y ) demonstrates that the Administrators group implies Delete, X = {Administrators} =⇒ Y = {Delete}. This rule is frequently occurring because the percent- age (support) of all entries that contain both X and Y is at 5 %. This is high considering that there are in excess of five other users and groups within the system, and fourteen different attributes used in total. The confidence value is also quite high (27 %) which indicates that there is a strong relationship between both Administrators and Delete. The second rule demonstrates an irregular permission between 11 Directory Number Directories Permissions levels Permission Entries 1 36 3 248 2 108 3 1142 3 427 2 1708 4 453 4 3184 5 407 4 3719 Table 3: Test NTFS directory structures specifics. The numbers are only considering directories with permissions different from their parent. I.e. not inherited. the Anomalyuser and Delete permission. Here the support is less 0.1 % indicating that a low number of permissions contain both X and Y . In addition, the confidence value is also low at 1 % indicating that a low number of the total Delete permissions are related to Anomalyuser. From this example, it is easy to determine how the second permission has been identified as an anomaly. In the results presented in this section, the average accuracy rate for each dataset is incrementally improving from 0.85 to 0.95 for the directory structure with 100 and 1000 directories, respectively. This further stimulates the assumption that the system performs best when the number of anomalies represents 1% or less than the total number of permissions (100 irregular and 9136 regular). However, even when the number of irregular permissions is around 10%, the accuracy rate is around 85%. This is interesting as it demonstrates the suitability of the technique for identifying file system permission regularity, which increases as the percentage of irregular permissions decreases. This is contradictory to the main aim of this work which is trying to eliminate the requirement for expert knowledge when auditing file system permissions. This is because as the complexity of a file system increases through more ad-hoc permissions, the tool’s ability decreases and the requirement for expert knowledge increases. However, this tool does make progress on reducing the need for expert knowledge by making considerable progress towards removing the requirement for an exhaustive manual audit. 6.2. Real-world Data sets In this section the proposed system is validated against five real-world file systems to identify irregular permissions. As this test is not synthetic, the ground truth is acquired by expert knowledge. The limitation of acquiring expert knowledge alongside access to real-world directory structures limits the number of available directory structures for analysis. Table 3 shows details of the five previously unseen NTFS directory structures that are used to establish the developed techniques’ performance. The diversity of different organisations permissions varies widely; some have good access control which is maintained through adhering to a rigorous structure, whereas some have inconsistent access control resulting from ad-hoc fixes. The number of directories presented in Table 3 is significantly lower than the total number of directories in the file system. This is because the algorithm considers directories which have permission different from their parent. For example, directory number 1 in Table 3 actually contains 2856 directories of which only 55 are are unique. I.e these 55 permissions are inherited to all their sub-directories. For each test case, the ground truth (I.e the correct identification of irregular permissions) is acquired through reports produced by an independent security auditing organisation. Figure 7 illustrates the computation times for each of the directory structures presented in Table 3. From this graph, it is noticeable that the execution time for both stages of the algorithm increase as the size and complexity of the directory structure increase. However, the results demonstrate that the duration of the extraction and processing stage is more sensitive to an increase in both directory size and the number of assigned permissions, whereas stage two (association rule mining) is less affected by an increasing number of allocated permissions. For example, in the results it can be seen that the execution time of both stages increases greater than exponentially until directory structure 4. Interestingly there is a large difference between the number of permission entries for directory structures 3 and 4 (1708 to 3184) which results in a 12 Directory Number Irregular Regular True positive (tpr) False Positive (fpr) True Negative (tnr) False Negative (fnr) accuracy 1 6 248 5 (0.83) 1 (0.17) 203 (0.82) 45 (0.18) 0.82 2 62 1142 50 (0.81) 12 (0.19) 977 (0.86) 165 (0.14) 0.85 3 24 1708 20 (0.83) 4 (0.17) 1538 (0.90) 170 (0.10) 0.90 4 14 3184 13 (0.93) 1 (0.07) 3037 (0.95) 147 (0.04) 0.97 5 144 3719 143 (0.97) 4 (0.03) 3619 (0.97) 100 (0.03) 0.99 Table 4: Accuracy results from empirical analysis. See paragraph 1 in Section 6 for a description of the measures. large increase in computation time for stage one; however, the computation time for stage two only changes a small amount (less than one second). This demonstrates good scalability of SOMA (stage 2) for analysing file system permissions. Following analysis of the performance, it is important to evaluate the ability to correctly identify both irregular and regular permissions. This is performed by comparing the results of processing permissions with SOMA with the results of expert analysis. Figure 8 illustrates the Receiver Operator Characteristics (ROC) space (Hanley and McNeil, 1982) for all five directory structures. The ROC graph demonstrates that although accuracy is lower on the smaller directory structures (directory structure 1 and 2), it is improving as both the directory size and complexity increase. For example, for directory structure 1 has a tpr of 0.83, a fpr of 0.17, and an accuracy of 0.82. Directory structure 5 has a tpr of 0.97, a fpr of 0.03, and an accuracy of 0.99 (full results can be seen in Table 4). Although directory structure 1 demonstrates a reasonable performance where 80 % of classifications are correct, it would still require a reliance on expert interpretation to improve confidence. Directory structures two and three are slightly better where the accuracy rate has increased to 0.85 and 0.90, respectively. It is interesting to discover that directory structures four and five have improved and have accuracy levels of 0.97 and 0.99, respectively. This is a significant finding as it suggests that the accuracy of the proposed techniques are improving as both the directory size and complexity are increasing. This would suggest that the directory structures one, two and three are not big enough to allow the association mining techniques to correctly distinguish between what is normal and irregular. The empirical results presented above demonstrate the potential of the proposed technique to program- matically identify irregular permissions with a minimal reliance on expert knowledge. These results have demonstrated that the proposed technique is sensitive to directory size, and in general performs better on larger directory structures with diverse permissions. 7. Conclusion This paper presents a technique to identify irregular file system permissions using Association Rule Mining. This research focuses on the vulnerabilities of Microsoft’s New Technology File System (NTFS). Motivation for applying the presented technique to the NTFS is justified through a details discussion on the complexities associated with NTFS permissions administration. A detailed discussion is then provided where the structure of NTFS permissions is discussed and modelled in a suitable way to develop algorithms to autonomously identify irregular file system permissions. The developed technique is a two-stage algorithm. The first stage is a permission extraction algorithm for acquiring all relevant information for analysis. Empirical analysis has demonstrated that this stage is sensitive to both the size of the directory structure as well as the complexity of the assigned permissions. However, given that analysis is performed off-line and that permissions are often not changing quickly, the time-scale that is presented (a maximum of just over 2 minutes for the largest directory) is not detrimental. The second stage is to use association rule mining to identify permissions which are irregular and can potentially result in vulnerabilities. Using association rule mining results in the production of a final rule set containing those which have low support and confidence. This stage of the produced system scales well and has an accuracy rate which increases with both directory size and complexity. 13 Empirical observations are then performed which included executing the produced technique against 70 synthetically generated and five real-world, multi-user directory structures. The synthetically generated directory structures created the possibility to rigorously evaluate the system’s performance and understand the ability to identify irregular permissions which have been created in an ‘ad-hoc’ manner. The results from the synthetic directory structures demonstrated that the presented technique has an 90 % average accuracy. The main finding from these experiments was that the ability to correctly identify irregular permissions increases and the incorrect identification of regular permissions decreases as the directory size increases and the number of anomalies stays the same or decreases. It was identified that 95 % accuracy can be achieved when the percentage of irregular permissions is equal to or less than 1 % of the total permissions. The real-world directory structures were different in terms of directory size, the number of assigned permissions, and the number of known irregular permissions. The irregular permissions have been prede- termined through the use of expert knowledge. This expert knowledge was used as ground truth for the empirical analysis. The results are promising with an average accuracy rate of 91 %. It is interesting to discover that the technique has a better accuracy on larger and more complex directory structures. Although an accuracy rate in excess of 90 % has been achieved, it still does not completely remove the requirement for expert knowledge. However, it significantly reduces the reliance on expert knowledge as well as the time consuming process of having to perform an exhaustive manual security audit of a file system. The contribution from this paper to the anomaly detection and file system auditing communities is significant and strongly motivates further research. All the data sets and software presented in this paper are available form the corresponding author upon request. Future research is to further develop the techniques to include different mechanisms of file system access control as well as identifying suitable mechanisms of further improving accuracy. 8. References Barak, S., Modarres, M., 2015. Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications 42 (3), 1325 – 1339. Beznosov, K., Inglesant, P., Lobo, J., Reeder, R., Zurko, M. E., 2009. Usability meets access control: challenges and research opportunities. In: Proceedings of the 14th ACM symposium on Access control models and technologies. SACMAT ’09. ACM, New York, NY, USA, pp. 73–74. URL http://doi.acm.org/10.1145/1542207.1542220 Bhuyan, M., Bhattacharyya, D., Kalita, J., First 2014. Network anomaly detection: Methods, systems and tools. Communica- tions Surveys Tutorials, IEEE 16 (1), 303–336. Cao, X., Iverson, L., 2006. Intentional access management: making access control usable for end-users. In: Proceedings of the second symposium on Usable privacy and security. SOUPS ’06. ACM, New York, NY, USA, pp. 20–31. URL http://doi.acm.org/10.1145/1143120.1143124 Catania, C. A., Bromberg, F., Garino, C. G., 2012. An autonomous labeling approach to support vector machines algorithms for network traffic anomaly detection. Expert Systems with Applications 39 (2), 1822 – 1829. Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41 (3), 15. Cheng, Q., Lu, X., Liu, Z., Huang, J., 2015. Mining research trends with anomaly detection models: the case of social computing research. Scientometrics 103 (2), 453–469. De Capitani di Vimercati, S., Paraboschi, S., Samarati, P., 2003. Access control: principles and solutions. Software: Practice and Experience 33 (5), 397–421. URL http://dx.doi.org/10.1002/spe.513 Hanley, J. A., McNeil, B. J., 1982. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143 (1), 29–36. Hanner, K., Hörmanseder, R., 1999. Managing windows nt file system permissions— a security tool to master the complexity of microsoft windows nt file system permissions. Journal of Network and Computer Applications 22 (2), 119 – 131. Hu, H., Ahn, G.-J., Kulkarni, K., Nov 2013. Discovery and resolution of anomalies in web access control policies. Dependable and Secure Computing, IEEE Transactions on 10 (6), 341–354. Islam, M. S., Rahman, S. A., 2011. Anomaly intrusion detection system in wireless sensor networks: security threats and existing approaches. International Journal of Advanced Science and Technology 36 (1). Khan, M. N. A., 2012. Performance analysis of bayesian networks and neural networks in classification of file system activities. Computers & Security 31 (4), 391 – 401. Khatib, E. J., Barco, R., Gmez-Andrades, A., Muoz, P., Serrano, I., 2015. Data mining for fuzzy diagnosis systems in LTE networks. Expert Systems with Applications 42 (21), 7549 – 7559. Lazarevic, A., Ertöz, L., Kumar, V., Ozgur, A., Srivastava, J., 2003. A comparative study of anomaly detection schemes in network intrusion detection. In: SDM. SIAM, pp. 25–36. 14 Li, X., Zhang, Y., Li, X., Sept 2009. Local area network anomaly detection using association rules mining. In: Wireless Communications, Networking and Mobile Computing, 2009. WiCom ’09. 5th International Conference on. pp. 1–5. Ma, B. L. W. H. Y., 1998. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining. Mahoney, M. V., 2003. Network traffic anomaly detection based on packet bytes. In: Proceedings of the 2003 ACM Symposium on Applied Computing. SAC ’03. ACM, New York, NY, USA, pp. 346–350. Microsoft, 2006a. AccessEnum V1.32. URL http://technet.microsoft.com/en-us/sysinternals/bb897332 Microsoft, 2006b. How to use Xcalcs.vbs to modify NTFS permissions. URL http://support.microsoft.com/kb/825751 Naldurg, P., KR, R., 2011. Seal: a logic programming framework for specifying and verifying access control models. In: Proceedings of the 16th ACM symposium on Access control models and technologies. ACM, pp. 83–92. Naldurg, P., Schwoon, S., Rajamani, S., Lambert, J., 2006. Netra:: seeing through access control. In: Proceedings of the fourth ACM workshop on Formal methods in security. ACM, pp. 55–66. Nemeth, E., 2010. UNIX and Linux system administration handbook. Pearson Education. Parkinson, S., Crampton, A., 2013. A Novel Software Tool for Analysing NT File System Permissions. International Journal of Advanced Computer Science and Applications 4 (6), 266–272. Parkinson, S., Hardcastle, D., 2014. Automated planning for file system interaction. Proceedings of The 32nd Workshop of the UK Planning and Scheduling Special Interest Group (PlanSIG2014). Patcha, A., Park, J.-M., 2007. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks 51 (12), 3448 – 3470. URL http://www.sciencedirect.com/science/article/pii/S138912860700062X Pawlowski, B., Shepler, S., Beame, C., Callaghan, B., Eisler, M., Noveck, D., Robinson, D., Thurlow, R., 2000. The nfs version 4 protocol. In: Proceedings of the 2nd International System Administration and Networking Conference (SANE 2000). Vol. 2. p. 50. Piatetsky-Shapiro, G., 1991. Discovery, analysis and presentation of strong rules. Knowledge discovery in databases, 229–238. Russel, C., Crawford, S., Gerend, J., 2003. Microsoft windows server 2003 administrator’s companion. Microsoft Press. Singer, J. D., Willett, J. B., 2003. Applied longitudinal data analysis: Modeling change and event occurrence. Oxford university press. Solomon, D. A., 2005. Microsoft windows internals: Microsoft windows server 2003, windows xp, and windows 2000. Somaraki, V., Broadbent, D., Coenen, F., Harding, S., 2010. Finding temporal patterns in noisy longitudinal data: a study in diabetic retinopathy. Advances in Data Mining. Applications and Theoretical Aspects, 418–431. Somaraki, V., Harding, S., Broadbent, D., Coenen, F., 2011. Soma: A proposed framework for trend mining in large uk diabetic retinopathy temporal databases. Research and Development in Intelligent Systems XXVII, 285–290. Somaraki, V., Vallati, M., McCluskey, L., 2015. Discovering interesting trends in real medical data: A study in diabetic retinopathy. Proceedings of the 17th Portuguese Conference on Artificial Intelligence. EPIA 2015. Ten, C.-W., Hong, J., Liu, C.-C., Dec 2011. Anomaly detection for cybersecurity of the substations. Smart Grid, IEEE Transactions on 2 (4), 865–873. Thomas, O., 2010. Are NTFS and share permissions a bit too complicated. Windows IT Pro. Tsai, P. S., Chen, C.-M., 2004. Mining interesting association rules from customer databases and transaction databases. Information Systems 29 (8), 685–696. Viswanath, B., Bashir, M. A., Crovella, M., Guha, S., Gummadi, K. P., Krishnamurthy, B., Mislove, A., 2014. Towards detecting anomalous user behavior in online social networks. In: Proceedings of the 23rd USENIX Security Symposium (USENIX Security). Xie, M., Han, S., Tian, B., Parvin, S., 2011. Anomaly detection in wireless sensor networks: A survey. Journal of Network and Computer Applications 34 (4), 1302 – 1325, advanced Topics in Cloud Computing. Yu, L., Chen, H., Wang, S., Lai, K. K., 2009. Evolving least squares support vector machines for stock market trend mining. Evolutionary Computation, IEEE Transactions on 13 (1), 87–102. Yuan, Y., Huang, T., 2005. A matrix algorithm for mining association rules. In: Advances in Intelligent Computing. pp. 370–379. 15 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Positive Rate T ru e P o si ti v e R a te d = 100 d = 200 d = 300 d = 400 Figure 3: ROC space demonstrating the fpr and tpr of identifying anomalies in the synthetic directory structures where d is the number of directories. 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 False Positive Rate T ru e P o si ti v e R a te d = 500 d = 600 d = 700 Figure 4: ROC space demonstrating the fpr and tpr of identifying anomalies in the synthetic directory structures where d is the number of directories. 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 False Positive Rate T ru e P o si ti v e R a te d = 800 d = 900 d = 1000 Figure 5: ROC space demonstrating the fpr and tpr of identifying anomalies in the synthetic directory structures where d is the number of directories. 0 200 400 600 800 100 200 300 400 500 600 700 800 900 1000 Time in Seconds D a ta se ts NTFS-r SOMA Figure 6: Computation time in seconds for (1) extracting and processing NTFS permissions, followed by (2) the ex- ecution of the trend mining (SOMA) algorithm to identify irregular permissions. 16 0 50 100 150 200 1 2 3 4 5 Time in Seconds D ir ec to ry S tr u ct u re NTFS-r SOMA Figure 7: Computation time in seconds for (1) extracting and processing NTFS permissions, followed by (2) the execution of the trend mining (SOMA) algorithm to identify irregular permissions. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 123 45 False Positive Rate T ru e P o si ti v e R a te Figure 8: ROC curve for the analysis of the test file systems 17