key: cord-0057842-fxn8n4yv authors: Derouiche, Abir; Layeb, Abdesslem; Habbas, Zineb title: Mining Interesting Association Rules with a Modified Genetic Algorithm date: 2021-02-22 journal: Pattern Recognition and Artificial Intelligence DOI: 10.1007/978-3-030-71804-6_20 sha: 2840a5475ce9eea07323827d536f3aa0fbf54c7f doc_id: 57842 cord_uid: fxn8n4yv Association Rules Mining is an important data mining task that has many applications. Association rules mining is considered as an optimization problem; thus several metaheuristics have been developed to solve it since they have been proven to be faster than the exact algorithms. However, most of them generates a lot of redundant rules. In this work, we proposed a modified genetic algorithm for mining interesting non-redundant association rules. Different experiments have been carried out on several well-known benchmarks. Moreover, the algorithm was compared with those of other published works and the results found proved the efficiency of our proposal. Association Rules Mining (ARM) has become one of the main topics in data mining. It attracts a lot of attention because of its wide applicability in differents area such as in web mining [24] , document analysis [17] , telecommunication alarm diagnosis [19] , network intrusion detection [5] , and bioinformatics applications [27] . Most of the classical association rules mining algorithms, such as the Apriori algorithm [2] and the FP-growth algorithm [13] work in two phases, the first phase is the frequent itemsets generation and the second phase is the rules generation. Generating the frequent itemsets might be considered as a straightforward task. However, it is a time-consuming task when the number of items is large. For example a dataset comprising n items contains 2 n − 1 different itemsets, whereas the number of itemsets of size k is equal to ( n k ) for any k n. Thus, given the amount of computations needed for each candidate rule is o(k) the overall complexity is O( n k=1 k × ( n k )) = O(2 n−1 × n). So the complexity of finding itemsets is in exponential order, and this complexity is even higher when the frequency of each itemset is calculated [10] . For any dataset comprising n items and m different transactions, the complexity to compute the frequencies of all the itemsets within the dataset is equal to O(2 n−1 × n × m) [28] . On the other hand, metaheuristics are increasingly considered as a more promising alternative approach. They have been proven beneficial as they directly generate association rules, skipping the frequent itemsets generation phase by maximizing the support and the confidence of the rules. The quality of an association rule is not limited to its support and confidence, there are many other metrics available to measure the quality of an association rule such as coverage, comprehensibility, leverage, interestingness, lift and conviction. Therefore, the problem of ARM can be considered also as a multi-objective optimization problem [11] . Although ARM metaheuristics are proved to be faster than ARM exact algorithms, most of them suffer from accuracy. They do not generate all the rules and not necessarily those of good quality. Besides, a large number of association rules returned to the user are redundant, so it is time-consuming to analyses the results, the user has to handle a large proportion of redundant rules, most of these rules are not interesting for the application at hand. This article proposed a Modified Genetic Algorithm for Mining interesting Association Rule (MGA-ARM), different contributions have been embedded in the proposed algorithm. First, the random initial population is replaced by a special initial population to enhance both the CPU time and the solution quality. Secondly, we consider different measures beside support and confidence in the objective function in order to extract better association rules and finally, we propose a method to handle and prune the non-significant redundant rules. Different experiments have been carried out on several well-known benchmarks. The results were compared with those of other published methods and proved the efficiency of the proposal. The formal definition of an association rule was initiated by Agrawal in [1] . Let I = {i 1 , i 2 , . . . , i n } be a set of n items, T = {t 1 , t 2 , . . . t m } a set of m transactions where each transaction t i is a set of items such that t i ⊆ I. An association rule is a pattern of the form X → Y meaning that there is an association between the presence of the itemsets X and Y in transactions. X is called antecedent or lefthand side of the rule, Y is called consequence or right-hand side of the rule. The association rules have two significant basic measures: support and confidence. Given an association rule (X → Y ): -The support denoted support (X → Y ) is the ratio between the number of transactions containing X ∪Y and the number of transactions in the database. It determines how often the rule is applicable to a given dataset. -The Confidence denoted confidence (X −→ Y ) is the ratio between the number of transactions containing X ∪ Y and the number of transactions containing X. It determines how frequently items in Y appear in transactions that contain X. Association rules mining task is formally stated as follows: given a set of transactions T= {T 1 , T 2 , ..., T m }, the objective is to find all valid rules i.e. rules having support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf). Association rules mining task is not limited to the previous definition, it is considered as an optimization problem that consists to find the best rules r ∈ R where R is the set of all possible rules, while maximizing the values of the support and confidence [8] . There are many other metrics available to measure the quality of an association rule such as coverage, comprehensibility, leverage, interestingness, lift and conviction [11] . Therefore, the problem of ARM can be considered also as a multi-objective optimization problem rather than a single objective one, where the goal is to find association rules while optimizing several objective measures simultaneously. Whether we use just support and confidence or we use more measures, metaheuristic always generate duplicated and redundant rules and this effects the quality of the results presented to the user. [29] . Definition 2. Let R1 be the rule X −→ Y and R2 the rule X' −→ Y'. We say that the rule R1 is more general than the rule R2 , denoted R1 ≤≤ R2, if R2 can be generated by adding additional items to either the antecedent or consequent of R1 [30] . Formally let R = R 1 , ..., R n be a set of rules, such that all their supports and confidences are equal. For all i, In other words, since all the rules in the collection R have the same support and confidence, the simplest rules in the collection should suffice to represent the whole set. Thus the non-redundant rules in the collection R are those that are most general, i.e., those having minimal antecedents and consequent, in terms of subset relation [30] . This second definition is considered in our work. Most metaheuristics were investigated to solve the ARM problem, whether evolutionary or swarm intelligence algorithms. Saggar et al. [25] were the first authors to propose the use of evolutionary algorithms for ARM, their idea consisted in optimizing association rules extracted with the Apriori algorithm by using a genetic algorithm, after that genetic programming was also used for ARM in [23] . An interesting survey about the use of evolutionary computation for frequent pattern mining with particular emphasis on genetic algorithms can be found in [28] . Particle swarm optimization was largely used to mine association rules a survey of its applications is proposed in [4] , we can cite for example the work published in [20] and the work proposed by Sarath and Ravi in [26] , after that in [21] a Modified Binary Cuckoo search (MBCS-ARM) was proposed. In [14] the authors present an adaptation of bat algorithm to ARM issue known as BAT-ARM. Later, they proposed a multi-population bat algorithm in [15] . In [22] , the authors used Ant Colony Optimization for continuous domains (ACO R ), this algorithm mines numeric association rules without any need to specify minimum support and minimum confidence. Another recent works propose an algorithm based on animal migration optimization [6] and chemical reaction optimization metaheuristic [7] . Hybrid approaches was also proposed such as the work in [9] where the authors proposed an algorithm based on bees swarm algorithms and tabu-search and in the work [18] where the authors combine both genetic algorithm and particle swarm optimisation in an algorithm called (GPSO). The following works considered the association rules mining as a multiobjective problem rather than a single-objective one, in [12] the authors proposed a pareto based genetic algorithm where they used three measures: comprehensibility, interestingness and confidence as objective functions. Alatas et al. [3] proposed pareto-based multi-objective differential evolution (DE) for extracting association rules. They formulated the association rule mining problem as a four-objective optimization problem, where, support, confidence and comprehensibility of rules are maximized, while the amplitude of the intervals, which conforms the item set and rule is minimized. In [11] , the Multi-objective Binary Particle Swarm Optimization (MO-BPSO), the Multi-objective Binary Firefly optimization and Threshold Accepting (MO-BFFO-TA), and the Multi-objective Binary Particle Swarm optimization and Threshold Accepting (MO-BPSO-TA) were used to extract association rules without specifying support and confident threshold's. Recently a multiobjective bat algorithm known as MOB-ARM is proposed in [16] . To use metaheuristics, the objective function, the representation of the solution and the different operators need to be defined and adapted to the problem in hand. For the rule encoding, the solution S is represented by an integer vector defined as follows: The Objective Function: Four quality measures of the rule was maximized in the objective function, we used the comprehensibility and the interestingness measure in addition to the support and the confidence. where according to [12] -The Comprehensibility of an association rule quantify how much the rule is comprehensible -Interestingness: a rule is interesting when the individual support count values are greater than the collective support (X → Y ) values In this section we present the Modified Genetic Algorithm for ARM (MGA-ARM), we explain the different steps of the algorithm and the proposed method to prune the non-significant redundant rules. -Search space: we believe that the search space can be easily pruned using the Apriori principle and some other propriety. The following example shows that the infrequent items are useless because they always lead to infrequent rules. Let ab → c be a given frequent rule and d an infrequent item, that means sup(d) < minsup. Consequently sup(abcd) < minsup, therefore, all rules generated from the itemset {abcd} will be infrequent. In our proposal, the search space contains a set of frequent items instead of considering all the items [8] . -Initial population: most the metaheuristic use a random initialization and exploit all the set of items to create the population where infrequent items generate infrequent rules and this is useless for the search procedure and it is a waste of time as explained previously. For MGA-ARM, the random population is replaced by a set of valid rules of size two. Algorithm 1 resumes the different steps of the procedure InitPop used for the generation of the initial population. InitPop starts by extracting a set of frequent items F IS 1 from the dataset T, then based on this set, it generates a set of rules that contains two frequent items one at the antecedent and the other one at the consequence of the rule. Finally, it selects the best rules within the generated rules and saves it as V alidAR 2 to be the special initial population exploited by our algorithm [8] . -Crossover: in our algorithm, the proposed crossover takes an item that does not exist in parent1 and add it to parent2 , and add a new item to parent1 from parent2 that does not exist beforehand. This operator ensures that the new solutions are feasible since the parents were feasible solutions. Figure 1 shows an example of the crossover operator. First, we copy parent1 to child1 and parent2 to child2, then we move an item that does not exists in parent2 and exists in parent1 to child2 and vice versa respectively. The second type of discarded rules are the rules constructed from the same itemset thus they have the same support however their confidence and fitness are different since they have different antecedent and consequence. The third type is the redundant rules, defined in Sect. 2.3, they are eliminated by grouping the rules that have the same values of support and confidence and their itemsets are included in each other. We keep then the general rule that has the smallest number of items and we delete the other ones. The details of the proposed algorithm are given by Algorithm 2. The experiments were done in two steps, in the first step a study that compares the proposed approach to other approaches in terms of quality of solutions and CPU run time is presented. Then in the second step, a study about the effect of pruning redundancy on the final solution is explained. The experiments were done under Windows 8 using a desktop computer with Intel Core-i3 processor, 1.8 GHz and 6 GB memory, all the implementations have been achieved using Matlab. The tests have been conducted on well-known scientific datasets, frequently used by the data mining community: Chess database that contains 3196 transactions and 75 items. Mushroom dataset which has much more transactions 8124 and 119 items. IBM Quest Standard dataset with 1000 transaction and 40 items. Finally, two small data set Book and Food with also 1000 transaction and 11 items. The dataset can be found found in (http://fimi.uantwerpen.be/data/) and (http://funapp.cs.bilkent.edu.tr/DataSets/) and (https://www.solver.com/ xlminer) and (www.ibm.com/software/analytics/spss/) respectively. In this section, we compared our proposed MGA-ARM to single and multiobjective algorithms. We used three different versions of the bat algorithm designed for mining association rules BAT-ARM [14] , MPB-ARM [15] , MOB-ARM [16] . The experiment was carried out on three datasets within the dataset stated before. The size of the population was fixed to 50 and the maximum number of iteration to 100. Tables 1 and 2 presents the average results of support and confidence respectively in thirty executions. In terms of support and confidence, the results show that our proposed algorithm MGA-ARM outperforms MOB-ARM, BAT-ARM and MPB-ARM in most cases. Table 3 presents the CPU run times for the different algorithms in 100 runs the result are given in seconds. Our proposed algorithm is faster than the other algorithms in most cases thanks to the special population proposed that avoid a lot of useless exploitation of the search space. In this section, we will see the effect of applying Filter rules procedure on the results to eliminates the duplicated and redundant rules. Table 4 shows the changes in support and confidence values before and after applying the Filter rules procedure. The interval in the first columns presents the percentage of rule deleted, it is different from dataset to another. It is clear that for the small datasets book and food, the percentage of rule pruned is large, in the best cases 40% of rules are deleted in the worst-case half of the rules are duplicated and redundant. For the larger data set, we notice that the percentage of rule deleted is smaller 16% in worst cases and 4% in best cases. On the other hand, for the quality measures, we noticed that there is not a big difference in the average of the support and confidence in different runs. For the different data sets, the values in the worst-case decrease with 0.02 and in the best cases with 0.001 this is not important compared to the number of useless rules pruned. The Filter rules procedure indeed decreases considerably the number of rules but it ensures to present to the user a set of interesting non-redundant association rules. It should be noted that when applying the filter procedure, it is recommended to select a large number of valid rules because after applying the Filter rules procedure the number of rules will decrease. In this work, we proposed a modified genetic algorithm for mining interesting association rules by considering four different quality measures: support, confidence, comprehensibility, and interestingness. Moreover, we propose a new technique to build the initial population of GA, and we handle duplicated and redundant rules by proposing the Filter rules procedure. Different experiments have been carried out on several well-known benchmarks. The results compared with those of other published methods proved the efficiency of the proposal. We emphasize to optimize our algorithm by testing different crossover and mutation, and to test it on larger datasets. Moreover, we can use more quality measurements to define which are really better for ARM problem. Mining association rules between sets of items in large databases Fast algorithms for mining association rules MODENAR: multi-objective differential evolution algorithm for mining numeric association rules A review on application of particle swarm optimization in association rule mining ADAM: a testbed for exploring the use of data mining in intrusion detection ARM-AMO: an efficient association rule mining algorithm based on animal migration optimization. Knowl.-Based Syst Chemical reaction optimization metaheuristic for solving association rule mining problem Metaheuristics guided by the apriori principle for association rule mining: Case study-CRO metaheuristic A hybrid bees swarm optimization and tabu search algorithm for association rule mining Data Mining and Knowledge Discovery with Evolutionary Algorithms Association rule mining via evolutionary multiobjective optimization Multi-objective rule mining using genetic algorithms Mining frequent patterns without candidate generation: a frequent-pattern tree approach Association rule mining based on bat algorithm Multi-population cooperative bat algorithm for association rule mining Multi-objective bat algorithm for mining interesting association rules Efficient mining of association rules in text databases Mining association rules using hybrid genetic algorithm and particle swarm optimisation algorithm A knowledge discovery methodology for telecommunication network alarm databases Application of particle swarm optimization to association rule mining Modified binary cuckoo search for association rule mining Multi-objective numeric association rules mining via ant colony optimization for continuous domains without specifying minimum support and minimum confidence Association rule mining using a multi-objective grammar-based ant programming algorithm Mining access patterns efficiently from web logs Optimization of association rule mining using improved genetic algorithms Association rule mining using binary particle swarm optimization Finding association rules on heterogeneous genome data Pattern Mining with Evolutionary Algorithms Generating non-redundant association rules Mining non-redundant association rules