Response Modeling with Support Vector Regression Dongil Kim, Hyoung-joo Lee and Sungzoon Cho ∗ Department of Industrial Engineering, Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul, 151-744, Korea Abstract Response modeling, which predicts whether each customer will respond or how much each customer will spend based on the database of customers, becomes a key factor of direct marketing. In previous researches, several classification approaches, include Support Vector Machines (SVM) and Neural Networks (NN), have been applied for response modeling. However, there are two drawbacks of conventional approaches: (1) response models only predict classification scores rather than predicting total amount of money spent, (2) too large training data. For the first drawback, we applied Support Vector Regression (SVR) for response modeling to predict total amount of money spent of each respondent. For the second drawback, we employed a pattern selection method designed for SVR. This paper provides experimental results of a direct marketing dataset in terms of model fit, training time complexity and profitability. Key words: Response Modeling, Direct Marketing, Support vector machines; Regression; Pattern selection ∗ Corresponding author. Tel: +82-2-883-4913, Fax: +82-2-883-4913 Email addresses: dikim01@snu.ac.kr (Dongil Kim), imhjlee@gmail.com Preprint submitted to Elsevier Science 26 May 2006 1 Introduction A response model, given a mailing campaign, predicts whether each customer will respond or how much each customer will spend based on the database of customers’ demographic information and/or purchase history. Marketers will send mails or catalogs to customers who are predicted to respond or to spend large amounts of money. A well-targeted mail increases the profit, while a mistargeted or unwanted mail not only increases the marketing cost but also may worsen the customer’s relationship to the firm (Gönül et al., 2000; Potharst et al., 2000). Various methods have been used for response modeling such as statistical techniques (Bentz and Merunka, 2000; Haughton and Oulabi, 1997; Ling and Li, 1998; Suh et al., 1999), machine learning techniques, (Wang et al., 2005; Chiu, 2002; Cheung et al., 2003; Shin and Cho, 2006; Viaene el al., 2001; Yu and Cho, 2005) and neural networks (NN) (Potharst et al., 2000; Bentz and Merunka, 2000; Zahavi and Levin, 1997). So far, a response model have been usually formulated as a binary classification problem because of its straightforwardness. The customers are divided into two classes, respondents and non-respondents. A classifier is constructed to predict whether a given customer will respond or not. From a modeling point of view, however, as pointed out in (KDD98 Cup, 1998) for the KDD-CUP-98 task, there is an inverse correlation between the likelihood to buy and the dollar amount to spend (Wang et al., 2005). This is because the more dollar amount is involved, the more cautious the customer is in making a purchase decision. In this sense, the likelihood to buy that is estimated by a classification model may not lead to more profit. Therefore, in addition to the classification approach, a regression model needs to be applied for response modeling to (Hyoung-joo Lee), zoon@snu.ac.kr (Sungzoon Cho). 2 predict total amount of money of each customer. Support Vector Machine (SVM) is known as the most spot-lighted algorithm with great generalization performances by employing Structural Risk Mini- mization (SRM) principle (Vapnik, 1995). Support Vector Regression (SVR), a regression version of SVM was developed to estimate regression functions (Druker et al., 1997). Like SVM, SVR is capable of solving non-linear prob- lems using kernel functions and successful in various domains (Druker et al., 1997; Müller et al., 1997). However, there was a difficulty to train SVR on real-world dataset. As the number of training patterns increases, SVR train- ing takes much longer with a time complexity of O(N 3) where N denotes the number of training patterns. So far, many algorithms such as Chunking, SMO, SVMlight and SOR have been proposed to reduce the training time. However, their training time complexity is still strongly related to the number of train- ing patterns (Platt, 1999). We take another direction called pattern selection which is focusing on reducing the number of training patterns. NPPS (Shin and Cho, 2003) is a proved method of pattern selection for SVM, but it’s only for SVM classifiers. Rather we use a pattern selection method based on ε-tube which is especially designed for SVR (Kim and Cho, 2006). In this paper, we applied SVR for response modeling to predict total amount of money spent of each respondent. As mentioned before, a single response model based on classification method can score the likely to respond of each customer. However, direct marketers would like to know not only respondents but also profitable customers who will spend more money than others. Hence, after predicting respondents by a classification response model, a regression model is needed to predict total amount of money spent of each respondent. As the classification model was not our concerned, we supposed there was a primitive ideal classifier which could find all respondents without False Posi- 3 Fig. 1. The ideal procedure of the paper. Especially, the paper is focused on the two dark boxes. tive (FP) errors. We made a subset consists of only but all respondents from original dataset. SVR model was applied to the subset of respondents. The ideal procedure of this paper is presented in Fig 1. We used the DMEF4 dataset from the Direct Marketing Educational Foundation (DMEF) which is collected from a catalog mailing task. The remaining of this paper is organized as follows. In Section 2, we introduce the concepts of SVR and provide the main idea of the pattern selection method with a simple toy example. In Section 3, we present details of DMEF datasets and parameters for experiment. In Section 4, experiment results are following. In section 5, we summarize the result and conclude the paper with a remark on limitations and future research directions. 2 Pattern Selection for Support Vector Regression 2.1 Support Vector Regression For a brief review of SVR, consider a regression function f(x) to be estimated with training patterns {(xi, yi)}, f (x) = w · x + b with w, x ∈ RN , b ∈ R (1) 4 where {(x1, y1), · · · , (xn, yn)} ∈ RN × R (2) SVR is moved around to include training patterns inside ε-insensitive tube (ε-tube). By the SRM principle, the generalization accuracy is optimized by the flatness of the regression function. Since the flatness is guaranteed on a small w, SVR is moved to minimize the norm, ‖w‖2. An optimization problem could be formulated with constraints where C, ε, and ξ, ξ∗ are trade-off cost between empirical error and the flatness, size of ε-tube and slack variables, respectively, for the following soft margin problem. Minimize 1 2 ‖w‖2 + C n∑ i=1 (ξi + ξ ∗ i ), (3) Subject to yi − w · xi − b ≤ ε + ξi, w · xi + b − yi ≤ ε + ξ∗i , ξi, ξ ∗ i ≥ 0, i = 1, · · · , n. Hence, SVR is trained by minimizing ‖w‖2 with including training patterns inside the ε-tube. With adding Lagrangian multipliers α and α∗, the QP prob- lem can be optimized as dual problem. The estimated regression function from SVR is following where ns is the number of support vectors: f (x) = ns∑ i=1 (α − α∗)K(xi, x) + b (4) 2.2 Pattern Selection for Support Vector Regression The training time complexity of SVR is O(N 3). If the number of training patterns increases, the training time increases more radically, i.e. in a cubic proportion. Marketing databases usually consists of over one millions of cus- tomers and hundreds of input variables. Hence, it takes too long time to train SVR directly to marketing dataset. We applied a pattern selection method from our previous research (Kim and Cho, 2006). 5 Fig. 2. (a) The regression function after training original dataset, and (b) the re- gression function after training ONLY patterns inside estimated ε–tube. SVR trains patterns based on ε-loss function foundation. SVR makes ε-tube on the training patterns. The patterns in ε-tube are not counted as error, and patterns out of ε-tube, i.e. Support Vectors (SVs), are used for training. In addition, SVR estimates the regression function as the center-line of ε-tube. Hence, if ε-tube can be estimated before training, we can find the regression function with only those patterns inside ε-tube (See Fig. 2) . However, remov- ing all patterns outside ε-tube could lead to reduction of ε-tube itself, thus it is desirable to keep some of “outside” patterns for training. Hence, we defined a “fitness” probability for each pattern based on its location with respect to ε-tube and then selected patterns stochastically. We made k bootstrap samples of size l (l < n) from original training pattern set (D). We trained an SVR with each bootstrap sample and obtained k SVR regression functions. Each regression function was used to see if a training pattern is located inside ε-tube. Each training pattern in D is located inside a minimum of zero ε-tubes to a maximum of k ε-tubes. Let mj denote the number of times that pattern j is found in-side an ε-tube. We use mj as the likelihood that pattern j is actually located inside the real ε-tube. Each mj is converted to a probability, pj as in Eq. (5). Since we want to select patterns 6 Fig. 3. (a) Original dataset and an SVR trained on it, (b) A bootstrap sample and an SVR trained on it, (c) Original dataset and ε-tube of (b)’s SVR, and (d) Selected patterns and an SVR trained on them inside ε-tube, pattern j is selected with a probability of pj, pj = mj∑n i=1 mj . (5) The procedure of the pattern selection method is presented in Fig. 3 with a simple toy example. The algorithm is presented in Fig. 4. 7 1. Initialize the number of bootstrap samples, k Initialize the number of patterns in each bootstrap sample, l Initialize the number of patterns to be selected, s 2. Make k bootstrap samples, Di, (i = 1, · · · , k), from the origi- nal dataset D by random sampling without replacement 3. Train SVR fi with Di, (i = 1, · · · , k) 4. Count the number of times mj that pattern j is found inside ε-tube of fi 5. Convert mj to pj according to Eq. (5) 6. Select s patterns stochastically from D without replacement based on pj 7. Train final SVR with s selected patterns Fig. 4. ε-tube based pattern selection algorithm 3 Dataset and Experimental Settings 3.1 Dataset: DMEF dataset We utilized DMEF4 dataset which contains 101,532 customers and 91 input variables. The response rate is 9.4% with 9,571 respondents and 91,961 non- respondents. We selected a subset based on the weighted dollars used from previous researches (Yu and Cho, 2005; Ha et al., 2005). We would like to design a regression model as a scoring model after picking up respondents with a primary classification model. Since a classification model was not our 8 Table 1 Input variables Name Formulation Description ORIGINAL VARIABLES Purseas Number of seasons with a purchase Falord LTD fall orders Ordtyr Number of orders this year Puryear Number of years with a purchase Sprord LTD spring orders Derived Variables DERIVED VARIABLES Recency Order days since 10/1992 Tran53 I(180 ≤ recency ≤ 270) Tran54 I(270 ≤ recency ≤ 366) Tran55 I(366 ≤ recency ≤ 730) Tran38 1/recency Comb2 ∑14 m=1 ProdGrpm Number of product groups purchased from this year Tran46 √ comb2 Tran42 log(1 + ordtyr × falord) Interaction between the number of orders Tran44 √ ordhist × sprord Interaction between LTD orders and LTD spring orders Tran25 1/(1+lorditm) Inverse of latest-season items interest in this research, we assumed that there was an ideal response model built with a classification algorithm that could pick all respondent without false acceptances. Hence, we selected a new dataset consists of customers only whose target dollars are positive, i.e. only respondents. The final selected dataset consists of 4,000 customers. For performance evaluation, the dataset was partitioned into training and 9 test sets. A half of customers were randomly assigned to the training set while the other half to the test set. Performance of a model shows a large varia- tion with regard to a specific data split (Malthouse, 2002). So ten different training/test splits were generated. All results are averaged of each exclusive training/tes sets. We are not interested in feature selection/extraction. Malt- house extracted 17 input variables for this dataset and Ha et al (Ha et al., 2005) used 15 out of them, removing two variables whose variations are neg- ligible. In this paper, these 15 variables were used as input variables as listed in Table 1. We formulated this dataset to a regression problem by setting the total amount of dollars as target variable. 3.2 Experimental Settings We made three different response models based on SVR in the experiments. All experimental results of SVR with pattern selection (SVR-PS) were compared with results of SVR with all data (SVR-100) and SVR with random sampling (SVR-Random). We set hyper-parameters of SVR. The hyper-parameters of SVR were deter- mined by cross–validation for SVR-100 with C × ε = {0.1, 1, 5, 10, 50, 100}× {0.01, 0.05, 0.07, 0.1, 0.5, 0.7, 1}. RBF kernel was used as a kernel function and the kernel parameter σ was fixed to 1.0 for all datasets. The parameters of the pattern selection method were set as follows. In this case, the number of boot- strap samples k was set to 10. The other parameter that controls the number of patterns in a bootstrap sample l was set to 25% of the number of patterns in dataset, n. The number of selected patterns was set to 10% to 90% of n. During the experiments, we used all data as normalized vectors. Root Mean Squared Error (RMSE) is used to estimate the model fits. We calculated the 10 averaged profits per one catalog mail to estimate the profitability of response models. Also we recorded training times. All stochastic results were the aver- aged values of 10-time repeats. 4 Experimental Results 4.1 Model Fit and Time Complexity Fig. 5 shows the experimental results. The solid lines present the result of SVR with all data (SVR-100). The lines with circles presents the results of SVR with pattern selection (SVR-PS) while the lines with boxes presents the results of SVR with random sampling (SVR-Random). Fig. 5 (a) is RMSEs which have real values (dollars). It shows SVR-PS is slightly less accurate than SVR-100, but is significantly more accurate than SVR-Random. As more patterns were selected, the training dataset became similar to the original dataset and the gaps between all SVRs became narrower. With 10% pattern selection, the gap of RMSEs between SVR-PS and SVR-Random is larger than 15 dollars. When the percentage of selected patterns is 40% of all, the gaps of RMSEs between SVR-PS and SVR-100 are less than 5 dollars. Fig. 5 (b) shows averaged training time of all SVRs. It takes 1200 seconds to train all data with SVR. However, with 10% of pattern selection, SVR-PS needed only 16% of training time of SVR-100. Although the percentage of selected patterns was up to 40%, which resulted only 5 dollars’ gap, SVR-PS needed only 25% of training time of SVR-100. SVR-PS is more efficient than SVR-100. 11 10 20 30 40 50 60 70 80 90 55 60 65 70 75 80 85 Percentage of patterns selected (%) R M S E SVR−100 SVR−PS SVR−Random (a) 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 1200 1400 Percentage of patterns selected (%) T im e (s ec ) SVR−100 SVR−PS SVR−Random (b) Fig. 5. (a) RMSE resulted from experiments and (b) the training times 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 60 70 80 90 100 110 120 130 140 150 Mailing depth (%) A ve ra ge p ro fi t p er m ai l ($ ) SVR−100 SVR−PS SVR−Random 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 60 70 80 90 100 110 120 130 140 150 Mailing depth (%) A ve ra ge p ro fi t p er m ai l ($ ) SVR−100 SVR−PS SVR−Random (a) (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 60 70 80 90 100 110 120 130 140 150 Mailing depth (%) A ve ra ge p ro fi t p er m ai l ($ ) SVR−100 SVR−PS SVR−Random 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 60 70 80 90 100 110 120 130 140 150 Mailing depth (%) A ve ra ge p ro fi t p er m ai l ($ ) SVR−100 SVR−PS SVR−Random (c) (d) Fig. 6. Average profit per one mail. (a) Results of 10% pattern selection and random sampling, from (b) to (d) are 30%, 50%, 70% respectively 4.2 Profitability Fig. 6 compares three response models of SVR in terms of profit. For each setting, the response model predicted the dollar amount to each test customer. Then, we sorted the test customers by descending order in terms of the dollar amount predicted by model. A mailing depth of Fig. 6 is an expected profit made in a promotion to each percentage of top-decile. As it shows, if we send catalog mails to all respondents, the averaged profit of each mail is roughly 13 60 dollars. However, if we want to select 10% of all respondents, we can make more profit based on SVR model than only use classification model. As mailing depths go larger, averaged profits per mail decrease. SVR-PS resulted between SVR-100 and SVR-Random. As we saw the training time complexity analysis in Fig. 5 (b), SVR-PS guaranteed efficient models with acceptable profitability and fast training speed. 5 Conclusions and Discussion We applied SVR for response modeling. We assumed that there was a pre- vious response model that could find all respondents, perfectly. With SVR, we estimated total amount of dollar spent for each customer to find more profitable respondents. As results, we could find high profit customers rather than just response rate. Also, to reduce the training time complexity, we used the pattern selection method. The pattern selection method made SVR be an efficient model and, at the same time, avoid worsening the accuracy. SVR-100 resulted the best and SVR-Random resulted the worst in terms of accuracy and profit, while SVR-Random resulted the best in terms of training speed. SVR-PS located a reasonable area in terms of efficiency which guarantees ac- curate like SVR-100 and fast like SVR-Random. SVR-PS needs only 16∼25% of training time of SVR-100, with acceptable accuracy loss. There are some limitations of this research. First, it was an early research of applying SVR for response modeling. We formulated the problem as a regres- sion form, however, we might have missed some key factors of data setting for regression formulation. With further researches, we can find more effec- tive regression formulation to be applied response modeling. Second, the re- sults of SVR-PS looked efficient, however, we couldn’t decide those were good 14 enough to apply real-world problems. The pattern selection method should be improved to be more accurate. Finally, various experiments including profit analysis should be followed in further study. We assumed there was a per- fect classification model, however, there will be a comparison with the real classification model in further study. References Bentz, Y., Merunka D., 2000. Neural networks and the multinomial logit for brand choice modeling: a hybrid approach, Journal of Forecasting 19, 177- 200. Cheung, K.-W., Kwok, J.T., Law, M.H., Tsui, K.-C., 2003.mining customer product ratings for personalized marketing, Decision Support Systems 35, 231-243. Chiu, C., 2002.a case-based customer classification approach for direct mar- keting, Expert Systems with Applications 22(2), 163-168. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., Vapnik, V., 1997.Sup- port Vector Regression Machines, In: Mozer, M. C., Jordan, M. I., Petsche, T.(eds.): Advances in Neural In-formation Processing System 9, MIT Press, Cambridge, MA, 155-161. Gönül, F.F., Kim, B.D., Shi, M., 2000. Mailing smarter to catalog customer, Journal of Interactive Marketing 14(2), 2-16. Ha, K., Cho, S., MacLachlan, D., 2005.Response models based on bagging neural networks, Journal of Interactive Marketing 19(1), 17-30. Haughton, D., Oulabi, S., 1997.Direct marketing modeling with CART and CHAID, Journal of Direct Marketing 11(4), 42-52. KDD98, The KDD-CUP-98 result, 1998. http://www.kdnuggets.com/ meetings/kdd98/kdd-cup-98.html. 15 Kim, D., Cho, S., 2006.ε-tube based Pattern Selection for Support Vector Machines Lecture Notes in Artificial Intelligence 3918, 215-224. Ling, C.X., Li, C., 1998.Data mining for direct marketing: problems and so- lutions, Proceedings of ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD-98), New York, pp. 73-79. Malthouse, E.C., 2002.Performance-based variable selection for scoring mod- els. Journal of Interactive Marketing 16(4), 37-50. Müller K.-R., Smola A., Rätsch G., Schölkopf B., Kohlmorgen J., Vapnik V., 1997.Predicting time series with support vector machines. In: GerstnerW., Germond A., Hasler M., and Nicoud J.-D. (Eds.), Artificial Neural Networks ICANN97, Berlin. Lecture Notes in Computer Science 1327, 999-1004. Platt, J. C., 1999.Fast Training of Support Vector Machines Using Sequen- tial Minimal Optimization, Advanced in Kernel Methods; Support Vector Machines, MIT Press, Cambridge, MA, pp. 185-208. Potharst, R., Kaymak, U., Pijls W., 2000. Neural networks for target selection in direct marketing, Erasmus Research Institute of Management (ERIM), RSM Erasmus University, Research Paper ERS-2001-14-LIS, Available at http://ideas.repec.org/s/dgr/eureri.html. Shin, H., Cho, S., 2003.Fast Pattern Selection Algorithm for Support Vector Classifiers: Time Complexity Analysis, Lecture Notes in Computer Science 2690, 1008-1015. Shin, H., Cho, S., 2006.Response modeling with support vector machines, Expert Systems with Applications 30(4), 746-760. Suh, E.H., Noh, K.C., Suh, C.K., 1999.Customer listsegmentation using the combined response model, Expert Systems with Applications 17(2), 89-97. Vapnik, V., 1995. The Natural of Statistical Learning Theory, Springer, New York. Viaene, S., Baesens, B., Gestel, T., Suykens, J.A.K., 2001.Van den Poel, D., 16 Vanthienen, J., De Moor, B., Dedene, G., Knowledge discovery in a direct marketing case using least squares support vector machines, International Journal of Intelligent Systems 16, 1023-1036. Wang, K., Zhou, S., Yang, Q., Yeung, J.M.S., 2005. Mining customer value: from association rules to direct marketing, Data Mining and Knowledge Discovery 11, 57-79. Yu, E., Cho, S., 2005.Constructing response model using ensemble based on feature subset selection, Expert Systems with Applications, In Press, Available online at http://www.sciencedirect.com/science/journal/ 09574174. Zahavi, J., Levin, N, 1997.Applying neural computing to target marketing. Journal of Direct Marketing 11(4), 76-93. 17