Analysing of existing customers to improve acquiring of new customers Technology Classification with Latent Semantic Indexing Dirk Thorleuchter a,* , Dirk Van den Poel b a Fraunhofer INT, D-53879 Euskirchen, Appelsgarten 2, Germany, dirk.thorleuchter@int.fraunhofer.de b Ghent University, Faculty of Economics and Business Administration, B-9000 Gent, Tweekerkenstraat 2, Belgium, dirk.vandenpoel@ugent.be URL: http://www.crm.UGent.be _________________ * Corresponding author at: Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany. Tel.: +49 2251 18305; fax: +49 2251 18 38 305 E-mail address: Dirk.Thorleuchter@int.fraunhofer.de (D. Thorleuchter). Abstract Many national and international governments establish organizations for applied science research funding. For this, several organizations have defined procedures for identifying relevant projects that based on prioritized technologies. Even for applied science research projects, which combine several technologies it is difficult to identify all corresponding technologies of all research-funding organizations. In this paper, we present an approach to support researchers and to support research-funding planners by classifying applied science research projects according to corresponding technologies of research-funding organizations. In contrast to related work, this problem is solved by considering results from literature concerning the application based technological relationships and by creating a new approach that is based on latent semantic indexing (LSI) as semantic text classification algorithm. Technologies that occur together in the process of creating an application are grouped in classes, semantic textual patterns are identified as representative for each class, and projects are assigned to one of these classes. This enables the assignment of each project to all technologies semantically grouped by use of LSI. This approach is evaluated using the example of defense and security based technological research. This is because the growing importance of this application field leads to an increasing number of research projects and to the appearance of many new technologies. Key Words: Latent semantic indexing, SVD, Classification, Research Funding. mailto:dirk.vandenpoel@ugent.be Using LSI for Technology Classification _______________________________________________________________________ 2 1 Introduction Research funding for applied science research projects is done by many national and international organizations (Beaudry & Allaoui, 2012; Lepori, 2011). They evaluate proposals for new research projects and based on self-defined procedures, they identify the relevant projects, which are accepted for funding (Hicks, 2012; Mobjörk & Linnér, 2006). An important criterion for technological research is that the technologies standing behind the proposed research project are also mentioned in a specific list or taxonomy of prioritizes technologies (Choi, Lee, & Sohn, 2009; Bradshaw et al., 2008). In general, these technology lists or taxonomies consist of a manually created label for each technology and of a description. The descriptions contain terms from the technology as well as from potential application fields (Thorleuchter, Van den Poel, & Prinzie, 2010c). For example, the European Union establishes a Framework Research Programme (FP7) theme for security that has the objective to develop technologies needed to ensure the security of citizens from threats. It uses a list of prioritized technologies (ESRAB technology list) for research funding decisions (Remuss, 2010). That means proposals of research projects that do not fit with these prioritized technologies and the corresponding application field e.g. ‘security’ normally are not accepted (McLeish & Nightingale, 2007; Jiricka & Pröbstl, 2012). For a researcher, it is often difficult to identify the corresponding prioritized technologies and corresponding application fields concerning each research-funding organization (Grimpe, 2012). Additionally, it is also difficult for research planners to assign applied science research projects to prioritized technologies of their research-funding organization manually (Ludwig, Roson, Zografos, & Kallis, 2011). Therefore, in this paper, we present an automated approach based on text classification that supports researchers as well as research-funding planners by the identification of relationships between applied science research projects and technologies extracted from lists or taxonomies. Literature proposes application based technological relationships (Yu, Hurley, Kliebenstein, & Orazem, 2012). Here, it is shown that during the process of creating an application, technologies are related to their substitutive, integrative, predecessor, and successor Using LSI for Technology Classification _______________________________________________________________________ 3 technologies (Geschka, 1983). An example for substitutive technologies is electrical fuel cells, electrical batteries, and solar cells in the context of creating an energy supply application. A research project that has the aim to create a new approach for an energy supply application can combine all three substitutive technologies to build this new approach. Alternatively, it can focus on one technology e.g. fuel cells. However then, it has to consider research results from the further substitutive technologies. This is because the newly created fuel cell approach for energy supply has to be compared to existing potential energy supply applications to indicate its advances. This full cell project processes knowledge from electrical battery and solar cells and thus, is related to the electrical battery technology and to the solar cell technology, too; even if key words from electrical battery technology or from solar cell technology do not occur in the project description (Geschka, Lenk, & Vietor, 2002). Applied science research projects have to combine or to consider these related technologies to create an application (Thorleuchter, Van den Poel, & Prinzie, 2010b). This describes a binary classification problem because the test examples (research projects) are associated with a specific class (a set of related technologies) (Kim, Toh, Teoh, Eng, & Yau, 2012). To identify related technologies, LSI is used. This is because semantically, all related technologies consist of the same terms describing the technology or the application field. LSI identifies the semantic textual patterns in the descriptions of the technologies and it also identifies the impact of each technology description on each semantic textual pattern (Thorleuchter & Van den Poel, 2012b). Then, each semantic pattern represents a set of related technologies where the corresponding impact is larger than a specific threshold. The descriptions of the projects are projected in the same semantic subspace. An assignment of each project on a set of technologies can be done based on the calculated impact of each project on each semantic textual pattern (Thorleuchter, Van den Poel, & Prinzie, 2012). Previous work calculates the similarity between each project and each technology separately assuming that all technologies are independent (Thorleuchter & Van den Poel, 2011a). It uses machine-learning techniques as supervised learning methods and a knowledge structure text classification approach that uses a similarity measure (Jaccard’s coefficient) as well as a specific threshold to enable a multi-label classification. This knowledge structure Using LSI for Technology Classification _______________________________________________________________________ 4 approach often fails because prevalent features that are characteristic for a technology are not simultaneously present in all projects that belong to one technology. In contrast to previous and related work, this work considers research results from the application based technological relationships as mentioned above. Aspects that are relevant for this task are extracted and used for this approach. Related technologies are grouped in several sets as represented by semantic textual patterns and each project has to be assigned to one set of related technologies. This can be done by using a binary textual classification instead of using a multi-label classification and this enables the use of LSI as a binary semantic classification algorithm. This approach is evaluated using the example of defense and security (D&S) based technological research projects. This is because the growing importance of this application field leads to an increasing number of research projects and the appearance of many new technologies as indicated by the occurrence of several technology lists or taxonomies (e.g. EDA, WEAG, STACCATO, ESRAB, MCTL, and DSTL) during the last years (Gericke et al., 2009; Te Kulve & Smit, 2003). The results are compared to a standard text classification algorithm that applies a multi-label classification on the same data set. A centroid vector is created that represents the term vectors from the training examples (projects) of each class (technology) (Takci & Güngör, 2012). This vector is the average vector of all vectors that are assigned to this class in the training phase. Term vectors from further research projects (test examples) are compared to all centroid vectors for identifying similar centroid vectors. We use a well-known similarity measure (Jaccard's coefficient) and a specific threshold to assign test examples to classes that means to identify none, one, or several technologies for each project (Madjarov, Kocev, Gjorgjevikj, & Džeroski, 2012). The evaluation shows, that the new LSI based approach outperforms the centroid based text classification algorithm concerning the calculated performance measures precision and recall. Using LSI for Technology Classification _______________________________________________________________________ 5 2 Background In this approach, we consider findings of literature that focus on the application based technological relationships. Some important aspects are adapted to this approach and mentioned in Sect. 2.1. Further, text classification approaches that are used in this study are described in Sect. 2.2 and it is explained; why LSI is a good mean to identify the technological relationships from Sect. 2.1. Further, a knowledge structure based classification approach is selected for evaluation purposes. It outperforms further knowledge structure approaches considering the aspects in Sect. 2.1. 2.1 Application based technological relationships A large number of literature studies the relationships between technologies (Choi et al., 2012; Subramanian & Soh, 2010; Radder, 2009; Jiménez, Garrido-Vega, Díez de los Ríos, & González, 2011; Herstatt & Geschka, 2002; Rubenstein et al., 1977; Fleck & Howells, 2001). Below, most important findings are adapted specifically for this study. a) An applied science research project can be classified according to a technology only if there is a relation between the project and the technology. The simplest relation is that a project contains research activities concerning the core area of a technology. Then both, the project description and the technology description consist of the same technology specific terms that describe the technological field. Therefore, the project can be directly assigned to one technology by computing the similarity of both descriptions. b) Technologies are not single data points but they describe a technological field that consists of many different research topics. Inside this field multiple research projects occur. Two research projects, which focus on different topics in a technological field, consist of project descriptions with different terms although they belong to the same technology. Therefore, prevalent features that are characteristic for a technology are not simultaneously present in all projects that belong to one technology. Using LSI for Technology Classification _______________________________________________________________________ 6 c) Technological project descriptions consist of a high percentage of term co-occurrence. This is because to describe a technical topic, several technical terms are used that normally occur together in a text phrase. Therefore, conditioned on each technology and on each project, different terms do not occur independently. d) Applied science research projects focus on an application field and use many different technologies. Literature indicates that these projects consist of up to ten technologies. Therefore, these research project descriptions consist of features from several different technologies. e) If a research project is assigned to a technology and this technology is related to further technologies then the project can be assigned to these further technologies, too. One kind of relationship is that technologies can be similar to other technologies. They deal with the same technology field but have a different focus e.g. passive radar technology and active radar technology. Technologies are not completely delimited from their similar technologies, which means in some research areas similar technologies overlap. Descriptions of similar technologies also consist of technology specific terms that describe the technological field. Then, a research project can be assigned to a similar technology by comparing the project description to the technology description. f) A further relationship is seen between a technology and its substitutive technology. These technologies substitute each other e.g. electrical fuel cells, electrical batteries and solar cells in the context of energy supply. An applied science research project normally examines several substitutive technologies to create an application. Then its description consists of terms from different technology fields. By comparing this description to a technology description, we do not get a large similarity because terms from the further technology fields do not appear in the technology description. If the research project examines fuel cell, electrical battery, and solar cell technology in an equally distributed way then the similarity by comparing the project description to the fuel cell technology description is about one third. Therefore, it is necessary to get project and technology descriptions that also contain terms, Using LSI for Technology Classification _______________________________________________________________________ 7 which describe the application field. Then, one gets a higher similarity by comparing and a better success by assigning a project to a substitutive technology. g) Integrative technologies sometimes are named complementary technologies and occur together by realizing an application. Examples for two integrative technologies are fuel and lubricants technology. This is because both technologies are used e.g. to create a new power plant prototype. Additionally, predecessor or successor technologies are technologies that precede or succeed another in the process of creating an application. Thus, it is important to use project and technology descriptions that contain terms, which describe the application field, too. 2.2 Text Classification In general, the aim of text classification is the assignment of pre-defined classes to text documents (Ko & Seo, 2009; Sudhamathy & Jothi Venkateswaran, 2012; Lin & Hong, 2011; Finzen, Kintz, & Kaufmann, 2012). For the identification of technologies standing behind projects, a class can be defined in two different ways. First, each technology can be represented by one class. Using this definition leads to the use of a multi-label classification (Thorleuchter, Weck, & Van den Poel, 2012a; Thorleuchter & Van den Poel, 2011c) because a project consists of several technologies and thus, it should be assigned to several classes. Second, a set of related technologies can be represented by one class. As shown in Sect. 2.1, the descriptions of related technologies consist of similar terms that describe application fields or technology areas. Based on these characteristic textual patterns, related technologies can be identified. Using this definition leads to the use of a binary classification where a project is assigned to one class or not. Extracting technologies from lists or taxonomies normally leads to a large number of technologies. E.g. in the case study (see Sect. 4) 2.850 technologies are extracted from the application field security and defense. Defining a class as a technologies leads to a large number of classes that probably causes performance problems in text classification. Using LSI for Technology Classification _______________________________________________________________________ 8 Semantic generalizations by grouping related technologies are a good mean to reduce the number of classes. The assignment of a project to a technology or to a set of related technologies depends on semantic aspects (aspects of meaning) and not on knowledge structure aspects (aspects of words) as described in Sect. 2.1. A single term (a word) that is characteristic for a technology does not have to be in the description of a project even if this project processes the technology but a semantic textual pattern of several terms probably will be. Thus for the text classification approach proposed in this paper, it is more important to compare the aspects of meaning between a project and technologies than to compare the aspects of words between them (Park, Kim, Choi, & Kim, 2012). The aspects of meaning can be identified by calculating the semantic textual patterns. 2.2.1 Knowledge structure approaches The most frequently used approaches in text classification are knowledge structure approaches. Examples for standard algorithms are k nearest neighbor (k-NN) classification as instance-based learning algorithm, C4.5 as decision tree model, naive Bayes (NB) as a simple probabilistic algorithm, and support vector machine (SVM) (Shi & Setchi, 2012; Lee & Wang, 2012). These approaches are not able to identify hidden semantic textual patterns. Despite this weakness, a knowledge structure approach is selected as baseline for the evaluation to show the success of the used semantic approach. The centroid-based approach is in contrast to some standard categorization algorithms in text classification where example classes are not described by one centroid vector, but by a number of training examples. We select this approach as baseline. Below, we give detailed explanations for using a centroid-based text classification. Our explanations are based on the results of (Han, 2000) where extensive evaluations of centroid-based classifications and comparisons with other classifiers are described. Using LSI for Technology Classification _______________________________________________________________________ 9 With a centroid-based scheme, the characteristics of each class can be summarized. By use of this summarization, several prevalent features are joined together. This is very important for our approach because terms that represent these technology-characteristic features are not simultaneously present in research project descriptions that belong to the technology as shown in Sec. 2.1. Therefore, comparing a term vector from a project (as test example) to a centroid vector leads to better performance than comparing it to term vectors from projects (training examples) that describe a class. We can find a similar summarization in the naive Bayes algorithm where for each class a distribution function is created that represents the term probabilities. Further algorithms (k-NN, C4.5, SVM etc.) describe a class by a number of training examples and therefore, they do not use summarizations (Buckinx, Moons, Van den Poel, & Wets, 2004). Further, a problem in text classification is the appearance of synonyms. Synonyms are different words with identical or at least similar meanings. In technological texts (e.g. in an applied science research project description) we can find them (assign, associate, classify, correlate etc.). By using a summarization, commonly used synonyms also are summarized that means, we can find them in the centroid vector. Therefore, comparing a term vector from a research project to a centroid vector also considers synonyms. Here, we also see that the centroid-based scheme and the naive Bayes algorithm outperform k-NN, C4.5, and SVM that do not use summarization. Additionally, we focus on the computational complexity of this centroid-based approach. This is relevant because as shown above, we will select 2.850 technologies in our case study that leads to 2.850 classes and that also will lead to a time consuming training and classification phase. In the training phase, we see a linear-time complexity that depends on the number of training examples for the centroid-based approach. We also see a linear complexity in the classification phase that depends on the number of classes. Therefore, the computational complexity in total is very low and it equals the complexity of the naive Bayes algorithm. Thus, the centroid-based scheme and the naive Bayes have a better performance concerning the computational complexity than k-NN, C4.5, and SVM. Using LSI for Technology Classification _______________________________________________________________________ 10 We also see advantages of the centroid-based algorithm concerning the naive Bayes algorithm that applies the Bayes theorem with strong (naive) independence assumptions. Conditioned on each class, this means that different terms independently occur. However, as shown in Sec. 2.1 the independence assumption is not true by using project description as training and test examples. Therefore, we think that the centroid-based algorithm also outperforms the Bayes algorithm. Thus, we use the centroid-based algorithm for the evaluation to compare results of the selected semantic approach to this knowledge based approach. 2.2.2 Semantic approaches As mentioned above, computational techniques are needed that are able to identify the aspect of meaning by calculating the semantic textual patterns. These techniques use eigenvectors in different variations and apply them on statistical procedures. (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999; Luo, Chen, & Xiong, 2011). With these techniques, words that occur in project or technology descriptions are used in the hidden semantic patterns but also words, that might be in these descriptions (Thorleuchter & Van den Poel, 2012d). This enables the identification of a similarity between a project and a set of technologies even if the words in the project description are completely different than the words in the technology descriptions (Tsai, 2012; Christidis, Mentzas, & Apostolou, 2012; Thorleuchter, Weck, & Van den Poel, 2012b). This approach uses LSI as well-known representative of these techniques. It extracts a large number of semantic textual patterns and it reduces their number by considering the values of the eigenvectors (Thorleuchter & Van den Poel, 2013). LSI is a good mean for the identification of application based technological relationships because it fulfills the requirements from Sect. 2.1 as described below. The paragraph a) in Sect. 2.1 indicates that the approach should be able to compute textual similarity in project and technology descriptions. LSI assigns project and technology descriptions to semantic textual patterns. Textual similarity between a project and a Using LSI for Technology Classification _______________________________________________________________________ 11 technology description can be assumed if both descriptions are assigned to the same semantic textual pattern. In the paragraph b) in Sect. 2.1, it is shown that prevalent features that are characteristic for a technology are not simultaneously present in all projects that belong to one technology. LSI as a semantic classification approach always considers this fact by using a semantic indexing that also consists of terms that are not mentioned explicitly in a text but that are related to the corresponding topic. Different terms so not occur independently in the technology or project descriptions as indicated by the paragraph c) in Sect. 2.1. LSI considers this by calculation relationships between projects and technologies based on semantic textual patterns. LSI groups several technologies that are related during the process of creating an application. This means it considers the fact that a project description consists of features from several different technologies as mentioned in the paragraph d) in Sect. 2.1. The paragraphs e), f), and g) indicate that similar, substitutive, integrative, predecessor, and successor technologies have to be identified by considering terms that also describe the application field (beside the technology area). LSI as semantic classification approach considers all related terms (describing a technology as well as describing an application field). Using LSI for Technology Classification _______________________________________________________________________ 12 3 Methodology Fig. 1 shows the processing of our approach in different steps. The methodology selects technology lists or taxonomies as well as information about research projects. Technology descriptions are extracted from the technology lists or taxonomies. Further, projects descriptions are identified or created from the research projects. The technology and project descriptions consist of terms, which describe the technology area as well as the application fields as assumed in Sect. 2.1. They are pre- processed by using tokenization, stop word filtering and stemming. Further, term vectors in a vector space model are created for each technology description and for each project description. LSI is applied to create the semantic textual patterns within the technology descriptions, where the impact of each technology on each semantic textual pattern is calculated. This impact is used to identify related technologies. Technologies with high impact on a specific semantic textual pattern are grouped together in a set of technologies. Using LSI for Technology Classification _______________________________________________________________________ 13 Projects descriptions are projected into the created LSI subspace where LSI calculated the impact of each project on each semantic textual pattern and thus, on each set of technologies. To determine the optimal value of the rank k as the number of semantic textual patterns, a cross-validation procedure is applied on test and training data from the project descriptions. An evaluation is used to compare the assignment of projects to the related technologies by this LSI based approach to the assignment by a knowledge structure based classification approach (centroid based approach). 3.1 Pre-processing The extracted textual information (technology and project description) has to be pre- processed. The aim of this step is to create term vectors in vector-space model. This is because textual information in term vectors can be used for further processing e.g. as input for a singular value decomposition. The textual information has to be prepared in a first step (Thorleuchter & Van den Poel, 2011b). This consists of raw text cleaning where specific objects e.g. images or xml-tags are removed. A dictionary is used to identify and correct typographical errors in the raw text. Tokenization is applied that splits the text in terms where the term unit is defined as words. A conversion of terms to lower case is done (case conversion). In a second step, the text is filtered to reduce the number of distinct terms. Different filtering methods are applied (Thorleuchter, Schulze, & Van den Poel, 2012): Part-of-speech tagging is used to identify the syntactic category of each term (e.g. nouns and verbs) and based on the category, non-informative terms are identified. Stop word filtering is also used to identify the content information of terms. Non-informative terms are discarded (Thorleuchter & Van den Poel, 2012a). As further filtering method, stemming is applied. While words occur in different forms, stemming use a basic form of words to map related words to this basic form. In contrast to lemmatization, stemming does not consider the context of a word. This leads to problems by processing words with the same spelling but with a different meaning. However, after the Using LSI for Technology Classification _______________________________________________________________________ 14 preprocessing step, latent semantic indexing is applied on the terms where the aspect of meaning is considered. Thus, at this time, it is not necessary to use lemmatization. The basic form of words is taken over from a dictionary. If a term is not in the dictionary then a set of production rules are applied to transform the word to its basic form. Terms that appear once or twice are discarded as stated in Zipf distribution (Zipf, 1949; Zeng et al., 2012). Literature shows that term vectors of weighted frequencies outperform term vectors of raw frequencies (Thorleuchter, Van den Poel, & Prinzie, 2010d; Prinzie, & Van den Poel, 2006; Van den Poel, De Schamphelaere, & Wets, 2004; Prinzie, & Van den Poel, 2007). Thus, vectors of weighted frequencies are created for each description in a third step. Based on the calculated weights, the importance of a term within the collection of all descriptions can be estimated (Sparck Jones, 1973). A term is assigned to a large weight if it occurs frequently in a small number of descriptions and seldom in further descriptions (Salton & Buckley, 1988). Based on the proposed weighting scheme from Salton, Allan, & Buckley (1994), the a weight wi,j for a term i in description j is calculated by      m 1p 2 i 2 pji iji ji p dfntf dfntf w ))/(log( )/log( , , , (1) where n is the number of descriptions, m the number of the term vector dimension, dfi is the number of all descriptions containing term i, tfi,j is the term frequency, and idfi, the inverse descriptions frequency (Chen, Chiu, & Chang, 2005). The different length of the descriptions is considered by using a length normalization factor in the divisor of the formula. 3.2 Identification of hidden semantic textual patterns with singular value decomposition Based on the calculated vectors of weighted frequencies, a term-by-description matrix can be created. The dimensionality of this matrix is large because of the large number of distinct terms. Most of the terms only occur frequently in a few numbers of descriptions but not in the Using LSI for Technology Classification _______________________________________________________________________ 15 further descriptions. This leads to many zero values in the matrix and thus, to a small matrix rank. To reduce the dimensionality of the matrix, LSI is used together with a matrix factorization technique. LSI summarizes terms with respect to their semantics (Deerwester et al., 1990). Singular value decomposition as matrix factorization technique identifies the relationships between terms based on their co-occurrences in the descriptions. All related terms are grouped into a semantic textual pattern and each semantic textual pattern has high discriminatory power to other patterns (Thorleuchter & Van den Poel, 2012c). Each semantic textual pattern is assigned to a singular value by processing the singular value decomposition algorithm. The singular value is calculated by splitting the term-by- description matrix A in a product of the matrices U, Σ, and V t . A = U Σ V t (2) Matrix A consists of m terms and n descriptions (m x n matrix) and a rank r (r ≤ min(m,n)) because of many zero values in the matrix. Matrix U consists of m terms and r semantic patterns (m x r matrix), matrix V consists of n descriptions and r semantic patterns (n x r matrix), and matrix Σ consists of the r singular values of matrix A. Thus, Σ is a diagonal (r x r) matrix and the singular values are sorted in descending order. For processing the singular value decomposition, the rank r is important. A large value of r leads to an unmanageable high number of semantic textual patterns. In this case, many semantic textual patterns only occur in a single description but not in several descriptions. For a technology classification, it is important to identify the relationships between different technologies as represented by the technology descriptions. Thus, semantic textual patterns are relevant for this task by considering the relationships between terms based on their co- occurrences in the collection of descriptions. These semantic textual patterns can be identified by reducing the rank r to a parameter k. As shown above, if k is too large e.g. k = r then too many semantic textual patterns are build that are not relevant. Otherwise, if k is too small then many relevant semantic textual Using LSI for Technology Classification _______________________________________________________________________ 16 patterns are not considered. Chen et al. (2010) proposes the use of an operational criterion to get an optimal value of k. We satisfy this by calculating the cross-validated area under the ROC (receiver operating characteristics) curve (AUC) for each k (DeLong, DeLong, & Clarke- Pearson, 1988; Hanley & McNeil, 1982; Halpern et al., 1996; Van Erkel & Pattynama, 1998). For this, we construct several rank-k models as described below. Based on the selection of a specific k, three matrices Uk, Σk and Vk are calculated where the first k columns of U, Σ, and V are retained while from k+1 on, the columns are discarded. Thus, the new term-by-description matrix Ak is based on the reduced matrix rank k