Analysing of existing customers to improve acquiring of new customers Weak Signal Identification with Semantic Web Mining Dirk Thorleuchter a,* , Dirk Van den Poel b a Fraunhofer INT, D-53879 Euskirchen, Appelsgarten 2, Germany, dirk.thorleuchter@int.fraunhofer.de b Ghent University, Faculty of Economics and Business Administration, B-9000 Gent, Tweekerkenstraat 2, Belgium, dirk.vandenpoel@ugent.be URL: http://www.crm.UGent.be _________________ * Corresponding author at: Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany. Tel.: +49 2251 18305; fax: +49 2251 18 38 305 E-mail address: Dirk.Thorleuchter@int.fraunhofer.de (D. Thorleuchter). Abstract We investigate an automated identification of weak signals according to Ansoff to improve strategic planning and technological forecasting. Literature shows that weak signals can be found in the organization’s environment and that they appear in different contexts. We use internet information to represent organization’s environment and we select these websites that are related to a given hypothesis. In contrast to related research, a methodology is provided that uses latent semantic indexing (LSI) for the identification of weak signals. This improves existing knowledge based approaches because LSI considers the aspects of meaning and thus, it is able to identify similar textual patterns in different contexts. A new weak signal maximization approach is introduced that replaces the commonly used prediction modeling approach in LSI. It enables to calculate the largest number of relevant weak signals represented by singular value decomposition (SVD) dimensions. A case study identifies and analyses weak signals to predict trends in the field of on-site medical oxygen production. This supports the planning of research and development (R&D) for a medical oxygen supplier. As a result, it is shown that the proposed methodology enables organizations to identify weak signals from the internet for a given hypothesis. This helps strategic planners to react ahead of time. Key Words: Weak Signal, Ansoff, Latent semantic indexing, SVD, Web Mining. mailto:dirk.vandenpoel@ugent.be Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 2 1 Introduction A successful planning of research and development (R&D) requires an overview on current and future environmental conditions (Choi, Kim, & Park, 2007) to predict the arising of new technological approaches - the technology push - (Thorleuchter, 2008) and to predict changes in consumers’ needs - the market pull - (Thorleuchter, Van den Poel, & Prinzie, 2010d) by time. Literature introduces a concept of environmental scanning (Abebe, Angriawan, & Tran, 2010; Tabatabei, 2011) that enables this prediction by extracting and analyzing information from the environment especially to identify events, trends, and relationships (Choo & Auster, 1993). The concept of environmental scanning realizes a predictive view by applying a weak signal approach (Ansoff, 1975). A weak signal is an event or a development where an accurate estimation of its impact on a target (e.g. on organization’s R&D) cannot be given because a single weak signal probably appears by chance (Ansoff, 1982). However, identifying several weak signals from different sources aiming at a common target is probably a hind that this target will be impacted in future. Thus, environmental changes can be predicted in advance that show future problem areas and opportunities. This enables the use of weak signals as early warning system for strategic planning. As shown by Decker, Wagner, and Scholz (2005), the internet is a valuable information source for an environmental scanning and thus, for detecting weak signals. A website or a document itself is normally not a weak signal however; it might be that a website or a document contains a textual pattern that represents a weak signal (Uskali, 2005). Thus, a full text access to information in the internet is necessary to identify these weak signals. Performance reasons based on the large number of internet websites enforces a (semi-) automated approach e.g. web mining rather than a human based manual scanning (Gericke et al., 2009; Tabatabei, 2011). Especially for R&D planning, information about three areas has to be considered (Thorleuchter, Van den Poel, & Prinzie, 2010c): the science for new technological aspects (technology push), the users for new product ideas (market pull), and the industry for new product development aspects (the link between technology and market). Technological Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 3 research results are described in articles published in scientific journals, in conference proceedings, and in various scientific document repositories. In recent years, access to the full text of these articles using the internet becomes much easier because of the increased number of open access journals and articles available today. Further, some publishers (e.g. Elsevier) offer open archives that enable a full text access to articles after a specific embargo period of time. Additionally, some publishers allow manuscript posting where accepted manuscripts can be posted on authors’ personal or institutional websites. The Google Books initiative enables full text access to selected pages of conference proceedings published in books. This shows that in contrast to several years ago, the full text access to a large number of scientific articles is available today using the internet (Thorleuchter, Van den Poel, & Prinzie, 2010a). Information about new product development can be found on companies’ websites and in business magazines. Today, many magazines publish articles on their websites and thus, a full text access on this information is also available. Patents as representative for both, scientific results and industrial products are also published with full text in the internet (Thorleuchter, Van den Poel, & Prinzie, 2010b). Information about new product ideas from the users can be found in internet forums, blogs, micro blogs, review sites etc. The full text access to this information using the internet is possible, too. Overall, the planning of R&D can be supported by an environmental scanning and weak signals detection using the full text information in the internet today. The proposed methodology uses semantic text classification combined with an automated web mining approach for environmental scanning and weak signals detection. This is in contrast to related research, where knowledge structure based text classification approaches are used (Yoon, 2012). The use of semantic text classification considers the fact, that weak signals are formulized by different persons, in different languages, and in different contexts. It might be that two textual patterns representing weak signals are related to a specific topic even if they do not share a common word. This relationship can only be identified with semantic approaches that consider aspects of meaning rather than aspects of words (Thorleuchter & Van den Poel, 2013b). Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 4 A further contrast to related research is the use of a new weak signal maximization approach. Existing literature that investigate latent semantic indexing as well known semantic approach apply prediction modeling approaches to calculate a performance optimized number of singular value decomposition (SVD) dimensions (Thorleuchter & Van den Poel, 2012e). They use training and test set that consists of a well-balanced number of positive and negative examples (Thorleuchter & Van den Poel, 2013a). The creation of a training and test set is not applicable to weak signal identification because weak signals for a specific topic occur low frequently. The number of positive examples for a specific topic is not sufficient to create a well-balanced training and test set. Further, an evaluation of weak signals’ impacts can only be done considering the collection of all weak signals. Thus, a new weak signal maximization approach is proposed to identify the maximal number of weak signals for a specific topic to enable such an evaluation. Up to now, the applied practical approaches for weak signal identification using a wide scope environmental scanning have failed. High tech companies in Europe had problems realizing a weak signal detection and evaluation because of the high manual effort caused by the lack of environmental scanning tools and the low quality of the results (Schwarz, 2005). Existing successful practical approaches for weak signal are restricted to a small number of documents e.g. 50 selected web pages (Decker et al., 2005) or financial news articles of one Finish newspaper (Uskali, 2005). Thus, the proposed semi-automated methodology bridges these gaps by implementing a web mining based environmental scanning and semantic weak signal identification. This enables a wide scope for environmental scanning, a low manual effort for human experts, and an improved identification performance. In a case study, the proposed methodology is applied in the field of on-site medical oxygen production. R&D planners have provided a hypothesis concerning future developments. The methodology identifies relevant weak signals that are related to the given hypothesis. The weak signals do not verify or falsify the hypothesis; however they show that the hypothesis is in accordance to current trends extracted from the internet. This supports R&D planners by their decision making process. Overall, a methodology is proposed that enables a practical use of the weak signal concept considering a wide scope of information from the internet, aspects of meaning, and Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 5 performance aspects to reduce the manual effort. Trends and developments can be identified in advance and they are a valuable source for R&D planners to support their decision making. 2 Background 2.1 Using Internet for R&D environmental scanning The internet contains a huge amount of information and literature shows that the added value of this information outperforms the added value gained from using traditional information sources (D’Haen, Van den Poel, & Thorleuchter, 2012). Organizations use the internet in different ways e.g. for collecting and analyzing information from organization’s customers (Alallak, 2010) and from competitive organizations (Teo & Choo, 2001) to advance organization’s strategic planning (Purandre, 2008). Web mining approaches support organizations by information collecting because they offer an automated possibility to scan the internet for relevant information on websites (Kosala & Blockeel, 2000; Kobayashi & Takeda, 2000). They apply automated filtering algorithms to reduce the large number of websites identified by use of search engines. This is necessary to overcome performance restrictions because many retrieved and filtered results represent non-relevant information and thus, low precision values in information retrieval are obtained. Further, many relevant documents are not retrieved by the internet search engine. This leads to low recall values. In recent years, information about the R&D environment (science, industry, and consumer) is available and accessible in the internet as shown in the introduction chapter. This opened an opportunity to use the internet for R&D environmental scanning today. 2.2 Weak signals identification for R&D The concept of weak signals has been introduced as early warning system to advance strategic planning (Ansoff, 1975; Tabatabei, 2011). It enables a timely identification of future events or developments that are relevant for a decision maker (Kuosa, 2010). Furthermore future events and developments are named topics. Literature introduces many different definitions of weak signals and most of them describe weak signals as unstructured Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 6 information with low content value (Mendonça et al., 2004). In a first stage, the weak signals reflect aspects of a threat or an opportunity. Then, their information content increases more and more e.g. they also describe the origin of a threat or an opportunity. Finally, weak signals become strong signals in a second stage and they indicate possible actions in future (Holopainen & Toivonen, 2012). Examples for weak or strong signals are articles in newspapers describing a specific topic, changes in sentiments of experts concerning this topic, and trends in the jurisdiction with impact on this topic (Mendonça, Cardoso, & Caraça, 2012). Strong signals point to a concrete topic that will occur with medium to high probability. A large number of strong signals for a specific topic can be found in the internet. This is because the topic is mentioned and discussed widely on several websites, in news articles, in internet blogs etc. Strong signals are not of interest for strategic planning because they occur too late for considering in strategic decision makings and thus, they do not provide a preview on environmental changes (Yoon, 2012). In contrast to the high frequent occurrence of strong signals concerning a specific topic, weak signals occur low frequently. Further, they can be used for strategic decision making because they occur early enough. For a specific topic, a small number of weak signals can be seen and it is hard to identify them from the large amount of information in the internet. This is the reason why many implementations of weak signal identification approaches fail in practice (Schwarz, 2005). Further, the occurrence of one weak signal is not sufficient for a predictive view, however; the occurrence of several weak signals that aim to the same topic might give a hind for future changes (Hiltunen, 2008). Thus, several weak signals have to be identified concerning the same topic and used for a strategic decision making (Ilmola & Kuusi, 2006; Rossel, 2009; Tabatabei, 2011). Strategic R&D decisions are a subsection of strategic decisions. Descriptions of R&D topics are characterized by the occurrence of domain specific technical words (characteristic terms). This is in contrast to the colloquial language where the meaning of a specific term is often not clearly defined. Texts describing R&D topics also consist of an above-chance frequent co-occurrence of technical terms. This means that a specific technical term occurs more frequently together with a further term than it would be expected by chance. Both characteristics, the occurrence of characteristic terms and the frequent co-occurrences enable an easier identification of weak signals than it could be realized by using colloquial texts. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 7 2.3 Latent semantic indexing for weak signals identification With text classification, texts can be assigned to different classes. The classes have to be pre-defined in advance. This is done manually by human experts or automated by use of a set of training examples and machine based learning (Ko & Seo, 2009; Lin & Hong, 2011; Sudhamathy & Jothi Venkateswaran, 2012; Finzen, Kintz, & Kaufmann, 2012). Knowledge structure approaches are commonly used as instance-based learning algorithms for classification. Examples are the k nearest neighbor classification, simple probabilistic algorithms (e.g. naïve Bayes), decision tree models (e.g. C4.5), and support vector machine algorithms (Buckinx, Moons, Van den Poel, & Wets, 2004; Lee & Wang, 2012; Shi & Setchi, 2012). These approaches are already used for identification of weak signals theoretically (Tabatabei, 2011). However, they do not consider semantic aspects of the information. This is important because several weak signals for a specific topic have to be identified that are normally formulized by different persons. Thus, considering aspect of meaning is important to identify related weak signals. Further, some of the knowledge structure approaches do not consider dependencies of terms. Weak signals in R&D are characterized by the above- chance frequent co-occurrence of technical terms thus, it is also important to consider these dependencies. In contrast to knowledge structure approaches, semantic approaches are better suited to consider aspect of meaning and to calculate term dependencies. The calculation of these semantic relationships between terms based on computational eigenvector techniques from algebra (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999; Luo, Chen, & Xiong, 2011). Terms that occur together in a textual pattern are considered as well as terms that might be in this textual pattern (Thorleuchter & Van den Poel, 2012c; Thorleuchter & Van den Poel, 2012d). Semantic indexing is normally applied using LSI. Text patterns standing behind several documents from a document collection are identified (Park, Kim, Choi, & Kim, 2012). These text patterns also enable a clustering of the documents. They consist of a list of semantically related terms and the meaning expressed by the set of these terms is stated in different documents from the collection (Christidis, Mentzas, & Apostolou, 2012; Tsai, 2012). Further, the impact of each document on each text pattern is calculated. This considers well Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 8 term dependencies (Thorleuchter, Van den Poel, & Prinzie, 2012). Thus, LSI they can be used to identify weak signals. Many modern approaches with better theoretical foundation and better performance than LSI have been introduced and applied in literature e.g. ‘Probabilistic Latent Semantic Indexing’ (Hofmann, 1999), ‘Latent Dirichlet Allocation’ (Blei, Ng, & Jordan, 2003; Ramirez, Brena, Magatti, & Stella, 2012), and ‘Non-Negative Matrix Factorization’ (Lee & Seung, 1999; Lee & Seung, 2001). In contrast to LSI, the improved approaches are of higher computational complexity than LSI. Applying them to a very large-scale document collection retrieved from the internet is difficult. Thus, LSI is used in the proposed approach to show the feasibility of this approach - in the knowledge that the performance can be improved by using the modern approaches instead of LSI. A very new and also very interesting approach is proposed by Ramírez and Brena (2012). Their query based topic modeling approach allows analyzing very large-scale collections with at least similar or even better performance than the above mentioned modern approaches by reducing computational complexity. As the case study (see Sect. 4) was already finished before we have been aware of this new approach, the use of this new approach might be interesting as an avenue of future research. As mentioned in Sect. 2.2, weak signals occur with low frequency that means they occur on a few numbers of websites in the internet. Using weak signals as classes for text classification fails because the classes cannot be defined in advance by human experts. Further, a machine based learning approach cannot be performed because it could not be guaranteed that the number of positive training examples is above a specific threshold to ensure statistical significance for the classification results. This excludes the use of knowledge structure or semantic classification approaches. To overcome these limitations, the classes have to be defined by a semantic clustering approach where the identified semantic textual patterns are evaluated to identify weak and strong signals in a textual collection. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 9 3 Methodology Fig. 1 shows the processing of the proposed methodology in different steps. The proposed methodology in Fig. 1 identifies semantic textual patterns from the internet and it analyzes them to identify weak and strong signals. It applies an environmental search as described in Sect. 2.1 by using the internet. It considers the characteristics of weak signals as described in Sec. 2.2 by applying a semantic clustering approach (see Sect. 2.3). The methodology starts with a hypothesis where an existing strategic decision problem is described. The weak signals are identified based on the topic described in this hypothesis. Thus, it is important that the hypothesis is formulized clearly and comprehensibly. The words that are used to formulize the hypothesis are considered for the next step, the creation of search queries. It is important that the search queries are created by human experts in high quality that many relevant documents are retrieved and that non-relevant documents are not Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 10 considered. The full text of the retrieved documents is crawled. Terms in the full text are compared to terms in the hypothesis to identify relevant sections within each document. These sections are used for further processing while the other sections are discarded. LSI is applied on the data and it creates a number of k different semantic textual patterns. They represent semantic aspects that occur on several different documents retrieved from the internet and thus, they can be used to represent strong and weak signals (see Sect. 2.2). The selection of k is based on a new weak signal maximization approach that leans on the weak signal characteristics from Sect. 2.2. This ensures that k is large enough to consider all weak signals and that k is small enough to discard semantic textual patterns that are not in accordance to the weak signal characteristics. The semantic textual patterns are split in weak signals and in strong signals. The relevance of each weak signal is analyzed manually concerning its impact on relevant terms. The development of a weak signal is calculated. As a result, the developments of weak signals are presented to the decision maker to improve strategic decision making. 3.1 Web Mining and Text Mining Search queries are created manually to represent the hypothesis. A single search query is often not suited to cover a topic. Thus, several search queries have to be created to cover all different aspects of the hypothesis. To ensure an environmental scanning, it has to be considered that websites are written in different languages. Thus, created search queries for a specific language have to be translated in different languages. This should also be done manually by human experts to ensure a higher quality than a search engine can offer by automated translation. The next steps are processed automatically. Each query is executed by an internet search engine and the search is restricted to the corresponding language from the query. The query results (the URLs of the websites) are used with a crawler that extracts the full text from all URLs. The retrieved results are stored as documents (one document per URL) in folders separated by the languages of the corresponding websites (Thorleuchter & Van den Poel, 2011b). Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 11 The full text from the retrieved documents are preprocessed and filtered to reduce complexity. In a first step, the raw test is cleaned from existing scripting code, images, and html-tags. Specific characters and punctuation are eliminated and typographical errors are corrected by using a dictionary from the corresponding language. Single words (terms) are identified by tokenization and case conversion is applied. In a second step, filtering methods are applied to reduce the number of terms for further processing. This includes stop word filtering (filtering of non-informative terms), via part-of-speech tagging (filtering specific syntactic category) up to stemming (reducing the number of terms with the same stem). Lemmatizing is not applied because existing practical methods are still error prone. The number of terms is reduced further by applying Zipf distribution (Zipf, 1949; Zeng et al., 2012). The preprocessed full text from the retrieved documents consists of texts in different lengths from several bytes up to several megabytes that depend on the corresponding websites. The text normally is split in several sentences. However, a website could reflect several different topics. Then probably the text in one sentence is relevant for the topic described in the hypothesis and text in other sections or sentences is not. Tokenization is applied on the full text of each retrieved document and the term unit is used as a sentence. This means the text is split in its different sentences and a text similarity measure is used to compare the terms from each sentence to the terms from the description of the hypothesis. For this, term vectors in vector space model have to be build and Jaccard’s coefficient can be used because it considers well the different lengths of term vectors (Thorleuchter & Van den Poel, 2011a). A similarity result value above a specific threshold shows that the corresponding sentence is relevant for the hypothesis and can be used for further processing. The other sentences are deleted. This reduces the lengths of the document and it ensures that the information used for latent semantic indexing is relevant. The documents are written in different languages. However, the processing of latent semantic indexing requires that documents are written is the same language. The translation of documents to a target language can be done automatically e.g. by use of Google translate application programming interface (API). It offers an automated translation of a document collection. The quality of the automated translations is low compared to the quality from a manual translation of a human expert. However, a manual translation of each document Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 12 leads to high efforts because of the large amount of documents. Further, a high quality translation is not necessary because the translated text is transformed to term vectors in the next step. Thus, grammar aspects are not of interest and it is sufficient to translate relevant terms one-to-one. This can be done with automated approaches in good quality, too. Term vectors in vector space model are created from the collection of all translated documents. For the components of the vectors, weighted frequencies are used. This is because literature shows that they outperform raw frequencies (Prinzie, & Van den Poel, 2007; Prinzie, & Van den Poel, 2006; Van den Poel, De Schamphelaere, & Wets, 2004). This is because weighted frequencies show the impact of a term on the collection of all documents (Sparck Jones, 1973). Large weights are assigned to terms that occur frequently in a very small number of documents and that do not or seldom occur in most of the documents from the collection (Salton & Buckley, 1988). The well-known terms weighting scheme proposed by Salton, Allan, & Buckley (1994) is used to calculate the weight wi,j for term i and for a document j by      m 1p 2 i 2 pji iji ji p dfntf dfntf w ))/(log( )/log( , , , (1) In formula 1, the number of documents in the collection is n, the number of components from the term vectors is m, and the number of documents that contain term i is dfi. Further, inverse document frequency as represented by log(n/dfi) and term frequency (tfi,j) are used (Chen, Chiu, & Chang, 2005). The divisor is a length normalization factor that considers different lengths of the documents. 3.2 Latent semantic indexing The created vectors of weighted frequencies are composed to build a term-by-document matrix A with rank r (r ≤ min(m,n)). The large number of components from the term vectors leads to a large dimensionality of the matrix. Further, many terms only occur in a small number of documents and thus, the corresponding term vector component in many documents is set to zero. In total, this leads to term vectors with many zero values in the components and also to a term-by-document matrix that also consists of many zero values. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 13 This fact is used by singular value decomposition to reduce the dimensionality of the matrix because it can be expected that the rank of the matrix is lower than its dimensionality. Singular value decomposition is a commonly used matrix factorization technique that is applied within LSI. A reduced matrix dimensionality leads to a summarizing of terms concerning aspects of meaning (Deerwester et al., 1990). Aspects of meaning are calculated based on the co-occurrences of terms in the documents. This enables to group semantically related terms together with high discriminatory power to other groups. The groups of terms represent semantic textual patterns (Thorleuchter & Van den Poel, 2012b). Singular value decomposition calculates singular values for each group and thus, for each semantic textual pattern. The singular values are sorted in descending order in a diagonal (r x r) matrix Σ. Singular value decomposition calculates two further matrices, too. The (m x r) matrix U shows the impact of terms on the semantic textual patterns and the (n x r) matrix V shows the impact of the documents on the semantic textual patterns. The calculation is shown in formula 2. A = U Σ V t (2) Matrix U and matrix V are used for calculating the weak signal maximization approach as introduced in Sect. 3.3. 3.3 Weak signal maximization approach for latent semantic indexing Literature shows how to reduce the dimensionality of the matrix from r to k with singular value decomposition (Chen, Chu, & Chen, 2010). The data is split in a test and training set both containing a specific percentage of positive examples (Thorleuchter, Herberz, & Van den Poel, 2012). The training set is used to build a LSI-subspace (Zhong & Li, 2010) and the test set is projected into this subspace to calculate the predictive performance. The performance on each rank-k model is measured using the area under the receiver operating characteristics curve, logistic regression, and n-fold cross validation (DeLong, DeLong, & Clarke-Pearson, 1988; Hanley & McNeil, 1982; Halpern et al., 1996; Migueis, Van den Poel, Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 14 Camanho, & Cunha, 2012; Van Erkel & Pattynama, 1998). As a result, an optimal value of k based on computational complexity and on predictive performance is selected. Existing approaches from literature cannot be used to identify an optimized number of semantic textual patterns representing weak signals from the internet. This is because weak signals occur low frequently as shown in Sect. 2.2. Thus, the percentage of positive examples in a randomly selected training and test set is very low and to obtain significant results, the training and test set have to be very large. The sets contain unseen documents retrieved by an internet search. For training and testing, these documents have to be evaluated concerning the occurrence of weak signals by human experts manually. This causes an unmanageable high manual effort. Our proposed weak signal maximization approach identifies the value of k where the number of weak signals represented by low frequently occurred semantic textual patterns with a strong relationship to the given hypothesis is maximized. Singular value decomposition calculates k semantic textual patterns from the collection of all retrieved internet documents. It is characteristic for this processing that a small number of k leads to a small number of semantic textual patterns that are impacted by a large number of documents. These patterns occur frequently in the collection of all documents and thus, they are not weak signals. Thus, using singular value decomposition with a very small number of k does not lead to the identification of any weak signals. A very large number of k leads to a small number of patterns impacted by a large number of documents and to a very large number of patterns impacted by a very small number of documents. As shown in Sect. 2.2., weak signals should occur at least more than once or twice otherwise they probably occur by chance. To identify a weak signal, the number of documents with impact on the pattern should be above a specific threshold. Thus, a very large number of k also leads to the identification of none weak signals in total. Weak signals are not only defined by the number of impacted documents however, they should be related to the given hypothesis, too. Thus, relevant terms in a semantic textual pattern as defined by a term impact on the pattern above a specific threshold are compared Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 15 to relevant terms from the given hypothesis by using text similarity measures. A similarity value above a specific threshold shows that a low frequent semantic textual pattern is also related to the given hypothesis and thus, it can be defined as weak signal. Patterns that are impacted by a large number of documents also are impacted by the large number of terms from the different documents. Thus, comparing these terms to the terms from the given hypothesis normally leads to a low similarity value. Furthermore, patterns that are impacted by a very small number of documents e.g. one or two are only impacted by a very small number of terms. Here, a low similarity value is also obtained by comparing them to terms from the given hypothesis. Thus, a maximal number of weak signals can be identified if k is not too small and not too large. Several rank-k models are created and the number of weak signals is calculated for each model. Comparing is done by use of a similarity measure e.g. Jaccard’s coefficient that considers well different lengths of input vectors because the term vectors created from the hypothesis normally is from different size than the term vectors created from the semantic textual patterns. 4 Case Study The case study applies the proposed methodology to support R&D planners of a company that offers medical oxygen to hospitals. This is normally done in two different ways: off- and on-site production. Off-site production means that the medical oxygen is produced by oxygen generators in the company, stored in high pressure gas cylinders, and transported to the hospitals. On-site production means that the company sells oxygen generators to hospitals and the oxygen is produced in the hospitals. The production of medical oxygen is based on two different methods that separate air in his components. The cryogenic air separation uses a low temperature rectification principle based on the fact that gases have different temperatures for changing aggregation states. Applying this method can produce a large quantity of oxygen with high oxygen purity of more than 99% that also can be used to obtain liquid oxygen. The non-cryogenic air separation uses the pressure swing adsorption (PSA) principle. It produces oxygen with a purity of about 93%. The R&D department of the http://www.dict.cc/englisch-deutsch/aggregation.html http://www.dict.cc/englisch-deutsch/states.html Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 16 company is responsible for the technical improvement of the oxygen generators (for both methods and for both, on- and off-site production). Medical oxygen for hospitals in Europe has to meet the requirements of the European Pharmacopoeia where the oxygen purity is an important point. In the past, a hospital management had to buy oxygen at a purity of at least 99% in Europe and in many non- European states. This excludes the use of pressure swing adsorption principle for on-site oxygen production. Since mid-2011, the European Pharmacopoeia has changed the requirements for European states. After transposing this to national law, it allows an oxygen purity of 93% for hospitals if the used oxygen generators are certificated according to ISO 13485:2003. Based on this legislation amendment, the R&D planners have stated a hypothesis: The use of PSA for medical oxygen on-site production (93% purity) in Europe will increase in future. This increase will be equally distributed in European states. New companies especially from the domain of machinery and plant engineering will become suppliers by offering PSA oxygen generators for hospitals in future. The aim of the case study is to identify weak signals that are in accordance or that are not in accordance to the hypothesis. 4.1 Web Mining and Text Mining Based on the given hypothesis, ten search queries are created in English language. Examples are ‘Medical +oxygen +high +purity’, ‘Oxygen +pressure +swing +adsorption +PSA’, and ‘Ultra +high +purity +oxygen +on +site +generation’. They describe the area of high purity medical oxygen and they enable the identification of all internet documents dealing with this topic. The search queries are translated in different languages: German (GE), French (FR), Polish (PO), Czech (CZ), and Romanian (RO) to cover different regions in Europe. The queries are executed automatically using Google API in mid-2012 one year after the European Pharmacopoeia has changed the percentage because the process of transposing Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 17 this to national law is time consuming. The hyperlinks of all query results are stored separately into the different languages. For each retrieved website, a crawler is used to extract the textual information and to store it in a document. As a result, 14.792 plain text documents are created automatically with a size of 213 megabytes in total. Text mining methods are applied as mentioned in Sect. 3.1. This identifies several non-relevant documents and it reduces documents’ sizes of relevant documents. Overall, 8375 documents with a size of 40 megabytes are obtained. 4.2 LSI with weak signal maximization Based on the data collected in Sect. 4.1, LSI is applied together with the proposed weak signal maximization approach. The thresholds of the approach are determined manually in a two-step process by human experts. Starting values are determined in a first step and the results from several rank-k models are evaluated in the second step. Based on this evaluation, some parameter values are adapted (first step) and several rank-k models based on the adapted values are created and evaluated again (second step). This two-step process repeats until the parameter values are optimized. As a result, the threshold for the impact of a document on a textual pattern as depicted in matrix V from -1 to 1 is determined to 0.4. The percentage of documents with impact greater than or equal to this threshold on a textual pattern to the number of documents in total is also determined to identify weak and strong signals: A value of 0% to 2% is representative for very low frequently occurred semantic textual pattern that might be occurred by change and can be discarded. Weak signals are identified from 2% on to 6% and strong signals are identified from 6% on to 10%. More than 10% means that the content of a semantic textual pattern can be found on more than on every tenth website. This information is normally well- known and thus, not relevant for forecasting. Matrix U shows the impact of a term on a semantic textual pattern from -1 to 1. The threshold for identifying relevant terms is determined to 0.4. Thus, a term vector for each semantic textual pattern is built on terms with an impact greater than or equal to 0.4. The term vectors are compared to the term vector from the hypothesis where the threshold for the result value is determined to 0.3 to identify patterns that are related to the hypothesis. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 18 4.3 Results An optimal value of k is identified with k=16. Thus, 16 semantic textual patterns are identified. Three patterns can be identified that are impacted by more than 10% of all documents. Eight patterns are identified where the percentage of the impacted documents is between 0% and 2%. Further, two strong signals and three weak signals have been identified. Comparing them to the hypothesis shows that the strong signals are not related to the hypothesis. This is because they deal about cryogenic air separation techniques and the transportation of high pressure gas cylinders. Two of the three weak signals are related to the hypothesis. The first weak signal with 4.2% impacted documents describes various aspects of PSA based oxygen generators producing 93% purity. This is in accordance to the first sentence of the hypothesis. The second weak signal based on a small number of impacted documents (2.3%) and it shows a new technological development that enables PSA based oxygen generators to produce medical oxygen with 99% purity in a multi-stage process. The increase of purity can be realized with low additional costs. Oxygen generators based on this technique can be used in Europe as well as in many non-European states and they are independent of future legislation amendments. This weak signal is not in accordance to the first sentence of the hypothesis because in future, it might be that 99% purity PSA generators are increasingly used for on-site medical oxygen production. Table 1: Number of documents with impact on the first and second weak signal in different languages compared to number of documents in total Weak signal EN CZ FR GE PL RO ∑ 1 Number of documents 182 9 66 68 3 24 352 Percentage 52% 2% 19% 19% 1% 7% 100% 2 Number of 118 3 47 13 1 11 193 Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 19 documents Percentage 61% 2% 24% 7% 0% 6% 100% Number of documents in total 3449 514 1133 1583 582 1114 8375 Percentage 41% 6% 14% 19% 7% 13% 100% The number of documents with impact on the first and second weak signal is sorted by language (see Table 1) e.g. 182 documents in English language (EN)) have an impact on the first weak signal. Further 52% of all documents with impact on the first weak signal are English language documents. They are compared to the distribution of all 8375 documents e.g. 41% of all documents are English language documents. Table 1 shows that more English and French websites are related to the first weak signal than it would be expected (increase from 41% to 52% and from 14% to 19%). In contrast to this, Romanian websites do not mention the weak signal as often as it would be expected (decrease from 13% to 7%). Table 1 also shows that on Czech and Polish websites the 93% purity PSA medical oxygen on-site production is seldom mentioned related to the number of Czech and Polish websites in total. Further, it shows that information about on-site medical oxygen generators with 99% purity often can be found on English and French websites. This is not in accordance to the second sentence of the hypothesis because the topic is not equally distributed in European states. The URLs of the documents with impact on the first weak signal are also evaluated concerning companies’ websites. As a result, the following list of companies is identified: ‘Linde, Air Liquide, Praxair, Messergroup, Pangas, Westfalen-ag, Basigas, DA- Energietechnik, Iga-gas, Airco-Druckluft, Cryotec, aircom24, Airtexx, IGS, Oxymat, Oxyplus, Oxair’. This list contains established medical oxygen suppliers as well as companies in the domain of machinery and plant engineering. This is in accordance to the third sentence of the hypothesis. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 20 4.4 Evaluation In the case study, the proposed approach identifies two weak signals each represented by a semantic textual pattern. The identification is based on the assignment of retrieved internet documents to the two corresponding semantic textual patterns by LSI / WSM. The performance of this assignment is evaluated based on precision and recall. For each of the two semantic textual patterns, the number of documents that are correctly assigned to the pattern is the true positive (TP). The number of documents incorrectly assigned to the pattern is the false positive (FP) and the number of documents that are incorrectly not assigned to the pattern (missing documents) is the false negative (FN). Precision is defined as TP / (TP + FP) and recall is defined as TP / (TP + FN). The evaluation is processed for documents in English and German language because the number of these documents is sufficient for a statistical evaluation. Table 2: Precision and recall for the assignment of English and German documents to the semantic textual patterns standing behind the first and second weak signal Language Number of Documents First / Second Weak signal TP FP FN Precision Recall EN 3449 First 153 29 69 84% 69% EN 3449 Second 96 22 33 81% 74% GE 1583 First 48 20 26 71% 65% GE 1583 Second 9 4 3 69% 75% Table 2 shows that the average value is 76% for precision at 70% recall. This outperforms the average value of the frequent baseline (3% precision at 70% recall). Further, the precision and recall values for German documents are smaller than that of English documents. An explanation for this is that the German documents are translated to the English language automatically. This reduces quality of the translated documents and thus, this also reduces the corresponding precision and recall values. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 21 5 Conclusion This work proposes a new methodology that enables the automated identification of weak signals for strategic forecasting. Weak signals are extracted from an organization’s environment as represented by internet information. Based on a given hypothesis about the future, related websites are identified and textual information is extracted. LSI with a new weak signal maximization approach is applied on this textual information to identify weak signals. The identified weak signals describe trends and future developments and it is analyzed whether they are in accordance to the given hypothesis or not. The websites standing behind an identified weak signal can be analyzed concerning language aspects to identify the spacial distribution of this weak signal. Further the website address also can be used to identify relevant organizations or companies related to the weak signal. This enables strategic planners to identify new trends, the spatial distribution of these trends, and the corresponding players (e.g. competitive organizations) ahead of time. Future work should focus on the optimization of the parameters by applying a parameter selection procedure. This enables an improved weak signal maximization approach. Further, the development of weak signals over time can be investigated by this methodology. For this, web mining has to retrieve the data at different points of time. Then, one probably could see that new weak signals occur, that existing weak signals disappear, or that existing weak signals become strong signals. Literature Abebe, M., Angriawan, A., & Tran, H. (2010). Chief executive external network ties and environmental scanning activities: An empirical examination. Strategic Management Review, 4(1), 30-43. Alallak, B. (2010). Evaluating the adoption and use of Internet-based marketing Information systems to improve marketing intelligence. International Journal of Marketing Studies, 2(2), 87-101. Ansoff, I. H. (1975). Managing strategic surprise by response to weak signals. California Management Review, 18(2), 21-33. Ansoff, I. H. (1984). Implanting strategic management. New Jersey: Prentice Hall Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4–5), 993-1022. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 22 Buckinx, W., Moons, E., Van den Poel, D., & Wets, G.(2004). Customer-adapted coupon targeting using feature selection. Expert Systems with Applications, 26(4), 509-518. Chen, M. C., Chiu, A. L., & Chang, H. H. (2005). Mining changes in customer behavior in retail marketing. Expert System with Applications, 28(4), 773-781. Chen, M.-Y., Chu, H.-C., & Chen, Y.-M. (2010). Developing a semantic-enable information retrieval mechanism. Expert Systems with Applications, 37(1), 322-340. Choi, C., Kim, S., & Park, Y. (2007). A patent-based cross impact analysis for quantitative estimation of technological impact: the case of information and communication technology. Technological Forecasting and Social Change, 74(2007), 1296-1314. Choo, C. W., & Auster, E. (1993). Environmental scanning: Acquisition and use of information by managers. Annual Review of Information Science and Technology, 28, 279-314. Christidis, K., Mentzas, G., & Apostolou, D. (2012). Using latent topics to enhance search and recommendation in Enterprise Social Software. Expert Systems with Applications, 39(10), 9297-9307. Decker, R., Wagner, R., & Scholz, S. W. (2005). An internet-based approach to environmental scanning in marketing planning. Marketing Intelligence & Planning, 23(2), 189-200. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837-845. D’Haen, J., Van den Poel, D., & Thorleuchter, D. (2013). Predicting customer profitability during acquisition: Finding the optimal combination of data source and data mining technique. Expert Systems with Applications, doi: 10.1016/j.eswa.2012.10.023. Finzen, J., Kintz, M., & Kaufmann, S. (2012). Aggregating web–based ideation platforms. International Journal of Technology Intelligence and Planning, 8(1), 32-46. Gericke, W., Thorleuchter, D., Weck, G., Reiländer, F., & Loß, D. (2009). Vertrauliche Verarbeitung staatlich eingestufter Information - die Informationstechnologie im Geheimschutz. Informatik Spektrum, 32(2), 102-109. Halpern, E. J., Albert, M., Krieger, A. M., Metz, C. E., & Maidment, A. D. (1996). Comparison of receiver operating characteristic curves on the basis of optimal operating points. Academic Radiology, 3(3), 245-253. Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36. Hiltunen, E. (2008). The future sign and its three dimensions. Futures, 40(3), 247-260. Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. In: Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99). Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 23 Holopainen, M., & Toivonen, M. (2012). Weak signals: Ansoff today. Futures, 44(3), 198-205. Ilmola, L., & Kuusi, O. (2006). Filters of weak signals hinder foresight: Monitoring weak signals efficiently in corporate decision-making. Futures, 38(8), 908-924. Jiang, J., Berry, M. W., Donato, J. M., Ostrouchov, G., & Grady, N. W. (1999). Mining consumer product data via latent semantic indexing. Intelligent Data Analysis, 3(5), 377-398. Ko, Y., & Seo, J. (2009). Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management, 45(1), 70-83. Kobayashi, M., & Takeda, K. (2000). Information retrieval on the web. Association for Computing Machinery, 32(2), 144. Kosala, R., & Blockeel, H. (2000). Web research: A survey. ACM SIGKDD Explorations Newsletter, 2(1). Kuosa, T. (2010). Futures signals sense-making framework (FSSF): A start-up tool to analyse and categorise weak signals, wild cards, drivers, trends, and other types of information. Futures, 42(1), 42- 48. Lee, C.H., & Wang S.H. (2012). An information fusion approach to integrate image annotation and text mining methods for geographic knowledge discovery. Expert Systems with Applications, 39(10), 8954- 8967. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788-791. Lee, D. D., & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. In: Advances in Neural Information Processing Systems 13. Proceedings of the 2000 Conference. MIT Press. pp. 556- 562. Lin, M-H., & Hong, C-F. (2011). Opportunities for Crossing the Chasm between Early Adopters and the Early Majority through New Uses of Innovative Products. The Review of Socionetwork Strategies, 5(2), 27-42. Luo, Q., Chen, E., & Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Systems with Applications, 38(10), 12708-12716. Mendonça, S., Cardoso, G., & Caraça, J. (2012). The strategic strength of weak signal analysis. Futures, 44(3), 218-228. Mendonça, S., Pina e Cunha, M., Kaivo-oja, J., & Ruff, F. (2004). Wild cards, weak signals and organisational improvisation. Futures, 36(2), 201-218. Migueis, V. L., Van den Poel, D., Camanho, A. S., & Cunha, J.F. (2012). Modeling partial customer churn: On the value of first product-category purchase sequences. Expert Systems with Applications, 39(12), 11250-11256. Park, D, H., Kim, H. K., Choi, I. Y., & Kim, J. K. (2012). A literature review and classification of recommender systems research. Expert Systems with Applications, 39(11), 10059-10072. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 24 Prinzie, A., & Van den Poel, D. (2006). Investigating purchasing-sequence patterns for financial services using Markov, MTD and MTDg models. European Journal of Operational Research, 170(3), 710-734. Prinzie, A., & Van den Poel, D. (2007). Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models. Decision Support Systems, 44(1), 28-45. Purandre, P. (2008). Web mining: A key to improve business on web. IADIS European Conference Data Mining, (pp. 155-159). Ramírez, E. H., & Brena, R. F. (2012). Query Based Topic Modeling: An Information-Theoretic Framework for Semantic Analysis in Large-Scale Collections. In: Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications (pp. 69-95), Information Science Pub., USA. Ramirez, E. H., Brena, R. F., Magatti, D., & Stella, F. (2012). Topic model validation. Neurocomputing 76(1), 125-133. Rossel, P. (2009). Weak signals as a flexible framing space for enhanced management and decision- making. Technology Analysis & Strategic Management, 21(3), 291-305. Salton, G., Allan, J., & Buckley, C. (1994). Automatic structuring and retrieval of large text files. Communications of the ACM, 37(2), 97-108. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. Schwarz, J. O. (2005). Pitfalls in implementing a strategic early warning system. Future Studies, 7(4), 22-31. Shi, L., & Setchi, R. (2012). User-oriented ontology-based clustering of stored memories. Expert Systems with Applications, 39(10), 9730-9742. Sparck Jones, K. (1973). Index term weighting. Information Storage and Retrieval, 9(11), 619-633. Sudhamathy, G. & Jothi Venkateswaran, C. (2012). Fuzzy Temporal Clustering Approach for E- Commerce Websites. International Journal of Engineering and Technology, 4(3), 119-132. Tabatabei, N. (2011). Detecting Weak Signals by Internet-Based Environmental Scanning. Master Thesis, Waterloo University: Waterloo. Teo, T. S., & Choo, W. Y. (2001). Assessing the impact of using Internet for competitive intelligence. Information & Management, 39(1), 67-83. Tsai, H.H. (2012). Global data mining: An empirical study of current trends, future forecasts and technology diffusions. Expert Systems with Applications, 39(9), 8172-8181. Thorleuchter, D. (2008). Finding Technological Ideas and Inventions with Text Mining and Technique Philosophy. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.), Data Analysis, Machine Learning, and Applications (pp. 413-420). Berlin: Springer. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 25 Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2010a). Mining Ideas from Textual Information. Expert Systems with Applications, 37(10), 7182-7188. Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2010b). A compared R&D-based and patent-based cross impact analysis for identifying relationships between technologies. Technological Forecasting and Social Change, 77(7), 1037-1050. Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2010c). Mining innovative ideas to support new product research and development. In H. Locarek-Junge, & C. Weihs (Eds.), Classification as a Tool for Research (pp. 587-594). Berlin: Springer. Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2010d). Extracting consumers needs for new products - A web mining approach. In Proceedings WKDD 2010 (p. 441.). Los Alamitos: IEEE Computer Society. Thorleuchter, D., & Van den Poel, D. (2011a). Semantic Technology Classification - A Defence and Security Case Study. In Proc. Uncertainty Reasoning and Knowledge Engineering (pp. 36-39). New York: IEEE. Thorleuchter, D., & Van den Poel, D. (2011b). Companies Website Optimising concerning Consumer’s searching for new Products. In Proc. Uncertainty Reasoning and Knowledge Engineering (pp. 40-43). New York: IEEE. Thorleuchter, D., & Van den Poel, D. (2011c). High Granular Multi-Level-Security Model for Improved Usability. In: System Science, Engineering Design and Manufacturing Informatization 1 (pp. 191-194). New York: IEEE. Thorleuchter, D., Herberz, S., & Van den Poel, D. (2012). Mining Social Behavior Ideas of Przewalski Horses. Lecture Notes in Electrical Engineering, 121, 649-656. Thorleuchter, D., Schulze, J., & Van den Poel, D. (2012). Improved Emergency Management by Loosely Coupled Logistic System. Communications in Computer and Information Science, 318, 5-8. Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2012). Analyzing existing customers’ websites to improve the customer acquisition process as well as the profitability prediction in B-to-B marketing. Expert Systems with Applications, 39(3), 2597-2605. Thorleuchter, D., & Van den Poel, D. (2012a). Extraction of Ideas from Microsystems Technology. Advances in Intelligent and Soft Computing, 168, 563-568. Thorleuchter, D., & Van den Poel, D. (2012b). Predicting e-commerce company success by mining the text of its publicly-accessible website. Expert Systems with Applications, 39(17), 13026-13034. Thorleuchter, D., & Van den Poel, D. (2012c). Using NMF for Analyzing War Logs. Communications in Computer and Information Science, 318, 73-76. Thorleuchter, D., & Van den Poel, D. (2012d). Using Webcrawling of Publicly-Available Websites to Assess E-Commerce Relationships. In SRII Global Conference 2012 (pp. 402-410). San Jose, CA, USA: IEEE. Weak Signal Identification with Semantic Web Mining _______________________________________________________________________ 26 Thorleuchter, D., & Van den Poel, D. (2012e). Improved Multilevel Security with Latent Semantic Indexing. Expert Systems with Applications, 39(18), 13462-13471. Thorleuchter, D., Weck, G., & Van den Poel, D. (2012a). Granular Deleting in Multi Level Security Models - an Electronic Engineering approach. Lecture Notes in Electrical Engineering, 1, 177, 609- 614. Thorleuchter, D., Weck, G., & Van den Poel, D. (2012b). Usability based Modeling for Advanced IT- Security - an Electronic Engineering approach. Lecture Notes in Electrical Engineering, 1, 177, 615- 619. Thorleuchter, D., & Van den Poel, D. (2013a). Technology classification with latent semantic indexing. Expert Systems with Applications, 40(5), 1786-1795. Thorleuchter, D., & Van den Poel, D. (2013b). Protecting Research and Technology from Espionage. Expert Systems with Applications, doi: 10.1016/j.eswa.2012.12.051. Uskali, T. (2005). Paying attention to weak signals: The key concept for innovation journalism. Innovation Journalism, 2(11), p. 19. Van den Poel, D., De Schamphelaere, J., & Wets, G (2004). Direct and indirect effects of retail promotions on sales and profits in the do-it-yourself market. Expert Systems with Applications, 27(1), 53-62. Van Erkel, A. R., & Pattynama, P. M. T. (1998). Receiver operating characteristic (ROC) analysis: Basic principles and applications in radiology. European Journal of Radiology, 27(2), 88-94. Yoon, J. (2012). Detecting weak signals for long-term business opportunities using text mining of Web news. Expert Systems with Applications, 39(16), 12543-12550. Zeng, J., Duan, J., Cao, W., & Wu C. (2012). Topics modeling based on selective Zipf distribution. Expert Systems with Applications, 39(7), 6541-6546. Zhong, J., & Li, X. (2010). Unified collaborative filtering model based on combination of latent features. Expert Systems with Applications, 37(8), 5666-5672. Zipf, G. K. (1949). Human Behaviour and the Principle of Least Effort. Cambridge: Addison-Wesley.