FACULTEIT ECONOMIE EN BEDRIJFSKUNDE   TWEEKERKENSTRAAT 2 B-9000 GENT Tel. : 32 - (0)9 – 264.34.61 Fax. : 32 - (0)9 – 264.35.92 WORKING PAPER   Mining Ideas from Textual Information Dirk Thorleuchter1 Dirk Van den Poel2 Anita Prinzie3 November 2009 2009/619 hofer.de 1 Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany & PhD Candidate, Ghent University, dirk.thorleuchter@int.fraun 2 Prof. Dr. Dirk Van den Poel, Professor of Marketing Modeling/analytical Customer Relationship Management, Faculty of Economics and Business Administration, dirk.vandenpoel@ugent.be; more papers about customer relationship management can be obtained from the website: www.crm.UGent.be, more papers about text mining can be downloaded from www.textmining.UGent.be 3 Prof. Dr. Anita Prinzie is visiting professor at Ghent University D/2009/7012/71 mailto:dirk.thorleuchter@int.fraunhofer.de mailto:dirk.vandenpoel@ugent.be http://www.crm.ugent.be/ http://www.textmining.ugent.be/ Mining Ideas from Textual Information Dirk Thorleuchter1, Dirk Van den Poel2, and Anita Prinzie2 1Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany 2Ghent University, Faculty of Economics and Business Administration, Tweekerkenstraat 2, 9000 Gent, Belgium Abstract This approach introduces idea mining as process of extracting new and useful ideas from unstructured text. We use an idea definition from technique philosophy and we focus on ideas that can be used to solve technological problems. The rationale for the idea mining approach is taken over from psychology and cognitive science and follows how persons create ideas. To realize the processing, we use methods from text mining and text classification (tokenization, term filtering methods, Euclidean distance measure etc.) and combine them with a new heuristic measure for mining ideas. As a result, the idea mining approach extracts automatically new and useful ideas from a user given text. We present these problem solution ideas in a comprehensible way to support users in problem solving. This approach is evaluated with patent data and it is realized as a web-based application, named 'Technological Idea Miner' that can be used for further testing and evaluation. Keywords Idea Mining, Text Mining, Text Classification, Technology Introduction Overview  An idea is an image existing or formed in the mind but it can be written down as textual information. In the last years, we see a continually increasing amount of information. About 80 % of all this information is stored in textual form [9]. Examples are research papers, articles in technical periodicals, reports, documents, web pages etc. These texts possibly contain many new ideas. A new idea is often needed to discover unconventional approaches e.g. to create a technological breakthrough. However, a manual extraction of new ideas from these masses of texts is time consuming and costly. Therefore, it is useful to search for new problem solution ideas automatically. Text mining or knowledge discovery from texts refers generally to the process of extracting interesting information and knowledge from unstructured text [12]. Referring to this, we introduce idea mining as an automatically process of extracting new and useful ideas from unstructured text using text-mining methods. 2 Creating ideas is a well-known topic that is related to psychology and cognitive science. There, we find many approaches dealing with how persons create ideas especially for problem solution. Therefore, in Sect 2 we focus on a general process of creating problem solution ideas and use it as rationale for the idea mining approach. In recent years, data and text mining techniques explore and analyze huge amounts of available textual data [4]. Idea mining uses known methods from these techniques and combine them with a new method to create text patterns and a new heuristic measure for mining ideas to realise the rationale. Therefore, we present the processing of the idea mining approach in Sect. 3 and we introduce this new idea mining measure in Sect. 4. A further task of idea mining is to present the extracted ideas in a comprehensible way to the user. Therefore, we focus on results of comprehensibility research and their relations to our task (see Sect. 5). Additionally, we provide an extensive evaluation to show the success of the idea mining approach and specifically the heuristic idea mining measure (see Sect. 6). Idea Definition  We limit our approach to the technological language because of two reasons. Firstly, the technological language is much more standardized than the colloquial language [11,16]. Therefore, we get better results by analyzing technological texts with text mining approaches. Secondly, our idea definition is taken over from technique philosophy [21]. There, an idea is defined as a combination of two things: a mean and an appertaining purpose. An example for an idea is a transistor. A transistor is a semiconductor device. It can be used to amplify or switch electronic signals. Here, we have a mean (a semiconductor device) and an appertaining purpose (to amplify or switch electronic signals). In general, we talk about a new idea if a know mean is related to an unknown purpose or if a known purpose is related to an unknown mean [1]. Then, a new idea is a nanomagnet because a nanomagnet is a miniaturized magnet that also can be used to amplify or switch electronic signals. Here we have an unknown mean (a miniaturized magnet) appearing together with a known purpose. This new idea could be useful to humans who are working in the field of electronic signals because in future nanomagnetic technology possibly could replace transistor technology. Therefore, we define a new and probably useful idea as a text phrase. This text phrase consists of domain specific terms that occur together in textual information. These terms can be divided up into two subsets. The first subset should represent a known mean (or a known purpose) and the second subset should represent an unknown purpose (or an unknown mean). Additionally, all terms in the first subset should occur together in a text phrase of the technological problem description. 3 Rationale behind Idea Mining Creating ideas is a well-known topic that is related to creativity in psychology and cognitive science. One of the first descriptions of the creative process was published by Wallas [23]. His stage model explains creative insights and illuminations for finding a problem solution. This model consists of a four stages process. In stage one 'preparation', the problem is analyzed so that a person recognizes the problem's dimensions. The stage two 'incubation / intimation' and the stage three 'illumination' transfer the problem from the conscious to the unconscious mind. The unconscious mind works on the problem continuously and it probably finds a solution by creative insights and illuminations. This solution is transferred to the conscious mind, which means after some time the person suddenly gets an idea that is new for him and that probably solves the problem. In the last stage 'verification', the idea is tested for novelty and usefulness. One of the best-known pragmatic approaches of using practical creativity is brainstorming from Osborn [17]. The first step in brainstorming is to define the problem e.g. by creating descriptions of the problem. Then, persons generate new ideas using creativity methods like idea association etc. The last step in the brainstorming process is to cluster the generated ideas and to evaluate it for novelty and usefulness. Beside this, there are several further approaches dealing with the creation of new ideas. We can learn from all these approaches that for creating ideas three steps are necessary. The first step is to focus on a problem, the second step is to generate some new ideas specific for this problem with creative methods and the third step is to evaluate the generated ideas for novelty and usefulness concerning the problem. Referring to these approaches, we build an adequate rationale for the idea mining process. Therefore, idea mining also consists of three steps. In the first step, we focus on the problem. Here, the user of our idea mining approach has to provide textual information where he describes his specific problem (a problem description). In the second step, the user has to provide further textual information where he supposes the existence of new and useful ideas (a new text) that probably can solve his problem [20]. Ideas are contained in text phrases inside this new text as described in Sect. 1.2. Therefore, with an automatically process, we extract a very large number of overlapping text phrases from the new text. In the remainder of this paper, text phrases will be named text patterns. In the third step, all extracted text patterns are evaluated for novelty and usefulness. This means, they are compared to the problem description by using a specific idea mining measure. With this measure, text patterns can be classified as new and useful idea. Therefore, idea mining identifies new and useful ideas in three steps: 1. Preparation of a problem description 2. Extraction of text patterns from a new text and 3. Evaluation of text patterns for novelty and usefulness concerning problem description. Idea Mining Process 4 Fig. 1 shows the processing of the idea mining approach in different steps based on the rationale for the idea mining process (see Sect. 2). Figure 1: Processing of our idea mining approach in different steps: After tokenization and term filtering, text patterns are created and term vectors are built representing these text patterns. Term vectors from the new text are compared to term vectors from the problem description using the Euclidean distance measure. Then, term vectors from the new text are compared to their most similar term vectors from the problem description using the idea mining measure. As a result, we get term vectors from the new text that represent new and useful ideas. With tokenization [3], texts are separated in terms and the term unit is word. The set of different terms in a text is reduced by using stop word filtering methods and stemming [12]. For this, a general list of stop words is used as well as the well-known Porter stemming algorithm [18]. A related problem to the use of stemming is to identify synonyms and homonyms. Synonyms are different words with identical or at least similar meanings. Homonyms are groups of words with the same spelling but with different meanings. With stemming synonyms and homonyms cannot be identified because stemming does not use knowledge of the context of a term. In this idea mining approach, we do not identify synonyms and homonyms. This is because the approach always considers the context of a term by working on text patterns containing several co- occurring terms as described below. Here, we show how to create these text patterns automatically. Around each appearance of each term in the new text, we create a text pattern containing the selected term and all terms, which occur in the left and right context of the selected term. To reduce the number of text patterns, we only create text patterns around non-stop words and around terms that occur both in the new text and in the problem description. One important decision to be taken is to determine the length of a text pattern. Text patterns should not be too small so that they contain all terms representing a new idea. Further text patterns should not be too large so that only terms occur in the text patterns that are related to the new idea. For example if we set the length of the text patterns to l then a text pattern contains the 5 selected term, l terms from its left context and also terms from its right context. The cardinality of the set of stop word filtered and stemmed terms from this pattern is normally smaller than because some terms are stop words, some terms occur twice and some terms have the same stem. l 1l2 +* Nn ∈ In this paper, we do not use a constant length l for all patterns but a variable length of text patterns based on a dynamic adaptation of its context. This is realized by using a term weighting scheme based on the difference between stop words and non-stop words because the importance of a stop word in a text pattern is not as high as the importance of a non-stop word. If an author formulates an idea very briefly by joining catchwords together then he normally does not use many stop words and the text pattern length can be small. If an author formulates an idea in a flowery style that means his writing is not expressed in a clear and simple way then he normally uses more stop words and the text pattern length has to be larger. In the idea mining application the value of text pattern length and the percentage of the importance of stop words and of non-stop words v can be provided by the user. l u To compute the variable length of a text pattern, we firstly define the term weighting scheme. Definition 1. Let (a text) be a list of terms (words) in order of appearance and let be the number of terms in ],..,[ 1 nwwT = iw T and i ]n,..,1[∈ . Let ]~,..,~[ mw1w=Σ be a set of domain specific stop terms [15] and let be the number of terms in Σ . Let the percentage be a term weighting coefficient for stop words. Let the percentage be a term weighting coefficient for non-stop words. Then, we define as term weighting scheme: Nm∈ Nu ∈ N∈v N∈w(f ig ) =)( ig wf ⎪⎩ ⎪ ⎨ ⎧ Σ∉ Σ∈ i i wv wu { }),..,1( ni ∈∀ (1) We give an example for this. The text pattern 'components for frequency conversion of infrared lasers' is built around the word 'conversion'. It contains the word conversion itself, three terms from its left context (components for frequency), and three terms from its right context (of infrared lasers). Here, we use a constant length and a term weighting scheme with 100 %. This means the importance of a stop word is equal to the importance of a non-stop word. The next text pattern is an example for a variable length: 'In a 1st phase, known but so far not available materials and technologies such as layer systems and crystals'. This text pattern is built around the word 'technologies'. Here we use a constant length and a term weighting scheme with 10 % and 100 %. As a result, this text pattern contains six terms from the right context and eleven terms from the left context of the term 'technologies'. In this example, non- stop words are phase, materials, technologies, layer, systems, and crystal. We compute the number of terms from the left and right context as described below: 3l = == βα 3l = =u Nl left ∈ =v Definition 2. Let be a constant length of text patterns. Let be the number of terms from the left context of a text pattern that is built around the term . Let be the number of terms from the right context of a text pattern that is built around the term . Then, we define and as: Nl ∈ N∈ left il iw right il iw i l right i 6 )())(( 1 min njilwfl j k kig j right i =+∨≥= ∑ = + { n,..,1i ∈ }∀ (2) )1())(( 1 min =−∨≥= ∑ = − jilwfl j k kig j left i { n,..,1i ∈ }∀ (3) After computing and , we can build a text pattern around the term from the text . left il right il iT iw ],..,[ 1 nwwT = ],...,,..,[ right i left i liili wwwT +− = (4) For each text pattern from the new text, we create a term vector in vector space model. The size of the vector is defined by the number of different stemmed and stop word filtered terms in the new text. For text pattern encoding, we use binary term vectors that means a vector element is set to one if the corresponding unstemmed term is used in the text pattern and to zero if the term is not. We also build text patterns from the problem description and create term vectors as described above. To identify new and useful ideas, we create a specific idea mining measure. This idea mining measure is described in Sect. 4. By comparing a vector from the new text to one from the problem description, we can compute a result value always between 0 % and 100 % using this measure. The greater the result value the more is the probability that the vector from the new text represents a new and useful idea concerning a vector from the problem description. We use this measure for comparing vectors from the new text to their most similar vectors from the problem description but not to all vectors. This is because result values from comparing a vector to its most similar vectors predominate result values from comparing a vector to its further vectors. For example if a vector from the new text is similar to one from the problem description then the idea is not new to the user regardless whether result values from comparing this vector to further vectors from the problem description are greater than zero. Therefore, we can be sure that a vector represents a new and useful idea only if it gets a great result value from idea mining measure concerning one of its most similar vectors. Further, the computing of the idea mining measure is time consuming. Therefore, it is necessary to limit the number of comparisons with idea mining measure for implementing an idea mining application. We choose a two-step classification way. In the first step, we compare each vector from the new text to all vectors from the problem description by using the well-known Euclidean distance measure. Fortunately, the computing of the Euclidean distance measure is not time consuming so that it is suited for implementing in an idea mining application. In detail, for each vector from the new text, we identify all vectors from the problem description where the Euclidean distance result value is the lowest that means we identify the most similar vectors. In the second step, we compare each vector from the new text to its most similar vectors using the idea mining measure. 7 Each vector from the new text - that is compared to several similar vectors - gets the highest result value from idea mining measure as result value. To identify a new and useful idea we use alpha-cut method. An alpha-cut of the idea mining measure result value is the set of all vectors from the new text such that the appertaining result value is greater than or equal to alpha (α~ ). In the idea mining application, the user can provide the value of . α~ Idea Mining Measure With the idea mining measure, we compare a vector that represents a text pattern from the new text to its most similar vectors from the problem description to identify a new and useful idea inside the text pattern from the new text. In detail, we have to find text pattern from the new text where all terms representing a mean (purpose) and no terms representing a purpose (mean) occur in a text pattern from the problem description. If all terms in the text pattern from the new text are known, which means all terms also occur in a text pattern from the problem description then the idea is not new to the user. Furthermore, the idea is not useful if all terms in the text pattern from the new text are unknown because there is no relation to the problem. It is shown in [22] that to find new and useful ideas the number of known terms (e.g. representing a mean) and the number of unknown terms (e.g. representing an appertaining purpose) shall be well balanced. Definition 3. Let be a the set of stemmed and stop word filtered terms representing a text pattern with number i from the new text. Let be a set of stemmed and stop word filtered terms representing a text pattern with number iα jβ j from the problem description. Let be the set of all stemmed and stop word filtered terms from the new text. Let γ γx = be the cardinality of . Let be a term vector in vector space model concerning . Let be a term vector in vector space model concerning . Let γ { } xi 10ω ,∈ iα { 1, } x j 0ρ ∈ jβ ∑ === x 1k ki,i ωαp be the number of all (known and unknown) terms in text pattern with number . Let i ∑ x 1 ω =k kiji ρβαq •=∩= , kj , be the number of known terms in text pattern with number i concerning a text pattern with number j from the problem description. Then, we define as measure for well-balanced known and unknown term distribution. 1m ⎪ ⎪ ⎩ ⎪⎪ ⎨ ⎧ ⋅ −⋅ = p q p qp m 2 )(2 1 ) 2 ( ) 2 ( p q p q < ≥ (5) The known terms in the text pattern from the new text should occur in the problem description more frequently than other terms. This is because they represent a known mean or a known purpose that is a central part of the problem. In the problem description, terms that represent the 8 problem occur more frequently than other terms. For this, we define these frequent terms by using a percentage z as parameter and we compute as the number of known and frequent terms over the number of all known terms. 2m Definition 4. Let z be a percentage. Let be a set of δ z % most frequently stemmed and stop word filtered terms in the problem description. Let ξ be a term vector in vector space model concerning δ . Let { } x1,0∈ k x 1kji δβαr == = ∑∩∩ kjki ξω •,, ρ• be the number of known terms, which occur frequently in the problem description. We define as measure for frequently occurrence of known terms in the problem description. 2m q r m =2 (6) The unknown terms in the text pattern from the new text represent a new approach (an unknown mean or purpose), which is a central part of the new idea. These terms normally occur more frequently than other terms in the new text because this text deals about the new idea. For this, we also define these frequent terms by using a percentage z as parameter and we compute as the number of unknown and frequent terms over the number of all unknown terms. 3m Definition 5. Let φ be a set of % most frequently stemmed and stop word filtered terms in the new text. Let be a term vector in vector space model concerning . Let z {0 } x1τ ,∈ φ k x 1k kjki x 1k kiji τρωτωφβαs •••= == ∑∑∩∩ ,,, nknown k= be the number of unknown terms, which occur frequently in the new text. We define as measure for frequently occurrence of u terms in the new text. 3m qp s m − =3 (7) There are often characteristic terms (higher, quicker, integrated, minimized etc.) that occur together with new ideas. They point to a changing purpose or a changing mean and can be an indicator for new ideas. Definition 6. Let be a set of these characteristic terms (stemmed and stop word filtered). Let be a term vector in vector space model concerning . Let λ { } x10θ ,∈ λ ∑∩ x 1k kkii θωλαt = •== , be the number of these characteristic terms in text pattern with number i . We define as measure for changing means and purposes. 4m ⎩ ⎨ ⎧ = > = )0(0 )0(1 4 t t m (8) 9 The idea mining measure bases on all four heuristic sub measures. Definition 7. Let and let be weighting factors with ∑ . Let the idea mining measure be the sum of all four sub measures multiplied by weighting factors in case of . { 41h ,..,∈ } 0g h ≥ = = 4 1h h 1g hg qp ≠ ⎩ ⎨ ⎧ = ≠+++ = )(0 )(44332211 qp qpmgmgmgmg m (9) Idea Mining and Comprehensibility Research The aim of idea mining is to find new and useful ideas but also to present these ideas in a comprehensible way to the user. To realize this, we focus on comprehensibility research. Up to the 1960s comprehensibility was a property of the text. It was measured in an objective way by analysing text parameters like word length, sentence length, word-usability, relationship between number of different words and number of words. The well-known approach in this time was the 'Reading Ease'-formula from Flesch [8]. Later research in this field focuses on cognitive effects by doing textual production and reception. The results of this research are presented by two approaches: the 'Hamburger Verständlichkeitsmodell' [14] and the 'Groebener Modell' [10]. Both approaches describe four dimensions of comprehensibility: simplicity, structure-organization, brevity-shortness and interest-liveliness. Figure 2: We present the new text back to the user with text patterns in bold print that represent new and useful ideas. A further approach from cognition research is named text excerption. If a human expert finds new and useful ideas in texts he highlights all corresponding text phrases e.g. with text marking. This behaviour is described by Puppe et al. [19]. In the idea mining application, text excerption is used to present the extracted ideas to the user (Fig. 2 shows an example). For the 'Groebener Modell' marking text pattern is important for structure-organization and this leads directly to comprehensibility. In this point, there are differences between the 'Groebener Modell' and the 'Hamburger Verständlichkeitsmodell' in which structure-organization is not so important for comprehensibility. 10 As a result, the presentation of ideas in the idea mining application based on text excerption. It is comprehensible after the 'Groebener Modell' and it is less comprehensible after the 'Hamburger Verständlichkeitsmodell'. Results and Discussions In a study for the German Ministry of Defence (MoD), we use this approach to identify new technological ideas for the German defence research program. In detail, we have to identify new solution ideas to solve current problems in German defence based research projects. We extract new ideas from 300 descriptions of research projects granted in 2006 by the National Institute of Standards and Technology (NIST) in the United States Small Business Innovation Research (SBIR) Program. We use textual information from current defence based research projects of the German MoD as problem description. As a result, we extract several new ideas that are useful for German defence research planners and that now are used as starting point for collaboration projects or for new defence based research projects. A proper selection of these ideas is a strategic issue and - together with the weapon selection problem [5] - it has significant impacts to the efficiency of future defence systems. The results are published in [6]. Here, we show some successful examples: A modified focal plane array technology is identified that can be used to create a detector for the far ultraviolet spectrum. It leads to an improvement of military reconnaissance. This idea is new because up to now focal plane array technology is only used in the infrared, visual and near ultraviolet area. Further, the approach identifies personnel ultrasonic locating equipment that was originally developed to make orientation possible for fire fighters in dense smoke. It also can be used to improve the location and navigation of soldiers in urban warfare (e.g. in buildings). Additionally, the approach shows that the use of avalanche photodiode (APD) technology can improve the internal gain and the dark current of infrared detectors. This also leads to an improvement of military reconnaissance. This study shows that some of the automatically extracted ideas are useful for technological research planners from the German MoD. Unfortunately, the used problem description (textual information about current defence based research projects) is classified as German restricted (Verschlusssache - Nur für den Dienstgebrauch) that means it is not allowed to distribute it to the scientific community. Therefore, we cannot use the results of this study to evaluate this idea mining approach. However, a separate evaluation (see Sect. 6) is done using (unclassified) patent data that allows re-computing of the evaluation. 11 Evaluation The idea mining measure as central point in the idea mining approach consists of four heuristic sub-measures that are not theoretically founded. Therefore, it is crucial to provide an extensive evaluation to show their success. We compare this approach to a baseline because we are not aware of other approaches for idea mining. As measure for the baseline, we use Jaccard's coefficient [7] as well-known heuristic similarity measure. The idea mining approach is evaluated by using our idea mining application (see Sect. 8). There the web based application and all texts that are used for evaluation are presented. Additionally, we create an alternative idea mining application, based on Jaccard's coefficient instead of the idea mining measure for the sole purpose of comparison to the baseline. For evaluation, we use patent data because in patent descriptions, we normally can find new ideas, which include a considerable part of scientific and technological knowledge [13]. We use the abstract of a patent as new text. A patent often bases on further patents. We aggregate abstracts of theses references as problem description. Then we identify new and useful ideas from this patent concerning its patent references using the idea mining applications. We use abstracts from 40 randomly selected patents and from their references, a general stop word list and Porter stemmer for evaluation. Then we determine the parameters of the idea mining measure ( , , , , , and 1g 2g 3g 4g α~ z ) as well as the parameters for the length of the text patterns ( , , and ). l u v For this, we use further patent data and their references as new text and as problem description. The results are evaluated by a human expert and compared to each single sub measure , , and alone. We find out that using the first sub measure alone is successful. If this sub measure is small then the corresponding text pattern normally does not contain a new and useful idea. If this sub measure is large then the probability that the text pattern contains a new idea is also high. We also find out that using the further sub measures alone is not successful. This means, they are successful only if the result value of the first sub measure is medium to high. Therefore, they only can be used in addition to the first sub measure. 1m 2m 3m 4m The results of the second and third sub measures depend on the parameter . This parameter is used to define frequent terms by building a set of % most frequently stemmed and stop word filtered terms. We heuristically think that this parameter should be between 10 % and 30 % to get good sub measures. This is because if is greater than 30 % then we probably classify several terms, which only occur once as frequent terms. If is smaller than 10 % then we only identify high frequently terms for the set. In this case, the result values of the second and third sub measures are small regardless weather known terms occur frequently in the problem description or unknown terms occur frequently in the new text. Therefore, we determine to the mean value (20 %). Additionally, we see that the second and third sub measure is nearly equally successful and that the fourth sub measure is less successful. Therefore, we heuristically determine the parameters of to 50 %, to 20 %, to 20 % and to 10 %. z z z z z 1g 2g 3g 4g 12 We also have used other values to optimize the combination of these four sub measures. However, we do not find a combination that is generally superior to the selected combination. This is because the success of these value combinations depends on the quality of the user given textual information. Then, we determine the alpha cut value of the idea mining measure . If the percentage α~ is small then we get many result items. This leads to a small precision value because many extracted text patterns do not contain a new and useful idea. If is large then we only get a very small number of results and probably our recall value is small because we do not find most of the new and useful ideas in the new text. A human expert checks the results of several patent descriptions for an optimal value of . He gets the experience that 60 % is a good compromise. Therefore, we set to 60 %. We also determine the alpha cut value of Jaccard's coefficient as measure for the baseline to 20 % by using the same way of evaluation as described above. α~ m α~ α~ α~ After this, we determine the length of the text patterns. The length depends on the parameter l and on , a term weighting scheme that is based on the difference between stop words and non-stop words (see Sect. 3). Text patterns should not be too small so that they contain all terms representing a new and useful idea. Additionally, text patterns should not be too large so that further terms occur in the text patterns that are not related to the new and useful idea. To find out an optimal size of text patterns, we create text patterns from several patent descriptions by using different values for and for the percentages u and . A human expert checks the different length of these text patterns for an optimal size. He gets the best results by setting the value of text pattern length l to 7 terms and the percentage to 50 % and to 100 % . )( ig wf l v u v Then, the approach extracts automatically about 200 new ideas from the 40 randomly selected patents. To cluster these results, means and purposes are assigned to scientific categories in the science citation index and examples are presented below. Several ideas are identified that uses methods from 'Artificial Intelligence' (mean) for applications in 'Health Care Sciences and Services' (purpose). We also identify new ideas using 'Imaging Science and Photographic Technology' (mean) for 'Medical Informatics' purposes. Further ideas use techniques from 'Remote Sensing' (mean) in the field of 'Tropical Medicine' (purpose). Additionally, several ideas use 'Computer Science, Theory and Methods' (mean) for applications in 'Psychiatry' (purpose). Furthermore, methods from 'Artificial Intelligence' (mean) are used for 'Automation and Control Systems' purposes. To evaluate these results, we use precision and recall measures commonly used in information retrieval based on true positives, false positives and false negatives. For this, we have to define the ground truth for our evaluation. Therefore, a human expert also identifies new and useful ideas from these patents manually that means without using our idea mining approach. He uses the idea definition in Sect. 1.2. This means, he checks each text pattern for finding terms representing a known mean (purpose) and terms representing an unknown purpose (mean). These results are the ground truth for the evaluation. For each patent, we compute its precision and recall values by using the idea mining measure and by using the Jaccard's coefficient. Then, we compute the average precision and recall values. As a result, we get a precision value of 40 % and a recall value of 25 % by using the idea mining 13 approach with the idea mining measure. A precision value of 40 % means that if the idea mining approach extracts ten text patterns then four of them represent a new and useful idea. A recall value of 25 % means that if there are four new and useful ideas in the new text then the idea mining approach extracts only one of them. In contrast to this, we get a precision value of 30 % and a recall value of 20 % by using Jaccard's coefficient. This is because in some texts Jaccard's coefficient extracts text patterns from the new text that are similar to text patterns from the problem description. This represents probably a known idea but not a new idea. Beside Jaccard's coefficient, we also test other well-known heuristic measures like overlap-index, cosine-similarity and dice-similarity [7] as baseline. However, we get nearly the same results for the precision (30 %) and for the recall (20 %) value. The Idea Mining Application The idea mining application focus on users without extensive knowledge in the text mining field as well as on text mining experts. We give them the possibility to extract specifically problem solution ideas for their own needs using this idea mining approach. They can access to the web- based application via the internet. It is available under http://www.text-mining.info and it is programmed in perl and ruby. A user has to provide two textual files, a problem description and a new text that probably consists of problem solution ideas. These files can be formatted in various ways e.g. as plain text, html, xml etc. However, scripting code, (html- or xml-) tags, and images are discarded that means the application extracts plain text from the provided files. Then, the user has to select the language of these texts to integrate a general stop word list of this language. The application offers general stop word lists in English, German, Dutch, Spain and French. After determining the parameters of the application the automatically extraction of new and useful ideas from the new text starts as described in the idea mining process (see Sect. 3 and Sect. 4). As a result, new ideas are presented as described in Sect. 5. Conclusions and Future Research This study shows the success of an automatic approach for finding new ideas from textual information. For this, the study transforms creativity approaches from psychology and cognitive science to text mining approaches. One main finding here is to redefine an abstract term (an idea) in a concrete way that it can be used for computing with text mining methods. In detail, it is shown that a technological idea represents a combination of a purpose and a mean and that purposes and means are defined by a combination of terms, which co-occur. 14 Additionally, it is shown that problems and problem solution ideas can be represented as term vectors in vector space model. For this, the study contributes a new (idea mining) measure. This measure identifies new ideas by comparing vectors that represent a problem to vectors that represent a problem solution idea. Last, it is shown that approaches from comprehensibility research can be adopted to this approach to present the new ideas in a comprehensible way to the user. As further main finding, it is demonstrated that this theoretical approach can be realized by a web-based application. The success of the idea mining measure is proved by comparing it to further heuristic measures (overlap-index, cosine-similarity and dice-similarity). Directions for future research are given by the fact that nowadays there is a large amount of textual information available on the internet and this information probably contains many new technological ideas. Enlarging this approach to a web idea mining approach that automatically identifies problem solution ideas from the internet is an interesting topic for further research. Additionally, the parameters of the approach can be optimized and the idea mining measure can probably be enlarged with further aspects to improve its quality that means to get better results for the precision and recall values. A further aspect is to transform this idea mining approach to the colloquial language. For this, it is necessary that the idea definition also contains new product ideas from the consumers. Then, new product ideas can be identified to support marketing activities. Last, the approach can be extended with innovation-related aspects. Then, extracted ideas can be classified as innovative ideas and might be used as starting point for the new product development. Acknowledge We thank Joachim Schulze and Jörg Fenner for constructive technical comments. References [1] Albers, S., & Gassmann, O. (2005). Handbuch Technologie- und Innovationsmanagement: Strategie- Umsetzung- Controlling (p.196). Wiesbaden: Gabler Verlag. [2] Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York: ACM Press. [3] Coussement, K., & Van den Poel, D. (2008). Integrating the voice of customers through call center emails into a decision support system for churn prediction. Information & Management 45, 165. [4] Coussement, K., & Van den Poel, D. (2009). Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers. Expert Systems with Applications 36, 6127-6134. [5] Dagdeviren, M., Yavuz, S., & Kilinc, N. (2009). Weapon selection using the AHP and TOPSIS methods under fuzzy environment. Expert Systems with Applications 36, 8150. [6] Fenner, J., & Thorleuchter, D. (2009). Textmining-Analyse von Forschungsvorhaben des National Institute of Standards and Technology. Euskirchen: Fraunhofer INT Edition. [7] Ferber, R. (2003). Information Retrieval (p. 74-80). Heidelberg: dpunkt.verlag. [8] Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology 32, 221-233. [9] Gentsch, P., & Hänlein, M. (1999). Text Mining. WISU 12, 1646. [10] Groeben, N. (1982). Leserpsychologie: Textverständnis - Textverständlichkeit. Münster: Aschendorff. 15 16 [11] Hoffmann, L., Kalverkämper, H., & Wiegand, H.E. (1998). Fachsprachen - Languages for Special purposes: Ein internationales Handbuch zur Fachsprachenforschung und Terminologiewissenschaft - an international Handbook of Special-language and Terminology Research (p. 1602). Berlin: Walter de Gruyter. [12] Hotho, A., Nürnberger, A., & Paaß, G. (2005). A Brief Survey of Text Mining. LDV Forum 20(1), 19-26. [13] Li, Y.R., Wang, L.H., & Hong, C.F. (2009). Extracting the significant-rare keywords for patent analysis. Expert Systems with Applications 36, 5200-5204. [14] Langer, I., Schulz v. Thun, F., & Tausch, R. (1974). Verständlichkeit in Schule und Verwaltung. München: Ernst Reinhardt. [15] Lustig, G. (1986). Automatische Indexierung zwischen Forschung und Anwendung (p. 92). Hildesheim: Georg Olms Verlag. [16] Martin-Bautista, M.J., Sanches, D., Serrano, J.M., & Vila M.A. (2004). Text Mining using Fuzzy Association Rules. In V. Loia, M. Nikravesh, & L.A. Zadeh (Eds.), Fuzzy Logic and the Internet (p. 173). Berlin, Springer-Verlag. [17] Osborn, A.-F. (1948). Your Creative Power. New York: C. Scribner's sons. [18] Porter, M.F. (1980). An algorithm for suffix stripping. Program 14(3), 130-137. [19] Puppe, F., Stoyan, H., & Studer, R. (2003). Knowledge Engineering. In G. Görz, C.R. Rollinger, & J. Schneeberger (Eds.), Handbuch der Künstlichen Intelligenz (p. 611). München: Oldenbourg. [20] Ripke, M., & Stöber, G. (1972). Probleme und Methoden der Identifizierung potentieller Objekte der Forschungsförderung. In H. Paschen & H. Krauch (Eds.), Methoden und Probleme der Forschungs- und Entwicklungsplanung (p. 47). München, Oldenbourg. [21] Rohpohl, G. (1996). Das Ende der Natur. In L. Schäfer, & E. Sträker (Eds.), Naturauffassungen in Philosophie, Wissenschaft und Technik (pp. 143-163). Freiburg, München: Alber. [22] Thorleuchter, D. (2008). Finding Technological Ideas and Inventions with Text Mining and Technique Philosophy. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, & R. Decker (Eds.) Data Analysis, Machine Learning, and Applications (pp. 413-420). Berlin: Springer-Verlag. [23] Wallas, G. (1926). The Art of Thought. New York: Harcourt Brace. FACULTEIT ECONOMIE TWEEKERKENSTRAAT 2 B-9000 GENT WORKING PAPER November 2009 Introduction Overview Idea Definition Rationale behind Idea Mining Idea Mining Process Idea Mining Measure Idea Mining and Comprehensibility Research Results and Discussions Evaluation The Idea Mining Application Conclusions and Future Research